基礎大模型能像人類一樣標註資料嗎？

自從 ChatGPT 出現以來，我們見證了大語言模型 (LLM) 領域前所未有的發展，尤其是對話類模型，經過微調以後可以根據給出的提示語 (prompt) 來完成相關要求和命令。然而，直到如今我們也無法對比這些大模型的效能，因為缺乏一個統一的基準，難以嚴謹地去測試它們各自的效能。評測我們發給它們的指令以及對話模型本身，從本質上來講就很困難，畢竟使用者的評價標準都是圍繞對回答的質量的主觀感受; 而現有的自然語言處理任務的效能評價標準，卻大多侷限於特定指標和某些定量標準。

在這一領域，通常當一個新的大語言模型釋出時，都會這麼宣傳: 我們的模型比在百分之多少的情況下優於 ChatGPT。這句話的潛在意思是，模型使用某些基於 GPT-4 的評價標準，在百分之多少的情況下優於 ChatGPT。這些分數實際想表達的是一種不同評價標準的代替方案: 由人類打標者提供的分數。利用人類反饋的強化學習 (RLHF) 大量提供了對比兩個模型的介面和資料。這些從 RLHF 而來的資料被用來訓練一個獎勵模型，用以評判哪個回答是更好的，但為模型輸出結果進行打分和排名的這一思想已經演變成了一種更通用的模型評測工具。

這裡我們展示一些範例，分別來自我們的盲測資料中的 instruct 和 code-instruct 兩個子集合。

從迭代速度來講，使用一個語言模型去評測模型輸出已經十分高效了，但這裡遺漏了一個大問題: 有沒有調研這個下游的快捷工具是否針對原有的評測形式進行了校準對齊。在本文中，我們將會詳細瞭解: 通過擴充套件 Open LLM Leaderboard 評測體系，哪些時候你需要相信或不相信你從你選擇的大語言模型中得到的資料標籤。

現如今，各種排行榜已經開始湧現，比如 LMSYS 以及 nomic / GPT4All 等，用以從各種角度對比模型。但我們仍需要一個更完整的資源，用以對比模型效能。有些人會使用現有的 NLP 測試基準，看一看提問回答的能力; 還有些人會用一些眾包的、開放式問答的排行榜。為了為大家提供一個更全面通用的評測方法，我們擴充套件了 Hugging Face Open LLM Leaderboard，囊括了各個自動化的學術評測基準、專業人員的打標，以及 GPT-4 相關的評測方法。

評估開源模型的偏好

在訓練階段的任何時間點，人工組織資料的需求從內在來看都是成本很高的。迄今為止，在這一領域僅存在少量人工標註的偏好資料集可以用來訓練大模型，例如 Anthropic’s HHH data、OpenAssistant’s dialogue rankings 或是 OpenAI 的 Learning to Summarize / WebGPT 資料集。相同的偏好標籤也可以用模型輸出獲取，用以構建兩兩模型間的 Elo 排序 (Elo 排序是常用於象棋或遊戲中的一種通過兩兩對比構建全域性排行榜的方法，排名越高越好)。當給到標註者的文字源是由我們關注的模型生成時，資料就變得很有趣了。

訓練模型的過程會發生很多意想不到的趣事，所以我們需要對各個開源模型做一個更嚴格的對照實驗，看看偏好收集過程如何轉化為當今盛行的 GPT-4/ChatGPT 偏好評估，以及與它們的差異對比。

為了這一目的，我們組織了一個指令提示語的集合，以及對應的一系列由開源模型 ( Koala 13b、Vicuna 13b、OpenAssistant 12b、Dolly 12b) 完成的補全。

我們從 Self-Instruct 評測集中收集了一系列高質量、由人編寫的提示語，同時也從資料商那裡收集了一些早期的討論類對話資料，涵蓋了生成、頭腦風暴、問答、總結、常識、程式設計等各種任務類別。總共有 327 條覆蓋這些任務型別的提示語，其中 25 條是程式設計相關的。

這裡我們列出一些提示語相關的資料統計，包括其句子長度。

	prompt	completions
count	327	327
length (mean ± std. dev.) in tokens	24 ± 38	69 ± 79
min. length	3	1
25% percentile length	10	18
50% percentile length	15	42
75% percentile length	23	83
max	381	546

藉助這些資料，我們開始使用 Scale AI 和 GPT-4 去評估模型質量。我們針對偏好模型使用 Antropic 的方法，並要求打分者去依照利克特 (Likert) 量表打 1 到 8 分。在這個範圍內，1 分代表打分者相比第一個模型，對當前模型有強烈偏好; 4 分表示基本與第一個模型持平; 8 分則代表評價者的觀點與 1 分完全相反。

人工 Elo 結果

我們與 Scale AI 合作，為一些開源的指令調教的模型基於我們的盲測集收集了高質量的人工標註。我們要求標註者以一種兩兩對比的形式，針對有用性和真實性對模型的回答打分。為此，我們為每個提示語生成了 $ n \choose 2 $ 個組合對，其中 $n$ 是我們要評測的模型的數量。下圖就是一個為人工標註者提供的打分頁面，其中包含了給標註者發出的任務指示 (相關工具由 Scale AI 提供)。

使用這一資料，我們依據兩兩模型間的勝率「自舉地」估算出了 Elo 排序結果。如果讀者對 Elo 的過程感興趣，可以閱讀 LMSYS 的這篇 notebook，我們盲測集上的 Elo 分數也在 leaderboard 上有列出。

在本文中，我們會展示「自舉」估算 Elo 的方法以及其中的錯誤估算。下表列出了依據我們盲測集和標註人員的排行榜:

**************不含平局的 Elo 排序 (由 1000 輪取樣的對局自舉計算而來) **************

Model	Elo ranking (median)	5th and 95th percentiles
Vicuna-13B	1140	1061 ↔ 1219
Koala-13B	1073	999 ↔ 1147
Oasst-12B	986	913 ↔ 1061
Dolly-12B	802	730 ↔ 878

針對利克特量表，打分打到 4 或 5 分是否構成一場勝利也很有爭議。所以我們也計算了一個當 4 或 5 分表示平局時的 Elo 排序。但基於這一改變，我們並沒有觀察到模型間相互勝負的太大變化。平局的計數 (每對模型間對局 327 次) 和新的 Elo 排序如下圖所示。每格的數位代表兩個模型的平局數，比如，Koala-13B 和 Vicuna-13B 之間由最多的平局 (96 次)，說明它們可能效能非常相近。

注意這張表的讀法: 選取一行，如 oasst-12b ，然後橫向看這個模型和其它各個模型分別有多少場平局。

**************包含平局的 Elo 排序 (由 1000 輪取樣的對局自舉計算而來) **************

Model	Elo ranking (median)	5th and 95th percentiles
Vicuna-13B	1130	1066 ↔ 1192
Koala-13B	1061	998 ↔ 1128
Oasst-12B	988	918 ↔ 1051
Dolly-12B	820	760 ↔ 890

下圖統計了打分的分佈情況，資料來自 Scale AI taskforce。

在本文接下來內容中，你還會看到基於不同資料生成準測的相似分析流程。

GPT-4 的 Elo 結果

接下來我們來看看 GPT-4 如何對比結果。幾個待評測模型的順序不變，但相對的間隔變了。

不含平局的 Elo 排序 (由 1000 輪取樣的對局自舉計算而來)

Model	Elo ranking (median)	2.5th and 97.5th percentiles
vicuna-13b	1134	1036 ↔ 1222
koala-13b	1082	989 ↔ 1169
oasst-12b	972	874 ↔ 1062
dolly-12b	812	723 ↔ 909

包含平局的 Elo 排序 (由 1000 輪取樣的對局自舉計算而來)

提醒一下，利克特 1 到 8 分量表中，4 或 5 分代表平局

Model	Elo ranking (median)	2.5th and 97.5th percentiles
vicuna-13b	1114	1033 ↔ 1194
koala-13b	1082	995 ↔ 1172
oasst-12b	973	885 ↔ 1054
dolly-12b	831	742 ↔ 919

為此，我們使用了一個由 FastChat evaluation prompts 改編而來的提示語，以便用較短的句子長度來獲得更快和更低成本的生成結果 (比如解釋性的話語大多數時間都捨棄掉了)。

### Question
{question}

### The Start of Assistant 1's Answer
{answer_1}
### The End of Assistant 1's Answer

### The Start of Assistant 2's Answer
{answer_2}
### The End of Assistant 2's Answer

### System
We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.
Please compare the helpfulness, relevance, accuracy, level of details of their responses.
The rating should be from the set of 1, 2, 3, 4, 5, 6, 7, or 8, where higher numbers indicated that Assistant 2 was better than Assistant 1.
Please first output a single line containing only one value indicating the preference between Assistant 1 and 2.
In the subsequent line, please provide a brief explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment.

下面基於 GPT-4 關於回答的柱狀圖開始顯示一個基於大語言模型評測的明顯問題了: 位置偏差 ( positional bias )。哪個模型對應上面的 answer_1 是完全隨機的。

鑑於 GPT-4 評價的不確定性，我們決定加上一個新的評測基準: 高度訓練過的人類打標者的補全。我們希望知道人類給出的 Elo 排序會是什麼樣的。

GPT-4 Elo 結果及展示

最終，人類給出的 Elo 排序看上去非常令人困惑。很多假設可以解釋這一點，但它最終指向了一個潛在的風格優勢，這種優勢存在於在大語言模型的輸出資料上訓練的那些模型上 (當我們拿 Dolly 做對比時)。這是由於訓練和評測方法是並行開發的，導致了兩者之間的「非故意摻雜」( unintentional doping )。

不含平局的 Elo 排序 (由 1000 輪取樣的對局自舉計算而來)

Model	Elo ranking (median)	2.5th and 975th percentiles
Vicuna-13b	1148	1049 ↔ 1239
koala-13b	1097	1002 ↔ 1197
Oasst-12b	985	896 ↔ 1081
human	940	840 ↔ 1034
dolly-12b	824	730 ↔ 922

GPT-4 評測樣例

下面我們展示了一些樣例，用以展示評測是如何進行的，讓你能感覺到 GPT-4 是如何挑選答案的:

Question:
Indicate the genre of the book to which it belongs.\n Input: Love in the Time of Cholera.

Model response (koala):

The genre of 「Love in the Time of Cholera」 is a romance novel with elements of magical realism and historical fiction.
It tells the story of two men, Fermina Daza and Florentino Ariza, who are both deeply in love with her but cannot be together due to various circumstances.
Throughout their lives, they encounter obstacles that prevent them from being reunited, including war, illness, and societal pressure.
However, their love for each other never fades away, and they continue to hold onto hope that one day they will be able to be together again.
This combination of romantic elements and historical context makes this book fall under the category of romance novels with magical realistic and historical fictional elements.

Human response:

Fiction

GPT-4 rating: 1 (model greatly better)

Assistant 1 provided a more detailed and accurate answer, discussing the elements of romance, magical realism, and historical fiction present in the novel.
Assistant 2’s answer was too brief and did not provide sufficient information about the genre of the book.

Question:

Write a LinkedIn post to announce that you have accepted a new job offer.\n Input:

Model response (koala):

Exciting news! I am thrilled to announce that I have officially accepted an exciting new job opportunity as [Job Title] at [Company Name]. This role will allow me to leverage my skills and experience in [industry/field of expertise], while also providing opportunities for personal and professional growth.
I want to take this moment to express my gratitude to everyone who has supported me throughout the hiring process, including [previous employer name(s) or individuals who provided guidance/advice/references]. Your support means the world to me and I couldn’t be more grateful for your help.

I am excited about what lies ahead and can’t wait to join the team at [Company Name]. Thank you again for all your support!

Human response:

I’m excited beyond words to share with you my decision to accept the role of Marketing Director at the XYZ Company!

I couldn’t have had this opportunity if not for the help of my friend, [name], who shared this job opportunity with me, and my former boss, [boss’s name], for her stellar recommendation and guidance.

I happily look forward to starting this new journey and growing professionally with my new family—the marketing rock stars of XYZ Company.

GPT-4 rating: 6 (model better)

Both assistants provided helpful and relevant responses for a LinkedIn post announcing a new job offer.
However, Assistant 2’s answer was more detailed and included placeholders for specific information, making it easier for the user to customize the post.
Assistant 1’s response was also good, but slightly less structured and detailed compared to Assistant 2’s response.

Question:

You are given a search query and a document. Classify whether the document is relevant to the search query or not relevant.
Input: Search: why sky is blue
Document: The Short Answer: Sunlight reaches Earth’s atmosphere and is scattered in all directions by all the gases and particles in the air. Blue light is scattered more than the other colors because it travels as shorter, smaller waves. This is why we see a blue sky most of the time

Model response (dolly):

The provided example search 「why sky is blue」 is related to the input document which explains the reason for blue color of the sky.

Advantages

This answer only takes up 34 characters compared to longer explanations like sunlight reaching earth due to its speed through space etc.. which would take up 80 characters. Hence this answer is highly relevant and useful to many users.

Disadvantage

Now if the question was about reasons behind changes in the weather one could think of many documents relating to climate change, pollution, static charges in the air etc., these documents might be relevant to the user but taking up lots of additional real-estate makes them less relevant compared to the above mentioned answer.

Human response:

relevant

GPT-4 rating: 5 (model slightly better)

Both assistants provided relevant answers to the question, but Assistant 1’s answer was more detailed and explained the reason behind the sky’s blue color, while Assistant 2’s answer was concise but not as informative. Assistant 1’s answer is more helpful for users who want to understand the phenomenon, while Assistant 2’s answer is suitable for users who only want a quick confirmation.

Category	Correlation: GPT-4 to Human Labels
Brainstorm	0.60
Creative generation	0.55
Commonsense reasoning	0.46
Question answering	0.44
Summarization	0.40
Natural language to code	0.33

Model	Elo ranking (median)
Vicuna-13b	1136
koala-13b	1081
Oasst-12b	961
human	958
dolly-12b	862

Model	Elo ranking (median)
koala-13b	1105
Oasst-12b	1075
Vicuna-13b	1066
human	916
dolly-12b	835

Model	Elo ranking (median)
Vicuna-13b	1110
koala-13b	1085
Oasst-12b	1075
human	923
dolly-12b	804

要點和討論

總結我們實驗的幾個重要發現:

GPT-4 存在位置偏差，當我們使用 1 到 8 分範圍，兩兩對比模型偏差時，GPT-4 傾向於給第一個模型更高分。
在提示語中要求 GPT-4 不要保留這種位置偏差，會導致模型偏向另一邊，但不會像上述情況那麼差。
針對事實性和有用性的回答上，GPT-4 更傾向於偏好那些用 InstructGPT、GPT-4、ChatGPT 生成的資料訓練的模型。比如，在人類手寫的輸出方面，GPT-4 更傾向 Vicuna 和 Alpaca。
GPT-4 和人工評價者在評價非程式設計任務時，有著 0.5 左右的正相關性; 而對於程式設計任務，仍保持正相關，但數值遠沒有那麼高。
如果我們按任務型別看，人類和 GPT-4 的相關性在」高熵「任務 (如頭腦風暴和生成類任務) 中相關性最高，而在程式設計類任務中相關性低。

這一領域的工作還是很新的，所以

利克特 vs 評分: 在我們的評測過程中，我們使用了利克特量表作為評測工具——如何收集偏好資料來使用 RLHF 去訓練模型。在這裡，我們反覆展示了，使用打出來的分數去訓練一個偏好模型並不能產出有用的訓練資訊 (相比於相對性的排名來說)。類似地，我們也發現在分數上評測也不太可能生成出長期有效的訓練訊號。

此外，值得注意的是，ChatGPT (效能稍低點的模型) 實際上難以返回一個符合利克特量表定義的分數，但有的時候可以多少可靠地做出排名。這暗示著這些模型其實才剛剛開始學習格式方面的資訊，來符合我們的評測框架; 這個能力的獲取距離它成為一個有用的評測工具還有較長的距離。

為評測做提示語: 在我們的實驗中，我們看到了 GPT-4 評測時潛在的位置偏差。但仍有很多其它因素，可能影響提示語的質量。在最近的一個 podcast 中，Riley Goodside 描述了 LLM 輸出時每個分詞的資訊量極限，所以在提示語中首先輸出分數可能會限制像 GPT-4 這樣的模型能力，使得它不能充分闡述緣由。
打分和排名的範圍: 我們還不清楚打分或利克特排名的真正合理範圍是多少。LLM 習慣去看到訓練資料中的某種組合 (如 1 到 5 星)，這可能會讓模型打分數產生偏差。給定特定的 token 讓模型去返回，而不是讓模型直接輸出一個分數，可能會降低這樣的偏差。
句子長度的偏差: ChatGPT 受歡迎的一個原因也在於它可以輸出有趣的、很長的回答。我們可以看到，在使用 GPT-4 評測時，GPT-4 非常不喜歡簡介而正確的回答，僅僅是因為另一個模型喜歡持續輸出更長的回答。
正確的生成引數: 在我們早期實驗中，我們需要花費大量時間獲取各個模型正確的對話格式 (可參考 FastChat 的 conversation.py)。這可能使得模型僅僅獲得 70-90% 的的潛在能力。剩下的部分需要通過調生成引數來獲取 (如 temperature 或 top-p 等)，但我們仍缺少可信的評測基準，迄今也沒有一個公平的方法去做這個。針對我們的實驗，我們設定 temperature 為 0.5，top-k 為 50 以及 top-p 為 0.95 (針對生成，OpenAI 的評測還需要別的引數)。

資源和參照

更多關於針對標註的指示資訊可以檢視這裡.

如果你有一個模型需要 GPT-4 或人工標註者去評測，可以在 the leaderboard discussions 留言。

@article{rajani2023llm_labels,
  author = {Rajani, Nazneen, and Lambert, Nathan and Han, Sheon and Wang, Jean and Nitski, Osvald and Beeching, Edward and Tunstall, Lewis},
  title = {Can foundation models label data like humans?},
  journal = {Hugging Face Blog},
  year = {2023},
  note = {https://huggingface.co/blog/llm-v-human-data},
}

感謝 Joao 指出我們表格中一個拼寫錯誤

英文原文: https://hf.co/blog/llm-leaderboard

作者: Nazneen Rajani, Nathan Lambert, Sheon Han, Jean Wang, Osvald Nitski, Edward Beeching, Lewis Tunstall, Julien Launay, Thomas Wolf

譯者: Hoi2022

排版/審校: zhongdongy (阿東)

基礎大模型能像人類一樣標註資料嗎？

目錄

評估開源模型的偏好

人工 Elo 結果

GPT-4 的 Elo 結果

GPT-4 Elo 結果及展示

相關工作

GPT-4 評測樣例

更多實驗

人工打標和 GPT-4 打標的相關性分析

對比實驗

要點和討論

資源和參照