千「垂」百鍊:垂直領域與語言模型(1)

2023-04-09 15:01:55

Using Language Models in Specific Domains (1)

微信公眾號版本:https://mp.weixin.qq.com/s/G24skuUbyrSatxWczVxEAg

這一系列文章仍然堅持走「通俗理解」的風格,用盡量簡短、簡單、通俗的話來描述清楚每一件事情。本系列主要關注語言模型在垂直領域嘗試的相關工作。

This series of articles still sticks to the "general understanding" style, describing everything in as short, simple and easy-to-understand terms as possible. This series focuses on the work of language models in specific domains.

目錄 (Table of Contents):

1 引言(←)

  • 1.1 語言模型的能力
  • 1.2 落地垂直領域的靈魂發問

2 歸根到底是可用的垂直領域資料

  • 2.1 醫療領域的嘗試:醫患對話(ChatDoctor)
  • 2.2 Stanford Alpaca解決資料稀缺的思路
  • 2.3 Self-Instruct半自動生成資料

更多(待定)

1 Introduction(←)

  • 1.1 Power of Language Models
  • 1.2 Questioning: Are You Sure Specific Domains?

2 Essential: Domain-specific Training Data

  • 2.1 Attempts in Medical Domain (ChatDoctor)
  • 2.2 Stanford Alpaca: Idea for Obtaining Data
  • 2.3 Self-Instruct: Semi-automatic Data Generation

More (to be confirmed)

1 引言(←)

Introduction

1.1 語言模型的能力

Power of Language Models

最近,語言模型讓我們看到,它迴應人類指令的表現效果大大提高了。Recently, language models have shown us that it responds to human commands even more amazingly well compared to before.

而在此之前,人們與AI智慧體的聊天互動基本上只侷限於 Prior to this, people's chat interactions with AI intelligence were mostly limited to:

  • 真正的閒聊(並且聊天質量不高)chit-chat (and the quality of the chat was not good)
  • 讓AI完成特定的任務(訂餐、訂票、問答等,這種互動方式幾乎不允許聊與此任務無關的內容)having the AI perform a specific task (ordering food, booking tickets, Q&A, etc., and this type of interaction barely allowed chatting about anything unrelated to this task)

如今,我們可以自由地發出指令。儘管這些指令五花八門,語言模型總是可以給出不錯、甚至超出預期的迴應。Nowadays, we can give instructions freely. Despite the variety of these commands, the language model can always respond well, or even beyond our expectations.

1.2 落地垂直領域的靈魂發問

Questioning: Are You Sure Specific Domains?

這一部分內容仁者見仁,智者見智。There are a thousand Hamlets in a thousand people's eyes.

是否能夠、有必要將這種語言模型和自己的垂直領域業務相結合,可能要先問自身幾個問題。In order to figure out whether it is possible and necessary to combine this language model with your own domain-specific business, you may want to ask yourself a few questions first.

1. 我不缺錢,我就是想把這種AI語言模型想盡辦法和我的業務結合。我不管這種結合是真的契合還是勉強的。這樣可以嗎? I'm not short of money, I just want to combine this AI language model with my business in any way I can. I don't care if that combination is a real fit. Is that alright?

可以,因為不缺錢,可以盡情的試錯(羨慕)。No problem because there is no shortage of money and you can try and experiment to your heart's content (extremely jealous).

言歸正傳,可以的原因大概有2個 Back to the main story, why is it possible:

  • 它很可能已經具備垂直領域的知識。It is likely to already have domain-specific knowledge. 這種AI模型是學習過海量資料的,無論你是在哪個垂直領域,它可能都有所涉及。它對於垂直領域的互動不見得會效果不好。This AI model is learned from vast amounts of information, and it has probably covered whatever domain you are in. The model may work well in your particular domain.

  • 看重的是它的某項技能。You are looking at it for a particular skill. 你可能也不需要這個AI模型學習過垂直領域相關的資料(換句話說,它即使不懂這個領域,同樣可以幫助到你)。在這種情況下,取決於你看上了語言模型的哪些語言技能。比如,AI語言模型具有不錯的文字總結能力,隨便扔給它一篇業內的文章,雖然它可能看不太懂,但是它仍然可以總結出質量不錯的簡報。You may also not need this AI model to have learned the knowledge of a certain domain (in other words, it can help you equally well even if it doesn't know the domain). In this case, it depends on which linguistic skills you look for in a language model. For example, an AI language model with good text summarisation skills can be given a casual article from the domain and it can still summarise a good-quality brief.

2. 我的垂直領域能接受語言模型的不完美嗎? Can my domain accept the imperfections of the language model?

雖然現在語言模型很強大,但它仍然有一些不完美的地方需要引起注意。As powerful as the language model is now, it still has some imperfections that need to be drawn to our attention.

  • 會犯錯 Mistakes can be made:它的回答可能會出現違背事實的錯誤。換句話說,可能會一本正經的胡說八道。its answers may be wrong against the facts.

  • 不確定性 Uncertainty:面對同一個問題,語言模型每次的回答是可以不一樣的。你喜歡它某一次的回答,不代表它每次的回答都會令你滿意。A language model can respond differently to the same instruction each time. It does not guarantee that every answer will be to your satisfaction.

  • 不方便「教訓」它 Not convenient to "teach" it:目前很多廠家會提供語言模型的介面,但是我們只可以使用,不能直接去「教訓」它。如果在自己的領域有表現不滿意的地方,在短時間內我們幾乎無能為力。Many companies currently provide interfaces to language models, but we can only use them and cannot "teach" them directly. If we are not satisfied with the performance, there is very little we can do about it in the short term.

  • 不靈活 Not lightweight:即使你擁有屬於自己的語言模型並且你可以任意「教訓」它,如果你想修改、校正、調整它的記憶和技能可不容易。你可能需要「教訓」它很多次、給它看很多例子它才能記住你的訓導。即使它說它記住了,那它是否真的記住了、它記住了這個是否又忘記了別的、教訓完後它每次的表現是否都能夠達到預期等都需要經過嚴格的測試才能知道。總之,訓導它和訓練真正的人類還是有很大區別。Even if you have your own language model and you can 'teach' it as much as you like, it is not easy to tune, calibrate and modify its memory and skills if you want to. You may have to 'teach' it many times and show it many examples before it understands your instructions. Even if it says it remembers, you need to test it carefully to see if it really remembers, if it remembers one thing and forgets another, and if it performs as expected after the training. In short, there is a big difference between training it and training a real human.

  • 帶來額外支出 Additional costs required:如果呼叫第三方介面去使用語言模型,會收取費用(一般來講,與介面傳送的資料越多,收費越高);如果自己部署語言模型,需要購置能夠執行語言模型的軟硬體資源;擁有語言模型並不是全部,還是需要投入人力、財力、時間去打磨如何讓模型與自己的業務相結合。If you call a third-party interface to use the language model, you will be billed (generally speaking, the more data you transfer to the interface, the higher the bill); if you deploy the language model by yourself, you will need to purchase the hardware and software resources to run it; owning the language model is not the end of the story; you will still need to invest manpower, money and time to work out how to integrate the model with your business.

3. 我想把這種語言模型融入到自己的垂直領域,這到底是我無意識陷入了盲目跟隨潮流,還是真的會對我的業務有幫助? I want to incorporate this language model into my domain - am I unconsciously falling into blindly following a trend, or will it really help my business?

夢想和理智並存。Dreams and sanity exist together.

  • 有夢想合理 Having dreams is reasonable:出現跟隨潮流的想法是合理的。因為語言模型確實在很多方面表現不錯,有潛力。It is reasonable to have the idea of following trends because language models do perform well in many ways and have potential.

  • 是否有幫助看效果 Whether it helps depends on the results:對業務有無幫助看實際驗證的效果,不憑空想象。如果找不到和自己業務類似的先例,這個問題的答案只有自己才能找到。Whether it helps your business or not depends on the actual validated results, not on imagination. If you can't find a previous example similar to your own business, the answer to this question can only be found by yourself.

  • 不失理智 No loss of sanity:

    • 不做超出自己承受能力的嘗試(能夠承擔的住失敗的代價)Do not try beyond what you can afford (can afford to fail)
    • 一開始可以先精選一個或少數業務進行嘗試 You can start with one or a few selected cases to explore

4. 我不懂技術原理,如果我提出來一些天馬行空、甚至不切實際、超出模型能力範圍的想法,技術/研發人員會笑話我、反感我嗎? I don't have any technical background, if I come up with some pie-in-the-sky, even unrealistic, ideas that are beyond the model's capabilities, will the technical/R&D team laugh at me and dislike me?

不會,垂直領域的落地正需要非技術和技術想法之間的碰撞。Will not, making models work in specific domains is requiring the collaboration between non-technical and technical ideas.

兩者之間需要互相配合、彼此校正。The two need to work together and correct each other. 碰撞的過程可能不總是愉快的,需要有商有量,互相理解。The collaboration may not always be pleasant and requires mutual understanding.

  • 從非技術人員的角度來看,我們需要他/她進行大膽、創新的業務規劃。同時也需要技術人員對能夠實現的功能進行評估(比如需要多少資源),對無法實現的業務功能及時提醒對方。From the perspective of a non-technical person, we need him/her to make brave and creative business plans. We also need the technical person to assess the features that can be achieved (e.g. how many resources are needed) and to remind the other person in a timely manner of the business features that cannot be achieved.

  • 從技術人員的角度看,我們同樣可以為業務規劃貢獻想法。AI技術是不斷髮展的,以前很難實現、遙不可及的功能,在今天可能很容易就可以實現,但非技術人員可能沒有及時的意識到這一點。這需要我們去提醒非技術人員,耐心的向他們科普目前技術能夠做到哪些事情。From the perspective of technical staff, we can also contribute ideas to business planning. AI technology is constantly evolving, and what was once difficult and out-of-reach may be easily implemented today, but non-technical staff may not realise this in time. It's important for us to remind non-technical people and patiently explain to them what the technology can currently do.

對語言模型設定合理預期,避免過高過低。Set reasonable expectations for the language model and avoid going too high or too low. 語言模型確實很強大,但它不是完美的。The language model is indeed powerful, but it is not perfect.

  • 預期不能過高 Expectations must not be too high:一個想法可能是好的但無法/很難實現(如果經濟實力足夠可以轉為研發專案。但需要沉得住氣,不能指望短期出成果)An idea may be good but unrealisable or require a great deal of cost (can be turned into an R&D project if financially strong enough. But we need to be patient and not expect short term results)

  • 預期不用太低 Expectations don't have to be low:非技術人員以為無法實現,砍掉了本來可以上線的功能(此時需要技術人員及時指正)a feature could have been implemented, but was removed because a non-technical person thought it couldn't be implemented (at which point it needed to be corrected by a technical person in a timely manner)

  • 模型一時表現不佳,不代表一直不佳 A model that performs poorly for a while does not mean it will always perform poorly:如果功能可以實現,但距離預期仍有差距,給模型適應的時間。它可以持續學習(尤其是從人類的反饋中)。經過堅持不懈的努力,它可以做的更好。If features can be implemented, but the performance of the feature still falls short of expectations, we need to give the model some time to improve. It can continue to learn (especially from human feedback). With consistent effort, it can do better.

5. 我聽說做這個很燒錢,但是我沒有那麼多錢,我還有機會試一試嗎? I've heard it's very expensive to do this but I don't have that much money, do I have a chance to try it?

有機會。這裡的「燒錢」主要是指從0到1創造模型的過程需要很大的開銷。而我們的目的主要是藉助現有經驗或使用現有模型,不是從頭創造。There is. The reason it is expensive is mainly that the process of creating a model from scratch requires a lot of expense. Our aim is to leverage existing experience or use existing models, not to create them from scratch.

  • 創造什麼都會的模型很燒錢:搜一下創造模型的公司都投入了多少資源就大概知道了 Creating models that can do everything is expensive: you can find out how much resources have been invested based on some news

  • 僅使用已經創造好的模型有一定開銷,但沒那麼燒錢:相比「很燒錢」,這部分開銷非常非常非常小 There is some cost in only using models that have already been created, but not very expensive: this cost to us is small compared to the cost of creating these models from scratch

  • 創造專精自己領域的模型不一定很燒錢 Creating models that specialise in your own domain does not have to be expensive:

    • 現有工作已經向我們證實了一條可行之路,我們可以少走彎路 The existing work has confirmed feasible solutions, saving us the effort of exploring on our own

    • 有關研究已經證明,即使使用很小的模型(小模型的學習效率和知識儲備能力不如大模型),經過恰當的訓練(尤其根據人類的反饋),小模型是有機會與大模型的表現相媲美的(在垂直領域表現如何需要自行驗證)Studies have demonstrated that small models (smaller models do not learn as efficiently or have the same knowledge-base capacity as larger models) have the opportunity to match the performance of large models with appropriate training (especially based on human feedback). How well it performs in your specific domains needs to be validated.

    • 現有的模型訓練技術允許我們低成本的在大模型的基礎上再次訓練(並且效果還不錯)Existing model training techniques allow us to retrain on a large model at a low cost (and with good results)

6. 現有的可用語言模型很好,但是在我的領域表現還不夠出色,我還是想要針對自己的領域研發一個模型。最應該注意什麼? The existing available language models are good, but they don't perform well enough in my domain and I still want to develop a model for my own domain. What are the most important things to be aware of?

至少應該注意3點 At least 3 points should be noted:

  • 業務剛需還是為了華而不實的功能 Develop your own models for essential functions or for impractical ones
  • 巧婦難為無米之炊,有無語言模型可用的學習資料 Availability of learning data for language models
  • 在現有模型基礎上繼續研發是否合規 Whether it is appropriate to continue to develop on the basis of existing models

業務剛需還是為了華而不實的功能 Develop your own models for essential functions or for impractical ones 在開展這個工作之前,需要結合自身的情況(例如戰略佈局、業務規劃)來決定開展自研工作是否是剛需。如果僅僅是為了實現華而不實的功能或者預算緊張,則需要再三考慮。Before undertaking this work, you need to decide whether the work to develop your own model is just what you need in the context of your own situation (e.g. strategic plan, business plan). If the purpose is simply to achieve an impractical function or if you are on a tight budget, you need to think twice.

巧婦難為無米之炊,有無語言模型可用的學習資料 Availability of learning data for language models 語言模型讀懂指令並做出反應的能力是學習出來的,這需要學習資料的支援。同理,在垂直領域你是否有合適的模型學習資料是非常重要的。目前業務上積累下來的資料可不可以直接用、如何將其轉化成語言模型可用的學習資料等,我們在後續的文章中有所提及。The ability of language models to understand and respond to instructions is learned, and this needs to be supported by learning data. Similarly, it is important that you have appropriate model learning data in your domain. Whether the data currently gathered in your business can be used directly and how it can be transformed into learning data usable by language models will be covered in subsequent articles.

在現有模型基礎上繼續研發是否合規 Whether it is appropriate to continue to develop on the basis of existing models 需仔細閱讀現有模型的許可證。有些模型雖然是開源直接可用的,但是在它們的許可證(license) 中有明確描述:模型以及模型的變體(例如再次訓練之後的模型)不能用於商用,不能用於提供醫療意見、解讀醫療報告等。The licences of existing models need to be read carefully. Some models are open-sourced and directly available, but their license clearly states that the model and derivatives of the model (e.g. after fine-tuning) cannot be used for commercial purposes, providing medical advice, interpreting medical reports, etc.

2.1 醫療領域的嘗試:醫患對話(ChatDoctor)

(未完待續, To be continued)