引子　　　

　　書接上文《GPT接入企微應用 - 讓工作快樂起來》，我把GPT接入了企微應用，不少同事都開始嘗試起來了。有的淺嘗輒止，有的刨根問底，五花八門，無所不有。這裡摘抄幾份：

　　「幫我寫一份表白信，我們是大學同學，暗戀十年」

　　」順產後多久可以用收腹帶？生完寶寶用收腹帶好還是不用好「（背景：公司主營月子中心，護理相關的領域知識是公司對於護士培訓的重點內容）

　　」我的工資是多少「（這個有點強機器人所難了，不過如果機器人有了公司的人事語料資料，應該是可以回答的）

　　 ......

　　總的來說，除了一些嚐鮮，獵奇之外，有相當一部分還是諮詢公司的內部的相關資訊，比如HR方面的育兒假等，再就是母嬰護理方面的問題了（公司有近60%的是護理人員，日常工作就是與寶寶，寶媽一起，這就不奇怪了）。

　　看到這些問題後，我就開始嘗試通過Fine-tune訓練公司內部的護理機器人，希望他可以為護士們的工作帶來一些便利。諸多嘗試失敗後，索性就放了一些時間。

　　恰逢五一假，回了媳婦孃家，掐指一算，已經3年6個月沒有回來過了，二娃子都快3歲了，還沒見過外婆。想不到娃子親舅舅，我到清閒了，又撿起護理機器人搗鼓起來了。於是有了這篇文章。

Fine-tune可能真的不合適

　　剛看到Fine-tune的介紹時就想，如果通過fine-tune構建個性化的模型，匯入公司的母嬰護理知識，並且在未來了問答中進化，變成企業內部專家。所以一開始就是向這樣的路子摸索著。畢竟介紹裡也說了通過少量樣本即可完成訓練，分類這樣的任務可能只需要200個左右的樣本即可。（其實問答模型的樣本要求至少要有幾千個可能會有點效果）

　　當然，檔案中也有一些關於Fine-tune的一些指南和準則。一來是全是英文檔案，理解不太深入；二來就是無知無畏，不嘗試下就是不死心。這是檔案原文，大概的意思Fine-tune可以用來解決一些類似分類（判斷對錯，情緒判斷（樂觀，悲觀），郵件分類），以及擴寫總結之類的場景。檔案也有提到案例」Customer support chatbot「，這可能也是大家這樣嘗試的原因之一吧。在其demo推薦使用 emebedding 來實現，也是本文的重點內容。這是後

　　雖然通過Fine-tune的方式最終也沒有好的效果，一來可能是樣本太少，或者樣本質量不好；也或者過程中有疏漏的地方。在這裡也和大家一起探討下。畢竟fine-tune的方式還是讓人非常神往的。實現程式碼基本是參考了 openai-cookbook 中的 fine-tuned_qa Demo。大致流程如入：

環境設定就不多說了（版本 python 3.10.4 整個過程基本還是流暢的。除了v-p-n自身原因折騰好久（原來用的是mono），換個使用者端居然好了）
收集文字資料並根據token的限制，合理分段落。（我自己則是找到內部了母嬰護理培訓的電子版本。）
用模型text-davinci-003 為每個段落自動生成若干問題，並根據段落及問題自動生成答案。
使用所有生成問題及答案組織成fine-tuen所需要的資料集。
建立新模型並使用。

　　1，文字分段 - 因為拿到的資料是word，並且有標題，就直接根據標題他分段了，超過2048的再分一次，程式碼如下（現學現用，比較粗漏）

import docx
import pandas as pd

def getText(fileName):
doc = docx.Document(fileName)
TextList = []

data = {"title":"","content":""}
for paragraph in doc.paragraphs:
if paragraph.style.name == 'Heading 1':
print("title %s " % paragraph.text)
if (len(data['content']) > 0):
datax = {}
datax['title'] = data['title']
datax['content'] = data['content']

TextList.append(datax)
data['title'] = paragraph.text
data['content'] = ''
else:
data['content'] += paragraph.text+"\n"
TextList.append(data)
return TextList

## 根據doc 轉 csv
if __name__ == '__main__':
fileName = '/Users/jijunjian/openai/test2.docx'

articList = getText(fileName)
count = 0
for article in articList:
if len(article['content']) > 800:
print("%s,%s,\n%s" % (article['title'], len(article['content']),article['content']))
count += 1

header = ['title', 'content']
print("總共 %s 篇文章" % count)
pd.DataFrame(articList, columns=header).to_csv('data_oring.csv', index=False, encoding='utf-8')

　　2，生成問題與答案 - 這樣生成的質量可能不是太高，可能實際使用時還是要對生成的問題和答案，讓領域專家進行修正比較好。

　　據官方檔案介紹，建議生成的資料集中，prompt與completion都要有固定的結尾，且儘量保證其他地方不會出現這個，所以我們這裡使用了」\n\n###\n\n「作為結束標誌。

 1 import pandas as pd
 2 import openai
 3 import sys
 4 sys.path.append("..")
 5 from tools.OpenaiInit import openai_config
 6 from transformers import GPT2TokenizerFast
 7 
 8 
 9 tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
10 
11 def count_tokens(text: str) -> int:
12     """count the number of tokens in a string"""
13     return len(tokenizer.encode(text))
14 
15 
16 COMPLETION_MODEL = "text-davinci-003"
17 FILE_TUNE_FILE = "search_data.jsonl"
18 
19 
20 # 獲取訓練資料
21 def     get_training_data():
22     file_name = "data_oring.csv"
23     df = pd.read_csv(file_name)
24     df['context'] = df.title + "\n\n" + df.content
25     print(f"{len(df)} rows in the data.")
26     return df
27 
28 
29 # 根據內容，生成問題
30 def get_questions(context):
31     print("正在生成問題")
32     try:
33         response = openai.Completion.create(
34             engine=COMPLETION_MODEL,
35             prompt=f"基於下面的文字生成問題\n\n文字: {context}\n\n問題集:\n1.",
36             temperature=0,
37             max_tokens=500,
38             top_p=1,
39             frequency_penalty=0,
40             presence_penalty=0,
41             stop=["\n\n"]
42         )
43         return response['choices'][0]['text']
44     except Exception as e:
45         print("建立問題錯誤 %s"  % e)
46         return ""
47 
48 
49 # 根據問題，生成答案
50 def get_answers(row):
51     print("正在生成答案")
52     try:
53         response = openai.Completion.create(
54             engine=COMPLETION_MODEL,
55             prompt=f"基於下面的文字生成答案\n\n文字: {row.context}\n\n問題集:\n{row.questions}\n\n答案集:\n1.",
56             temperature=0,
57             max_tokens=500,
58             top_p=1,
59             frequency_penalty=0,
60             presence_penalty=0
61         )
62         return response['choices'][0]['text']
63     except Exception as e:
64         print (e)
65         return ""
66 
67 
68 # 獲取訓練資料 /Users/jijunjian/tuningdata.xlsx
69 if __name__ == '__main__':
70      openai_config()
71      df = get_training_data()
72      df['tokens'] = df.context.apply(count_tokens)
73      # questions 根據返回生成
74      df['questions']= df.context.apply(get_questions)
75      df['questions'] = "1." + df.questions
76 
77      df['answers']= df.apply(get_answers, axis=1)
78      df['answers'] = "1." + df.answers
79      df = df.dropna().reset_index().drop('index',axis=1)
80 
81      print("正在儲存資料")
82      df.to_csv('nursing_qa.csv', index=False)
83 
84 
85 
86      df['prompt'] = df.context + "\n\n###\n\n"
87      df['completion'] = " yes\n\n###\n\n"
88 
89      df[['prompt', 'completion']].to_json(FILE_TUNE_FILE, orient='records', lines=True)
90 
91      search_file = openai.File.create(
92         file=open(FILE_TUNE_FILE),
93         purpose='fine-tune'
94      )
95      qa_search_fileid = search_file['id']
96      print("上傳檔案成功，檔案ID為：%s" % qa_search_fileid)
97 
98      # file_id = file-Bv5gP2lAmxLL9rRtdaQXixHF

　　3，根據生成資料集，建立新的模型。

　　官方的demo，還有生成驗證集，測試集，生成相識的文字，同樣的問題與答案來增加一些對抗性，因為最終效果不太好，再是檔案中有使用search 模組，但是這已經下線了，我用prompt-completion的資料結構模擬了下，也不知道有沒有效果，因為使用openai tools 建立模型可以有一些互動動作，也方便看一些執行結果，花費資料，這裡就使用這這工具作了演示，執行一段時間後，可以通過」openai.Model.list()「檢視我們建立的模型。當時大概有1000來個問題與答案，花費了0.78刀。（這是4月13嘗試的，因為效果不好，結果一放就是半月有餘了。時間真是如白駒過隙一般）　　

 1 openai api fine_tunes.create -t "discriminator_train.jsonl" -v "discriminator_test.jsonl" --batch_size 16  --compute_classification_metrics --classification_positive_class yes --model ada --suffix 'discriminator'
 2 
 3 Uploaded file from discriminator_train.jsonl: file-5OeHx3bMDqk******
 4 Uploaded file from discriminator_test.jsonl: file-AnOiDwG1Oqv3Jh******
 5 Created fine-tune: ft-cQBMLPzqVNml1ZWqkGYQKUdO
 6 Streaming events until fine-tuning is complete...
 7 
 8 (Ctrl-C will interrupt the stream, but not cancel the fine-tune)
 9 [2023-04-13 23:17:05] Created fine-tune: ft-cQBMLPz********
10 [2023-04-13 23:17:22] Fine-tune costs $0.78
11 [2023-04-13 23:17:23] Fine-tune enqueued. Queue number: 3

　　最後，效果不太理想，一番嘗試後，看到檔案中的提示資訊：

　　」Note: To answer questions based on text documents, we recommend the procedure in Question Answering using Embeddings. Some of the code below may rely on deprecated API endpoints.「於是藉著五一的空閒，開始嘗試emebedding 方式了

emebedding可能是當下最好的選擇

　　GPT擅長回答訓練資料中存在的問題，對於一些不常見的話題，或者企業內部的語料資訊，則可以通過把相關資訊放在上下文中，傳給GPT，根據上下問進行回答。因為不同模型對於token的限制，以及Token本身的成本因素。

　　具體實現時，我們需要把文字資訊Chunk（分塊）並Embed（不知道如何翻譯）得到一個值，收到問題時，同樣進行Embed，找到最相近的Chunk，做為上下文傳給GPT。官方檔案如下：　

　　Specifically, this notebook demonstrates the following procedure:

Prepare search data (once)
1. Collect: We'll download a few hundred Wikipedia articles about the 2022 Olympics
2. Chunk: Documents are split into short, mostly self-contained sections to be embedded
3. Embed: Each section is embedded with the OpenAI API
4. Store: Embeddings are saved (for large datasets, use a vector database)
Search (once per query)
1. Given a user question, generate an embedding for the query from the OpenAI API
2. Using the embeddings, rank the text sections by relevance to the query
Ask (once per query)
1. Insert the question and the most relevant sections into a message to GPT
2. Return GPT's answer

　　一開始本想參考這個demo Question_answering_using_embeddings.ipynb 編寫程式碼，後來有意無意看到使用llama_index的實現，並且對於語料資訊格式無要求，就摘抄過來了，感謝程式碼的貢獻者，節省了大家好些時間。

#!/usr/bin/env python
# coding=utf-8

from langchain import OpenAI
from llama_index import SimpleDirectoryReader, LangchainEmbedding, GPTListIndex,GPTSimpleVectorIndex, PromptHelper
from llama_index import LLMPredictor, ServiceContext
import gradio as gr
import sys
import os
os.environ["OPENAI_API_KEY"] = 'sk-fHstI********************'

#MODEL_NAME = "text-davinci-003"
MODEL_NAME = "ada:ft-primecare:*************"

def construct_index(directory_path):
    max_input_size = 2048
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name=MODEL_NAME, max_tokens=num_outputs))
    documents = SimpleDirectoryReader(directory_path).load_data()
    #index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    index.save_to_disk('index.json')
    return index
def chatbot(input_text):
    index = GPTSimpleVectorIndex.load_from_disk('data/index.json')
    response = index.query(input_text, response_mode="compact")
    return response.response

if __name__ == '__main__':

    iface = gr.Interface(fn=chatbot,inputs=gr.inputs.Textbox(lines=7, label="輸入你的問題"),outputs="text",title="護理智慧機器人")
    ## 用於生成資料, 放在docs資料夾下
    ##index = construct_index("docs")
    iface.launch(share=True, server_name='0.0.0.0', server_port=8012)

　　使用了gradio 作為演示，效果如下，基本可以根據我們的內部培訓資料中回覆，美中不足的就是通過要10幾秒才可以完成回覆，至少比之前fine-tune有了很大的進步了。至此，總算可以安撫下這半月的苦惱了。（下圖中的output 如果變成自定義的文字，嘗試多次一起沒有成功，也是有點遺憾）

曲折的部署之路

　　孟子有云：獨樂樂不如眾樂樂。如何讓同事們一起體驗，又是一個艱鉅任務開始了。再則也需要讓護理專家們看看回復的質量，以及如何優化文字內容。原本以為部署是一件簡答的事兒，但是對於python菜-雞的我，每一步都是坎坷。

　　一開始以為直接用pyinstaller 打包就可以直接放在伺服器上執行，結果 pyinstaller -F, -D 嘗試很久都無法打包依賴， --hidden-import 也用了， .spec也用了，都不好使。索性放棄了。

　　到了晚上12點半時，毫無進展，索性直接放原在碼放上去。結果又提示無法安裝指定版本的langchain。然後開始搗騰pip版本升級到最新，python版本升級到3.10.4（和本地一樣）。

　　python升級後，又是提示ModuleNotFoundError: No module named '_bz2'，總算是錯誤資訊變了。這個錯誤大概就是原來自帶中的版本中有_bz2模組，重安裝的3.10中沒有，解決版本都是複製這個檔案到新的版本中。

mv _bz2.cpython-36m-x86_64-linux-gnu.so /usr/local/python/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so

　　再次執行終於啟動了，太不容易了。設定好防火牆，騰訊雲的安全組，輸入外網ip:8012，瀟灑的一回車 - 還是無法存取。借用毛爺爺的一句話描述下當下的心情：它是站在海岸遙望海中已經看得見桅杆尖頭了的一隻航船，它是立於高山之巔遠看東方已見光芒四射噴薄欲出的一輪朝日，它是躁動於母腹中的快要成熟了的一個嬰兒。加之夜確實太深了，才踏實的睡下了。

1 /usr/local/python/lib/python3.10/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
2   warnings.warn(
3 /usr/local/python/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect
4   warnings.warn(value)
5 /usr/local/python/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect
6   warnings.warn(value)
7 Running on local URL:  http://127.0.0.1:8012
8 Running on public URL: https://11d5*****.gradio.live

　　第二天，找到gradio 中Interface.launch 的引數有個 server_name 設定成通過設定server_name=‘0.0.0.0’ 即可通過IP存取。通過ss -tnlp | grep ":8012" 也可以看到埠的監聽從」127.0.0.1:8012「就成了」0.0.0.0:8012 「。

LISTEN 0      128          0.0.0.0:8012      0.0.0.0:*    users:(("python",pid=2801254,fd=7))

　展望一下

　　從目前測試的情況來，每問一個問題成本在10美分左右（成本還是比較高），優化的方向可能Chunk的大小，太小無法包含住夠的上下問，太大成本又比較高。再回頭看Fine-tune的方式，應該是前期訓練話費的成本會比較高，後期回答的成本會比較少，只是目前訓練效果不太好，看其他文章也是同樣的問題。從目前的情況來可能 emebedding的是一個較為合適的落地方式。

接下看看使用的情況，如果效果不錯，考慮增加語音轉文字，用GPT回覆答案，甚至可以再文字轉語音播報出來，這樣護士們的工作可能會更加便利與快樂了。

　　成為一名優秀的程式設計師！

GPT護理機器人

引子

Fine-tune可能真的不合適

emebedding可能是當下最好的選擇

曲折的部署之路

引子