基於論文SpERT: "Span-based Entity and Relation Transformer"的中文關係抽取,同時抽取實體、實體類別和關係類別。
原始論文地址: https://arxiv.org/abs/1909.07755 (published at ECAI 2020)
原始論文程式碼:https://github.com/lavis-nlp/spert
pip install transformers ==4.1.1
pip install tensorboardX
pip install tqdm
pip install jinja2
pip install spacy==3.3.1
額外的,下載:https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.3.0/zh_core_web_sm-3.3.0.tar.gz 。執行:pip install zh_core_web_sm-3.3.0.tar.gz
還需要在huggingface上下載chinese-bert-wwm-ext到model_hub/chinese-bert-wwm-ext/下。
這裡使用的資料是千言資料中的資訊抽取資料,可以去這裡下載:千言(LUGE)| 全面的中文開源資料集合 。下載並解壓獲得duie_train.json、duie_dev.json、duie_schema.json,將它們放置在data/duie/下,然後執行那下面的process.py以獲得:
train.json # 訓練集
dev.json # 驗證集,如果有測試集,也可以生成test.json
duie_prediction_example.json # 預測樣本
duie_types.json # 儲存的實體型別和關係型別
entity_types.txt # 實際上用不上,只是我們自己看看
relation_types.txt # 實際上用不上,只是我們自己看看
train.json和dev.json裡面的資料格式如下所示:
[
{"tokens": ["這", "件", "婚", "事", "原", "本", "與", "陳", "國", "峻", "無", "關", ",", "但", "陳", "國", "峻", "卻", "「", "欲", "求", "配", "而", "無", "由", ",", "夜", "間", "乃", "潛", "入", "天", "城", "公", "主", "所", "居", "通", "之"], "entities": [{"type": "人物", "start": 8, "end": 10}, {"type": "人物", "start": 31, "end": 35}], "relations": [{"type": "丈夫", "tail": 0, "head": 1}, {"type": "妻子", "head": 0, "tail": 1}]},
......
]
需要說明的是relations裡面的head和tail對應的是entities裡面實體的列表裡的索引。
duie_types.json格式如下所示:
{"entities": {"行政區": {"short": "行政區", "verbose": "行政區"}, "人物": {"short": "人物", "verbose": "人物"}, "氣候": {"short": "氣候", "verbose": "氣候"}, "文學作品": {"short": "文學作品", "verbose": "文學作品"}, "Text": {"short": "Text", "verbose": "Text"}, "學科專業": {"short": "學科專業", "verbose": "學科專業"}, "作品": {"short": "作品", "verbose": "作品"}, "獎項": {"short": "獎項", "verbose": "獎項"}, "國家": {"short": "國家", "verbose": "國家"}, "電視綜藝": {"short": "電視綜藝", "verbose": "電視綜藝"}, "影視作品": {"short": "影視作品", "verbose": "影視作品"}, "企業": {"short": "企業", "verbose": "企業"}, "語言": {"short": "語言", "verbose": "語言"}, "歌曲": {"short": "歌曲", "verbose": "歌曲"}, "Date": {"short": "Date", "verbose": "Date"}, "企業/品牌": {"short": "企業/品牌", "verbose": "企業/品牌"}, "地點": {"short": "地點", "verbose": "地點"}, "Number": {"short": "Number", "verbose": "Number"}, "圖書作品": {"short": "圖書作品", "verbose": "圖書作品"}, "景點": {"short": "景點", "verbose": "景點"}, "城市": {"short": "城市", "verbose": "城市"}, "學校": {"short": "學校", "verbose": "學校"}, "音樂專輯": {"short": "音樂專輯", "verbose": "音樂專輯"}, "機構": {"short": "機構", "verbose": "機構"}},
"relations": {"編劇": {"short": "編劇", "verbose": "編劇", "symmetric": false}, "修業年限": {"short": "修業年限", "verbose": "修業年限", "symmetric": false}, "畢業院校": {"short": "畢業院校", "verbose": "畢業院校", "symmetric": false}, "氣候": {"short": "氣候", "verbose": "氣候", "symmetric": false}, "配音": {"short": "配音", "verbose": "配音", "symmetric": false}, "註冊資本": {"short": "註冊資本", "verbose": "註冊資本", "symmetric": false}, "成立日期": {"short": "成立日期", "verbose": "成立日期", "symmetric": false}, "父親": {"short": "父親", "verbose": "父親", "symmetric": false}, "面積": {"short": "面積", "verbose": "面積", "symmetric": false}, "專業程式碼": {"short": "專業程式碼", "verbose": "專業程式碼", "symmetric": false}, "作者": {"short": "作者", "verbose": "作者", "symmetric": false}, "首都": {"short": "首都", "verbose": "首都", "symmetric": false}, "丈夫": {"short": "丈夫", "verbose": "丈夫", "symmetric": false}, "嘉賓": {"short": "嘉賓", "verbose": "嘉賓", "symmetric": false}, "官方語言": {"short": "官方語言", "verbose": "官方語言", "symmetric": false}, "作曲": {"short": "作曲", "verbose": "作曲", "symmetric": false}, "號": {"short": "號", "verbose": "號", "symmetric": false}, "票房": {"short": "票房", "verbose": "票房", "symmetric": false}, "簡稱": {"short": "簡稱", "verbose": "簡稱", "symmetric": false}, "母親": {"short": "母親", "verbose": "母親", "symmetric": false}, "製片人": {"short": "製片人", "verbose": "製片人", "symmetric": false}, "導演": {"short": "導演", "verbose": "導演", "symmetric": false}, "歌手": {"short": "歌手", "verbose": "歌手", "symmetric": false}, "改編自": {"short": "改編自", "verbose": "改編自", "symmetric": false}, "海拔": {"short": "海拔", "verbose": "海拔", "symmetric": false}, "佔地面積": {"short": "佔地面積", "verbose": "佔地面積", "symmetric": false}, "出品公司": {"short": "出品公司", "verbose": "出品公司", "symmetric": false}, "上映時間": {"short": "上映時間", "verbose": "上映時間", "symmetric": false}, "所在城市": {"short": "所在城市", "verbose": "所在城市", "symmetric": false}, "主持人": {"short": "主持人", "verbose": "主持人", "symmetric": false}, "作詞": {"short": "作詞", "verbose": "作詞", "symmetric": false}, "人口數量": {"short": "人口數量", "verbose": "人口數量", "symmetric": false}, "祖籍": {"short": "祖籍", "verbose": "祖籍", "symmetric": false}, "校長": {"short": "校長", "verbose": "校長", "symmetric": false}, "朝代": {"short": "朝代", "verbose": "朝代", "symmetric": false}, "主題曲": {"short": "主題曲", "verbose": "主題曲", "symmetric": false}, "獲獎": {"short": "獲獎", "verbose": "獲獎", "symmetric": false}, "代言人": {"short": "代言人", "verbose": "代言人", "symmetric": false}, "主演": {"short": "主演", "verbose": "主演", "symmetric": false}, "所屬專輯": {"short": "所屬專輯", "verbose": "所屬專輯", "symmetric": false}, "飾演": {"short": "飾演", "verbose": "飾演", "symmetric": false}, "董事長": {"short": "董事長", "verbose": "董事長", "symmetric": false}, "主角": {"short": "主角", "verbose": "主角", "symmetric": false}, "妻子": {"short": "妻子", "verbose": "妻子", "symmetric": false}, "總部地點": {"short": "總部地點", "verbose": "總部地點", "symmetric": false}, "國籍": {"short": "國籍", "verbose": "國籍", "symmetric": false}, "創始人": {"short": "創始人", "verbose": "創始人", "symmetric": false}, "郵政編碼": {"short": "郵政編碼", "verbose": "郵政編碼", "symmetric": false}}}
(1) 在duie上使用訓練集進行訓練, 在驗證集上進行評估。需要注意的是,這裡我只使用了訓練集的10000條資料和驗證集的10000條資料訓練了1個epoch。
python ./spert.py train --config configs/duie_train.conf
--------------------------------------------------
Config:
{'label': 'duie_train', 'model_type': 'spert', 'model_path': 'model_hub/chinese-bert-wwm-ext', 'tokenizer_path': 'model_hub/chinese-bert-wwm-ext', 'train_path': 'data/duie/train.json', 'valid_path': 'data/duie/dev.json', 'types_path': 'data/duie/duie_types.json', 'train_batch_size': '2', 'eval_batch_size': '1', 'neg_entity_count': '100', 'neg_relation_count': '100', 'epochs': '1', 'lr': '5e-5', 'lr_warmup': '0.1', 'weight_decay': '0.01', 'max_grad_norm': '1.0', 'rel_filter_threshold': '0.4', 'size_embedding': '25', 'prop_drop': '0.1', 'max_span_size': '20', 'store_predictions': 'true', 'store_examples': 'true', 'sampling_processes': '2', 'max_pairs': '1000', 'final_eval': 'true', 'log_path': 'data/log/', 'save_path': 'data/save/'}
Repeat 1 times
--------------------------------------------------
Iteration 0
--------------------------------------------------
2022-11-17 06:48:16,488 [MainThread ] [INFO ] Datasets: data/duie/train.json, data/duie/dev.json
2022-11-17 06:48:16,489 [MainThread ] [INFO ] Model type: spert
Parse dataset 'train': 100% 10000/10000 [00:52<00:00, 189.61it/s]
<spert.entities.Dataset object at 0x7f24c8c19550>
Parse dataset 'valid': 100% 10000/10000 [00:52<00:00, 191.25it/s]
<spert.entities.Dataset object at 0x7f24c8c19250>
2022-11-17 06:50:02,108 [MainThread ] [INFO ] Relation type count: 49
2022-11-17 06:50:02,108 [MainThread ] [INFO ] Entity type count: 25
2022-11-17 06:50:02,108 [MainThread ] [INFO ] Entities:
2022-11-17 06:50:02,108 [MainThread ] [INFO ] No Entity=0
2022-11-17 06:50:02,108 [MainThread ] [INFO ] 行政區=1
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 人物=2
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 氣候=3
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 文學作品=4
2022-11-17 06:50:02,109 [MainThread ] [INFO ] Text=5
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 學科專業=6
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 作品=7
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 獎項=8
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 國家=9
2022-11-17 06:50:02,109 [MainThread ] [INFO ] 電視綜藝=10
2022-11-17 06:50:02,110 [MainThread ] [INFO ] 影視作品=11
2022-11-17 06:50:02,110 [MainThread ] [INFO ] 企業=12
2022-11-17 06:50:02,110 [MainThread ] [INFO ] 語言=13
2022-11-17 06:50:02,110 [MainThread ] [INFO ] 歌曲=14
2022-11-17 06:50:02,110 [MainThread ] [INFO ] Date=15
2022-11-17 06:50:02,110 [MainThread ] [INFO ] 企業/品牌=16
2022-11-17 06:50:02,110 [MainThread ] [INFO ] 地點=17
2022-11-17 06:50:02,110 [MainThread ] [INFO ] Number=18
2022-11-17 06:50:02,111 [MainThread ] [INFO ] 圖書作品=19
2022-11-17 06:50:02,111 [MainThread ] [INFO ] 景點=20
2022-11-17 06:50:02,111 [MainThread ] [INFO ] 城市=21
2022-11-17 06:50:02,111 [MainThread ] [INFO ] 學校=22
2022-11-17 06:50:02,111 [MainThread ] [INFO ] 音樂專輯=23
2022-11-17 06:50:02,111 [MainThread ] [INFO ] 機構=24
2022-11-17 06:50:02,111 [MainThread ] [INFO ] Relations:
2022-11-17 06:50:02,111 [MainThread ] [INFO ] No Relation=0
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 編劇=1
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 修業年限=2
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 畢業院校=3
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 氣候=4
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 配音=5
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 註冊資本=6
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 成立日期=7
2022-11-17 06:50:02,112 [MainThread ] [INFO ] 父親=8
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 面積=9
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 專業程式碼=10
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 作者=11
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 首都=12
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 丈夫=13
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 嘉賓=14
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 官方語言=15
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 作曲=16
2022-11-17 06:50:02,113 [MainThread ] [INFO ] 號=17
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 票房=18
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 簡稱=19
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 母親=20
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 製片人=21
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 導演=22
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 歌手=23
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 改編自=24
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 海拔=25
2022-11-17 06:50:02,114 [MainThread ] [INFO ] 佔地面積=26
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 出品公司=27
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 上映時間=28
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 所在城市=29
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 主持人=30
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 作詞=31
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 人口數量=32
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 祖籍=33
2022-11-17 06:50:02,115 [MainThread ] [INFO ] 校長=34
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 朝代=35
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 主題曲=36
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 獲獎=37
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 代言人=38
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 主演=39
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 所屬專輯=40
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 飾演=41
2022-11-17 06:50:02,116 [MainThread ] [INFO ] 董事長=42
2022-11-17 06:50:02,117 [MainThread ] [INFO ] 主角=43
2022-11-17 06:50:02,117 [MainThread ] [INFO ] 妻子=44
2022-11-17 06:50:02,117 [MainThread ] [INFO ] 總部地點=45
2022-11-17 06:50:02,117 [MainThread ] [INFO ] 國籍=46
2022-11-17 06:50:02,117 [MainThread ] [INFO ] 創始人=47
2022-11-17 06:50:02,117 [MainThread ] [INFO ] 郵政編碼=48
2022-11-17 06:50:02,117 [MainThread ] [INFO ] Dataset: train
2022-11-17 06:50:02,117 [MainThread ] [INFO ] Document count: 10000
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Relation count: 18119
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Entity count: 28033
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Dataset: valid
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Document count: 10000
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Relation count: 18223
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Entity count: 28071
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Updates per epoch: 5000
2022-11-17 06:50:02,118 [MainThread ] [INFO ] Updates total: 5000
Some weights of the model checkpoint at model_hub/chinese-bert-wwm-ext were not used when initializing SpERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing SpERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing SpERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of SpERT were not initialized from the model checkpoint at model_hub/chinese-bert-wwm-ext and are newly initialized: ['rel_classifier.weight', 'rel_classifier.bias', 'entity_classifier.weight', 'entity_classifier.bias', 'size_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-11-17 06:50:07,261 [MainThread ] [INFO ] Train epoch: 0
Train epoch 0: 100% 5000/5000 [09:01<00:00, 9.24it/s]
2022-11-17 06:59:08,476 [MainThread ] [INFO ] Evaluate: valid
Evaluate epoch 1: 0% 0/10000 [00:00<?, ?it/s]/content/drive/MyDrive/spert/spert/prediction.py:84: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
valid_rel_indices = rel_nonzero // rel_class_count
Evaluate epoch 1: 100% 10000/10000 [06:36<00:00, 25.20it/s]
Evaluation
--- Entities (named entity recognition (NER)) ---
An entity is considered correct if the entity type and span is predicted correctly
type precision recall f1-score support
語言 0.00 0.00 0.00 9
行政區 41.29 87.37 56.08 95
電視綜藝 43.94 81.69 57.14 355
獎項 20.90 74.87 32.68 199
Text 42.69 78.23 55.23 634
學校 47.59 93.20 63.01 647
氣候 69.64 79.59 74.29 49
Number 29.01 96.58 44.62 292
歌曲 54.55 87.14 67.10 1617
地點 26.25 57.58 36.06 264
影視作品 57.05 92.34 70.53 2704
城市 62.79 46.55 53.47 58
人物 60.93 95.98 74.54 14283
音樂專輯 54.75 79.34 64.79 334
文學作品 35.14 13.27 19.26 98
Date 47.23 97.15 63.56 1193
企業/品牌 26.88 46.30 34.01 54
作品 0.00 0.00 0.00 22
企業 35.62 73.86 48.07 1144
圖書作品 64.91 87.12 74.39 1724
機構 39.45 79.37 52.70 1076
學科專業 0.00 0.00 0.00 2
景點 25.00 3.23 5.71 31
國家 29.92 93.28 45.31 640
micro 53.15 90.82 67.06 27524
macro 38.15 64.33 45.52 27524
--- Relations ---
Without named entity classification (NEC)
A relation is considered correct if the relation type and the spans of the two related entities are predicted correctly (entity type is not considered)
type precision recall f1-score support
成立日期 19.31 88.94 31.74 868
註冊資本 9.57 87.50 17.25 56
主角 15.45 15.18 15.32 112
飾演 40.00 9.74 15.67 308
祖籍 20.98 73.17 32.61 82
作曲 22.67 59.92 32.90 484
編劇 47.27 7.22 12.53 360
修業年限 0.00 0.00 0.00 1
妻子 24.99 57.30 34.80 747
改編自 0.00 0.00 0.00 34
佔地面積 20.69 29.27 24.24 41
主演 33.06 90.21 48.39 2574
氣候 39.33 70.00 50.36 50
父親 15.13 67.36 24.71 916
朝代 11.67 75.84 20.23 356
歌手 23.50 81.08 36.44 1221
導演 32.93 84.82 47.44 1179
面積 7.14 73.53 13.02 34
所在城市 3.12 3.23 3.17 31
海拔 57.14 66.67 61.54 24
票房 4.13 94.83 7.91 116
主持人 27.25 73.46 39.75 260
代言人 10.97 45.61 17.69 57
嘉賓 19.13 51.17 27.84 342
專業程式碼 0.00 0.00 0.00 1
創始人 19.10 46.22 27.03 119
所屬專輯 33.30 81.21 47.23 431
人口數量 16.07 40.91 23.08 22
製片人 0.00 0.00 0.00 97
作者 35.77 83.67 50.11 1837
董事長 14.06 84.77 24.12 440
配音 8.77 46.35 14.74 233
作詞 32.24 67.88 43.72 520
上映時間 12.87 92.70 22.60 356
畢業院校 31.41 91.05 46.71 503
獲獎 3.66 71.14 6.96 201
官方語言 0.00 0.00 0.00 9
丈夫 24.59 55.96 34.16 747
郵政編碼 0.00 0.00 0.00 1
首都 80.00 14.81 25.00 27
主題曲 19.35 64.17 29.74 187
號 34.08 79.17 47.65 96
母親 14.44 36.99 20.77 519
簡稱 13.24 65.40 22.02 237
校長 16.77 93.92 28.45 148
總部地點 5.51 49.38 9.92 160
出品公司 18.49 77.78 29.87 405
國籍 11.03 87.44 19.59 661
micro 19.89 72.78 31.25 18210
macro 19.80 54.94 24.77 18210
With named entity classification (NEC)
A relation is considered correct if the relation type and the two related entities are predicted correctly (in span and entity type)
type precision recall f1-score support
成立日期 17.54 80.76 28.82 868
註冊資本 8.20 75.00 14.79 56
主角 6.36 6.25 6.31 112
飾演 40.00 9.74 15.67 308
祖籍 20.98 73.17 32.61 82
作曲 22.67 59.92 32.90 484
編劇 47.27 7.22 12.53 360
修業年限 0.00 0.00 0.00 1
妻子 24.99 57.30 34.80 747
改編自 0.00 0.00 0.00 34
佔地面積 20.69 29.27 24.24 41
主演 33.04 90.17 48.36 2574
氣候 39.33 70.00 50.36 50
父親 15.13 67.36 24.71 916
朝代 11.50 74.72 19.93 356
歌手 22.51 77.64 34.90 1221
導演 32.86 84.65 47.34 1179
面積 7.14 73.53 13.02 34
所在城市 0.00 0.00 0.00 31
海拔 14.29 16.67 15.38 24
票房 4.13 94.83 7.91 116
主持人 27.10 73.08 39.54 260
代言人 9.70 40.35 15.65 57
嘉賓 19.02 50.88 27.68 342
專業程式碼 0.00 0.00 0.00 1
創始人 10.42 25.21 14.74 119
所屬專輯 26.93 65.66 38.19 431
人口數量 16.07 40.91 23.08 22
製片人 0.00 0.00 0.00 97
作者 35.19 82.31 49.30 1837
董事長 14.02 84.55 24.05 440
配音 8.77 46.35 14.74 233
作詞 32.24 67.88 43.72 520
上映時間 12.16 87.64 21.36 356
畢業院校 31.41 91.05 46.71 503
獲獎 3.64 70.65 6.92 201
官方語言 0.00 0.00 0.00 9
丈夫 24.59 55.96 34.16 747
郵政編碼 0.00 0.00 0.00 1
首都 80.00 14.81 25.00 27
主題曲 19.19 63.64 29.49 187
號 34.08 79.17 47.65 96
母親 14.44 36.99 20.77 519
簡稱 11.36 56.12 18.89 237
校長 16.77 93.92 28.45 148
總部地點 3.07 27.50 5.52 160
出品公司 18.31 77.04 29.59 405
國籍 10.97 86.99 19.49 661
micro 19.36 70.83 30.41 18210
macro 18.08 51.39 22.69 18210
2022-11-17 07:08:01,224 [MainThread ] [INFO ] Logged in: data/log/duie_train/2022-11-17_06:48:16.414088
2022-11-17 07:08:01,224 [MainThread ] [INFO ] Saved in: data/save/duie_train/2022-11-17_06:48:16.414088
(2) 在測試集上進行評估,由於我們沒有測試集,裡面引數設定為驗證集地址。我們要修改duie_eval.conf裡面儲存好的模型的地址,一般的,在data/save/duie_train/日期資料夾/final_model下。如果測試集和驗證集一樣,那麼就是和上述一樣的結果。
python ./spert.py eval --config configs/duie_eval.conf
(3) 我們要修改duie_eval.conf裡面儲存好的模型的地址,一般的,在data/save/duie_train/日期資料夾/final_model下。進行預測使用的是duie_prediction_example.json,裡面的格式是:
[{"tokens": ["《", "廢", "物", "小", "說", "》", "是", "新", "片", "場", "出", "品", ",", "杜", "煜", "峰", "(", "東", "北", "花", "澤", "類", ")", "導", "演", "2", "的", "動", "畫", "首", "作", ",", "作", "品", "延", "續", "了", "他", "一", "貫", "的", "脫", "力", "系", "搞", "笑", "風", "格"], "entities": [{"type": "影視作品", "start": 1, "end": 5}, {"type": "企業", "start": 7, "end": 10}, {"type": "人物", "start": 13, "end": 16}], "relations": [{"type": "出品公司", "head": 0, "tail": 1}, {"type": "導演", "head": 0, "tail": 2}]}, {"tokens": ["《", "廢", "物", "小", "說", "》", "是", "新", "片", "場", "出", "品", ",", "杜", "煜", "峰", "(", "東", "北", "花", "澤", "類", ")", "導", "演", "2", "的", "動", "畫", "首", "作", ",", "作", "品", "延", "續", "了", "他", "一", "貫", "的", "脫", "力", "系", "搞", "笑", "風", "格"], "entities": [{"type": "影視作品", "start": 1, "end": 5}, {"type": "企業", "start": 7, "end": 10}, {"type": "人物", "start": 13, "end": 16}], "relations": [{"type": "出品公司", "head": 0, "tail": 1}, {"type": "導演", "head": 0, "tail": 2}]}, {"tokens": ["《", "廢", "物", "小", "說", "》", "是", "新", "片", "場", "出", "品", ",", "杜", "煜", "峰", "(", "東", "北", "花", "澤", "類", ")", "導", "演", "2", "的", "動", "畫", "首", "作", ",", "作", "品", "延", "續", "了", "他", "一", "貫", "的", "脫", "力", "系", "搞", "笑", "風", "格"], "entities": [{"type": "影視作品", "start": 1, "end": 5}, {"type": "企業", "start": 7, "end": 10}, {"type": "人物", "start": 13, "end": 16}], "relations": [{"type": "出品公司", "head": 0, "tail": 1}, {"type": "導演", "head": 0, "tail": 2}]}]
python ./spert.py predict --config configs/example_predict.conf
[{"tokens": ["《", "廢", "物", "小", "說", "》", "是", "新", "片", "場", "出", "品", ",", "杜", "煜", "峰", "(", "東", "北", "花", "澤", "類", ")", "導", "演", "2", "的", "動", "畫", "首", "作", ",", "作", "品", "延", "續", "了", "他", "一", "貫", "的", "脫", "力", "系", "搞", "笑", "風", "格"], "entities": [{"type": "影視作品", "start": 1, "end": 5}, {"type": "企業", "start": 7, "end": 10}, {"type": "人物", "start": 13, "end": 16}], "relations": [{"type": "出品公司", "head": 0, "tail": 1}, {"type": "導演", "head": 0, "tail": 2}]}, {"tokens": ["《", "廢", "物", "小", "說", "》", "是", "新", "片", "場", "出", "品", ",", "杜", "煜", "峰", "(", "東", "北", "花", "澤", "類", ")", "導", "演", "2", "的", "動", "畫", "首", "作", ",", "作", "品", "延", "續", "了", "他", "一", "貫", "的", "脫", "力", "系", "搞", "笑", "風", "格"], "entities": [{"type": "影視作品", "start": 1, "end": 5}, {"type": "企業", "start": 7, "end": 10}, {"type": "人物", "start": 13, "end": 16}], "relations": [{"type": "出品公司", "head": 0, "tail": 1}, {"type": "導演", "head": 0, "tail": 2}]}, {"tokens": ["《", "廢", "物", "小", "說", "》", "是", "新", "片", "場", "出", "品", ",", "杜", "煜", "峰", "(", "東", "北", "花", "澤", "類", ")", "導", "演", "2", "的", "動", "畫", "首", "作", ",", "作", "品", "延", "續", "了", "他", "一", "貫", "的", "脫", "力", "系", "搞", "笑", "風", "格"], "entities": [{"type": "影視作品", "start": 1, "end": 5}, {"type": "企業", "start": 7, "end": 10}, {"type": "人物", "start": 13, "end": 16}], "relations": [{"type": "出品公司", "head": 0, "tail": 1}, {"type": "導演", "head": 0, "tail": 2}]}]
這裡有三條結果,也就是說我們在duie_prediction_example.json裡面任意一種格式都行。
lavis-nlp/spert: PyTorch code for SpERT: Span-based Entity and Relation Transformer (github.com)