大語言模型目前一發不可收拾,在使用的時候經常會看到transformers庫的蹤影,其中xxxCausalLM和xxxForConditionalGeneration會經常出現在我們的視野中,接下來我們就來聊聊transformers庫中的一些基本任務。
這裡以三類模型為例:bert(自編碼)、gpt(自迴歸)、bart(編碼-解碼)
首先我們整體看下每個模型有什麼任務:
from ..bart.modeling_bart import (
BartForCausalLM,
BartForConditionalGeneration,
BartForQuestionAnswering,
BartForSequenceClassification,
BartModel,
)
from ..bert.modeling_bert import (
BertForMaskedLM,
BertForMultipleChoice,
BertForNextSentencePrediction,
BertForPreTraining,
BertForQuestionAnswering,
BertForSequenceClassification,
BertForTokenClassification,
BertLMHeadModel,
BertModel,
)
from ..gpt2.modeling_gpt2 import GPT2ForSequenceClassification, GPT2LMHeadModel, GPT2Model
BertModel(BertPreTrainedModel):最原始的bert,可獲得句向量表示或者每個token的向量表示。
BertForPreTraining(BertPreTrainedModel):在BertModel的基礎上加了一個預訓練頭:
self.bert = BertModel(config)
self.cls = BertPreTrainingHeads(config)
class BertPreTrainingHeads(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = BertLMPredictionHead(config)
self.seq_relationship = nn.Linear(config.hidden_size, 2)
def forward(self, sequence_output, pooled_output):
prediction_scores = self.predictions(sequence_output)
seq_relationship_score = self.seq_relationship(pooled_output)
return prediction_scores, seq_relationship_score
class BertLMPredictionHead(nn.Module):
def __init__(self, config):
super().__init__()
self.transform = BertPredictionHeadTransform(config)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_states
對應bert的兩個訓練任務:掩碼語言模型(MLM)和下一個句子預測(NSP)。
self.bert = BertModel(config, add_pooling_layer=False)
self.cls = BertOnlyMLMHead(config)
class BertOnlyMLMHead(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = BertLMPredictionHead(config)
def forward(self, sequence_output: torch.Tensor) -> torch.Tensor:
prediction_scores = self.predictions(sequence_output)
return prediction_scores
class BertLMPredictionHead(nn.Module):
def __init__(self, config):
super().__init__()
self.transform = BertPredictionHeadTransform(config)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_states
self.bert = BertModel(config)
self.cls = BertOnlyNSPHead(config)
class BertOnlyNSPHead(nn.Module):
def __init__(self, config):
super().__init__()
self.seq_relationship = nn.Linear(config.hidden_size, 2)
def forward(self, pooled_output):
seq_relationship_score = self.seq_relationship(pooled_output)
return seq_relationship_score
self.bert = BertModel(config)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.bert = BertModel(config)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, 1)
self.bert = BertModel(config, add_pooling_layer=False)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.bert = BertModel(config, add_pooling_layer=False)
self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
GPT2Model(GPT2PreTrainedModel):原始的GPT2模型,返回每個token的向量。
GPT2LMHeadModel(GPT2PreTrainedModel):進行語言模型任務。判斷每一個token的下一個token是什麼、
self.transformer = GPT2Model(config)
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
config.num_labels = 1
self.transformer = GPT2Model(config)
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.multiple_choice_head = SequenceSummary(config)
這個要看個例子:
import torch
from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
choices = [ "Bob likes candy ; what does Bob like ? Bag <|endoftext|>",
"Bob likes candy ; what does Bob like ? Burger <|endoftext|>",
"Bob likes candy ; what does Bob like ? Candy <|endoftext|>",
"Bob likes candy ; what does Bob like ? Apple <|endoftext|>"]
encoded_choices = [tokenizer.encode(s) for s in choices]
eos_token_location = [tokens.index(tokenizer.eos_token_id) for tokens in encoded_choices]
input_ids = torch.tensor(encoded_choices).unsqueeze(0)
mc_token_ids = torch.tensor([eos_token_location])
print(input_ids.shape)
print(mc_token_ids.shape)
outputs = model(input_ids, mc_token_ids=mc_token_ids)
lm_prediction_scores, mc_prediction_scores = outputs[:2]
print(lm_prediction_scores.shape)
print(mc_prediction_scores)
"""
torch.Size([1, 4, 13])
torch.Size([1, 4])
torch.Size([1, 4, 13, 50257])
tensor([[-6.0075, -6.0649, -6.0657, -6.0585]], grad_fn=<SqueezeBackward1>)
"""
Confused by GPT2DoubleHeadsModel example · Issue #1794 · huggingface/transformers (github.com)
How to use GPT2DoubleHeadsModel? · Issue #3680 · huggingface/transformers (github.com)
self.transformer = GPT2Model(config)
self.score = nn.Linear(config.n_embd, self.num_labels, bias=False)
self.transformer = GPT2Model(config)
if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:
classifier_dropout = config.classifier_dropout
elif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:
classifier_dropout = config.hidden_dropout
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
舉個例子:
import torch
from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = [
"Bob likes candy ; what does Bob like ? Bag <|endoftext|>",
"Bob likes candy ; what does Bob like ? Bag <|endoftext|>"
]
inputs = tokenizer(text, return_tensors="pt")
print(inputs)
print(tokenizer.decode(inputs["input_ids"][0]))
output = model(**inputs)
print(output[0].shape)
"""
{'input_ids': tensor([[18861, 7832, 18550, 2162, 644, 857, 5811, 588, 5633, 220,
20127, 220, 50256],
[18861, 7832, 18550, 2162, 644, 857, 5811, 588, 5633, 220,
20127, 220, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Bob likes candy ; what does Bob like? Bag <|endoftext|>
torch.Size([2, 13, 768])
"""
BartModel(BartPretrainedModel):bart的原始模型,返回解碼器每個token的向量。當然還有其它可選的。
BartForConditionalGeneration(BartPretrainedModel):顧名思義,條件文字生成。
self.model = BartModel(config)
self.register_buffer("final_logits_bias", torch.zeros((1, self.model.shared.num_embeddings)))
self.lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=False)
輸入一般我們需要定義:input_ids(編碼器的輸入)、attention_mask (編碼器注意力)、decoder_input_ids(解碼器的輸入),target_attention_mask(解碼器注意力)輸出一般我們使用的有兩個 loss=masked_lm_loss和 logits=lm_logits。
self.model = BartModel(config)
self.classification_head = BartClassificationHead(
config.d_model,
config.d_model,
config.num_labels,
config.classifier_dropout,
)
class BartClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(
self,
input_dim: int,
inner_dim: int,
num_classes: int,
pooler_dropout: float,
):
super().__init__()
self.dense = nn.Linear(input_dim, inner_dim)
self.dropout = nn.Dropout(p=pooler_dropout)
self.out_proj = nn.Linear(inner_dim, num_classes)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.dropout(hidden_states)
hidden_states = self.dense(hidden_states)
hidden_states = torch.tanh(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.out_proj(hidden_states)
return hidden_states
具體的獲取logits是這麼操作的:
hidden_states = outputs[0] # last hidden state
# 找到eos_mask的位置
eos_mask = input_ids.eq(self.config.eos_token_id).to(hidden_states.device)
if len(torch.unique_consecutive(eos_mask.sum(1))) > 1:
raise ValueError("All examples must have the same number of <eos> tokens.")
sentence_representation = hidden_states[eos_mask, :].view(hidden_states.size(0), -1, hidden_states.size(-1))[
:, -1, :
]
logits = self.classification_head(sentence_representation)
損失計算:
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.config.num_labels == 1:
self.config.problem_type = "regression"
elif self.config.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.config.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
config.num_labels = 2
self.num_labels = config.num_labels
self.model = BartModel(config)
self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
sequence_output = outputs[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1).contiguous()
end_logits = end_logits.squeeze(-1).contiguous()
config = copy.deepcopy(config)
config.is_decoder = True
config.is_encoder_decoder = False
super().__init__(config)
self.model = BartDecoderWrapper(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
outputs = self.model.decoder(
input_ids=input_ids,
attention_mask=attention_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
head_mask=head_mask,
cross_attn_head_mask=cross_attn_head_mask,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = self.lm_head(outputs[0])
>>> from transformers import AutoTokenizer, BartForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
>>> model = BartForCausalLM.from_pretrained("facebook/bart-base", add_cross_attention=False)
>>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder."
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logits
>>> expected_shape = [1, inputs.input_ids.shape[-1], model.config.vocab_size]
>>> list(logits.shape) == expected_shape
True
接下來針對xxxCausalLM和xxxForConditionalGeneration,我們實際操作來更加深入的瞭解它們。首先需要安裝一些依賴:
pip install transformers==4.28.1
pip install evaluate
pip install datasets
資料從這裡下載:https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv
直接上程式碼:
import torch
from tqdm import tqdm
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import (
default_data_collator,
get_linear_schedule_with_warmup,
)
from torch.utils.data import DataLoader
data_file = "./ChnSentiCorp_htl_all.csv" # 資料檔案路徑,資料需要提前下載
max_length = 86
train_batch_size = 64
eval_batch_size = 64
num_epochs = 10
lr = 3e-4
# 載入資料集
dataset = load_dataset("csv", data_files=data_file)
dataset = dataset.filter(lambda x: x["review"] is not None)
dataset = dataset["train"].train_test_split(0.2, seed=123)
model_name_or_path = "uer/gpt2-chinese-cluecorpussmall"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
# example = {'label': 1, 'review': '早餐太差,無論去多少人,那邊也不加食品的。酒店應該重視一下這個問題了。房間本身很好。'}
def process(example):
text = example["review"]
# text = ["399真的很值得之前也住過別的差不多價位的酒店式公寓沒有這間好廚房很像廚房很大整個格局也都很舒服早上的早餐我訂的8點半的已經冷了。。。位置啊什麼還是很好的下次還會去服務也很周到"]
batch_size = len(text)
inputs = tokenizer(text, add_special_tokens=False, truncation=True, max_length=max_length)
inputs["labels"] = []
for i in range(batch_size):
input_ids = inputs["input_ids"][i]
if len(input_ids) + 1 <= max_length:
inputs["input_ids"][i] = input_ids + [tokenizer.pad_token_id] + [0] * (max_length - len(input_ids) - 1)
inputs["labels"].append(input_ids + [tokenizer.pad_token_id] + [-100] * (max_length - len(input_ids) - 1))
inputs["attention_mask"][i] = [1] * len(input_ids) + [0] + [0] * (max_length - len(input_ids) - 1)
else:
inputs["input_ids"][i] = input_ids[:max_length - 1] + [tokenizer.pad_token_id]
inputs["labels"].append(inputs["input_ids"][i])
inputs["attention_mask"][i] = [1] * max_length
inputs["token_type_ids"][i] = [0] * max_length
# for k, v in inputs.items():
# print(k, len(v[0]))
# assert len(inputs["labels"][i]) == len(inputs["input_ids"][i]) == len(inputs["token_type_ids"][i]) == len(inputs["attention_mask"][i]) == 86
return inputs
# process(None)
train_dataset = dataset["train"].map(process, batched=True, num_proc=1, remove_columns=dataset["train"].column_names)
test_dataset = dataset["test"].map(process, batched=True, num_proc=1, remove_columns=dataset["test"].column_names)
train_dataloader = DataLoader(
train_dataset, collate_fn=default_data_collator, shuffle=True, batch_size=train_batch_size, pin_memory=True
)
test_dataloader = DataLoader(
test_dataset, collate_fn=default_data_collator, batch_size=eval_batch_size, pin_memory=True
)
# optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
# lr scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=(len(train_dataloader) * num_epochs),
)
model.cuda()
from tqdm import tqdm
for epoch in range(num_epochs):
model.train()
total_loss = 0
t = tqdm(train_dataloader)
for step, batch in enumerate(t):
for k, v in batch.items():
batch[k] = v.cuda()
outputs = model(
input_ids=batch["input_ids"],
token_type_ids=batch["token_type_ids"],
attention_mask=batch["attention_mask"],
labels=batch["labels"],
)
loss = outputs.loss
t.set_description("loss:{:.6f}".format(loss.item()))
total_loss += loss.detach().float()
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
train_epoch_loss = total_loss / len(train_dataloader)
model.save_pretrained("gpt2-chinese/")
print(f"epoch:{epoch+1}/{num_epochs} loss:{train_epoch_loss}")
訓練結果:
loss:2.416899: 100%|██████████| 98/98 [01:51<00:00, 1.14s/it]
epoch:1/10 loss:2.7781832218170166
loss:2.174688: 100%|██████████| 98/98 [01:54<00:00, 1.17s/it]
epoch:2/10 loss:2.3192219734191895
loss:2.123909: 100%|██████████| 98/98 [01:55<00:00, 1.17s/it]
epoch:3/10 loss:2.037835121154785
loss:1.785878: 100%|██████████| 98/98 [01:55<00:00, 1.18s/it]
epoch:4/10 loss:1.7687807083129883
loss:1.466153: 100%|██████████| 98/98 [01:55<00:00, 1.18s/it]
epoch:5/10 loss:1.524872064590454
loss:1.465316: 100%|██████████| 98/98 [01:54<00:00, 1.17s/it]
epoch:6/10 loss:1.3074666261672974
loss:1.150320: 100%|██████████| 98/98 [01:54<00:00, 1.16s/it]
epoch:7/10 loss:1.1217808723449707
loss:1.043044: 100%|██████████| 98/98 [01:53<00:00, 1.16s/it]
epoch:8/10 loss:0.9760875105857849
loss:0.790678: 100%|██████████| 98/98 [01:53<00:00, 1.16s/it]
epoch:9/10 loss:0.8597695827484131
loss:0.879025: 100%|██████████| 98/98 [01:53<00:00, 1.16s/it]
epoch:10/10 loss:0.790839433670044
可以這麼進行預測:
from transformers import AutoTokenizer, GPT2LMHeadModel, TextGenerationPipeline, AutoModelForCausalLM
from datasets import load_dataset
data_file = "./ChnSentiCorp_htl_all.csv" # 資料檔案路徑,資料需要提前下載
dataset = load_dataset("csv", data_files=data_file)
dataset = dataset.filter(lambda x: x["review"] is not None)
dataset = dataset["train"].train_test_split(0.2, seed=123)
model_name_or_path = "uer/gpt2-chinese-cluecorpussmall"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained("./gpt2-chinese/")
text_generator = TextGenerationPipeline(model, tokenizer)
import random
examples = dataset["train"]
example = random.choice(examples)
text = example["review"]
print(text)
print(text[:10])
text_generator(text[:10],
max_length=100,
do_sample=False,
top_p=0.8,
repetition_penalty=10.0,
temperature=0.95,
eos_token_id=0,
)
"""
第一次住在這裡兒,我針對大家的意見,特別關注了一下,感覺如下吧!1、標準間雖然有點舊但很乾淨,被子蓋得很舒服,也很暖和,衛生間也蠻大的,因是在商業中心離很多還算很近。2、酒店服務還算可以,沒有像這裡說的那樣,入住時,退房時也挺快的,總的來說我很滿意。3、早餐也還可以,環境也不錯,有點江南的感覺;菜品種品也不少,挺可口。4、可能是在市或者離火車站的距離很近,稍微有點「熱鬧」,來找我辦事的人不方便停車,但還好這裡有地下停車場。總體來說,我感覺很不錯,值得推薦!!!
第一次住在這裡兒,我
[{'generated_text': '第一次住在這裡兒,我 感 覺 很 溫 馨 。 房 間 寬 敞 、 幹 淨 還 有 水 果 送 ( 每 人 10 元 ) ; 飯 菜 也 不 錯 ! 價 格 合 理 經 濟 實 惠 .'}]
"""
我們需要注意的幾點:
不同模型使用的tokenizer是不一樣的,需要注意它們的區別,尤其是pad_token_id和eos_token_id。eos_token_id常常用於標識生成文字的結尾。
有一些中文的生成預訓練模型使用的還是Bert的tokenizer,在進行token化的時候,通過指定add_special_tokens=False來避免新增[CLS]和[SEP]。
BertTokenizer的eos_token_id為None,這裡我們用[PAD]視為生成結束的符號,其索引為0.當然,你也可以設定它為詞表裡面的特殊符號,比如[SEP]。
對於不需要計算損失的token,我們將其標籤設定為-100。
我們的labels和input_ids為什麼是一樣的,不是說根據上一個詞生成下一個詞嗎?這是因為模型裡面幫我們處理了,見程式碼:
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
資料從這裡下載:https://www.modelscope.cn/datasets/minisnow/couplet_samll.git
直接看程式碼:
import json
import pandas as pd
import numpy as np
# import lawrouge
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline, pipeline
from datasets import load_dataset, Dataset
from transformers import default_data_collator
import torch
from tqdm import tqdm
from datasets import load_dataset
from transformers import (
default_data_collator,
get_linear_schedule_with_warmup,
)
from torch.utils.data import DataLoader
# =============================
# 載入資料
train_path = "couplet_samll/train.csv"
train_dataset = Dataset.from_csv(train_path)
test_path = "couplet_samll/test.csv"
test_dataset = Dataset.from_csv(test_path)
max_len = 24
train_batch_size = 64
eval_batch_size = 64
lr = 3e-4
num_epochs = 1
# 轉換為模型需要的格式
def tokenize_dataset(tokenizer, dataset, max_len):
def convert_to_features(batch):
text1 = batch["text1"]
text2 = batch["text2"]
inputs = tokenizer.batch_encode_plus(
text1,
max_length=max_len,
padding="max_length",
truncation=True,
)
targets = tokenizer.batch_encode_plus(
text2,
max_length=max_len,
padding="max_length",
truncation=True,
)
outputs = {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
"target_ids": targets["input_ids"],
"target_attention_mask": targets["attention_mask"]
}
return outputs
dataset = dataset.map(convert_to_features, batched=True)
# Set the tensor type and the columns which the dataset should return
columns = ['input_ids', 'target_ids', 'attention_mask', 'target_attention_mask']
dataset.with_format(type='torch', columns=columns)
dataset = dataset.rename_column('target_ids', 'labels')
dataset = dataset.rename_column('target_attention_mask', 'decoder_attention_mask')
dataset = dataset.remove_columns(['text1', 'text2'])
return dataset
tokenizer = BertTokenizer.from_pretrained("fnlp/bart-base-chinese")
train_data = tokenize_dataset(tokenizer, train_dataset, max_len)
test_data = tokenize_dataset(tokenizer, test_dataset, max_len)
train_dataset = train_data
train_dataloader = DataLoader(
train_dataset, collate_fn=default_data_collator, shuffle=True, batch_size=train_batch_size, pin_memory=True
)
test_dataset = test_data
test_dataloader = DataLoader(
test_dataset, collate_fn=default_data_collator, batch_size=eval_batch_size, pin_memory=True
)
# optimizer
model = BartForConditionalGeneration.from_pretrained("fnlp/bart-base-chinese")
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
# lr scheduler
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=(len(train_dataloader) * num_epochs),
)
model.cuda()
from tqdm import tqdm
for epoch in range(num_epochs):
model.train()
total_loss = 0
t = tqdm(train_dataloader)
for step, batch in enumerate(t):
for k, v in batch.items():
batch[k] = v.cuda()
outputs = model(**batch)
loss = outputs.loss
t.set_description("loss:{:.6f}".format(loss.item()))
total_loss += loss.detach().float()
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
train_epoch_loss = total_loss / len(train_dataloader)
model.save_pretrained("bart-couplet/")
tokenizer.save_pretrained("bart-couplet/")
print(f"epoch:{epoch+1}/{num_epochs} loss:{train_epoch_loss}")
結果:
loss:1.593506: 100%|██████████| 4595/4595 [33:28<00:00, 2.29it/s]
epoch:1/1 loss:1.76453697681427
我們可以這麼預測:
from transformers import Text2TextGenerationPipeline
model_path = "bart-couplet"
# model_path = "fnlp/bart-base-chinese"
model = BartForConditionalGeneration.from_pretrained(model_path)
tokenizer = BertTokenizer.from_pretrained(model_path)
generator = Text2TextGenerationPipeline(model=model, tokenizer=tokenizer)
max_len = 24
test_path = "couplet_samll/test.csv"
test_data = pd.read_csv(test_path)
texts = test_data["text1"].values.tolist()
labels = test_data["text2"].values.tolist()
results = generator(texts, max_length=max_len, eos_token_id=102, pad_token_id=0, do_sample=True)
for text, label, res in zip(texts, labels, results):
print("上聯:", text)
print("真實下聯:", label)
print("預測下聯:", "".join(res["generated_text"].split(" ")))
print("="*100)
"""
上聯: 幾幀山水關秋路
真實下聯: 無奈胭脂點絳脣
預測下聯: 天高雲淡月光明
====================================================================================================
上聯: 許多心事懶收拾
真實下聯: 大好青春莫撂荒
預測下聯: 何妨明月照寒窗
====================================================================================================
上聯: 誰同執手人間老
真實下聯: 自願並肩化外遊
預測下聯: 心中有夢月當頭
====================================================================================================
上聯: 畫地為牢封自步
真實下聯: 齊天大聖悟空行
預測下聯: 不妨一世好清閒
====================================================================================================
上聯: 布穀攜春臨五嶽
真實下聯: 流鶯送喜到千家
預測下聯: 萬家燈火慶豐年
====================================================================================================
上聯: 冤家宜解不宜結
真實下聯: 窮寇定殲必定追
預測下聯: 不因風雨誤春秋
====================================================================================================
上聯: 汪倫情義人間少
真實下聯: 法律條文格外繁
預測下聯: 一江春水向東流
====================================================================================================
上聯: 潑墨吟詩,銀髮人生添雅興
真實下聯: 手機簡訊,古稀老叟逐新潮
預測下聯: 春風得意,萬里千帆逐浪高
====================================================================================================
上聯: 刊岫展屏山,雲凝罨畫
真實下聯: 平湖環鏡檻,波漾空明
預測下聯: 千年古邑,百花芳草淹春
====================================================================================================
上聯: 且向人間賒一醉
真實下聯: 直如島外泛孤舟
預測下聯: 春風得意樂逍遙
====================================================================================================
"""
需要注意的地方:
model = BartForConditionalGeneration.from_pretrained(model_path)
model = model.to("cuda")
model.eval()
inputs = tokenizer(
texts,
padding="max_length",
truncation=True,
max_length=max_len,
return_tensors="pt",
)
input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)
# 生成
outputs = model.generate(input_ids,
attention_mask=attention_mask,
max_length=max_len,
do_sample=True,
pad_token_id=0,
eos_token_id=102)
# 將token轉換為文字
output_str = tokenizer.batch_decode(outputs, skip_special_tokens=False)
output_str = [s.replace(" ","") for s in output_str]
for text, label, pred in zip(texts, labels, output_str):
print("上聯:", text)
print("真實下聯:", label)
print("預測下聯:", pred)
print("="*100)
結果:
上聯: 幾幀山水關秋路
真實下聯: 無奈胭脂點絳脣
預測下聯: [SEP][CLS]春風送暖柳含煙[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
上聯: 許多心事懶收拾
真實下聯: 大好青春莫撂荒
預測下聯: [SEP][CLS]無私奉獻為人民[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
上聯: 誰同執手人間老
真實下聯: 自願並肩化外遊
預測下聯: [SEP][CLS]清風明月是知音[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
上聯: 畫地為牢封自步
真實下聯: 齊天大聖悟空行
預測下聯: [SEP][CLS]月明何處不相逢[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
上聯: 布穀攜春臨五嶽
真實下聯: 流鶯送喜到千家
預測下聯: [SEP][CLS]一壺老酒醉春風[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
上聯: 冤家宜解不宜結
真實下聯: 窮寇定殲必定追
預測下聯: [SEP][CLS]風流人物不虛名[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
上聯: 汪倫情義人間少
真實下聯: 法律條文格外繁
預測下聯: [SEP][CLS]萬里江山萬里春[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
上聯: 潑墨吟詩,銀髮人生添雅興
真實下聯: 手機簡訊,古稀老叟逐新潮
預測下聯: [SEP][CLS]和諧社會,和諧和諧幸福家[SEP]
====================================================================================================
上聯: 刊岫展屏山,雲凝罨畫
真實下聯: 平湖環鏡檻,波漾空明
預測下聯: [SEP][CLS]天下無雙,人壽年豐[SEP][PAD][PAD][PAD]
====================================================================================================
上聯: 且向人間賒一醉
真實下聯: 直如島外泛孤舟
預測下聯: [SEP][CLS]不知何處有閒人[SEP][PAD][PAD][PAD][PAD][PAD]
====================================================================================================
到這裡,你已經瞭解了transformers庫中自帶的模型及相關的一些任務了,特別是針對生成模型有了更深一層的瞭解,趕緊去試試吧。
最後附上相關的一些知識:
https://zhuanlan.zhihu.com/p/624845975
transformers.models.auto.modeling_auto — transformers 4.4.2 documentation (huggingface.co)