諸公可知目前最牛逼的TTS免費開源專案是哪一個?沒錯,是Bert-vits2,沒有之一。它是在本來已經極其強大的Vits專案中融入了Bert大模型,基本上解決了VITS的語氣韻律問題,在效果非常出色的情況下訓練的成本開銷普通人也完全可以接受。
BERT的核心思想是通過在大規模文字語料上進行無監督預訓練,學習到通用的語言表示,然後將這些表示用於下游任務的微調。相比傳統的基於詞嵌入的模型,BERT引入了雙向上下文資訊的建模,使得模型能夠更好地理解句子中的語意和關係。
BERT的模型結構基於Transformer,它由多個編碼器層組成。每個編碼器層都有多頭自注意力機制和前饋神經網路,用於對輸入序列進行多層次的特徵提取和表示學習。在預訓練階段,BERT使用了兩種任務來學習語言表示:掩碼語言模型(Masked Language Model,MLM)和下一句預測(Next Sentence Prediction,NSP)。通過這兩種任務,BERT能夠學習到上下文感知的詞嵌入和句子級別的語意表示。
在實際應用中,BERT的預訓練模型可以用於各種下游任務,如文字分類、命名實體識別、問答系統等。通過微調預訓練模型,可以在特定任務上取得更好的效能,而無需從頭開始訓練模型。
BERT的出現對自然語言處理領域帶來了重大影響,成為了許多最新研究和應用的基礎。它在多個任務上取得了領先的效能,並促進了自然語言理解的發展。
本次讓我們基於Bert-vits2專案來克隆渣渣輝和劉青雲的聲音,打造一款時下熱搜榜一的「青島啤酒」鬼畜視訊。
首先我們需要渣渣輝和劉青雲的原版音訊素材,原版《掃毒》素材可以參考:https://www.bilibili.com/video/BV1R64y1F7SQ/。
將兩個主角的聲音單獨提取出來,隨後依次進行背景音和前景音的分離,聲音降噪以及聲音切片等操作,這些步驟之前已經做過詳細介紹,請參見:民謠女神唱流行,基於AI人工智慧so-vits庫訓練自己的音色模型(葉蓓/Python3.10)。 囿於篇幅,這裡不再贅述。
做好素材的簡單處理後,我們來克隆專案:
git clone https://github.com/Stardust-minus/Bert-VITS2
隨後安裝專案的依賴:
cd Bert-VITS2
pip3 install -r requirements.txt
接著下載bert模型放入到專案的bert目錄。
bert模型下載地址:
中:https://huggingface.co/hfl/chinese-roberta-wwm-ext-large
日:https://huggingface.co/cl-tohoku/bert-base-japanese-v3/tree/main
接著我們需要對已經切好分片的語音進行標註,這裡我們使用開源庫whisper,關於whisper請移步:聞其聲而知雅意,M1 Mac基於PyTorch(mps/cpu/cuda)的人工智慧AI本地語音識別庫Whisper(Python3.10)。
編寫標註程式碼:
import whisper
import os
import json
import torchaudio
import argparse
import torch
lang2token = {
'zh': "ZH|",
'ja': "JP|",
"en": "EN|",
}
def transcribe_one(audio_path):
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio(audio_path)
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
lang = max(probs, key=probs.get)
# decode the audio
options = whisper.DecodingOptions(beam_size=5)
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)
return lang, result.text
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--languages", default="CJ")
parser.add_argument("--whisper_size", default="medium")
args = parser.parse_args()
if args.languages == "CJE":
lang2token = {
'zh': "ZH|",
'ja': "JP|",
"en": "EN|",
}
elif args.languages == "CJ":
lang2token = {
'zh': "ZH|",
'ja': "JP|",
}
elif args.languages == "C":
lang2token = {
'zh': "ZH|",
}
assert (torch.cuda.is_available()), "Please enable GPU in order to run Whisper!"
model = whisper.load_model(args.whisper_size)
parent_dir = "./custom_character_voice/"
speaker_names = list(os.walk(parent_dir))[0][1]
speaker_annos = []
total_files = sum([len(files) for r, d, files in os.walk(parent_dir)])
# resample audios
# 2023/4/21: Get the target sampling rate
with open("./configs/config.json", 'r', encoding='utf-8') as f:
hps = json.load(f)
target_sr = hps['data']['sampling_rate']
processed_files = 0
for speaker in speaker_names:
for i, wavfile in enumerate(list(os.walk(parent_dir + speaker))[0][2]):
# try to load file as audio
if wavfile.startswith("processed_"):
continue
try:
wav, sr = torchaudio.load(parent_dir + speaker + "/" + wavfile, frame_offset=0, num_frames=-1, normalize=True,
channels_first=True)
wav = wav.mean(dim=0).unsqueeze(0)
if sr != target_sr:
wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav)
if wav.shape[1] / sr > 20:
print(f"{wavfile} too long, ignoring\n")
save_path = parent_dir + speaker + "/" + f"processed_{i}.wav"
torchaudio.save(save_path, wav, target_sr, channels_first=True)
# transcribe text
lang, text = transcribe_one(save_path)
if lang not in list(lang2token.keys()):
print(f"{lang} not supported, ignoring\n")
continue
#text = "ZH|" + text + "\n"
text = lang2token[lang] + text + "\n"
speaker_annos.append(save_path + "|" + speaker + "|" + text)
processed_files += 1
print(f"Processed: {processed_files}/{total_files}")
except:
continue
標註後,會生成切片語音對應檔案:
./genshin_dataset/ying/vo_dialog_DPEQ003_raidenEi_01.wav|ying|ZH|神子…臣民對我的畏懼…
./genshin_dataset/ying/vo_dialog_DPEQ003_raidenEi_02.wav|ying|ZH|我不會那麼做…
./genshin_dataset/ying/vo_dialog_SGLQ002_raidenEi_01.wav|ying|ZH|不用著急,好好挑選吧,我就在這裡等著。
./genshin_dataset/ying/vo_dialog_SGLQ003_raidenEi_01.wav|ying|ZH|現在在做的事就是「留影」…
./genshin_dataset/ying/vo_dialog_SGLQ003_raidenEi_02.wav|ying|ZH|嗯,不錯,又學到新東西了。快開始吧。
說白了,就是通過whisper把人物說的話先轉成文字,並且生成對應的音標:
./genshin_dataset/ying/vo_dialog_DPEQ003_raidenEi_01.wav|ying|ZH|神子…臣民對我的畏懼…|_ sh en z i0 … ch en m in d ui w o d e w ei j v … _|0 2 2 5 5 0 2 2 2 2 4 4 3 3 5 5 4 4 4 4 0 0|1 2 2 1 2 2 2 2 2 2 2 1 1
./genshin_dataset/ying/vo_dialog_DPEQ003_raidenEi_02.wav|ying|ZH|我不會那麼做…|_ w o b u h ui n a m e z uo … _|0 3 3 2 2 4 4 4 4 5 5 4 4 0 0|1 2 2 2 2 2 2 1 1
./genshin_dataset/ying/vo_dialog_SGLQ002_raidenEi_01.wav|ying|ZH|不用著急,好好挑選吧,我就在這裡等著.|_ b u y ong zh ao j i , h ao h ao t iao x van b a , w o j iu z ai zh e l i d eng zh e . _|0 2 2 4 4 2 2 2 2 0 2 2 3 3 1 1 3 3 5 5 0 3 3 4 4 4 4 4 4 3 3 3 3 5 5 0 0|1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1
./genshin_dataset/ying/vo_dialog_SGLQ003_raidenEi_01.wav|ying|ZH|現在在做的事就是'留影'…|_ x ian z ai z ai z uo d e sh ir j iu sh ir ' l iu y ing ' … _|0 4 4 4 4 4 4 4 4 5 5 4 4 4 4 4 4 0 2 2 3 3 0 0 0|1 2 2 2 2 2 2 2 2 1 2 2 1 1 1
./genshin_dataset/ying/vo_dialog_SGLQ003_raidenEi_02.wav|ying|ZH|恩,不錯,又學到新東西了.快開始吧.|_ EE en , b u c uo , y ou x ve d ao x in d ong x i l e . k uai k ai sh ir b a
最後,將標註好的檔案轉換為bert模型可讀檔案:
import torch
from multiprocessing import Pool
import commons
import utils
from tqdm import tqdm
from text import cleaned_text_to_sequence, get_bert
import argparse
import torch.multiprocessing as mp
def process_line(line):
rank = mp.current_process()._identity
rank = rank[0] if len(rank) > 0 else 0
if torch.cuda.is_available():
gpu_id = rank % torch.cuda.device_count()
device = torch.device(f"cuda:{gpu_id}")
wav_path, _, language_str, text, phones, tone, word2ph = line.strip().split("|")
phone = phones.split(" ")
tone = [int(i) for i in tone.split(" ")]
word2ph = [int(i) for i in word2ph.split(" ")]
word2ph = [i for i in word2ph]
phone, tone, language = cleaned_text_to_sequence(phone, tone, language_str)
phone = commons.intersperse(phone, 0)
tone = commons.intersperse(tone, 0)
language = commons.intersperse(language, 0)
for i in range(len(word2ph)):
word2ph[i] = word2ph[i] * 2
word2ph[0] += 1
bert_path = wav_path.replace(".wav", ".bert.pt")
try:
bert = torch.load(bert_path)
assert bert.shape[-1] == len(phone)
except Exception:
bert = get_bert(text, word2ph, language_str, device)
assert bert.shape[-1] == len(phone)
torch.save(bert, bert_path)
此時,開啟專案目錄中的config.json檔案:
{
"train": {
"log_interval": 100,
"eval_interval": 100,
"seed": 52,
"epochs": 200,
"learning_rate": 0.0001,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 4,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 16384,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"skip_optimizer": true
},
"data": {
"training_files": "filelists/train.list",
"validation_files": "filelists/val.list",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 128,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 1,
"cleaned_text": true,
"spk2id": {
"ying": 0
}
},
"model": {
"use_spk_conditioned_encoder": true,
"use_noise_scaled_mas": true,
"use_mel_posterior_encoder": false,
"use_duration_discriminator": true,
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
2,
2,
2
],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
8,
2,
2
],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 256
}
}
這裡需要修改的引數是batch_size,通常情況下,數值和本地視訊記憶體應該是一致的,但是最好還是改小一點,比如說一塊4060的8G卡,最好batch_size是4,如果寫8的話,還是有機率爆視訊記憶體。
隨後開始訓練:
python3 train_ms.py
程式返回:
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [v3u.net]:65280 (system error: 10049 - 在其上下文中,該請求的地址無效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [v3u.net]:65280 (system error: 10049 - 在其上下文中,該請求的地址無效。).
2023-10-23 15:36:08.293 | INFO | data_utils:_filter:61 - Init dataset...
100%|█████████████████████████████████████████████████████████████████████████████| 562/562 [00:00<00:00, 14706.57it/s]
2023-10-23 15:36:08.332 | INFO | data_utils:_filter:76 - skipped: 0, total: 562
2023-10-23 15:36:08.333 | INFO | data_utils:_filter:61 - Init dataset...
100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<?, ?it/s]
2023-10-23 15:36:08.334 | INFO | data_utils:_filter:76 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
INFO:OUTPUT_MODEL:Loaded checkpoint './logs\OUTPUT_MODEL\DUR_4600.pth' (iteration 33)
INFO:OUTPUT_MODEL:Loaded checkpoint './logs\OUTPUT_MODEL\G_4600.pth' (iteration 33)
INFO:OUTPUT_MODEL:Loaded checkpoint './logs\OUTPUT_MODEL\D_4600.pth' (iteration 33)
說明沒有問題,訓練紀錄檔存放在專案的logs目錄下。
隨後可以通過tensorboard來監控訓練過程:
python3 -m tensorboard.main --logdir=logs\OUTPUT_MODEL
當loss趨於穩定說明模型已經收斂:
最後,我們就可以使用模型來生成我們想要聽到的語音了:
python3 webui.py -m ./logs\OUTPUT_MODEL\G_47700.pth
注意引數為訓練好的迭代模型,如果覺得當前迭代的模型可用,那麼直接把pth和config.json拷貝出來即可,隨後可以接著訓練下一個模型。
基於Bert-vits2打造的渣渣輝和劉青雲音色的鬼畜視訊已經上線到Youtube(B站),請檢索:劉悅的技術部落格,歡迎諸君品鑑和臻賞。