聞其聲而知雅意,基於Pytorch(mps/cpu/cuda)的人工智慧AI本地語音識別庫Whisper(Python3.10)

前文回溯，之前一篇：含辭未吐,聲若幽蘭,史上最強免費人工智慧AI語音合成TTS服務微軟Azure(Python3.10接入)，利用AI技術將文字合成語音，現在反過來，利用開源庫Whisper再將語音轉回文字，所謂聞其聲而知雅意。

Whisper 是一個開源的語音識別庫，它是由Facebook AI Research (FAIR)開發的，支援多種語言的語音識別。它使用了雙向迴圈神經網路（bi-directional RNNs）來識別語音並將其轉換為文字。 Whisper支援自定義模型，可以用於實現線上語音識別，並且具有高階的語音識別功能，支援語音識別中的語音活動檢測和語音識別中的語音轉文字。它是使用PyTorch進行開發，可以使用Python API來呼叫語音識別，並且提供了一系列的預訓練模型和資料集來幫助使用者開始使用。

PyTorch基於MPS的安裝

我們知道PyTorch一直以來在M晶片的MacOs系統中都不支援cuda模式，而現在，新的MPS後端擴充套件了PyTorch生態系統並提供了現有的指令碼功能來在 GPU上設定和執行操作。

截止本文釋出，PyTorch與Python 3.11不相容，所以我們將使用最新的 3.10.x 版本。

確保安裝Python3.10最新版：

➜  transformers git:(stable) python3  
Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin  
Type "help", "copyright", "credits" or "license" for more information.  
>>>

隨後執行安裝命令：

pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

安裝成功後，在終端裡驗證PyTorch-MPS的狀態：

➜  transformers git:(stable) python3  
Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin  
Type "help", "copyright", "credits" or "license" for more information.  
>>> import torch  
>>> torch.backends.mps.is_available()  
True  
>>>

返回True即可。

PyTorch MPS (Multi-Process Service)效能測試

PyTorch MPS (Multi-Process Service)是 PyTorch 中的一種分散式訓練方式。它是基於Apple的MPS(Metal Performance Shaders) 框架開發的。MPS可以在多核的蘋果裝置上加速tensor的運算。MPS使用了多個裝置上的多個核心來加速模型的訓練。它可以將模型的計算過程分配到多個核心上，並且可以在多個裝置上進行訓練，從而提高訓練速度。

PyTorch MPS 可以在 Apple 的裝置（如 iPhone 和 iPad）上加速模型訓練，也可以在 Mac 上使用。可以使用MPS來加速折積神經網路（CNNs）、迴圈神經網路（RNNs）和其他型別的神經網路。使用MPS可以在不改變模型結構的情況下，通過分散式訓練來加速模型的訓練速度。

現在我們來做一個簡單測試：

import torch  
import timeit  
import random  
  
x = torch.ones(50000000,device='cpu')  
print(timeit.timeit(lambda:x*random.randint(0,100),number=1))

首先建立一個大小為 50000000 的全為1的張量 x，並將其設定為在cpu上運算。最後使用 timeit.timeit 函數來測量在 CPU 上執行 x 乘以一個隨機整數的時間。 number=1表示只執行一次。這段程式碼的作用是在cpu上測量運算一個張量的時間。

執行結果：

➜  nlp_chinese /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/nlp_chinese/mps_test.py"  
0.020812375005334616

在10核M1pro的cpu晶片加持下，執行時間為：0.020812375005334616

隨後換成MPS模式：

import torch  
import timeit  
import random  
  
x = torch.ones(50000000,device='mps')  
print(timeit.timeit(lambda:x*random.randint(0,100),number=1))

程式返回：

➜  nlp_chinese /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/nlp_chinese/mps_test.py"  
0.003058041911572218

16核的GPU僅用時：0.003058041911572218

也就是說MPS的執行速度比CPU提升了7倍左右。

Whisper語音識別

安裝好了PyTorch，我們安裝Whisper:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

安裝好之後進行驗證：

➜  transformers git:(stable) whisper     
usage: whisper [-h] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large}] [--model_dir MODEL_DIR]  
               [--device DEVICE] [--output_dir OUTPUT_DIR] [--verbose VERBOSE] [--task {transcribe,translate}]  
               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,hi,hr,ht,hu,hy,id,is,it,iw,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]

隨後安裝ffmpeg:

brew install ffmpeg

然後編寫語音識別程式碼：

import whisper  
  
model = whisper.load_model("small")  
  
# load audio and pad/trim it to fit 30 seconds  
audio = whisper.load_audio("/Users/liuyue/wodfan/work/mydemo/b1.wav")  
audio = whisper.pad_or_trim(audio)  
  
# make log-Mel spectrogram and move to the same device as the model  
  
mel = whisper.log_mel_spectrogram(audio).to("cpu")  
  
# detect the spoken language  
_, probs = model.detect_language(mel)  
print(f"Detected language: {max(probs, key=probs.get)}")  
  
# decode the audio  
options = whisper.DecodingOptions(fp16 = False)  
result = whisper.decode(model, mel, options)  
  
# print the recognized text  
print(result.text)

這裡匯入音訊後，通過whisper.log_mel_spectrogram方法自動檢測語言，然後輸出文字：

➜  minGPT git:(master) ✗ /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/minGPT/wisper_test.py"  
Detected language: zh  
Hello大家好,這裡是劉悅的技術部落格,眾神殿內,高朋滿座,聖有如雲,VMware,Virtual Box,UPM等虛擬機器器大神群英匯翠,指見位於C位王座上的Parallels唱網擡頭,緩緩群尋,屁膩群小,目光到處,無人敢擡頭對視。是的,如果說虛擬機器器領域有一位王者,非Parallels不能領袖群倫,畢竟大廠背書,功能滿格,美中不足之處就是價格略高,

這裡使用的small模型，也可以用更大的模型比如：medium、large。模型越大，效果越好。

如果想使用MPS的方式，需要改寫一下Whisper原始碼，將load_model方法的引數改為mps即可：

def load_model(name: str, device: Optional[Union[str, torch.device]] = None, download_root: str = None, in_memory: bool = False) -> Whisper:  
    """  
    Load a Whisper ASR model  
  
    Parameters  
    ----------  
    name : str  
        one of the official model names listed by `whisper.available_models()`, or  
        path to a model checkpoint containing the model dimensions and the model state_dict.  
    device : Union[str, torch.device]  
        the PyTorch device to put the model into  
    download_root: str  
        path to download the model files; by default, it uses "~/.cache/whisper"  
    in_memory: bool  
        whether to preload the model weights into host memory  
  
    Returns  
    -------  
    model : Whisper  
        The Whisper ASR model instance  
    """  
  
    if device is None:  
        device = "cuda" if torch.cuda.is_available() else "mps"

程式碼在第18行。

隨後執行指令碼也改成mps:

import whisper  
  
model = whisper.load_model("medium")  
  
# load audio and pad/trim it to fit 30 seconds  
audio = whisper.load_audio("/Users/liuyue/wodfan/work/mydemo/b1.wav")  
audio = whisper.pad_or_trim(audio)  
  
# make log-Mel spectrogram and move to the same device as the model  
  
mel = whisper.log_mel_spectrogram(audio).to("mps")  
  
# detect the spoken language  
_, probs = model.detect_language(mel)  
print(f"Detected language: {max(probs, key=probs.get)}")  
  
# decode the audio  
options = whisper.DecodingOptions(fp16 = False)  
result = whisper.decode(model, mel, options)  
  
# print the recognized text  
print(result.text)

這回切換為medium模型，程式返回：

➜  minGPT git:(master) ✗ /opt/homebrew/bin/python3.10 "/Users/liuyue/wodfan/work/minGPT/wisper_test.py"  
100%|█████████████████████████████████████| 1.42G/1.42G [02:34<00:00, 9.90MiB/s]  
Detected language: zh  
Hello 大家好,這裡是劉悅的技術部落格,眾神殿內,高朋滿座,聖有如雲,VMware,Virtualbox,UTM等虛擬機器器大神群音惠翠,只見位於C位王座上的Parallels唱往擡頭,緩緩輕尋,屁逆群小,目光到處,無人敢擡頭對視。

效率和精準度提升了不少，但medium模型的體積也更大，達到了1.42g。

結語

Whisper作為一個開源的語音識別庫，支援多種語言，並且使用雙向迴圈神經網路（bi-directional RNNs）來識別語音並將其轉換為文字，支援自定義模型，可以用於實現線上語音識別，並且具有高階的語音識別功能，支援語音識別中的語音活動檢測和語音識別中的語音轉文字，在PyTorch的MPS加成下，更是猛虎添翼，絕世好庫，值得擁有。