成為鋼鐵俠!只需一塊RTX3090,微軟開源賈維斯(J.A.R.V.I.S.)人工智慧AI助理系統

夢想照進現實，微軟果然不愧是微軟，開源了賈維斯(J.A.R.V.I.S.)人工智慧助理系統，賈維斯(jarvis)全稱為Just A Rather Very Intelligent System（只是一個相當聰明的人工智慧系統），它可以幫助鋼鐵俠託尼斯塔克完成各種任務和挑戰，包括控制和管理託尼的機甲裝備，提供實時情報和資料分析，幫助託尼做出決策等等。

如今，我們也可以擁有自己的賈維斯人工智慧助理，成本僅僅是一塊RTX3090顯示卡。

賈維斯(Jarvis)的環境設定

一般情況下，深度學習領域相對主流的入門級別顯示卡是2070或者3070，而3090可以算是消費級深度學習顯示卡的天花板了：

再往上走就是工業級別的A系列和V系列顯示卡，視訊記憶體是一個硬指標，因為需要載入原生的大模型，雖然可以改程式碼對模型載入進行「閹割」，但功能上肯定也會有一定的損失。如果沒有3090，也可以組兩塊3060 12G的並行，視訊記憶體雖然可以達標，但算力和綜合效能抵不過3090。

確保本地具備足以支撐賈維斯(Jarvis)的硬體環境之後，老規矩，克隆專案：

git clone https://github.com/microsoft/JARVIS.git

隨後進入專案目錄：

cd JARVIS

修改專案的組態檔 server/config.yaml:

openai:  
  key: your_personal_key # gradio, your_personal_key  
huggingface:  
  cookie: # required for huggingface inference  
local: # ignore: just for development  
  endpoint: http://localhost:8003  
dev: false  
debug: false  
log_file: logs/debug.log  
model: text-davinci-003 # text-davinci-003  
use_completion: true  
inference_mode: hybrid # local, huggingface or hybrid  
local_deployment: minimal # no, minimal, standard or full  
num_candidate_models: 5  
max_description_length: 100  
proxy:   
httpserver:  
  host: localhost  
  port: 8004  
modelserver:  
  host: localhost  
  port: 8005  
logit_bias:  
  parse_task: 0.1  
  choose_model: 5

這裡主要修改三個設定即可，分別是openaikey，huggingface官網的cookie令牌，以及OpenAI的model，預設使用的模型是text-davinci-003。

修改完成後，官方推薦使用虛擬環境conda，Python版本3.8，私以為這裡完全沒有任何必要使用虛擬環境，直接上Python3.10即可，接著安裝依賴：

pip3 install -r requirements.txt

專案依賴庫如下：

git+https://github.com/huggingface/diffusers.git@8c530fc2f6a76a2aefb6b285dce6df1675092ac6#egg=diffusers  
git+https://github.com/huggingface/transformers@c612628045822f909020f7eb6784c79700813eda#egg=transformers  
git+https://github.com/patrickvonplaten/controlnet_aux@78efc716868a7f5669c288233d65b471f542ce40#egg=controlnet_aux  
tiktoken==0.3.3  
pydub==0.25.1  
espnet==202301  
espnet_model_zoo==0.1.7  
flask==2.2.3  
flask_cors==3.0.10  
waitress==2.1.2  
datasets==2.11.0  
asteroid==0.6.0  
speechbrain==0.5.14  
timm==0.6.13  
typeguard==2.13.3  
accelerate==0.18.0  
pytesseract==0.3.10  
gradio==3.24.1

這裡web端介面是用Flask2.2高版本搭建的，但奇怪的是微軟並未使用Flask新版本的非同步特性。

安裝完成之後，進入模型目錄：

cd models

下載模型和資料集：

sh download.sh

這裡一定要做好心理準備，因為模型就已經佔用海量的硬碟空間了，資料集更是不必多說，所有檔案均來自huggingface:

models="  
nlpconnect/vit-gpt2-image-captioning  
lllyasviel/ControlNet  
runwayml/stable-diffusion-v1-5  
CompVis/stable-diffusion-v1-4  
stabilityai/stable-diffusion-2-1  
Salesforce/blip-image-captioning-large  
damo-vilab/text-to-video-ms-1.7b  
microsoft/speecht5_asr  
facebook/maskformer-swin-large-ade  
microsoft/biogpt  
facebook/esm2_t12_35M_UR50D  
microsoft/trocr-base-printed  
microsoft/trocr-base-handwritten  
JorisCos/DCCRNet_Libri1Mix_enhsingle_16k  
espnet/kan-bayashi_ljspeech_vits  
facebook/detr-resnet-101  
microsoft/speecht5_tts  
microsoft/speecht5_hifigan  
microsoft/speecht5_vc  
facebook/timesformer-base-finetuned-k400  
runwayml/stable-diffusion-v1-5  
superb/wav2vec2-base-superb-ks  
openai/whisper-base  
Intel/dpt-large  
microsoft/beit-base-patch16-224-pt22k-ft22k  
facebook/detr-resnet-50-panoptic  
facebook/detr-resnet-50  
openai/clip-vit-large-patch14  
google/owlvit-base-patch32  
microsoft/DialoGPT-medium  
bert-base-uncased  
Jean-Baptiste/camembert-ner  
deepset/roberta-base-squad2  
facebook/bart-large-cnn  
google/tapas-base-finetuned-wtq  
distilbert-base-uncased-finetuned-sst-2-english  
gpt2  
mrm8488/t5-base-finetuned-question-generation-ap  
Jean-Baptiste/camembert-ner  
t5-base  
impira/layoutlm-document-qa  
ydshieh/vit-gpt2-coco-en  
dandelin/vilt-b32-finetuned-vqa  
lambdalabs/sd-image-variations-diffusers  
facebook/timesformer-base-finetuned-k400  
facebook/maskformer-swin-base-coco  
Intel/dpt-hybrid-midas  
lllyasviel/sd-controlnet-canny  
lllyasviel/sd-controlnet-depth  
lllyasviel/sd-controlnet-hed  
lllyasviel/sd-controlnet-mlsd  
lllyasviel/sd-controlnet-openpose  
lllyasviel/sd-controlnet-scribble  
lllyasviel/sd-controlnet-seg  
"  
  
# CURRENT_DIR=$(cd `dirname $0`; pwd)  
CURRENT_DIR=$(pwd)  
for model in $models;  
do  
    echo "----- Downloading from https://huggingface.co/"$model" -----"  
    if [ -d "$model" ]; then  
        # cd $model && git reset --hard && git pull && git lfs pull  
        cd $model && git pull && git lfs pull  
        cd $CURRENT_DIR  
    else  
        # git clone 包含了lfs  
        git clone https://huggingface.co/$model $model  
    fi  
done  
  
datasets="Matthijs/cmu-arctic-xvectors"  
  
for dataset in $datasets;  
 do  
     echo "----- Downloading from https://huggingface.co/datasets/"$dataset" -----"  
     if [ -d "$dataset" ]; then  
         cd $dataset && git pull && git lfs pull  
         cd $CURRENT_DIR  
     else  
         git clone https://huggingface.co/datasets/$dataset $dataset  
     fi  
done

也可以考慮拆成兩個shell，開多程序下載，速度會快很多。

但事實上，真的，別下了，檔案屬實過於巨大，這玩意兒真的不是普通人能耍起來的，當然選擇不下載本地模型和資料集也能執行，請看下文。

漫長的下載流程結束之後，賈維斯(Jarvis)就設定好了。

執行賈維斯(Jarvis)

如果您選擇下載了所有的模型和資料集（佩服您是條漢子），終端內啟動服務：

python models_server.py --config config.yaml

隨後會在系統的8004埠啟動一個Flask服務程序，然後發起Http請求即可執行賈維斯(Jarvis)：

curl --location 'http://localhost:8004/hugginggpt' \  
--header 'Content-Type: application/json' \  
--data '{  
    "messages": [  
        {  
            "role": "user",  
            "content": "please generate a video based on \"Spiderman is surfing\""  
        }  
    ]  
}'

這個的意思是讓賈維斯(Jarvis)生成一段「蜘蛛俠在衝浪」的視訊。

當然了，以筆者的硬體環境，是不可能跑起來的，所以可以對載入的模型適當「閹割」，在models_server.py檔案的81行左右：

other_pipes = {  
            "nlpconnect/vit-gpt2-image-captioning":{  
                "model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),  
                "feature_extractor": ViTImageProcessor.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),  
                "tokenizer": AutoTokenizer.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),  
                "device": "cuda:0"  
            },  
            "Salesforce/blip-image-captioning-large": {  
                "model": BlipForConditionalGeneration.from_pretrained(f"{local_fold}/Salesforce/blip-image-captioning-large"),  
                "processor": BlipProcessor.from_pretrained(f"{local_fold}/Salesforce/blip-image-captioning-large"),  
                "device": "cuda:0"  
            },  
            "damo-vilab/text-to-video-ms-1.7b": {  
                "model": DiffusionPipeline.from_pretrained(f"{local_fold}/damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16"),  
                "device": "cuda:0"  
            },  
            "facebook/maskformer-swin-large-ade": {  
                "model": MaskFormerForInstanceSegmentation.from_pretrained(f"{local_fold}/facebook/maskformer-swin-large-ade"),  
                "feature_extractor" : AutoFeatureExtractor.from_pretrained("facebook/maskformer-swin-large-ade"),  
                "device": "cuda:0"  
            },  
            "microsoft/trocr-base-printed": {  
                "processor": TrOCRProcessor.from_pretrained(f"{local_fold}/microsoft/trocr-base-printed"),  
                "model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/microsoft/trocr-base-printed"),  
                "device": "cuda:0"  
            },  
            "microsoft/trocr-base-handwritten": {  
                "processor": TrOCRProcessor.from_pretrained(f"{local_fold}/microsoft/trocr-base-handwritten"),  
                "model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/microsoft/trocr-base-handwritten"),  
                "device": "cuda:0"  
            },  
            "JorisCos/DCCRNet_Libri1Mix_enhsingle_16k": {  
                "model": BaseModel.from_pretrained("JorisCos/DCCRNet_Libri1Mix_enhsingle_16k"),  
                "device": "cuda:0"  
            },  
            "espnet/kan-bayashi_ljspeech_vits": {  
                "model": Text2Speech.from_pretrained(f"espnet/kan-bayashi_ljspeech_vits"),  
                "device": "cuda:0"  
            },  
            "lambdalabs/sd-image-variations-diffusers": {  
                "model": DiffusionPipeline.from_pretrained(f"{local_fold}/lambdalabs/sd-image-variations-diffusers"), #torch_dtype=torch.float16  
                "device": "cuda:0"  
            },  
            "CompVis/stable-diffusion-v1-4": {  
                "model": DiffusionPipeline.from_pretrained(f"{local_fold}/CompVis/stable-diffusion-v1-4"),  
                "device": "cuda:0"  
            },  
            "stabilityai/stable-diffusion-2-1": {  
                "model": DiffusionPipeline.from_pretrained(f"{local_fold}/stabilityai/stable-diffusion-2-1"),  
                "device": "cuda:0"  
            },  
            "runwayml/stable-diffusion-v1-5": {  
                "model": DiffusionPipeline.from_pretrained(f"{local_fold}/runwayml/stable-diffusion-v1-5"),  
                "device": "cuda:0"  
            },  
            "microsoft/speecht5_tts":{  
                "processor": SpeechT5Processor.from_pretrained(f"{local_fold}/microsoft/speecht5_tts"),  
                "model": SpeechT5ForTextToSpeech.from_pretrained(f"{local_fold}/microsoft/speecht5_tts"),  
                "vocoder":  SpeechT5HifiGan.from_pretrained(f"{local_fold}/microsoft/speecht5_hifigan"),  
                "embeddings_dataset": load_dataset(f"{local_fold}/Matthijs/cmu-arctic-xvectors", split="validation"),  
                "device": "cuda:0"  
            },  
            "speechbrain/mtl-mimic-voicebank": {  
                "model": WaveformEnhancement.from_hparams(source="speechbrain/mtl-mimic-voicebank", savedir="models/mtl-mimic-voicebank"),  
                "device": "cuda:0"  
            },  
            "microsoft/speecht5_vc":{  
                "processor": SpeechT5Processor.from_pretrained(f"{local_fold}/microsoft/speecht5_vc"),  
                "model": SpeechT5ForSpeechToSpeech.from_pretrained(f"{local_fold}/microsoft/speecht5_vc"),  
                "vocoder": SpeechT5HifiGan.from_pretrained(f"{local_fold}/microsoft/speecht5_hifigan"),  
                "embeddings_dataset": load_dataset(f"{local_fold}/Matthijs/cmu-arctic-xvectors", split="validation"),  
                "device": "cuda:0"  
            },  
            "julien-c/wine-quality": {  
                "model": joblib.load(cached_download(hf_hub_url("julien-c/wine-quality", "sklearn_model.joblib")))  
            },  
            "facebook/timesformer-base-finetuned-k400": {  
                "processor": AutoImageProcessor.from_pretrained(f"{local_fold}/facebook/timesformer-base-finetuned-k400"),  
                "model": TimesformerForVideoClassification.from_pretrained(f"{local_fold}/facebook/timesformer-base-finetuned-k400"),  
                "device": "cuda:0"  
            },  
            "facebook/maskformer-swin-base-coco": {  
                "feature_extractor": MaskFormerFeatureExtractor.from_pretrained(f"{local_fold}/facebook/maskformer-swin-base-coco"),  
                "model": MaskFormerForInstanceSegmentation.from_pretrained(f"{local_fold}/facebook/maskformer-swin-base-coco"),  
                "device": "cuda:0"  
            },  
            "Intel/dpt-hybrid-midas": {  
                "model": DPTForDepthEstimation.from_pretrained(f"{local_fold}/Intel/dpt-hybrid-midas", low_cpu_mem_usage=True),  
                "feature_extractor": DPTFeatureExtractor.from_pretrained(f"{local_fold}/Intel/dpt-hybrid-midas"),  
                "device": "cuda:0"  
            }  
        }

把不需要的模型遮蔽即可。

與此同時，還可以不通過本地服務介面，修改config.yaml的inference_mode: huggingface

隨後直接在終端執行ChatGPT版本的賈維斯：

cd server  
python3 awesome_chat.py --config config.yaml --mode cli

執行效果：

Welcome to Jarvis! A collaborative system that consists of an LLM as the controller and numerous expert models as collaborative executors. Jarvis can plan tasks, schedule Hugging Face models, generate friendly responses based on your requests, and help you with many things. Please enter your request (`exit` to exit).  
[ User ]: Given a collection of image A: /examples/cat.jpg, B: /examples/z3.jpg, C: /examples/z1.jpg, please tell me how many zebras in these picture?  
[ Jarvis ]: Based on your request, I have used several models to process the images you provided. For the image A: /examples/cat.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a cat sitting on a window sill looking out". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/a3dc.jpg', which contains two objects with labels 'potted plant' and 'cat'. For the image B: /examples/z3.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "two zebras standing in a field of dry grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/f5df.jpg', which contains three objects with labels 'zebra'. For the image C: /examples/z1.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a zebra standing in a field of tall grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/69c7.jpg', which contains one object with label 'zebra'. Finally, I used the canny-control model canny-control to generate the image '/images/7e18.png' based on the image /examples/savanna.jpg. Therefore, there are four zebras in these pictures. Is there anything else I can help you with?  
[ User ]: Please answer all the named entities in the sentence: Iron Man is a superhero appearing in American comic books published by Marvel Comics. The character was co-created by writer and editor Stan Lee, developed by scripter Larry Lieber, and designed by artists Don Heck and Jack Kirby.  
[ Jarvis ]: Sure, I understand your request. Based on the inference results of the models, I will answer your request step by step. Firstly, I used the token-classification model dslim/bert-base-NER to recognize the named entities in the sentence. The inference result is that the named entities are Iron Man, Stan Lee, Larry Lieber, Don Heck and Jack Kirby. Then, I used the question-answering model deepset/roberta-base-squad2 to answer your request. The inference result is that there is no output since the context cannot be empty. Therefore, I cannot make it. I hope this answer your request. Is there anything else I can help you with?  
[ User ]: Please generate a canny image based on /examples/savanna.jpg  
[ Jarvis ]: Sure. I have generated a canny image based on /examples/savanna.jpg. To do this, I first used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text description of the image, which is "a herd of giraffes and zebras grazing in a field". Then I used the canny-control model to generate the canny image of the original image, which is stored in the path "/images/1ea2.png". Finally, I used the canny-text-to-image model lllyasviel/sd-controlnet-canny to generate the canny image based on the text description and the canny image, which is stored in the path "/images/ba66.png". Do you need anything else?

可以理解為ChatGPT版本的文字生成圖片，外面套了個賈維斯(Jarvis)的殼兒，演出效果滿分。

結語

總的來說，和微軟之前開源的「視覺化圖形ChatGPT」一樣，此類專案的象徵意義要遠遠大於現實意義。賈維斯(Jarvis)代表的是大多數技術同仁的共同願景，對於這類人工智慧技術的發展，可以肯定，但由於硬體門檻過高的原因，短期內還不能過於期待。