WinUI（WASDK）使用ChatGPT和攝像頭手勢識別結合TTS讓機器人更智慧

前言

之前寫過一篇基於ML.NET的手部關鍵點分類的部落格，可以根據圖片進行手部的提取分類，於是我就將手勢分類和攝像頭資料結合，整合到了我開發的電子腦殼軟體裡。

電子腦殼是一個為稚暉君開源的桌面機器人ElectronBot提供一些軟體功能的桌面程式專案。它是由綠蔭阿廣也就是我開發的，使用了微軟的WASDK框架。

電子腦殼算是本人學習WinUI開發的練習專案了，通過根據一些開源的專案的學習，將一些功能進行整合，比如手勢識別觸發語音轉文字，然後接入ChatGPT結合文字轉語音的方式，實現機器人的對話。

此部落格算是實戰記錄了，替大家先踩坑。

下圖連結為機器人的演示視訊，通過對話，讓ChatGPT給我講了一個駱駝祥子的故事，只不過這個故事有點離譜，本來前部分還正常，後面就開始瞎編了，比如祥子有了一頭驢，最後還成為了商人。

大家觀看覺得不錯的話給點個贊。

具體的實現方案

1. 方案思路敘述

整體的流程如下圖，圖畫的不一定標準，但是大體如圖所示：

處理攝像頭幀事件，通過將攝像頭的幀資料處理進行手勢的匹配。
手勢識別結果處理方法呼叫語音轉文字邏輯。
轉的文字通過呼叫ChatGPT API實現智慧回覆。
將回復結果文字通過TTS播放到機器人上的揚聲器，完成一次對話。

2. 所用技術說明

WASDK
MediaPipe offers open source cross-platform, customizable ML solutions for live and streaming media.
ML.NET 開放原始碼的跨平臺機器學習框架

上面的技術棧在我上面文章裡有講述，這裡就不展開了，大家有興趣的可以點選之前的文章檢視。

WinUI（WASDK）使用MediaPipe檢查手部關鍵點並通過ML.NET進行手勢分類

程式碼講解

1. 專案介紹

電子腦殼專案本身是一個標準的MVVM的WinUI專案，使用微軟的輕量級DI容器管理物件的生命週期，MVVM使用的是社群工具包提供的框架，支援程式碼生成，簡化VM的程式碼。

2. 核心程式碼講解

實時視訊流解析手勢，通過名稱空間Windows.Media.Capture下的MediaCapture類和Windows.Media.Capture.Frames名稱空間下的MediaFrameReader類，建立物件並註冊幀處理事件，在幀處理事件中處理視訊畫面並傳出到手勢識別服務裡進行手勢識別，主要程式碼如下。

//幀處理結果訂閱
private void Current_SoftwareBitmapFrameCaptured(object? sender, SoftwareBitmapEventArgs e)
{
    if (e.SoftwareBitmap is not null)
    {

        if (e.SoftwareBitmap.BitmapPixelFormat != BitmapPixelFormat.Bgra8 ||
              e.SoftwareBitmap.BitmapAlphaMode == BitmapAlphaMode.Straight)
        {
            e.SoftwareBitmap = SoftwareBitmap.Convert(
                e.SoftwareBitmap, BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);
        }
        //手勢識別服務獲取
        var service = App.GetService<GestureClassificationService>();
        //呼叫手勢分析程式碼
        _ = service.HandPredictResultUnUseQueueAsync(calculator, modelPath, e.SoftwareBitmap);
    }
}

涉及到的程式碼如下：

MainViewModel

CameraFrameService

語音轉文字的實現，WinUI（WASDK）繼承了UWP的現代化的UI，也可以很好的使用WinRT的API進行操作。主要涉及的物件為名稱空間Windows.Media.SpeechRecognition下的SpeechRecognizer物件。

官網檔案地址語音互動定義自定義識別約束

以下是語音轉文字的部分程式碼詳細程式碼點選文字

//建立識別為網路搜尋
var webSearchGrammar = new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.WebSearch, "webSearch", "sound");
        //webSearchGrammar.Probability = SpeechRecognitionConstraintProbability.Min;
        speechRecognizer.Constraints.Add(webSearchGrammar);
        SpeechRecognitionCompilationResult result = await speechRecognizer.CompileConstraintsAsync();

        if (result.Status != SpeechRecognitionResultStatus.Success)
        {
            // Disable the recognition buttons.
        }
        else
        {
            // Handle continuous recognition events. Completed fires when various error states occur. ResultGenerated fires when
            // some recognized phrases occur, or the garbage rule is hit.
            //註冊指定的事件
            speechRecognizer.ContinuousRecognitionSession.Completed += ContinuousRecognitionSession_Completed;
            speechRecognizer.ContinuousRecognitionSession.ResultGenerated += ContinuousRecognitionSession_ResultGenerated;
        }

語音轉文字之後呼叫ChatGPT API進行對話回覆獲取，使用ChatGPTSharp封裝庫實現。

程式碼如下：

private async void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args)
{
    // The garbage rule will not have a tag associated with it, the other rules will return a string matching the tag provided
    // when generating the grammar.
    var tag = "unknown";

    if (args.Result.Constraint != null && isListening)
    {
        tag = args.Result.Constraint.Tag;

        App.MainWindow.DispatcherQueue.TryEnqueue(() =>
        {
            ToastHelper.SendToast(tag, TimeSpan.FromSeconds(3));
        });


        Debug.WriteLine($"識別內容---{tag}");
    }

    // Developers may decide to use per-phrase confidence levels in order to tune the behavior of their 
    // grammar based on testing.
    if (args.Result.Confidence == SpeechRecognitionConfidence.Medium ||
        args.Result.Confidence == SpeechRecognitionConfidence.High)
    {
        var result = string.Format("Heard: '{0}', (Tag: '{1}', Confidence: {2})", args.Result.Text, tag, args.Result.Confidence.ToString());


        App.MainWindow.DispatcherQueue.TryEnqueue(() =>
        {
            ToastHelper.SendToast(result, TimeSpan.FromSeconds(3));
        });


        if (args.Result.Text.ToUpper() == "開啟B站")
        {
            await Launcher.LaunchUriAsync(new Uri(@"https://www.bilibili.com/"));
        }
        else if (args.Result.Text.ToUpper() == "撒個嬌")
        {
            ElectronBotHelper.Instance.ToPlayEmojisRandom();
        }
        else
        {
            try
            {
                // 根據機器人使用者端工廠建立指定型別的處理程式 可以支援多種聊天API
                var chatBotClientFactory = App.GetService<IChatbotClientFactory>();

                var chatBotClientName = (await App.GetService<ILocalSettingsService>()
                     .ReadSettingAsync<ComboxItemModel>(Constants.DefaultChatBotNameKey))?.DataKey;

                if (string.IsNullOrEmpty(chatBotClientName))
                {
                    throw new Exception("未設定語音提供程式機密資料");
                }

                var chatBotClient = chatBotClientFactory.CreateChatbotClient(chatBotClientName);
                //呼叫指定的實現獲取聊天返回結果
                var resultText = await chatBotClient.AskQuestionResultAsync(args.Result.Text);

                //isListening = false;
                await ReleaseRecognizerAsync();
                //呼叫文字轉語音並進行播放方法
                await ElectronBotHelper.Instance.MediaPlayerPlaySoundByTTSAsync(resultText, false);      
            }
            catch (Exception ex)
            {
                App.MainWindow.DispatcherQueue.TryEnqueue(() =>
                {
                    ToastHelper.SendToast(ex.Message, TimeSpan.FromSeconds(3));
                });

            }
        }
    }
    else
    {
    }
}

結果文字轉語音並進行播放，通過Windows.Media.SpeechSynthesis名稱空間下的SpeechSynthesizer類，使用下面的程式碼可以將文字轉化成Stream。

  using SpeechSynthesizer synthesizer = new();
            // Create a stream from the text. This will be played using a media element.

            //將文字轉化為Stream
            var synthesisStream = await synthesizer.SynthesizeTextToStreamAsync(text);

然後使用MediaPlayer物件進行語音的播報。


 /// <summary>
/// 播放聲音
/// </summary>
/// <param name="content"></param>
/// <returns></returns>
public async Task MediaPlayerPlaySoundByTTSAsync(string content, bool isOpenMediaEnded = true)
{
    _isOpenMediaEnded = isOpenMediaEnded;
    if (!string.IsNullOrWhiteSpace(content))
    {
        try
        {
            var localSettingsService = App.GetService<ILocalSettingsService>();

            var audioModel = await localSettingsService
                .ReadSettingAsync<ComboxItemModel>(Constants.DefaultAudioNameKey);

            var audioDevs = await EbHelper.FindAudioDeviceListAsync();

            if (audioModel != null)
            {
                var audioSelect = audioDevs.FirstOrDefault(c => c.DataValue == audioModel.DataValue) ?? new ComboxItemModel();

                var selectedDevice = (DeviceInformation)audioSelect.Tag!;

                if (selectedDevice != null)
                {
                    mediaPlayer.AudioDevice = selectedDevice;
                }
            }
            //獲取TTS服務範例
            var speechAndTTSService = App.GetService<ISpeechAndTTSService>();
            //轉化文字到Stream
            var stream = await speechAndTTSService.TextToSpeechAsync(content);
            //播放stream
            mediaPlayer.SetStreamSource(stream);
            mediaPlayer.Play();
            isTTS = true;
        }
        catch (Exception)
        {
        }
    }
}

至此一次完整的識別對話流程就結束了，軟體的介面如下圖，感興趣的同學可以點選圖片檢視專案原始碼地址檢視其他的功能：

個人感悟

個人覺得DotNET的生態還是差了些，尤其是ML.NET的輪子還是太少了，畢竟參與的人少，而且知識遷移也需要成本，熟悉其他機器學習框架的人可能不懂DotNET。

所以作為社群的一員，我覺得我們需要走出去，然後再回來，走出去就是先學習其他的機器學習框架，然後回來用DotNET進行應用，這樣輪子多了，社群就會越來越繁榮。

我也能多多的複製貼上大家的程式碼了。