前段時間收到一個朋友的求助,說他的程式執行緒數瘋漲,尋求如何解決。
等我分析完之後,我覺得這個問題很有代表性,所以拿出來和大家分享下,還是上老工具 WinDbg。
要想查執行緒有沒有瘋漲,可以用 !t
命令看一下。
0:000:x86> !t
ThreadCount: 382
UnstartedThread: 1
BackgroundThread: 376
PendingThread: 0
DeadThread: 2
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 59c 00e52fb0 26020 Preemptive 12D67610:00000000 00e4b408 0 STA
2 2 2b30 00e61aa0 2b220 Preemptive 00000000:00000000 00e4b408 0 MTA (Finalizer)
3 3 18cc 00ea72b8 202b220 Preemptive 00000000:00000000 00e4b408 0 MTA
5 4 1f18 00f02998 1020220 Preemptive 00000000:00000000 00e4b408 0 Ukn (Threadpool Worker)
XXXX 6 0 00f056f8 39820 Preemptive 00000000:00000000 00e4b408 0 MTA
6 7 2154 09052448 202b020 Preemptive 12E353E0:00000000 00e4b408 0 MTA
...
377 373 2ee8 21a90958 1029220 Preemptive 12D1FCCC:00000000 00e4b408 0 MTA (Threadpool Worker)
378 374 227c 21b1d510 1029220 Preemptive 12DCBFC8:00000000 00e4b408 0 MTA (Threadpool Worker)
379 375 7e8 21b1baa8 1029220 Preemptive 12D39ADC:00000000 00e4b408 0 MTA (Threadpool Worker)
380 376 1d1c 21a8fec8 1029220 Preemptive 12D11F40:00000000 00e4b408 0 MTA (Threadpool Worker)
381 366 19ec 215c1bd0 1029220 Preemptive 12DB42D8:00000000 00e4b408 0 MTA (Threadpool Worker)
382 377 1dc8 21b1bff0 1029220 Preemptive 12C71F9C:00000000 00e4b408 0 MTA (Threadpool Worker)
383 378 f94 215bc750 1029220 Preemptive 12E10568:00000000 00e4b408 0 MTA (Threadpool Worker)
384 379 17d4 21ac5580 1029220 Preemptive 12D8EE98:00000000 00e4b408 0 MTA (Threadpool Worker)
385 381 2c1c 21b1b018 1029220 Preemptive 12D0DD00:00000000 00e4b408 0 MTA (Threadpool Worker)
386 380 309c 21b1da58 1029220 Preemptive 12E25028:00000000 00e4b408 0 MTA (Threadpool Worker)
387 382 3048 21ac6aa0 1029220 Preemptive 12DFA918:00000000 00e4b408 0 MTA (Threadpool Worker)
從卦中看,主執行緒是一個 STA,說明是一個表單程式,一個表單能做到 387 個執行緒,也是挺牛的,同時也能觀察到大多都是 ThreadPool Worker
,也就是執行緒池工作執行緒。
這裡有一個小技巧,那就是執行緒號越大,往往都是最新建立的,往往從這裡面就能套出來一些有用的東西,言外之意就扒一扒 380 ~ 387
這些執行緒的呼叫棧。
0:387:x86> ~387s
ntdll_77380000!NtWaitForSingleObject+0xc:
773f29dc c20c00 ret 0Ch
0:387:x86> k
CvRegToMachine(x86) conversion failure for 0x14f
X86MachineInfo::SetVal: unknown register 0 requested
# ChildEBP RetAddr
00 31fef104 755a1539 ntdll_77380000!NtWaitForSingleObject+0xc
01 31fef104 74b3ee3b KERNELBASE!WaitForSingleObjectEx+0x99
02 31fef168 74b3efed clr!CLRSemaphore::Wait+0xbe
03 31fef19c 74b3eee2 clr!ThreadpoolMgr::UnfairSemaphore::Wait+0x13a
04 31fef204 74a54c27 clr!ThreadpoolMgr::WorkerThreadStart+0x328
05 31feff24 7649fa29 clr!Thread::intermediateThreadProc+0x58
06 31feff34 773e7a7e kernel32!BaseThreadInitThunk+0x19
07 31feff90 773e7a4e ntdll_77380000!__RtlUserThreadStart+0x2f
0:387:x86> ~386s
ntdll_77380000!NtWaitForSingleObject+0xc:
773f29dc c20c00 ret 0Ch
0:386:x86> k
CvRegToMachine(x86) conversion failure for 0x14f
X86MachineInfo::SetVal: unknown register 0 requested
# ChildEBP RetAddr
00 31d6ede4 755a1539 ntdll_77380000!NtWaitForSingleObject+0xc
01 31d6ede4 74b3ee3b KERNELBASE!WaitForSingleObjectEx+0x99
02 31d6ee48 74b3efed clr!CLRSemaphore::Wait+0xbe
03 31d6ee7c 74b3eee2 clr!ThreadpoolMgr::UnfairSemaphore::Wait+0x13a
04 31d6eee4 74a54c27 clr!ThreadpoolMgr::WorkerThreadStart+0x328
05 31d6fb84 7649fa29 clr!Thread::intermediateThreadProc+0x58
06 31d6fb94 773e7a7e kernel32!BaseThreadInitThunk+0x19
07 31d6fbf0 773e7a4e ntdll_77380000!__RtlUserThreadStart+0x2f
0:386:x86> ~385s
ntdll_77380000!NtWaitForSingleObject+0xc:
773f29dc c20c00 ret 0Ch
0:385:x86> k
CvRegToMachine(x86) conversion failure for 0x14f
X86MachineInfo::SetVal: unknown register 0 requested
# ChildEBP RetAddr
00 31eaee64 755a1539 ntdll_77380000!NtWaitForSingleObject+0xc
01 31eaee64 74b3ee3b KERNELBASE!WaitForSingleObjectEx+0x99
02 31eaeec8 74b3efed clr!CLRSemaphore::Wait+0xbe
03 31eaeefc 74b3eee2 clr!ThreadpoolMgr::UnfairSemaphore::Wait+0x13a
04 31eaef64 74a54c27 clr!ThreadpoolMgr::WorkerThreadStart+0x328
05 31eafb7c 7649fa29 clr!Thread::intermediateThreadProc+0x58
06 31eafb8c 773e7a7e kernel32!BaseThreadInitThunk+0x19
07 31eafbe8 773e7a4e ntdll_77380000!__RtlUserThreadStart+0x2f
從執行緒棧上看,這些執行緒都在 UnfairSemaphore
處等待,這是一個正常現象,因為這些執行緒都是通過 UnfairSemaphore
鎖來喚醒,不過奇怪的是,這些執行緒為什麼產生,又為什麼不被消亡?
根據經驗預測:肯定有程式碼在不斷的排程
執行緒池
執行緒,然後又做了一個短命的操作,導致執行緒池執行緒不斷新增,又得不到執行緒可以被消亡的閾值。
既然猜測是程式在頻繁的呼叫執行緒池執行緒,能做的只能是觀察此時 dump 中的所有執行緒的執行緒棧,看能不能挖到點有價值的東西,可以使用 ~*e !clrstack
命令。
經過仔細觀察這近400個執行緒棧,發現有 37 處都是 System.Threading.Thread.Sleep(Int32)
, 而且大多都是 HslCommunication.Core.Net.NetworkBase.ThreadPoolCheckTimeOut(System.Object)
函數,能清楚的看到是由執行緒池發起的,接下來就是用 ILSpy 反編譯下這個函數看下到底是怎麼回事。
protected void ThreadPoolCheckTimeOut(object obj)
{
HslTimeOut hslTimeOut;
if ((hslTimeOut = obj as HslTimeOut) == null)
{
return;
}
while (!hslTimeOut.IsSuccessful)
{
Thread.Sleep(100);
if ((DateTime.Now - hslTimeOut.StartTime).TotalMilliseconds > (double)hslTimeOut.DelayTime)
{
if (!hslTimeOut.IsSuccessful)
{
LogNet?.WriteWarn(ToString(), "Wait Time Out : " + hslTimeOut.DelayTime);
hslTimeOut.Operator?.Invoke();
hslTimeOut.WorkSocket?.Close();
}
break;
}
}
}
接下來通過 ILSpy 檢視這個方法的參照,發現有很多處,抽幾個如下:
protected OperateResult<TNetMessage> ReceiveMessage<TNetMessage>(Socket socket, int timeOut, TNetMessage netMsg) where TNetMessage : INetMessage
{
...
if (timeOut > 0)
{
ThreadPool.QueueUserWorkItem(ThreadPoolCheckTimeOut, hslTimeOut);
}
...
}
protected OperateResult<Socket> CreateSocketAndConnect(IPEndPoint endPoint, int timeOut)
{
...
ThreadPool.QueueUserWorkItem(ThreadPoolCheckTimeOut, hslTimeOut);
...
}
protected void CreateSocketAndConnect(IPEndPoint endPoint, int timeOut, Action<OperateResult<Socket>> connectCallback)
{
...
ThreadPool.QueueUserWorkItem(ThreadPoolCheckTimeOut, hslTimeOut);
...
}
從上面程式碼看,確實存在一些商榷的地方,很多的 socket 操作都用執行緒池來處理 ThreadPoolCheckTimeOut()
函數,而在這個函數內當 hslTimeOut.IsSuccessful =false
的時候,在 if ((DateTime.Now - hslTimeOut.StartTime).TotalMilliseconds > (double)hslTimeOut.DelayTime)
不滿足的時間區間內會一直 sleep,這就導致當 socket 請求量上去後,導致很多執行緒處於 sleep 狀態, 執行緒池又不得不生成更多的執行緒來處理 ThreadPoolCheckTimeOut()
邏輯。
到這裡終於就找到了符合 執行緒池執行緒 瘋漲的底層邏輯,接下來看看 HslCommunication.dll
為何物,去找一下它的類庫宣告。
[assembly: CompilationRelaxations(8)]
[assembly: RuntimeCompatibility(WrapNonExceptionThrows = true)]
[assembly: Debuggable(DebuggableAttribute.DebuggingModes.IgnoreSymbolStoreSequencePoints)]
[assembly: AssemblyTitle("HslCommunication")]
[assembly: AssemblyDescription("一個框架庫,包含完善的網路通訊及紀錄檔元件")]
[assembly: AssemblyConfiguration("")]
[assembly: AssemblyCompany("")]
[assembly: AssemblyProduct("HslCommunication")]
[assembly: AssemblyCopyright("Copyright © By Richard.Hu 2018")]
[assembly: AssemblyTrademark("")]
[assembly: ComVisible(false)]
[assembly: Guid("d3710b78-1b32-4d53-9604-0451a795a2f5")]
[assembly: AssemblyFileVersion("5.3.2.0")]
[assembly: AssemblyVersion("5.3.2.0")]
可以看到,這是一個商業元件。
由於定位到疑似是 HslCommunication
元件的問題,看了下還是 商業版
, 這就尷尬了,建議的解決辦法如下:
1) 短期:
用 ThreadPool.SetMaxThreads
限定執行緒上限。
2) 長期:
找作者看看有沒有最新版,或者到 https://github.com/dathlin/HslCommunication
上提一個 issue,讓別人系統性解決一下。