記一次 .NET 某智慧物流 WCS系統 CPU 爆高分析

2022-07-27 18:00:57

一:背景

1. 講故事

哈哈,再次見到物流類軟體,上個月有位朋友找到我,說他的程式出現了 CPU 爆高,讓我幫忙看下什麼原因,由於那段時間在苦心研究 C++,分析和經驗分享也就懈怠了,今天就給大家安排上。

話不多說,上 windbg 說話。

二:WinDbg 分析

1. CPU 真的爆高嗎

既然說 CPU 爆高,那就用 !tp 驗證下。


0:000> !tp
    Method table is shared (not implemented): System.Threading.ThreadPool
CPU utilization: 81 Unknown format characterUnknown format control characterWorker Thread: Total: 203 Running: 183 Idle: 0 MaxLimit: 300 MinLimit: 150
Work Request in Queue: 0
--------------------------------------
Number of Timers: 40
--------------------------------------
Completion Port Thread:Total: 21 Free: 21 MaxFree: 80 CurrentLimit: 21 MaxLimit: 300 MinLimit: 150

從卦中看確實 CPU=81%,不過輸出資訊很奇怪,方法表都出錯了,猜的不錯應該是觸發 GC 把 託管堆給關閉了,原始碼如下:


    GCScan::GcRuntimeStructuresValid (FALSE);
    plan_phase (n);
    GCScan::GcRuntimeStructuresValid (TRUE);

也可以用 !dumpheap -stat 來驗證。


0:000> !dumpheap -stat
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to 
displaying, finding or traversing objects as well as gc heap segments may not 
work properly. !dumpheap and !verifyheap may incorrectly complain of heap 
consistency errors.
Could not request method table data for object 000001E49376D520 (MethodTable: FFFFFFFFFFE026C0).

2. 為什麼會觸發 GC

此時我們已知道是 GC 觸發,接下來可以通過 !t + !clrstack 找到那個觸發 GC 的執行緒,通過執行緒棧看看正在幹嘛 ?

0:000> !t 
ThreadCount:      382
UnstartedThread:  0
BackgroundThread: 340
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                            Lock  
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1     1ba4 000001E45C018C90    2a020 Preemptive  0000000000000000:0000000000000000 000001e45c368cb0 1     MTA 
 297  286     2144 000001E478521200  1029220 Cooperative 0000000000000000:0000000000000000 000001e45c368cb0 0     MTA (GC) (Threadpool Worker) 

0:297> !clrstack 
OS Thread Id: 0x2144 (297)

        Child SP               IP Call Site
        0e 0000002f`2927ade0 00007ffa`afda2096     coreclr!WKS::gc_heap::garbage_collect+0x2a1 [e:\a\_work\191\s\src\gc\gc.cpp @ 16967] 
0f 0000002f`2927aee0 00007ffa`afdbe746     coreclr!WKS::GCHeap::GarbageCollectGeneration+0x156 [e:\a\_work\191\s\src\gc\gc.cpp @ 35107] 
10 (Inline Function) --------`--------     coreclr!WKS::gc_heap::try_allocate_more_space+0x1f5 [e:\a\_work\191\s\src\gc\gc.cpp @ 13197] 
11 0000002f`2927af30 00007ffa`afd80c9f     coreclr!WKS::gc_heap::allocate_more_space+0x216 [e:\a\_work\191\s\src\gc\gc.cpp @ 13490] 
12 (Inline Function) --------`--------     coreclr!WKS::gc_heap::allocate+0x37e [e:\a\_work\191\s\src\gc\gc.cpp @ 13521] 
13 (Inline Function) --------`--------     coreclr!WKS::GCHeap::Alloc+0x3e5 [e:\a\_work\191\s\src\gc\gc.cpp @ 34419] 
14 (Inline Function) --------`--------     coreclr!Alloc+0x4be [e:\a\_work\191\s\src\vm\gchelpers.cpp @ 241] 
15 (Inline Function) --------`--------     coreclr!AllocateObject+0x512 [e:\a\_work\191\s\src\vm\gchelpers.cpp @ 1156] 
16 0000002f`2927af90 00007ffa`51c05122     coreclr!JIT_New+0x5ff [e:\a\_work\191\s\src\vm\jithelpers.cpp @ 2810] 
...
0000002F2927B228 00007ffaafd63aff [HelperMethodFrame: 0000002f2927b228] 
0000002F2927B340 00007ffa51c05122 Jint.Native.Object.ObjectInstance..ctor(Jint.Engine)
0000002F2927B380 00007ffa51c058aa Jint.Native.Array.ArrayConstructor.CreateArrayConstructor(Jint.Engine)
0000002F2927B3D0 00007ffa51c0407c Jint.Engine..ctor(System.Action`1<Jint.Options>)
...

由於資訊比較敏感,我就不過多的輸出了,不過可以看出 GC 的引發是由於 Jint 元件,查了下資料是 JavaScript.NET 用來互動的,為了進一步驗證,觀察下此時 GC 觸發的代以及什麼原因。


0:297> dx -r1 (*((coreclr!WKS::gc_mechanisms *)0x7ffab021df90))
(*((coreclr!WKS::gc_mechanisms *)0x7ffab021df90))                 [Type: WKS::gc_mechanisms]
    [+0x000] gc_index         : 0x984ab [Type: unsigned __int64]
    [+0x008] condemned_generation : 0 [Type: int]
    [+0x00c] promotion        : 1 [Type: int]
    [+0x010] compaction       : 1 [Type: int]
    [+0x014] loh_compaction   : 0 [Type: int]
    [+0x018] heap_expansion   : 0 [Type: int]
    [+0x01c] concurrent       : 0x0 [Type: unsigned int]
    [+0x020] demotion         : 1 [Type: int]
    [+0x024] card_bundles     : 1 [Type: int]
    [+0x028] gen0_reduction_count : 0 [Type: int]
    [+0x02c] should_lock_elevation : 0 [Type: int]
    [+0x030] elevation_locked_count : 0 [Type: int]
    [+0x034] elevation_reduced : 0 [Type: int]
    [+0x038] minimal_gc       : 0 [Type: int]
    [+0x03c] reason           : reason_alloc_soh (0) [Type: gc_reason]
    [+0x040] pause_mode       : pause_interactive (1) [Type: WKS::gc_pause_mode]
    [+0x044] found_finalizers : 1 [Type: int]
    [+0x048] background_p     : 0 [Type: int]
    [+0x04c] b_state          : bgc_not_in_process (0) [Type: bgc_state]
    [+0x050] allocations_allowed : 1 [Type: int]
    [+0x054] stress_induced   : 0 [Type: int]
    [+0x058] entry_memory_load : 0x0 [Type: unsigned int]
    [+0x05c] exit_memory_load : 0x0 [Type: unsigned int]

從卦中看,當前觸發的是 0 代GC,觸發原因是 0代 的閾值滿了,這是一個很正常的 GC 操作,理應不會造成 CPU 爆高,除非是那些傷害性比較大的 FULLGC,由於沒有更多的 dump 可以參考,到這裡就沒法更進一步確認了。

3. 還有其他線索嗎

雖然 .NET 程式大多 CPU 爆高是由於 GC 的頻繁觸發所致,但也有其他情況,比如 CPU 密集型操作往往也會,就像我之前解讀 B站的LUA死迴圈導致的CPU爆高場景下如何通過 火焰圖 去尋找熱點函數。

那這個 dump 會不會也存在這種情況呢? 不管有沒有,在一個 dump 的情況下也只能 死馬當作活馬醫 了,可以用 !runaway 查查當前執行緒執行時間。


0:297> !runaway
 User Mode Time
  Thread       Time
  269:2354     0 days 0:07:04.171
  274:15d4     0 days 0:06:16.453
  280:1c98     0 days 0:05:32.406
  284:438      0 days 0:04:37.703
  283:183c     0 days 0:04:29.531
  282:122c     0 days 0:04:24.703
  288:2060     0 days 0:03:59.953
  286:28d0     0 days 0:03:56.640
  289:2a84     0 days 0:03:50.859
  290:1224     0 days 0:03:44.640
  291:2e4c     0 days 0:03:29.937
  292:f0c      0 days 0:03:28.656
  293:2454     0 days 0:03:26.640
  275:2810     0 days 0:03:23.828
  294:2f34     0 days 0:03:22.312
  295:24ec     0 days 0:03:17.625
  297:2144     0 days 0:03:16.609
  298:2c34     0 days 0:03:14.609
  299:2480     0 days 0:03:11.218
  ...

執行緒還是蠻多的,取樣幾個看一下,發現有很多函數與 序列化 有關。


0:269> !clrstack 
OS Thread Id: 0x2354 (269)
       Child SP               IP Call Site
0000002F080FD658 00007ffacb236124 [HelperMethodFrame: 0000002f080fd658] 
0000002F080FD770 00007ffab11d806b System.Runtime.Serialization.Formatters.Binary.SizedArray..ctor() [E:\A\_work\322\s\corefx\src\System.Runtime.Serialization.Formatters\src\System\Runtime\Serialization\Formatters\Binary\BinaryUtilClasses.cs @ 203]
0000002F080FD7A0 00007ffab11d6964 System.Runtime.Serialization.Formatters.Binary.BinaryParser.get_ObjectMapIdTable() [E:\A\_work\322\s\corefx\src\System.Runtime.Serialization.Formatters\src\System\Runtime\Serialization\Formatters\Binary\BinaryParser.cs @ 57]
0000002F080FD7E0 00007ffa515132c1 System.Runtime.Serialization.Formatters.Binary.BinaryParser.ReadObjectWithMapTyped(System.Runtime.Serialization.Formatters.Binary.BinaryObjectWithMapTyped) [E:\A\_work\322\s\corefx\src\System.Runtime.Serialization.Formatters\src\System\Runtime\Serialization\Formatters\Binary\BinaryParser.cs @ 532]
0000002F080FD8B0 00007ffab11d74ed System.Runtime.Serialization.Formatters.Binary.BinaryParser.ReadObjectWithMapTyped(System.Runtime.Serialization.Formatters.Binary.BinaryHeaderEnum) [E:\A\_work\322\s\corefx\src\System.Runtime.Serialization.Formatters\src\System\Runtime\Serialization\Formatters\Binary\BinaryParser.cs @ 504]

0:280> !clrstack 
OS Thread Id: 0x1c98 (280)
       Child SP               IP Call Site
0000002F185FCE38 00007ffacb236124 [HelperMethodFrame: 0000002f185fce38] 
0000002F185FCF30 00007ffaaf59bb61 System.String.Ctor(Char[], Int32, Int32) [E:\A\_work\191\s\src\mscorlib\shared\System\String.cs @ 79]
0000002F185FCF90 00007ffa5033f984 Newtonsoft.Json.JsonTextReader.ParseReadString(Char, Newtonsoft.Json.ReadType)
0000002F185FD040 00007ffa5099cd0b Newtonsoft.Json.JsonTextReader.ReadStringValue(Newtonsoft.Json.ReadType)
0000002F185FD0B0 00007ffa5099cb0e Newtonsoft.Json.JsonTextReader.ReadAsString()
0000002F185FD0E0 00007ffa514c68fc Newtonsoft.Json.JsonReader.ReadForType(Newtonsoft.Json.Serialization.JsonContract, Boolean)

0:284> !clrstack 
OS Thread Id: 0x438 (284)
       Child SP               IP Call Site
0000002F1ED7C9C8 00007ffacb236124 [RedirectedThreadFrame: 0000002f1ed7c9c8] 
0000002F1ED7CA48 00007ffaaf5a6863 System.Buffer.Memmove(Byte*, Byte*, UInt64) [E:\A\_work\191\s\src\mscorlib\src\System\Buffer.cs @ 211]
0000002F1ED7CA50 00007ffaaf59bbb2 System.String.Ctor(Char[], Int32, Int32) [E:\A\_work\191\s\src\mscorlib\shared\System\String.cs @ 83]
0000002F1ED7CAB0 00007ffa5033f984 Newtonsoft.Json.JsonTextReader.ParseReadString(Char, Newtonsoft.Json.ReadType)
0000002F1ED7CB60 00007ffa5099cd0b Newtonsoft.Json.JsonTextReader.ReadStringValue(Newtonsoft.Json.ReadType)
0000002F1ED7CBD0 00007ffa5099cb0e Newtonsoft.Json.JsonTextReader.ReadAsString()

有了線索之後,接下來用 ~*e !clrstack 把所有的執行緒棧調出來,發現很多的 JsonConvert ,並且還有 5 個執行緒在做 DeepClone,截圖如下:

接下來把 DeepClone 函數匯出來看看,發現是用 BinaryFormatter 來實現物件的深複製。


public static T DeepClone<T>(this T obj) where T : class
{
	BinaryFormatter binaryFormatter = new BinaryFormatter();
	using MemoryStream memoryStream = new MemoryStream();
	binaryFormatter.Serialize(memoryStream, obj);
	memoryStream.Seek(0L, SeekOrigin.Begin);
	return (T)binaryFormatter.Deserialize(memoryStream);
}

把發現的這些線索反饋給朋友後,確實也驗證了是 序列化 造成的。

三:總結

分析完畢,這個 dump 給我們的教訓是:

  1. 物件的深複製慎用 BinaryFormatter 這種流式操作,尤其是在大物件的情況下,它是一種 CPU 密集性的,建議採用 AutoMapper 這類 帶 ILEmit, ExpressionTree 還帶編譯快取的開源工具包。

  2. 高階偵錯是一場破案之旅,你第一眼看到的往往是程式故意讓你看到的,需要不斷的積累破案經驗練就一雙慧眼。

圖片名稱