頻繁設定CGroup觸發linux核心bug導致CGroup running task不排程

1. 說明

1> 本篇是實際工作中linux上碰到的一個問題，一個使用了CGroup的程序處於R狀態但不執行，也不退出，還不能kill，經過深入挖掘才發現是Cgroup的核心bug

2>發現該bug後，去年給RedHat提交過漏洞，但可惜並未通過，不知道為什麼，這裡就發我部落格公開了

3> 前面的2個貼文《極簡cfs公平排程演演算法》《極簡組排程-CGroup如何限制cpu》是為了瞭解本篇這個核心bug而寫的，需要linux核心程序排程和CGroup控制的基本原理才能夠比較清晰的瞭解這個核心bug的來龍去脈

4> 本文所用的核心偵錯工具是crash，大家可以到官網上去檢視crash命令的使用，這裡就不多介紹了

https://crash-utility.github.io/help.html

2. 問題

2.1 觸發bug code(code較長，請展開程式碼)

2.1.1 code

#include <iostream>
#include <sys/types.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>
#include <pthread.h>
#include <sys/time.h>
#include <string>

using namespace std;
std::string sub_cgroup_dir("/sys/fs/cgroup/cpu/test");

// common lib
bool is_dir(const std::string& path)
{
    struct stat statbuf;
    if (stat(path.c_str(), &statbuf) == 0 )
    {
        if (0 != S_ISDIR(statbuf.st_mode))
        {
            return true;
        }
    }
    return false;
}

bool write_file(const std::string& file_path, int num)
{
    FILE* fp = fopen(file_path.c_str(), "w");
    if (fp = NULL)
    {
        return false;
    }

    std::string write_data = to_string(num);
    fputs(write_data.c_str(), fp);
    fclose(fp);
    return true;
}

// ms
long get_ms_timestamp()
{
    timeval tv;
    gettimeofday(&tv, NULL);
    return (tv.tv_sec * 1000 + tv.tv_usec / 1000);
}

// cgroup
bool create_cgroup()
{
    if (is_dir(sub_cgroup_dir) == false)
    {
        if (mkdir(sub_cgroup_dir.c_str(), S_IRWXU | S_IRGRP) != 0)
        {
            cout << "mkdir cgroup dir fail" << endl;
            return false;
        }
    }

    int pid = getpid();
    cout << "pid is " << pid << endl;
    std::string procs_path = sub_cgroup_dir + "/cgroup.procs";
    return write_file(procs_path, pid);
}

bool set_period(int period)
{
    std::string period_path = sub_cgroup_dir + "/cpu.cfs_period_us";
    return write_file(period_path, period);
}

bool set_quota(int quota)
{
    std::string quota_path = sub_cgroup_dir + "/cpu.cfs_quota_us";
    return write_file(quota_path, quota);
}

// thread
// param: ms interval
void* thread_func(void* param)
{
    int i = 0;
    int interval = (long)param;
    long last = get_ms_timestamp();

    while (true)
    {
        i++;
        if (i % 1000 != 0)
        {
            continue;
        }

        long current = get_ms_timestamp();
        if ((current - last) >= interval)
        {
            usleep(1000);
            last = current;
        }
    }

    pthread_exit(NULL);
}

 void test_thread()
 {
    const int k_thread_num = 10;
    pthread_t pthreads[k_thread_num];

    for (int i = 0; i < k_thread_num; i++)
    {
        if (pthread_create(&pthreads[i], NULL, thread_func, (void*)(i + 1)) != 0)
        {
            cout << "create thread fail" << endl;
        }
        else
        {
            cout << "create thread success,tid is " << pthreads[i] << endl;
        }
    }
}

//argv[0] : period
//argv[1] : quota
int main(int argc,char* argv[])
{
    if (argc <3)
    {
        cout << "usage : ./inactive timer $period $quota" << endl;
        return -1;
    }

    int period = stoi(argv[1]);
    int quota = stoi(argv[2]);
    cout << "period is " << period << endl;
    cout << "quota is " << quota << endl;

    test_thread();
    if (create_cgroup() == false)
    {
        cout << "create cgroup fail" << endl;
        return -1;
    }

    int i =0;
    while (true)
    {
        if (i > 20)
        {
            i = 0;
        }

        i++;
        long current = get_ms_timestamp();
        long last = current;
        while ((current - last) < i)
        {
            usleep(1000);
            current = get_ms_timestamp();
        }
        
        set_period(period);
        set_quota(quota);
    }

    return 0;
}

View Code

2.1.2 編譯

g++ -std=c++11 -lpthread trigger_cgroup_timer_inactive.cpp -o inactive_timer

2.1.3 在CentOS7.0~7.5的系統上執行程式

./inactive_timer 100000 10000

2.1.4 上述程式碼主要乾了2件事

1> 將自己程序設定為CGroup控制cpu

2> 反覆設定CGroup的cpu.cfs_period_us和cpu.cfs_quota_us

3> 起10個執行緒消耗cpu

2.1.5《極簡組排程-CGroup如何限制cpu》已經講過CGroup限制cpu的原理：

CGroup控制cpu是通過cfs_period_us指定的一個時間週期內，CGroup下的程序，能使用cfs_quota_us時間長度的cpu，如果在該週期內使用的cpu超過了cfs_quota_us設定的值，則將其throttled，即將其從公平排程執行佇列中移出，然後等待定時器觸發下個週期unthrottle後再移入，從而達到控制cpu的效果。

2.2 現象

1> 程式跑幾分鐘後，所有的執行緒一直處於running狀態，但實際執行緒都已經不再執行了，cpu使用率也一直是0

2> 檢視執行緒的stack，task都在系統呼叫返回中

3> 用crash檢視程序的主執行緒32764狀態確實為"running"，但對應的0號cpu上的rq cfs執行佇列中並沒有任何執行task

4> 檢視task對應的se沒有在rq上，cfs_rq顯示被throttled

《極簡組排程-CGroup如何限制cpu》中說過，throttle後經過一個period（程式設的是100ms），CGroup的定時器會再次分配quota，並unthrottle，將group se重新加入到rq中，這裡一直throttle不恢復，只能懷疑是不是定時器出問題了

5> 再檢視task group對應的cfs_bandwidth的period timer，發現state為0，即HRTIMER_STATE_INACTIVE，表示未啟用，問題就在這裡，正常情況下該timer是啟用的，該定時器未啟用會導致對應cpu上的group cfs_rq分配不到quota，quota用完後就會導致其對應的se被移出rq，此時task雖然處於Ready狀態，但由於不在rq上，仍然不會被排程的

3. 原因

3.1 linux的定時器是一次性，到期後需要再次啟用才能繼續使用，搜尋程式碼可知period_timer是在__start_cfs_bandwidth()中實現呼叫start_bandwidth_timer()進行啟用的

這裡有一個關鍵點，當cfs_b->timer_active不為0時，__start_cfs_bandwidth()就會不啟用period_timer，和問題現象相符，那麼什麼時候cfs_b->timer_active會不為0呢？

3.2 當設定CGroup的quota或者period時，會最終進入到__start_cfs_bandwidth()，這裡就會將cfs_b->timer_active設為0，並進入__start_cfs_bandwidth()

tg_set_cfs_quota()
    tg_set_cfs_bandwidth()
            /* restart the period timer (if active) to handle new period expiry */
            if (runtime_enabled && cfs_b->timer_active) {
                /* force a reprogram */
                cfs_b->timer_active = 0;
                __start_cfs_bandwidth(cfs_b);
            }

仔細觀察上述程式碼，設想如下場景：

1> 線上程A設定CGroup的quota或者period時，將cfs_b->timer_active設為0，呼叫_start_cfs_bandwidth()後，在未執行到__start_cfs_bandwidth()程式碼580行hrtimer_cancel()之前，cpu切換到B執行緒

2> 執行緒B也呼叫__start_cfs_bandwidth()，執行完後將cfs_b->timer_active設為1，並呼叫start_bandwidth_timer()啟用timer，此時cpu切換到執行緒A

3> 執行緒A恢復並繼續執行，呼叫hrtimer_cancel()讓period_timer失效，然後執行到__start_cfs_bandwidth()程式碼585行後，發現cfs_b->timer_active為1，直接return，而不再將period_timer啟用

3.3 搜尋__start_cfs_bandwidth()的呼叫，發現時鐘中斷中會呼叫update_curr()函數，其最終會呼叫assign_cfs_rq_runtime()檢查cgroup cpu配額使用情況，決定是否需要throttle，這裡在cfs_b->timer_active = 0時，也會呼叫__start_cfs_bandwidth()，即執行上面B執行緒的程式碼，從而和設定CGroup的執行緒A發生執行緒競爭，導致timer失效。

1> 完整程式碼執行流程圖

2> 當定時器失效後，由於3.2中執行緒B將cfs_b->timer_active = 1，所以即使下次時鐘中斷執行到assign_cfs_rq_runtime()中時，由於誤判timer是active的，也不會呼叫__start_cfs_bandwidth()再次啟用timer，這樣被throttle的group se永遠不會被unthrottle投入rq排程了

3.4 總結

頻繁設定CGroup設定，會和時鐘中斷中檢查group quota的執行緒在__start_cfs_bandwidth()上發生執行緒競爭，導致period_timer被cancel後不再啟用，然後CGroup控制的task不能分配cpu quota，導致不再被排程

3.5 恢復方法

知道了漏洞成因，我們也看到tg_set_cfs_quota()會呼叫__start_cfs_bandwidth() cancel掉timer，然後重新啟用timer，這樣就能在timer回撥中unthrottle了，所以只要手動設定下這個CGroup的cpu.cfs_period_us或cpu.cfs_quota_us，就能恢復執行。

4. 修復

3.10.0-693以上的版本並不會出現這個問題，通過和2.6.32版本（下圖右邊）的程式碼對比，可知3.10.0-693版的程式碼（下圖左邊）將hrtimer_cancel()該為hrtimer_try_to_cancel()，並將其和cfs_b->timer_active的判定都放在自旋鎖中保護，這樣就不會cfs_b->timer_active被置1後，仍然還會去cancel period_timer的問題了，但看這個bug fix的郵件組討論，是為了修另一個問題順便把這個問題也修了，痛失給linux提patch的機會- -

ref : https://gfiber.googlesource.com/kernel/bruno/+/09dc4ab03936df5c5aa711d27c81283c6d09f495

5. 漏洞利用

1> 在國內，仍有大量的公司在使用CentOS6和CentOS7.0~7.5，這些系統都存在這個漏洞，使用了CGroup限制cpu就有可能觸發這個bug導致業務中斷，且還不一定能重啟恢復

2> 一旦觸發這個bug，由於task本身已經是running狀態了，即使去kill，由於task得不到排程，是無法kill掉的，因此可以通過這種方法攻擊任意軟體程式（如防毒軟體），讓其不能執行又不能重啟（很多程式為了保證不雙開，都會只保證只有一個程序存在），即使他們不用CGroup，也可以給他建一個對其進行攻擊

3> 該bug由於是linux核心bug，一旦觸發還不易排查和感知，因為看程序狀態都是running，直覺上認為程序仍然在正常執行的