k8s驅逐篇(6)-kube-controller-manager驅逐-NodeLifecycleController原始碼分析

2023-06-24 12:00:17

概述

k8s v1.16版本中NodeController已經分為了NodeIpamControllerNodeLifecycleController,本文主要介紹NodeLifecycleController

NodeLifecycleController主要功能有:
(1)定期檢查node的心跳上報,某個node間隔一定時間都沒有心跳上報時,更新node的ready condition值為false或unknown,開啟了汙點驅逐的情況下,給該node新增NoExecute的汙點;
(2)當汙點驅逐未開啟時,當node的ready Condition值為false或unknown且已經持續了一段時間(該時間可設定)時,對該node上的pod做驅逐(刪除)操作;
(3)當汙點驅逐開啟時,node上有NoExecute汙點後,立馬驅逐(刪除)不能容忍汙點的pod,對於能容忍該汙點的pod,則等待所有汙點的容忍時間裡最小值後,pod才被驅逐(刪除);

原始碼分析

原始碼分析分成3部分:
(1)啟動引數分析;
(2)初始化與相關結構體分析;
(3)處理邏輯分析;

1.相關啟動引數分析

// cmd/kube-controller-manager/app/core.go
func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
	lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
		ctx.InformerFactory.Coordination().V1().Leases(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Core().V1().Nodes(),
		ctx.InformerFactory.Apps().V1().DaemonSets(),
		// node lifecycle controller uses existing cluster role from node-controller
		ctx.ClientBuilder.ClientOrDie("node-controller"),
		ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,
		ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
		ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
		ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
		ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
		ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
		ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
		ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
		ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
		utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
	)
	if err != nil {
		return nil, true, err
	}
	go lifecycleController.Run(ctx.Stop)
	return nil, true, nil
}

看到上面的startNodeLifecycleController函數中lifecyclecontroller.NewNodeLifecycleController方法的入參,其中傳入了多個kube-controller-manager的啟動引數;

(1)ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration

即kcm啟動引數--node-monitor-period,預設值5秒,代表NodeLifecycleController中更新同步node物件的status值(node的汙點、node的condition值)的週期;

fs.DurationVar(&o.NodeMonitorPeriod.Duration, "node-monitor-period", o.NodeMonitorPeriod.Duration,
		"The period for syncing NodeStatus in NodeController.")

(2)ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration

即kcm啟動引數--node-startup-grace-period,預設值60秒,代表node啟動後多久才會更新node物件的conditions值;

fs.DurationVar(&o.NodeStartupGracePeriod.Duration, "node-startup-grace-period", o.NodeStartupGracePeriod.Duration,
		"Amount of time which we allow starting Node to be unresponsive before marking it unhealthy.")

(3)ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration

即kcm啟動引數--node-monitor-grace-period,預設值40秒,代表在距離上一次上報心跳時間超過40s後,將該node的conditions值更新為unknown(kubelet通過更新node lease來上報心跳);

fs.DurationVar(&o.NodeMonitorGracePeriod.Duration, "node-monitor-grace-period", o.NodeMonitorGracePeriod.Duration,
		"Amount of time which we allow running Node to be unresponsive before marking it unhealthy. "+
			"Must be N times more than kubelet's nodeStatusUpdateFrequency, "+
			"where N means number of retries allowed for kubelet to post node status.")

(4)ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration

即kcm啟動引數--pod-eviction-timeout,預設值5分鐘,當不開啟汙點驅逐時該引數起效,當node的ready condition值變為false或unknown並持續了5分鐘後,將驅逐(刪除)該node上的pod;

fs.DurationVar(&o.PodEvictionTimeout.Duration, "pod-eviction-timeout", o.PodEvictionTimeout.Duration, "The grace period for deleting pods on failed nodes.")

(5)ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager

即kcm啟動引數--enable-taint-manager,預設值true,代表啟動taintManager,當已經排程到該node上的pod不能容忍node的NoExecute 汙點時,由TaintManager負責驅逐此類pod,若為false即不啟動taintManager,則根據--pod-eviction-timeout來做驅逐操作;

fs.BoolVar(&o.EnableTaintManager, "enable-taint-manager", o.EnableTaintManager, "WARNING: Beta feature. If set to true enables NoExecute Taints and will evict all not-tolerating Pod running on Nodes tainted with this kind of Taints.")

(6)utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions)

即kcm啟動引數--feature-gates=TaintBasedEvictions=xxx,預設值true,配合--enable-taint-manager共同作用,兩者均為true,才會開啟汙點驅逐;

(7)ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate

即kcm啟動引數--node-eviction-rate,預設值0.1,代表當叢集下某個zone(zone的概念後面詳細介紹)為healthy時,每秒應該觸發pod驅逐操作的node數量,預設0.1,即每10s觸發1個node上的pod驅逐操作;

fs.Float32Var(&o.NodeEvictionRate, "node-eviction-rate", 0.1, "Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters.")

(8)ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate

即kcm啟動引數--secondary-node-eviction-rate,代表如果某個zone下的unhealthy節點的百分比超過--unhealthy-zone-threshold (預設為 0.55)時,驅逐速率將會減小,如果不是LargeCluster(zone節點數量小於等於--large-cluster-size-threshold個,預設為 50),驅逐操作將會停止,如果是LargeCluster,驅逐速率將降為每秒--secondary-node-eviction-rate 個,預設為0.01;

fs.Float32Var(&o.SecondaryNodeEvictionRate, "secondary-node-eviction-rate", 0.01, "Number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters. This value is implicitly overridden to 0 if the cluster size is smaller than --large-cluster-size-threshold.")

(9)ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold

即kcm啟動引數--large-cluster-size-threshold,預設值50,當某zone的節點數超過該值時,認為該zone是一個LargeCluster,不是LargeCluster時,對應的SecondaryNodeEvictionRate設定會被設定為0;

fs.Int32Var(&o.LargeClusterSizeThreshold, "large-cluster-size-threshold", 50, "Number of nodes from which NodeController treats the cluster as large for the eviction logic purposes. --secondary-node-eviction-rate is implicitly overridden to 0 for clusters this size or smaller.")

(10)ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold

即kcm啟動引數--unhealthy-zone-threshold,代表認定某zone為unhealthy的閾值,即會影響什麼時候開啟二級驅逐速率;預設值0.55,當該zone中not ready節點(ready condition值不為true)數目超過55%,認定該zone為unhealthy;

fs.Float32Var(&o.UnhealthyZoneThreshold, "unhealthy-zone-threshold", 0.55, "Fraction of Nodes in a zone which needs to be not Ready (minimum 3) for zone to be treated as unhealthy. ")

(11)--feature-gates=NodeLease=xxx:預設值true,使用lease物件上報node心跳資訊,替換老的更新node的status的方式,能大大減輕apiserver的負擔;

zone概念介紹

根據每個node物件的region和zone的label值,將node劃分到不同的zone中;

region、zone值都相同的node,劃分為同一個zone;

zone狀態介紹

zone狀態有四種,分別是:
(1)Initial:初始化狀態;
(2)FullDisruption:ready的node數量為0,not ready的node數量大於0;
(3)PartialDisruption:not ready的node數量大於2且其佔比大於等於unhealthyZoneThreshold
(4)Normal:上述三種狀態以外的情形,都屬於該狀態;

需要注意二級驅逐速率對驅逐的影響,即kcm啟動引數--secondary-node-eviction-rate,代表如果某個zone下的unhealthy節點的百分比超過--unhealthy-zone-threshold (預設為 0.55)時,驅逐速率將會減小,如果不是LargeCluster(zone節點數量小於等於--large-cluster-size-threshold,預設為 50),驅逐操作將會停止,如果是LargeCluster,驅逐速率將降為每秒--secondary-node-eviction-rate 個,預設為0.01;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, ZoneState) {
	readyNodes := 0
	notReadyNodes := 0
	for i := range nodeReadyConditions {
		if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
			readyNodes++
		} else {
			notReadyNodes++
		}
	}
	switch {
	case readyNodes == 0 && notReadyNodes > 0:
		return notReadyNodes, stateFullDisruption
	case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
		return notReadyNodes, statePartialDisruption
	default:
		return notReadyNodes, stateNormal
	}
}

2.初始化與相關結構體分析

2.1 Controller結構體分析

Controller結構體關鍵屬性:
(1)taintManager:負責汙點驅逐的manager;
(2)enterPartialDisruptionFunc:返回當zone狀態為PartialDisruption時的驅逐速率(node節點數量超過largeClusterThreshold時,返回secondaryEvictionLimiterQPS,即kcm啟動引數--secondary-node-eviction-rate,否則返回0);
(3)enterFullDisruptionFunc:返回當zone狀態為FullDisruption時的驅逐速率(直接返回NodeEvictionRate值,kcm啟動引數--node-eviction-rate);
(4)computeZoneStateFunc:計算zone狀態的方法,即上面zone狀態介紹中的ComputeZoneState方法;
(5)nodeHealthMap:用於記錄所有node的最近一次的狀態資訊;
(6)zoneStates:用於記錄所有zone的狀態;
(7)nodeMonitorPeriodnodeStartupGracePeriodnodeMonitorGracePeriodpodEvictionTimeoutevictionLimiterQPSsecondaryEvictionLimiterQPSlargeClusterThresholdunhealthyZoneThreshold,上面介紹啟動引數時已經做了分析;
(8)runTaintManager:kcm啟動引數--enable-taint-manager賦值,代表是否啟動taintManager;
(9)useTaintBasedEvictions:代表是否開啟汙點驅逐,kcm啟動引數--feature-gates=TaintBasedEvictions=xxx賦值,預設值true,配合--enable-taint-manager共同作用,兩者均為true,才會開啟汙點驅逐;

Controller結構體中的兩個關鍵佇列:
(1)zonePodEvictor:pod需要被驅逐的node節點佇列(只有在未開啟汙點驅逐時,才用到該佇列),當node的ready condition變為false或unknown且持續了podEvictionTimeout的時間,會將該node放入該佇列,然後有worker負責從該佇列中讀取node,去執行node上的pod驅逐操作;
(2)zoneNoExecuteTainter:需要更新taint的node節點佇列,當node的ready condition變為false或unknown時,會將該node放入該佇列,然後有worker負責從該佇列中讀取node,去執行taint更新操作(增加notReadyunreachable的taint);

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
type Controller struct {
    ...
	taintManager *scheduler.NoExecuteTaintManager

	// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
	// to avoid the problem with time skew across the cluster.
	now func() metav1.Time

	enterPartialDisruptionFunc func(nodeNum int) float32
	enterFullDisruptionFunc    func(nodeNum int) float32
	computeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, ZoneState)

	knownNodeSet map[string]*v1.Node
	// per Node map storing last observed health together with a local time when it was observed.
	nodeHealthMap *nodeHealthMap

	// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
	// TODO(#83954): API calls shouldn't be executed under the lock.
	evictorLock     sync.Mutex
	nodeEvictionMap *nodeEvictionMap
	// workers that evicts pods from unresponsive nodes.
	zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
	// workers that are responsible for tainting nodes.
	zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue

	nodesToRetry sync.Map

	zoneStates map[string]ZoneState

	daemonSetStore          appsv1listers.DaemonSetLister
	daemonSetInformerSynced cache.InformerSynced

	leaseLister         coordlisters.LeaseLister
	leaseInformerSynced cache.InformerSynced
	nodeLister          corelisters.NodeLister
	nodeInformerSynced  cache.InformerSynced

	getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)

	recorder record.EventRecorder

	// Value controlling Controller monitoring period, i.e. how often does Controller
	// check node health signal posted from kubelet. This value should be lower than
	// nodeMonitorGracePeriod.
	// TODO: Change node health monitor to watch based.
	nodeMonitorPeriod time.Duration

	// When node is just created, e.g. cluster bootstrap or node creation, we give
	// a longer grace period.
	nodeStartupGracePeriod time.Duration

	// Controller will not proactively sync node health, but will monitor node
	// health signal updated from kubelet. There are 2 kinds of node healthiness
	// signals: NodeStatus and NodeLease. NodeLease signal is generated only when
	// NodeLease feature is enabled. If it doesn't receive update for this amount
	// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
	// time before which Controller start evicting pods is controlled via flag
	// 'pod-eviction-timeout'.
	// Note: be cautious when changing the constant, it must work with
	// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
	// controller. The node health signal update frequency is the minimal of the
	// two.
	// There are several constraints:
	// 1. nodeMonitorGracePeriod must be N times more than  the node health signal
	//    update frequency, where N means number of retries allowed for kubelet to
	//    post node status/lease. It is pointless to make nodeMonitorGracePeriod
	//    be less than the node health signal update frequency, since there will
	//    only be fresh values from Kubelet at an interval of node health signal
	//    update frequency. The constant must be less than podEvictionTimeout.
	// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
	//    value takes longer for user to see up-to-date node health.
	nodeMonitorGracePeriod time.Duration

	podEvictionTimeout          time.Duration
	evictionLimiterQPS          float32
	secondaryEvictionLimiterQPS float32
	largeClusterThreshold       int32
	unhealthyZoneThreshold      float32

	// if set to true Controller will start TaintManager that will evict Pods from
	// tainted nodes, if they're not tolerated.
	runTaintManager bool

	// if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'
	// taints instead of evicting Pods itself.
	useTaintBasedEvictions bool

	nodeUpdateQueue workqueue.Interface
	podUpdateQueue  workqueue.RateLimitingInterface
}

2.2 初始化邏輯分析

NewNodeLifecycleController函數的主要邏輯為:
(1)初始化Controller結構體,代表NodeLifecycleController
(2)給podInformer註冊EventHandler(部分邏輯與TaintManager相關);
(3)判斷是否開啟汙點驅逐,即--enable-taint-manager啟動引數值是否設定為true,是則初始化TaintManager並賦值給ControllertaintManager屬性,隨後給nodeInformer註冊TaintManager相關的EventHandler
(4)給nodeInformer註冊EventHandler

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func NewNodeLifecycleController(
	leaseInformer coordinformers.LeaseInformer,
	podInformer coreinformers.PodInformer,
	nodeInformer coreinformers.NodeInformer,
	daemonSetInformer appsv1informers.DaemonSetInformer,
	kubeClient clientset.Interface,
	nodeMonitorPeriod time.Duration,
	nodeStartupGracePeriod time.Duration,
	nodeMonitorGracePeriod time.Duration,
	podEvictionTimeout time.Duration,
	evictionLimiterQPS float32,
	secondaryEvictionLimiterQPS float32,
	largeClusterThreshold int32,
	unhealthyZoneThreshold float32,
	runTaintManager bool,
	useTaintBasedEvictions bool,
) (*Controller, error) {

	...

    // (1)初始化`Controller`結構體;  
	nc := &Controller{
		kubeClient:                  kubeClient,
		now:                         metav1.Now,
		knownNodeSet:                make(map[string]*v1.Node),
		nodeHealthMap:               newNodeHealthMap(),
		nodeEvictionMap:             newNodeEvictionMap(),
		recorder:                    recorder,
		nodeMonitorPeriod:           nodeMonitorPeriod,
		nodeStartupGracePeriod:      nodeStartupGracePeriod,
		nodeMonitorGracePeriod:      nodeMonitorGracePeriod,
		zonePodEvictor:              make(map[string]*scheduler.RateLimitedTimedQueue),
		zoneNoExecuteTainter:        make(map[string]*scheduler.RateLimitedTimedQueue),
		nodesToRetry:                sync.Map{},
		zoneStates:                  make(map[string]ZoneState),
		podEvictionTimeout:          podEvictionTimeout,
		evictionLimiterQPS:          evictionLimiterQPS,
		secondaryEvictionLimiterQPS: secondaryEvictionLimiterQPS,
		largeClusterThreshold:       largeClusterThreshold,
		unhealthyZoneThreshold:      unhealthyZoneThreshold,
		runTaintManager:             runTaintManager,
		useTaintBasedEvictions:      useTaintBasedEvictions && runTaintManager,
		nodeUpdateQueue:             workqueue.NewNamed("node_lifecycle_controller"),
		podUpdateQueue:              workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "node_lifecycle_controller_pods"),
	}
	if useTaintBasedEvictions {
		klog.Infof("Controller is using taint based evictions.")
	}

	nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc
	nc.enterFullDisruptionFunc = nc.HealthyQPSFunc
	nc.computeZoneStateFunc = nc.ComputeZoneState

    // (2)給`podInformer`註冊`EventHandler`(部分邏輯與`TaintManager`相關);
	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: func(obj interface{}) {
			pod := obj.(*v1.Pod)
			nc.podUpdated(nil, pod)
			if nc.taintManager != nil {
				nc.taintManager.PodUpdated(nil, pod)
			}
		},
		UpdateFunc: func(prev, obj interface{}) {
			prevPod := prev.(*v1.Pod)
			newPod := obj.(*v1.Pod)
			nc.podUpdated(prevPod, newPod)
			if nc.taintManager != nil {
				nc.taintManager.PodUpdated(prevPod, newPod)
			}
		},
		DeleteFunc: func(obj interface{}) {
			pod, isPod := obj.(*v1.Pod)
			// We can get DeletedFinalStateUnknown instead of *v1.Pod here and we need to handle that correctly.
			if !isPod {
				deletedState, ok := obj.(cache.DeletedFinalStateUnknown)
				if !ok {
					klog.Errorf("Received unexpected object: %v", obj)
					return
				}
				pod, ok = deletedState.Obj.(*v1.Pod)
				if !ok {
					klog.Errorf("DeletedFinalStateUnknown contained non-Pod object: %v", deletedState.Obj)
					return
				}
			}
			nc.podUpdated(pod, nil)
			if nc.taintManager != nil {
				nc.taintManager.PodUpdated(pod, nil)
			}
		},
	})
	nc.podInformerSynced = podInformer.Informer().HasSynced
	podInformer.Informer().AddIndexers(cache.Indexers{
		nodeNameKeyIndex: func(obj interface{}) ([]string, error) {
			pod, ok := obj.(*v1.Pod)
			if !ok {
				return []string{}, nil
			}
			if len(pod.Spec.NodeName) == 0 {
				return []string{}, nil
			}
			return []string{pod.Spec.NodeName}, nil
		},
	})

	podIndexer := podInformer.Informer().GetIndexer()
	nc.getPodsAssignedToNode = func(nodeName string) ([]*v1.Pod, error) {
		objs, err := podIndexer.ByIndex(nodeNameKeyIndex, nodeName)
		if err != nil {
			return nil, err
		}
		pods := make([]*v1.Pod, 0, len(objs))
		for _, obj := range objs {
			pod, ok := obj.(*v1.Pod)
			if !ok {
				continue
			}
			pods = append(pods, pod)
		}
		return pods, nil
	}
	nc.podLister = podInformer.Lister()

    // (3)判斷是否開啟汙點驅逐,即`--enable-taint-manager`啟動引數值是否設定為true,是則初始化`TaintManager`並賦值給`Controller`的`taintManager`屬性,隨後給nodeInformer註冊`TaintManager`相關的`EventHandler`;  
	if nc.runTaintManager {
		podGetter := func(name, namespace string) (*v1.Pod, error) { return nc.podLister.Pods(namespace).Get(name) }
		nodeLister := nodeInformer.Lister()
		nodeGetter := func(name string) (*v1.Node, error) { return nodeLister.Get(name) }
		nc.taintManager = scheduler.NewNoExecuteTaintManager(kubeClient, podGetter, nodeGetter, nc.getPodsAssignedToNode)
		nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
			AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {
				nc.taintManager.NodeUpdated(nil, node)
				return nil
			}),
			UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(oldNode, newNode *v1.Node) error {
				nc.taintManager.NodeUpdated(oldNode, newNode)
				return nil
			}),
			DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {
				nc.taintManager.NodeUpdated(node, nil)
				return nil
			}),
		})
	}
    
    // (4) 給`nodeInformer`註冊`EventHandler`;  
	klog.Infof("Controller will reconcile labels.")
	nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {
			nc.nodeUpdateQueue.Add(node.Name)
			nc.nodeEvictionMap.registerNode(node.Name)
			return nil
		}),
		UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(_, newNode *v1.Node) error {
			nc.nodeUpdateQueue.Add(newNode.Name)
			return nil
		}),
		DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {
			nc.nodesToRetry.Delete(node.Name)
			nc.nodeEvictionMap.unregisterNode(node.Name)
			return nil
		}),
	})

	nc.leaseLister = leaseInformer.Lister()
	nc.leaseInformerSynced = leaseInformer.Informer().HasSynced

	nc.nodeLister = nodeInformer.Lister()
	nc.nodeInformerSynced = nodeInformer.Informer().HasSynced

	nc.daemonSetStore = daemonSetInformer.Lister()
	nc.daemonSetInformerSynced = daemonSetInformer.Informer().HasSynced

	return nc, nil
}

3.處理邏輯分析

Run方法作為NodeLifecycleController的處理邏輯分析入口,其主要邏輯為:
(1)等待leaseInformernodeInformerpodInformerdaemonSetInformer中的cache同步完成;

(2)判斷是否開啟汙點驅逐,是則拉起一個goroutine,呼叫nc.taintManager.Run方法,啟動taintManager

(3)啟動8個goroutine,即8個worker,迴圈呼叫nc.doNodeProcessingPassWorker方法,用於處理nc.nodeUpdateQueue佇列;
nc.doNodeProcessingPassWorker方法有兩個作用:
(3-1)呼叫nc.doNoScheduleTaintingPass方法,根據node.Status.Conditionsnode.Spec.Unschedulable的值來更新node.Spec.Taints,主要是設定Effectnoschedule的taint;
(3-2)呼叫nc.reconcileNodeLabels方法,處理node物件中os和arch相關的label;

(4)啟動4個goroutine,即4個worker,迴圈呼叫nc.doPodProcessingWorker方法,用於處理nc.podUpdateQueue佇列;
nc.doPodProcessingWorker方法做以下兩個操作:
(4-1)當汙點驅逐未開啟時,判斷node物件的status,當node的ready Condition為false或unknown且已經持續了至少nc.podEvictionTimeout的時間時,對該node上的pod做驅逐(刪除)操作;
(4-2)如果node的ready condition值不為true,則將pod的ready condition更新為false;

(5)判斷nc.useTaintBasedEvictions是否為true,即是否開啟汙點驅逐,是則啟動goroutine並回圈呼叫nc.doNoExecuteTaintingPass
nc.doNoExecuteTaintingPass方法主要作用是根據node.Status.Conditions的值來更新node.Spec.Taints,主要是設定EffectnoExecute的taint;

(6)未開啟汙點驅逐時,啟動goroutine並回圈呼叫nc.doEvictionPass
nc.doEvictionPass方法主要作用是從nc.zonePodEvictor中獲取node,然後驅逐(刪除)該node上除daemonset pod外的所有pod;

(7)啟動goroutine,間隔nc.nodeMonitorPeriod時間(即kcm啟動引數--node-monitor-period,預設值5秒)迴圈呼叫nc.monitorNodeHealth方法;
nc.monitorNodeHealth方法的主要作用是持續監控node的狀態,根據叢集中不同zone下unhealthy數量的node,以及kcm啟動引數中驅逐速率的相關設定,給不同的zone設定不同的驅逐速率(該驅逐速率對是否開啟汙點驅逐均生效),且當node心跳上報(node lease的更新時間)距離上一次上報時間已經超過nodeMonitorGracePeriod(剛啟動則為nodeStartupGracePeriod),更新node物件的ready condition值,並做相應的驅逐處理:
(7-1)當開啟了汙點驅逐,且node的ready condition不為true時,新增NoExcute汙點,並將該node放入zoneNoExecuteTainter中,由taintManager來做驅逐操作;
(7-2)當沒開啟汙點驅逐,且node的ready condition不為true持續了podEvictionTimeout時間,則開始驅逐pod的操作;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) Run(stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()

	klog.Infof("Starting node controller")
	defer klog.Infof("Shutting down node controller")

	if !cache.WaitForNamedCacheSync("taint", stopCh, nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {
		return
	}

	if nc.runTaintManager {
		go nc.taintManager.Run(stopCh)
	}

	// Close node update queue to cleanup go routine.
	defer nc.nodeUpdateQueue.ShutDown()
	defer nc.podUpdateQueue.ShutDown()

	// Start workers to reconcile labels and/or update NoSchedule taint for nodes.
	for i := 0; i < scheduler.UpdateWorkerSize; i++ {
		// Thanks to "workqueue", each worker just need to get item from queue, because
		// the item is flagged when got from queue: if new event come, the new item will
		// be re-queued until "Done", so no more than one worker handle the same item and
		// no event missed.
		go wait.Until(nc.doNodeProcessingPassWorker, time.Second, stopCh)
	}

	for i := 0; i < podUpdateWorkerSize; i++ {
		go wait.Until(nc.doPodProcessingWorker, time.Second, stopCh)
	}

	if nc.useTaintBasedEvictions {
		// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
		// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
		go wait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod, stopCh)
	} else {
		// Managing eviction of nodes:
		// When we delete pods off a node, if the node was not empty at the time we then
		// queue an eviction watcher. If we hit an error, retry deletion.
		go wait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, stopCh)
	}

	// Incorporate the results of node health signal pushed from kubelet to master.
	go wait.Until(func() {
		if err := nc.monitorNodeHealth(); err != nil {
			klog.Errorf("Error monitoring node health: %v", err)
		}
	}, nc.nodeMonitorPeriod, stopCh)

	<-stopCh
}

3.1 nc.taintManager.Run

taintManager的主要功能為:當某個node被打上NoExecute汙點後,其上面的pod如果不能容忍該汙點,則taintManager將會驅逐這些pod,而新建的pod也需要容忍該汙點才能排程到該node上;

通過kcm啟動引數--enable-taint-manager來確定是否啟動taintManagertrue時啟動(啟動引數預設值為true);

kcm啟動引數--feature-gates=TaintBasedEvictions=xxx,預設值true,配合--enable-taint-manager共同作用,兩者均為true,才會開啟汙點驅逐;

taintManager部分的內容比較多,將在後面單獨一遍文章進行分析;

3.2 nc.doNodeProcessingPassWorker

nc.doNodeProcessingPassWorker方法有兩個作用:
(1)呼叫nc.doNoScheduleTaintingPass方法,根據node.Status.Conditionsnode.Spec.Unschedulable的值來更新node.Spec.Taints,主要是設定Effectnoschedule的taint;
(2)呼叫nc.reconcileNodeLabels方法,處理node物件中os和arch相關的label;

主要邏輯:
(1)迴圈消費nodeUpdateQueue佇列,從佇列中獲取一個nodeName;
(2)呼叫nc.doNoScheduleTaintingPass,對node的taint進行處理;
(3)呼叫nc.reconcileNodeLabels,對node物件中os和arch相關的label進行處理;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doNodeProcessingPassWorker() {
	for {
		obj, shutdown := nc.nodeUpdateQueue.Get()
		// "nodeUpdateQueue" will be shutdown when "stopCh" closed;
		// we do not need to re-check "stopCh" again.
		if shutdown {
			return
		}
		nodeName := obj.(string)
		if err := nc.doNoScheduleTaintingPass(nodeName); err != nil {
			klog.Errorf("Failed to taint NoSchedule on node <%s>, requeue it: %v", nodeName, err)
			// TODO(k82cn): Add nodeName back to the queue
		}
		// TODO: re-evaluate whether there are any labels that need to be
		// reconcile in 1.19. Remove this function if it's no longer necessary.
		if err := nc.reconcileNodeLabels(nodeName); err != nil {
			klog.Errorf("Failed to reconcile labels for node <%s>, requeue it: %v", nodeName, err)
			// TODO(yujuhong): Add nodeName back to the queue
		}
		nc.nodeUpdateQueue.Done(nodeName)
	}
}

3.2.1 nc.doNoScheduleTaintingPass

nc.doNoScheduleTaintingPass方法主要作用是根據node.Status.Conditionsnode.Spec.Unschedulable的值來更新node.Spec.Taints,主要是設定Effectnoschedule的taint;

主要邏輯:
(1)呼叫nc.nodeLister.Get,從informer本地快取中獲取node物件,不存在則直接return;

(2)根據node.Status.Conditions的值,獲得相應的taints;
(2-1)node.status.Conditions中有type為ready的condition。如果這個condition.status為fasle,設定key為node.kubernetes.io/not-ready,Effect為noschedule的taint;如果這個condition.status值為unknown,設定key為node.kubernetes.io/unreachable,Effect為noschedule的taint;
(2-2)node.status.Conditions中有type為MemoryPressure的condition。如果這個condition.status為true,設定key為node.kubernetes.io/memory-pressure,Effect為noschedule的taint;
(2-3)node.status.Conditions中有type為DiskPressure的condition。如果這個condition.status為true,設定key為node.kubernetes.io/disk-pressure,Effect為noschedule的taint;
(2-4)node.status.Conditions中有type為NetworkUnavailable的condition。如果這個condition.status為true,設定key為node.kubernetes.io/network-unavailable,Effect為noschedule的taint;
(2-5)node.status.Conditions中有type為PIDPressure的condition。如果這個condition.status為true,設定key為node.kubernetes.io/pid-pressure,Effect為noschedule的taint;

(3)如果node.Spec.Unschedulable值為true,則再追加key為node.kubernetes.io/unschedulable,Effect為noschedule的taint到taints中;

(4)呼叫nodeutil.SwapNodeControllerTaint,更新node的taints;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go

var(
    ...
    nodeConditionToTaintKeyStatusMap = map[v1.NodeConditionType]map[v1.ConditionStatus]string{
		v1.NodeReady: {
			v1.ConditionFalse:   v1.TaintNodeNotReady,
			v1.ConditionUnknown: v1.TaintNodeUnreachable,
		},
		v1.NodeMemoryPressure: {
			v1.ConditionTrue: v1.TaintNodeMemoryPressure,
		},
		v1.NodeDiskPressure: {
			v1.ConditionTrue: v1.TaintNodeDiskPressure,
		},
		v1.NodeNetworkUnavailable: {
			v1.ConditionTrue: v1.TaintNodeNetworkUnavailable,
		},
		v1.NodePIDPressure: {
			v1.ConditionTrue: v1.TaintNodePIDPressure,
		},
	}
    ...
)

func (nc *Controller) doNoScheduleTaintingPass(nodeName string) error {
	node, err := nc.nodeLister.Get(nodeName)
	if err != nil {
		// If node not found, just ignore it.
		if apierrors.IsNotFound(err) {
			return nil
		}
		return err
	}

	// Map node's condition to Taints.
	var taints []v1.Taint
	for _, condition := range node.Status.Conditions {
		if taintMap, found := nodeConditionToTaintKeyStatusMap[condition.Type]; found {
			if taintKey, found := taintMap[condition.Status]; found {
				taints = append(taints, v1.Taint{
					Key:    taintKey,
					Effect: v1.TaintEffectNoSchedule,
				})
			}
		}
	}
	if node.Spec.Unschedulable {
		// If unschedulable, append related taint.
		taints = append(taints, v1.Taint{
			Key:    v1.TaintNodeUnschedulable,
			Effect: v1.TaintEffectNoSchedule,
		})
	}

	// Get exist taints of node.
	nodeTaints := taintutils.TaintSetFilter(node.Spec.Taints, func(t *v1.Taint) bool {
		// only NoSchedule taints are candidates to be compared with "taints" later
		if t.Effect != v1.TaintEffectNoSchedule {
			return false
		}
		// Find unschedulable taint of node.
		if t.Key == v1.TaintNodeUnschedulable {
			return true
		}
		// Find node condition taints of node.
		_, found := taintKeyToNodeConditionMap[t.Key]
		return found
	})
	taintsToAdd, taintsToDel := taintutils.TaintSetDiff(taints, nodeTaints)
	// If nothing to add not delete, return true directly.
	if len(taintsToAdd) == 0 && len(taintsToDel) == 0 {
		return nil
	}
	if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, taintsToAdd, taintsToDel, node) {
		return fmt.Errorf("failed to swap taints of node %+v", node)
	}
	return nil
}

3.2.2 nc.reconcileNodeLabels

nc.reconcileNodeLabels方法主要是處理node物件中,os和arch相關的label,os和arch相關的label在kubelet向apiserver註冊node的時候會帶上;

主要邏輯:
(1)從informer快取中獲取node物件,不存在則直接return;
(2)如果node的label為空,則直接return;
(3)如果node物件存在key為「beta.kubernetes.io/os」的label ,則設定key為「kubernetes.io/os"、value值一樣的label;
(4)如果node物件存在key為「beta.kubernetes.io/arch」的label,則設定key為「kubernetes.io/arch"、value值一樣的label;

3.3 nc.doPodProcessingWorker

nc.doPodProcessingWorker方法判斷node物件的status,當node的ready Condition為false或unknown且已經持續了至少nc.podEvictionTimeout的時間時,對該node上的pod做驅逐(刪除)操作,並且如果node的ready condition值不為true,則將pod的ready condition更新為false;

需要注意的是,當啟用了taint manager時,pod的驅逐由taint manager進行處理,這裡就不再進行pod的驅逐處理。

主要邏輯:
(1)迴圈消費podUpdateQueue佇列,從佇列中獲取一個nodeName;
(2)呼叫nc.processPod,做進一步處理;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doPodProcessingWorker() {
	for {
		obj, shutdown := nc.podUpdateQueue.Get()
		// "podUpdateQueue" will be shutdown when "stopCh" closed;
		// we do not need to re-check "stopCh" again.
		if shutdown {
			return
		}

		podItem := obj.(podUpdateItem)
		nc.processPod(podItem)
	}
}

3.3.1 nc.processPod

nc.processPod方法主要邏輯:
(1)從informer本地快取中獲取pod物件;
(2)獲取pod所在nodeName,並根據nodeName呼叫nc.nodeHealthMap.getDeepCopy獲取nodeHealth,如nodeHealth為空則直接return;
(3)呼叫nodeutil.GetNodeCondition,獲取nodeHealth.status中node的ready condition,如果獲取不到則直接return;
(4)判斷taint manager是否啟用,沒啟用則呼叫nc.processNoTaintBaseEviction對pod做進一步處理(驅逐邏輯);
(5)如果node的ready condition值不為true,則呼叫nodeutil.MarkPodsNotReady將pod的ready condition更新為false;

注意:當啟用taint manager時,pod的驅逐由taint manager進行處理,所以不在這裡處理。

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processPod(podItem podUpdateItem) {
	defer nc.podUpdateQueue.Done(podItem)
	pod, err := nc.podLister.Pods(podItem.namespace).Get(podItem.name)
	if err != nil {
		if apierrors.IsNotFound(err) {
			// If the pod was deleted, there is no need to requeue.
			return
		}
		klog.Warningf("Failed to read pod %v/%v: %v.", podItem.namespace, podItem.name, err)
		nc.podUpdateQueue.AddRateLimited(podItem)
		return
	}

	nodeName := pod.Spec.NodeName

	nodeHealth := nc.nodeHealthMap.getDeepCopy(nodeName)
	if nodeHealth == nil {
		// Node data is not gathered yet or node has beed removed in the meantime.
		// Pod will be handled by doEvictionPass method.
		return
	}

	node, err := nc.nodeLister.Get(nodeName)
	if err != nil {
		klog.Warningf("Failed to read node %v: %v.", nodeName, err)
		nc.podUpdateQueue.AddRateLimited(podItem)
		return
	}

	_, currentReadyCondition := nodeutil.GetNodeCondition(nodeHealth.status, v1.NodeReady)
	if currentReadyCondition == nil {
		// Lack of NodeReady condition may only happen after node addition (or if it will be maliciously deleted).
		// In both cases, the pod will be handled correctly (evicted if needed) during processing
		// of the next node update event.
		return
	}

	pods := []*v1.Pod{pod}
	// In taint-based eviction mode, only node updates are processed by NodeLifecycleController.
	// Pods are processed by TaintManager.
	if !nc.useTaintBasedEvictions {
		if err := nc.processNoTaintBaseEviction(node, currentReadyCondition, nc.nodeMonitorGracePeriod, pods); err != nil {
			klog.Warningf("Unable to process pod %+v eviction from node %v: %v.", podItem, nodeName, err)
			nc.podUpdateQueue.AddRateLimited(podItem)
			return
		}
	}

	if currentReadyCondition.Status != v1.ConditionTrue {
		if err := nodeutil.MarkPodsNotReady(nc.kubeClient, pods, nodeName); err != nil {
			klog.Warningf("Unable to mark pod %+v NotReady on node %v: %v.", podItem, nodeName, err)
			nc.podUpdateQueue.AddRateLimited(podItem)
		}
	}
}

3.3.2 nc.processNoTaintBaseEviction

nc.processNoTaintBaseEviction方法主要邏輯:
(1)當node的ready Condition為false或unknown且已經持續了至少nc.podEvictionTimeout的時間時,呼叫nc.evictPods方法,將node加入到nc.zonePodEvictor佇列中,由其他worker消費該佇列,對該node上的pod做驅逐(刪除)操作;
(2)當node的ready Condition為true時,呼叫nc.cancelPodEviction方法,將該node從nc.zonePodEvictor佇列中移除,代表取消驅逐該node上的pod;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processNoTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition, gracePeriod time.Duration, pods []*v1.Pod) error {
	decisionTimestamp := nc.now()
	nodeHealthData := nc.nodeHealthMap.getDeepCopy(node.Name)
	if nodeHealthData == nil {
		return fmt.Errorf("health data doesn't exist for node %q", node.Name)
	}
	// Check eviction timeout against decisionTimestamp
	switch observedReadyCondition.Status {
	case v1.ConditionFalse:
		if decisionTimestamp.After(nodeHealthData.readyTransitionTimestamp.Add(nc.podEvictionTimeout)) {
			enqueued, err := nc.evictPods(node, pods)
			if err != nil {
				return err
			}
			if enqueued {
				klog.V(2).Infof("Node is NotReady. Adding Pods on Node %s to eviction queue: %v is later than %v + %v",
					node.Name,
					decisionTimestamp,
					nodeHealthData.readyTransitionTimestamp,
					nc.podEvictionTimeout,
				)
			}
		}
	case v1.ConditionUnknown:
		if decisionTimestamp.After(nodeHealthData.probeTimestamp.Add(nc.podEvictionTimeout)) {
			enqueued, err := nc.evictPods(node, pods)
			if err != nil {
				return err
			}
			if enqueued {
				klog.V(2).Infof("Node is unresponsive. Adding Pods on Node %s to eviction queues: %v is later than %v + %v",
					node.Name,
					decisionTimestamp,
					nodeHealthData.readyTransitionTimestamp,
					nc.podEvictionTimeout-gracePeriod,
				)
			}
		}
	case v1.ConditionTrue:
		if nc.cancelPodEviction(node) {
			klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name)
		}
	}
	return nil
}

3.3.3 nc.evictPods

nc.evictPods方法不列出原始碼,只需要知道:該方法主要是將node加入到nc.zonePodEvictor佇列中,由其他worker消費該佇列,對該node上的pod做驅逐(刪除)操作;

3.3.4 消費nc.zonePodEvictor佇列-nc.doEvictionPass

nc.doEvictionPass方法負責消費nc.zonePodEvictor佇列,呼叫nodeutil.DeletePods來將node上的pod驅逐(刪除)掉;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doEvictionPass() {
	nc.evictorLock.Lock()
	defer nc.evictorLock.Unlock()
	for k := range nc.zonePodEvictor {
	    nc.zonePodEvictor[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
	        ...
	        remaining, err := nodeutil.DeletePods(nc.kubeClient, pods, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)
	        ...
	    })
	}
}
// pkg/controller/util/node/controller_utils.go
func DeletePods(...) ... {
    ...
    for i := range pods {
        if err := kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil {
        ...
    }
    ...
}

3.4 nc.doNoExecuteTaintingPass

nc.doNoExecuteTaintingPass方法主要作用是根據node.Status.Conditions的值來更新node.Spec.Taints,主要是設定EffectnoExecute的taint;

主要邏輯為迴圈遍歷nc.zoneNoExecuteTainter,然後做相關處理:
(1)從nc.zoneNoExecuteTainter中取出一個node,然後從informer本地快取中獲取該node物件;
(2)獲取node物件ready condition的值並做判斷,如為false則構建key為node.kubernetes.io/not-ready,Effect為NoExecute的taint;如為unknown則構建key為node.kubernetes.io/unreachable,Effect為NoExecute的taint;
(3)最後呼叫nodeutil.SwapNodeControllerTaint,將構造好的taint更新到node物件中(這裡注意,上述兩個NoExecute的taint在node物件中,同一時間只會存在一個,一個新增到node物件中時,會把另一個移除);

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doNoExecuteTaintingPass() {
	nc.evictorLock.Lock()
	defer nc.evictorLock.Unlock()
	for k := range nc.zoneNoExecuteTainter {
		// Function should return 'false' and a time after which it should be retried, or 'true' if it shouldn't (it succeeded).
		nc.zoneNoExecuteTainter[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
			node, err := nc.nodeLister.Get(value.Value)
			if apierrors.IsNotFound(err) {
				klog.Warningf("Node %v no longer present in nodeLister!", value.Value)
				return true, 0
			} else if err != nil {
				klog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
				// retry in 50 millisecond
				return false, 50 * time.Millisecond
			}
			_, condition := nodeutil.GetNodeCondition(&node.Status, v1.NodeReady)
			// Because we want to mimic NodeStatus.Condition["Ready"] we make "unreachable" and "not ready" taints mutually exclusive.
			taintToAdd := v1.Taint{}
			oppositeTaint := v1.Taint{}
			switch condition.Status {
			case v1.ConditionFalse:
				taintToAdd = *NotReadyTaintTemplate
				oppositeTaint = *UnreachableTaintTemplate
			case v1.ConditionUnknown:
				taintToAdd = *UnreachableTaintTemplate
				oppositeTaint = *NotReadyTaintTemplate
			default:
				// It seems that the Node is ready again, so there's no need to taint it.
				klog.V(4).Infof("Node %v was in a taint queue, but it's ready now. Ignoring taint request.", value.Value)
				return true, 0
			}

			result := nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{&oppositeTaint}, node)
			if result {
				//count the evictionsNumber
				zone := utilnode.GetZoneKey(node)
				evictionsNumber.WithLabelValues(zone).Inc()
			}

			return result, 0
		})
	}
}

3.5 nc.doEvictionPass

nc.doEvictionPass方法主要作用是從nc.zonePodEvictor中獲取node,然後驅逐(刪除)該node上除daemonset pod外的所有pod;

主要邏輯為迴圈遍歷nc.zonePodEvictor,然後做相關處理:
(1)從nc.zonePodEvictor中取出一個node,然後從informer本地快取中獲取該node物件;
(2)呼叫nc.getPodsAssignedToNode,獲取該node上的所有pod;
(3)呼叫nodeutil.DeletePods,刪除該node上除daemonset pod外的所有的pod;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doEvictionPass() {
	nc.evictorLock.Lock()
	defer nc.evictorLock.Unlock()
	for k := range nc.zonePodEvictor {
		// Function should return 'false' and a time after which it should be retried, or 'true' if it shouldn't (it succeeded).
		nc.zonePodEvictor[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
			node, err := nc.nodeLister.Get(value.Value)
			if apierrors.IsNotFound(err) {
				klog.Warningf("Node %v no longer present in nodeLister!", value.Value)
			} else if err != nil {
				klog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
			}
			nodeUID, _ := value.UID.(string)
			pods, err := nc.getPodsAssignedToNode(value.Value)
			if err != nil {
				utilruntime.HandleError(fmt.Errorf("unable to list pods from node %q: %v", value.Value, err))
				return false, 0
			}
			remaining, err := nodeutil.DeletePods(nc.kubeClient, pods, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)
			if err != nil {
				// We are not setting eviction status here.
				// New pods will be handled by zonePodEvictor retry
				// instead of immediate pod eviction.
				utilruntime.HandleError(fmt.Errorf("unable to evict node %q: %v", value.Value, err))
				return false, 0
			}
			if !nc.nodeEvictionMap.setStatus(value.Value, evicted) {
				klog.V(2).Infof("node %v was unregistered in the meantime - skipping setting status", value.Value)
			}
			if remaining {
				klog.Infof("Pods awaiting deletion due to Controller eviction")
			}

			if node != nil {
				zone := utilnode.GetZoneKey(node)
				evictionsNumber.WithLabelValues(zone).Inc()
			}

			return true, 0
		})
	}
}

3.6 nc.monitorNodeHealth

nc.monitorNodeHealth方法的主要作用是持續監控node的狀態,根據叢集中不同zone下unhealthy數量的node,以及kcm啟動引數中驅逐速率的相關設定,給不同的zone設定不同的驅逐速率(該驅逐速率對是否開啟汙點驅逐均生效),且當node心跳上報(node lease的更新時間)距離上一次上報時間已經超過nodeMonitorGracePeriod(剛啟動則為nodeStartupGracePeriod),更新node物件的ready condition值,並做相應的驅逐處理:
(1)當開啟了汙點驅逐,且node的ready condition不為true時,新增NoExcute汙點,並將該node放入zoneNoExecuteTainter中,由taintManager來做驅逐操作;
(2)當沒開啟汙點驅逐,且node的ready condition不為true持續了podEvictionTimeout時間,則開始驅逐pod的操作;

主要邏輯:
(1)從informer本地快取中獲取所有node物件;

(2)呼叫nc.classifyNodes,將這些node物件分為added、deleted、newZoneRepresentatives三類,即新增的、被刪除的、屬於新的zone三類,並對每一類做不同的邏輯處理(根據node物件的region與zone label,為每一個node劃分一個zoneStates),zoneStates有Initial、Normal、FullDisruption、PartialDisruption四種,新增加的 node預設zoneStates為Initial,其餘的幾個zoneStates分別對應著不同的驅逐速率;

(3)遍歷node物件列表,對每個node進行處理;
(3-1)呼叫nc.tryUpdateNodeHealth,根據當前獲取的node物件的ready condition值、node lease的更新時間等來更新nc.nodeHealthMap中的資料、更新node的condition值,並獲取該node的gracePeriodobservedReadyConditioncurrentReadyCondition值,observedReadyCondition可以理解為該node上一次探測時的狀態,currentReadyCondition為本次計算出來的狀態;
(3-2)如果currentReadyCondition不為空,則呼叫nc.getPodsAssignedToNode,獲取該node上的所有pod列表;
(3-3)如果nc.useTaintBasedEvictions為true,即開啟了汙點驅逐,則呼叫nc.processTaintBaseEviction,當node的ready condition屬性值為false時去除Unrearchable的汙點而新增Notready的汙點,並將該node加入zoneNoExecuteTainter佇列中;為unknown時去除Notready的汙點而新增Unrearchable的汙點,並將該node加入zoneNoExecuteTainter佇列中;為true時去除NotreadyUnrearchable的汙點,然後將該node從zoneNoExecuteTainter佇列中移除;
(3-4)如果nc.useTaintBasedEvictions為false,則呼叫nc.processNoTaintBaseEviction,做進一步的驅逐邏輯處理:當node的ready condition屬性值為false時,判斷距該node上一次的readyTransitionTimestamp時間是否超過了 podEvictionTimeout,是則將該node加入到zonePodEvictor佇列中,最終該node上的pod會被驅逐;當node的ready condition屬性值為unknow時,判斷距該node上一次的probeTimestamp時間是否超過了 podEvictionTimeout,是則將該node加入到zonePodEvictor佇列中,最終該node上的pod會被驅逐;當node的ready condition屬性值為true時,呼叫nc.cancelPodEviction,將該node從zonePodEvictor佇列中移除,代表不再對該node上的pod執行驅逐操作;
(3-5)當node物件的Ready Condition值由true變為false時,則呼叫nodeutil.MarkAllPodsNotReady,將該node上的所有pod標記為notReady;

(4)呼叫nc.handleDisruption,根據叢集中不同zone下unhealthy數量的node,以及kcm啟動引數中驅逐速率的相關設定,給不同的zone設定不同的驅逐速率(該驅逐速率對是否開啟汙點驅逐均生效);
nc.handleDisruption方法暫不展開分析,可自行檢視;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) monitorNodeHealth() error {
	// We are listing nodes from local cache as we can tolerate some small delays
	// comparing to state from etcd and there is eventual consistency anyway.
	nodes, err := nc.nodeLister.List(labels.Everything())
	if err != nil {
		return err
	}
	added, deleted, newZoneRepresentatives := nc.classifyNodes(nodes)

	for i := range newZoneRepresentatives {
		nc.addPodEvictorForNewZone(newZoneRepresentatives[i])
	}

	for i := range added {
		klog.V(1).Infof("Controller observed a new Node: %#v", added[i].Name)
		nodeutil.RecordNodeEvent(nc.recorder, added[i].Name, string(added[i].UID), v1.EventTypeNormal, "RegisteredNode", fmt.Sprintf("Registered Node %v in Controller", added[i].Name))
		nc.knownNodeSet[added[i].Name] = added[i]
		nc.addPodEvictorForNewZone(added[i])
		if nc.useTaintBasedEvictions {
			nc.markNodeAsReachable(added[i])
		} else {
			nc.cancelPodEviction(added[i])
		}
	}

	for i := range deleted {
		klog.V(1).Infof("Controller observed a Node deletion: %v", deleted[i].Name)
		nodeutil.RecordNodeEvent(nc.recorder, deleted[i].Name, string(deleted[i].UID), v1.EventTypeNormal, "RemovingNode", fmt.Sprintf("Removing Node %v from Controller", deleted[i].Name))
		delete(nc.knownNodeSet, deleted[i].Name)
	}

	zoneToNodeConditions := map[string][]*v1.NodeCondition{}
	for i := range nodes {
		var gracePeriod time.Duration
		var observedReadyCondition v1.NodeCondition
		var currentReadyCondition *v1.NodeCondition
		node := nodes[i].DeepCopy()
		if err := wait.PollImmediate(retrySleepTime, retrySleepTime*scheduler.NodeHealthUpdateRetry, func() (bool, error) {
			gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)
			if err == nil {
				return true, nil
			}
			name := node.Name
			node, err = nc.kubeClient.CoreV1().Nodes().Get(name, metav1.GetOptions{})
			if err != nil {
				klog.Errorf("Failed while getting a Node to retry updating node health. Probably Node %s was deleted.", name)
				return false, err
			}
			return false, nil
		}); err != nil {
			klog.Errorf("Update health of Node '%v' from Controller error: %v. "+
				"Skipping - no pods will be evicted.", node.Name, err)
			continue
		}

		// Some nodes may be excluded from disruption checking
		if !isNodeExcludedFromDisruptionChecks(node) {
			zoneToNodeConditions[utilnode.GetZoneKey(node)] = append(zoneToNodeConditions[utilnode.GetZoneKey(node)], currentReadyCondition)
		}

		if currentReadyCondition != nil {
			pods, err := nc.getPodsAssignedToNode(node.Name)
			if err != nil {
				utilruntime.HandleError(fmt.Errorf("unable to list pods of node %v: %v", node.Name, err))
				if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
					// If error happened during node status transition (Ready -> NotReady)
					// we need to mark node for retry to force MarkPodsNotReady execution
					// in the next iteration.
					nc.nodesToRetry.Store(node.Name, struct{}{})
				}
				continue
			}
			if nc.useTaintBasedEvictions {
				nc.processTaintBaseEviction(node, &observedReadyCondition)
			} else {
				if err := nc.processNoTaintBaseEviction(node, &observedReadyCondition, gracePeriod, pods); err != nil {
					utilruntime.HandleError(fmt.Errorf("unable to evict all pods from node %v: %v; queuing for retry", node.Name, err))
				}
			}

			_, needsRetry := nc.nodesToRetry.Load(node.Name)
			switch {
			case currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue:
				// Report node event only once when status changed.
				nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
				fallthrough
			case needsRetry && observedReadyCondition.Status != v1.ConditionTrue:
				if err = nodeutil.MarkPodsNotReady(nc.kubeClient, pods, node.Name); err != nil {
					utilruntime.HandleError(fmt.Errorf("unable to mark all pods NotReady on node %v: %v; queuing for retry", node.Name, err))
					nc.nodesToRetry.Store(node.Name, struct{}{})
					continue
				}
			}
		}
		nc.nodesToRetry.Delete(node.Name)
	}
	nc.handleDisruption(zoneToNodeConditions, nodes)

	return nil
}

3.6.1 nc.tryUpdateNodeHealth

nc.tryUpdateNodeHealth方法主要是根據當前獲取的node物件的ready condition值、node lease的更新時間等來更新nc.nodeHealthMap中的資料、更新node的condition值,並返回gracePeriod、上次記錄的node的ready condition值observedReadyCondition、本次計算出來的node的ready condition值currentReadyCondition

nc.tryUpdateNodeHealth方法主要邏輯為:
(1)從node物件中獲取ready condition的值,如果其為空,則設定observedReadyCondition為unknown並設定gracePeriodnc.nodeStartupGracePeriod;否則設定gracePeriod值為nc.nodeMonitorGracePeriod,設定observedReadyCondition值為從node物件中獲取到的ready condition的值;
(2)nodeHealthstatusprobeTimestampreadyTransitionTimestamp屬性值賦值相關邏輯處理,status賦值為node.status,probeTimestamp賦值為現在的時間,當ready condition的LastTransitionTime值有所變化,設定readyTransitionTimestamp值為現在的時間;
(3)獲取node對應的lease物件,判斷其spec.RenewTime屬性值是否比上次記錄的時間(nodeHealth.lease.spec.RenewTime)要新,是則更新nodeHealth的lease為新lease物件、更新probeTimestamp為此時此刻的時間;
(4)判斷現在距離上次探測時間probeTimestamp是否已經超過了nc.nodeMonitorGracePeriod時間,是則將該node的ready conditionmemoryPressure conditiondiskPressure conditionpidPressure condition的值都更新為unknown,最後判斷上一次記錄的node的ready condition的值與本次的是否一致,不一致則更新nodeHealthreadyTransitionTimestamp的時間為現在;
(5)最後返回gracePeriod、上次記錄的node的ready condition值observedReadyCondition、本次計算出來的node的ready condition值currentReadyCondition

3.6.2 nc.processTaintBaseEviction

nc.processTaintBaseEviction方法為汙點驅逐的處理邏輯:
(1)當node的ready condition屬性值為false時去除Unrearchable的汙點而新增Notready的汙點,並將該node加入zoneNoExecuteTainter佇列中;
(2)為unknown時去除Notready的汙點而新增Unrearchable的汙點,並將該node加入zoneNoExecuteTainter佇列中;
(3)為true時去除NotreadyUnrearchable的汙點,然後將該node從zoneNoExecuteTainter佇列中移除;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition) {
	decisionTimestamp := nc.now()
	// Check eviction timeout against decisionTimestamp
	switch observedReadyCondition.Status {
	case v1.ConditionFalse:
		// We want to update the taint straight away if Node is already tainted with the UnreachableTaint
		if taintutils.TaintExists(node.Spec.Taints, UnreachableTaintTemplate) {
			taintToAdd := *NotReadyTaintTemplate
			if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{UnreachableTaintTemplate}, node) {
				klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")
			}
		} else if nc.markNodeForTainting(node) {
			klog.V(2).Infof("Node %v is NotReady as of %v. Adding it to the Taint queue.",
				node.Name,
				decisionTimestamp,
			)
		}
	case v1.ConditionUnknown:
		// We want to update the taint straight away if Node is already tainted with the UnreachableTaint
		if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) {
			taintToAdd := *UnreachableTaintTemplate
			if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) {
				klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")
			}
		} else if nc.markNodeForTainting(node) {
			klog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to the Taint queue.",
				node.Name,
				decisionTimestamp,
			)
		}
	case v1.ConditionTrue:
		removed, err := nc.markNodeAsReachable(node)
		if err != nil {
			klog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name)
		}
		if removed {
			klog.V(2).Infof("Node %s is healthy again, removing all taints", node.Name)
		}
	}
}

3.6.3 nc.processNoTaintBaseEviction

nc.processNoTaintBaseEviction方法為未開啟汙點驅逐時的驅逐處理邏輯:
(1)當node的ready condition屬性值為false時,判斷距該node上一次的readyTransitionTimestamp時間是否超過了 podEvictionTimeout,是則將該node加入到zonePodEvictor佇列中,最終該node上的pod會被驅逐;
(2)當node的ready condition屬性值為unknow時,判斷距該node上一次的probeTimestamp時間是否超過了 podEvictionTimeout,是則將該node加入到zonePodEvictor佇列中,最終該node上的pod會被驅逐;
(3)當node的ready condition屬性值為true時,呼叫nc.cancelPodEviction,將該node從zonePodEvictor佇列中移除,代表不再對該node上的pod執行驅逐操作;

// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processNoTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition, gracePeriod time.Duration, pods []*v1.Pod) error {
	decisionTimestamp := nc.now()
	nodeHealthData := nc.nodeHealthMap.getDeepCopy(node.Name)
	if nodeHealthData == nil {
		return fmt.Errorf("health data doesn't exist for node %q", node.Name)
	}
	// Check eviction timeout against decisionTimestamp
	switch observedReadyCondition.Status {
	case v1.ConditionFalse:
		if decisionTimestamp.After(nodeHealthData.readyTransitionTimestamp.Add(nc.podEvictionTimeout)) {
			enqueued, err := nc.evictPods(node, pods)
			if err != nil {
				return err
			}
			if enqueued {
				klog.V(2).Infof("Node is NotReady. Adding Pods on Node %s to eviction queue: %v is later than %v + %v",
					node.Name,
					decisionTimestamp,
					nodeHealthData.readyTransitionTimestamp,
					nc.podEvictionTimeout,
				)
			}
		}
	case v1.ConditionUnknown:
		if decisionTimestamp.After(nodeHealthData.probeTimestamp.Add(nc.podEvictionTimeout)) {
			enqueued, err := nc.evictPods(node, pods)
			if err != nil {
				return err
			}
			if enqueued {
				klog.V(2).Infof("Node is unresponsive. Adding Pods on Node %s to eviction queues: %v is later than %v + %v",
					node.Name,
					decisionTimestamp,
					nodeHealthData.readyTransitionTimestamp,
					nc.podEvictionTimeout-gracePeriod,
				)
			}
		}
	case v1.ConditionTrue:
		if nc.cancelPodEviction(node) {
			klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name)
		}
	}
	return nil
}

總結

NodeLifecycleController主要功能有:
(1)定期檢查node的心跳上報,某個node間隔一定時間都沒有心跳上報時,更新node的ready condition值為false或unknown,開啟了汙點驅逐的情況下,給該node新增NoExecute的汙點;
(2)當汙點驅逐未開啟時,當node的ready Condition值為false或unknown且已經持續了一段時間(該時間可設定)時,對該node上的pod做驅逐(刪除)操作;
(3)當汙點驅逐開啟時,node上有NoExecute汙點後,立馬驅逐(刪除)不能容忍汙點的pod,對於能容忍該汙點的pod,則等待所有汙點的容忍時間裡最小值後,pod才被驅逐(刪除);

驅逐中的zone概念

根據每個node物件的region和zone的label值,將node劃分到不同的zone中;

region、zone值都相同的node,劃分為同一個zone;

zone狀態介紹

zone狀態有四種,分別是:
(1)Initial:初始化狀態;
(2)FullDisruption:ready的node數量為0,not ready的node數量大於0;
(3)PartialDisruption:not ready的node數量大於2且其佔比大於等於unhealthyZoneThreshold
(4)Normal:上述三種狀態以外的情形,都屬於該狀態;

需要注意二級驅逐速率對驅逐的影響,即kcm啟動引數--secondary-node-eviction-rate,代表如果某個zone下的unhealthy節點的百分比超過--unhealthy-zone-threshold (預設為 0.55)時,驅逐速率將會減小,如果不是LargeCluster(zone節點數量小於等於--large-cluster-size-threshold,預設為 50),驅逐操作將會停止,如果是LargeCluster,驅逐速率將降為每秒--secondary-node-eviction-rate 個,預設為0.01;