k8s v1.16版本中NodeController
已經分為了NodeIpamController
與NodeLifecycleController
,本文主要介紹NodeLifecycleController
。
NodeLifecycleController
主要功能有:
(1)定期檢查node的心跳上報,某個node間隔一定時間都沒有心跳上報時,更新node的ready condition
值為false或unknown,開啟了汙點驅逐的情況下,給該node新增NoExecute
的汙點;
(2)當汙點驅逐未開啟時,當node的ready Condition
值為false或unknown且已經持續了一段時間(該時間可設定)時,對該node上的pod做驅逐(刪除)操作;
(3)當汙點驅逐開啟時,node上有NoExecute
汙點後,立馬驅逐(刪除)不能容忍汙點的pod,對於能容忍該汙點的pod,則等待所有汙點的容忍時間裡最小值後,pod才被驅逐(刪除);
原始碼分析分成3部分:
(1)啟動引數分析;
(2)初始化與相關結構體分析;
(3)處理邏輯分析;
// cmd/kube-controller-manager/app/core.go
func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
ctx.InformerFactory.Coordination().V1().Leases(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.InformerFactory.Core().V1().Nodes(),
ctx.InformerFactory.Apps().V1().DaemonSets(),
// node lifecycle controller uses existing cluster role from node-controller
ctx.ClientBuilder.ClientOrDie("node-controller"),
ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,
ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
)
if err != nil {
return nil, true, err
}
go lifecycleController.Run(ctx.Stop)
return nil, true, nil
}
看到上面的startNodeLifecycleController
函數中lifecyclecontroller.NewNodeLifecycleController
方法的入參,其中傳入了多個kube-controller-manager的啟動引數;
(1)ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration
;
即kcm啟動引數--node-monitor-period
,預設值5秒,代表NodeLifecycleController
中更新同步node物件的status值(node的汙點、node的condition值)的週期;
fs.DurationVar(&o.NodeMonitorPeriod.Duration, "node-monitor-period", o.NodeMonitorPeriod.Duration,
"The period for syncing NodeStatus in NodeController.")
(2)ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration
;
即kcm啟動引數--node-startup-grace-period
,預設值60秒,代表node啟動後多久才會更新node物件的conditions值;
fs.DurationVar(&o.NodeStartupGracePeriod.Duration, "node-startup-grace-period", o.NodeStartupGracePeriod.Duration,
"Amount of time which we allow starting Node to be unresponsive before marking it unhealthy.")
(3)ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration
;
即kcm啟動引數--node-monitor-grace-period
,預設值40秒,代表在距離上一次上報心跳時間超過40s後,將該node的conditions值更新為unknown(kubelet通過更新node lease來上報心跳);
fs.DurationVar(&o.NodeMonitorGracePeriod.Duration, "node-monitor-grace-period", o.NodeMonitorGracePeriod.Duration,
"Amount of time which we allow running Node to be unresponsive before marking it unhealthy. "+
"Must be N times more than kubelet's nodeStatusUpdateFrequency, "+
"where N means number of retries allowed for kubelet to post node status.")
(4)ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration
;
即kcm啟動引數--pod-eviction-timeout
,預設值5分鐘,當不開啟汙點驅逐時該引數起效,當node的ready condition值變為false或unknown並持續了5分鐘後,將驅逐(刪除)該node上的pod;
fs.DurationVar(&o.PodEvictionTimeout.Duration, "pod-eviction-timeout", o.PodEvictionTimeout.Duration, "The grace period for deleting pods on failed nodes.")
(5)ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager
;
即kcm啟動引數--enable-taint-manager
,預設值true,代表啟動taintManager,當已經排程到該node上的pod不能容忍node的NoExecute
汙點時,由TaintManager負責驅逐此類pod,若為false即不啟動taintManager,則根據--pod-eviction-timeout
來做驅逐操作;
fs.BoolVar(&o.EnableTaintManager, "enable-taint-manager", o.EnableTaintManager, "WARNING: Beta feature. If set to true enables NoExecute Taints and will evict all not-tolerating Pod running on Nodes tainted with this kind of Taints.")
(6)utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions)
;
即kcm啟動引數--feature-gates=TaintBasedEvictions=xxx
,預設值true,配合--enable-taint-manager
共同作用,兩者均為true,才會開啟汙點驅逐;
(7)ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate
;
即kcm啟動引數--node-eviction-rate
,預設值0.1,代表當叢集下某個zone(zone的概念後面詳細介紹)為healthy時,每秒應該觸發pod驅逐操作的node數量,預設0.1,即每10s觸發1個node上的pod驅逐操作;
fs.Float32Var(&o.NodeEvictionRate, "node-eviction-rate", 0.1, "Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters.")
(8)ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate
;
即kcm啟動引數--secondary-node-eviction-rate
,代表如果某個zone下的unhealthy節點的百分比超過--unhealthy-zone-threshold
(預設為 0.55)時,驅逐速率將會減小,如果不是LargeCluster(zone節點數量小於等於--large-cluster-size-threshold
個,預設為 50),驅逐操作將會停止,如果是LargeCluster,驅逐速率將降為每秒--secondary-node-eviction-rate
個,預設為0.01;
fs.Float32Var(&o.SecondaryNodeEvictionRate, "secondary-node-eviction-rate", 0.01, "Number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters. This value is implicitly overridden to 0 if the cluster size is smaller than --large-cluster-size-threshold.")
(9)ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold
;
即kcm啟動引數--large-cluster-size-threshold
,預設值50,當某zone的節點數超過該值時,認為該zone是一個LargeCluster,不是LargeCluster時,對應的SecondaryNodeEvictionRate
設定會被設定為0;
fs.Int32Var(&o.LargeClusterSizeThreshold, "large-cluster-size-threshold", 50, "Number of nodes from which NodeController treats the cluster as large for the eviction logic purposes. --secondary-node-eviction-rate is implicitly overridden to 0 for clusters this size or smaller.")
(10)ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold
;
即kcm啟動引數--unhealthy-zone-threshold
,代表認定某zone為unhealthy的閾值,即會影響什麼時候開啟二級驅逐速率;預設值0.55,當該zone中not ready節點(ready condition值不為true)數目超過55%,認定該zone為unhealthy;
fs.Float32Var(&o.UnhealthyZoneThreshold, "unhealthy-zone-threshold", 0.55, "Fraction of Nodes in a zone which needs to be not Ready (minimum 3) for zone to be treated as unhealthy. ")
(11)--feature-gates=NodeLease=xxx
:預設值true,使用lease物件上報node心跳資訊,替換老的更新node的status的方式,能大大減輕apiserver的負擔;
根據每個node物件的region和zone的label值,將node劃分到不同的zone中;
region、zone值都相同的node,劃分為同一個zone;
zone狀態有四種,分別是:
(1)Initial
:初始化狀態;
(2)FullDisruption
:ready的node數量為0,not ready的node數量大於0;
(3)PartialDisruption
:not ready的node數量大於2且其佔比大於等於unhealthyZoneThreshold
;
(4)Normal
:上述三種狀態以外的情形,都屬於該狀態;
需要注意二級驅逐速率對驅逐的影響,即kcm啟動引數--secondary-node-eviction-rate
,代表如果某個zone下的unhealthy節點的百分比超過--unhealthy-zone-threshold
(預設為 0.55)時,驅逐速率將會減小,如果不是LargeCluster(zone節點數量小於等於--large-cluster-size-threshold
,預設為 50),驅逐操作將會停止,如果是LargeCluster,驅逐速率將降為每秒--secondary-node-eviction-rate
個,預設為0.01;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, ZoneState) {
readyNodes := 0
notReadyNodes := 0
for i := range nodeReadyConditions {
if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
readyNodes++
} else {
notReadyNodes++
}
}
switch {
case readyNodes == 0 && notReadyNodes > 0:
return notReadyNodes, stateFullDisruption
case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
return notReadyNodes, statePartialDisruption
default:
return notReadyNodes, stateNormal
}
}
Controller結構體關鍵屬性:
(1)taintManager
:負責汙點驅逐的manager;
(2)enterPartialDisruptionFunc
:返回當zone狀態為PartialDisruption
時的驅逐速率(node節點數量超過largeClusterThreshold
時,返回secondaryEvictionLimiterQPS
,即kcm啟動引數--secondary-node-eviction-rate
,否則返回0);
(3)enterFullDisruptionFunc
:返回當zone狀態為FullDisruption
時的驅逐速率(直接返回NodeEvictionRate
值,kcm啟動引數--node-eviction-rate
);
(4)computeZoneStateFunc
:計算zone狀態的方法,即上面zone狀態介紹中的ComputeZoneState
方法;
(5)nodeHealthMap
:用於記錄所有node的最近一次的狀態資訊;
(6)zoneStates
:用於記錄所有zone的狀態;
(7)nodeMonitorPeriod
、nodeStartupGracePeriod
、nodeMonitorGracePeriod
、podEvictionTimeout
、evictionLimiterQPS
、secondaryEvictionLimiterQPS
、largeClusterThreshold
、unhealthyZoneThreshold
,上面介紹啟動引數時已經做了分析;
(8)runTaintManager
:kcm啟動引數--enable-taint-manager
賦值,代表是否啟動taintManager;
(9)useTaintBasedEvictions
:代表是否開啟汙點驅逐,kcm啟動引數--feature-gates=TaintBasedEvictions=xxx
賦值,預設值true,配合--enable-taint-manager
共同作用,兩者均為true,才會開啟汙點驅逐;
Controller結構體中的兩個關鍵佇列:
(1)zonePodEvictor
:pod需要被驅逐的node節點佇列(只有在未開啟汙點驅逐時,才用到該佇列),當node的ready condition變為false或unknown且持續了podEvictionTimeout
的時間,會將該node放入該佇列,然後有worker負責從該佇列中讀取node,去執行node上的pod驅逐操作;
(2)zoneNoExecuteTainter
:需要更新taint的node節點佇列,當node的ready condition變為false或unknown時,會將該node放入該佇列,然後有worker負責從該佇列中讀取node,去執行taint更新操作(增加notReady
或unreachable
的taint);
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
type Controller struct {
...
taintManager *scheduler.NoExecuteTaintManager
// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
// to avoid the problem with time skew across the cluster.
now func() metav1.Time
enterPartialDisruptionFunc func(nodeNum int) float32
enterFullDisruptionFunc func(nodeNum int) float32
computeZoneStateFunc func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
knownNodeSet map[string]*v1.Node
// per Node map storing last observed health together with a local time when it was observed.
nodeHealthMap *nodeHealthMap
// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
// TODO(#83954): API calls shouldn't be executed under the lock.
evictorLock sync.Mutex
nodeEvictionMap *nodeEvictionMap
// workers that evicts pods from unresponsive nodes.
zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
// workers that are responsible for tainting nodes.
zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue
nodesToRetry sync.Map
zoneStates map[string]ZoneState
daemonSetStore appsv1listers.DaemonSetLister
daemonSetInformerSynced cache.InformerSynced
leaseLister coordlisters.LeaseLister
leaseInformerSynced cache.InformerSynced
nodeLister corelisters.NodeLister
nodeInformerSynced cache.InformerSynced
getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)
recorder record.EventRecorder
// Value controlling Controller monitoring period, i.e. how often does Controller
// check node health signal posted from kubelet. This value should be lower than
// nodeMonitorGracePeriod.
// TODO: Change node health monitor to watch based.
nodeMonitorPeriod time.Duration
// When node is just created, e.g. cluster bootstrap or node creation, we give
// a longer grace period.
nodeStartupGracePeriod time.Duration
// Controller will not proactively sync node health, but will monitor node
// health signal updated from kubelet. There are 2 kinds of node healthiness
// signals: NodeStatus and NodeLease. NodeLease signal is generated only when
// NodeLease feature is enabled. If it doesn't receive update for this amount
// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
// time before which Controller start evicting pods is controlled via flag
// 'pod-eviction-timeout'.
// Note: be cautious when changing the constant, it must work with
// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
// controller. The node health signal update frequency is the minimal of the
// two.
// There are several constraints:
// 1. nodeMonitorGracePeriod must be N times more than the node health signal
// update frequency, where N means number of retries allowed for kubelet to
// post node status/lease. It is pointless to make nodeMonitorGracePeriod
// be less than the node health signal update frequency, since there will
// only be fresh values from Kubelet at an interval of node health signal
// update frequency. The constant must be less than podEvictionTimeout.
// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
// value takes longer for user to see up-to-date node health.
nodeMonitorGracePeriod time.Duration
podEvictionTimeout time.Duration
evictionLimiterQPS float32
secondaryEvictionLimiterQPS float32
largeClusterThreshold int32
unhealthyZoneThreshold float32
// if set to true Controller will start TaintManager that will evict Pods from
// tainted nodes, if they're not tolerated.
runTaintManager bool
// if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'
// taints instead of evicting Pods itself.
useTaintBasedEvictions bool
nodeUpdateQueue workqueue.Interface
podUpdateQueue workqueue.RateLimitingInterface
}
NewNodeLifecycleController
函數的主要邏輯為:
(1)初始化Controller
結構體,代表NodeLifecycleController
;
(2)給podInformer
註冊EventHandler
(部分邏輯與TaintManager
相關);
(3)判斷是否開啟汙點驅逐,即--enable-taint-manager
啟動引數值是否設定為true,是則初始化TaintManager
並賦值給Controller
的taintManager
屬性,隨後給nodeInformer註冊TaintManager
相關的EventHandler
;
(4)給nodeInformer
註冊EventHandler
;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func NewNodeLifecycleController(
leaseInformer coordinformers.LeaseInformer,
podInformer coreinformers.PodInformer,
nodeInformer coreinformers.NodeInformer,
daemonSetInformer appsv1informers.DaemonSetInformer,
kubeClient clientset.Interface,
nodeMonitorPeriod time.Duration,
nodeStartupGracePeriod time.Duration,
nodeMonitorGracePeriod time.Duration,
podEvictionTimeout time.Duration,
evictionLimiterQPS float32,
secondaryEvictionLimiterQPS float32,
largeClusterThreshold int32,
unhealthyZoneThreshold float32,
runTaintManager bool,
useTaintBasedEvictions bool,
) (*Controller, error) {
...
// (1)初始化`Controller`結構體;
nc := &Controller{
kubeClient: kubeClient,
now: metav1.Now,
knownNodeSet: make(map[string]*v1.Node),
nodeHealthMap: newNodeHealthMap(),
nodeEvictionMap: newNodeEvictionMap(),
recorder: recorder,
nodeMonitorPeriod: nodeMonitorPeriod,
nodeStartupGracePeriod: nodeStartupGracePeriod,
nodeMonitorGracePeriod: nodeMonitorGracePeriod,
zonePodEvictor: make(map[string]*scheduler.RateLimitedTimedQueue),
zoneNoExecuteTainter: make(map[string]*scheduler.RateLimitedTimedQueue),
nodesToRetry: sync.Map{},
zoneStates: make(map[string]ZoneState),
podEvictionTimeout: podEvictionTimeout,
evictionLimiterQPS: evictionLimiterQPS,
secondaryEvictionLimiterQPS: secondaryEvictionLimiterQPS,
largeClusterThreshold: largeClusterThreshold,
unhealthyZoneThreshold: unhealthyZoneThreshold,
runTaintManager: runTaintManager,
useTaintBasedEvictions: useTaintBasedEvictions && runTaintManager,
nodeUpdateQueue: workqueue.NewNamed("node_lifecycle_controller"),
podUpdateQueue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "node_lifecycle_controller_pods"),
}
if useTaintBasedEvictions {
klog.Infof("Controller is using taint based evictions.")
}
nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc
nc.enterFullDisruptionFunc = nc.HealthyQPSFunc
nc.computeZoneStateFunc = nc.ComputeZoneState
// (2)給`podInformer`註冊`EventHandler`(部分邏輯與`TaintManager`相關);
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
pod := obj.(*v1.Pod)
nc.podUpdated(nil, pod)
if nc.taintManager != nil {
nc.taintManager.PodUpdated(nil, pod)
}
},
UpdateFunc: func(prev, obj interface{}) {
prevPod := prev.(*v1.Pod)
newPod := obj.(*v1.Pod)
nc.podUpdated(prevPod, newPod)
if nc.taintManager != nil {
nc.taintManager.PodUpdated(prevPod, newPod)
}
},
DeleteFunc: func(obj interface{}) {
pod, isPod := obj.(*v1.Pod)
// We can get DeletedFinalStateUnknown instead of *v1.Pod here and we need to handle that correctly.
if !isPod {
deletedState, ok := obj.(cache.DeletedFinalStateUnknown)
if !ok {
klog.Errorf("Received unexpected object: %v", obj)
return
}
pod, ok = deletedState.Obj.(*v1.Pod)
if !ok {
klog.Errorf("DeletedFinalStateUnknown contained non-Pod object: %v", deletedState.Obj)
return
}
}
nc.podUpdated(pod, nil)
if nc.taintManager != nil {
nc.taintManager.PodUpdated(pod, nil)
}
},
})
nc.podInformerSynced = podInformer.Informer().HasSynced
podInformer.Informer().AddIndexers(cache.Indexers{
nodeNameKeyIndex: func(obj interface{}) ([]string, error) {
pod, ok := obj.(*v1.Pod)
if !ok {
return []string{}, nil
}
if len(pod.Spec.NodeName) == 0 {
return []string{}, nil
}
return []string{pod.Spec.NodeName}, nil
},
})
podIndexer := podInformer.Informer().GetIndexer()
nc.getPodsAssignedToNode = func(nodeName string) ([]*v1.Pod, error) {
objs, err := podIndexer.ByIndex(nodeNameKeyIndex, nodeName)
if err != nil {
return nil, err
}
pods := make([]*v1.Pod, 0, len(objs))
for _, obj := range objs {
pod, ok := obj.(*v1.Pod)
if !ok {
continue
}
pods = append(pods, pod)
}
return pods, nil
}
nc.podLister = podInformer.Lister()
// (3)判斷是否開啟汙點驅逐,即`--enable-taint-manager`啟動引數值是否設定為true,是則初始化`TaintManager`並賦值給`Controller`的`taintManager`屬性,隨後給nodeInformer註冊`TaintManager`相關的`EventHandler`;
if nc.runTaintManager {
podGetter := func(name, namespace string) (*v1.Pod, error) { return nc.podLister.Pods(namespace).Get(name) }
nodeLister := nodeInformer.Lister()
nodeGetter := func(name string) (*v1.Node, error) { return nodeLister.Get(name) }
nc.taintManager = scheduler.NewNoExecuteTaintManager(kubeClient, podGetter, nodeGetter, nc.getPodsAssignedToNode)
nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {
nc.taintManager.NodeUpdated(nil, node)
return nil
}),
UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(oldNode, newNode *v1.Node) error {
nc.taintManager.NodeUpdated(oldNode, newNode)
return nil
}),
DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {
nc.taintManager.NodeUpdated(node, nil)
return nil
}),
})
}
// (4) 給`nodeInformer`註冊`EventHandler`;
klog.Infof("Controller will reconcile labels.")
nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {
nc.nodeUpdateQueue.Add(node.Name)
nc.nodeEvictionMap.registerNode(node.Name)
return nil
}),
UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(_, newNode *v1.Node) error {
nc.nodeUpdateQueue.Add(newNode.Name)
return nil
}),
DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {
nc.nodesToRetry.Delete(node.Name)
nc.nodeEvictionMap.unregisterNode(node.Name)
return nil
}),
})
nc.leaseLister = leaseInformer.Lister()
nc.leaseInformerSynced = leaseInformer.Informer().HasSynced
nc.nodeLister = nodeInformer.Lister()
nc.nodeInformerSynced = nodeInformer.Informer().HasSynced
nc.daemonSetStore = daemonSetInformer.Lister()
nc.daemonSetInformerSynced = daemonSetInformer.Informer().HasSynced
return nc, nil
}
Run方法作為NodeLifecycleController
的處理邏輯分析入口,其主要邏輯為:
(1)等待leaseInformer
、nodeInformer
、podInformer
、daemonSetInformer
中的cache同步完成;
(2)判斷是否開啟汙點驅逐,是則拉起一個goroutine,呼叫nc.taintManager.Run
方法,啟動taintManager
;
(3)啟動8個goroutine,即8個worker,迴圈呼叫nc.doNodeProcessingPassWorker
方法,用於處理nc.nodeUpdateQueue
佇列;
nc.doNodeProcessingPassWorker
方法有兩個作用:
(3-1)呼叫nc.doNoScheduleTaintingPass
方法,根據node.Status.Conditions
與node.Spec.Unschedulable
的值來更新node.Spec.Taints
,主要是設定Effect
為noschedule
的taint;
(3-2)呼叫nc.reconcileNodeLabels
方法,處理node物件中os和arch相關的label;
(4)啟動4個goroutine,即4個worker,迴圈呼叫nc.doPodProcessingWorker
方法,用於處理nc.podUpdateQueue
佇列;
nc.doPodProcessingWorker
方法做以下兩個操作:
(4-1)當汙點驅逐未開啟時,判斷node物件的status,當node的ready Condition
為false或unknown且已經持續了至少nc.podEvictionTimeout
的時間時,對該node上的pod做驅逐(刪除)操作;
(4-2)如果node的ready condition
值不為true,則將pod的ready condition
更新為false;
(5)判斷nc.useTaintBasedEvictions
是否為true,即是否開啟汙點驅逐,是則啟動goroutine並回圈呼叫nc.doNoExecuteTaintingPass
;
nc.doNoExecuteTaintingPass
方法主要作用是根據node.Status.Conditions
的值來更新node.Spec.Taints
,主要是設定Effect
為noExecute
的taint;
(6)未開啟汙點驅逐時,啟動goroutine並回圈呼叫nc.doEvictionPass
;
nc.doEvictionPass
方法主要作用是從nc.zonePodEvictor
中獲取node,然後驅逐(刪除)該node上除daemonset pod外的所有pod;
(7)啟動goroutine,間隔nc.nodeMonitorPeriod
時間(即kcm啟動引數--node-monitor-period
,預設值5秒)迴圈呼叫nc.monitorNodeHealth
方法;
nc.monitorNodeHealth
方法的主要作用是持續監控node的狀態,根據叢集中不同zone下unhealthy數量的node,以及kcm啟動引數中驅逐速率的相關設定,給不同的zone設定不同的驅逐速率(該驅逐速率對是否開啟汙點驅逐均生效),且當node心跳上報(node lease的更新時間)距離上一次上報時間已經超過nodeMonitorGracePeriod
(剛啟動則為nodeStartupGracePeriod
),更新node物件的ready condition值,並做相應的驅逐處理:
(7-1)當開啟了汙點驅逐,且node的ready condition不為true時,新增NoExcute
汙點,並將該node放入zoneNoExecuteTainter
中,由taintManager
來做驅逐操作;
(7-2)當沒開啟汙點驅逐,且node的ready condition不為true持續了podEvictionTimeout
時間,則開始驅逐pod的操作;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) Run(stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
klog.Infof("Starting node controller")
defer klog.Infof("Shutting down node controller")
if !cache.WaitForNamedCacheSync("taint", stopCh, nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {
return
}
if nc.runTaintManager {
go nc.taintManager.Run(stopCh)
}
// Close node update queue to cleanup go routine.
defer nc.nodeUpdateQueue.ShutDown()
defer nc.podUpdateQueue.ShutDown()
// Start workers to reconcile labels and/or update NoSchedule taint for nodes.
for i := 0; i < scheduler.UpdateWorkerSize; i++ {
// Thanks to "workqueue", each worker just need to get item from queue, because
// the item is flagged when got from queue: if new event come, the new item will
// be re-queued until "Done", so no more than one worker handle the same item and
// no event missed.
go wait.Until(nc.doNodeProcessingPassWorker, time.Second, stopCh)
}
for i := 0; i < podUpdateWorkerSize; i++ {
go wait.Until(nc.doPodProcessingWorker, time.Second, stopCh)
}
if nc.useTaintBasedEvictions {
// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
go wait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod, stopCh)
} else {
// Managing eviction of nodes:
// When we delete pods off a node, if the node was not empty at the time we then
// queue an eviction watcher. If we hit an error, retry deletion.
go wait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, stopCh)
}
// Incorporate the results of node health signal pushed from kubelet to master.
go wait.Until(func() {
if err := nc.monitorNodeHealth(); err != nil {
klog.Errorf("Error monitoring node health: %v", err)
}
}, nc.nodeMonitorPeriod, stopCh)
<-stopCh
}
taintManager
的主要功能為:當某個node被打上NoExecute
汙點後,其上面的pod如果不能容忍該汙點,則taintManager
將會驅逐這些pod,而新建的pod也需要容忍該汙點才能排程到該node上;
通過kcm啟動引數--enable-taint-manager
來確定是否啟動taintManager
,true
時啟動(啟動引數預設值為true
);
kcm啟動引數--feature-gates=TaintBasedEvictions=xxx
,預設值true,配合--enable-taint-manager
共同作用,兩者均為true,才會開啟汙點驅逐;
taintManager
部分的內容比較多,將在後面單獨一遍文章進行分析;
nc.doNodeProcessingPassWorker
方法有兩個作用:
(1)呼叫nc.doNoScheduleTaintingPass
方法,根據node.Status.Conditions
與node.Spec.Unschedulable
的值來更新node.Spec.Taints
,主要是設定Effect
為noschedule
的taint;
(2)呼叫nc.reconcileNodeLabels
方法,處理node物件中os和arch相關的label;
主要邏輯:
(1)迴圈消費nodeUpdateQueue
佇列,從佇列中獲取一個nodeName;
(2)呼叫nc.doNoScheduleTaintingPass
,對node的taint進行處理;
(3)呼叫nc.reconcileNodeLabels
,對node物件中os和arch相關的label進行處理;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doNodeProcessingPassWorker() {
for {
obj, shutdown := nc.nodeUpdateQueue.Get()
// "nodeUpdateQueue" will be shutdown when "stopCh" closed;
// we do not need to re-check "stopCh" again.
if shutdown {
return
}
nodeName := obj.(string)
if err := nc.doNoScheduleTaintingPass(nodeName); err != nil {
klog.Errorf("Failed to taint NoSchedule on node <%s>, requeue it: %v", nodeName, err)
// TODO(k82cn): Add nodeName back to the queue
}
// TODO: re-evaluate whether there are any labels that need to be
// reconcile in 1.19. Remove this function if it's no longer necessary.
if err := nc.reconcileNodeLabels(nodeName); err != nil {
klog.Errorf("Failed to reconcile labels for node <%s>, requeue it: %v", nodeName, err)
// TODO(yujuhong): Add nodeName back to the queue
}
nc.nodeUpdateQueue.Done(nodeName)
}
}
nc.doNoScheduleTaintingPass
方法主要作用是根據node.Status.Conditions
與node.Spec.Unschedulable
的值來更新node.Spec.Taints
,主要是設定Effect
為noschedule
的taint;
主要邏輯:
(1)呼叫nc.nodeLister.Get
,從informer本地快取中獲取node物件,不存在則直接return;
(2)根據node.Status.Conditions
的值,獲得相應的taints;
(2-1)node.status.Conditions
中有type為ready
的condition。如果這個condition.status
為fasle,設定key為node.kubernetes.io/not-ready
,Effect為noschedule
的taint;如果這個condition.status
值為unknown,設定key為node.kubernetes.io/unreachable
,Effect為noschedule
的taint;
(2-2)node.status.Conditions
中有type為MemoryPressure
的condition。如果這個condition.status
為true,設定key為node.kubernetes.io/memory-pressure
,Effect為noschedule
的taint;
(2-3)node.status.Conditions
中有type為DiskPressure
的condition。如果這個condition.status
為true,設定key為node.kubernetes.io/disk-pressure
,Effect為noschedule
的taint;
(2-4)node.status.Conditions
中有type為NetworkUnavailable
的condition。如果這個condition.status
為true,設定key為node.kubernetes.io/network-unavailable
,Effect為noschedule
的taint;
(2-5)node.status.Conditions
中有type為PIDPressure
的condition。如果這個condition.status
為true,設定key為node.kubernetes.io/pid-pressure
,Effect為noschedule
的taint;
(3)如果node.Spec.Unschedulable
值為true,則再追加key為node.kubernetes.io/unschedulable
,Effect為noschedule
的taint到taints中;
(4)呼叫nodeutil.SwapNodeControllerTaint
,更新node的taints;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
var(
...
nodeConditionToTaintKeyStatusMap = map[v1.NodeConditionType]map[v1.ConditionStatus]string{
v1.NodeReady: {
v1.ConditionFalse: v1.TaintNodeNotReady,
v1.ConditionUnknown: v1.TaintNodeUnreachable,
},
v1.NodeMemoryPressure: {
v1.ConditionTrue: v1.TaintNodeMemoryPressure,
},
v1.NodeDiskPressure: {
v1.ConditionTrue: v1.TaintNodeDiskPressure,
},
v1.NodeNetworkUnavailable: {
v1.ConditionTrue: v1.TaintNodeNetworkUnavailable,
},
v1.NodePIDPressure: {
v1.ConditionTrue: v1.TaintNodePIDPressure,
},
}
...
)
func (nc *Controller) doNoScheduleTaintingPass(nodeName string) error {
node, err := nc.nodeLister.Get(nodeName)
if err != nil {
// If node not found, just ignore it.
if apierrors.IsNotFound(err) {
return nil
}
return err
}
// Map node's condition to Taints.
var taints []v1.Taint
for _, condition := range node.Status.Conditions {
if taintMap, found := nodeConditionToTaintKeyStatusMap[condition.Type]; found {
if taintKey, found := taintMap[condition.Status]; found {
taints = append(taints, v1.Taint{
Key: taintKey,
Effect: v1.TaintEffectNoSchedule,
})
}
}
}
if node.Spec.Unschedulable {
// If unschedulable, append related taint.
taints = append(taints, v1.Taint{
Key: v1.TaintNodeUnschedulable,
Effect: v1.TaintEffectNoSchedule,
})
}
// Get exist taints of node.
nodeTaints := taintutils.TaintSetFilter(node.Spec.Taints, func(t *v1.Taint) bool {
// only NoSchedule taints are candidates to be compared with "taints" later
if t.Effect != v1.TaintEffectNoSchedule {
return false
}
// Find unschedulable taint of node.
if t.Key == v1.TaintNodeUnschedulable {
return true
}
// Find node condition taints of node.
_, found := taintKeyToNodeConditionMap[t.Key]
return found
})
taintsToAdd, taintsToDel := taintutils.TaintSetDiff(taints, nodeTaints)
// If nothing to add not delete, return true directly.
if len(taintsToAdd) == 0 && len(taintsToDel) == 0 {
return nil
}
if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, taintsToAdd, taintsToDel, node) {
return fmt.Errorf("failed to swap taints of node %+v", node)
}
return nil
}
nc.reconcileNodeLabels
方法主要是處理node物件中,os和arch相關的label,os和arch相關的label在kubelet向apiserver註冊node的時候會帶上;
主要邏輯:
(1)從informer快取中獲取node物件,不存在則直接return;
(2)如果node的label為空,則直接return;
(3)如果node物件存在key為「beta.kubernetes.io/os
」的label ,則設定key為「kubernetes.io/os
"、value值一樣的label;
(4)如果node物件存在key為「beta.kubernetes.io/arch」的label,則設定key為「kubernetes.io/arch
"、value值一樣的label;
nc.doPodProcessingWorker
方法判斷node物件的status,當node的ready Condition
為false或unknown且已經持續了至少nc.podEvictionTimeout
的時間時,對該node上的pod做驅逐(刪除)操作,並且如果node的ready condition
值不為true,則將pod的ready condition
更新為false;
需要注意的是,當啟用了taint manager
時,pod的驅逐由taint manager
進行處理,這裡就不再進行pod的驅逐處理。
主要邏輯:
(1)迴圈消費podUpdateQueue
佇列,從佇列中獲取一個nodeName;
(2)呼叫nc.processPod
,做進一步處理;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doPodProcessingWorker() {
for {
obj, shutdown := nc.podUpdateQueue.Get()
// "podUpdateQueue" will be shutdown when "stopCh" closed;
// we do not need to re-check "stopCh" again.
if shutdown {
return
}
podItem := obj.(podUpdateItem)
nc.processPod(podItem)
}
}
nc.processPod
方法主要邏輯:
(1)從informer本地快取中獲取pod物件;
(2)獲取pod所在nodeName,並根據nodeName呼叫nc.nodeHealthMap.getDeepCopy
獲取nodeHealth
,如nodeHealth
為空則直接return;
(3)呼叫nodeutil.GetNodeCondition
,獲取nodeHealth.status
中node的ready condition
,如果獲取不到則直接return;
(4)判斷taint manager是否啟用,沒啟用則呼叫nc.processNoTaintBaseEviction
對pod做進一步處理(驅逐邏輯);
(5)如果node的ready condition
值不為true,則呼叫nodeutil.MarkPodsNotReady
將pod的ready condition
更新為false;
注意:當啟用taint manager時,pod的驅逐由taint manager進行處理,所以不在這裡處理。
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processPod(podItem podUpdateItem) {
defer nc.podUpdateQueue.Done(podItem)
pod, err := nc.podLister.Pods(podItem.namespace).Get(podItem.name)
if err != nil {
if apierrors.IsNotFound(err) {
// If the pod was deleted, there is no need to requeue.
return
}
klog.Warningf("Failed to read pod %v/%v: %v.", podItem.namespace, podItem.name, err)
nc.podUpdateQueue.AddRateLimited(podItem)
return
}
nodeName := pod.Spec.NodeName
nodeHealth := nc.nodeHealthMap.getDeepCopy(nodeName)
if nodeHealth == nil {
// Node data is not gathered yet or node has beed removed in the meantime.
// Pod will be handled by doEvictionPass method.
return
}
node, err := nc.nodeLister.Get(nodeName)
if err != nil {
klog.Warningf("Failed to read node %v: %v.", nodeName, err)
nc.podUpdateQueue.AddRateLimited(podItem)
return
}
_, currentReadyCondition := nodeutil.GetNodeCondition(nodeHealth.status, v1.NodeReady)
if currentReadyCondition == nil {
// Lack of NodeReady condition may only happen after node addition (or if it will be maliciously deleted).
// In both cases, the pod will be handled correctly (evicted if needed) during processing
// of the next node update event.
return
}
pods := []*v1.Pod{pod}
// In taint-based eviction mode, only node updates are processed by NodeLifecycleController.
// Pods are processed by TaintManager.
if !nc.useTaintBasedEvictions {
if err := nc.processNoTaintBaseEviction(node, currentReadyCondition, nc.nodeMonitorGracePeriod, pods); err != nil {
klog.Warningf("Unable to process pod %+v eviction from node %v: %v.", podItem, nodeName, err)
nc.podUpdateQueue.AddRateLimited(podItem)
return
}
}
if currentReadyCondition.Status != v1.ConditionTrue {
if err := nodeutil.MarkPodsNotReady(nc.kubeClient, pods, nodeName); err != nil {
klog.Warningf("Unable to mark pod %+v NotReady on node %v: %v.", podItem, nodeName, err)
nc.podUpdateQueue.AddRateLimited(podItem)
}
}
}
nc.processNoTaintBaseEviction
方法主要邏輯:
(1)當node的ready Condition
為false或unknown且已經持續了至少nc.podEvictionTimeout
的時間時,呼叫nc.evictPods
方法,將node加入到nc.zonePodEvictor
佇列中,由其他worker消費該佇列,對該node上的pod做驅逐(刪除)操作;
(2)當node的ready Condition
為true時,呼叫nc.cancelPodEviction
方法,將該node從nc.zonePodEvictor
佇列中移除,代表取消驅逐該node上的pod;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processNoTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition, gracePeriod time.Duration, pods []*v1.Pod) error {
decisionTimestamp := nc.now()
nodeHealthData := nc.nodeHealthMap.getDeepCopy(node.Name)
if nodeHealthData == nil {
return fmt.Errorf("health data doesn't exist for node %q", node.Name)
}
// Check eviction timeout against decisionTimestamp
switch observedReadyCondition.Status {
case v1.ConditionFalse:
if decisionTimestamp.After(nodeHealthData.readyTransitionTimestamp.Add(nc.podEvictionTimeout)) {
enqueued, err := nc.evictPods(node, pods)
if err != nil {
return err
}
if enqueued {
klog.V(2).Infof("Node is NotReady. Adding Pods on Node %s to eviction queue: %v is later than %v + %v",
node.Name,
decisionTimestamp,
nodeHealthData.readyTransitionTimestamp,
nc.podEvictionTimeout,
)
}
}
case v1.ConditionUnknown:
if decisionTimestamp.After(nodeHealthData.probeTimestamp.Add(nc.podEvictionTimeout)) {
enqueued, err := nc.evictPods(node, pods)
if err != nil {
return err
}
if enqueued {
klog.V(2).Infof("Node is unresponsive. Adding Pods on Node %s to eviction queues: %v is later than %v + %v",
node.Name,
decisionTimestamp,
nodeHealthData.readyTransitionTimestamp,
nc.podEvictionTimeout-gracePeriod,
)
}
}
case v1.ConditionTrue:
if nc.cancelPodEviction(node) {
klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name)
}
}
return nil
}
nc.evictPods
方法不列出原始碼,只需要知道:該方法主要是將node加入到nc.zonePodEvictor
佇列中,由其他worker消費該佇列,對該node上的pod做驅逐(刪除)操作;
nc.doEvictionPass
方法負責消費nc.zonePodEvictor
佇列,呼叫nodeutil.DeletePods
來將node上的pod驅逐(刪除)掉;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doEvictionPass() {
nc.evictorLock.Lock()
defer nc.evictorLock.Unlock()
for k := range nc.zonePodEvictor {
nc.zonePodEvictor[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
...
remaining, err := nodeutil.DeletePods(nc.kubeClient, pods, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)
...
})
}
}
// pkg/controller/util/node/controller_utils.go
func DeletePods(...) ... {
...
for i := range pods {
if err := kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil {
...
}
...
}
nc.doNoExecuteTaintingPass
方法主要作用是根據node.Status.Conditions
的值來更新node.Spec.Taints
,主要是設定Effect
為noExecute
的taint;
主要邏輯為迴圈遍歷nc.zoneNoExecuteTainter
,然後做相關處理:
(1)從nc.zoneNoExecuteTainter
中取出一個node,然後從informer本地快取中獲取該node物件;
(2)獲取node物件ready condition
的值並做判斷,如為false
則構建key為node.kubernetes.io/not-ready
,Effect為NoExecute
的taint;如為unknown
則構建key為node.kubernetes.io/unreachable
,Effect為NoExecute
的taint;
(3)最後呼叫nodeutil.SwapNodeControllerTaint
,將構造好的taint更新到node物件中(這裡注意,上述兩個NoExecute
的taint在node物件中,同一時間只會存在一個,一個新增到node物件中時,會把另一個移除);
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doNoExecuteTaintingPass() {
nc.evictorLock.Lock()
defer nc.evictorLock.Unlock()
for k := range nc.zoneNoExecuteTainter {
// Function should return 'false' and a time after which it should be retried, or 'true' if it shouldn't (it succeeded).
nc.zoneNoExecuteTainter[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
node, err := nc.nodeLister.Get(value.Value)
if apierrors.IsNotFound(err) {
klog.Warningf("Node %v no longer present in nodeLister!", value.Value)
return true, 0
} else if err != nil {
klog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
// retry in 50 millisecond
return false, 50 * time.Millisecond
}
_, condition := nodeutil.GetNodeCondition(&node.Status, v1.NodeReady)
// Because we want to mimic NodeStatus.Condition["Ready"] we make "unreachable" and "not ready" taints mutually exclusive.
taintToAdd := v1.Taint{}
oppositeTaint := v1.Taint{}
switch condition.Status {
case v1.ConditionFalse:
taintToAdd = *NotReadyTaintTemplate
oppositeTaint = *UnreachableTaintTemplate
case v1.ConditionUnknown:
taintToAdd = *UnreachableTaintTemplate
oppositeTaint = *NotReadyTaintTemplate
default:
// It seems that the Node is ready again, so there's no need to taint it.
klog.V(4).Infof("Node %v was in a taint queue, but it's ready now. Ignoring taint request.", value.Value)
return true, 0
}
result := nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{&oppositeTaint}, node)
if result {
//count the evictionsNumber
zone := utilnode.GetZoneKey(node)
evictionsNumber.WithLabelValues(zone).Inc()
}
return result, 0
})
}
}
nc.doEvictionPass
方法主要作用是從nc.zonePodEvictor
中獲取node,然後驅逐(刪除)該node上除daemonset pod外的所有pod;
主要邏輯為迴圈遍歷nc.zonePodEvictor
,然後做相關處理:
(1)從nc.zonePodEvictor
中取出一個node,然後從informer本地快取中獲取該node物件;
(2)呼叫nc.getPodsAssignedToNode
,獲取該node上的所有pod;
(3)呼叫nodeutil.DeletePods
,刪除該node上除daemonset pod外的所有的pod;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) doEvictionPass() {
nc.evictorLock.Lock()
defer nc.evictorLock.Unlock()
for k := range nc.zonePodEvictor {
// Function should return 'false' and a time after which it should be retried, or 'true' if it shouldn't (it succeeded).
nc.zonePodEvictor[k].Try(func(value scheduler.TimedValue) (bool, time.Duration) {
node, err := nc.nodeLister.Get(value.Value)
if apierrors.IsNotFound(err) {
klog.Warningf("Node %v no longer present in nodeLister!", value.Value)
} else if err != nil {
klog.Warningf("Failed to get Node %v from the nodeLister: %v", value.Value, err)
}
nodeUID, _ := value.UID.(string)
pods, err := nc.getPodsAssignedToNode(value.Value)
if err != nil {
utilruntime.HandleError(fmt.Errorf("unable to list pods from node %q: %v", value.Value, err))
return false, 0
}
remaining, err := nodeutil.DeletePods(nc.kubeClient, pods, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)
if err != nil {
// We are not setting eviction status here.
// New pods will be handled by zonePodEvictor retry
// instead of immediate pod eviction.
utilruntime.HandleError(fmt.Errorf("unable to evict node %q: %v", value.Value, err))
return false, 0
}
if !nc.nodeEvictionMap.setStatus(value.Value, evicted) {
klog.V(2).Infof("node %v was unregistered in the meantime - skipping setting status", value.Value)
}
if remaining {
klog.Infof("Pods awaiting deletion due to Controller eviction")
}
if node != nil {
zone := utilnode.GetZoneKey(node)
evictionsNumber.WithLabelValues(zone).Inc()
}
return true, 0
})
}
}
nc.monitorNodeHealth
方法的主要作用是持續監控node的狀態,根據叢集中不同zone下unhealthy數量的node,以及kcm啟動引數中驅逐速率的相關設定,給不同的zone設定不同的驅逐速率(該驅逐速率對是否開啟汙點驅逐均生效),且當node心跳上報(node lease的更新時間)距離上一次上報時間已經超過nodeMonitorGracePeriod
(剛啟動則為nodeStartupGracePeriod
),更新node物件的ready condition值,並做相應的驅逐處理:
(1)當開啟了汙點驅逐,且node的ready condition不為true時,新增NoExcute
汙點,並將該node放入zoneNoExecuteTainter
中,由taintManager
來做驅逐操作;
(2)當沒開啟汙點驅逐,且node的ready condition不為true持續了podEvictionTimeout
時間,則開始驅逐pod的操作;
主要邏輯:
(1)從informer本地快取中獲取所有node物件;
(2)呼叫nc.classifyNodes
,將這些node物件分為added、deleted、newZoneRepresentatives三類,即新增的、被刪除的、屬於新的zone三類,並對每一類做不同的邏輯處理(根據node物件的region與zone label,為每一個node劃分一個zoneStates),zoneStates有Initial、Normal、FullDisruption、PartialDisruption四種,新增加的 node預設zoneStates為Initial,其餘的幾個zoneStates分別對應著不同的驅逐速率;
(3)遍歷node物件列表,對每個node進行處理;
(3-1)呼叫nc.tryUpdateNodeHealth
,根據當前獲取的node物件的ready condition值、node lease的更新時間等來更新nc.nodeHealthMap
中的資料、更新node的condition值,並獲取該node的gracePeriod
、observedReadyCondition
、currentReadyCondition
值,observedReadyCondition
可以理解為該node上一次探測時的狀態,currentReadyCondition
為本次計算出來的狀態;
(3-2)如果currentReadyCondition
不為空,則呼叫nc.getPodsAssignedToNode
,獲取該node上的所有pod列表;
(3-3)如果nc.useTaintBasedEvictions
為true,即開啟了汙點驅逐,則呼叫nc.processTaintBaseEviction
,當node的ready condition屬性值為false時去除Unrearchable
的汙點而新增Notready
的汙點,並將該node加入zoneNoExecuteTainter
佇列中;為unknown時去除Notready
的汙點而新增Unrearchable
的汙點,並將該node加入zoneNoExecuteTainter
佇列中;為true時去除Notready
、Unrearchable
的汙點,然後將該node從zoneNoExecuteTainter
佇列中移除;
(3-4)如果nc.useTaintBasedEvictions
為false,則呼叫nc.processNoTaintBaseEviction
,做進一步的驅逐邏輯處理:當node的ready condition屬性值為false時,判斷距該node上一次的readyTransitionTimestamp
時間是否超過了 podEvictionTimeout
,是則將該node加入到zonePodEvictor
佇列中,最終該node上的pod會被驅逐;當node的ready condition屬性值為unknow時,判斷距該node上一次的probeTimestamp
時間是否超過了 podEvictionTimeout
,是則將該node加入到zonePodEvictor
佇列中,最終該node上的pod會被驅逐;當node的ready condition屬性值為true時,呼叫nc.cancelPodEviction
,將該node從zonePodEvictor
佇列中移除,代表不再對該node上的pod執行驅逐操作;
(3-5)當node物件的Ready Condition值由true變為false時,則呼叫nodeutil.MarkAllPodsNotReady
,將該node上的所有pod標記為notReady;
(4)呼叫nc.handleDisruption
,根據叢集中不同zone下unhealthy數量的node,以及kcm啟動引數中驅逐速率的相關設定,給不同的zone設定不同的驅逐速率(該驅逐速率對是否開啟汙點驅逐均生效);
nc.handleDisruption
方法暫不展開分析,可自行檢視;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) monitorNodeHealth() error {
// We are listing nodes from local cache as we can tolerate some small delays
// comparing to state from etcd and there is eventual consistency anyway.
nodes, err := nc.nodeLister.List(labels.Everything())
if err != nil {
return err
}
added, deleted, newZoneRepresentatives := nc.classifyNodes(nodes)
for i := range newZoneRepresentatives {
nc.addPodEvictorForNewZone(newZoneRepresentatives[i])
}
for i := range added {
klog.V(1).Infof("Controller observed a new Node: %#v", added[i].Name)
nodeutil.RecordNodeEvent(nc.recorder, added[i].Name, string(added[i].UID), v1.EventTypeNormal, "RegisteredNode", fmt.Sprintf("Registered Node %v in Controller", added[i].Name))
nc.knownNodeSet[added[i].Name] = added[i]
nc.addPodEvictorForNewZone(added[i])
if nc.useTaintBasedEvictions {
nc.markNodeAsReachable(added[i])
} else {
nc.cancelPodEviction(added[i])
}
}
for i := range deleted {
klog.V(1).Infof("Controller observed a Node deletion: %v", deleted[i].Name)
nodeutil.RecordNodeEvent(nc.recorder, deleted[i].Name, string(deleted[i].UID), v1.EventTypeNormal, "RemovingNode", fmt.Sprintf("Removing Node %v from Controller", deleted[i].Name))
delete(nc.knownNodeSet, deleted[i].Name)
}
zoneToNodeConditions := map[string][]*v1.NodeCondition{}
for i := range nodes {
var gracePeriod time.Duration
var observedReadyCondition v1.NodeCondition
var currentReadyCondition *v1.NodeCondition
node := nodes[i].DeepCopy()
if err := wait.PollImmediate(retrySleepTime, retrySleepTime*scheduler.NodeHealthUpdateRetry, func() (bool, error) {
gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)
if err == nil {
return true, nil
}
name := node.Name
node, err = nc.kubeClient.CoreV1().Nodes().Get(name, metav1.GetOptions{})
if err != nil {
klog.Errorf("Failed while getting a Node to retry updating node health. Probably Node %s was deleted.", name)
return false, err
}
return false, nil
}); err != nil {
klog.Errorf("Update health of Node '%v' from Controller error: %v. "+
"Skipping - no pods will be evicted.", node.Name, err)
continue
}
// Some nodes may be excluded from disruption checking
if !isNodeExcludedFromDisruptionChecks(node) {
zoneToNodeConditions[utilnode.GetZoneKey(node)] = append(zoneToNodeConditions[utilnode.GetZoneKey(node)], currentReadyCondition)
}
if currentReadyCondition != nil {
pods, err := nc.getPodsAssignedToNode(node.Name)
if err != nil {
utilruntime.HandleError(fmt.Errorf("unable to list pods of node %v: %v", node.Name, err))
if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
// If error happened during node status transition (Ready -> NotReady)
// we need to mark node for retry to force MarkPodsNotReady execution
// in the next iteration.
nc.nodesToRetry.Store(node.Name, struct{}{})
}
continue
}
if nc.useTaintBasedEvictions {
nc.processTaintBaseEviction(node, &observedReadyCondition)
} else {
if err := nc.processNoTaintBaseEviction(node, &observedReadyCondition, gracePeriod, pods); err != nil {
utilruntime.HandleError(fmt.Errorf("unable to evict all pods from node %v: %v; queuing for retry", node.Name, err))
}
}
_, needsRetry := nc.nodesToRetry.Load(node.Name)
switch {
case currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue:
// Report node event only once when status changed.
nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
fallthrough
case needsRetry && observedReadyCondition.Status != v1.ConditionTrue:
if err = nodeutil.MarkPodsNotReady(nc.kubeClient, pods, node.Name); err != nil {
utilruntime.HandleError(fmt.Errorf("unable to mark all pods NotReady on node %v: %v; queuing for retry", node.Name, err))
nc.nodesToRetry.Store(node.Name, struct{}{})
continue
}
}
}
nc.nodesToRetry.Delete(node.Name)
}
nc.handleDisruption(zoneToNodeConditions, nodes)
return nil
}
nc.tryUpdateNodeHealth
方法主要是根據當前獲取的node物件的ready condition值、node lease的更新時間等來更新nc.nodeHealthMap
中的資料、更新node的condition值,並返回gracePeriod
、上次記錄的node的ready condition值observedReadyCondition
、本次計算出來的node的ready condition值currentReadyCondition
;
nc.tryUpdateNodeHealth
方法主要邏輯為:
(1)從node物件中獲取ready condition的值,如果其為空,則設定observedReadyCondition
為unknown並設定gracePeriod
為nc.nodeStartupGracePeriod
;否則設定gracePeriod
值為nc.nodeMonitorGracePeriod
,設定observedReadyCondition
值為從node物件中獲取到的ready condition的值;
(2)nodeHealth
的status
、probeTimestamp
、readyTransitionTimestamp
屬性值賦值相關邏輯處理,status賦值為node.status,probeTimestamp賦值為現在的時間,當ready condition的LastTransitionTime
值有所變化,設定readyTransitionTimestamp
值為現在的時間;
(3)獲取node對應的lease物件,判斷其spec.RenewTime
屬性值是否比上次記錄的時間(nodeHealth.lease.spec.RenewTime
)要新,是則更新nodeHealth
的lease為新lease物件、更新probeTimestamp
為此時此刻的時間;
(4)判斷現在距離上次探測時間probeTimestamp
是否已經超過了nc.nodeMonitorGracePeriod
時間,是則將該node的ready condition
、memoryPressure condition
、diskPressure condition
、pidPressure condition
的值都更新為unknown,最後判斷上一次記錄的node的ready condition的值與本次的是否一致,不一致則更新nodeHealth
的readyTransitionTimestamp
的時間為現在;
(5)最後返回gracePeriod
、上次記錄的node的ready condition值observedReadyCondition
、本次計算出來的node的ready condition值currentReadyCondition
;
nc.processTaintBaseEviction
方法為汙點驅逐的處理邏輯:
(1)當node的ready condition屬性值為false時去除Unrearchable
的汙點而新增Notready
的汙點,並將該node加入zoneNoExecuteTainter
佇列中;
(2)為unknown時去除Notready
的汙點而新增Unrearchable
的汙點,並將該node加入zoneNoExecuteTainter
佇列中;
(3)為true時去除Notready
、Unrearchable
的汙點,然後將該node從zoneNoExecuteTainter
佇列中移除;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition) {
decisionTimestamp := nc.now()
// Check eviction timeout against decisionTimestamp
switch observedReadyCondition.Status {
case v1.ConditionFalse:
// We want to update the taint straight away if Node is already tainted with the UnreachableTaint
if taintutils.TaintExists(node.Spec.Taints, UnreachableTaintTemplate) {
taintToAdd := *NotReadyTaintTemplate
if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{UnreachableTaintTemplate}, node) {
klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")
}
} else if nc.markNodeForTainting(node) {
klog.V(2).Infof("Node %v is NotReady as of %v. Adding it to the Taint queue.",
node.Name,
decisionTimestamp,
)
}
case v1.ConditionUnknown:
// We want to update the taint straight away if Node is already tainted with the UnreachableTaint
if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) {
taintToAdd := *UnreachableTaintTemplate
if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) {
klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")
}
} else if nc.markNodeForTainting(node) {
klog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to the Taint queue.",
node.Name,
decisionTimestamp,
)
}
case v1.ConditionTrue:
removed, err := nc.markNodeAsReachable(node)
if err != nil {
klog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name)
}
if removed {
klog.V(2).Infof("Node %s is healthy again, removing all taints", node.Name)
}
}
}
nc.processNoTaintBaseEviction
方法為未開啟汙點驅逐時的驅逐處理邏輯:
(1)當node的ready condition屬性值為false時,判斷距該node上一次的readyTransitionTimestamp
時間是否超過了 podEvictionTimeout
,是則將該node加入到zonePodEvictor
佇列中,最終該node上的pod會被驅逐;
(2)當node的ready condition屬性值為unknow時,判斷距該node上一次的probeTimestamp
時間是否超過了 podEvictionTimeout
,是則將該node加入到zonePodEvictor
佇列中,最終該node上的pod會被驅逐;
(3)當node的ready condition屬性值為true時,呼叫nc.cancelPodEviction
,將該node從zonePodEvictor
佇列中移除,代表不再對該node上的pod執行驅逐操作;
// pkg/controller/nodelifecycle/node_lifecycle_controller.go
func (nc *Controller) processNoTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition, gracePeriod time.Duration, pods []*v1.Pod) error {
decisionTimestamp := nc.now()
nodeHealthData := nc.nodeHealthMap.getDeepCopy(node.Name)
if nodeHealthData == nil {
return fmt.Errorf("health data doesn't exist for node %q", node.Name)
}
// Check eviction timeout against decisionTimestamp
switch observedReadyCondition.Status {
case v1.ConditionFalse:
if decisionTimestamp.After(nodeHealthData.readyTransitionTimestamp.Add(nc.podEvictionTimeout)) {
enqueued, err := nc.evictPods(node, pods)
if err != nil {
return err
}
if enqueued {
klog.V(2).Infof("Node is NotReady. Adding Pods on Node %s to eviction queue: %v is later than %v + %v",
node.Name,
decisionTimestamp,
nodeHealthData.readyTransitionTimestamp,
nc.podEvictionTimeout,
)
}
}
case v1.ConditionUnknown:
if decisionTimestamp.After(nodeHealthData.probeTimestamp.Add(nc.podEvictionTimeout)) {
enqueued, err := nc.evictPods(node, pods)
if err != nil {
return err
}
if enqueued {
klog.V(2).Infof("Node is unresponsive. Adding Pods on Node %s to eviction queues: %v is later than %v + %v",
node.Name,
decisionTimestamp,
nodeHealthData.readyTransitionTimestamp,
nc.podEvictionTimeout-gracePeriod,
)
}
}
case v1.ConditionTrue:
if nc.cancelPodEviction(node) {
klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name)
}
}
return nil
}
NodeLifecycleController
主要功能有:
(1)定期檢查node的心跳上報,某個node間隔一定時間都沒有心跳上報時,更新node的ready condition
值為false或unknown,開啟了汙點驅逐的情況下,給該node新增NoExecute
的汙點;
(2)當汙點驅逐未開啟時,當node的ready Condition
值為false或unknown且已經持續了一段時間(該時間可設定)時,對該node上的pod做驅逐(刪除)操作;
(3)當汙點驅逐開啟時,node上有NoExecute
汙點後,立馬驅逐(刪除)不能容忍汙點的pod,對於能容忍該汙點的pod,則等待所有汙點的容忍時間裡最小值後,pod才被驅逐(刪除);
根據每個node物件的region和zone的label值,將node劃分到不同的zone中;
region、zone值都相同的node,劃分為同一個zone;
zone狀態有四種,分別是:
(1)Initial
:初始化狀態;
(2)FullDisruption
:ready的node數量為0,not ready的node數量大於0;
(3)PartialDisruption
:not ready的node數量大於2且其佔比大於等於unhealthyZoneThreshold
;
(4)Normal
:上述三種狀態以外的情形,都屬於該狀態;
需要注意二級驅逐速率對驅逐的影響,即kcm啟動引數--secondary-node-eviction-rate
,代表如果某個zone下的unhealthy節點的百分比超過--unhealthy-zone-threshold
(預設為 0.55)時,驅逐速率將會減小,如果不是LargeCluster(zone節點數量小於等於--large-cluster-size-threshold
,預設為 50),驅逐操作將會停止,如果是LargeCluster,驅逐速率將降為每秒--secondary-node-eviction-rate
個,預設為0.01;