--enable-taint-manager
WARNING: Beta feature. If set to true enables NoExecute Taints and will evict all not-tolerating Pod running on
Nodes tainted with this kind of Taints. (default true) 启用在unready时候为node添加taint和ready时候移除相关taint和进行驱逐pod不能容忍NoExecute Taints
--large-cluster-size-threshold int32
Number of nodes from which NodeController treats the cluster as large for the eviction logic purposes.
--secondary-node-eviction-rate is implicitly overridden to 0for clusters this size or smaller. (default 50) node数量超过多少认为是large-cluster
--node-eviction-rate float32
Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy (see
--unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone
clusters. (default 0.1) 在zone healthy时候(unhealthy node数量比小于 --unhealthy-zone-threshold)的驱逐速率
--node-monitor-grace-period duration
Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more
than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node
status. (default 40s) 多久node没有响应认为node为unhealthy
--node-startup-grace-period duration
Amount of time which we allow starting Node to be unresponsive before marking it unhealthy. (default 1m0s) 多久允许刚启动的node未响应,认为unhealthy
--pod-eviction-timeout duration
The grace period for deleting pods on failed nodes. (default 5m0s) 当node unhealthy时候多久删除上面的pod(只在taint manager未启用时候生效)
--secondary-node-eviction-rate float32
Number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy (see
--unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone
clusters. This value is implicitly overridden to 0if the cluster size is smaller than
--large-cluster-size-threshold. (default 0.01) 当zone unhealthy时候,一秒内多少个node进行驱逐node上pod
--unhealthy-zone-threshold float32
Fraction of Nodes in a zone which needs to be not Ready (minimum 3)for zone to be treated as unhealthy.
(default 0.55) 多少比例的unhealthy node认为zone unhealthy
--node-monitor-period duration
The period for syncing NodeStatus in NodeController. (default 5s) 更新release资源的周期
typeControllerstruct {
.........//周期性主动扫描到的node存在这里,用于对比新增加的node、删除的node
knownNodeSetmap[string]*v1.Node// per Node map storing last observed health together with a local time when it was observed.
//周期性扫描node,从shareinformer获取node status保存在这里
nodeHealthMap*nodeHealthMap// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
// TODO(#83954): API calls shouldn't be executed under the lock.
evictorLocksync.Mutex// workers that evicts pods from unresponsive nodes.
//未启用taints manager时使用,存放node上pod是否已经执行驱逐的状态, 从这读取node eviction的状态是evicted、tobeeviced
nodeEvictionMap*nodeEvictionMap//未启用taints manager时使用, zone的需要pod evictor的node列表
zonePodEvictormap[string]*scheduler.RateLimitedTimedQueue// workers that are responsible for tainting nodes.
//启用taints manage时使用,存放需要更新taint的unready node列表--令牌桶队列
zoneNoExecuteTaintermap[string]*scheduler.RateLimitedTimedQueue//存放每个zone的健康状态,有stateFullDisruption、statePartialDisruption、stateNormal、stateInitial
zoneStatesmap[string]ZoneState// Value controlling Controller monitoring period, i.e. how often does Controller
// check node health signal posted from kubelet. This value should be lower than
// nodeMonitorGracePeriod.
// TODO: Change node health monitor to watch based.
// 主动扫描所有node的周期
nodeMonitorPeriodtime.Duration// When node is just created, e.g. cluster bootstrap or node creation, we give
// a longer grace period.
// node刚注册时候的,认为node unready的超时时间
nodeStartupGracePeriodtime.Duration// Controller will not proactively sync node health, but will monitor node
// health signal updated from kubelet. There are 2 kinds of node healthiness
// signals: NodeStatus and NodeLease. NodeLease signal is generated only when
// NodeLease feature is enabled. If it doesn't receive update for this amount
// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
// time before which Controller start evicting pods is controlled via flag
// 'pod-eviction-timeout'.
// Note: be cautious when changing the constant, it must work with
// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
// controller. The node health signal update frequency is the minimal of the
// two.
// There are several constraints:
// 1. nodeMonitorGracePeriod must be N times more than the node health signal
// update frequency, where N means number of retries allowed for kubelet to
// post node status/lease. It is pointless to make nodeMonitorGracePeriod
// be less than the node health signal update frequency, since there will
// only be fresh values from Kubelet at an interval of node health signal
// update frequency. The constant must be less than podEvictionTimeout.
// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
// value takes longer for user to see up-to-date node health.
// node的不更新status或者lease的持续时间,超过这个时间会将node的ready condition改为unknown
nodeMonitorGracePeriodtime.Duration// node unready之后多久执行驱逐node上的pod
podEvictionTimeouttime.Duration// zone正常时候的 每秒多少个node去执行驱逐/添加taint
evictionLimiterQPSfloat32// zone为statePartialDisruption时候且节点数量大于largeClusterThreshold 每秒多少个node去执行驱逐/添加taint
secondaryEvictionLimiterQPSfloat32// 多少个节点数认为大集群, 这个数值用来判断是否在zone为statePartialDisruption时候,将每秒多少个node去执行驱逐/添加taint设置为0
largeClusterThresholdint32// unready node超出多少比例,认为zone是statePartialDisruption
unhealthyZoneThresholdfloat32// if set to true Controller will start TaintManager that will evict Pods from
// tainted nodes, if they're not tolerated.
runTaintManagerbool// 不限速的workqueue
nodeUpdateQueueworkqueue.Interface// 具有限速和指数回退策略的workqueue
podUpdateQueueworkqueue.RateLimitingInterface}
// Run starts an asynchronous loop that monitors the status of cluster nodes.
func (nc*Controller) Run(stopCh<-chanstruct{}) {
deferutilruntime.HandleCrash()
klog.Infof("Starting node controller")
deferklog.Infof("Shutting down node controller")
if !cache.WaitForNamedCacheSync("taint", stopCh, nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {
return }
//pod是否能够toleration上的tains,不能就进行驱逐pod
ifnc.runTaintManager {
gonc.taintManager.Run(stopCh)
}
// Close node update queue to cleanup go routine.
defernc.nodeUpdateQueue.ShutDown()
defernc.podUpdateQueue.ShutDown()
// Start workers to reconcile labels and/or update NoSchedule taint for nodes.
fori:=0; i < scheduler.UpdateWorkerSize; i++ {
// Thanks to "workqueue", each worker just need to get item from queue, because
// the item is flagged when got from queue: if new event come, the new item will
// be re-queued until "Done", so no more than one worker handle the same item and
// no event missed.
gowait.Until(nc.doNodeProcessingPassWorker, time.Second, stopCh)
}
fori:=0; i < podUpdateWorkerSize; i++ {
gowait.Until(nc.doPodProcessingWorker, time.Second, stopCh)
}
ifnc.runTaintManager {
// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated
// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.
gowait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod, stopCh)
} else {
// Managing eviction of nodes:
// When we delete pods off a node, if the node was not empty at the time we then
// queue an eviction watcher. If we hit an error, retry deletion.
gowait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, stopCh)
}
// Incorporate the results of node health signal pushed from kubelet to master.
gowait.Until(func() {
iferr:=nc.monitorNodeHealth(); err!=nil {
klog.Errorf("Error monitoring node health: %v", err)
}
}, nc.nodeMonitorPeriod, stopCh)
<-stopCh}