Garden | Kube-Scheduler Policy

本文分析了Kubernetes内置的各种调度策略。

Predicate

整体梳理

策略名称	策略算法	备注
CheckNodeUnschedulable	在 Node 节点上有一个 NodeUnschedulable 的标记，那这个节点就不会被调度了
CheckVolumeBinding	在 pvc 和 pv 的 binding 过程中对其进行逻辑校验
GeneralPredicates	是 PodFitsHostPorts，PodFitsResources，HostName，MatchNodeSelector这四个的组合
MatchInterPodAffinity	亲和性检查，当Node上所有正在运行的Pod与待调度的Pod不互相排斥时，则可调度
MaxAzureDiskVolumeCount	当Node上被挂载的Azure Disk Volume超过默认限制，该Node不可调度
MaxCSIVolumeCountPred	当Node上被挂载的CSI Volume超过默认限制，该Node不可调度
MaxEBSVolumeCount	当Node上被挂载的AWS EBS Volume超过默认限制39，该Node不可调度
MaxGCEPDVolumeCount	当Node上被挂载的GCD Persistent Disk超过默认限制16，该Node不可调度
MaxQcloudCbsVolumeCount	当Node上被挂载的Qcloud CBS Volume超过默认限制，该Node不可调度
NoDiskConflict	当Node上所有Pod使用的卷和待调度Pod使用的卷存在冲突，该Node不可调度
NoVolumeZoneConflict	当Node上的zone-lable包含Pod中PV卷下的zone-label时，可以调度。当Node上没有zone-label，表示没有zone限制，也可调度
PodToleratesNodeTaints	当Pod可以容忍Node上所有的taint时，该Node才可以调度
PodFitsHostPorts	当待调度Pod中所有容器所用到的HostPort与Node上已使用的Port存在冲突，则无法调度
PodFitsResources	当总资源-Node中所有Pod对资源的request总量 < 待调度的Pod request总量，则无法调度
HostName	如果待调度的Pod制定了pod.Spec.Host，则调度到该主机上
MatchNodeSelector	校验 Pod.Spec.Affinity.NodeAffinity 和 Pod.Spec.NodeSelector 是否与 Node 的 Labels 匹配
CheckNodeMemoryPressure	当Node剩余内存紧张时，BestEffort类型的Pod无法调度到该主机

CheckNodeDiskPressure	当Node剩余磁盘空间紧张时，无法调度到该主机
PodFitsHostPorts	当待调度Pod中所有容器所用到的HostPort与Node上已使用的Port存在冲突，则无法调度
PodFitsResources	当总资源-Node中所有Pod对资源的request总量 < 待调度的Pod request总量，则无法调度
HostName	如果待调度的Pod制定了pod.Spec.Host，则调度到该主机上
EvenPodsSpread	在1.18版本默认启动，符合条件的一组 Pod 在指定 TopologyKey 上的打散要求
CheckNodeLabelPresence	主要用于检查指定的Label是否在Node上存在
CheckServiceAffinityPred	根据当前POD对象所属的service已有的其他POD对象所运行的节点进行调度，其目的在于将相同service的POD 对象放置与同一个或同一类节点上以提高效率，此预选此类试图将那些在其节点选择器中带有特定标签的POD资源调度至拥有同样标签的节点上，具体的标签则取决于用户的定义。

存储相关

NoVolumeZoneConflictPred

当在 k8s 集群中使用 zone 时，所有的Node都会被标记上 zone label，下面四种是常见的lable的key：

1
2
3
4


LabelZoneFailureDomain       = "failure-domain.beta.kubernetes.io/zone"
LabelZoneRegion              = "failure-domain.beta.kubernetes.io/region"
LabelZoneFailureDomainStable = "topology.kubernetes.io/zone"
LabelZoneRegionStable        = "topology.kubernetes.io/region"

举个例子：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


apiVersion: v1
kind: Node
metadata:
  name: 10.0.1.28
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-07-20T12:11:34Z"
  resourceVersion: "334106446"
  selfLink: /api/v1/nodes/10.0.1.28
  uid: 5943d3fc-0841-43f2-b519-c32af755c1c5
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: QCLOUD
    beta.kubernetes.io/os: linux
    cloud.tencent.com/node-instance-id: ins-r3gy6izp
    failure-domain.beta.kubernetes.io/region: bj
    failure-domain.beta.kubernetes.io/zone: "800002"
    topology.kubernetes.io/region: bj
    topology.kubernetes.io/zone: "800002"
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: 10.0.1.28
    kubernetes.io/os: linux
spec:
  podCIDR: 172.18.0.128/26
  podCIDRs:
  - 172.18.0.128/26
  providerID: qcloud:///800002/ins-r3gy6izp

当一个Pod有存储卷要求时，需要检查该存储卷的zone调度约束是否与Node的zone限制存在冲突。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


		for k, v := range pv.ObjectMeta.Labels {
			if !volumeZoneLabels.Has(k) {
				continue
			}
			nodeV, _ := nodeConstraints[k]
			volumeVSet, err := volumehelpers.LabelZonesToSet(v)
			if err != nil {
				klog.Warningf("Failed to parse label for %q: %q. Ignoring the label. err=%v. ", k, v, err)
				continue
			}

			if !volumeVSet.Has(nodeV) {
				klog.V(10).Infof("Won't schedule pod %q onto node %q due to volume %q (mismatch on %q)", pod.Name, node.Name, pvName, k)
				return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonConflict)
			}
		}

通过检查的条件是：属于该Pod的所有volumes都必须与Node上的zone label完全匹配。

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(NoVolumeZoneConflictPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, volumezone.Name, nil)
			return
		})

CheckVolumeBindingPred

在 pvc 和 pv 的 binding 过程中对其进行逻辑校验，里头的逻辑写的比较复杂，主要都是如何复用 pv；

算法注册逻辑：

1
2
3
4
5
6
7
8


	registry.registerPredicateConfigProducer(CheckVolumeBindingPred,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, volumebinding.Name, nil)
			plugins.Filter = appendToPluginSet(plugins.Filter, volumebinding.Name, nil)
			plugins.Reserve = appendToPluginSet(plugins.Reserve, volumebinding.Name, nil)
			plugins.PreBind = appendToPluginSet(plugins.PreBind, volumebinding.Name, nil)
			return
		})

NoDiskConflictPred

SCSI 存储不会被重复的 volume, 检查在此主机上是否存在卷冲突。如果这个主机已经挂载了卷，其它同样使用这个卷的Pod不能调度到这个主机上，不同的存储后端具体规则不同

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(NoDiskConflictPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, volumerestrictions.Name, nil)
			return
		})

MaxCSIVolumeCountPred

一个Pod请求Volumes的时候，节点上可能已经有Volumes，需要检查加上这个Pod之后的Volumes是否超过Node最大允许的Volumes限制。MaxCSIVolumeCountPred 用来校验 pvc 上指定的 Provision 在 CSI plugin 上的单机最大 pv 数限制。

1
2
3
4
5
6
7
8
9


	for volumeLimitKey, count := range newVolumeCount {
		maxVolumeLimit, ok := nodeVolumeLimits[v1.ResourceName(volumeLimitKey)]
		if ok {
			currentVolumeCount := attachedVolumeCount[volumeLimitKey]
			if currentVolumeCount+count > int(maxVolumeLimit) {
				return framework.NewStatus(framework.Unschedulable, ErrReasonMaxVolumeCountExceeded)
			}
		}
	}

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(MaxCSIVolumeCountPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.CSIName, nil)
			return
		})

MaxNonCSIVolumeCountPred

对于不是CSI标准的存储插件，也需要满足最大PV数限制，整体逻辑类似。

1
2
3
4


	if numExistingVolumes+numNewVolumes > maxAttachLimit {
		// violates MaxEBSVolumeCount or MaxGCEPDVolumeCount
		return framework.NewStatus(framework.Unschedulable, ErrReasonMaxVolumeCountExceeded)
	}

MaxEBSVolumeCountPred

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(MaxEBSVolumeCountPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.EBSName, nil)
			return
		})

MaxGCEPDVolumeCountPred

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(MaxGCEPDVolumeCountPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.GCEPDName, nil)
			return
		})

MaxAzureDiskVolumeCountPred

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(MaxAzureDiskVolumeCountPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.AzureDiskName, nil)
			return
		})

MaxCinderVolumeCountPred

算法注册逻辑：

1
2
3
4
5


registry.registerPredicateConfigProducer(MaxCinderVolumeCountPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodevolumelimits.CinderName, nil)
			return
		})

Pod 与 Node 匹配相关

CheckNodeCondition：校验节点是否准备好被调度，校验node.condition的condition type ：Ready为true和NetworkUnavailable为false以及Node.Spec.Unschedulable为false；
PodFitsHostPorts：校验 Pod 上的 Container 声明的 Ports 是否正在被 Node 上已经分配的 Pod 使用；
MatchNodeSelector: 校验 Pod.Spec.Affinity.NodeAffinity 和 Pod.Spec.NodeSelector 是否与 Node 的 Labels 匹配。

PodFitsHostPortsPred

PodFitsHostPorts策略主要用于校验 Pod 上的 Container 声明的 Ports 是否正在被 Node 上已经分配的 Pod 使用。

在 PreFilter 阶段，获取当前 Pod 对应的所有容器的Port，并且写入cycleState。

1
2
3
4
5
6


// PreFilter invoked at the prefilter extension point.
func (pl *NodePorts) PreFilter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod) *framework.Status {
	s := getContainerPorts(pod)
	cycleState.Write(preFilterStateKey, preFilterState(s))
	return nil
}

在 Filter 阶段，从Cycle拿到当前Pod请求的Port，对比当前系统中已使用的 Port，看是否会发生冲突。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


func (pl *NodePorts) Filter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	wantPorts, err := getPreFilterState(cycleState)
	if err != nil {
		return framework.NewStatus(framework.Error, err.Error())
	}

	fits := fitsPorts(wantPorts, nodeInfo)
	if !fits {
		return framework.NewStatus(framework.Unschedulable, ErrReason)
	}

	return nil
}

算法注册逻辑：

1
2
3
4
5
6


	registry.registerPredicateConfigProducer(PodFitsHostPortsPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodeports.Name, nil)
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, nodeports.Name, nil)
			return
		})

PodFitsResourcesPred

算法注册逻辑：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


	registry.registerPredicateConfigProducer(PodFitsResourcesPred,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, noderesources.FitName, nil)
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, noderesources.FitName, nil)
			if args.NodeResourcesFitArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: noderesources.FitName, Args: args.NodeResourcesFitArgs})
			}
			return
		})

PodToleratesNodeTaintsPred

PodToleratesNodeTaints策略校验 Node 的 Taints 是否被 Pod Tolerates 包含。这里主要检查 NoSchedule 和 NoExecute 这两个 taint，如果不容忍，那么返回错误。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


func (pl *TaintToleration) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	if nodeInfo == nil || nodeInfo.Node() == nil {
		return framework.NewStatus(framework.Error, "invalid nodeInfo")
	}

	filterPredicate := func(t *v1.Taint) bool {
		// PodToleratesNodeTaints is only interested in NoSchedule and NoExecute taints.
		return t.Effect == v1.TaintEffectNoSchedule || t.Effect == v1.TaintEffectNoExecute
	}

	taint, isUntolerated := v1helper.FindMatchingUntoleratedTaint(nodeInfo.Node().Spec.Taints, pod.Spec.Tolerations, filterPredicate)
	if !isUntolerated {
		return nil
	}

	errReason := fmt.Sprintf("node(s) had taint {%s: %s}, that the pod didn't tolerate",
		taint.Key, taint.Value)
	return framework.NewStatus(framework.UnschedulableAndUnresolvable, errReason)
}

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(PodToleratesNodeTaintsPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, tainttoleration.Name, nil)
			return
		})

HostNamePred

NodeNamePred策略主要用于检查Pod Spec声明的Node Name是否与Node实际的Name匹配。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


// Filter invoked at the filter extension point.
func (pl *NodeName) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	if nodeInfo.Node() == nil {
		return framework.NewStatus(framework.Error, "node not found")
	}
	if !Fits(pod, nodeInfo) {
		return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReason)
	}
	return nil
}

// Fits actually checks if the pod fits the node.
func Fits(pod *v1.Pod, nodeInfo *framework.NodeInfo) bool {
	return len(pod.Spec.NodeName) == 0 || pod.Spec.NodeName == nodeInfo.Node().Name
}

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(HostNamePred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodename.Name, nil)
			return
		})

MatchNodeSelectorPred

MatchNodeSelectorPred策略用于校验 Pod.Spec.Affinity.NodeAffinity 和 Pod.Spec.NodeSelector 是否与 Node 的 Labels 匹配。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


func PodMatchesNodeSelectorAndAffinityTerms(pod *v1.Pod, node *v1.Node) bool {
	// Check if node.Labels match pod.Spec.NodeSelector.
	if len(pod.Spec.NodeSelector) > 0 {
		selector := labels.SelectorFromSet(pod.Spec.NodeSelector)
		if !selector.Matches(labels.Set(node.Labels)) {
			return false
		}
	}

	// 1. nil NodeSelector matches all nodes (i.e. does not filter out any nodes)
	// 2. nil []NodeSelectorTerm (equivalent to non-nil empty NodeSelector) matches no nodes
	// 3. zero-length non-nil []NodeSelectorTerm matches no nodes also, just for simplicity
	// 4. nil []NodeSelectorRequirement (equivalent to non-nil empty NodeSelectorTerm) matches no nodes
	// 5. zero-length non-nil []NodeSelectorRequirement matches no nodes also, just for simplicity
	// 6. non-nil empty NodeSelectorRequirement is not allowed
	nodeAffinityMatches := true
	affinity := pod.Spec.Affinity
	if affinity != nil && affinity.NodeAffinity != nil {
		nodeAffinity := affinity.NodeAffinity
		// if no required NodeAffinity requirements, will do no-op, means select all nodes.
		// TODO: Replace next line with subsequent commented-out line when implement RequiredDuringSchedulingRequiredDuringExecution.
		if nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution == nil {
			// if nodeAffinity.RequiredDuringSchedulingRequiredDuringExecution == nil && nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution == nil {
			return true
		}

		// Match node selector for requiredDuringSchedulingRequiredDuringExecution.
		// TODO: Uncomment this block when implement RequiredDuringSchedulingRequiredDuringExecution.
		// if nodeAffinity.RequiredDuringSchedulingRequiredDuringExecution != nil {
		// 	nodeSelectorTerms := nodeAffinity.RequiredDuringSchedulingRequiredDuringExecution.NodeSelectorTerms
		// 	klog.V(10).Infof("Match for RequiredDuringSchedulingRequiredDuringExecution node selector terms %+v", nodeSelectorTerms)
		// 	nodeAffinityMatches = nodeMatchesNodeSelectorTerms(node, nodeSelectorTerms)
		// }

		// Match node selector for requiredDuringSchedulingIgnoredDuringExecution.
		if nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
			nodeSelectorTerms := nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms
			nodeAffinityMatches = nodeAffinityMatches && nodeMatchesNodeSelectorTerms(node, nodeSelectorTerms)
		}

	}
	return nodeAffinityMatches
}

这是一个典型的Node亲和性示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


		{
			pod: &v1.Pod{
				Spec: v1.PodSpec{
					Affinity: &v1.Affinity{
						NodeAffinity: &v1.NodeAffinity{
							RequiredDuringSchedulingIgnoredDuringExecution: &v1.NodeSelector{
								NodeSelectorTerms: []v1.NodeSelectorTerm{
									{
										MatchExpressions: []v1.NodeSelectorRequirement{
											{
												Key:      "kernel-version",
												Operator: v1.NodeSelectorOpGt,
												Values:   []string{"0204"},
											},
										},
									},
								},
							},
						},
					},
				},
			},
			labels: map[string]string{
				// We use two digit to denote major version and two digit for minor version.
				"kernel-version": "0206",
			},
			name: "Pod with matchExpressions using Gt operator that matches the existing node",
		},

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(MatchNodeSelectorPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodeaffinity.Name, nil)
			return
		})

GeneralPred

算法注册逻辑：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


	registry.registerPredicateConfigProducer(GeneralPred,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			// GeneralPredicate is a combination of predicates.
			plugins.Filter = appendToPluginSet(plugins.Filter, noderesources.FitName, nil)
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, noderesources.FitName, nil)
			if args.NodeResourcesFitArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: noderesources.FitName, Args: args.NodeResourcesFitArgs})
			}
			plugins.Filter = appendToPluginSet(plugins.Filter, nodename.Name, nil)
			plugins.Filter = appendToPluginSet(plugins.Filter, nodeports.Name, nil)
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, nodeports.Name, nil)
			plugins.Filter = appendToPluginSet(plugins.Filter, nodeaffinity.Name, nil)
			return
		})

CheckNodeUnschedulablePred

CheckNodeUnschedulable 在 node 节点上有一个 NodeUnschedulable 的标记，那这个节点就不会被调度了，形如这种。

1
2
3
4
5


	node: &v1.Node{
    Spec: v1.NodeSpec{
      Unschedulable: true,
    },
  },

在 1.16 的版本里，这个 Unschedulable 已经变成了一个 Taints。也就是说需要校验一下 Pod 上打上的 Tolerates 是不是可以容忍这个 Taints。如果容忍了这个不可调度的taint，那么它也可以容忍 NodeSpec的不可调度。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


func (pl *NodeUnschedulable) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	if nodeInfo == nil || nodeInfo.Node() == nil {
		return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonUnknownCondition)
	}
	// If pod tolerate unschedulable taint, it's also tolerate `node.Spec.Unschedulable`.
	podToleratesUnschedulable := v1helper.TolerationsTolerateTaint(pod.Spec.Tolerations, &v1.Taint{
		Key:    v1.TaintNodeUnschedulable,
		Effect: v1.TaintEffectNoSchedule,
	})
	// TODO (k82cn): deprecates `node.Spec.Unschedulable` in 1.13.
	if nodeInfo.Node().Spec.Unschedulable && !podToleratesUnschedulable {
		return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonUnschedulable)
	}
	return nil
}

算法注册逻辑：

1
2
3
4
5


	registry.registerPredicateConfigProducer(CheckNodeUnschedulablePred,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodeunschedulable.Name, nil)
			return
		})

CheckNodeLabelPresencePred

CheckNodeLablePresencePred策略主要用于检查指定的Label是否在Node上存在。这里检查的是两种情况：

一种检查Node上面是否有指定Label。比如有时候通过 region/zone/racks 这种label来划分空间，想要把Pod调度到有特定region/zone/racks的Node。
一种是检查Node上面是否没有指定的Label。比如有的Node被打上 retiring 的 label，想要制定Pod不调度到这些Node上。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


func (pl *NodeLabel) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
	node := nodeInfo.Node()
	if node == nil {
		return framework.NewStatus(framework.Error, "node not found")
	}
	nodeLabels := labels.Set(node.Labels)
	check := func(labels []string, presence bool) bool {
		for _, label := range labels {
			exists := nodeLabels.Has(label)
			if (exists && !presence) || (!exists && presence) {
				return false
			}
		}
		return true
	}
	if check(pl.args.PresentLabels, true) && check(pl.args.AbsentLabels, false) {
		return nil
	}

	return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReasonPresenceViolated)
}

对这个策略，需要在注册的时候设定策略插件的参数。

算法注册逻辑：

1
2
3
4
5
6
7
8
9


	registry.registerPredicateConfigProducer(CheckNodeLabelPresencePred,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodelabel.Name, nil)
			if args.NodeLabelArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: nodelabel.Name, Args: args.NodeLabelArgs})
			}
			return
		})

Pod 与 Pod 匹配相关

MatchInterPodAffinityPred

算法注册逻辑：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


	registry.registerPredicateConfigProducer(MatchInterPodAffinityPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, interpodaffinity.Name, nil)
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, interpodaffinity.Name, nil)
			return
		})
	registry.registerPredicateConfigProducer(CheckNodeLabelPresencePred,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, nodelabel.Name, nil)
			if args.NodeLabelArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: nodelabel.Name, Args: args.NodeLabelArgs})
			}
			return
		})

Pod 服务打散相关

EvenPodsSpread

算法注册逻辑：

1
2
3
4
5
6


	registry.registerPredicateConfigProducer(EvenPodsSpreadPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, podtopologyspread.Name, nil)
			plugins.Filter = appendToPluginSet(plugins.Filter, podtopologyspread.Name, nil)
			return
		})

CheckServiceAffinity

算法注册逻辑：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


	registry.registerPredicateConfigProducer(CheckServiceAffinityPred,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Filter = appendToPluginSet(plugins.Filter, serviceaffinity.Name, nil)
			if args.ServiceAffinityArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: serviceaffinity.Name, Args: args.ServiceAffinityArgs})
			}
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, serviceaffinity.Name, nil)
			return
		})

Priority

整体梳理

策略名称	策略算法	权重
BalancedResourceAllocation*	CPU和内存利用率越接近，得分越高	1
ImageLocalityPriority*	待调度的Pod会使用一些镜像，拥有这些镜像越多的节点，得分越高	1
InterPodAffinityPriority*	Pod与Node上正运行的其他Pod亲和性匹配度越高，得分越高	1
LeastRequestedPriority*	剩余资源越多，得分越高	1
NodeAffinityPriority*	Pod与Node亲和性匹配度越高，得分越高	1
NodePreferAvoidPodsPriority*	该Node的annotation scheduler.alpha.kubernetes.io/preferAvoidPods被设置时，说明该Node不希望被调度，得分低。	10000
SelectorSpreadPriority*	相同service/rc的Pods越分散，得分越高	1
TaintTolerationPriority*	Pod对Node的taint容忍度越高，得分越高	1
ServiceSpreadingPriority	相同Service的Pods越分散，得分越高，被 SelectorSpreadPriority取代，保留在系统中并不使用	1
EqualPriority	所有机器得分一样	1
MostRequestPriority	Request资源越多，得分越高，与LeastRequestPriority相反	1
EvenPodsSpreadPriority	在1.18版本默认启动，用来指定一组符合条件的 Pod 在某个拓扑结构上的打散需求，这样是比较灵活、比较定制化的一种方式，使用起来也是比较复杂的一种方式	2
RequestedToCapacityRatioName	允许用户对于CPU、内存和扩展加速卡等资源实现bin packing
NodeLabel	主要是为了实现对某些特定 label 的 Node 优先分配，算法很简单，启动时候依据调度策略 (SchedulerPolicy）配置的 label 值，判断 Node 上是否满足这个label条件，如果满足条件的节点优先分配。
ServiceAffinity	是为了支持 Service 下的 Pod 的分布要按照 Node 的某个 label 的值进行均衡。

打分算法主要解决的问题就是集群的碎片、容灾、水位、亲和、反亲和等，可以分为以下四个大类。

资源水位

资源水位公式的概念：Request：Node 已经分配的资源；Allocatable：Node 的可调度的资源。
优先打散：把 Pod 分到资源空闲率最高的节点上，而非空闲资源最大的节点，公式：资源空闲率 = (Allocatable - Request) / Allocatable，当这个值越大，表示分数越高，优先分配到高分数的节点。其中 (Allocatable - Request) 表示为 Pod 分配到这个节点之后空闲的资源数。
优先堆叠：把 Pod 分配到资源使用率最高的节点上，公式:资源使用率 = Request / Allocatable ，资源使用率越高，表示得分越高，会优先分配到高分数的节点。
碎片率：是指 Node 上的多种资源之间的资源使用率的差值，目前支持 CPU/Mem/Disk 三类资源, 假如仅考虑 CPU/Mem，那么碎片率的公式 = Abs[CPU(Request / Allocatable) - Mem(Request / Allocatable)] 。举一个例子，当 CPU 的分配率是 99%，内存的分配率是 50%，那么碎片率 = 99% - 50% = 50%，那么这个例子中剩余 1% CPU, 50% Mem，很难有这类规格的容器能用完 Mem。得分 = 1 - 碎片率，碎片率越高得分低。
指定比率：可以在 Scheduler 启动的时候，为每一个资源使用率设置得分，从而实现控制集群上 node 资源分配分布曲线。

LeastRequestedPriority

LeastRequestedPriority 策略对于那些使用率越低的Node的优先级越高。通过这种算法，可以使得各个节点的资源得到均衡利用。

计算公式如下： $$ (cpu((capacity-sum(requested))*MaxNodeScore/capacity) + memory((capacity-sum(requested))*MaxNodeScore/capacity))/weightSum $$

算法注册逻辑：

1
2
3
4
5


registry.registerPriorityConfigProducer(LeastRequestedPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, noderesources.LeastAllocatedName, &args.Weight)
			return
		})

MostRequestedPriority

MostRequestedPriority 策略对于那些使用率更高的Node的优先级更高。这种算法在动态伸缩集群环境比较适用，会优先调度pod到使用率最高的主机节点，这样在伸缩集群时，就会腾出空闲机器，从而进行停机处理。

其计算公式如下： $$ (cpu(MaxNodeScore * sum(requested) / capacity) + memory(MaxNodeScore * sum(requested) / capacity)) / weightSum $$ 算法注册逻辑：

1
2
3
4
5


	registry.registerPriorityConfigProducer(MostRequestedPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, noderesources.MostAllocatedName, &args.Weight)
			return
		})

BalancedResourceAllocation

BalancedResourceAllocation：尽量选择在部署Pod后各项资源更均衡的机器。BalancedResourceAllocation不能单独使用，而且必须和LeastRequestedPriority同时使用，它分别计算主机上的cpu和memory的比重，主机的分值由cpu比重和memory比重的“距离”决定。

计算公式如下： $$ score = (1 - variance(cpuFraction,memoryFraction,volumeFraction)) * MaxNodeScore $$ 算法注册逻辑：

1
2
3
4
5


	registry.registerPriorityConfigProducer(BalancedResourceAllocation,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, noderesources.BalancedAllocationName, &args.Weight)
			return
		})

RequestedToCapacityRatioPriority

RequestedToCapacityRatioPriority允许用户对于CPU、内存和扩展加速卡等资源实现bin packing。

所谓 Bin Packing ，又称装箱问题，是运筹学中的一个经典问题。问题的背景是，现有若干个小盒子，想要把它们装进有限个给定大小的箱子中，如何既能够装的多油装的快，使得尽可能每个箱子都装满，从而减少箱子的使用数目。BinPack问题有很多变种，当限制箱子的数目为1，每个盒子给定value和weight，binpack问题就变成了背包问题。

Kubernetes默认开启的资源调度策略是Spread的策略，资源尽量打散，但是会导致较多的资源碎片，使得整体资源利用率下降。通过RequestedToCapacityRatioPriority配置支持CPU、内存和GPU等扩展卡的权重，在打分阶段计算对应资源的利用率，通过利用率进行排序，优先打满一个节点后再向后调度，从而实现bin packing。

RequestedToCapacityRatioResourceAllocation 优先级函数的行为可以通过名为 requestedToCapacityRatioArguments 的配置选项进行控制。该标志由两个参数 shape 和 resources 组成。 shape 允许用户根据 utilization 和 score 值将函数调整为最少请求（least requested）或最多请求（most requested）计算。 resources 由 name 和 weight 组成，name 指定评分时要考虑的资源，weight 指定每种资源的权重。

以下是一个配置示例，该配置将 requestedToCapacityRatioArguments 设置为对扩展资源 intel.com/foo 和 intel.com/bar 的装箱行为

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


{
    "kind" : "Policy",
    "apiVersion" : "v1",
    ...
    "priorities" : [
       ...
      {
        "name": "RequestedToCapacityRatioPriority",
        "weight": 2,
        "argument": {
          "requestedToCapacityRatioArguments": {
            "shape": [
              {"utilization": 0, "score": 0},
              {"utilization": 100, "score": 10}
            ],
            "resources": [
              {"name": "intel.com/foo", "weight": 3},
              {"name": "intel.com/bar", "weight": 5}
            ]
          }
        }
      }
    ],
  }

实际上，这里的shape参数定义的是不同utilization下对应的得分，是对 LeastRequestedPriority 和 MostRequestedPriority 的进一步抽象。

这种配置对应的是LeastRequestedPriority

1
2


 {"utilization": 0, "score": 10},
 {"utilization": 100, "score": 0}

这种配置对应的是MostRequestedPriority

1
2


 {"utilization": 0, "score": 0},
 {"utilization": 100, "score": 10}

算法注册逻辑：

1
2
3
4
5
6
7
8
9


	registry.registerPriorityConfigProducer(noderesources.RequestedToCapacityRatioName,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, noderesources.RequestedToCapacityRatioName, &args.Weight)
			if args.RequestedToCapacityRatioArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: noderesources.RequestedToCapacityRatioName, Args: args.RequestedToCapacityRatioArgs})
			}
			return
		})

Pod 打散

Pod打散目的是支持符合条件的一组 Pod 在不同 topology 上部署的 spread 需求。

ServiceSpreadingPriority

ServiceSpreadingPriority：官方注释上说大概率会用来替换 SelectorSpreadPriority，为什么呢？我个人理解：Service 代表一组服务，我们只要能做到服务的打散分配就足够了。

算法注册逻辑：

1
2
3
4
5
6


	registry.registerPredicateConfigProducer(EvenPodsSpreadPred,
		func(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.PreFilter = appendToPluginSet(plugins.PreFilter, podtopologyspread.Name, nil)
			plugins.Filter = appendToPluginSet(plugins.Filter, podtopologyspread.Name, nil)
			return
		})

EvenPodsSpread

EvenPodsSpreadPriority：用来指定一组符合条件的 Pod 在某个拓扑结构上的打散需求，这样是比较灵活、比较定制化的一种方式，使用起来也是比较复杂的一种方式。因为这个使用方式可能会一直变化，我们假设这个拓扑结构是这样的：Spec 是要求在 node 上进行分布的，我们就可以按照上图中的计算公式，计算一下在这个 node 上满足 Spec 指定 labelSelector 条件的 pod 数量，然后计算一下最大的差值，接着计算一下 Node 分配的权重，如果说这个值越大，表示这个值越优先。

算法注册逻辑：

1
2
3
4
5
6


	registry.registerPriorityConfigProducer(EvenPodsSpreadPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.PreScore = appendToPluginSet(plugins.PreScore, podtopologyspread.Name, nil)
			plugins.Score = appendToPluginSet(plugins.Score, podtopologyspread.Name, &args.Weight)
			return
		})

CheckServiceAffinity

算法注册逻辑：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


	registry.registerPriorityConfigProducer(serviceaffinity.Name,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			// If there are n ServiceAffinity priorities in the policy, the weight for the corresponding
			// score plugin is n*weight (note that the validation logic verifies that all ServiceAffinity
			// priorities specified in Policy have the same weight).
			weight := args.Weight * int32(len(args.ServiceAffinityArgs.AntiAffinityLabelsPreference))
			plugins.Score = appendToPluginSet(plugins.Score, serviceaffinity.Name, &weight)
			if args.ServiceAffinityArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: serviceaffinity.Name, Args: args.ServiceAffinityArgs})
			}
			return
		})

SelectorSpreadPriority

SelectorSpreadPriority：用于实现 Pod 所属的 Controller 下所有的 Pod 在 Node 上打散的要求。实现方式是这样的：它会依据待分配的 Pod 所属的 controller，计算该 controller 下的所有 Pod,假设总数为 T，对这些 Pod 按照所在的 Node 分组统计；假设为 N (表示为某个 Node 上的统计值)，那么对 Node上的分数统计为 (T-N)/T 的分数，值越大表示这个节点的 controller 部署的越少，分数越高，从而达到 workload 的 pod 打散需求。

算法注册逻辑：

1
2
3
4
5
6


	registry.registerPriorityConfigProducer(SelectorSpreadPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, selectorspread.Name, &args.Weight)
			plugins.PreScore = appendToPluginSet(plugins.PreScore, selectorspread.Name, nil)
			return
		})

Pod 亲和/反亲和

InterPodAffinityPriority

InterPodAffinityPriority：先介绍一下使用场景：第一个例子，比如说应用 A 提供数据，应用 B 提供服务，A 和 B 部署在一起可以走本地网络，优化网络传输；第二个例子，如果应用 A 和应用 B 之间都是 CPU 密集型应用，而且证明它们之间是会互相干扰的，那么可以通过这个规则设置尽量让它们不在一个节点上。pod亲和性选择策略，类似NodeAffinityPriority，提供两种选择器支持：requiredDuringSchedulingIgnoredDuringExecution（保证所选的主机必须满足所有Pod对主机的规则要求）、preferresDuringSchedulingIgnoredDuringExecution（调度器会尽量但不保证满足NodeSelector的所有要求），两个子策略：podAffinity和podAntiAffinity

算法注册逻辑：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


	registry.registerPriorityConfigProducer(InterPodAffinityPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.PreScore = appendToPluginSet(plugins.PreScore, interpodaffinity.Name, nil)
			plugins.Score = appendToPluginSet(plugins.Score, interpodaffinity.Name, &args.Weight)
			if args.InterPodAffinityArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: interpodaffinity.Name, Args: args.InterPodAffinityArgs})
			}
			return
		})

Node 亲和/反亲和

NodeAffinityPriority，这个是为了满足 Pod 和 Node 的亲和 & 反亲和；
ServiceAntiAffinity，是为了支持 Service 下的 Pod 的分布要按照 Node 的某个 label 的值进行均衡。比如：集群的节点有云上也有云下两组节点，我们要求服务在云上云下均衡去分布，假设 Node 上有某个 label，那我们就可以用这个 ServiceAntiAffinity 进行打散分布；
NodeLabelPrioritizer，主要是为了实现对某些特定 label 的 Node 优先分配，算法很简单，启动时候依据调度策略 (SchedulerPolicy）配置的 label 值，判断 Node 上是否满足这个label条件，如果满足条件的节点优先分配;
ImageLocalityPriority，节点亲和主要考虑的是镜像下载的速度。如果节点里面存在镜像的话，优先把 Pod 调度到这个节点上，这里还会去考虑镜像的大小，比如这个 Pod 有好几个镜像，镜像越大下载速度越慢，它会按照节点上已经存在的镜像大小优先级亲和。

NodePreferAvoidPodsPriority

NodePreferAvoidPodsPriority策略用于实现某些 controller 尽量不分配到某些节点上的能力；通过在 node 上加 annotation 声明哪些 controller 不要分配到 Node 上，如果不满足就优先。

具体实现就是会在Node上加上Annotation，形如这种

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


	annotations1 := map[string]string{
		v1.PreferAvoidPodsAnnotationKey: `
							{
							    "preferAvoidPods": [
							        {
							            "podSignature": {
							                "podController": {
							                    "apiVersion": "v1",
							                    "kind": "ReplicationController",
							                    "name": "foo",
							                    "uid": "abcdef123456",
							                    "controller": true
							                }
							            },
							            "reason": "some reason",
							            "message": "some message"
							        }
							    ]
							}`,
	}

在检查的时候，对于那些不被 ReplicaSet 和 ReplicationController 拥有的 Pod，直接跳过，给予最高分。如果和 annotation 中标记的相同，那么给予最低分。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


func (pl *NodePreferAvoidPods) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil {
		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
	}

	node := nodeInfo.Node()
	if node == nil {
		return 0, framework.NewStatus(framework.Error, "node not found")
	}

	controllerRef := metav1.GetControllerOf(pod)
	if controllerRef != nil {
		// Ignore pods that are owned by other controller than ReplicationController
		// or ReplicaSet.
		if controllerRef.Kind != "ReplicationController" && controllerRef.Kind != "ReplicaSet" {
			controllerRef = nil
		}
	}
	if controllerRef == nil {
		return framework.MaxNodeScore, nil
	}

	avoids, err := v1helper.GetAvoidPodsFromNodeAnnotations(node.Annotations)
	if err != nil {
		// If we cannot get annotation, assume it's schedulable there.
		return framework.MaxNodeScore, nil
	}
	for i := range avoids.PreferAvoidPods {
		avoid := &avoids.PreferAvoidPods[i]
		if avoid.PodSignature.PodController.Kind == controllerRef.Kind && avoid.PodSignature.PodController.UID == controllerRef.UID {
			return 0, nil
		}
	}
	return framework.MaxNodeScore, nil
}

算法注册逻辑：

1
2
3
4
5


	registry.registerPriorityConfigProducer(NodePreferAvoidPodsPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, nodepreferavoidpods.Name, &args.Weight)
			return
		})

NodeAffinityPriority

NodeAffinityPriority策略用于满足Pod与Node之间的亲和与反亲和。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


	affinity := pod.Spec.Affinity

	var count int64
	// A nil element of PreferredDuringSchedulingIgnoredDuringExecution matches no objects.
	// An element of PreferredDuringSchedulingIgnoredDuringExecution that refers to an
	// empty PreferredSchedulingTerm matches all objects.
	if affinity != nil && affinity.NodeAffinity != nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution != nil {
		// Match PreferredDuringSchedulingIgnoredDuringExecution term by term.
		for i := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution {
			preferredSchedulingTerm := &affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution[i]
			if preferredSchedulingTerm.Weight == 0 {
				continue
			}

			// TODO: Avoid computing it for all nodes if this becomes a performance problem.
			nodeSelector, err := v1helper.NodeSelectorRequirementsAsSelector(preferredSchedulingTerm.Preference.MatchExpressions)
			if err != nil {
				return 0, framework.NewStatus(framework.Error, err.Error())
			}

			if nodeSelector.Matches(labels.Set(node.Labels)) {
				count += int64(preferredSchedulingTerm.Weight)
			}
		}
	}

算法注册逻辑：

1
2
3
4
5


	registry.registerPriorityConfigProducer(NodeAffinityPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, nodeaffinity.Name, &args.Weight)
			return
		})

TaintTolerationPriority

TaintTolerationPriority 策略，Pod 对 Node 的 taint 容忍程度越高，优先级越大。

在 PreScore 阶段，拿到所有 all Tolerations with Effect PreferNoSchedule or with no effect，并将其写到cycleState。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


func (pl *TaintToleration) PreScore(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodes []*v1.Node) *framework.Status {
	if len(nodes) == 0 {
		return nil
	}
	tolerationsPreferNoSchedule := getAllTolerationPreferNoSchedule(pod.Spec.Tolerations)
	state := &preScoreState{
		tolerationsPreferNoSchedule: tolerationsPreferNoSchedule,
	}
	cycleState.Write(preScoreStateKey, state)
	return nil

在 Score 阶段，具体算法就是Pod不能容忍的taint越多，那么得分就越高（之后会在Normalize处正则化，将得分逆序），也就是其优先级越低。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


func (pl *TaintToleration) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil || nodeInfo.Node() == nil {
		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
	}
	node := nodeInfo.Node()

	s, err := getPreScoreState(state)
	if err != nil {
		return 0, framework.NewStatus(framework.Error, err.Error())
	}

	score := int64(countIntolerableTaintsPreferNoSchedule(node.Spec.Taints, s.tolerationsPreferNoSchedule))
	return score, nil
}

算法注册逻辑：

1
2
3
4
5
6


	registry.registerPriorityConfigProducer(TaintTolerationPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.PreScore = appendToPluginSet(plugins.PreScore, tainttoleration.Name, nil)
			plugins.Score = appendToPluginSet(plugins.Score, tainttoleration.Name, &args.Weight)
			return
		})

ImageLocalityPriority

ImageLocalityPriority策略主要考虑的是镜像下载的速度。如果节点里面存在镜像的话，优先把 Pod 调度到这个节点上，这里还会去考虑镜像的大小，比如这个 Pod 有好几个镜像，镜像越大下载速度越慢，它会按照节点上已经存在的镜像大小优先级亲和。

1
2
3
4
5
6
7
8
9


func sumImageScores(nodeInfo *framework.NodeInfo, containers []v1.Container, totalNumNodes int) int64 {
	var sum int64
	for _, container := range containers {
		if state, ok := nodeInfo.ImageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}
	}
	return sum
}

算法注册逻辑：

1
2
3
4
5


	registry.registerPriorityConfigProducer(ImageLocalityPriority,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			plugins.Score = appendToPluginSet(plugins.Score, imagelocality.Name, &args.Weight)
			return
		})

NodeLabel

NodeLabel策略主要是为了实现对某些特定 label 的 Node 优先分配，算法很简单，启动时候依据调度策略 (SchedulerPolicy）配置的 label 值，判断 Node 上是否满足这个label条件，如果满足条件的节点优先分配。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


func (pl *NodeLabel) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
	nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil || nodeInfo.Node() == nil {
		return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v, node is nil: %v", nodeName, err, nodeInfo.Node() == nil))
	}

	node := nodeInfo.Node()
	score := int64(0)
	for _, label := range pl.args.PresentLabelsPreference {
		if labels.Set(node.Labels).Has(label) {
			score += framework.MaxNodeScore
		}
	}
	for _, label := range pl.args.AbsentLabelsPreference {
		if !labels.Set(node.Labels).Has(label) {
			score += framework.MaxNodeScore
		}
	}
	// Take average score for each label to ensure the score doesn't exceed MaxNodeScore.
	score /= int64(len(pl.args.PresentLabelsPreference) + len(pl.args.AbsentLabelsPreference))

	return score, nil
}

算法注册逻辑：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


	registry.registerPriorityConfigProducer(nodelabel.Name,
		func(args ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {
			// If there are n LabelPreference priorities in the policy, the weight for the corresponding
			// score plugin is n*weight (note that the validation logic verifies that all LabelPreference
			// priorities specified in Policy have the same weight).
			weight := args.Weight * int32(len(args.NodeLabelArgs.PresentLabelsPreference)+len(args.NodeLabelArgs.AbsentLabelsPreference))
			plugins.Score = appendToPluginSet(plugins.Score, nodelabel.Name, &weight)
			if args.NodeLabelArgs != nil {
				pluginConfig = append(pluginConfig,
					config.PluginConfig{Name: nodelabel.Name, Args: args.NodeLabelArgs})
			}
			return
		})