Garden | Kubelet PLEG

PLEG 全称叫 Pod Lifecycle Event Generator，即 Pod 生命周期事件生成器。实际上它只是 Kubelet 中的一个模块，主要职责就是通过每个匹配的 Pod 级别事件来调整容器运行时的状态，并将调整的结果写入缓存，使 Pod 的缓存保持最新状态。

Background

在 Kubernetes 中，每个节点上都运行着一个守护进程 Kubelet 来管理节点上的容器，调整容器的实际状态以匹配 spec 中定义的状态。具体来说，Kubelet 需要对两个地方的更改做出及时的回应：

Pod spec 中定义的状态：对于 Pod，Kubelet 会从多个数据来源 watch Pod spec 中的变化
容器运行时的状态：对于容器，Kubelet 会定期（例如10s）轮询容器运行时，以获取所有容器的最新状态。

随着 Pod 和容器数量的增加，轮询会产生不可忽略的开销，并且会由于 Kubelet 的并行操作而加剧这种开销¹²³（为每个 Pod 分配一个 goruntine，用来获取容器的状态）。轮询带来的周期性大量并发请求会导致较高的 CPU 使用率峰值（即使 Pod 的定义和容器的状态没有发生改变），降低性能。最后容器运行时可能不堪重负，从而降低系统的可靠性，限制 Kubelet 的可扩展性。

为了降低 Pod 的管理开销，提升 Kubelet 的性能和可扩展性，引入了 PLEG⁴，改进了之前的工作方式：

减少空闲期间的不必要工作（例如 Pod 的定义和容器的状态没有发生更改）。
减少获取容器状态的并发请求数量。

Overview

整体的工作流程如下图所示，虚线部分是 PLEG 的工作内容。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


// pkg/kubelet/kubelet.go
func NewMainKubelet(...) (*Kubelet, error) {
  // ...

  klet := &Kubelet{
    nodeName:                                nodeName,
    kubeClient:                              kubeDeps.KubeClient,
    heartbeatClient:                         kubeDeps.HeartbeatClient,
    // ...
  }

	klet.pleg = pleg.NewGenericPLEG(klet.containerRuntime, plegChannelCapacity, plegRelistPeriod, klet.podCache, clock.RealClock{})
	klet.runtimeState = newRuntimeState(maxWaitForContainerRuntime)
	klet.runtimeState.addHealthCheck("PLEG", klet.pleg.Healthy)

  // ...
}

Pod Lifecycle Event

Pod Lifecycle Event 是 Pod 级别抽象出来的用来解释底层容器状态变化，从而在 Kubernetes 看是对于容器运行时是无感知的。当前定义的 Event Type 如下所示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


// pkg/kubelet/pleg/pleg.go
type PodLifeCycleEventType string

const (
	// ContainerStarted - event type when the new state of container is running.
	ContainerStarted PodLifeCycleEventType = "ContainerStarted"
	// ContainerDied - event type when the new state of container is exited.
	ContainerDied PodLifeCycleEventType = "ContainerDied"
	// ContainerRemoved - event type when the old state of container is exited.
	ContainerRemoved PodLifeCycleEventType = "ContainerRemoved"
	// PodSync is used to trigger syncing of a pod when the observed change of
	// the state of the pod cannot be captured by any single event above.
	PodSync PodLifeCycleEventType = "PodSync"
	// ContainerChanged - event type when the new state of container is unknown.
	ContainerChanged PodLifeCycleEventType = "ContainerChanged"
)

// PodLifecycleEvent is an event reflects the change of the pod state.
type PodLifecycleEvent struct {
	// The pod ID.
	ID types.UID
	// The type of the event.
	Type PodLifeCycleEventType
	// The accompanied data which varies based on the event type.
	//   - ContainerStarted/ContainerStopped: the container name (string).
	//   - All other event types: unused.
	Data interface{}
}

PLEG 的具体实现逻辑主要是以下三个函数：

1
2
3
4
5
6
7
8


// pkg/kubelet/pleg/pleg.go

// PodLifecycleEventGenerator contains functions for generating pod life cycle events.
type PodLifecycleEventGenerator interface {
	Start()
	Watch() chan *PodLifecycleEvent
	Healthy() (bool, error)
}

Relist

前面提到，Kubelet 通过 PLEG 定期产生 Pod Lifecycle Event，通过 channel 将事件传递给 syncLoop 来处理 Pod 的创建和删除等事件。以 Docker 为例，在 Pod 中启动一个容器就会在 Kubelet 中注册一个 ContainerStarted Pod 生命周期事件。

那么 PLEG 是如何知道新启动了一个 infra 容器呢？它会定期重新列出节点上的所有容器（例如 docker ps），并与上一次的容器列表进行对比，以此来判断容器状态的变化。其实这就是 relist() 函数干的事情，尽管这种方法和以前的 Kubelet 轮询类似，但现在只有一个线程，就是 PLEG。现在不需要所有的线程并发获取容器的状态，只有相关的线程会被唤醒用来同步容器状态。而且 relist 与容器运行时无关，也不需要外部依赖。

当前 Kubelet 设置 relist 的调用周期为 1s。尽管每秒钟调用一次 relist，但它的完成时间仍然有可能超过 1s。因为下一次调用 relist 必须得等上一次 relist 执行结束，设想一下，如果容器运行时响应缓慢，或者一个周期内有大量的容器状态发生改变，那么 relist 的完成时间将不可忽略，假设是 5s，那么下一次调用 relist 将要等到 6s 之后。

1
2


// pkg/kubelet/kubelet.go
plegRelistPeriod = time.Second * 1

前面提到，我们在 NewMainKubelet 中创建了 GenericPLEG 对象，当 Kubelet.Run() 时，我们会开启 Start() 开启一个 goroutine 以 1s 为周期定期去 relist。相关的源代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// pkg/kubelet/pleg/generic.go

// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {
  go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}

func (g *GenericPLEG) relist() {
... WE WILL REVIEW HERE ...
}

Healthy Check

为了检查每次 relist 是否正常，Kubelet 设置了 relist 的完成时间为 3min。Kubelet 在 syncLoop 中会定期检查 relist 进程（PLEG 的关键任务）是否在 3 分钟内完成。如果 relist 进程的完成时间超过了 3 分钟，就会报告 PLEG is not healthy 。

下面是调用 Healthy() 函数的相关代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


//// pkg/kubelet/pleg/generic.go - Healthy()

// The threshold needs to be greater than the relisting period + the
// relisting time, which can vary significantly. Set a conservative
// threshold to avoid flipping between healthy and unhealthy.
relistThreshold = 3 * time.Minute

func (g *GenericPLEG) Healthy() (bool, error) {
  relistTime := g.getRelistTime()
  elapsed := g.clock.Since(relistTime)
  if elapsed > relistThreshold {
    return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)
  }
  return true, nil
}

那么 Healthy 是如何被调用的呢？Healthy() 函数会以 PLEG 的形式添加到 runtimeState 中，Kubelet 在一个同步循环（SyncLoop() 函数）中会定期（默认是 10s）调用 Healthy() 函数。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


//// pkg/kubelet/kubelet.go - NewMainKubelet()
func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration, ...) {
  klet.runtimeState.addHealthCheck("PLEG", klet.pleg.Healthy)
}

//// pkg/kubelet/kubelet.go - syncLoop()
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {

// The resyncTicker wakes up kubelet to checks if there are any pod workers
// that need to be sync'd. A one-second period is sufficient because the
// sync interval is defaulted to 10s.

  const (
    base   = 100 * time.Millisecond
    max    = 5 * time.Second
    factor = 2
  )
  duration := base
  for {
      if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
          glog.Infof("skipping pod synchronization - %v", rs)
          // exponential backoff
          time.Sleep(duration)
          duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
          continue
      }
      // ...
  }
  // ...
}

//// pkg/kubelet/runtime.go - runtimeErrors()
func (s *runtimeState) runtimeErrors() []string {
:
    for _, hc := range s.healthChecks {
        if ok, err := hc.fn(); !ok {
            ret = append(ret, fmt.Sprintf("%s is not healthy: %v", hc.name, err))
        }
    }
:
}

深入解读 relist 函数

下面我们来看一下 relist() 函数的内部实现。完整的流程如下图所示：

如上所示，relist 函数第一步就是记录 Kubelet 的相关指标（例如 kubelet_pleg_relist_latency_microseconds），然后通过 CRI 从容器运行时获取当前的 Pod 列表（包括停止的 Pod）。该 Pod 列表会和之前的 Pod 列表进行比较，检查哪些状态发生了变化，然后同时生成相关的 Pod 生命周期事件和 更改后的状态 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


//// pkg/kubelet/pleg/generic.go - relist()
  :
  // get a current timestamp
  timestamp := g.clock.Now()

  // kubelet_pleg_relist_latency_microseconds for prometheus metrics
  defer func() {
      metrics.PLEGRelistLatency.Observe(metrics.SinceInMicroseconds(timestamp))
  }()

  // Get all the pods.
  podList, err := g.runtime.GetPods(true)
  :

GetPods

其中 GetPods() 函数的调用堆栈如下图所示：

computeEvents

将当前的 Pod 列表和上一次 relist 的 Pod 列表进行对比之后，就会针对每一个变化生成相应的 Pod 级别的事件。相关的源代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


//// pkg/kubelet/pleg/generic.go - relist()

  pods := kubecontainer.Pods(podList)
  g.podRecords.setCurrent(pods)

  // Compare the old and the current pods, and generate events.
  eventsByPodID := map[types.UID][]*PodLifecycleEvent{}
  for pid := range g.podRecords {
    oldPod := g.podRecords.getOld(pid)
    pod := g.podRecords.getCurrent(pid)

    // Get all containers in the old and the new pod.
    allContainers := getContainersFromPods(oldPod, pod)
    for _, container := range allContainers {
          events := computeEvents(oldPod, pod, &container.ID)

          for _, e := range events {
                updateEvents(eventsByPodID, e)
          }
        }
  }

其中 generateEvents() 函数（computeEvents() 函数会调用它）用来生成相应的 Pod 级别的事件（例如 ContainerStarted、ContainerDied 等等），然后通过 updateEvents() 函数来更新事件。

computeEvents() 函数的内容如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


//// pkg/kubelet/pleg/generic.go - computeEvents()

func computeEvents(oldPod, newPod *kubecontainer.Pod, cid *kubecontainer.ContainerID) []*PodLifecycleEvent {
:
    return generateEvents(pid, cid.ID, oldState, newState)
}

//// pkg/kubelet/pleg/generic.go - generateEvents()

func generateEvents(podID types.UID, cid string, oldState, newState plegContainerState) []*PodLifecycleEvent {
:
    glog.V(4).Infof("GenericPLEG: %v/%v: %v -> %v", podID, cid, oldState, newState)
    switch newState {
    case plegContainerRunning:
      return []*PodLifecycleEvent{{ID: podID, Type: ContainerStarted, Data: cid}}
    case plegContainerExited:
      return []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}}
    case plegContainerUnknown:
      return []*PodLifecycleEvent{{ID: podID, Type: ContainerChanged, Data: cid}}
    case plegContainerNonExistent:
      switch oldState {
      case plegContainerExited:
        // We already reported that the container died before.
        return []*PodLifecycleEvent{{ID: podID, Type: ContainerRemoved, Data: cid}}
      default:
        return []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}, {ID: podID, Type: ContainerRemoved, Data: cid}}
      }
    default:
      panic(fmt.Sprintf("unrecognized container state: %v", newState))
  }
}

updateCache

relist 的最后一个任务是检查是否有与 Pod 关联的事件，更新事件到 plegCh，并按照下面的流程更新 podCache。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


//// pkg/kubelet/pleg/generic.go - relist()

  // If there are events associated with a pod, we should update the
  // podCache.
  for pid, events := range eventsByPodID {
    pod := g.podRecords.getCurrent(pid)
    if g.cacheEnabled() {
      // updateCache() will inspect the pod and update the cache. If an
      // error occurs during the inspection, we want PLEG to retry again
      // in the next relist. To achieve this, we do not update the
      // associated podRecord of the pod, so that the change will be
      // detect again in the next relist.
      // TODO: If many pods changed during the same relist period,
      // inspecting the pod and getting the PodStatus to update the cache
      // serially may take a while. We should be aware of this and
      // parallelize if needed.
      if err := g.updateCache(pod, pid); err != nil {
        glog.Errorf("PLEG: Ignoring events for pod %s/%s: %v", pod.Name, pod.Namespace, err)
        :
      }
      :
    }
    // Update the internal storage and send out the events.
    g.podRecords.update(pid)
    for i := range events {
      // Filter out events that are not reliable and no other components use yet.
      if events[i].Type == ContainerChanged {
           continue
      }
      g.eventChannel <- events[i]
     }
  }

updateCache() 将会检查每个 Pod，并在单个循环中依次对其进行更新。因此，如果在同一个 relist 中更改了大量的 Pod，那么 updateCache 过程将会成为瓶颈。最后，更新后的 Pod 生命周期事件将会被发送到 eventChannel。

某些远程客户端还会调用每一个 Pod 来获取 Pod 的 spec 定义信息，这样一来，Pod 数量越多，延时就可能越高，因为 Pod 越多就会生成越多的事件。

updateCache() 的详细调用堆栈如下图所示，其中 GetPodStatus() 用来获取 Pod 的 spec 定义信息：

完整的代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


//// pkg/kubelet/pleg/generic.go - updateCache()

func (g *GenericPLEG) updateCache(pod *kubecontainer.Pod, pid types.UID) error {
:
    timestamp := g.clock.Now()
    // TODO: Consider adding a new runtime method
    // GetPodStatus(pod *kubecontainer.Pod) so that Docker can avoid listing
    // all containers again.
    status, err := g.runtime.GetPodStatus(pod.ID, pod.Name, pod.Namespace)
  :
    g.cache.Set(pod.ID, status, err, timestamp)
    return err
}

//// pkg/kubelet/kuberuntime/kuberuntime_manager.go - GetPodStatus()

// GetPodStatus retrieves the status of the pod, including the
// information of all containers in the pod that are visible in Runtime.
func (m *kubeGenericRuntimeManager) GetPodStatus(uid kubetypes.UID, name, namespace string) (*kubecontainer.PodStatus, error) {
  podSandboxIDs, err := m.getSandboxIDByPodUID(uid, nil)
  :
    for idx, podSandboxID := range podSandboxIDs {
        podSandboxStatus, err := m.runtimeService.PodSandboxStatus(podSandboxID)
    :
    }

    // Get statuses of all containers visible in the pod.
    containerStatuses, err := m.getPodContainerStatuses(uid, name, namespace)
  :
}

//// pkg/kubelet/kuberuntime/kuberuntime_sandbox.go - getSandboxIDByPodUID()

// getPodSandboxID gets the sandbox id by podUID and returns ([]sandboxID, error).
// Param state could be nil in order to get all sandboxes belonging to same pod.
func (m *kubeGenericRuntimeManager) getSandboxIDByPodUID(podUID kubetypes.UID, state *runtimeapi.PodSandboxState) ([]string, error) {
  :
  sandboxes, err := m.runtimeService.ListPodSandbox(filter)
  :  
  return sandboxIDs, nil
}

//// pkg/kubelet/remote/remote_runtime.go - PodSandboxStatus()

// PodSandboxStatus returns the status of the PodSandbox.
func (r *RemoteRuntimeService) PodSandboxStatus(podSandBoxID string) (*runtimeapi.PodSandboxStatus, error) {
    ctx, cancel := getContextWithTimeout(r.timeout)
    defer cancel()

    resp, err := r.runtimeClient.PodSandboxStatus(ctx, &runtimeapi.PodSandboxStatusRequest{
        PodSandboxId: podSandBoxID,
    })
  :
    return resp.Status, nil
}

//// pkg/kubelet/kuberuntime/kuberuntime_container.go - getPodContainerStatuses()

// getPodContainerStatuses gets all containers' statuses for the pod.
func (m *kubeGenericRuntimeManager) getPodContainerStatuses(uid kubetypes.UID, name, namespace string) ([]*kubecontainer.ContainerStatus, error) {
  // Select all containers of the given pod.
  containers, err := m.runtimeService.ListContainers(&runtimeapi.ContainerFilter{
    LabelSelector: map[string]string{types.KubernetesPodUIDLabel: string(uid)},
  })
  :
  // TODO: optimization: set maximum number of containers per container name to examine.
  for i, c := range containers {
    status, err := m.runtimeService.ContainerStatus(c.Id)
    :
  }
  :
  return statuses, nil
}

上面就是 relist() 函数的完整调用堆栈，我在讲解的过程中结合了相关的源代码，希望能为你提供有关 PLEG 的更多细节。为了实时了解 PLEG 的健康状况，最好的办法就是监控 relist。

监控 relist

我们可以通过监控 Kubelet 的指标来了解 relist 的延时。relist 的调用周期是 1s，那么 relist 的完成时间 + 1s 就等于 kubelet_pleg_relist_interval_microseconds 指标的值。你也可以监控容器运行时每个操作的延时，这些指标在排查故障时都能提供线索。

你可以在每个节点上通过访问 URL http://127.0.0.1:10255/metrics 来获取 Kubelet 的指标。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


# HELP kubelet_pleg_relist_duration_seconds [ALPHA] Duration in seconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_duration_seconds histogram
kubelet_pleg_relist_duration_seconds_bucket{le="0.005"} 22
kubelet_pleg_relist_duration_seconds_bucket{le="0.01"} 462394
kubelet_pleg_relist_duration_seconds_bucket{le="0.025"} 1.535108e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.05"} 1.543127e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.1"} 1.54345e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.25"} 1.54347e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.5"} 1.543473e+06
kubelet_pleg_relist_duration_seconds_bucket{le="1"} 1.543473e+06
kubelet_pleg_relist_duration_seconds_bucket{le="2.5"} 1.543473e+06
kubelet_pleg_relist_duration_seconds_bucket{le="5"} 1.543473e+06
kubelet_pleg_relist_duration_seconds_bucket{le="10"} 1.543473e+06
kubelet_pleg_relist_duration_seconds_bucket{le="+Inf"} 1.543473e+06
kubelet_pleg_relist_duration_seconds_sum 17323.009421153103
kubelet_pleg_relist_duration_seconds_count 1.543473e+06

# HELP kubelet_pleg_relist_interval_seconds [ALPHA] Interval in seconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_seconds histogram
kubelet_pleg_relist_interval_seconds_bucket{le="0.005"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.01"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.025"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.05"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.1"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.25"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.5"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="1"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="2.5"} 1.543472e+06
kubelet_pleg_relist_interval_seconds_bucket{le="5"} 1.543472e+06
kubelet_pleg_relist_interval_seconds_bucket{le="10"} 1.543472e+06
kubelet_pleg_relist_interval_seconds_bucket{le="+Inf"} 1.543472e+06
kubelet_pleg_relist_interval_seconds_sum 1.5616756207034157e+06
kubelet_pleg_relist_interval_seconds_count 1.543472e+06

可以通过 Prometheus 对其进行监控：

PLEG is not healthy

造成 PLEG is not healthy 的因素有很多，下面是几种可能的原因：

Container Runtime

RPC 调用过程中 容器运行时⁵ 响应超时（有可能是性能下降，死锁或者出现了 bug）。这种情况下可以通过以下命令检查执行 list container 时间

1
2


# 假设节点使用 docker 
$ time sudo docker ps

一种 workaround 的方法⁶是：

We created a bash script that checks if docker ps is executed properly in less than 60 seconds. If this check fails 3 times, the Docker service is restarted. It happens from time to time, but the script solves the issue.

Overload

节点上的 Pod 数量太多，导致 relist 无法在 3 分钟内完成。事件数量和延时与 Pod 数量成正比，与节点资源无关。

System Bug

relist 出现了死锁⁷，该 bug 已在 Kubernetes 1.14 中修复。
获取 Pod 的网络堆栈信息时 CNI 出现了 bug⁸。
因为 systemd 导致的问题⁹