Garden | Pod 生命周期与状态

本文讲解的是 Kubernetes 中 Pod 的生命周期，包括生命周期的不同阶段、存活和就绪探针、重启策略等。

1
2
3
4
5
6
7


type Pod struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

	Spec PodSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
	Status PodStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
}

PodStatus 对象反映了 Pod 的状态，其数据结构如下所示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


type PodStatus struct {
	Phase PodPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase,casttype=PodPhase"`
	Conditions []PodCondition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,2,rep,name=conditions"`
	Message string `json:"message,omitempty" protobuf:"bytes,3,opt,name=message"`
	Reason string `json:"reason,omitempty" protobuf:"bytes,4,opt,name=reason"`
	// nominatedNodeName is set only when this pod preempts other pods on the node, but it cannot be
	// scheduled right away as preemption victims receive their graceful termination periods.
	// This field does not guarantee that the pod will be scheduled on this node. Scheduler may decide
	// to place the pod elsewhere if other nodes become available sooner. Scheduler may also decide to
	// give the resources on this node to a higher priority pod that is created after preemption.
	// As a result, this field may be different than PodSpec.nodeName when the pod is
	// scheduled.
	// +optional
	NominatedNodeName string `json:"nominatedNodeName,omitempty" protobuf:"bytes,11,opt,name=nominatedNodeName"`

	// IP address of the host to which the pod is assigned. Empty if not yet scheduled.
	// +optional
	HostIP string `json:"hostIP,omitempty" protobuf:"bytes,5,opt,name=hostIP"`
	// IP address allocated to the pod. Routable at least within the cluster.
	// Empty if not yet allocated.
	// +optional
	PodIP string `json:"podIP,omitempty" protobuf:"bytes,6,opt,name=podIP"`

	// podIPs holds the IP addresses allocated to the pod. If this field is specified, the 0th entry must
	// match the podIP field. Pods may be allocated at most 1 value for each of IPv4 and IPv6. This list
	// is empty if no IPs have been allocated yet.
	// +optional
	// +patchStrategy=merge
	// +patchMergeKey=ip
	PodIPs []PodIP `json:"podIPs,omitempty" protobuf:"bytes,12,rep,name=podIPs" patchStrategy:"merge" patchMergeKey:"ip"`

	// RFC 3339 date and time at which the object was acknowledged by the Kubelet.
	// This is before the Kubelet pulled the container image(s) for the pod.
	// +optional
	StartTime *metav1.Time `json:"startTime,omitempty" protobuf:"bytes,7,opt,name=startTime"`

	// The list has one entry per init container in the manifest. The most recent successful
	// init container will have ready = true, the most recently started container will have
	// startTime set.
	// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-and-container-status
	InitContainerStatuses []ContainerStatus `json:"initContainerStatuses,omitempty" protobuf:"bytes,10,rep,name=initContainerStatuses"`

	// The list has one entry per container in the manifest. Each entry is currently the output
	// of `docker inspect`.
	// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-and-container-status
	// +optional
	ContainerStatuses []ContainerStatus `json:"containerStatuses,omitempty" protobuf:"bytes,8,rep,name=containerStatuses"`
	// The Quality of Service (QOS) classification assigned to the pod based on resource requirements
	// See PodQOSClass type for available QOS classes
	// More info: https://git.k8s.io/community/contributors/design-proposals/node/resource-qos.md
	// +optional
	QOSClass PodQOSClass `json:"qosClass,omitempty" protobuf:"bytes,9,rep,name=qosClass"`
	// Status for any ephemeral containers that have run in this pod.
	// This field is alpha-level and is only populated by servers that enable the EphemeralContainers feature.
	// +optional
	EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,13,rep,name=ephemeralContainerStatuses"`
}

Pod Phase

pod从创建到最后的创建成功会分别处于不同的阶段，在源码中用PodPhase来表示不同的阶段：

1
2
3
4
5


PodPending PodPhase = "Pending"
PodRunning PodPhase = "Running"
PodSucceeded PodPhase = "Succeeded"
PodFailed PodPhase = "Failed"
PodUnknown PodPhase = "Unknown"

一个pod的完整创建，通常会伴随着各种事件的产生，k8s种事件的种类总共只有4种：

1
2
3
4


Added    EventType = "ADDED"
Modified EventType = "MODIFIED"
Deleted  EventType = "DELETED"
Error    EventType = "ERROR"

Pending 创建pod的请求已经被k8s接受，但是容器并没有启动成功，可能处在：写数据到etcd，调度，pull镜像，启动容器这四个阶段中的任何一个阶段，pending伴随的事件通常会有：ADDED, Modified这两个事件的产生。
Running pod已经绑定到node节点，并且所有的容器已经启动成功，或者至少有一个容器在运行，或者在重启中。
Succeeded pod中的所有的容器已经正常的自行退出，并且k8s永远不会自动重启这些容器，一般会是在部署job的时候会出现。
Failed pod中的所有容器已经终止，并且至少有一个容器已经终止于失败（退出非零退出代码或被系统停止）。
Unknown 由于某种原因，无法获得pod的状态，通常是由于与pod的主机通信错误。

Pod 的 phase 是 Pod 在其生命周期中的简单宏观概述。该阶段并不是对容器或 Pod 的综合汇总，也不是为了做为综合状态机。Pod Phase的数量和含义是严格指定的。除了本文档中列举的状态外，不应该再假定 Pod 有其他的 phase 值。

下面是 phase 可能的值：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


  // PodPending means the pod has been accepted by the system, but one or more of the containers
	// has not been started. This includes time before being bound to a node, as well as time spent
	// pulling images onto the host.	
  PodPending PodPhase = "Pending"
	// PodRunning means the pod has been bound to a node and all of the containers have been started.
	// At least one container is still running or is in the process of being restarted.
	PodRunning PodPhase = "Running"
	// PodSucceeded means that all containers in the pod have voluntarily terminated
	// with a container exit code of 0, and the system is not going to restart any of these containers.
	PodSucceeded PodPhase = "Succeeded"
	// PodFailed means that all containers in the pod have terminated, and at least one container has
	// terminated in a failure (exited with a non-zero exit code or was stopped by the system).
	PodFailed PodPhase = "Failed"
	// PodUnknown means that for some reason the state of the pod could not be obtained, typically due
	// to an error in communicating with the host of the pod.
	PodUnknown PodPhase = "Unknown"

下图是Pod的生命周期示意图，从图中可以看到Pod状态的变化。

Pod Condition

PodStatus 对象中包含一个 PodCondition 数组。 PodCondition 数组的每个元素都有一个 type 字段和一个 status 字段

type 字段是字符串，可能的值有 PodScheduled、Ready、Initialized、Unschedulable和ContainersReady
status 字段是一个字符串，可能的值有 True、False 和 Unknown。

1
2
3
4
5
6
7
8


type PodCondition struct {
	Type PodConditionType `json:"type" protobuf:"bytes,1,opt,name=type,casttype=PodConditionType"`
	Status ConditionStatus `json:"status" protobuf:"bytes,2,opt,name=status,casttype=ConditionStatus"`
	LastProbeTime metav1.Time `json:"lastProbeTime,omitempty" protobuf:"bytes,3,opt,name=lastProbeTime"`
	LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"`
	Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"`
	Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"`
}

Condition Type

1
2
3
4
5
6
7
8
9


	// ContainersReady indicates whether all containers in the pod are ready.
	ContainersReady PodConditionType = "ContainersReady"
	// PodInitialized means that all init containers in the pod have started successfully.
	PodInitialized PodConditionType = "Initialized"
	// PodReady means the pod is able to service requests and should be added to the
	// load balancing pools of all matching services.
	PodReady PodConditionType = "Ready"
	// PodScheduled represents status of the scheduling process for this pod.
	PodScheduled PodConditionType = "PodScheduled"

PodScheduled： Pod 的调度状态
Initialized： Pod中的所有 Init Containers 已经初启动完毕
ContainersReady： Pod中的所有的容器可以提供服务了
Ready： Pod 可以提供服务，并且应该被加到匹配的负载均衡池中

Condition Status

代表了当前pod是否处于某一个阶段（PodScheduled，Ready，Initialized，ContainersReady），true 表示处于，false表示不处于。

Container State

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


type ContainerState struct {
	// Details about a waiting container
	// +optional
	Waiting *ContainerStateWaiting `json:"waiting,omitempty" protobuf:"bytes,1,opt,name=waiting"`
	// Details about a running container
	// +optional
	Running *ContainerStateRunning `json:"running,omitempty" protobuf:"bytes,2,opt,name=running"`
	// Details about a terminated container
	// +optional
	Terminated *ContainerStateTerminated `json:"terminated,omitempty" protobuf:"bytes,3,opt,name=terminated"`
}

容器探针

探针是由 kubelet 对容器执行的定期诊断。要执行诊断，kubelet 调用由容器实现的 Handler。有三种类型的处理程序：

ExecAction：在容器内执行指定命令。如果命令退出时返回码为 0 则认为诊断成功。
TCPSocketAction：对指定端口上的容器的 IP 地址进行 TCP 检查。如果端口打开，则诊断被认为是成功的。
HTTPGetAction：对指定的端口和路径上的容器的 IP 地址执行 HTTP Get 请求。如果响应的状态码大于等于200 且小于 400，则诊断被认为是成功的。

每次探测都将获得以下三种结果之一：

成功：容器通过了诊断。
失败：容器未通过诊断。
未知：诊断失败，因此不会采取任何行动。

Kubelet 可以选择是否执行在容器上运行的两种探针执行和做出反应：

livenessProbe：指示容器是否正在运行。如果存活探测失败，则 kubelet 会杀死容器，并且容器将受到其重启策略的影响。如果容器不提供存活探针，则默认状态为 Success。
readinessProbe：指示容器是否准备好服务请求。如果就绪探测失败，端点控制器将从与 Pod 匹配的所有 Service 的端点中删除该 Pod 的 IP 地址。初始延迟之前的就绪状态默认为 Failure。如果容器不提供就绪探针，则默认状态为 Success。

该什么时候使用存活（liveness）和就绪（readiness）探针?

如果容器中的进程能够在遇到问题或不健康的情况下自行崩溃，则不一定需要存活探针; kubelet 将根据 Pod 的restartPolicy 自动执行正确的操作。

如果您希望容器在探测失败时被杀死并重新启动，那么请指定一个存活探针，并指定restartPolicy 为 Always 或 OnFailure。

如果要仅在探测成功时才开始向 Pod 发送流量，请指定就绪探针。在这种情况下，就绪探针可能与存活探针相同，但是 spec 中的就绪探针的存在意味着 Pod 将在没有接收到任何流量的情况下启动，并且只有在探针探测成功后才开始接收流量。

如果您希望容器能够自行维护，您可以指定一个就绪探针，该探针检查与存活探针不同的端点。

请注意，如果您只想在 Pod 被删除时能够排除请求，则不一定需要使用就绪探针；在删除 Pod 时，Pod 会自动将自身置于未完成状态，无论就绪探针是否存在。当等待 Pod 中的容器停止时，Pod 仍处于未完成状态。

readinessGates

自 Kubernetes 1.14（该版本 readinessGates GA，在1.11 版本是为 alpha）起默认支持 Pod 就绪检测机制扩展。

应用程序可以向 PodStatus 注入额外的反馈或信号：Pod readiness。要使用这个功能，请在 PodSpec 中设置 readinessGates 来指定 kubelet 评估 Pod readiness 的附加条件列表。

Readiness gates 由 Pod 的 status.condition 字段的当前状态决定。如果 Kubernetes 在 Pod 的 status.conditions 字段中找不到这样的条件，则该条件的状态默认为 “False”。

下面是一个例子。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


kind: Pod
...
spec:
  readinessGates:
    - conditionType: "www.example.com/feature-1"
status:
  conditions:
    - type: Ready                              # 内置的 Pod 状态
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z
    - type: "www.example.com/feature-1"        # 附加的额外的 Pod 状态
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z
  containerStatuses:
    - containerID: docker://abcd...
      ready: true
...

您添加的 Pod 条件的名称必须符合 Kubernetes 的 label key 格式。

只有到 Pod 中的所有容器状态都是 Ready，且 Pod 附加的额外状态检测的 readinessGates 条件也是 Ready 的时候，Pod 的状态才是 Ready。

Pod 和容器状态

有关 Pod 容器状态的详细信息，请参阅 PodStatus 和 ContainerStatus。请注意，报告的 Pod 状态信息取决于当前的 ContainerState。

Restart Policy

PodSpec 中有一个 restartPolicy 字段，可能的值为 Always、OnFailure 和 Never。默认为 Always。 restartPolicy 适用于 Pod 中的所有容器。restartPolicy 仅指通过同一节点上的 kubelet 重新启动容器。失败的容器由 kubelet 以五分钟为上限的指数退避延迟（10秒，20秒，40秒…）重新启动，并在成功执行十分钟后重置。如 Pod 文档中所述，一旦绑定到一个节点，Pod 将永远不会重新绑定到另一个节点。

Pod 的生命

一般来说，Pod 不会消失，直到人为销毁他们。这可能是一个人或控制器。这个规则的唯一例外是成功或失败的 phase 超过一段时间（由 master 确定）的Pod将过期并被自动销毁。

有三种可用的控制器：

使用 Job 运行预期会终止的 Pod，例如批量计算。Job 仅适用于重启策略为 OnFailure 或 Never 的 Pod。
对预期不会终止的 Pod 使用 ReplicationController、ReplicaSet 和 Deployment ，例如 Web 服务器。 ReplicationController 仅适用于具有 restartPolicy 为 Always 的 Pod。
提供特定于机器的系统服务，使用 DaemonSet 为每台机器运行一个 Pod 。

所有这三种类型的控制器都包含一个 PodTemplate。建议创建适当的控制器，让它们来创建 Pod，而不是直接自己创建 Pod。这是因为单独的 Pod 在机器故障的情况下没有办法自动复原，而控制器却可以。

如果节点死亡或与集群的其余部分断开连接，则 Kubernetes 将应用一个策略将丢失节点上的所有 Pod 的 phase 设置为 Failed。

示例

高级 liveness 探针示例

存活探针由 kubelet 来执行，因此所有的请求都在 kubelet 的网络命名空间中进行。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - args:
    - /server
    image: k8s.gcr.io/liveness
    livenessProbe:
      httpGet:
        # when "host" is not defined, "PodIP" will be used
        # host: my-host
        # when "scheme" is not defined, "HTTP" scheme will be used. Only "HTTP" and "HTTPS" are allowed
        # scheme: HTTPS
        path: /healthz
        port: 8080
        httpHeaders:
        - name: X-Custom-Header
          value: Awesome
      initialDelaySeconds: 15
      timeoutSeconds: 1
    name: liveness

状态示例

Pod 中只有一个容器并且正在运行。容器成功退出。
- 记录完成事件。
- 如果 restartPolicy 为：
  - Always：重启容器；Pod phase 仍为 Running。
  - OnFailure：Pod phase 变成 Succeeded。
  - Never：Pod phase 变成 Succeeded。
Pod 中只有一个容器并且正在运行。容器退出失败。
- 记录失败事件。
- 如果 restartPolicy 为：
  - Always：重启容器；Pod phase 仍为 Running。
  - OnFailure：重启容器；Pod phase 仍为 Running。
  - Never：Pod phase 变成 Failed。
Pod 中有两个容器并且正在运行。容器1退出失败。
- 记录失败事件。
- 如果 restartPolicy 为：
  - Always：重启容器；Pod phase 仍为 Running。
  - OnFailure：重启容器；Pod phase 仍为 Running。
  - Never：不重启容器；Pod phase 仍为 Running。
- 如果有容器1没有处于运行状态，并且容器2退出：
  - 记录失败事件。
  - 如果 restartPolicy 为：
    - Always：重启容器；Pod phase 仍为 Running。
    - OnFailure：重启容器；Pod phase 仍为 Running。
    - Never：Pod phase 变成 Failed。
Pod 中只有一个容器并处于运行状态。容器运行时内存超出限制：
- 容器以失败状态终止。
- 记录 OOM 事件。
- 如果 restartPolicy 为：
  - Always：重启容器；Pod phase 仍为 Running。
  - OnFailure：重启容器；Pod phase 仍为 Running。
  - Never: 记录失败事件；Pod phase 仍为 Failed。
Pod 正在运行，磁盘故障：
- 杀掉所有容器。
- 记录适当事件。
- Pod phase 变成 Failed。
- 如果使用控制器来运行，Pod 将在别处重建。
Pod 正在运行，其节点被分段。
- 节点控制器等待直到超时。
- 节点控制器将 Pod phase 设置为 Failed。
- 如果是用控制器来运行，Pod 将在别处重建。

Kubelet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


var ErrCrashLoopBackOff = errors.New("CrashLoopBackOff")

var (
	// ErrContainerNotFound returned when a container in the given pod with the
	// given container name was not found, amongst those managed by the kubelet.
	ErrContainerNotFound = errors.New("no matching container")
)

var (
	ErrRunContainer     = errors.New("RunContainerError")
	ErrKillContainer    = errors.New("KillContainerError")
	ErrVerifyNonRoot    = errors.New("VerifyNonRootError")
	ErrRunInitContainer = errors.New("RunInitContainerError")
	ErrCreatePodSandbox = errors.New("CreatePodSandboxError")
	ErrConfigPodSandbox = errors.New("ConfigPodSandboxError")
	ErrKillPodSandbox   = errors.New("KillPodSandboxError")
)

var (
	ErrSetupNetwork    = errors.New("SetupNetworkError")
	ErrTeardownNetwork = errors.New("TeardownNetworkError")
)

k8s pod 状态分析

以下通过创建一个pod来具体的看看从pod创建到成功所触发的事件，以及pod相关状态的改变

kubectl apply -f busybox.yaml

第一步：写入数据到etcd

1
2
3
4
5
6


event type: ADDED 
event object: 
{
	"phase": "Pending",
	"qosClass": "BestEffort"
}

第二步：开始被调度，但是还未调度到具体node上，请注意：PodScheduled的 status=“true”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


event type: MODIFIED
{
  "phase": "Pending",
  "conditions": [
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2017-06-06T07:57:06Z"
    }
  ],
  "qosClass": "BestEffort"
}

第三步：被调度到了具体的node上hostip绑定了，并且被所有初始化容器已经启动完毕（注意我的busybox.yaml中pod没有指定init container，所以这里很快就被设置成true）,被调度到的节点watch到并开始创建容器（此阶段是在拉去镜像）然后创建容器，而此时Ready的status是false，仔细看会发现，containerStatus的状态未waiting

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


event type: MODIFIED
{
  "phase": "Pending",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2017-06-06T07:57:06Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2017-06-06T07:57:06Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [busybox]"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2017-06-06T07:57:06Z"
    }
  ],
  "hostIP": "10.39.1.35",
  "startTime": "2017-06-06T07:57:06Z",
  "containerStatuses": [
    {
      "name": "busybox",
      "state": {
        "waiting": {
          "reason": "ContainerCreating"
        }
      },
      "lastState": {},
      "ready": false,
      "restartCount": 0,
      "image": "busybox",
      "imageID": ""
    }
  ],
  "qosClass": "BestEffort"
}

第四步：容器创建成功，Ready的status=“true”，此时容器的status也为running，这个时候，对应的pod的PodPhase也应该为running

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


event type: MODIFIED
{
  "phase": "Running",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2017-06-06T07:57:06Z"
    },
    {
      "type": "Ready",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2017-06-06T07:57:08Z"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2017-06-06T07:57:06Z"
    }
  ],
  "hostIP": "10.39.1.35",
  "podIP": "192.168.107.204",
  "startTime": "2017-06-06T07:57:06Z",
  "containerStatuses": [
    {
      "name": "busybox",
      "state": {
        "running": {
          "startedAt": "2017-06-06T07:57:08Z"
        }
      },
      "lastState": {},
      "ready": true,
      "restartCount": 0,
      "image": "busybox:latest",
      "imageID": "docker-pullable://busybox@sha256:c79345819a6882c31b41bc771d9a94fc52872fa651b36771fbe0c8461d7ee558",
      "containerID": "docker://a6af9d58c7dabf55fdfe8d4222b2c16349e3b49b3d0eca4bc761fdb571f3cf44"
    }
  ],
  "qosClass": "BestEffort"
}