Garden | GPU Monitor

问题背景

在使用 GPU 进行深度学习相关的训练与推理时，需要查看当前集群中 GPU 的使用情况：

需要通过当前 GPU 设备资源使用情况判断是否可以再部署新的应用，判断集群是否需要扩容，为 GPU 服务提供对齐 CPU 的容量保障服务，补齐容量保障中的 GPU 短板
需要通过当前 GPU 设备资源使用情况分析使用中存在的瓶颈和短板，推进优化，提高资源利用率和服务性能

为了获得 GPU 的监控数据，NVIDIA 提供了以下三种方法：

NVML：NVIDIA Management Library，基于 C 进行监控和管理 GPU 的库，nvidia-smi 命令即是基于此实现的
DCGM：Data Center GPU Manager，基于 NVML 和 CUDA 实现的一整套 GPU 的监控和管理工具
第三方工具：基于 DCGM 或者 NVML 开发的第三方监控工具，可以与 Prometheus 等工具结合，提供数据库、UI 等工具

对比这三种工具的特点：

NVML
- 无状态的查询，只支持查询当前数据
- 属于低级别控制 GPU 的 API
- 基于 NVML 库开发的管理工具运行成本低，开发成本高
- 基于 NVML 库开发的管理工具必须与 GPU 运行在同一个节点
DCGM
- 可以查询几个小时的数据指标
- 提供了 GPU 的健康检查和诊断
- 可以对一组 GPU 进行批量查询
- 允许以 remote/local 两种方式运行
第三方工具
- 提供了 database、graphs 和好看的 UI

本文后续将主要介绍 DCGM。

DCGM

下图展示了 DCGM 在集群中运行的方式，DCGM 以 Agent 的形式部署在计算节点上，管理节点上的工具可以通过 DCGM 提供的 API 管理和监控 GPU。

DCGM 提供了一下四种关键特性：

Active Health Monitoring
GPU Diagnostics
Policy and Alerting
Configuration Managerment

安装部署

DCGM 需要单独下载安装，在 NVIDIA 官网NVIDIA下载对应的安装包，这里选择下载 rpm 包即可，下载完成后：

1
2
3
4


# 卸载可能已安装的旧版本DCGM
$ yum remove datacenter-gpu-manager
# 安装
$ rpm -ivh datacenter-gpu-manager-2.0.13-1-x86_64.rpm

DCGM 的动态链接库会被安装到 /usr/lib64目录
Python 库会被安装到 /usr/local/dcgm/bindings目录

DCGM 是一个面向集群管理的工具，所以在实际使用前，需要先在目标机器启动一个 agent，nv-hostengine，具体启动命令如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


# 启动 nv-hostengine
$ nv-hostengine --port 39999 --bind-interface 127.0.0.1
Host Engine Listener Started
Started host engine version 2.0.13 using port number: 39999

# 查看设备列表
$ dcgmi discovery --host 127.0.0.1:39999 -l
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:08.0                                         |
|        | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f                |
+--------+----------------------------------------------------------------------+
| 1      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:09.0                                         |
|        | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619                |
+--------+----------------------------------------------------------------------+
| 2      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:0A.0                                         |
|        | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca                |
+--------+----------------------------------------------------------------------+
| 3      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:0B.0                                         |
|        | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

# 关闭 nv-hostengine，这里作演示用，后续的过程还要继续打开
$ nv-hostengine –t
Host engine successfully terminated.

其中，--port --bind-interface 两个参数分别用来设置监听的端口和绑定的 IP 地址。同时也支持使用 UNIX_SOCKET 通信

在启动 nv-hostengine 之后，我们就可以使用 dcgmi 来操作

组操作

和 NVML 不同，DCGM 的大部分功能都是面向组的，所以在使用 DCGM 之前，首先需要创建组，然后才能使用 DCGM 提供的各种功能。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


# 获取设备列表后，可以用如下命令创建组
# 创建成功后，该命令会输出如下，返回设备的组ID，后续的操作中都会用到组ID，例如下面的组ID 2
$ dcgmi group --host 127.0.0.1:39999 -c GPU_GROUP
Successfully created group "GPU_GROUP" with a group ID of 2

$ dcgmi group --host 127.0.0.1:39999 -l
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 1 group found.                                                               |
+===================+==========================================================+
| Groups            |                                                          |
| -> 2              |                                                          |
|    -> Group ID    | 2                                                        |
|    -> Group Name  | GPU_GROUP                                                |
|    -> Entities    | None                                                     |
+-------------------+----------------------------------------------------------+

$ dcgmi discovery --host 127.0.0.1:39999 -l
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:08.0                                         |
|        | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f                |
+--------+----------------------------------------------------------------------+
| 1      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:09.0                                         |
|        | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619                |
+--------+----------------------------------------------------------------------+
| 2      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:0A.0                                         |
|        | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca                |
+--------+----------------------------------------------------------------------+
| 3      | Name: Tesla T4                                                       |
|        | PCI Bus ID: 00000000:00:0B.0                                         |
|        | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

# 创建组后可以用如下命令给组中添加设备
$ dcgmi group --host 127.0.0.1:39999 -g 2 -a 0,1
Add to group operation successful.

$ dcgmi group --host 127.0.0.1:39999 -g 2 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO                                                                   |
+===================+==========================================================+
| 2                 |                                                          |
| -> Group ID       | 2                                                        |
| -> Group Name     | GPU_GROUP                                                |
| -> Entities       | GPU 0, GPU 1                                             |
+-------------------+----------------------------------------------------------+

# 使用如下命令可以从组中删除设备
$ dcgmi group --host 127.0.0.1:39999 -g 2 -r 0,1
Remove from group operation successful.

# 使用如下命令可以从删除组
$ dcgmi group --host 127.0.0.1:39999 -d 2

注意：group 和设备之间是多对多关系

Job Statistics

当有一个 Job 需要通过 GPU 加速计算的时候，我们想知道：

我的 Job 运行在哪个 GPU 上
我的 Job 使用了多少 GPU
在我的 Job 运行过程中是否有任何的错误和 Warning
系统的 GPU 是否都健康并且准备好了下一个 Job 的计算

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57


# 当前 Group 3 如下
$ dcgmi group --host 127.0.0.1:39999 -g 3 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO                                                                   |
+===================+==========================================================+
| 3                 |                                                          |
| -> Group ID       | 3                                                        |
| -> Group Name     | GPU_GROUP                                                |
| -> Entities       | GPU 0, GPU 1, GPU 2, GPU 3                               |
+-------------------+----------------------------------------------------------+

# 在使用dcgmi获取GPU统计数据，需要先打开数据分析功能，具体命令如下
$ dcgmi stats --host 127.0.0.1:39999 -g 3 --enable
Successfully started process watches.

# 打开数据分析功能后，可以使用如下命令查看具体的进程的统计信息
# 假设这里启动了一个CUDA应用进程正在使用GPU进行计算
$ dcgmi stats --host 127.0.0.1:39999 -g 3 -p 41861 -v
Successfully retrieved process info for PID: 41861. Process ran on 1 GPUs.
+------------------------------------------------------------------------------+
| GPU ID: 3                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                     *   | Wed Jan  6 16:54:16 2021                |
| End Time                       *   | Still Running                           |
| Total Execution Time (sec)     *   | Still Running                           |
| No. of Conflicting Processes   *   | 0                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 2985                                    |
| Max GPU Memory Used (bytes)    *   | 12107907072                             |
| SM Clock (MHz)                     | Avg: 1590, Max: 1590, Min: 1590         |
| Memory Clock (MHz)                 | Avg: 5000, Max: 5000, Min: 5000         |
| SM Utilization (%)                 | Avg: 100, Max: 100, Min: 100            |
| Memory Utilization (%)             | Avg: 5, Max: 5, Min: 5                  |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 0                                       |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | 0                                       |
|        - Board Limit (%)           | 0                                       |
|        - Low Utilization (%)       | 0                                       |
|        - Sync Boost (%)            | 0                                       |
+-----  Process Utilization  --------+-----------------------------------------+
| PID                                | 41861                                   |
|     Avg SM Utilization (%)         | 99                                      |
|     Avg Memory Utilization (%)     | 3                                       |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

(*) Represents a process statistic. Otherwise device statistic during
    process lifetime listed.

Configuration Managerment

DCGM 可以更改 GPU 设置, 具体支持的设置项如下，查看原有设置：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


$ dcgmi config  --host 127.0.0.1:39999 -g 3 --get
+------------------------------+------------------------------+------------------------------+
| GPU_GROUP                                                                                  |
| Group of 4 GPUs                                                                            |
+==============================+==============================+==============================+
| Field                        | Target                       | Current                      |
+------------------------------+------------------------------+------------------------------+
| Compute Mode                 | Not Specified                | Unrestricted                 |
| ECC Mode                     | Not Specified                | Enabled                      |
| Sync Boost                   | Not Specified                | Not Supported                |
| Memory Application Clock     | Not Specified                | 5001                         |
| SM Application Clock         | Not Specified                | 585                          |
| Power Limit                  | Not Specified                | 70                           |
+------------------------------+------------------------------+------------------------------+

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


# 具体参数说明
$ dcgmi config -h

 config -- Used to configure settings for groups of GPUs.

Usage: dcgmi config
   dcgmi config [--host <IP/FQDN>] [-g <groupId>] --enforce
   dcgmi config [--host <IP/FQDN>] [-g <groupId>] --get [-v] [-j]
   dcgmi config [--host <IP/FQDN>] [-g <groupId>] --set [-e <0/1>] [-s
        <0/1>] [-a <mem,proc>] [-P <limit>] [-c <mode>]

  ...
  -c  --compmode   mode       Configure Compute Mode. Can be any of the
                               following:
                               0 - Unrestricted
                               1 - Prohibited
                               2 - Exclusive Process
  -P  --powerlimit limit      Configure Power Limit (Watts).
  -a  --appclocks  mem,proc   Configure Application Clocks. Must use memory,proc
                               clocks (csv) format(MHz).
  -s  --syncboost  0/1        Configure Syncboost. (1 to Enable, 0 to Disable)
  -e  --eccmode    0/1        Configure Ecc mode. (1 to Enable, 0 to Disable)

# 更改设置
$ dcgmi config  --host 127.0.0.1:39999 -g 3 --set -c 2

# 查询结果
$ dcgmi config  --host 127.0.0.1:39999 -g 3 --get
+------------------------------+------------------------------+------------------------------+
| GPU_GROUP                                                                                  |
| Group of 4 GPUs                                                                            |
+==============================+==============================+==============================+
| Field                        | Target                       | Current                      |
+------------------------------+------------------------------+------------------------------+
| Compute Mode                 | E. Process                   | E. Process                   |
| ECC Mode                     | Not Specified                | Enabled                      |
| Sync Boost                   | Not Specified                | Not Supported                |
| Memory Application Clock     | Not Specified                | 5001                         |
| SM Application Clock         | Not Specified                | 585                          |
| Power Limit                  | Not Specified                | 70                           |
+------------------------------+------------------------------+------------------------------+

注意，使用 DCGM 更改设置时，运作模式是一种面向声明的模式，用户通过 dcgmi 指定需要的目标设置，同时 nv-hostengine 自动调整设置，使当前设置对齐目标设置

Policy and Alerting

dcgm 的提供了 policy 功能，policy 本质上是类似于一种 Watch 机制，首先设定一个 违反条件，然后可以根据 违反条件设置对应的处理策略。一般而言，可以设置一个条件，然后注册 listener，等待 dcgm 通知。

例如

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


# 通过如下命令设置最大温度50度的条件
$ dcgmi policy --host 127.0.0.1:39999 -g 3 --set 0,0 -T 50

# 设置后的policy，通过如下命令查询
$ dcgmi policy --host 127.0.0.1:39999 -g 2 --get
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information                                                           |
| GPU_GROUP                                                                    |
+=============================+================================================+
| Violation conditions        | Max temperature threshold - 50                 |
| Isolation mode              | Manual                                         |
| Action on violation         | None                                           |
| Validation after action     | None                                           |
| Validation failure action   | None                                           |
+-----------------------------+------------------------------------------------+

$ dcgmi policy --host 127.0.0.1:39999 -g 2 --reg
Timestamp: Wed Jan  6 17:02:27 2021
The maximum thermal limit has violated policy manager values.
Temperature: 65
Listening for violations.
Timestamp: Wed Jan  6 17:02:37 2021
The maximum thermal limit has violated policy manager values.
Temperature: 65
...

参数设置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


   --set        actn,val   (OR required)  Set the current violation policy.
                               Use csv action,validation (ie. 1,2)
                               -----
                               Action to take when any of the violations
                               specified occur.
                               0 - None
                               1 - GPU Reset
                               -----
                               Validation to take after the violation action has
                               been performed.
                               0 - None
                               1 - System Validation (short)
                               2 - System Validation (medium)
                               3 - System Validation (long)
  -x  --xiderrors             Add XID errors to the policy conditions.
  -n  --nvlinkerrors           Add NVLink errors to the policy conditions.
  -p  --pcierrors             Add PCIe replay errors to the policy conditions.
  -e  --eccerrors             Add ECC double bit errors to the policy
                               conditions.
  -P  --maxpower   max        Specify the maximum power a group's GPUs can reach
                               before triggering a violation.
  -T  --maxtemp    max        Specify the maximum temperature a group's GPUs can
                               reach before triggering a violation.
  -M  --maxpages   max        Specify the maximum number of retired pages that
                               will trigger a violation.

Health check

DCGM 的健康检查是无侵入式的检查，提供了实时监控和聚合的健康数据，其运行机制是

打开健康检查，设置需要检查的项
DCGM 在后台运行，根据设置监控对应组件状态
用户通过 dcgmi health命令查询当前发现的错误

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


$ dcgmi health --check -g 1
Health Monitor Report
+------------------+---------------------------------------------------------+
| Overall Health:   Healthy                                                  |
+==================+=========================================================+

$ dcgmi health --check -g 1
Health Monitor Report
+----------------------------------------------------------------------+
| Group 1       | Overall Health: Warning                              |
+==================+===================================================+
| GPU ID: 0     | Warning                                              |
|               | PCIe system: Warning - Detected more than 8 PCIe     |
|               | replays per minute for GPU 0: 13                     |
+---------------+------------------------------------------------------+
| GPU ID: 1     | Warning                                              |
|               | InfoROM system: Warning - A corrupt InfoROM has been |
|               | detected in GPU 1.                                   |
+---------------+------------------------------------------------------+

GPU Diagnostics

诊断是主动检查的模式，提供了三个级别的检查，每次运行时会根据运行级别，运行对应的测试程序，来发现问题。

运行命令如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


$ dcgmi diag --host 127.0.0.1:39999 -g 3 -r 1
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Blacklist                 | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement           | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+---------------------------+------------------------------------------------+

Profile

profile 功能可以用较小的性能消耗获取 GPU 卡的利用率数据以及进程的性能数据，profile 功能对于驱动版本和卡的类型有一些强制要求，具体是

DCGM 版本大于 1.7
驱动版本大于 418.43
nv-hostengine 以 root 身份启动
目前只支持 Tesla V100、Tesla T4 卡

可以获取的性能指标有

指标	说明	FIELD_NAME
Graphics Engine Activity	Ratio of time the graphics engine is active. The graphics engine is active if a graphics/compute context is bound and the graphics pipe or compute pipe is busy. PROF_GR_ENGINE_ACTIVE (ID: 1001)
SM Activity	The ratio of cycles an SM has at least 1 warp assigned (computed from the number of cycles and elapsed cycles)	PROF_SM_ACTIVE (ID: 1002)
SM Occupancy	The ratio of number of warps resident on an SM. (number of resident warps as a percentage of the theoretical maximum number of warps per elapsed cycle)	PROF_SM_OCCUPANCY (ID: 1003)
Tensor Activity	The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)	PROF_PIPE_TENSOR_ACTIVE (ID: 1004)
Memory BW Utilization	The ratio of cycles the device memory interface is active sending or receiving data.	PROF_DRAM_ACTIVE (ID: 1005)
Engine Activity	Ratio of cycles the fp64 /fp32 / fp16 / HMMA / IMMA pipes are active.	PROF_PIPE_FPXY_ACTIVE (ID: 1006 (FP64); 1007 (FP32); 1008 (FP16))
NVLink Activity	The number of bytes of active NVLink rx or tx data including both header and payload.	DEV_NVLINK_BANDWIDTH_L0
PCIe Bandwidth pci_bytes{rx, tx}	The number of bytes of active pcie rx or tx data including both header and payload.	PROFPCIE[TR]X_BYTES (ID: 1009 (TX); 1010 (RX))

在 k8s 中集成 GPU Telemetry

系统监控通常需要有以下几个组件：

数据收集组件：collector，作为数据来源
时序数据库组件：存储收集到的 metrics
可视化组件：将收集到的数据以可视化的界面友好地展示出来

Prometheus 作为云原生时代优秀的解决方案，其结合 Grafana 和 Alert Manager 等组件实现了 k8s 集群的系统监控，下面是其组件架构，更多内容可以参考我的另一篇博文。

同样，为了获得 GPU 的监控数据，NVIDIA 推出了 dcgm-exporter，它封装了 DCGM，类似于 node-exporter 将 GPU 的数据暴露给 Prometheus：

部署 dcgm-exporter

dcgm-exporter 作为 DaemonSet 运行在每一个装有 GPU 的 Node 上，为了使得 Prometheus 能够采集到它收集的数据，同时创建了 Service。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60


apiVersion: apps/v1
kind: DaemonSet
namespace: kube-system
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.1.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.1.1"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
namespace: kube-system
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.1.1"
  ports:
  - name: "metrics"
    port: 9400

这一步之后，可以获取每个 Node 上的 Metrics：

部署完成后，需要在 Prometheus 的配置中，给 scrape_configs添加 gpu-metrics 的 job，通过 kubernetes_sd_configs 的服务发现机制找到 dcgm-exporter 对应的服务。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kube-system
    selectors:
      - role: pod
          label: "app.kubernetes.io/name:dcgm-exporter"
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

使用 grafana 监控

NVIDIA 提供了专用于 GPU 监控的 Grafana 面板，在 Grafana 导入面板后，即可看到对应的 GPU 监控面板：

OpenFalcon GPU 监控插件

OpenFalcon 是小米开源的一套监控系统解决方案，其架构如下图所示。在每个节点上会有一个 falcon-agent 的 daemon 进程，负责对每个节点进行数据采集。

为了支持 GPU 监控，OpenFalcon 有专门的 GPU 监控插件，它依赖于 DCGM 获得监控指标，下面是一些常用的指标：

GPUUtils             GPU 使用率 (%)
MemUtils             GPU 显存使用率(%)
FBUsed               GPU 的显存占用(MB)
Performance          GPU 的性能状态(0-15, 其中0表示最高)
DeviceTemperature    当前GPU设备温度(℃)
PowerUsed            GPU的功率使用
SingleBitError       全部累积的单精度ECC错误
DoubleBitError       全部累积的双精度ECC错误

GPU Manager 监控数据分析

与 OpenFalcon 不同，GPU Manager 使用的是 NVML 库开发，获得对于 GPU Pod 级的监控数据。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


func (disp *Display) getDeviceUsage(pidsInCont []int, deviceIdx int) *displayapi.DeviceInfo {
	nvml.Init()
	defer nvml.Shutdown()

	dev, err := nvml.DeviceGetHandleByIndex(uint(deviceIdx))
	if err != nil {
		klog.Warningf("can't find device %d, error %s", deviceIdx, err)
		return nil
	}

	processSamples, err := dev.DeviceGetProcessUtilization(1024, time.Second)
	if err != nil {
		klog.Warningf("can't get processes utilization from device %d, error %s", deviceIdx, err)
		return nil
	}

	processOnDevices, err := dev.DeviceGetComputeRunningProcesses(1024)
	if err != nil {
		klog.Warningf("can't get processes info from device %d, error %s", deviceIdx, err)
		return nil
	}

	busID, err := dev.DeviceGetPciInfo()
	if err != nil {
		klog.Warningf("can't get pci info from device %d, error %s", deviceIdx, err)
		return nil
	}

	sort.Slice(pidsInCont, func(i, j int) bool {
		return pidsInCont[i] < pidsInCont[j]
	})

	usedMemory := uint64(0)
	usedPids := make([]int32, 0)
	usedGPU := uint(0)
	for _, info := range processOnDevices {
		idx := sort.Search(len(pidsInCont), func(pivot int) bool {
			return pidsInCont[pivot] >= int(info.Pid)
		})

		if idx < len(pidsInCont) && pidsInCont[idx] == int(info.Pid) {
			usedPids = append(usedPids, int32(pidsInCont[idx]))
			usedMemory += info.UsedGPUMemory
		}
	}

	for _, sample := range processSamples {
		idx := sort.Search(len(pidsInCont), func(pivot int) bool {
			return pidsInCont[pivot] >= int(sample.Pid)
		})

		if idx < len(pidsInCont) && pidsInCont[idx] == int(sample.Pid) {
			usedGPU += sample.SmUtil
		}
	}

	return &displayapi.DeviceInfo{
		Id:      busID.BusID,
		CardIdx: fmt.Sprintf("%d", deviceIdx),
		Gpu:     float32(usedGPU),
		Mem:     float32(usedMemory >> 20),
		Pids:    usedPids,
	}
}

GPU 监控指标探讨

对于 k8s 的 GPU 监控，我们到底需要那些指标：

集群级别
- 整个集群有多少 GPU，各种 GPU 的型号是怎样的
- 集群级别 GPU 算力使用量（绝对值），算力使用率（相对值）
- 集群级别 GPU 显存使用量（绝对值），显存使用率（相对值）
单机级别
- Node 上有多少 GPU，各种 GPU 的型号是怎样的
- 单机级别 GPU 算力使用量（绝对值），算力使用率（相对值）
- 单机级别 GPU 显存使用量（绝对值），显存使用率（相对值）
Pod 级别
- Pod 运行在哪个 GPU 上
- Pod 级别 GPU 算力使用量（绝对值），算力使用率（相对值）
- Pod 级别 GPU 显存使用量（绝对值），显存使用率（相对值）
其他相关统计数据
- GPU 的功率、温度、主频、FAN 转速等