NVIDIA® OpenFabrics Enterprise Distribution for Linux (MLNX_OFED) is a single Virtual Protocol Interconnect (VPI) software stack that operates across all NVIDIA network adapter solutions.

NVIDIA OFED (MLNX_OFED) is an NVIDIA-tested and packaged version of OFED and supports two interconnect types using the same RDMA (remote DMA) and kernel bypass APIs called OFED verbs—InfiniBand and Ethernet. Up to 200 Gb/s InfiniBand and RoCE (based on the RDMA over Converged Ethernet standard) over 10/25/40/50/100/200 GbE are supported with OFED to enable OEMs and System Integrators to meet the needs of end users in the said markets.1

OFED Performance Tests

可以通过 perftest 2 对 RDMA 性能进行测试,安装

1
apt-get install perftest

Ib_read_bw

1
2
ServerA:ib_read_bw -a -d mlx4_0
ServerB: ib_read_bw -a -F <ServerAIP> -d mlx4_0 --report_gbits

示例如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# Server A
# 这里 -q 指定 QP 数为 2,-x 指定 GID Index
$ ib_read_bw -q 2 -x 3 --report_g --run_infinitely

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF		Device         : mlx5_1
 Number of qps   : 2		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Outstand reads  : 16
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x02d9 PSN 0xf326ce OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bccfc000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12
 local address: LID 0000 QPN 0x02da PSN 0x1403a0 OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bcd0c000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12
 remote address: LID 0000 QPN 0x02c9 PSN 0x861fee OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc0f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05
 remote address: LID 0000 QPN 0x02ca PSN 0xfad640 OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc1f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]

# Server B
$ ib_read_bw 172.18.0.237 -q 2 -x 3 --report_g --run_infinitely
---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF		Device         : mlx5_3
 Number of qps   : 2		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Outstand reads  : 16
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x02c9 PSN 0x861fee OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc0f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05
 local address: LID 0000 QPN 0x02ca PSN 0xfad640 OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc1f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05
 remote address: LID 0000 QPN 0x02d9 PSN 0xf326ce OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bccfc000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12
 remote address: LID 0000 QPN 0x02da PSN 0x1403a0 OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bcd0c000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      829982           0.00               87.06  		   0.166061
 65536      828835           0.00               86.94  		   0.165832
 65536      828849           0.00               86.95  		   0.165835
 65536      828828           0.00               86.94  		   0.165831
 65536      828801           0.00               86.94  		   0.165825
 65536      828795           0.00               86.94  		   0.165824
 65536      828852           0.00               86.95  		   0.165835

Ib_write_bw

1
2
ServerA: ib_write_bw -a -d mlx 4_0
ServerB: ib_write_bw -a -F <ServerAIP> -d mlx4_0 --report_gbits

Ib_send_bw

1
2
ServerA: ib_send_bw -a -d mlx4_0
ServerB: ib_send_bw -a -F <ServerAIP> -d mlx4_0 --report_gbits

延迟测试

延迟测试也有三个命令,使用方法与上类似:

  • ib_read_lat
  • ib_write_lat
  • ib_send_lat

ib_read_lat 为例,测试结果如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Server A
$ ib_read_lat -x 3 --report_g

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Read Latency Test
 Dual-port       : OFF		Device         : mlx5_1
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Outstand reads  : 16
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x02db PSN 0x144f6a OUT 0x10 RKey 0x060d14 VAddr 0x00000001849000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12
 remote address: LID 0000 QPN 0x02cb PSN 0x1c38a6 OUT 0x10 RKey 0x050f14 VAddr 0x00000000e59000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05
---------------------------------------------------------------------------------------

# Server B
$ ib_read_lat 172.18.0.237 -x 3 --report_g
---------------------------------------------------------------------------------------
                    RDMA_Read Latency Test
 Dual-port       : OFF		Device         : mlx5_3
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 TX depth        : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Outstand reads  : 16
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x02cb PSN 0x1c38a6 OUT 0x10 RKey 0x050f14 VAddr 0x00000000e59000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05
 remote address: LID 0000 QPN 0x02db PSN 0x144f6a OUT 0x10 RKey 0x060d14 VAddr 0x00000001849000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       1000          10.87          14.62        11.40    	       11.48       	0.34   		12.74   		14.62
---------------------------------------------------------------------------------------

InfiniBand Diagnostics Commands

参考3

Command Description
ibstat Shows the host adapters status.
ibstatus Similar to ibstat but implemented as a script.
ibnetdiscover Scans the topology.
ibaddr Shows the LID range and default GID of the target (default is the local port).
ibroute Displays unicast and multicast forwarding tables of the switches.
ibtracert Displays unicast or multicast route from source to destination.
ibping Uses vendor MADs to validate connectivity between InfiniBand nodes. On exit, (IP) ping-like output is shown.
ibsysstat Obtains basic information for the specific node which may be remote. This information includes: hostname, CPUs, memory utilization.
sminfo Queries the SMInfo attribute on a node.
smpdump A general purpose SMP utility which gets SM attributes from a specified SMA. The result is dumped in hex by default.
smpquery Enables a basic subset of standard SMP queries including the following:
node info, node description, switch info, port info. 

Fields are displayed in human readable format.
perfquery Dumps (and optionally clears) the performance  counters of the destination port (including error counters).
ibcheckport Performs basic tests on the specified port.
ibchecknode Performs basic tests on the specified node.
ibcheckerrs Checks if the error counters of the port/node have exceeded some predefined thresholds.
ibchecknet Performs port/node/errors check on the subnet. Ibnetdiscover output can be used as in input topology.
ibswitches Scans the net or uses existing net topology file and lists all switches.
ibhosts Scans the net or uses existing net topology file and lists all hosts.
ibnodes Scans the net or uses existing net topology file and lists all nodes.
ibportstate Gets the logical and physical port states of an InfiniBand port or disables or enables the port (only on a switch).
ibcheckwidth Performs port width check on the subnet.

This command is used to find ports with 1 x link width.
ibcheckportwidth Performs 1 x port width check on specified port.
ibcheckstate Performs a port state check on the subnet (as well as a physical port state). This command is used to find ports that are not in LinkUp physical port state and that are not in active port state.
ibcheckportstate Performs a port state check on the specified port (as well as a physical port state).
ibcheckerrors Performs an error check on the subnet. This command is used to find ports with error counters (Performance Management Agent (PMA) PortCounters) beyond the indicated thresholds.
ibclearerrors Clears all error counters on the subnet.
ibclearcounters Clears all port counters on the subnet.
ibdiscover.pl Takes the ibnetdiscover output and a map file and produces a topology file (local node GUID and port connected to remote node GUID and port).
saquery Issues SA queries.
ibdiagnet ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices.
ibnetsplit Automatically groups hosts and creates scripts that can be run in order to split the network into sub-networks each containing one group of hosts.
1
apt-get install infiniband-diags

ibstatus

A script that displays basic information obtained from the local InfiniBand driver. Output includes LID, SMLID, port state, link width active, and port physical state.

1
ibstatus [-h] [devname[:port]]

示例:

1
2
3
ibstatus                      # display status of all IB ports
ibstatus mthca1               # status of mthca1 ports
ibstatus mthca1:1 mthca0:2    # show status of specified ports
1
2
3
4
5
6
7
8
9
$ ibstatus mlx5_1
Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:bace:f6ff:fec6:1286
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            200 Gb/sec (4X HDR)
        link_layer:      Ethernet

ibstat

Similar to the ibstatus utility but implemented as a binary and not as a script. Includes options to list CAs and/or ports.

1
ibstat [-d(ebug) -l(ist_of_cas) -p(ort_list) -s(hort)] <ca_name> [portnum]
ibstat            # display status of all IB ports
ibstat mthca1     # status of mthca1 ports`
ibstat mthca1 2   # show status of specified ports
ibstat -p mthca0  # list the port guids of mthca0
ibstat –l         # list all CA names

See also: ibstatus

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
$ ibstat mlx5_1
CA 'mlx5_1'
        CA type: MT4125
        Number of ports: 1
        Firmware version: 22.36.1010
        Hardware version: 0
        Node GUID: 0xb8cef60300c61286
        System image GUID: 0xb8cef60300c61286
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xbacef6fffec61286
                Link layer: Ethernet

show_gids

show_gids 是一个脚本,Mellanox 和 Intel 都有自己的脚本

https://enterprise-support.nvidia.com/s/article/understanding-show-gids-script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ show_gids
root@n124-044-059:~# show_gids
DEV	PORT	INDEX	GID					IPv4  		VER	DEV
---	----	-----	---					------------  	---	---
mlx5_bond_0	1	0	fe80:0000:0000:0000:0225:9dff:fe00:b01e			v1	bond0
mlx5_bond_0	1	1	fe80:0000:0000:0000:0225:9dff:fe00:b01e			v2	bond0
mlx5_bond_0	1	2	0000:0000:0000:0000:0000:ffff:0a7c:2c3b	10.124.44.59  	v1	bond0
mlx5_bond_0	1	3	0000:0000:0000:0000:0000:ffff:0a7c:2c3b	10.124.44.59  	v2	bond0
mlx5_bond_0	1	4	fe80:0000:0000:0000:1270:fdff:feb5:7853			v1	bond0
mlx5_bond_0	1	5	fe80:0000:0000:0000:1270:fdff:feb5:7853			v2	bond0
mlx5_bond_1	1	0	fe80:0000:0000:0000:0225:9dff:fe00:b025			v1	bond2
mlx5_bond_1	1	1	fe80:0000:0000:0000:0225:9dff:fe00:b025			v2	bond2
mlx5_bond_1	1	2	0000:0000:0000:0000:0000:ffff:ac10:08bf	172.16.8.191  	v1	bond2
mlx5_bond_1	1	3	0000:0000:0000:0000:0000:ffff:ac10:08bf	172.16.8.191  	v2	bond2
mlx5_bond_1	1	4	fe80:0000:0000:0000:1270:fdff:fea8:c115			v1	bond2
mlx5_bond_1	1	5	fe80:0000:0000:0000:1270:fdff:fea8:c115			v2	bond2
mlx5_bond_2	1	0	fe80:0000:0000:0000:0225:9dff:fe00:b026			v1	bond3
mlx5_bond_2	1	1	fe80:0000:0000:0000:0225:9dff:fe00:b026			v2	bond3
mlx5_bond_2	1	2	0000:0000:0000:0000:0000:ffff:ac10:0971	172.16.9.113  	v1	bond3
mlx5_bond_2	1	3	0000:0000:0000:0000:0000:ffff:ac10:0971	172.16.9.113  	v2	bond3
mlx5_bond_2	1	4	fe80:0000:0000:0000:1270:fdff:feb5:7317			v1	bond3
mlx5_bond_2	1	5	fe80:0000:0000:0000:1270:fdff:feb5:7317			v2	bond3
mlx5_bond_3	1	0	fe80:0000:0000:0000:0225:9dff:fe00:b01f			v1	bond1
mlx5_bond_3	1	1	fe80:0000:0000:0000:0225:9dff:fe00:b01f			v2	bond1
mlx5_bond_3	1	2	0000:0000:0000:0000:0000:ffff:ac10:0873	172.16.8.115  	v1	bond1
mlx5_bond_3	1	3	0000:0000:0000:0000:0000:ffff:ac10:0873	172.16.8.115  	v2	bond1
mlx5_bond_3	1	4	fe80:0000:0000:0000:1270:fdff:fea8:db7d			v1	bond1
mlx5_bond_3	1	5	fe80:0000:0000:0000:1270:fdff:fea8:db7d			v2	bond1
mlx5_bond_4	1	0	fe80:0000:0000:0000:0225:9dff:fe00:b027			v1	bond4
mlx5_bond_4	1	1	fe80:0000:0000:0000:0225:9dff:fe00:b027			v2	bond4
mlx5_bond_4	1	2	0000:0000:0000:0000:0000:ffff:ac10:09d5	172.16.9.213  	v1	bond4
mlx5_bond_4	1	3	0000:0000:0000:0000:0000:ffff:ac10:09d5	172.16.9.213  	v2	bond4
mlx5_bond_4	1	4	fe80:0000:0000:0000:1270:fdff:fea8:c110			v1	bond4
mlx5_bond_4	1	5	fe80:0000:0000:0000:1270:fdff:fea8:c110			v2	bond4
mlx5_bond_4	1	6	fe80:0000:0000:0000:9c48:9cff:fee8:7126			v1	carma_vxlan0
mlx5_bond_4	1	7	fe80:0000:0000:0000:9c48:9cff:fee8:7126			v2	carma_vxlan0
n_gids_found=32

ibdev2netdev

ibdev2netdev 是一个脚本,Mellanox 和 Intel 都有自己的脚本

1
2
3
4
5
6
$ ibdev2netdev
mlx5_bond_0 port 1 ==> bond0 (Up)
mlx5_bond_1 port 1 ==> bond2 (Up)
mlx5_bond_2 port 1 ==> bond3 (Up)
mlx5_bond_3 port 1 ==> bond1 (Up)
mlx5_bond_4 port 1 ==> bond4 (Up)

ibverbs-utils

关于 ibverbs-utils 4 可以查看主机上是否有 RDMA 设备

1
2
3
$ apt-get install ibverbs-utils
$ ibv_<TAB>
ibv_asyncwatch     ibv_devices        ibv_devinfo        ibv_rc_pingpong    ibv_srq_pingpong   ibv_uc_pingpong    ibv_ud_pingpong    ibv_xsrq_pingpong  

ibv_devices

ibv_devices 是一个包含在 libibverbs-utils 包里的工具,用于显示本机上的 RDMA 设备:

1
2
3
4
5
6
7
$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              08c0eb0300f4248c
    mlx5_1              b8cef60300c61286
    mlx5_2              08c0eb0300f427a8
    mlx5_3              08c0eb03009a14e4

ibv_devinfo

ibv_devinfo 也是 libibverbs-utils 包中的一个工具,它会打开一个设备查询设备的属性,通过它可以验证用户空间和内核空间的 RMDA 栈是否能够一起正常运作:

1
2
3
4
5
6
7
ibv_devinfo: invalid option -- 'h'
Usage: ibv_devinfo             print the ca attributes

Options:
  -d, --ib-dev=<dev>     use IB device <dev> (default first device found)
  -i, --ib-port=<port>   use port <port> of IB device (default all ports)
  -l, --list             print only the IB devices names
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
$ ibv_devinfo -d mlx5_1
hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         22.36.1010
        node_guid:                      b8ce:f603:00c6:1286
        sys_image_guid:                 b8ce:f603:00c6:1286
        vendor_id:                      0x02c9
        vendor_part_id:                 4125
        hw_ver:                         0x0
        board_id:                       MT_0000000362
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

至少要有一个端口的状态是 PORT_ACTIVE,才能说明 RDMA 相关组件已经正常运行起来。

1
mlxlink -d 

ethtool

1
2
3
4
# 弃包统计
$ ethtool -S bond0 | grep drop

$ watch -n 1 "ethtool -S bond0 | grep drop"

Reference