NVIDIA XID Message
Xid Message 由 NVIDIA 驱动报告的错误信息,一般卸载操作系统的内核日志或者是事件日志中。Xid 消息表明发生了一般的 GPU 错误,通常是由于驱动程序对 GPU 的编程不正确或发送给 GPU 的命令损坏所致。这些消息可能表示硬件问题、NVIDIA 软件问题或用户应用程序问题。
Xid Message 的产生可能有以下三种:
- Hardware Problem
- NVIDIA Software Problem
- User Application Problem
Xid Message 可以用作错误诊断,辅助调试报告的错误。在所有不同版本的 NVIDIA 驱动中,Xid Message 的含义保持一致。
查看 Xid Errors
在 Linux 中,Xid Error 的信息在 /var/log/messages 中,可以看到错误信息。下图展示的是 XID 14 的错误信息:
|
|
在 NVIDIA 提供的 NVML 库中可以监听 GPU 的 Xid Error,下面是 Go 监听的示例代码:
|
|
Common Xid Errors
XID 13:GR: SW Notify Error
XID 13 号错误是通用的用户进程的错误,一般是用户访问数组越界、或者非法指令、非法寄存器的问题。这种问题在很少的情况下才会是硬件问题或者内核驱动的问题,基本上是用户进程的问题。
当这种问题发生时,NVIDIA 推荐如下步骤:
- Run the application in cuda-gdb or cuda-memcheck , or
- Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
- File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.
XID 31: Fifo: MMU Error
XID 31 号错误是由 MMU 报告的错误,比如当一个用户进程对一个非法地址访问的时候。一般来说,这是用户程序级别的 bug,也有可能是驱动或者硬件 bug。
当这种问题发生时,NVIDIA 推荐如下步骤:
- Run the application in cuda-gdb or cuda-memcheck , or
- Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or
- File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.
XID 32: PBDMA Error
XID 32 号错误是由 DMA Controller 上报的,DMA Controller 负责在 NVIDIA 驱动和 GPU 之前通过 PCIe 总线进行通信。
一般来说,这种问题是由 PCI 的质量问题导致,一般也不是由用户程序造成的。
XID 43: Reset Channel VERIF Error
XID 43 号错误发生在当探测到用户程序可能因此故障,这时候必须终止用户程序。这种情况下,GPU 还是处于健康的状态。
在大多数情况,这种问题是用户进程导致的,而不是驱动的 bug
XID 45: OS: Preemptive Channel Removal
XID 45 号错误发生在 用户进程 Abort 了,这时候内核驱动需要终止在 GPU 上运行的 GPU Application。Ctrl-C、CPU Reset、Sigkill 都是这种场景。
大多数情况下,这种问题是用户进程导致的,而不是驱动的 bug
XID 48: DBE(Double Bit Error) ECC Error
XID 48 号错误发生在当 GPU 探测到 GPU 上有一个不可纠正的错误,这个错误也会报告给用户进程。这种情况下,可要 GPU Reset 或者 Node 重启来修复这个问题。nvidia-smi 工具会提供一个 ECC 错误的总结。
Xid Error Listing
下表展示了所有的 Xid Error 信息:
| XID | Failure | Causes | ||||||
|---|---|---|---|---|---|---|---|---|
| HW Error | Driver Error | User App Error | System Memory Corruption | Bus Error | Thermal Issue | FB Corruption | ||
| 1 | Invalid or corrupted push buffer stream | X | X | X | X | |||
| 2 | Invalid or corrupted push buffer stream | X | X | X | X | |||
| 3 | Invalid or corrupted push buffer stream | X | X | X | X | |||
| 4 | Invalid or corrupted push buffer stream | X | X | X | X | |||
| GPU semaphore timeout | X | X | X | X | X | |||
| 5 | Unused | |||||||
| 6 | Invalid or corrupted push buffer stream | X | X | X | X | |||
| 7 | Invalid or corrupted push buffer address | X | X | X | ||||
| 8 | GPU stopped processing | X | X | X | X | |||
| 9 | Driver error programming GPU | X | ||||||
| 10 | Unused | |||||||
| 11 | Invalid or corrupted push buffer stream | X | X | X | X | |||
| 12 | Driver error handling GPU exception | X | ||||||
| 13 | Graphics Engine Exception | X | X | X | X | X | X | |
| 14 | Unused | |||||||
| 15 | Unused | |||||||
| 16 | Display engine hung | X | ||||||
| 17 | Unused | |||||||
| 18 | Bus mastering disabled in PCI Config Space | X | ||||||
| 19 | Display Engine error | X | ||||||
| 20 | Invalid or corrupted Mpeg push buffer | X | X | X | X | |||
| 21 | Invalid or corrupted Motion Estimation push buffer | X | X | X | X | |||
| 22 | Invalid or corrupted Video Processor push buffer | X | X | X | X | |||
| 23 | Unused | |||||||
| 24 | GPU semaphore timeout | X | X | X | X | X | X | |
| 25 | Invalid or illegal push buffer stream | X | X | X | X | X | ||
| 26 | Framebuffer timeout | X | ||||||
| 27 | Video processor exception | X | ||||||
| 28 | Video processor exception | X | ||||||
| 29 | Video processor exception | X | ||||||
| 30 | GPU semaphore access error | X | ||||||
| 31 | GPU memory page fault | X | X | |||||
| 32 | Invalid or corrupted push buffer stream | X | X | X | X | X | ||
| 33 | Internal micro-controller error | X | ||||||
| 34 | Video processor exception | X | ||||||
| 35 | Video processor exception | X | ||||||
| 36 | Video processor exception | X | ||||||
| 37 | Driver firmware error | X | X | X | ||||
| 38 | Driver firmware error | X | ||||||
| 39 | Unused | |||||||
| 40 | Unused | |||||||
| 41 | Unused | |||||||
| 42 | Video processor exception | X | ||||||
| 43 | GPU stopped processing | X | X | |||||
| 44 | Graphics Engine fault during context switch | X | ||||||
| 45 | Preemptive cleanup, due to previous errors – Most likely to see when running multiple cuda applications and hitting a DBE | X | ||||||
| 46 | GPU stopped processing | X | ||||||
| 47 | Video processor exception | X | ||||||
| 48 | Double Bit ECC Error | X | ||||||
| 49 | Unused | |||||||
| 50 | Unused | |||||||
| 51 | Unused | |||||||
| 52 | Unused | |||||||
| 53 | Unused | |||||||
| 54 | Auxiliary power is not connected to the GPU board | |||||||
| 55 | Unused | |||||||
| 56 | Display Engine error | X | X | |||||
| 57 | Error programming video memory interface | X | X | X | ||||
| 58 | Unstable video memory interface detected | X | X | |||||
| EDC error – clarified in printout | X | |||||||
| 59 | Internal micro-controller error(older drivers) | X | ||||||
| 60 | Video processor exception | X | ||||||
| 61 | Internal micro-controller breakpoint/warning(newer drivers) | |||||||
| 62 | Internal micro-controller halt(newer drivers) | X | X | X | ||||
| 63 | ECC page retirement recording event | X | X | X | ||||
| 64 | ECC page retirement recording failure | X | X | |||||
| 65 | Video processor exception | X | X | |||||
| 66 | Illegal access by driver | X | X | |||||
| 67 | Illegal access by driver | X | X | |||||
| 68 | Video processor exception | X | X | |||||
| 69 | Graphics Engine class error | X | X | |||||
| 70 | CE3: Unknown Error | X | X | |||||
| 71 | CE4: Unknown Error | X | X | |||||
| 72 | CE5: Unknown Error | X | X | |||||
| 73 | NVENC2 Error | X | X | |||||
| 74 | NVLINK Error | X | X | X | ||||
| 75 | Reserved | |||||||
| 76 | Reserved | |||||||
| 77 | Reserved | |||||||
| 78 | vGPU Start Error | X | ||||||
| 79 | GPU has fallen off the bus | X | X | X | X | X | ||
| 80 | Corrupted data sent to GPU | X | X | X | X | X | ||
| 81 | VGA Subsystem Error | X | ||||||
| 82 | Reserved | |||||||
| 83 | Reserved | |||||||
| 84 | Reserved | |||||||
| 85 | Reserved | |||||||
| 86 | Reserved | |||||||
| 87 | Reserved | |||||||
| 88 | Reserved | |||||||
| 89 | Reserved | |||||||
| 90 | Reserved | |||||||
| 91 | Reserved | |||||||
| 92 | High single-bit ECC error rate | X | X |
参考资料
-
No backlinks found.