讲完结构之后,大概会讲讲怎么训练的,会介绍 DeepSpeed,Megatron 的基本原来,从而引出各种并行原理,Tensor/PipeLine/ZeRO/3 D Parallel

参考资料:

这个是微软的介绍 ZeRO 和 DeepSpeed 的文章

https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

关于各种并行原理的解读:

DeepSpeed

Megatron-LM

GPipe

ZeRO

ZeRO
ZeRO

https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

Colossal

state-of-gpt-2

  • Ray

OpenAI kubernetes 集群 7500 节点 Ray Cluster stack

OpenAI 使用 ray