• Read through ultrascale playbook
  • Grad accumulation
  • FSDP2 publish
  • FSDP2 and fp8 mixed precision training
  • Pipeline parallel like dualpipe/zbv
  • LongContext Training long-context-llm
  • DeepEP alltoall comm optimization deep-ep
  • Distributed ckpt
  • Streaming dataloader
  • MoE
  • DTensor
  • Pathway and multi-controller/single-controller
  • Ray and RLHF and Hybrid Flow
  • TPU

FP16 和 BF16 混合精度训练

FSDP 原生支持混合精度训练

FP8

FSDP 与 FP8

1
2
3
4
5
6
7
8
9
TransformerBlock(
    (attention): Attention(
        (wq/wk/wv/wo): Float8Linear(in=4096, out=4096, bias=False) 
    )
    (feed_forward): FeedForward(
        (w1/w2/w3): Float8Linear(in=4096, out=14336, bias=False)
    )
    (attention_norm / ffn_norm): RMSNorm()
)