目标:

  1. Super-large models with high sparsity and rapid updates
  2. Super-large cluster scale
  3. Tight step-time limitation

面临的问题:

  1. Large model (200B ~ 1T+)
    1. Large tensors on tight memory budget
  2. Sparse & long context model
    1. Massive communication overhead
    2. Rapid model updates & diverse training jobs
  3. 20 k+ GPUs
    1. Complicated network environment
    2. Frequent hardware failures and stragglers

大规模 FSDP 面临的挑战

Memory 优化

训练效率优化

容错恢复