FSDP2 10k

目标：

Super-large models with high sparsity and rapid updates
Super-large cluster scale
Tight step-time limitation

面临的问题：

Large model (200B ~ 1T+)
1. Large tensors on tight memory budget
Sparse & long context model
1. Massive communication overhead
2. Rapid model updates & diverse training jobs
20 k+ GPUs
1. Complicated network environment
2. Frequent hardware failures and stragglers

大规模 FSDP 面临的挑战

Memory 优化

训练效率优化

容错恢复

Author houmin

Publish January 1, 0001

LastMod May 12, 2026

License CC BY-NC-ND 4.0

Linked Mentions

No backlinks found.