FSDP2 10k
目标:
- Super-large models with high sparsity and rapid updates
- Super-large cluster scale
- Tight step-time limitation
面临的问题:
- Large model (200B ~ 1T+)
- Large tensors on tight memory budget
- Sparse & long context model
- Massive communication overhead
- Rapid model updates & diverse training jobs
- 20 k+ GPUs
- Complicated network environment
- Frequent hardware failures and stragglers
大规模 FSDP 面临的挑战
Memory 优化
训练效率优化
容错恢复
Linked Mentions
-
No backlinks found.