FSDP + TP/EP 优势分析

FSDP + TP/EP 实现

Applies EP (when enabled) + FSDP2 parallel strategy to the model.

Flow:
1. Apply EP: Expert tensors [128,H,I] -> [32,H,I] local tensors per EP rank
2. Apply FSDP2 to expert modules: Shard expert tensors along dim-1 (hidden dim)
3. Apply FSDP2 to regular modules: Standard dim-0 sharding
4. Result: Expert params [32, H/fsdp_size, I], regular params use standard FSDP2

参考资料