Garden | TorchTitan 论文解读：面向 LLM 预训练的 PyTorch 原生方案

最近看到 TorchTitan¹ 被 ICLR 2025 接收²，本文

Quote from the paper chair:
(LegoScale is the proxy name we used for the double-blind submission):
"
I recommend Accept for the following reasons:

This is a production-grade framework that covers a wide range of parallelism method and can be useful for the ML community. The ability of the framework to unify (or at least attempt to unify) and improve distributed training workflows is likely to have significant impact, particularly for researchers and practitioners working on LLMs.
The open-source nature of this work and already active community engagement supports the value of the framework.
The rebuttal addressed most of the reviewer concerns, including clarifications on contributions, comparisons with related systems, and update to incorporate CP and expert parallelism (I believe the community can even further extend the framework).
While I agree with the reviewers regarding limited research novelty, the detailed design of LegoScale can be a strong research tool (and possibly a strong baseline for comparisons) for the community and can inspire further innovations and extensions in the field.
Overall, this submission represents a valuable addition to the ML community, providing a well-engineered solution to challenges in LLM pre-training and serving as a benchmark for distributed training frameworks.