The Wire · Showcase
PYTORCH SHIPS METAL ACCELERATORS AND SPARSE DTYPE SUPPORT WHILE TORCHTITAN ADVANCES EXPERT PARALLELISM
By RepoJournal · Filed · About PyTorch
PyTorch landed critical MPS kernel migrations and sparse tensor improvements overnight while the Titan team built out the distributed training pipeline for expert-parallel models.
The core team completed Metal Performance Shaders (MPS) implementations for two essential ops: GLU forward pass [1] now runs 2x faster via MPSGraph instead of TensorIterator, and CTC loss [2] shipped with full forward-pass support optimized for batch parallelism. On the sparse front, `torch.sparse.sampled_addmm` now handles float16 and bfloat16 on CUDA [3], fixing a critical backward-pass gap that broke half-precision sparse matrix multiplication. Meanwhile, TorchTitan merged the graph trainer's expert parallelism (EP) infrastructure: the EP overlap scheduler [5] and chunking pass [6] enable Inductor to optimize token dispatch across distributed model experts, plus a new `DPRequestRouter` [4] centralizes data-parallel routing logic. Test infrastructure tightened its CI safeguards with an AI-advisor outage guard [7] that won't bail entire PRs on expected broad failures, and the CRCR zombie-workflow cleaner [8] now purges stale cross-repo CI entries from Redis. ExecuTorch fixed Windows CI by forcing CPU-only builds [10] to avoid CUDA toolkit conflicts, shipped Arm TOSA binary op support [11], and bumped Vela to 5.1.0 [9].
Action items
- → If using sparse half-precision ops on CUDA, upgrade to pick up sampled_addmm fixes [ref:5] pytorch/pytorch [plan]
- → Review TorchTitan EP overlap scheduler and chunking PRs if building distributed expert-parallel models [ref:9] [ref:10] pytorch/torchtitan [monitor]
- → ExecuTorch Windows CI is stable again - Windows unittest jobs should go green with the CPU-only fix [ref:17] pytorch/executorch [monitor]
References
- [1] [MPS] Migrate GLU to Metal (#187833) pytorch/pytorch
- [2] [MPS] Add `ctc_loss` forward pass (#187716) pytorch/pytorch
- [3] Add float16/bfloat16 support to sparse CSR sampled_addmm (#187681) pytorch/pytorch
- [4] Add DPRequestRouter and use it in generator ↗ pytorch/torchtitan
- [5] [graph_trainer] Add EP overlap scheduling pass ↗ pytorch/torchtitan
- [6] [graph_trainer] Add graph EP chunking pass ↗ pytorch/torchtitan
- [7] [torchci] AI advisor: stable-hash sanity cap + ci-no-td outage-guard bypass ↗ pytorch/test-infra
- [8] [CRCR] Implement zombie workflow entries cleaner ↗ pytorch/test-infra
- [9] Arm backend: Bump vela to 5.1.0 (#20181) pytorch/executorch
- [10] Fix Windows unittest CI: force CPU-only build (CUDA 13.2 toolkit on runner breaks _portable_lib load) (#20527) pytorch/executorch
- [11] Arm backend: Add TOSA binary op visitors ↗ pytorch/executorch
FAQ
- What changed in PyTorch on June 27, 2026?
- PyTorch landed critical MPS kernel migrations and sparse tensor improvements overnight while the Titan team built out the distributed training pipeline for expert-parallel models.
- What should PyTorch teams do about it?
- If using sparse half-precision ops on CUDA, upgrade to pick up sampled_addmm fixes [ref:5] • Review TorchTitan EP overlap scheduler and chunking PRs if building distributed expert-parallel models [ref:9] [ref:10] • ExecuTorch Windows CI is stable again - Windows unittest jobs should go green with the CPU-only fix [ref:17]
- Which PyTorch repositories shipped on June 27, 2026?
- pytorch/pytorch, pytorch/torchtitan, pytorch/test-infra, pytorch/executorch