The Wire · Showcase
ROCM DISTRIBUTED JOBS GRADUATE FROM SANDBOX, HELION SHIPS CUDAGRAPH OPTIMIZATION
By RepoJournal · Filed · About PyTorch
PyTorch's ROCm CI pipeline is moving its mi200 distributed test suite out of sandbox and into production after consistent performance wins, while Helion unlocks 5x speedup on GB200 through CUDA graph integration.
The pytorch/pytorch team graduated distributed ROCm jobs from trunk-rocm-sandbox to periodic-rocm-mi200 after proving stability [1]. These mi200 distributed jobs have been timing out consistently in sandbox, but the move signals confidence in their reliability. On the Helion front, CUDA graph support for running examples delivers dramatic performance gains: cudagraph cuts the add.py implementation from 0.0208ms down to parity with torch's 0.0076ms baseline [2], a 5.4x improvement that materializes immediately in GB200 benchmarks. The same team is shipping a CuTe NVFP4 GEMV example [3] with inline assembly support for FP4/E4M3 decode operations and stacked a dtype-driven inference optimization [4] that infers FP4 conversions directly from tensor dtypes, avoiding unsupported scalar dereference paths. Separately, the test suite caught an accidental skip in OpInfo's NumPy reference tests that was silencing CPU validation [5], though this PR was auto-reverted moments later [6] due to downstream failures. TorchTitan's graph trainer is simplifying its FSDP memory policy by removing the redundant `fsdp_reshard_after_fwd_pass` parameter [7], consolidating reshard logic into the unified memory policy framework. Documentation debt is being cleared: torch.signal.windows functions were removed from coverage ignore lists [8], bringing 11 window functions into proper docstring requirements.
Action items
- → Verify your ROCm CI jobs are migrating to periodic-rocm-mi200; deprecated sandbox distributed jobs will be removed pytorch/pytorch [plan]
- → If using TorchTitan graph training, audit for hardcoded fsdp_reshard_after_forward parameters and migrate to memory policy framework pytorch/torchtitan [plan]
- → Monitor the reverted OpInfo NumPy test skip - root cause appears to be dtype validation in meta registrations; expect rework pytorch/pytorch [monitor]
References
- [1] [ROCm][CI] Remove sandbox distributed jobs; restore periodic-rocm-mi200 cron schedule (#183914) pytorch/pytorch
- [2] Enable cudagraph for running examples ↗ pytorch/helion
- [3] Add CuTe NVFP4 GEMV example ↗ pytorch/helion
- [4] Infer CuTe NVFP4 conversions from dtypes ↗ pytorch/helion
- [5] [test] Remove unintentional skip for OpInfo test against NumPy on CPU (#182999) pytorch/pytorch
- [6] Revert "[test] Remove unintentional skip for OpInfo test against NumPy on CPU (#182999)" pytorch/pytorch
- [7] [graph_trainer] Remove fsdp_reshard_after_fwd_pass ↗ pytorch/torchtitan
- [8] [Docathon]: removed torch.signal.windows functions from coverage ignore (#183454) pytorch/pytorch
FAQ
- What changed in PyTorch on May 16, 2026?
- PyTorch's ROCm CI pipeline is moving its mi200 distributed test suite out of sandbox and into production after consistent performance wins, while Helion unlocks 5x speedup on GB200 through CUDA graph integration.
- What should PyTorch teams do about it?
- Verify your ROCm CI jobs are migrating to periodic-rocm-mi200; deprecated sandbox distributed jobs will be removed • If using TorchTitan graph training, audit for hardcoded fsdp_reshard_after_forward parameters and migrate to memory policy framework • Monitor the reverted OpInfo NumPy test skip - root cause appears to be dtype validation in meta registrations; expect rework
- Which PyTorch repositories shipped on May 16, 2026?
- pytorch/pytorch, pytorch/helion, pytorch/torchtitan