Who contributed to PyTorch on May 16, 2026?

3 developers shipped this update, including choijon5, oulgen, and SherlockNoMad.

What were the notable PyTorch updates?

[ROCm][CI] Remove sandbox distributed jobs; restore periodic-rocm-mi200 cron schedule (#183914), Enable cudagraph for running examples, and Add CuTe NVFP4 GEMV example.

@pytorch

PyTorch and the broader machine-learning ecosystem

github ↗

Pick a date

Topics: Python AI / ML Full archive →

The Wire · Showcase

ROCM DISTRIBUTED JOBS GRADUATE FROM SANDBOX, HELION SHIPS CUDAGRAPH OPTIMIZATION

By RepoJournal · Filed 06:04 UTC on May 16, 2026 · About PyTorch

3 people shipped this

oulgen @oulgen 2 cited

choijon5 @choijon5 1 cited

SherlockNoMad @SherlockNoMad 1 cited

PyTorch's ROCm CI pipeline is moving its mi200 distributed test suite out of sandbox and into production after consistent performance wins, while Helion unlocks 5x speedup on GB200 through CUDA graph integration.

The pytorch/pytorch team graduated distributed ROCm jobs from trunk-rocm-sandbox to periodic-rocm-mi200 after proving stability [1]. These mi200 distributed jobs have been timing out consistently in sandbox, but the move signals confidence in their reliability. On the Helion front, CUDA graph support for running examples delivers dramatic performance gains: cudagraph cuts the add.py implementation from 0.0208ms down to parity with torch's 0.0076ms baseline [2], a 5.4x improvement that materializes immediately in GB200 benchmarks. The same team is shipping a CuTe NVFP4 GEMV example [3] with inline assembly support for FP4/E4M3 decode operations and stacked a dtype-driven inference optimization [4] that infers FP4 conversions directly from tensor dtypes, avoiding unsupported scalar dereference paths. Separately, the test suite caught an accidental skip in OpInfo's NumPy reference tests that was silencing CPU validation [5], though this PR was auto-reverted moments later [6] due to downstream failures. TorchTitan's graph trainer is simplifying its FSDP memory policy by removing the redundant `fsdp_reshard_after_fwd_pass` parameter [7], consolidating reshard logic into the unified memory policy framework. Documentation debt is being cleared: torch.signal.windows functions were removed from coverage ignore lists [8], bringing 11 window functions into proper docstring requirements.

Action items

→ Verify your ROCm CI jobs are migrating to periodic-rocm-mi200; deprecated sandbox distributed jobs will be removed pytorch/pytorch [plan]
→ If using TorchTitan graph training, audit for hardcoded fsdp_reshard_after_forward parameters and migrate to memory policy framework pytorch/torchtitan [plan]
→ Monitor the reverted OpInfo NumPy test skip - root cause appears to be dtype validation in meta registrations; expect rework pytorch/pytorch [monitor]

References

[1] [ROCm][CI] Remove sandbox distributed jobs; restore periodic-rocm-mi200 cron schedule (#183914) pytorch/pytorch
[2] Enable cudagraph for running examples ↗ pytorch/helion
[3] Add CuTe NVFP4 GEMV example ↗ pytorch/helion
[4] Infer CuTe NVFP4 conversions from dtypes ↗ pytorch/helion
[5] [test] Remove unintentional skip for OpInfo test against NumPy on CPU (#182999) pytorch/pytorch
[6] Revert "[test] Remove unintentional skip for OpInfo test against NumPy on CPU (#182999)" pytorch/pytorch
[7] [graph_trainer] Remove fsdp_reshard_after_fwd_pass ↗ pytorch/torchtitan
[8] [Docathon]: removed torch.signal.windows functions from coverage ignore (#183454) pytorch/pytorch

Quick answers

What shipped in PyTorch on May 16, 2026?: PyTorch's ROCm CI pipeline is moving its mi200 distributed test suite out of sandbox and into production after consistent performance wins, while Helion unlocks 5x speedup on GB200 through CUDA graph integration. In total, 74 commits and 20 pull requests landed.
Who contributed to PyTorch on May 16, 2026?: 3 developers shipped this update, including choijon5, oulgen, and SherlockNoMad.
What were the notable PyTorch updates?: [ROCm][CI] Remove sandbox distributed jobs; restore periodic-rocm-mi200 cron schedule (#183914), Enable cudagraph for running examples, and Add CuTe NVFP4 GEMV example.

CRITICAL OIDC INJECTION IN DOCS PREVIEW WORKFLOW PATCHED

PyTorch's docs-preview CI trusted fork-controlled artifacts in a context with token-write permissions, exposing the entire build pipeline to code injection.

python 66 shipped 1-min read

@pytorch 1 day ago

PYTORCH AUTOGRAD GETS 7% FASTER, AOTI FIXES SILENT FAILURES

Interned attribute names in autograd.Function shaved microseconds off the hot path while AOTI's scatter operations now properly report errors instead of silently corrupting results.

python 64 shipped 1-min read

@pytorch 4 days ago

DYNAMO REVERTS BREAKING CHANGE, EXECUTORCH CLEANS UP DEPRECATED TYPES

PyTorch reverted a Dynamo optimization that broke internal tests, while ExecutorTorch is aggressively deprecating c10 shims in favor of standard library types.

python 91 shipped 1-min read

@pytorch 5 days ago

PYTORCH SHIPS BUILD FIX WHILE HELION TUNES H100 KERNELS TO DEFAULT

A critical build regression in cusparselt.cpp is now patched, while the kernel autotuner promotes its pointwise seed heuristic to production defaults on H100 and B200.

python 36 shipped 1-min read

Elsewhere on the wire

AI Agents about 9 hours ago

CLAUDE OPUS 5 LANDS ACROSS THE STACK

The newest Anthropic model is now live in langchain, Cline, and llama-index, with native support for extended reasoning and 1M context windows.

ai-agents 28 shipped 1-min read

Local LLMs about 9 hours ago

OLLAMA LANDS LAGUNA SUPPORT AND CRUSHES MEMORY LEAKS WHILE SGLANG HITS V0.5.16 WITH CONFIDENCE-DRIVEN SPECULATIVE DECODING

Ollama shipped three critical performance and reliability fixes for Metal residency and concurrent access patterns, while SGL-Lang released 0.5.16 with a new speculative algorithm hitting 383.7 tok/s on DeepSeek-V4.