Who contributed to PyTorch on May 20, 2026?

2 developers shipped this update, including AmesingFlank and mcremon-meta.

What were the notable PyTorch updates?

[fsdp] Remove redundant stream waits (#183983), Fix broken ConvBNReLu from new Convert1DConvTo2D pass (#19558) (#19558), and Only include common outputs as outputs of traced if subgraph.

@pytorch

PyTorch and the broader machine-learning ecosystem

github ↗

Pick a date

Topics: Python AI / ML Full archive →

The Wire · Showcase

FSDP2 CUDA GRAPHS STREAM EXPLOSION FIXED, EXECUTORCH FUSION PIPELINE PATCHED

By RepoJournal · Filed 06:04 UTC on May 20, 2026 · About PyTorch

2 people shipped this

AmesingFlank @AmesingFlank 1 cited

mcremon-meta @mcremon-meta 1 cited

PyTorch's FSDP implementation shipped a critical fix for stream proliferation in CUDA graphs, while ExecutorCh closed a gap in its convolution fusion pipeline that was breaking downstream models.

The big win: FSDP2 now eliminates redundant stream waits when running CUDA graphs [1]. This fix won't fully solve the problem (TP, EP, and micro-batching edge cases remain), but it cuts stream count significantly and requires CUDA 13.2 or higher. In parallel, ExecutorCh's ConvBNReLU fusion pipeline was broken by a new Convert1DConvTo2D pass that didn't account for batch norm followed by activation functions [2]. That's patched. Separately, Helion fixed a critical logic bug in traced if-subgraph outputs that was bleeding local variables across branch boundaries [3], which could cause silent correctness issues in control flow tracing. The docs preview pipeline got rearchitected to avoid S3 write permission issues with fork PRs [4], moving artifact staging to GitHub Actions instead of the Kubernetes pod. ExecutorCh also added QNN backend support for the randn operation [6] and fixed permute cancellation around rank-changing views [5], both landing just in time for broader model export use cases.

Action items

→ If running FSDP2 + CUDA graphs: upgrade to get the stream wait fix, verify CUDA >= 13.2 pytorch/pytorch [immediate]
→ ExecutorCh users: verify your ConvBNReLU fusion works with the new pass after upgrading pytorch/executorch [plan]
→ Monitor Helion's if-subgraph fix if using traced control flow in production pytorch/helion [monitor]

References

[1] [fsdp] Remove redundant stream waits (#183983) pytorch/pytorch
[2] Fix broken ConvBNReLu from new Convert1DConvTo2D pass (#19558) (#19558) pytorch/executorch
[3] Only include common outputs as outputs of traced if subgraph ↗ pytorch/helion
[4] Upload docs preview from a workflow_run job, not the OSDC pod (#184414) pytorch/pytorch
[5] Handle rank-changing views in RemovePermutesAroundElementwiseOps (#19538) ↗ pytorch/executorch
[6] Qualcomm AI Engine Direct - Adding QNN backend support for randn core ATen op (#19377) pytorch/executorch

Quick answers

What shipped in PyTorch on May 20, 2026?: PyTorch's FSDP implementation shipped a critical fix for stream proliferation in CUDA graphs, while ExecutorCh closed a gap in its convolution fusion pipeline that was breaking downstream models. In total, 97 commits and 43 pull requests landed.
Who contributed to PyTorch on May 20, 2026?: 2 developers shipped this update, including AmesingFlank and mcremon-meta.
What were the notable PyTorch updates?: [fsdp] Remove redundant stream waits (#183983), Fix broken ConvBNReLu from new Convert1DConvTo2D pass (#19558) (#19558), and Only include common outputs as outputs of traced if subgraph.

CRITICAL OIDC INJECTION IN DOCS PREVIEW WORKFLOW PATCHED

PyTorch's docs-preview CI trusted fork-controlled artifacts in a context with token-write permissions, exposing the entire build pipeline to code injection.

python 66 shipped 1-min read

@pytorch 1 day ago

PYTORCH AUTOGRAD GETS 7% FASTER, AOTI FIXES SILENT FAILURES

Interned attribute names in autograd.Function shaved microseconds off the hot path while AOTI's scatter operations now properly report errors instead of silently corrupting results.

python 64 shipped 1-min read

@pytorch 4 days ago

DYNAMO REVERTS BREAKING CHANGE, EXECUTORCH CLEANS UP DEPRECATED TYPES

PyTorch reverted a Dynamo optimization that broke internal tests, while ExecutorTorch is aggressively deprecating c10 shims in favor of standard library types.

python 91 shipped 1-min read

@pytorch 5 days ago

PYTORCH SHIPS BUILD FIX WHILE HELION TUNES H100 KERNELS TO DEFAULT

A critical build regression in cusparselt.cpp is now patched, while the kernel autotuner promotes its pointwise seed heuristic to production defaults on H100 and B200.

python 36 shipped 1-min read

Elsewhere on the wire

AI Agents about 10 hours ago

CLAUDE OPUS 5 LANDS ACROSS THE STACK

The newest Anthropic model is now live in langchain, Cline, and llama-index, with native support for extended reasoning and 1M context windows.

ai-agents 28 shipped 1-min read

Local LLMs about 10 hours ago

OLLAMA LANDS LAGUNA SUPPORT AND CRUSHES MEMORY LEAKS WHILE SGLANG HITS V0.5.16 WITH CONFIDENCE-DRIVEN SPECULATIVE DECODING

Ollama shipped three critical performance and reliability fixes for Metal residency and concurrent access patterns, while SGL-Lang released 0.5.16 with a new speculative algorithm hitting 383.7 tok/s on DeepSeek-V4.