The Wire · Showcase
FSDP2 CUDA GRAPHS STREAM EXPLOSION FIXED, EXECUTORCH FUSION PIPELINE PATCHED
By RepoJournal · Filed · About PyTorch
PyTorch's FSDP implementation shipped a critical fix for stream proliferation in CUDA graphs, while ExecutorCh closed a gap in its convolution fusion pipeline that was breaking downstream models.
The big win: FSDP2 now eliminates redundant stream waits when running CUDA graphs [1]. This fix won't fully solve the problem (TP, EP, and micro-batching edge cases remain), but it cuts stream count significantly and requires CUDA 13.2 or higher. In parallel, ExecutorCh's ConvBNReLU fusion pipeline was broken by a new Convert1DConvTo2D pass that didn't account for batch norm followed by activation functions [2]. That's patched. Separately, Helion fixed a critical logic bug in traced if-subgraph outputs that was bleeding local variables across branch boundaries [3], which could cause silent correctness issues in control flow tracing. The docs preview pipeline got rearchitected to avoid S3 write permission issues with fork PRs [4], moving artifact staging to GitHub Actions instead of the Kubernetes pod. ExecutorCh also added QNN backend support for the randn operation [6] and fixed permute cancellation around rank-changing views [5], both landing just in time for broader model export use cases.
Action items
- → If running FSDP2 + CUDA graphs: upgrade to get the stream wait fix, verify CUDA >= 13.2 pytorch/pytorch [immediate]
- → ExecutorCh users: verify your ConvBNReLU fusion works with the new pass after upgrading pytorch/executorch [plan]
- → Monitor Helion's if-subgraph fix if using traced control flow in production pytorch/helion [monitor]
References
- [1] [fsdp] Remove redundant stream waits (#183983) pytorch/pytorch
- [2] Fix broken ConvBNReLu from new Convert1DConvTo2D pass (#19558) (#19558) pytorch/executorch
- [3] Only include common outputs as outputs of traced if subgraph ↗ pytorch/helion
- [4] Upload docs preview from a workflow_run job, not the OSDC pod (#184414) pytorch/pytorch
- [5] Handle rank-changing views in RemovePermutesAroundElementwiseOps (#19538) ↗ pytorch/executorch
- [6] Qualcomm AI Engine Direct - Adding QNN backend support for randn core ATen op (#19377) pytorch/executorch
FAQ
- What changed in PyTorch on May 20, 2026?
- PyTorch's FSDP implementation shipped a critical fix for stream proliferation in CUDA graphs, while ExecutorCh closed a gap in its convolution fusion pipeline that was breaking downstream models.
- What should PyTorch teams do about it?
- If running FSDP2 + CUDA graphs: upgrade to get the stream wait fix, verify CUDA >= 13.2 • ExecutorCh users: verify your ConvBNReLU fusion works with the new pass after upgrading • Monitor Helion's if-subgraph fix if using traced control flow in production
- Which PyTorch repositories shipped on May 20, 2026?
- pytorch/pytorch, pytorch/executorch, pytorch/helion