Who contributed to PyTorch on June 8, 2026?

2 developers shipped this update, including ethche and Gasoonjia.

What were the notable PyTorch updates?

[pallas] enable emit_pipeline BlockSpecs for jagged tile ranges, [FSDP2] Add set_reduce_scatter_max_input_buffers to mitigate reduce-scatter blocking backward compute (#186000), and [dynamo] Box resume frame values when del can run (#185561).

@pytorch

PyTorch and the broader machine-learning ecosystem

github ↗

Pick a date

Topics: Python AI / ML Full archive →

The Wire · Showcase

PALLAS JAGGED KERNELS NOW PIPELINE, FSDP2 UNBLOCKS BACKWARD COMPUTE

By RepoJournal · Filed 22:43 UTC on June 8, 2026 · About PyTorch

2 people shipped this

ethche @ethche 1 cited

Gasoonjia @Gasoonjia 1 cited

Helion's emit_pipeline now handles runtime-determined tile ranges in jagged attention kernels, while FSDP2 ships a new buffer control to stop reduce-scatter from stalling backward passes.

The Pallas team shipped a critical fix for jagged tile ranges in emit_pipeline [1], enabling kernels that use runtime-loaded offsets to properly stage their computation. This matters because jagged attention patterns are everywhere in modern LLM kernels, and the old code couldn't address them correctly when tile boundaries weren't block-aligned. In parallel, FSDP2 adds set_reduce_scatter_max_input_buffers [2] to mitigate a killer bottleneck: the compute stream stalling on reduce-scatter to finish before buffer reuse, which was idle time measured in tens of milliseconds per step. Dynamo also shipped a fix for resume frame boxing [3] that prevents deleted intermediates from staying alive longer in compiled graphs than in eager execution. On the ROCm side, the team removed a now-impossible test after grid size constraints tightened [4], and DTensor sharding now uses explicit hints for unbacked dimensions [5], trading conservative fallback behavior for paths that know what they're doing. ExecutorTorch landed Q6K Metal kernels for Gemma4 GGUF inference [6], Neutron SDK bumped to 3.1.2 [7], and test utilities got consolidated to kill redundancy [8].

Action items

→ Review FSDP2 buffer tuning if you hit reduce-scatter blocking in backward pytorch/pytorch [plan]
→ Test Dynamo resume functions in your graph-breaking workflows pytorch/pytorch [monitor]
→ Pull Helion jagged kernel fix if you're using attention patterns pytorch/helion [plan]
→ Validate ExecutorTorch Q6K inference on MLX backend pytorch/executorch [monitor]

References

[1] [pallas] enable emit_pipeline BlockSpecs for jagged tile ranges ↗ pytorch/helion
[2] [FSDP2] Add set_reduce_scatter_max_input_buffers to mitigate reduce-scatter blocking backward compute (#186000) pytorch/pytorch
[3] [dynamo] Box resume frame values when del can run (#185561) pytorch/pytorch
[4] [ROCm] Remove test_upsamplingNearest2d_launch_rocm test as ROCm reduces max grid size (#186257) pytorch/pytorch
[5] [DTensor] Use explicit hints for unbacked sharding (#183545) pytorch/pytorch
[6] [MLX][Gemma4] Introduce Q6K kernels (#20004) pytorch/executorch
[7] NXP backend: Update eIQ Neutron SDK to 3.1.2 (#19938) pytorch/executorch
[8] Extract shared device test utilities to reduce redundancy (#20061) ↗ pytorch/executorch

Quick answers

What shipped in PyTorch on June 8, 2026?: Helion's emit_pipeline now handles runtime-determined tile ranges in jagged attention kernels, while FSDP2 ships a new buffer control to stop reduce-scatter from stalling backward passes. In total, 73 commits and 23 pull requests landed.
Who contributed to PyTorch on June 8, 2026?: 2 developers shipped this update, including ethche and Gasoonjia.
What were the notable PyTorch updates?: [pallas] enable emit_pipeline BlockSpecs for jagged tile ranges, [FSDP2] Add set_reduce_scatter_max_input_buffers to mitigate reduce-scatter blocking backward compute (#186000), and [dynamo] Box resume frame values when del can run (#185561).

CRITICAL OIDC INJECTION IN DOCS PREVIEW WORKFLOW PATCHED

PyTorch's docs-preview CI trusted fork-controlled artifacts in a context with token-write permissions, exposing the entire build pipeline to code injection.

python 66 shipped 1-min read

@pytorch 1 day ago

PYTORCH AUTOGRAD GETS 7% FASTER, AOTI FIXES SILENT FAILURES

Interned attribute names in autograd.Function shaved microseconds off the hot path while AOTI's scatter operations now properly report errors instead of silently corrupting results.

python 64 shipped 1-min read

@pytorch 4 days ago

DYNAMO REVERTS BREAKING CHANGE, EXECUTORCH CLEANS UP DEPRECATED TYPES

PyTorch reverted a Dynamo optimization that broke internal tests, while ExecutorTorch is aggressively deprecating c10 shims in favor of standard library types.

python 91 shipped 1-min read

@pytorch 5 days ago

PYTORCH SHIPS BUILD FIX WHILE HELION TUNES H100 KERNELS TO DEFAULT

A critical build regression in cusparselt.cpp is now patched, while the kernel autotuner promotes its pointwise seed heuristic to production defaults on H100 and B200.

python 36 shipped 1-min read

Elsewhere on the wire

AI Agents about 9 hours ago

CLAUDE OPUS 5 LANDS ACROSS THE STACK

The newest Anthropic model is now live in langchain, Cline, and llama-index, with native support for extended reasoning and 1M context windows.

ai-agents 28 shipped 1-min read

Local LLMs about 9 hours ago

OLLAMA LANDS LAGUNA SUPPORT AND CRUSHES MEMORY LEAKS WHILE SGLANG HITS V0.5.16 WITH CONFIDENCE-DRIVEN SPECULATIVE DECODING

Ollama shipped three critical performance and reliability fixes for Metal residency and concurrent access patterns, while SGL-Lang released 0.5.16 with a new speculative algorithm hitting 383.7 tok/s on DeepSeek-V4.