Who contributed to PyTorch on June 9, 2026?

3 developers shipped this update, including tianyu-l, Gasoonjia, and yuchengliu1.

What were the notable PyTorch updates?

[Operators] Implement remainder operator in Dynamo (#185654), [Operators] Implement true division operator in Dynamo (#185653), and [Operators] Implement floor division operator in Dynamo (#185652).

@pytorch

PyTorch and the broader machine-learning ecosystem

github ↗

Pick a date

Topics: Python AI / ML Full archive →

The Wire · Showcase

DYNAMO GAINS ARITHMETIC OPERATORS AS FSDP2 CUTS BACKWARD COMPUTE STALLS

By RepoJournal · Filed 06:03 UTC on June 9, 2026 · About PyTorch

3 people shipped this

tianyu-l @tianyu-l 1 cited

Gasoonjia @Gasoonjia 1 cited

yuchengliu1 @yuchengliu1 1 cited

PyTorch's JIT compiler now handles Python's core division and modulo operators natively, while FSDP2 adds buffering controls to stop reduce-scatter from blocking gradient computation.

Dynamo landed three operator implementations [1] [2] [3] that wire floor division (//), true division (/), and remainder (%) directly into torch.compile by adding CPython's number protocol slots to VariableTracker, letting compiled code use native Python arithmetic without fallback. On the distributed training front, FSDP2 shipped set_reduce_scatter_max_input_buffers [4] to decouple reduce-scatter blocking from backward compute, solving a critical latency bottleneck where compute streams stall waiting for buffer recycling every layer (measured at 37.6 ms per step per layer in production). ExecuTorch expanded MLX support with fused Q6_K quantized kernels [8] for Gemma 4 31B GGUF export, eliminating the slow dequant path and shrinking `.pte` size through kernel blob deduplication. Meanwhile, XNNPACK was removed from default builds [5] now that ExecuTorch is the recommended mobile inference path. TorchTitan consolidated FSDP configuration by deprecating the llama4 folder [6] [7] and unifying MoE and dense model setup into a single distributed/fsdp.py file. The AO team added int8/fp8 quantized QKV fusion for x86 [11], fusing three GEMMs and scaled dot product attention into a single kernel pair. Across ExecutorTorch, device tensor helpers got hardened with proper metadata preservation and error reporting [9], while test utilities moved to shared modules [10] to kill duplication. FBGEMM deployed deterministic seeding infrastructure [12] for legacy test reproducibility and optimized ROCm gradient accumulation with block-wise loop unrolling [13].

Action items

→ Verify torch.compile handles your division/modulo ops without fallback in local tests pytorch/pytorch [plan]
→ Pin FSDP2 code to new set_reduce_scatter_max_input_buffers API if running large-scale training with exposed reduce-scatter pytorch/pytorch [monitor]
→ Update torchtitan training scripts to use consolidated FSDP config from distributed/fsdp.py pytorch/torchtitan [plan]
→ Test x86 QKV fusion pass on your quantized transformers to validate int8/fp8 throughput gains pytorch/ao [monitor]

References

[1] [Operators] Implement remainder operator in Dynamo (#185654) pytorch/pytorch
[2] [Operators] Implement true division operator in Dynamo (#185653) pytorch/pytorch
[3] [Operators] Implement floor division operator in Dynamo (#185652) pytorch/pytorch
[4] [FSDP2] Add set_reduce_scatter_max_input_buffers to mitigate reduce-scatter blocking backward compute (#186000) pytorch/pytorch
[5] Remove XNNPACK availability check from binary smoke test (#186662) pytorch/pytorch
[6] [BE] deprecate llama4, move apply_fsdp to common file ↗ pytorch/torchtitan
[7] [BE] deprecate llama4, move apply_fsdp to common file (#3573) pytorch/torchtitan
[8] [MLX][Gemma4] Introduce Q6K kernels (#20004) pytorch/executorch
[9] Address review feedback on device tensor helpers (#20078) (#20078) pytorch/executorch
[10] Extract shared device test utilities to reduce redundancy (#20061) ↗ pytorch/executorch
[11] add quantized qkv-fusion pass for x86 ↗ pytorch/ao
[12] Add seed_all() deterministic-seed test helper (#5851) pytorch/FBGEMM
[13] Add `PROCESS_BLOCK` macro for grad accumulation loop unrolling (#5835) pytorch/FBGEMM

Quick answers

What shipped in PyTorch on June 9, 2026?: PyTorch's JIT compiler now handles Python's core division and modulo operators natively, while FSDP2 adds buffering controls to stop reduce-scatter from blocking gradient computation. In total, 86 commits and 32 pull requests landed.
Who contributed to PyTorch on June 9, 2026?: 3 developers shipped this update, including tianyu-l, Gasoonjia, and yuchengliu1.
What were the notable PyTorch updates?: [Operators] Implement remainder operator in Dynamo (#185654), [Operators] Implement true division operator in Dynamo (#185653), and [Operators] Implement floor division operator in Dynamo (#185652).

CRITICAL OIDC INJECTION IN DOCS PREVIEW WORKFLOW PATCHED

PyTorch's docs-preview CI trusted fork-controlled artifacts in a context with token-write permissions, exposing the entire build pipeline to code injection.

python 66 shipped 1-min read

@pytorch 1 day ago

PYTORCH AUTOGRAD GETS 7% FASTER, AOTI FIXES SILENT FAILURES

Interned attribute names in autograd.Function shaved microseconds off the hot path while AOTI's scatter operations now properly report errors instead of silently corrupting results.

python 64 shipped 1-min read

@pytorch 4 days ago

DYNAMO REVERTS BREAKING CHANGE, EXECUTORCH CLEANS UP DEPRECATED TYPES

PyTorch reverted a Dynamo optimization that broke internal tests, while ExecutorTorch is aggressively deprecating c10 shims in favor of standard library types.

python 91 shipped 1-min read

@pytorch 5 days ago

PYTORCH SHIPS BUILD FIX WHILE HELION TUNES H100 KERNELS TO DEFAULT

A critical build regression in cusparselt.cpp is now patched, while the kernel autotuner promotes its pointwise seed heuristic to production defaults on H100 and B200.

python 36 shipped 1-min read

Elsewhere on the wire

AI Agents about 9 hours ago

CLAUDE OPUS 5 LANDS ACROSS THE STACK

The newest Anthropic model is now live in langchain, Cline, and llama-index, with native support for extended reasoning and 1M context windows.

ai-agents 28 shipped 1-min read

Local LLMs about 9 hours ago

OLLAMA LANDS LAGUNA SUPPORT AND CRUSHES MEMORY LEAKS WHILE SGLANG HITS V0.5.16 WITH CONFIDENCE-DRIVEN SPECULATIVE DECODING

Ollama shipped three critical performance and reliability fixes for Metal residency and concurrent access patterns, while SGL-Lang released 0.5.16 with a new speculative algorithm hitting 383.7 tok/s on DeepSeek-V4.