Who contributed to PyTorch on May 26, 2026?

3 developers shipped this update, including julianchan-meta, vacu9708, and jeanschmidt.

What were the notable PyTorch updates?

Adjust tolerances in test_two_local_buffers_in_outer_loop_fusion and (#183932), [cuda][eager] Preserve internal precision for native_group_norm (#183946), and [ShapesSpec] Support args, *args, **kwargs at the spec / dynamo source level (#184129).

@pytorch

PyTorch and the broader machine-learning ecosystem

github ↗

Pick a date

Topics: Python AI / ML Full archive →

The Wire · Showcase

PYTORCH FIXES PRECISION DRIFT IN INDUCTOR, EXECUTORCH BACKS OUT XNNPACK SERIALIZATION

By RepoJournal · Filed 06:03 UTC on May 26, 2026 · About PyTorch

3 people shipped this

jeanschmidt @jeanschmidt 2 cited

julianchan-meta @julianchan-meta 1 cited

vacu9708 @vacu9708 1 cited

PyTorch's core team debugged and resolved a floating-point reduction-order drift in Inductor that was breaking tensor fusion tests on ARM, while ExecutorTorch is reverting a global XNNPACK serialization change that triggered latency regressions in production mobile models.

The Inductor team isolated a math-equivalence issue in test_two_local_buffers_in_outer_loop_fusion to fp32 reduction-order drift between eager and compiled code [1], removing stale xfails and applying explicit tolerances (atol=1e-5, rtol=2e-6) to reflect the real numerical behavior. In parallel, the CUDA group preserved internal precision for native_group_norm by deferring float16 truncation until kernel exit [2], eliminating differences between eager and Inductor paths. On the spec side, ShapesSpec now supports variadic *args and **kwargs at the dynamo source level [3], enabling more flexible shape tracing for dynamic signatures. ExecutorTorch is backing out D106123930 [4], a global XNNPACK serialization patch that degraded latency on PhoneLLM, Llama4-mini TISO, and on-device NGTTS deployments. The Arm backend expanded bf16 support to aten.index_select and aten.unfold_copy [5], both of which now flow through TOSA GATHER without dtype restrictions. CI infrastructure shipped node_fleet overrides [6] that decouple large-instance runners from shared Karpenter fleets, reducing contention, while a new fast pre-merge gate [7] (30-60 min) now blocks merges instead of running the full 2-3 hour battery.

Action items

→ If you maintain Inductor fusion tests with ARM targets, apply the new tolerance values [ref:1] to your local test suite to unblock CI pytorch/pytorch [plan]
→ ExecutorTorch teams running XNNPACK: monitor mobile latency metrics; the global serialization fix was reverted pending investigation [ref:8] pytorch/executorch [monitor]
→ Review ShapesSpec PR [ref:2] if you're building dynamo shape tracing with variadic arguments pytorch/pytorch [plan]
→ CI teams: merge queue behavior flipped; pre-merge-ok is now the sole required check [ref:15] pytorch/ci-infra [immediate]

References

[1] Adjust tolerances in test_two_local_buffers_in_outer_loop_fusion and (#183932) pytorch/pytorch
[2] [cuda][eager] Preserve internal precision for native_group_norm (#183946) pytorch/pytorch
[3] [ShapesSpec] Support args, *args, **kwargs at the spec / dynamo source level (#184129) pytorch/pytorch
[4] Back out "Globally serialize XNNPACK execution, add logging" (#19752) ↗ pytorch/executorch
[5] Arm backend: Add bf16 support for aten.index_select and aten.unfold_copy ↗ pytorch/executorch
[6] Add node_fleet override to decouple large-instance runners from shared node fleets ↗ pytorch/ci-infra
[7] Split pre-merge CI into fast and slow gates ↗ pytorch/ci-infra

Quick answers

What shipped in PyTorch on May 26, 2026?: PyTorch's core team debugged and resolved a floating-point reduction-order drift in Inductor that was breaking tensor fusion tests on ARM, while ExecutorTorch is reverting a global XNNPACK serialization change that triggered latency regressions in production mobile models. In total, 44 commits and 11 pull requests landed.
Who contributed to PyTorch on May 26, 2026?: 3 developers shipped this update, including julianchan-meta, vacu9708, and jeanschmidt.
What were the notable PyTorch updates?: Adjust tolerances in test_two_local_buffers_in_outer_loop_fusion and (#183932), [cuda][eager] Preserve internal precision for native_group_norm (#183946), and [ShapesSpec] Support args, *args, **kwargs at the spec / dynamo source level (#184129).

CRITICAL OIDC INJECTION IN DOCS PREVIEW WORKFLOW PATCHED

PyTorch's docs-preview CI trusted fork-controlled artifacts in a context with token-write permissions, exposing the entire build pipeline to code injection.

python 66 shipped 1-min read

@pytorch 1 day ago

PYTORCH AUTOGRAD GETS 7% FASTER, AOTI FIXES SILENT FAILURES

Interned attribute names in autograd.Function shaved microseconds off the hot path while AOTI's scatter operations now properly report errors instead of silently corrupting results.

python 64 shipped 1-min read

@pytorch 4 days ago

DYNAMO REVERTS BREAKING CHANGE, EXECUTORCH CLEANS UP DEPRECATED TYPES

PyTorch reverted a Dynamo optimization that broke internal tests, while ExecutorTorch is aggressively deprecating c10 shims in favor of standard library types.

python 91 shipped 1-min read

@pytorch 5 days ago

PYTORCH SHIPS BUILD FIX WHILE HELION TUNES H100 KERNELS TO DEFAULT

A critical build regression in cusparselt.cpp is now patched, while the kernel autotuner promotes its pointwise seed heuristic to production defaults on H100 and B200.

python 36 shipped 1-min read

Elsewhere on the wire

AI Agents about 10 hours ago

CLAUDE OPUS 5 LANDS ACROSS THE STACK

The newest Anthropic model is now live in langchain, Cline, and llama-index, with native support for extended reasoning and 1M context windows.

ai-agents 28 shipped 1-min read

Local LLMs about 10 hours ago

OLLAMA LANDS LAGUNA SUPPORT AND CRUSHES MEMORY LEAKS WHILE SGLANG HITS V0.5.16 WITH CONFIDENCE-DRIVEN SPECULATIVE DECODING

Ollama shipped three critical performance and reliability fixes for Metal residency and concurrent access patterns, while SGL-Lang released 0.5.16 with a new speculative algorithm hitting 383.7 tok/s on DeepSeek-V4.