RepoJournal
PyTorch

@pytorch

PyTorch and the broader machine-learning ecosystem

Pick a date

The Wire · Showcase

PYTORCH FIXES PRECISION DRIFT IN INDUCTOR, EXECUTORCH BACKS OUT XNNPACK SERIALIZATION

By RepoJournal · Filed · About PyTorch

PyTorch's core team debugged and resolved a floating-point reduction-order drift in Inductor that was breaking tensor fusion tests on ARM, while ExecutorTorch is reverting a global XNNPACK serialization change that triggered latency regressions in production mobile models.

The Inductor team isolated a math-equivalence issue in test_two_local_buffers_in_outer_loop_fusion to fp32 reduction-order drift between eager and compiled code [1], removing stale xfails and applying explicit tolerances (atol=1e-5, rtol=2e-6) to reflect the real numerical behavior. In parallel, the CUDA group preserved internal precision for native_group_norm by deferring float16 truncation until kernel exit [2], eliminating differences between eager and Inductor paths. On the spec side, ShapesSpec now supports variadic *args and **kwargs at the dynamo source level [3], enabling more flexible shape tracing for dynamic signatures. ExecutorTorch is backing out D106123930 [4], a global XNNPACK serialization patch that degraded latency on PhoneLLM, Llama4-mini TISO, and on-device NGTTS deployments. The Arm backend expanded bf16 support to aten.index_select and aten.unfold_copy [5], both of which now flow through TOSA GATHER without dtype restrictions. CI infrastructure shipped node_fleet overrides [6] that decouple large-instance runners from shared Karpenter fleets, reducing contention, while a new fast pre-merge gate [7] (30-60 min) now blocks merges instead of running the full 2-3 hour battery.

Action items

References

  1. [1] Adjust tolerances in test_two_local_buffers_in_outer_loop_fusion and (#183932) pytorch/pytorch
  2. [2] [cuda][eager] Preserve internal precision for native_group_norm (#183946) pytorch/pytorch
  3. [3] [ShapesSpec] Support args, *args, **kwargs at the spec / dynamo source level (#184129) pytorch/pytorch
  4. [4] Back out "Globally serialize XNNPACK execution, add logging" (#19752) ↗ pytorch/executorch
  5. [5] Arm backend: Add bf16 support for aten.index_select and aten.unfold_copy ↗ pytorch/executorch
  6. [6] Add node_fleet override to decouple large-instance runners from shared node fleets ↗ pytorch/ci-infra
  7. [7] Split pre-merge CI into fast and slow gates ↗ pytorch/ci-infra

FAQ

What changed in PyTorch on May 26, 2026?
PyTorch's core team debugged and resolved a floating-point reduction-order drift in Inductor that was breaking tensor fusion tests on ARM, while ExecutorTorch is reverting a global XNNPACK serialization change that triggered latency regressions in production mobile models.
What should PyTorch teams do about it?
If you maintain Inductor fusion tests with ARM targets, apply the new tolerance values [ref:1] to your local test suite to unblock CI • ExecutorTorch teams running XNNPACK: monitor mobile latency metrics; the global serialization fix was reverted pending investigation [ref:8] • Review ShapesSpec PR [ref:2] if you're building dynamo shape tracing with variadic arguments
Which PyTorch repositories shipped on May 26, 2026?
pytorch/pytorch, pytorch/executorch, pytorch/ci-infra

Related across the cluster

For your repos

The showcase is a teaser.
Your wire is the product.

Same engine. Different stack. Below: what changes when the wire is yours.

Showcase wire

  • 14 famous open source orgs
  • One wire per day
  • Public, generic
  • Read on the web, when you remember

Your wire

  • Up to 1,500 of your repos - orgs, deps, vendors
  • Morning and evening briefs
  • Action items routed to your team
  • Slack delivery, email, breaking-news CVE alerts

Want a hands-on demo first? Ask a current user for an invite link.