The Wire · Showcase
PYTORCH FIXES PRECISION DRIFT IN INDUCTOR, EXECUTORCH BACKS OUT XNNPACK SERIALIZATION
By RepoJournal · Filed · About PyTorch
PyTorch's core team debugged and resolved a floating-point reduction-order drift in Inductor that was breaking tensor fusion tests on ARM, while ExecutorTorch is reverting a global XNNPACK serialization change that triggered latency regressions in production mobile models.
The Inductor team isolated a math-equivalence issue in test_two_local_buffers_in_outer_loop_fusion to fp32 reduction-order drift between eager and compiled code [1], removing stale xfails and applying explicit tolerances (atol=1e-5, rtol=2e-6) to reflect the real numerical behavior. In parallel, the CUDA group preserved internal precision for native_group_norm by deferring float16 truncation until kernel exit [2], eliminating differences between eager and Inductor paths. On the spec side, ShapesSpec now supports variadic *args and **kwargs at the dynamo source level [3], enabling more flexible shape tracing for dynamic signatures. ExecutorTorch is backing out D106123930 [4], a global XNNPACK serialization patch that degraded latency on PhoneLLM, Llama4-mini TISO, and on-device NGTTS deployments. The Arm backend expanded bf16 support to aten.index_select and aten.unfold_copy [5], both of which now flow through TOSA GATHER without dtype restrictions. CI infrastructure shipped node_fleet overrides [6] that decouple large-instance runners from shared Karpenter fleets, reducing contention, while a new fast pre-merge gate [7] (30-60 min) now blocks merges instead of running the full 2-3 hour battery.
Action items
- → If you maintain Inductor fusion tests with ARM targets, apply the new tolerance values [ref:1] to your local test suite to unblock CI pytorch/pytorch [plan]
- → ExecutorTorch teams running XNNPACK: monitor mobile latency metrics; the global serialization fix was reverted pending investigation [ref:8] pytorch/executorch [monitor]
- → Review ShapesSpec PR [ref:2] if you're building dynamo shape tracing with variadic arguments pytorch/pytorch [plan]
- → CI teams: merge queue behavior flipped; pre-merge-ok is now the sole required check [ref:15] pytorch/ci-infra [immediate]
References
- [1] Adjust tolerances in test_two_local_buffers_in_outer_loop_fusion and (#183932) pytorch/pytorch
- [2] [cuda][eager] Preserve internal precision for native_group_norm (#183946) pytorch/pytorch
- [3] [ShapesSpec] Support args, *args, **kwargs at the spec / dynamo source level (#184129) pytorch/pytorch
- [4] Back out "Globally serialize XNNPACK execution, add logging" (#19752) ↗ pytorch/executorch
- [5] Arm backend: Add bf16 support for aten.index_select and aten.unfold_copy ↗ pytorch/executorch
- [6] Add node_fleet override to decouple large-instance runners from shared node fleets ↗ pytorch/ci-infra
- [7] Split pre-merge CI into fast and slow gates ↗ pytorch/ci-infra
FAQ
- What changed in PyTorch on May 26, 2026?
- PyTorch's core team debugged and resolved a floating-point reduction-order drift in Inductor that was breaking tensor fusion tests on ARM, while ExecutorTorch is reverting a global XNNPACK serialization change that triggered latency regressions in production mobile models.
- What should PyTorch teams do about it?
- If you maintain Inductor fusion tests with ARM targets, apply the new tolerance values [ref:1] to your local test suite to unblock CI • ExecutorTorch teams running XNNPACK: monitor mobile latency metrics; the global serialization fix was reverted pending investigation [ref:8] • Review ShapesSpec PR [ref:2] if you're building dynamo shape tracing with variadic arguments
- Which PyTorch repositories shipped on May 26, 2026?
- pytorch/pytorch, pytorch/executorch, pytorch/ci-infra