The Wire · Showcase
HALIDE FUSION BUG CRUSH, ROCM DEADLOCK FIXES ACROSS THE STACK
By RepoJournal · Filed · About PyTorch
PyTorch's Halide backend had a critical inplace mutation fusion bug that could silently corrupt results on aliased reads, and it's not alone: ROCm deadlocks are being systematically hunted across FBGEMM and core compute kernels.
The Halide inplace mutation fusion fix [1] addresses a subtle but dangerous bug where TailStrategy::ShiftInwards could overcompute output tiles and reread stale buffer values during vertically fused operations with transposed destinations. This is production-critical for anyone using Halide autoscheduling on mutation-heavy workloads. Meanwhile, the ROCm team is methodically closing deadlock vectors: FBGEMM's compute_amax_and_quantize_kernel [2] had threads hitting early returns before barrier synchronization, a pattern that kills HIP execution without obvious error messages. The same audit caught grid overflow issues in direct_mapped_lxu_cache_lookup_kernel [3], applying canonical caps to prevent silent launch failures. On the ExecutorTorch side, the Arm backend is shipping incremental gains: TOSA dialect ARGMAX support [4], dim mapping helpers for shape-changing operators [5], and adaptive pooling decomposition [6]. TorchTitan shipped a checkpoint compatibility cleanup [7] decoupling from PyTorch distributed state_dict APIs, critical for checkpoint portability, while also reverting a deterministic topk change [8] that broke internal numerics and will reland upstream once semantics stabilize. The MoE sequence parallelism bug fix [9] corrects token index placement when tensor, expert, and sequence parallelism run together, preventing routing misplacement in large multi-axis parallel setups.
Action items
- → Review Halide fusion behavior in production pipelines using transposed mutations pytorch/pytorch [plan]
- → If running ROCm with FP4 quantization or split embeddings, pull FBGEMM deadlock fixes pytorch/FBGEMM [immediate]
- → TorchTitan users: validate checkpoint compatibility after state_dict API decoupling pytorch/torchtitan [plan]
- → Monitor deterministic topk reland in pytorch/pytorch for MoE correctness pytorch/pytorch [monitor]
References
- [1] Fix Halide inplace mutation fusion with aliased reads (#186121) pytorch/pytorch
- [2] Fix ROCm __syncthreads deadlock in compute_amax_and_quantize_kernel pytorch/FBGEMM
- [3] Fix HIP grid overflow in direct_mapped_lxu_cache_lookup_kernel (#5882) pytorch/FBGEMM
- [4] Arm backend: Add TOSA dialect ARGMAX op ↗ pytorch/executorch
- [5] Arm backend: Add dim mapping helpers ↗ pytorch/executorch
- [6] Arm backend: Add adaptive pooling node visitors ↗ pytorch/executorch
- [7] [Checkpointer] Remove the dependencies on PyTorch distributed state_dict APIs (#3623) pytorch/torchtitan
- [8] Revert "Add deterministic topk for MoE routing" ↗ pytorch/torchtitan
- [9] [Bug] Fix MoE SP token combine indices ↗ pytorch/torchtitan
FAQ
- What changed in PyTorch on June 13, 2026?
- PyTorch's Halide backend had a critical inplace mutation fusion bug that could silently corrupt results on aliased reads, and it's not alone: ROCm deadlocks are being systematically hunted across FBGEMM and core compute kernels.
- What should PyTorch teams do about it?
- Review Halide fusion behavior in production pipelines using transposed mutations • If running ROCm with FP4 quantization or split embeddings, pull FBGEMM deadlock fixes • TorchTitan users: validate checkpoint compatibility after state_dict API decoupling
- Which PyTorch repositories shipped on June 13, 2026?
- pytorch/pytorch, pytorch/FBGEMM, pytorch/executorch, pytorch/torchtitan