The Wire · Showcase
INDUCTOR DOUBLES DOWN ON PROFILING; C10D BACKENDS GET UNIFIED API
By RepoJournal · Filed · About PyTorch
PyTorch's compilation stack shipped major profiler improvements and collective communication overhaul overnight, with changes spanning kernel provenance tracking to 390K-entry cache optimization.
The Inductor profiler now attaches kernel provenance metadata to Chrome traces [1], threading Triton and extern-kernel context through debug paths so you can see exactly which source lines generated which CUDA kernels in your timeline. That lands alongside a sweeping c10d backend refactor [2] that introduces the torchcomms `_single` collective names on the C++ API while keeping all legacy names fully backward compatible, unifying how distributed training backends declare their capabilities. On the compiler front, Inductor's NVGEMM handler fixed a brutal performance bug [3] where the kernel cache search iterated through 390K candidates twice per GEMM operation; a new single-pass partition function cuts that overhead entirely. The same area shed dead code [4] where duplicate nvgemm_max_profiling configs were shadowing the canonical definition. In gradient optimization, linear_cross_entropy now stops materializing zero-filled gradients [5] for unused chunked-op outputs, cutting unnecessary backward overhead. TorchTitan's MoE layer got leaner: production models all shipped with `score_before_experts=False`, so [6] removed the dead `True` branch entirely. The repo also added a deterministic loss test [7] for Qwen3 with TP+EP configs to catch silent numerical corruption, and landed MinimalAsyncEP [8], a cudagraphable expert dispatcher with minimal kernel overhead and a ping-pong buffer pattern that avoids symm-mem copies during recompute. FBGEMM finished migrating TBE backward templates [9] to the new threshold-guard API and replaced deprecated PyTorch calls [10] across the fbgemm_gpu surface.
Action items
- → Review Inductor profiler changes if you instrument CUDA kernels; kernel provenance in traces is now the source of truth pytorch/pytorch [plan]
- → Verify c10d backend implementations match new `_single` method signatures before next distributed training run pytorch/pytorch [plan]
- → Rebuild with latest FBGEMM if you hit deprecated PyTorch API warnings in embedding gradients pytorch/FBGEMM [monitor]
- → Run TorchTitan MoE loss tests on your cluster; deterministic loss validation now catches EP breakage pytorch/torchtitan [plan]
References
- [1] Add Inductor profiler timeline provenance (#186230) pytorch/pytorch
- [2] Add _single c10d::Backend methods and migrate backends to them (#187140) (#187140) pytorch/pytorch
- [3] [Inductor][NVGEMM] Avoid double iteration over kernel cache during choice enumeration (#185966) pytorch/pytorch
- [4] [Inductor][NVGEMM] Remove duplicate nvgemm_max_profiling config (#185965) pytorch/pytorch
- [5] linear_cross_entropy: do not materialize gradients for unused chunked-op outputs (#187219) pytorch/pytorch
- [6] [MoE] Remove unused score_before_experts dispatcher flag ↗ pytorch/torchtitan
- [7] [MoE] Add TP+EP config to Qwen3 MoE deterministic loss test (FSDP4+TP2+EP4) ↗ pytorch/torchtitan
- [8] Add MinimalAsyncEP ↗ pytorch/torchtitan
- [9] Migrate TBE backward template to cap_grid_dim_x; clean up legacy get_max_thread_blocks helpers (#5853) pytorch/FBGEMM
- [10] Replace deprecated PyTorch APIs in fbgemm_gpu (#5896) pytorch/FBGEMM
FAQ
- What changed in PyTorch on June 14, 2026?
- PyTorch's compilation stack shipped major profiler improvements and collective communication overhaul overnight, with changes spanning kernel provenance tracking to 390K-entry cache optimization.
- What should PyTorch teams do about it?
- Review Inductor profiler changes if you instrument CUDA kernels; kernel provenance in traces is now the source of truth • Verify c10d backend implementations match new `_single` method signatures before next distributed training run • Rebuild with latest FBGEMM if you hit deprecated PyTorch API warnings in embedding gradients
- Which PyTorch repositories shipped on June 14, 2026?
- pytorch/pytorch, pytorch/torchtitan, pytorch/FBGEMM