Who contributed to PyTorch on May 19, 2026?

3 developers shipped this update, including felipemello1, fegin, and Xia-Weiwen.

What were the notable PyTorch updates?

[rl] Add TITO generator and gen metrics, [graph_trainer] Fix H100 CI failure from DeepEP compilation break (#3390), and [graph_trainer] Defer cudagraph compatibility check to pass execution… (#3355).

@pytorch

PyTorch and the broader machine-learning ecosystem

github ↗

Pick a date

Topics: Python AI / ML Full archive →

The Wire · Showcase

TORCHTITAN SHIPS TOKEN-IN-TOKEN-OUT GENERATOR FOR RL, FIXES CRITICAL CI BREAKS ACROSS PLATFORMS

By RepoJournal · Filed 06:04 UTC on May 19, 2026 · About PyTorch

3 people shipped this

felipemello1 @felipemello1 1 cited

fegin @fegin 1 cited

Xia-Weiwen @Xia-Weiwen 1 cited

TorchTitan's RL training pipeline now encodes prompts once and passes tokenized inputs directly to the generator, eliminating retokenization bugs that plagued distributed training.

The TITO (token-in/token-out) generator change [1] is the standard approach in production RL systems and unblocks cleaner separation between prompt encoding and generation logic. It also wires up generation metrics to wandb automatically, giving you visibility into sampling behavior without manual instrumentation. Meanwhile, TorchTitan fixed two critical CI failures blocking H100 and graph-compiled workloads: the DeepEP ABI break against PyTorch nightly [2] is now non-fatal so tests continue, and cudagraph compatibility checks now run at pass execution time instead of eagerly rejecting flex_attention kernels that regional_inductor will compile away [3]. Full DTensor mode for Llama3 landed [4] with declarative CP handling via LocalMapSpec instead of hooks, giving you a cleaner path to multi-dimensional SPMD meshes. On the compiler side, PyTorch's Inductor test infrastructure is now 4x faster on collection and 1.7x faster on execution [5] thanks to ISA subprocess caching. TorchAO shipped multi-ISA portable X86 kernels [6] so builds work across AVX512, AVX10.2, and scalar targets without rebuilding per machine. PyTorch core also preserved pin_memory metadata in Inductor constructors [7], fixing the torch.tensor and torch.rand pinned allocation cases that were being silently lowered away.

Action items

→ Merge TITO generator change into your RL training branch before next experiment run pytorch/torchtitan [plan]
→ If running H100 CI, update your DeepEP installation to non-fatal to unblock remaining tests pytorch/torchtitan [immediate]
→ Pull the Inductor subprocess caching optimization to speed up your local test suites pytorch/pytorch [plan]
→ If shipping X86 wheels, upgrade TorchAO to pick up multi-ISA portable kernels pytorch/ao [plan]

References

[1] [rl] Add TITO generator and gen metrics ↗ pytorch/torchtitan
[2] [graph_trainer] Fix H100 CI failure from DeepEP compilation break (#3390) pytorch/torchtitan
[3] [graph_trainer] Defer cudagraph compatibility check to pass execution… (#3355) pytorch/torchtitan
[4] [Full DTensor] Config-based Full DTensor for Llama3 ↗ pytorch/torchtitan
[5] Speed up inductor test infrastructure (~4x collection, ~1.7x execution) (#181617) pytorch/pytorch
[6] [X86] multi-ISA portable kernel compilation and runtime dispatch ↗ pytorch/ao
[7] [inductor] Preserve pin_memory for constructors (#183977) pytorch/pytorch

Quick answers

What shipped in PyTorch on May 19, 2026?: TorchTitan's RL training pipeline now encodes prompts once and passes tokenized inputs directly to the generator, eliminating retokenization bugs that plagued distributed training. In total, 96 commits and 46 pull requests landed.
Who contributed to PyTorch on May 19, 2026?: 3 developers shipped this update, including felipemello1, fegin, and Xia-Weiwen.
What were the notable PyTorch updates?: [rl] Add TITO generator and gen metrics, [graph_trainer] Fix H100 CI failure from DeepEP compilation break (#3390), and [graph_trainer] Defer cudagraph compatibility check to pass execution… (#3355).

CRITICAL OIDC INJECTION IN DOCS PREVIEW WORKFLOW PATCHED

PyTorch's docs-preview CI trusted fork-controlled artifacts in a context with token-write permissions, exposing the entire build pipeline to code injection.

python 66 shipped 1-min read

@pytorch 1 day ago

PYTORCH AUTOGRAD GETS 7% FASTER, AOTI FIXES SILENT FAILURES

Interned attribute names in autograd.Function shaved microseconds off the hot path while AOTI's scatter operations now properly report errors instead of silently corrupting results.

python 64 shipped 1-min read

@pytorch 4 days ago

DYNAMO REVERTS BREAKING CHANGE, EXECUTORCH CLEANS UP DEPRECATED TYPES

PyTorch reverted a Dynamo optimization that broke internal tests, while ExecutorTorch is aggressively deprecating c10 shims in favor of standard library types.

python 91 shipped 1-min read

@pytorch 5 days ago

PYTORCH SHIPS BUILD FIX WHILE HELION TUNES H100 KERNELS TO DEFAULT

A critical build regression in cusparselt.cpp is now patched, while the kernel autotuner promotes its pointwise seed heuristic to production defaults on H100 and B200.

python 36 shipped 1-min read

Elsewhere on the wire

AI Agents about 10 hours ago

CLAUDE OPUS 5 LANDS ACROSS THE STACK

The newest Anthropic model is now live in langchain, Cline, and llama-index, with native support for extended reasoning and 1M context windows.

ai-agents 28 shipped 1-min read

Local LLMs about 10 hours ago

OLLAMA LANDS LAGUNA SUPPORT AND CRUSHES MEMORY LEAKS WHILE SGLANG HITS V0.5.16 WITH CONFIDENCE-DRIVEN SPECULATIVE DECODING

Ollama shipped three critical performance and reliability fixes for Metal residency and concurrent access patterns, while SGL-Lang released 0.5.16 with a new speculative algorithm hitting 383.7 tok/s on DeepSeek-V4.