The Wire · Showcase
TORCHTITAN SHIPS TOKEN-IN-TOKEN-OUT GENERATOR FOR RL, FIXES CRITICAL CI BREAKS ACROSS PLATFORMS
By RepoJournal · Filed · About PyTorch
TorchTitan's RL training pipeline now encodes prompts once and passes tokenized inputs directly to the generator, eliminating retokenization bugs that plagued distributed training.
The TITO (token-in/token-out) generator change [1] is the standard approach in production RL systems and unblocks cleaner separation between prompt encoding and generation logic. It also wires up generation metrics to wandb automatically, giving you visibility into sampling behavior without manual instrumentation. Meanwhile, TorchTitan fixed two critical CI failures blocking H100 and graph-compiled workloads: the DeepEP ABI break against PyTorch nightly [2] is now non-fatal so tests continue, and cudagraph compatibility checks now run at pass execution time instead of eagerly rejecting flex_attention kernels that regional_inductor will compile away [3]. Full DTensor mode for Llama3 landed [4] with declarative CP handling via LocalMapSpec instead of hooks, giving you a cleaner path to multi-dimensional SPMD meshes. On the compiler side, PyTorch's Inductor test infrastructure is now 4x faster on collection and 1.7x faster on execution [5] thanks to ISA subprocess caching. TorchAO shipped multi-ISA portable X86 kernels [6] so builds work across AVX512, AVX10.2, and scalar targets without rebuilding per machine. PyTorch core also preserved pin_memory metadata in Inductor constructors [7], fixing the torch.tensor and torch.rand pinned allocation cases that were being silently lowered away.
Action items
- → Merge TITO generator change into your RL training branch before next experiment run pytorch/torchtitan [plan]
- → If running H100 CI, update your DeepEP installation to non-fatal to unblock remaining tests pytorch/torchtitan [immediate]
- → Pull the Inductor subprocess caching optimization to speed up your local test suites pytorch/pytorch [plan]
- → If shipping X86 wheels, upgrade TorchAO to pick up multi-ISA portable kernels pytorch/ao [plan]
References
- [1] [rl] Add TITO generator and gen metrics ↗ pytorch/torchtitan
- [2] [graph_trainer] Fix H100 CI failure from DeepEP compilation break (#3390) pytorch/torchtitan
- [3] [graph_trainer] Defer cudagraph compatibility check to pass execution… (#3355) pytorch/torchtitan
- [4] [Full DTensor] Config-based Full DTensor for Llama3 ↗ pytorch/torchtitan
- [5] Speed up inductor test infrastructure (~4x collection, ~1.7x execution) (#181617) pytorch/pytorch
- [6] [X86] multi-ISA portable kernel compilation and runtime dispatch ↗ pytorch/ao
- [7] [inductor] Preserve pin_memory for constructors (#183977) pytorch/pytorch
FAQ
- What changed in PyTorch on May 19, 2026?
- TorchTitan's RL training pipeline now encodes prompts once and passes tokenized inputs directly to the generator, eliminating retokenization bugs that plagued distributed training.
- What should PyTorch teams do about it?
- Merge TITO generator change into your RL training branch before next experiment run • If running H100 CI, update your DeepEP installation to non-fatal to unblock remaining tests • Pull the Inductor subprocess caching optimization to speed up your local test suites
- Which PyTorch repositories shipped on May 19, 2026?
- pytorch/torchtitan, pytorch/pytorch, pytorch/ao