The Wire · Showcase
GEMMA4 CUTS 19 GIB TRAINING BLOAT, TRANSFORMERS REVERTS FSDP CHAOS, CANDLE SPEEDS SCALAR OPS
By RepoJournal · Filed · About Hugging Face
Gemma4 just shed a massive training bottleneck by replacing one-hot tensor materialization with embedding lookups, while Transformers is cleaning house after an FSDP refactor went sideways.
The Gemma4 team shipped a smart optimization that replaces one-hot encoding plus matmul patterns with two F.embedding lookups, completely eliminating the materialization of a roughly 19 GiB intermediate tensor during large batch training [1]. This is the kind of surgical fix that unblocks real production workloads. Meanwhile, Transformers just reverted a significant FSDP plus Dtensor refactor after internal discussion flagged problems [2], a necessary move that keeps the main branch stable while the team sorts out the architecture. On the performance front, Candle is tackling binary broadcast scalar operations with better kernel dispatch, adding Layout helpers to identify scalars that masquerade as strided tensors and avoiding unnecessary indexing overhead [3]. The infrastructure work is solid too: hf-mount now supports JSON log format for environments with log shippers [7], and Transformers added Metal Flash SDPA support on Apple Silicon with fixes for generate and generate_batch paths [8]. Candle also bumped three core dependencies: rubato to 2.0, hf-hub to 0.5.0, and Symphonia to 0.6.0 [4] [5] [6]. Test CI infrastructure got hardened with cache permission fixes and token cleanup [9] [10]. Finally, TRL aligned KTO training with DPO by removing the null_ref_context indirection layer [11], cleaning up how the code handles missing reference models.
Action items
- → Review Transformers main branch carefully - FSDP refactor reverted, coordinate with team before relying on distributed training changes huggingface/transformers [immediate]
- → Pull Gemma4 optimization if you're doing large batch training - eliminates 19 GiB memory waste huggingface/transformers [plan]
- → Update Candle dependencies (rubato, hf-hub, Symphonia) at next minor version bump huggingface/candle [plan]
- → Test Metal Flash SDPA on Apple Silicon if you support MPS inference huggingface/transformers [monitor]
References
- [1] [Gemma4] Replace one-hot matmul with F.embedding in position embeddings (#46176) huggingface/transformers
- [2] [`Revert`] FSDP+Dtensor refactor related changes ↗ huggingface/transformers
- [3] Binary broadcast scalar support ↗ huggingface/candle
- [4] chore(deps): update rubato requirement from 1 to 2 ↗ huggingface/candle
- [5] chore(deps): update hf-hub requirement from 0.4.1 to 0.5.0 ↗ huggingface/candle
- [6] chore(deps): update symphonia requirement from 0.5.3 to 0.6.0 ↗ huggingface/candle
- [7] feat: add json log format ↗ huggingface/hf-mount
- [8] Enable kernels-community/metal-flash-sdpa on MPS (#45974) huggingface/transformers
- [9] Fix cache read-only permission for metrics (#19) huggingface/transformers-test-ci
- [10] Remove token ↗ huggingface/transformers-test-ci
- [11] Align KTO with DPO: Remove null_ref_context ↗ huggingface/trl
FAQ
- What changed in Hugging Face on May 29, 2026?
- Gemma4 just shed a massive training bottleneck by replacing one-hot tensor materialization with embedding lookups, while Transformers is cleaning house after an FSDP refactor went sideways.
- What should Hugging Face teams do about it?
- Review Transformers main branch carefully - FSDP refactor reverted, coordinate with team before relying on distributed training changes • Pull Gemma4 optimization if you're doing large batch training - eliminates 19 GiB memory waste • Update Candle dependencies (rubato, hf-hub, Symphonia) at next minor version bump
- Which Hugging Face repositories shipped on May 29, 2026?
- huggingface/transformers, huggingface/candle, huggingface/hf-mount, huggingface/transformers-test-ci, huggingface/trl