The Wire · Showcase
DISTILLATION TRAINER GRADIENT ACCUMULATION BUG FIXED, LEROBOT IMAGE STATS OVERFLOW PATCHED
By RepoJournal · Filed · About Hugging Face
TRL's distillation trainers were silently miscalculating loss under gradient accumulation, and LeRobot's image statistics were overflowing to zero for valid data.
The GKDTrainer, GOLDTrainer, and DistillationTrainer accepted `num_items_in_batch` in their loss computation but never used it [1], causing JSD distillation loss to normalize by local microbatch token count instead of global count. Under gradient accumulation, this breaks the gradient scaling that transformers' base Trainer expects, silently producing wrong gradients. This is a critical fix for anyone training distilled models at scale [2]. In parallel, LeRobot's image statistics computation was promoting uint8 samples to float *after* squaring them in RunningQuantileStats, causing uint8 overflow that made computed variance negative and clamped to zero [3], so stats.json reported `std=0` for non-constant image data [4]. Both fixes are merged and ready. TRL also patched the GLM-4-MoE chat template to properly terminate assistant turns with role markers instead of missing end-of-turn tokens [5]. Routine dependency bumps across trl and ml-intern [6], [7]. Chat-UI increased MiniMax-M3's max_tokens to 65536 to prevent the router from truncating reasoning-heavy outputs at 2048 tokens [8].
Action items
- → Update TRL if running GKDTrainer, GOLDTrainer, or DistillationTrainer with gradient accumulation huggingface/trl [immediate]
- → Rebuild LeRobot dataset statistics if using image features with uint8 casting huggingface/lerobot [plan]
- → Review GLM-4-MoE fine-tuning chat templates if in production huggingface/trl [plan]
References
- [1] Normalize JSD distillation loss by num_items_in_batch for gradient accumulation ↗ huggingface/trl
- [2] Normalize JSD distillation loss by num_items_in_batch for gradient accumulation (#6006) huggingface/trl
- [3] fix(datasets): avoid uint8 overflow in image stats (#3697) huggingface/lerobot
- [4] fix(datasets): avoid uint8 overflow in image stats ↗ huggingface/lerobot
- [5] [fix] GLM-4-MoE template: turn-terminating token to the turn itself ↗ huggingface/trl
- [6] Bump the actions group with 4 updates ↗ huggingface/trl
- [7] Bump the actions group with 4 updates ↗ huggingface/ml-intern
- [8] Set max_tokens 65536 for MiniMax-M3 huggingface/chat-ui
FAQ
- What changed in Hugging Face on June 14, 2026?
- TRL's distillation trainers were silently miscalculating loss under gradient accumulation, and LeRobot's image statistics were overflowing to zero for valid data.
- What should Hugging Face teams do about it?
- Update TRL if running GKDTrainer, GOLDTrainer, or DistillationTrainer with gradient accumulation • Rebuild LeRobot dataset statistics if using image features with uint8 casting • Review GLM-4-MoE fine-tuning chat templates if in production
- Which Hugging Face repositories shipped on June 14, 2026?
- huggingface/trl, huggingface/lerobot, huggingface/ml-intern, huggingface/chat-ui