The Wire · Showcase
TRANSFORMERS CI FIXES OVERSIZED TRACES; TRL STREAMLINES QLORA TRAINING
By RepoJournal · Filed · About Hugging Face
The transformers test pipeline stopped silently dropping its largest jobs from dashboards. Three critical fixes ship today.
Transformers-ci fixed a cascade of trace observability failures that left engineers flying blind on big test runs. Test traces ballooned to 36-65 MB with 100k spans, causing Tempo to ingest them but fail on retrieval, which silently broke the dashboard [1]. The root cause: pytest-opentelemetry emits per-test protocol spans plus phase and fixture spans, bloating traces unnecessarily [2]. The fix drops all but the protocol span, cutting trace size to usable levels. A second bug froze traces as "settled" once span counts held steady, missing late-arriving ERROR spans from failed tests that eventually surfaced out-of-order [4]. Together, these fixes make the biggest jobs visible and accurate again [3].
On the training side, TRL shipped four quality-of-life wins. The `quantization_config` trainer argument lands across SFTTrainer, DPOTrainer, GRPOTrainer, RLOOTrainer, and RewardTrainer, killing the pattern of reaching into `model_init_kwargs` or manual loading [5]. Data collators across DPO, SFT, Reward, and KTO got a consistency pass with unified docstrings, naming, and structure [6]. SFT now truncates sequences during dataset prep instead of on every batch, speeding iteration and setting up for future work on dropping untrained rows [7]. KTO trainer now aligns with DPO by supporting PEFT models with the Liger fused loss, fixing a blanket rejection that blocked a common pattern [8]. One breaking change: vLLM 0.15 support drops [9].
Serge's pod-per-task security model is now durable and documented [10], with the design doc folded into security.md and the plan obsolete. CI now runs task tests on the in-VPC runner with proper route acceptance so internal ALB connections work [11]. Diffusers 0.39.0 shipped Cosmos 3, NVIDIA's unified world foundation model for Physical AI running omni-generation and reasoning in a single transformer [12]. huggingface.js registered the hi terminal coding agent in the harness registry [13].
Action items
- → Upgrade transformers-ci to pull the trace fixes (drop phase/fixture spans) before next large test run huggingface/transformers-ci [immediate]
- → Update TRL to 0.39+ if using QLoRA to simplify quantization_config; note vLLM 0.15 is no longer supported huggingface/trl [plan]
- → Review Serge security docs; pod-per-task is now the durable model on prod huggingface/serge [monitor]
References
- [1] Fix: large shard traces (tests_torch) silently dropped from dashboards ↗ huggingface/transformers-ci
- [2] Emit one span per test — drop phase + fixture spans (durable fix for oversized traces) ↗ huggingface/transformers-ci
- [3] Copied from `transformers-test-ci` ↗ huggingface/transformers-ci
- [4] Fix: reverify window so out-of-order late spans aren't frozen out ↗ huggingface/transformers-ci
- [5] Add `quantization_config` trainer argument (streamline QLoRA) ↗ huggingface/trl
- [6] Align data collators across DPO / SFT / Reward / KTO ↗ huggingface/trl
- [7] SFT: Truncate during dataset preparation, not collation ↗ huggingface/trl
- [8] Align KTO with DPO: Support PEFT with Liger ↗ huggingface/trl
- [9] Drop vLLM 0.15 support (#6239) huggingface/trl
- [10] docs: fold per-task-pod security into docs/security.md; drop the plan doc (#43) huggingface/serge
- [11] CI: run serge-task-test on the in-VPC runner (aws-general-8-plus) (#41) huggingface/serge
- [12] Diffusers 0.39.0: New image and video pipelines, core library improvements, and more ↗ huggingface/diffusers
- [13] Add hi agent harness (#2269) huggingface/huggingface.js
FAQ
- What changed in Hugging Face on July 4, 2026?
- The transformers test pipeline stopped silently dropping its largest jobs from dashboards. Three critical fixes ship today.
- What should Hugging Face teams do about it?
- Upgrade transformers-ci to pull the trace fixes (drop phase/fixture spans) before next large test run • Update TRL to 0.39+ if using QLoRA to simplify quantization_config; note vLLM 0.15 is no longer supported • Review Serge security docs; pod-per-task is now the durable model on prod
- Which Hugging Face repositories shipped on July 4, 2026?
- huggingface/transformers-ci, huggingface/trl, huggingface/serge, huggingface/diffusers, huggingface/huggingface.js