RepoJournal
Hugging Face

@huggingface

Transformers, Datasets, and the open AI-model layer

Pick a date

The Wire · Showcase

SWIFT TOKENIZERS FIXED FOR UNICODE, TRANSFORMERS CI PATCHED, MONGOKU SHIPS SCHEMA AUDIT

By RepoJournal · Filed · About Hugging Face

Swift-transformers shipped three critical tokenizer fixes that resolve grapheme cluster bugs breaking emoji and combining marks across Unigram, BPE, and BasicTokenizer [ref:8] [ref:9], while transformers CI dodged an E2BIG argument limit blowup on large PRs [ref:1].

The Swift fixes address a fundamental mismatch: SentencePiece vocabularies index by Unicode scalar, but Swift's Character type operates on extended grapheme clusters, causing emoji like '1️⃣' and combining marks in Thai, Devanagari, and Japanese to fail tokenization [4] [5] [6]. These aren't edge cases - they're vocab coverage holes that would silently produce wrong tokens in production. Transformers sidestepped a CI disaster by refetching PR files in-script instead of piping them through environment variables, which was hitting the kernel's MAX_ARG_STRLEN limit on PRs with large patches [7]. Mongoku merged a schema auditing feature with a medium-risk warning: new collection-wide introspection endpoints could surface expensive aggregations that timeout on large datasets [8]. Hub-docs auto-bumped inference provider packages and regenerated docs without incident [9].

Action items

References

  1. [1] Unigram lattice walks Unicode scalars (#352, Bug 3) (#356) huggingface/swift-transformers
  2. [2] BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4) (#355) huggingface/swift-transformers
  3. [3] bugfix(ci): avoid E2BIG in pr_slow_ci_suggestion (#45983) huggingface/transformers
  4. [4] Unigram lattice walks Unicode scalars (#352, Bug 3) ↗ huggingface/swift-transformers
  5. [5] BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4) ↗ huggingface/swift-transformers
  6. [6] Strip Japanese voiced-kana marks in BasicTokenizer (#352, Bug 2) ↗ huggingface/swift-transformers
  7. [7] bugfix(ci): avoid E2BIG in pr_slow_ci_suggestion ↗ huggingface/transformers
  8. [8] Add schema auditing endpoints and navigation tab ↗ huggingface/Mongoku
  9. [9] [Bot] Update Inference Providers documentation ↗ huggingface/hub-docs

FAQ

What changed in Hugging Face on May 17, 2026?
Swift-transformers shipped three critical tokenizer fixes that resolve grapheme cluster bugs breaking emoji and combining marks across Unigram, BPE, and BasicTokenizer , while transformers CI dodged an E2BIG argument limit blowup on large PRs .
What should Hugging Face teams do about it?
Merge swift-transformers tokenizer fixes into your build pipeline before next release • Test Mongoku schema audit against your largest collections before enabling in production • Monitor transformers CI for large PR submissions to confirm E2BIG fix holds
Which Hugging Face repositories shipped on May 17, 2026?
huggingface/swift-transformers, huggingface/transformers, huggingface/Mongoku, huggingface/hub-docs

Related across the cluster

For your repos

The showcase is a teaser.
Your wire is the product.

Same engine. Different stack. Below: what changes when the wire is yours.

Showcase wire

  • 14 famous open source orgs
  • One wire per day
  • Public, generic
  • Read on the web, when you remember

Your wire

  • Up to 1,500 of your repos - orgs, deps, vendors
  • Morning and evening briefs
  • Action items routed to your team
  • Slack delivery, email, breaking-news CVE alerts

Want a hands-on demo first? Ask a current user for an invite link.