“Training a Tokenizer That Actually Speaks Italian”

Tuesday 21st April, 2026 - Bruce Sterling

https://medium.com/@fabio_angeletti/training-a-tokenizer-that-actually-speaks-italian-4084919224db

(…)

Why English Tokenizers Fail at Italian
The apostrophe problem

In English, apostrophes mark possessives or contractions: “it’s,” “don’t,” “Sarah’s.” They’re grammatically optional — you could rewrite any sentence without them.

In Italian, apostrophes are elisions — they mark where two words fuse into a single grammatical unit. “L’intelligenza” means “the intelligence.” “Dell’algoritmo” means “of the algorithm.” “Un’ottimizzazione” means “an optimization.” The apostrophe connects. Remove it, and you’ve broken the syntax.

Every major English tokenizer — GPT’s, LLaMA’s, Mistral’s — treats apostrophes as split points. They were designed for English, where that’s the right behavior. But when you feed them Italian text, “dell’algoritmo” becomes three separate tokens: [“dell”, “‘“, “algoritmo”]. The model sees a broken article, a punctuation mark, and a noun — when an Italian reader sees a single, inseparable phrase.

This isn’t just an efficiency problem. When the apostrophe lands in a different token from both the article and the noun, the model’s attention mechanism has to work harder to learn that these three pieces form one grammatical unit. Multiply that across every elision in every Italian sentence, and you’re systematically handicapping the model’s ability to learn Italian syntax.

The accent problem

Italian uses six accented vowels in daily writing: à, è, é, ì, ò, ù. The word “perché” (why/because) appears in virtually every Italian text. So does “è” (is), “più” (more), “già” (already), “così” (so)….