{"id":6374,"date":"2026-04-21T09:05:44","date_gmt":"2026-04-21T08:05:44","guid":{"rendered":"https:\/\/toshareproject.it\/artmakerblog\/?p=6374"},"modified":"2026-04-21T09:05:44","modified_gmt":"2026-04-21T08:05:44","slug":"training-a-tokenizer-that-actually-speaks-italian","status":"publish","type":"post","link":"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/","title":{"rendered":"&#8220;Training a Tokenizer That Actually Speaks Italian&#8221;"},"content":{"rendered":"<p><a href=\"https:\/\/medium.com\/@fabio_angeletti\/training-a-tokenizer-that-actually-speaks-italian-4084919224db\">https:\/\/medium.com\/@fabio_angeletti\/training-a-tokenizer-that-actually-speaks-italian-4084919224db<\/a><\/p>\n<p>(&#8230;)<\/p>\n<p>Why English Tokenizers Fail at Italian<br \/>\nThe apostrophe problem<\/p>\n<p>In English, apostrophes mark possessives or contractions: \u201cit\u2019s,\u201d \u201cdon\u2019t,\u201d \u201cSarah\u2019s.\u201d They\u2019re grammatically optional \u2014 you could rewrite any sentence without them.<\/p>\n<p>In Italian, apostrophes are elisions \u2014 they mark where two words fuse into a single grammatical unit. \u201cL\u2019intelligenza\u201d means \u201cthe intelligence.\u201d \u201cDell\u2019algoritmo\u201d means \u201cof the algorithm.\u201d \u201cUn\u2019ottimizzazione\u201d means \u201can optimization.\u201d The apostrophe connects. Remove it, and you\u2019ve broken the syntax.<\/p>\n<p>Every major English tokenizer \u2014 GPT\u2019s, LLaMA\u2019s, Mistral\u2019s \u2014 treats apostrophes as split points. They were designed for English, where that\u2019s the right behavior. But when you feed them Italian text, \u201cdell\u2019algoritmo\u201d becomes three separate tokens: [\u201cdell\u201d, \u201c\u2018\u201c, \u201calgoritmo\u201d]. The model sees a broken article, a punctuation mark, and a noun \u2014 when an Italian reader sees a single, inseparable phrase.<\/p>\n<p>This isn\u2019t just an efficiency problem. When the apostrophe lands in a different token from both the article and the noun, the model\u2019s attention mechanism has to work harder to learn that these three pieces form one grammatical unit. Multiply that across every elision in every Italian sentence, and you\u2019re systematically handicapping the model\u2019s ability to learn Italian syntax.<\/p>\n<p>The accent problem<\/p>\n<p>Italian uses six accented vowels in daily writing: \u00e0, \u00e8, \u00e9, \u00ec, \u00f2, \u00f9. The word \u201cperch\u00e9\u201d (why\/because) appears in virtually every Italian text. So does \u201c\u00e8\u201d (is), \u201cpi\u00f9\u201d (more), \u201cgi\u00e0\u201d (already), \u201ccos\u00ec\u201d (so)&#8230;.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/medium.com\/@fabio_angeletti\/training-a-tokenizer-that-actually-speaks-italian-4084919224db (&#8230;) Why English Tokenizers Fail at Italian The apostrophe problem In English, apostrophes mark possessives or contractions: \u201cit\u2019s,\u201d \u201cdon\u2019t,\u201d \u201cSarah\u2019s.\u201d They\u2019re grammatically optional \u2014 you could rewrite any sentence without them. In Italian, apostrophes are elisions \u2014 they mark where two words fuse into a single grammatical unit. \u201cL\u2019intelligenza\u201d means \u201cthe intelligence.\u201d \u201cDell\u2019algoritmo\u201d means [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-6374","post","type-post","status-publish","format-standard","hentry","category-uncategorised"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v17.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>&quot;Training a Tokenizer That Actually Speaks Italian&quot; | Artmaker Blog<\/title>\n<meta name=\"description\" content=\"&quot;Training a Tokenizer That Actually Speaks Italian&quot; | Artmaker Blog\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"&quot;Training a Tokenizer That Actually Speaks Italian&quot; | Artmaker Blog\" \/>\n<meta property=\"og:description\" content=\"&quot;Training a Tokenizer That Actually Speaks Italian&quot; | Artmaker Blog\" \/>\n<meta property=\"og:url\" content=\"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/\" \/>\n<meta property=\"og:site_name\" content=\"Artmaker Blog\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-21T08:05:44+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bruce Sterling\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#website\",\"url\":\"https:\/\/toshareproject.it\/artmakerblog\/\",\"name\":\"Artmaker Blog\",\"description\":\"on Toshareproject.it - curated by Bruce Sterling\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/toshareproject.it\/artmakerblog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-GB\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/#webpage\",\"url\":\"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/\",\"name\":\"\\\"Training a Tokenizer That Actually Speaks Italian\\\" | Artmaker Blog\",\"isPartOf\":{\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#website\"},\"datePublished\":\"2026-04-21T08:05:44+00:00\",\"dateModified\":\"2026-04-21T08:05:44+00:00\",\"author\":{\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085\"},\"description\":\"\\\"Training a Tokenizer That Actually Speaks Italian\\\" | Artmaker Blog\",\"breadcrumb\":{\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/toshareproject.it\/artmakerblog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"&#8220;Training a Tokenizer That Actually Speaks Italian&#8221;\"}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085\",\"name\":\"Bruce Sterling\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#personlogo\",\"inLanguage\":\"en-GB\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g\",\"caption\":\"Bruce Sterling\"},\"description\":\"Art director at Share Festival, author and journalist\",\"sameAs\":[\"http:\/\/toshareproject.it\/tomorrowart\"],\"url\":\"https:\/\/toshareproject.it\/artmakerblog\/author\/brucesterling\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"\"Training a Tokenizer That Actually Speaks Italian\" | Artmaker Blog","description":"\"Training a Tokenizer That Actually Speaks Italian\" | Artmaker Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/","og_locale":"en_GB","og_type":"article","og_title":"\"Training a Tokenizer That Actually Speaks Italian\" | Artmaker Blog","og_description":"\"Training a Tokenizer That Actually Speaks Italian\" | Artmaker Blog","og_url":"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/","og_site_name":"Artmaker Blog","article_published_time":"2026-04-21T08:05:44+00:00","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Bruce Sterling","Estimated reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebSite","@id":"https:\/\/toshareproject.it\/artmakerblog\/#website","url":"https:\/\/toshareproject.it\/artmakerblog\/","name":"Artmaker Blog","description":"on Toshareproject.it - curated by Bruce Sterling","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/toshareproject.it\/artmakerblog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-GB"},{"@type":"WebPage","@id":"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/#webpage","url":"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/","name":"\"Training a Tokenizer That Actually Speaks Italian\" | Artmaker Blog","isPartOf":{"@id":"https:\/\/toshareproject.it\/artmakerblog\/#website"},"datePublished":"2026-04-21T08:05:44+00:00","dateModified":"2026-04-21T08:05:44+00:00","author":{"@id":"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085"},"description":"\"Training a Tokenizer That Actually Speaks Italian\" | Artmaker Blog","breadcrumb":{"@id":"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/toshareproject.it\/artmakerblog\/training-a-tokenizer-that-actually-speaks-italian\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/toshareproject.it\/artmakerblog\/"},{"@type":"ListItem","position":2,"name":"&#8220;Training a Tokenizer That Actually Speaks Italian&#8221;"}]},{"@type":"Person","@id":"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085","name":"Bruce Sterling","image":{"@type":"ImageObject","@id":"https:\/\/toshareproject.it\/artmakerblog\/#personlogo","inLanguage":"en-GB","url":"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g","caption":"Bruce Sterling"},"description":"Art director at Share Festival, author and journalist","sameAs":["http:\/\/toshareproject.it\/tomorrowart"],"url":"https:\/\/toshareproject.it\/artmakerblog\/author\/brucesterling\/"}]}},"_links":{"self":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts\/6374","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/comments?post=6374"}],"version-history":[{"count":1,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts\/6374\/revisions"}],"predecessor-version":[{"id":6375,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts\/6374\/revisions\/6375"}],"wp:attachment":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/media?parent=6374"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/categories?post=6374"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/tags?post=6374"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}