i built an inference engine in rust, shipped it, then said in a linkedin post i wanted to look into Arabic tokenization. so i actually did. what follows is what i found, and it wasn't what i expected.
Arabic uses a root-and-pattern (non-concatenative) morphological system. a 3-consonant root like k-t-b (write) combines with templatic vowel patterns to produce: kataba (he wrote), kitaab (book), maktab (office), maktūb (written), yaktubu (he writes). the root never appears as a standalone word, it's always embedded inside a surface form.
a single Arabic word can encode what would be 4 separate tokens in English. fasayaktubūnahā (فسيكتبونها) = "and they will write it", one word with proclitics, stem, and enclitic fused together.
BPE was designed for frequency-based merging of contiguous character sequences. Arabic's meaning lives in non-contiguous root-pattern interleaving. these are structurally incompatible. add optional diacritics, no consistent spacing, and diglossia (MSA vs Egyptian vs Gulf vs Levantine), and you have a genuinely hard problem.
the field tried this. a lot:
reasonable assumption: if you segment morphemes correctly before the model sees them, it should learn morphological structure better.
they evaluated 7 Arabic LLMs (ALLAM, FANAR, GPT-4, GPT-4o, LLaMA-3, Qwen-3, Cohere) on two dimensions: (1) how well their tokenizers align with gold-standard morpheme boundaries, and (2) how well the models can productively generate Arabic root-pattern forms, including nonce (invented) roots, which tests real generalization, not memorization.
alignment doesn't predict competence
ALLAM had the best tokenizer morphological alignment (MCR 83-86%) but collapsed on nonce words: 20% accuracy. it's memorizing, not generalizing.
GPT-4 had the worst tokenizer alignment (fertility 4× higher than ideal, boundary precision 17%) but scored 92% on nonce root-pattern generation. second best overall.
GPT-4o scored 97% on nonce words despite similarly bad tokenizer alignment.
no correlation between tokenizer alignment metrics and morphological generation performance. morpheme F1 and MCR have zero or weak negative correlation with generation accuracy.
FANAR, which has a morphologically-informed tokenizer (MorphBPE), performed consistently but didn't dominate. its steadier performance might be instruction-following, not tokenizer quality.
English prompts outperformed Arabic prompts on most models, instruction-tuning data is overwhelmingly English.
the paper's conclusion: morphological competence should be defined by productive generalization, not surface segmentation alignment. compositional reasoning + instruction-following substitutes for explicit morphological parsing.
the 2020 result (morphology-aware tokenization helps) and the 2026 result (tokenizer alignment doesn't predict generation) aren't fully contradictory. they're scale-dependent:
the field has been asking "which tokenizer wins on classification tasks" for years. multiple papers (2023, 2024) kept landing on "it depends on the task and dataset." the 2026 paper reframes the question entirely: stop asking about surface segmentation and start asking about productive generalization.
the real gap might not be at the tokenizer or even the embedding layer, it might be in the internal representations the model builds. GPT-4 and GPT-4o are doing something ALLAM isn't, despite ALLAM having more Arabic training data and a better tokenizer. what is it?
i'm going to look at the internal activation angle, what do the representations actually look like inside models that succeed vs fail on nonce Arabic words?
colab + small Arabic models is feasible on CPU hardware. behavioral probing (logit-level) doesn't require loading full weights. and i built an inference engine that gives me direct access to hidden states at every layer.
other threads i'm tracking:
this started as a curiosity from a linkedin post. now i have actual research questions and a direction that doesn't seem fully explored. more updates as i go.