← voidwest research notes

the tokenizer isn't the problem

what i learned reading arabic nlp papers for a week
mohammed al-thobaiti · 2026-05-16

i built an inference engine in rust, shipped it, then said in a linkedin post i wanted to look into Arabic tokenization. so i actually did. what follows is what i found, and it wasn't what i expected.

why Arabic is a real stress test for tokenizers

Arabic uses a root-and-pattern (non-concatenative) morphological system. a 3-consonant root like k-t-b (write) combines with templatic vowel patterns to produce: kataba (he wrote), kitaab (book), maktab (office), maktūb (written), yaktubu (he writes). the root never appears as a standalone word, it's always embedded inside a surface form.

a single Arabic word can encode what would be 4 separate tokens in English. fasayaktubūnahā (فسيكتبونها) = "and they will write it", one word with proclitics, stem, and enclitic fused together.

BPE was designed for frequency-based merging of contiguous character sequences. Arabic's meaning lives in non-contiguous root-pattern interleaving. these are structurally incompatible. add optional diacritics, no consistent spacing, and diglossia (MSA vs Egyptian vs Gulf vs Levantine), and you have a genuinely hard problem.

the obvious hypothesis: fix the tokenizer

the field tried this. a lot:

CAMeL Tools and Farasa: morphological analyzers that produce near-perfect morpheme segmentation (Farasa gets ~99% morpheme F1 on MSA)
MorphBPE: BPE extended with morphological supervision
Alkaoud & Syed (WANLP 2020): modified Word2Vec and BERT to use morphology-aware tokenization at the embedding layer. result: 60% smaller vocab, better OOV handling, SOTA on two Arabic datasets without retraining. this seemed promising.

reasonable assumption: if you segment morphemes correctly before the model sees them, it should learn morphological structure better.

then a 2026 paper broke the assumption

Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

alakeel, qwaider, aldarmaki, alqahtani · LREC 2026, arXiv:2603.15773
from SDAIA, MBZUAI, and PNU

they evaluated 7 Arabic LLMs (ALLAM, FANAR, GPT-4, GPT-4o, LLaMA-3, Qwen-3, Cohere) on two dimensions: (1) how well their tokenizers align with gold-standard morpheme boundaries, and (2) how well the models can productively generate Arabic root-pattern forms, including nonce (invented) roots, which tests real generalization, not memorization.

Key Finding

alignment doesn't predict competence

ALLAM had the best tokenizer morphological alignment (MCR 83-86%) but collapsed on nonce words: 20% accuracy. it's memorizing, not generalizing.

GPT-4 had the worst tokenizer alignment (fertility 4× higher than ideal, boundary precision 17%) but scored 92% on nonce root-pattern generation. second best overall.

GPT-4o scored 97% on nonce words despite similarly bad tokenizer alignment.

no correlation between tokenizer alignment metrics and morphological generation performance. morpheme F1 and MCR have zero or weak negative correlation with generation accuracy.

FANAR, which has a morphologically-informed tokenizer (MorphBPE), performed consistently but didn't dominate. its steadier performance might be instruction-following, not tokenizer quality.

English prompts outperformed Arabic prompts on most models, instruction-tuning data is overwhelmingly English.

the paper's conclusion: morphological competence should be defined by productive generalization, not surface segmentation alignment. compositional reasoning + instruction-following substitutes for explicit morphological parsing.

what this means

the 2020 result (morphology-aware tokenization helps) and the 2026 result (tokenizer alignment doesn't predict generation) aren't fully contradictory. they're scale-dependent:

at Word2Vec/mBERT scale (~100M params), explicit morphological injection at the embedding layer compensates for limited capacity to learn it implicitly
at GPT-4 scale, compositional reasoning + instruction-following compensates for terrible tokenizer alignment
the open question: where is the inflection point? at what scale does explicit morphological structure stop mattering as an input? and can you reach that capability for Arabic without needing GPT-4's compute?

the field has been asking "which tokenizer wins on classification tasks" for years. multiple papers (2023, 2024) kept landing on "it depends on the task and dataset." the 2026 paper reframes the question entirely: stop asking about surface segmentation and start asking about productive generalization.

the real gap might not be at the tokenizer or even the embedding layer, it might be in the internal representations the model builds. GPT-4 and GPT-4o are doing something ALLAM isn't, despite ALLAM having more Arabic training data and a better tokenizer. what is it?

what's next

i'm going to look at the internal activation angle, what do the representations actually look like inside models that succeed vs fail on nonce Arabic words?

colab + small Arabic models is feasible on CPU hardware. behavioral probing (logit-level) doesn't require loading full weights. and i built an inference engine that gives me direct access to hidden states at every layer.

other threads i'm tracking:

the dialect problem (MSA vs colloquial) keeps appearing across every paper as unsolved. morphological analyzers are MSA-biased. dialectal data is underrepresented.
arabizi (romanized Arabic with numbers, like "3arabi" for عربي) is completely unaddressed in the tokenization literature i've seen.
the scaling question: can we find the crossover point where implicit morphology learning overtakes explicit morphological injection? that's a specific, answerable research question.

this started as a curiosity from a linkedin post. now i have actual research questions and a direction that doesn't seem fully explored. more updates as i go.

papers referenced

alakeel, qwaider, aldarmaki, alqahtani, "morphemes without borders", LREC 2026, arXiv:2603.15773
alkaoud & syed, "on the importance of tokenization in Arabic embedding models", WANLP 2020, ACL Anthology
attia, "Arabic tokenization system", ~2007 (rule-based, finite-state)
alrefaie et al., "exploring tokenization strategies and vocabulary sizes for enhanced Arabic language models", arXiv:2403.11130, 2024