token-morpheme alignment doesn't predict whether a model can generate Arabic root-pattern forms. a tokenizer that segments morphemes cleanly doesn't guarantee good generation, and one that over-segments badly (GPT-4) doesn't prevent it.
Arabic uses a root-and-pattern (non-concatenative) system. consonantal roots combine with templatic vowel patterns to form words. example: root ktb (write) + pattern mafūl → maktūb (written). the root and pattern interleave; they aren't concatenated like prefix+stem+suffix in English, which makes Arabic a stress test for subword tokenizers (BPE, Unigram, WordPiece) built for concatenative morphology.
measured how well each tokenizer's segments match gold-standard morpheme boundaries from CAMEL and Farasa analyzers, on MSA (ATB3) and dialectal (BOLT) Arabic. metrics:
three probing tasks using real roots and nonce (invented) roots:
ALLAM, FANAR, GPT-4, GPT-4o, LLaMA-3, Qwen-3, Cohere. FANAR uses MorphBPE (morphologically-informed tokenization); the rest use standard BPE/Unigram/WordPiece. zero-shot and one-shot prompts, tested in both English and Arabic.
finding 1
no correlation between tokenizer alignment and generation performance. GPT-4o scored highest across all tasks (97% nonce accuracy) with one of the worst alignment scores (17% boundary precision). ALLAM had the best MCR (83-86%) but fell to 20% on nonce words.
finding 2
Arabic-centric models can't handle nonce words. ALLAM and FANAR drop sharply on nonce roots. memorized lexemes, not productive rules. FANAR's morphological tokenizer didn't help it generalize.
finding 3
English prompts work better than Arabic prompts. most models did worse when prompted in Arabic. probably a side effect of English-heavy instruction-tuning data.
finding 4
one-shot prompting helps weak models, not strong ones. GPT-4 and GPT-4o were flat across zero-shot and one-shot. LLaMA-3, Qwen-3, and Cohere improved with an example. they need in-context scaffolding to figure out the transformation.
finding 5
five error modes. pattern misapplication (root right, template wrong), root deformation (consonants changed), real-word substitution (outputting a valid word instead of applying the pattern), incorrect affix ordering, and partial truncation.
language-specific tokenizers (MorphBPE, Splinter, etc.) are expensive to build. large-scale pretraining and instruction tuning seem to compensate, or even outperform them. GPT-4o's tokenizer over-segments Arabic (fertility > 3, boundary precision 17%) yet nails nonce patterns 97% of the time.
compositional reasoning + instruction-following replaces explicit morphological parsing. models that follow instructions apply morphological rules consistently. models that can't follow instructions fail regardless of tokenizer quality.
ALLAM and FANAR underperformed GPT-4 and GPT-4o on every task despite more Arabic training data. scale and tuning methodology seem to outweigh language-specific data advantages in current LLMs.
ALLAM scored 67% on real roots, 20% on nonce. a 47-point gap. nonce probes are the only way to tell memorization from genuine productivity. benchmarks without nonce conditions give an incomplete picture.