← voidwest    research notes

morphemes without borders

Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs
Alakeel, Qwaider, Aldarmaki, Alqahtani · LREC 2026
Arabic NLP morphology tokenization LLM evaluation arXiv:2603.15773

the core claim

token-morpheme alignment doesn't predict whether a model can generate Arabic root-pattern forms. a tokenizer that segments morphemes cleanly doesn't guarantee good generation, and one that over-segments badly (GPT-4) doesn't prevent it.

why Arabic morphology is a good test

Arabic uses a root-and-pattern (non-concatenative) system. consonantal roots combine with templatic vowel patterns to form words. example: root ktb (write) + pattern mafūlmaktūb (written). the root and pattern interleave; they aren't concatenated like prefix+stem+suffix in English, which makes Arabic a stress test for subword tokenizers (BPE, Unigram, WordPiece) built for concatenative morphology.

experimental design

part 1: tokenizer morphological alignment

measured how well each tokenizer's segments match gold-standard morpheme boundaries from CAMEL and Farasa analyzers, on MSA (ATB3) and dialectal (BOLT) Arabic. metrics:

part 2: morphological generation

three probing tasks using real roots and nonce (invented) roots:

models evaluated

ALLAM, FANAR, GPT-4, GPT-4o, LLaMA-3, Qwen-3, Cohere. FANAR uses MorphBPE (morphologically-informed tokenization); the rest use standard BPE/Unigram/WordPiece. zero-shot and one-shot prompts, tested in both English and Arabic.


key findings

Key Finding

finding 1

no correlation between tokenizer alignment and generation performance. GPT-4o scored highest across all tasks (97% nonce accuracy) with one of the worst alignment scores (17% boundary precision). ALLAM had the best MCR (83-86%) but fell to 20% on nonce words.

Key Finding

finding 2

Arabic-centric models can't handle nonce words. ALLAM and FANAR drop sharply on nonce roots. memorized lexemes, not productive rules. FANAR's morphological tokenizer didn't help it generalize.

Key Finding

finding 3

English prompts work better than Arabic prompts. most models did worse when prompted in Arabic. probably a side effect of English-heavy instruction-tuning data.

Key Finding

finding 4

one-shot prompting helps weak models, not strong ones. GPT-4 and GPT-4o were flat across zero-shot and one-shot. LLaMA-3, Qwen-3, and Cohere improved with an example. they need in-context scaffolding to figure out the transformation.

Observation

finding 5

five error modes. pattern misapplication (root right, template wrong), root deformation (consonants changed), real-word substitution (outputting a valid word instead of applying the pattern), incorrect affix ordering, and partial truncation.


what this means for Arabic NLP research

1. morphology-aware tokenizers may not be worth the complexity

language-specific tokenizers (MorphBPE, Splinter, etc.) are expensive to build. large-scale pretraining and instruction tuning seem to compensate, or even outperform them. GPT-4o's tokenizer over-segments Arabic (fertility > 3, boundary precision 17%) yet nails nonce patterns 97% of the time.

2. instruction-following matters more than tokenizer design

compositional reasoning + instruction-following replaces explicit morphological parsing. models that follow instructions apply morphological rules consistently. models that can't follow instructions fail regardless of tokenizer quality.

3. Arabic-centric models have room to grow

ALLAM and FANAR underperformed GPT-4 and GPT-4o on every task despite more Arabic training data. scale and tuning methodology seem to outweigh language-specific data advantages in current LLMs.

4. real-word-only benchmarks hide failure modes

ALLAM scored 67% on real roots, 20% on nonce. a 47-point gap. nonce probes are the only way to tell memorization from genuine productivity. benchmarks without nonce conditions give an incomplete picture.

5. future directions


limitations