← voidwest research notes

morphemes without borders

Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

Alakeel, Qwaider, Aldarmaki, Alqahtani · LREC 2026
Arabic NLP morphology tokenization LLM evaluation arXiv:2603.15773

the core claim

token-morpheme alignment doesn't predict whether a model can generate Arabic root-pattern forms. a tokenizer that segments morphemes cleanly doesn't guarantee good generation, and one that over-segments badly (GPT-4) doesn't prevent it.

why Arabic morphology is a good test

Arabic uses a root-and-pattern (non-concatenative) system. consonantal roots combine with templatic vowel patterns to form words. example: root ktb (write) + pattern mafūl → maktūb (written). the root and pattern interleave; they aren't concatenated like prefix+stem+suffix in English, which makes Arabic a stress test for subword tokenizers (BPE, Unigram, WordPiece) built for concatenative morphology.

experimental design

part 1: tokenizer morphological alignment

measured how well each tokenizer's segments match gold-standard morpheme boundaries from CAMEL and Farasa analyzers, on MSA (ATB3) and dialectal (BOLT) Arabic. metrics:

fertility: tokens per word
morpheme F₁: exact morpheme match
boundary F₁: boundary detection precision/recall
MCR: morpheme coverage rate (avoids internal splits)

part 2: morphological generation

three probing tasks using real roots and nonce (invented) roots:

root-pattern real: apply a pattern to a real triliteral root
root-pattern nonce: same task with made-up roots (tests generalization, not memorization)
affix-build: concatenative affix ordering on a base word

models evaluated

ALLAM, FANAR, GPT-4, GPT-4o, LLaMA-3, Qwen-3, Cohere. FANAR uses MorphBPE (morphologically-informed tokenization); the rest use standard BPE/Unigram/WordPiece. zero-shot and one-shot prompts, tested in both English and Arabic.

key findings

Key Finding

finding 1

no correlation between tokenizer alignment and generation performance. GPT-4o scored highest across all tasks (97% nonce accuracy) with one of the worst alignment scores (17% boundary precision). ALLAM had the best MCR (83-86%) but fell to 20% on nonce words.

Key Finding

finding 2

Arabic-centric models can't handle nonce words. ALLAM and FANAR drop sharply on nonce roots. memorized lexemes, not productive rules. FANAR's morphological tokenizer didn't help it generalize.

Key Finding

finding 3

English prompts work better than Arabic prompts. most models did worse when prompted in Arabic. probably a side effect of English-heavy instruction-tuning data.

Key Finding

finding 4

one-shot prompting helps weak models, not strong ones. GPT-4 and GPT-4o were flat across zero-shot and one-shot. LLaMA-3, Qwen-3, and Cohere improved with an example. they need in-context scaffolding to figure out the transformation.

Observation

finding 5

five error modes. pattern misapplication (root right, template wrong), root deformation (consonants changed), real-word substitution (outputting a valid word instead of applying the pattern), incorrect affix ordering, and partial truncation.

what this means for Arabic NLP research

1. morphology-aware tokenizers may not be worth the complexity

language-specific tokenizers (MorphBPE, Splinter, etc.) are expensive to build. large-scale pretraining and instruction tuning seem to compensate, or even outperform them. GPT-4o's tokenizer over-segments Arabic (fertility > 3, boundary precision 17%) yet nails nonce patterns 97% of the time.

2. instruction-following matters more than tokenizer design

compositional reasoning + instruction-following replaces explicit morphological parsing. models that follow instructions apply morphological rules consistently. models that can't follow instructions fail regardless of tokenizer quality.

3. Arabic-centric models have room to grow

ALLAM and FANAR underperformed GPT-4 and GPT-4o on every task despite more Arabic training data. scale and tuning methodology seem to outweigh language-specific data advantages in current LLMs.

4. real-word-only benchmarks hide failure modes

ALLAM scored 67% on real roots, 20% on nonce. a 47-point gap. nonce probes are the only way to tell memorization from genuine productivity. benchmarks without nonce conditions give an incomplete picture.

5. future directions

tokenization-agnostic morphology learning: teach LLMs morphology through data and tuning, not custom tokenizers
adaptive/hybrid tokenization: character-level processing only when morphologically needed
controlled experiments: isolate tokenizer design from architecture, data, and tuning

limitations

generation task conflates morphology with instruction-following; models that can't follow format instructions score lower regardless of morphological competence
correlation can't prove causation: architectures, data, and tuning are confounded
only 7 models, 13 patterns; more data would help
dialectal Arabic (BOLT) tested for alignment but not generation