← voidwest research notes

tokenization in Arabic embedding models

On the Importance of Tokenization in Arabic Embedding Models

alkaoud & syed · WANLP 2020
Arabic NLP tokenization morphology embeddings

the claim

morphology-aware tokenization at the embedding layer improves Arabic NLP performance without retraining. the authors modified Word2Vec and BERT to use morpheme-level tokens instead of surface-word or BPE tokens, then evaluated on NER, sentiment, and POS tagging.

results

60% smaller vocab, morphology-aware tokenization collapsed surface forms into their constituent morphemes
better OOV handling, unseen words composed from known morphemes were representable
SOTA without retraining, the modified BERT achieved SOTA on two Arabic datasets despite using a frozen pretrained model

context

this worked because Arabic's morphological system has a finite (and relatively small) set of roots, patterns, and affixes. a morphology-aware tokenizer maps surface forms to these primitives, reducing the effective vocabulary from ~1M surface forms to ~20K morphemes, and since the embedding layer accounts for a large fraction of parameter count in smaller models, shrinking the embedding table while keeping semantic compositionality intact produces gains.

Hypothesis

why this matters alongside the 2026 results

the 2020 result (morphological injection helps) and the 2026 result (tokenizer alignment doesn't predict generation quality) are consistent if the key variable is model scale. at Word2Vec/mBERT scale, explicit morphological structure at the embedding layer compensates for the model's limited capacity to learn it implicitly. at GPT-4 scale, compositional reasoning and instruction-following substitute for explicit morphological parsing. the question that remains: where is the inflection point, and can you reach it without GPT-4's compute?