arabic tokenization system

a modular, finite-state tokenizer for Arabic built on top of the XFST platform (Beesley & Karttunen, 2003). unlike statistical tokenizers that learn from data, this is a hand-engineered system with explicit rules for clitic detection, morphological analysis, and token disambiguation.

architecture

three tokenization models

model 1: tokenization without a morphological analyzer. uses a clitic guesser, a finite-state transducer with hand-written patterns for proclitic prefixes and enclitic suffixes. strips all possible clitic combinations. a separate small-scale lexc transducer handles detached clitic forms. this is robust, works on any word regardless of dictionary coverage, and modular, tokenization and morphology are independent components. the tradeoff: the guesser is non-deterministic, producing combinatorial ambiguity (a three-word sentence can yield 8 tokenization paths).

model 2: tokenization with morphological analyzer hints. the stem is constrained to entries in the morphological analyzer. not implemented in the paper; described as a future refinement that would reduce ambiguity at the cost of requiring lexicon coverage.

model 3: tokenization dependent on the morphological analyzer. the stem is constrained to actual entries in the morphological analyzer's lexicon rather than guessed. reduces ambiguity but loses robustness, unknown words can't be tokenized.

key engineering details

preprocessing: whitespace normalization

Arabic text uses multiple whitespace characters (spaces, newlines, Arabic-specific space variants) inconsistently, especially around punctuation and clitics. a finite-state normalizer collapses whitespace to a canonical form before tokenization runs.

clitic demarcation with the cashida

the Arabic elongation line (cashida, ـ) is repurposed as a disambiguation marker. when the tokenizer splits enclitics, it inserts a cashida between the stem and the clitic. this resolves ambiguity between proclitics and homographic enclitics, and between morphologically-attached clitics and free-standing homographs. example: kitabuhum (their book) becomes kitab@ـhum, where the cashida signals the pronoun is attached, not a separate word. clever use of existing typographic infrastructure.

post-processing: token filter

a finite-state transducer composed on top of the core tokenizer marks undesired tokenizations with a +undesired tag. it imports lists from the MWE transducer (curfew → forbidden + walking) and a hand-curated set of improbable splits (e.g., ba'd (after) being split as bi + 'add). the filter doesn't query for morpho-syntactic information, it matches surface forms against known undesirable patterns.

multiword expressions

MWEs like hazr al-tajawwul (curfew, lit. "forbidding of walking") are handled by keeping internal spaces intact, the tokenizer outputs them as single tokens. a dedicated MWE transducer supplies the list of expressions, and the token filter blocks the compositional analyses that a non-deterministic tokenizer would otherwise produce.

Observation

key insight

tokenization can be modular. by separating clitic detection (guesser) from morphological analysis (lexicon), the system gains robustness and maintainability. the guesser handles unknown words; the filter cleans up the noise. this is the opposite of "tokenization baked into the model" approaches and prefigures the modular NLP pipeline debates still active today.

relevance today

finite-state approaches are alive in constrained settings

neural tokenizers have largely replaced hand-written finite-state systems for Arabic. but the structural problems haven't changed: Arabic's clitic stacking still over-segments BPE tokenizers (GPT-4o's boundary precision was 17%), and the guess-and-filter architecture anticipates modern two-stage pipelines (over-generate + rerank). the cashida trick is still used in Arabic NLP preprocessing pipelines.

the modularity lesson

the cleanest insight from this paper isn't any specific algorithm, it's the argument that tokenization should be separable from morphological analysis. the guesser handles the surface string problem (where do the clitic boundaries go?); the analyzer handles the linguistic problem (what does this stem mean?). conflating them produces systems that fail on unknown words. this lesson didn't make it into neural NLP, where tokenization is baked into model architectures and changed only by retraining.