the morphemes without borders paper showed GPT-4o gets 97% on nonce Arabic root-pattern generation despite having terrible tokenizer alignment, while ALLAM drops to 20% despite better alignment and more Arabic training data. GPT-4o is learning morphology somewhere inside the model, and ALLAM isn't. where, and what do the representations look like?
the 2026 paper leaves an open question: morphological competence in LLMs is defined by productive generalization, not tokenizer alignment, but we don't know how the model achieves it. the authors suggest "compositional reasoning + instruction-following" as the mechanism, but that's a behavioral description, not a mechanistic one.
three facts make this tractable right now:
one caveat on model selection. GPT-2, LLaMA 3, and ALLAM differ in architecture, training data composition, and scale simultaneously — any comparison across them confounds all three. within this set, the LLaMA 3 family (1B→3B→8B) provides the cleanest scaling curve: same architecture, same training distribution, different sizes. the GPT-2 and ALLAM comparisons are still informative for cross-family patterns (e.g. whether probe accuracy and layer geometry generalize), but claims about scaling effects specifically should be grounded in the LLaMA curve.
build a stimulus set of ~200 nonce triliteral roots (consonant triplets that don't exist in Arabic, filtered against a lexicon) crossed with ~10 common patterns (fa3ala, maf3ūl, yaf3alu, fā3il, etc.). each stimulus is: "apply pattern X to root Y" → expected surface form. example:
the gold-standard dataset from Alakeel et al. is public (github). start there, extend with more patterns if needed.
for each stimulus:
Q1: where does root identity live?
train linear probes on the hidden states at each layer to classify the root (which of the 200 nonce roots produced this activation?). a high-accuracy probe at a given layer means root identity is linearly decodable there. compare across models.
Q2: where does pattern identity live?
same probe, different target: classify which pattern was applied. does pattern information appear at the same layers as root information? does it appear earlier or later?
Q3: are root and pattern disentangled?
for a model that succeeds at the task (e.g., GPT-4-class), are the root and pattern representations in orthogonal subspaces of the hidden state? compute the cosine similarity between root-probe weight vectors and pattern-probe weight vectors at each layer. low similarity = disentangled.
Q4: where does the model "figure it out"?
compare the hidden states of correct vs incorrect predictions within the same model. at which layer do the representations for correct and incorrect outputs diverge? this tells us where the computation succeeds or fails, is the failure early (bad encoding of root/pattern) or late (good encoding but bad output projection)?
Q5: the scaling question
if possible, run the same probes on GPT-2 small → medium → large (or LLaMA 3 1B → 3B → 8B). at what parameter count does the model transition from memorization (ALLAM-like, real roots only) to productive generalization (GPT-4o-like, nonce roots)? this answers the inflection-point question from the tokenizer writeup.
ember already has everything needed for the base inference. the probing pipeline needs two additions:
| model | params | why |
|---|---|---|
| GPT-2 small | 124M | baseline; english-centric, bad Arabic tokenizer. tested in 2026 paper indirectly |
| LLaMA 3 1B / 3B | 1B / 3B | middle of the pack in 2026 paper. likely has some Arabic in pretraining |
| AraBERT / CAMeLBERT | ~110M | Arabic-specific BERT. good tokenizer alignment. should behave like ALLAM, good on real, bad on nonce? |
| LLaMA 3 8B | 8B | largest feasible on consumer CPU. where does the inflection happen? |
this is CPU-friendly by design. a single forward pass on GPT-2 (124M params) takes ~50ms on modern x86. 200 stimuli × 12 layers × 2 snapshots = 4,800 activation vectors per model. probe training on 4,800 samples is seconds in numpy. the whole pipeline for GPT-2 runs in under a minute. LLaMA 3 8B with Q4_K quantization fits in ~5 GB RAM, feasible on a laptop.
GPT-4's weights aren't accessible. the probing plan as written works on open-weight models. the 2026 paper's result (GPT-4o 97% nonce accuracy) provides the ceiling. the probing experiment asks: do open-weight models at various scales show the same internal organization that presumably enables GPT-4o's performance? we're looking for the emergence of the right representations, not replicating GPT-4o.
this means morphological information isn't linearly decodable from individual layer activations, it may be distributed across layers, or non-linear. follow-up: non-linear probes (MLP), or inter-layer representational similarity analysis (CKA).
the model might be encoding the combined root+pattern as a single fused representation rather than disentangling them. this would mean the model learned a lookup table rather than a compositional rule, which is consistent with ALLAM's behavior (good on real roots, collapses on nonce).
this is the most actionable result. it means explicit morphological injection (tokenizer-side, embedding-side) might only help below ~1B parameters. above that, the model learns it. this quantifies the tradeoff and tells researchers when to invest in language-specific tokenization vs when to scale.