← voidwest    research notes

probing Arabic morphology inside LLMs

a research plan · 2026-05-16
Arabic NLP probing ember morphology research plan
Open Question

the morphemes without borders paper showed GPT-4o gets 97% on nonce Arabic root-pattern generation despite having terrible tokenizer alignment, while ALLAM drops to 20% despite better alignment and more Arabic training data. GPT-4o is learning morphology somewhere inside the model, and ALLAM isn't. where, and what do the representations look like?

motivation

the 2026 paper leaves an open question: morphological competence in LLMs is defined by productive generalization, not tokenizer alignment, but we don't know how the model achieves it. the authors suggest "compositional reasoning + instruction-following" as the mechanism, but that's a behavioral description, not a mechanistic one.

three facts make this tractable right now:

  1. ember gives direct access to hidden states. after every transformer block, i can call backend.data(&x) and read the activations. no hooks, no CUDA synchronization, no framework overhead.
  2. behavioral probing is cheap. i don't need to train probes on 100K examples. the nonce root-pattern task from the 2026 paper gives a clean ground-truth signal: feed a root + pattern, check if the output is correct. probe at every layer.
  3. the comparison writes itself. run the same probe on GPT-2 (terrible tokenizer alignment, unknown Arabic performance), LLaMA 3 (tested in the 2026 paper, middle of the pack), and a dedicated Arabic model (ALLAM if accessible, otherwise AraBERT). layer-by-layer, same stimuli.

one caveat on model selection. GPT-2, LLaMA 3, and ALLAM differ in architecture, training data composition, and scale simultaneously — any comparison across them confounds all three. within this set, the LLaMA 3 family (1B→3B→8B) provides the cleanest scaling curve: same architecture, same training distribution, different sizes. the GPT-2 and ALLAM comparisons are still informative for cross-family patterns (e.g. whether probe accuracy and layer geometry generalize), but claims about scaling effects specifically should be grounded in the LLaMA curve.

experimental design

stimuli: root-pattern nonce pairs

build a stimulus set of ~200 nonce triliteral roots (consonant triplets that don't exist in Arabic, filtered against a lexicon) crossed with ~10 common patterns (fa3ala, maf3ūl, yaf3alu, fā3il, etc.). each stimulus is: "apply pattern X to root Y" → expected surface form. example:

the gold-standard dataset from Alakeel et al. is public (github). start there, extend with more patterns if needed.

probing setup

for each stimulus:

  1. run forward pass through the model with caching disabled (we want all hidden states, not just the final logit)
  2. extract hidden states after each transformer block, specifically after the attention residual add and after the MLP residual add (two snapshots per layer)
  3. extract the output embedding (logits) for the final token position
  4. record correctness: does the argmax token match the expected surface form?

analysis questions

Open Question

Q1: where does root identity live?

train linear probes on the hidden states at each layer to classify the root (which of the 200 nonce roots produced this activation?). a high-accuracy probe at a given layer means root identity is linearly decodable there. compare across models.

Open Question

Q2: where does pattern identity live?

same probe, different target: classify which pattern was applied. does pattern information appear at the same layers as root information? does it appear earlier or later?

Open Question

Q3: are root and pattern disentangled?

for a model that succeeds at the task (e.g., GPT-4-class), are the root and pattern representations in orthogonal subspaces of the hidden state? compute the cosine similarity between root-probe weight vectors and pattern-probe weight vectors at each layer. low similarity = disentangled.

Open Question

Q4: where does the model "figure it out"?

compare the hidden states of correct vs incorrect predictions within the same model. at which layer do the representations for correct and incorrect outputs diverge? this tells us where the computation succeeds or fails, is the failure early (bad encoding of root/pattern) or late (good encoding but bad output projection)?

Hypothesis

Q5: the scaling question

if possible, run the same probes on GPT-2 small → medium → large (or LLaMA 3 1B → 3B → 8B). at what parameter count does the model transition from memorization (ALLAM-like, real roots only) to productive generalization (GPT-4o-like, nonce roots)? this answers the inflection-point question from the tokenizer writeup.

implementation in ember

what needs to change

ember already has everything needed for the base inference. the probing pipeline needs two additions:

  1. activation capture mode, a flag or separate function that runs the forward pass but saves the hidden state after each block instead of discarding it. currently Gpt2::forward_with_cache overwrites x at each layer. a variant that pushes x.clone() to a Vec<B::Tensor> before the next block is ~5 lines.
  2. probe training harness, a small Rust module or Python script that takes the saved activations, fits logistic regression probes (via linfa or numpy), and reports accuracy per layer. this doesn't need to be fast, we're running ~200 stimuli, not 200K.

models to test

modelparamswhy
GPT-2 small124Mbaseline; english-centric, bad Arabic tokenizer. tested in 2026 paper indirectly
LLaMA 3 1B / 3B1B / 3Bmiddle of the pack in 2026 paper. likely has some Arabic in pretraining
AraBERT / CAMeLBERT~110MArabic-specific BERT. good tokenizer alignment. should behave like ALLAM, good on real, bad on nonce?
LLaMA 3 8B8Blargest feasible on consumer CPU. where does the inflection happen?

compute requirements

this is CPU-friendly by design. a single forward pass on GPT-2 (124M params) takes ~50ms on modern x86. 200 stimuli × 12 layers × 2 snapshots = 4,800 activation vectors per model. probe training on 4,800 samples is seconds in numpy. the whole pipeline for GPT-2 runs in under a minute. LLaMA 3 8B with Q4_K quantization fits in ~5 GB RAM, feasible on a laptop.

expected results & interpretation

if GPT-4 succeeds but we can't probe it

GPT-4's weights aren't accessible. the probing plan as written works on open-weight models. the 2026 paper's result (GPT-4o 97% nonce accuracy) provides the ceiling. the probing experiment asks: do open-weight models at various scales show the same internal organization that presumably enables GPT-4o's performance? we're looking for the emergence of the right representations, not replicating GPT-4o.

if probes are near chance everywhere

this means morphological information isn't linearly decodable from individual layer activations, it may be distributed across layers, or non-linear. follow-up: non-linear probes (MLP), or inter-layer representational similarity analysis (CKA).

if root and pattern are in the same subspace

the model might be encoding the combined root+pattern as a single fused representation rather than disentangling them. this would mean the model learned a lookup table rather than a compositional rule, which is consistent with ALLAM's behavior (good on real roots, collapses on nonce).

if the inflection point is between 1B and 3B

this is the most actionable result. it means explicit morphological injection (tokenizer-side, embedding-side) might only help below ~1B parameters. above that, the model learns it. this quantifies the tradeoff and tells researchers when to invest in language-specific tokenization vs when to scale.

related work to cite

next steps

  1. add activation capture to ember, ~30 lines in model.rs. a new method forward_with_activations that returns hidden states alongside logits.
  2. build stimulus set, start with the public dataset from Alakeel et al., filter to nonce roots only, expand to ~200 stimuli with 10+ patterns.
  3. run GPT-2 probes, baseline. GPT-2's Arabic is probably poor, but the probe structure works regardless.
  4. run LLaMA 3 1B/3B probes, the interesting comparison. where's the inflection?
  5. write up, if the scaling curve is clear, this is a short paper (findings or workshop). 6 pages, clean story: "here's what the 2026 paper showed at the behavioral level; here's what's happening inside the model to explain it."