research notes · voidwest

running notes on papers, experiments, and ideas. updated when i learn something worth writing down.

research & direction

When the Result Gets Less Flashy but More Real

this week Ember's Arabic morphology probing shifted from hidden-state extraction into leakage-aware measurement. the result is narrower: POS survives stricter heldout evaluation in Qwen3-0.6B and Llama-3.2-1B, while root, lemma, and pattern need a different evaluation framework.

2026-06-22 · Arabic NLP probing morphology ember Qwen LLaMA

cross-family Arabic morphology probes

preliminary layerwise probe results across LLaMA, Qwen, and Gemma on Arabic nonce root-pattern stimuli. root labels vary across final layers; pattern labels saturate and should not be used for scale or family claims in this run.

2026-06-14 · Arabic NLP probing morphology LLaMA Qwen Gemma preliminary

what LLaMA knows about Arabic morphology (and won't say)

probing LLaMA 3.2 1B/3B/8B with linear classifiers, CCA, and RSA. the representations are there — structured, disentangled, scaling non-monotonically — but every model outputs "The". findings, charts, and next steps.

2026-05-26 · Arabic NLP probing morphology LLaMA scaling findings

the tokenizer isn't the problem

what i learned reading arabic nlp papers for a week. the 2026 paper that broke the assumption, and why the real question is about internal representations, not tokenization.

2026-05-16 · Arabic NLP morphology tokenization writeup

probing Arabic morphology inside LLMs

research plan: use ember for activation probing to find where and how models learn Arabic root-pattern morphology internally. builds on the open question from Alakeel et al. (2026).

2026-05-16 · Arabic NLP probing ember research plan

paper notes

morphemes without borders

alakeel, qwaider, aldarmaki, alqahtani · LREC 2026. token-morpheme alignment doesn't predict morphological generation in Arabic LLMs. GPT-4o scores 97% on nonce roots with terrible tokenizer alignment.

2026-05-15 · Arabic NLP morphology tokenization LLM evaluation

arabic tokenization system

attia · ~2007. a finite-state, modular Arabic tokenizer with clitic guesser, morphological analyzer, and cashida-based disambiguation. the guess-and-filter architecture.

2026-05-15 · Arabic NLP tokenization finite-state

tokenization in Arabic embedding models

alkaoud & syed · WANLP 2020. morphology-aware tokenization at the embedding layer: 60% smaller vocab, better OOV handling, SOTA without retraining. worked at Word2Vec/mBERT scale.

2026-05-16 · Arabic NLP embeddings tokenization morphology