← voidwest

research notes

running notes on papers, experiments, and ideas. updated when i learn something worth writing down.

research & direction

this week Ember's Arabic morphology probing shifted from hidden-state extraction into leakage-aware measurement. the result is narrower: POS survives stricter heldout evaluation in Qwen3-0.6B and Llama-3.2-1B, while root, lemma, and pattern need a different evaluation framework.
2026-06-22  ·  Arabic NLP probing morphology ember Qwen LLaMA
preliminary layerwise probe results across LLaMA, Qwen, and Gemma on Arabic nonce root-pattern stimuli. root labels vary across final layers; pattern labels saturate and should not be used for scale or family claims in this run.
2026-06-14  ·  Arabic NLP probing morphology LLaMA Qwen Gemma preliminary
probing LLaMA 3.2 1B/3B/8B with linear classifiers, CCA, and RSA. the representations are there — structured, disentangled, scaling non-monotonically — but every model outputs "The". findings, charts, and next steps.
2026-05-26  ·  Arabic NLP probing morphology LLaMA scaling findings
what i learned reading arabic nlp papers for a week. the 2026 paper that broke the assumption, and why the real question is about internal representations, not tokenization.
2026-05-16  ·  Arabic NLP morphology tokenization writeup
research plan: use ember for activation probing to find where and how models learn Arabic root-pattern morphology internally. builds on the open question from Alakeel et al. (2026).
2026-05-16  ·  Arabic NLP probing ember research plan

paper notes

alakeel, qwaider, aldarmaki, alqahtani · LREC 2026. token-morpheme alignment doesn't predict morphological generation in Arabic LLMs. GPT-4o scores 97% on nonce roots with terrible tokenizer alignment.
2026-05-15  ·  Arabic NLP morphology tokenization LLM evaluation
attia · ~2007. a finite-state, modular Arabic tokenizer with clitic guesser, morphological analyzer, and cashida-based disambiguation. the guess-and-filter architecture.
2026-05-15  ·  Arabic NLP tokenization finite-state
alkaoud & syed · WANLP 2020. morphology-aware tokenization at the embedding layer: 60% smaller vocab, better OOV handling, SOTA without retraining. worked at Word2Vec/mBERT scale.
2026-05-16  ·  Arabic NLP embeddings tokenization morphology