running notes on papers, experiments, and ideas. updated when i
learn something worth writing down.
this week Ember's Arabic morphology probing shifted from
hidden-state extraction into leakage-aware measurement. the
result is narrower: POS survives stricter heldout evaluation
in Qwen3-0.6B and Llama-3.2-1B, while root, lemma, and
pattern need a different evaluation framework.
2026-06-22
·
Arabic NLP
probing
morphology
ember
Qwen
LLaMA
preliminary layerwise probe results across LLaMA, Qwen, and
Gemma on Arabic nonce root-pattern stimuli. root labels vary
across final layers; pattern labels saturate and should not
be used for scale or family claims in this run.
2026-06-14
·
Arabic NLP
probing
morphology
LLaMA
Qwen
Gemma
preliminary
probing LLaMA 3.2 1B/3B/8B with linear classifiers, CCA, and
RSA. the representations are there — structured,
disentangled, scaling non-monotonically — but every model
outputs "The". findings, charts, and next steps.
2026-05-26
·
Arabic NLP
probing
morphology
LLaMA
scaling
findings
what i learned reading arabic nlp papers for a week. the
2026 paper that broke the assumption, and why the real
question is about internal representations, not
tokenization.
2026-05-16
·
Arabic NLP
morphology
tokenization
writeup
research plan: use ember for activation probing to find
where and how models learn Arabic root-pattern morphology
internally. builds on the open question from Alakeel et al.
(2026).
2026-05-16
·
Arabic NLP
probing
ember
research plan
alakeel, qwaider, aldarmaki, alqahtani · LREC 2026.
token-morpheme alignment doesn't predict morphological
generation in Arabic LLMs. GPT-4o scores 97% on nonce roots
with terrible tokenizer alignment.
2026-05-15
·
Arabic NLP
morphology
tokenization
LLM evaluation
attia · ~2007. a finite-state, modular Arabic tokenizer with
clitic guesser, morphological analyzer, and cashida-based
disambiguation. the guess-and-filter architecture.
2026-05-15
·
Arabic NLP
tokenization
finite-state
alkaoud & syed · WANLP 2020. morphology-aware tokenization
at the embedding layer: 60% smaller vocab, better OOV
handling, SOTA without retraining. worked at Word2Vec/mBERT
scale.
2026-05-16
·
Arabic NLP
embeddings
tokenization
morphology