ember is a research layer over GGUF models. it owns dataset handling, prompt construction, token-position selection, hidden-state artifacts, probes, baselines, metrics, reports, and validation. it is not trying to replace llama.cpp; ember uses llama.cpp when scale and model coverage matter, while keeping a native rust backend for inspectability and validation.
the extraction artifact contract is backend-neutral: native ember and future llama.cpp extractors write the same manifest, samples, tokenization, positions, layer shards, checksums, and report files. downstream probes read that contract instead of backend-specific output.
the first llama.cpp integration point is deliberately narrow: ember can spawn a `llama-cpp-external` extractor through a request file and validate the resulting tokenization/logits artifact skeleton before any intermediate hidden-state patch is required.
| smoke | structural execution only: the command loaded artifacts and produced output. |
| golden logits | output-logit comparison against a trusted reference for the same prompt, tokenizer, model, and quantization path. |
| activation checks | hidden-state comparison by prompt, tokenizer, model, layer, and token position. |
| probes | linear or MLP decodability/recoverability, not causal model use. |
| interventions | only supports behavioral claims when downstream logits or continuations change. |
| area | status | read |
|---|---|---|
| CPU runtime | works locally across small/medium GGUF paths | engineering artifact, not production parity |
| Qwen3 0.6B | generation/probe paths run | needs trusted golden-logit reference |
| LLaMA 1B/3B/8B | local smoke/probe artifacts exist | research conclusions remain preliminary |
| Gemma 4 E2B | dense text-only path runs local smoke/benchmark | experimental until golden checks cover architecture details |
| encoder benchmarks | mBERT PADT smoke completed; suite manifest exists | full XLM-R/AraBERTv2 suite still pending |
the newest engineering pass added a thread-count benchmark section under engineering. local results show that larger dense Q8_0 models benefit from threaded runtime paths on this machine, while the small Qwen3 0.6B run does not. the page keeps that claim deliberately local; it is not a cloud-speed forecast.