ember engineering update

the recent ember work moved the project from "model support works in a narrow demo" toward an inspectable research pipeline. the main changes are runtime plumbing, streamed activation extraction, benchmark manifests, golden-logit summaries, smoke reports, and conservative causal-intervention reporting. the point is not broader model coverage for its own sake; it is making every run easier to reproduce, compare, and distrust productively.

runtime	CPU backend helpers, q8_0 scratch reuse, Rayon attention paths, deterministic greedy sampling, and pooled activation extraction.
artifacts	streamed .npy activation files, sidecar metadata, smoke summaries, benchmark summaries, and Markdown reports.
validation	golden-logit comparison scripts and activation-reference design docs, with independent references still required for strong claims.
research pipeline	split-policy metadata, benchmark manifests, encoder extraction, MDL curves, and conservative intervention summaries.

engineering pages

thread-count benchmark

local post-SIMD benchmark for Rayon thread counts across Qwen3, LLaMA, and Gemma Q8_0 models. useful as runtime evidence, not as a cloud-speed prediction.

runtimerayonq8_0local benchmark

simd kernels, qwen 3, and gemma 4

the earlier engineering writeup on q8_0 SIMD kernels and narrow model-family support. linked here as part of the engineering track.

simdqwen3gemma4model support

internals

lower-level notes on tensors, the backend trait, KV cache, GGUF loading, sampling, and the original probing pipeline.

architecturememorybackend

runtime changes

the CPU backend now carries more of the execution contract. row helpers, q8_0 prefill scratch, cached attention helpers, and Rayon parallelism reduce repeated hot-path work without changing the public model contract. the SIMD kernel benchmark is covered in the separate SIMD/Qwen3/Gemma page; the local benchmark below is only about post-SIMD thread-count behavior.

probe extraction also changed shape. instead of treating probe mode as a large in-memory dump, ember streams activation rows to .npy and records pooled per-layer states for the selected token positions. that makes cloud pullback and repeated benchmark runs less dependent on raw activation transfer.

thread-count benchmark

i ran scripts/benchmark_threads.py locally on an Intel i5-1135G7 laptop CPU with 4 physical cores and 8 hardware threads. all rows use Q8_0 GGUF files and Ember's own --benchmark decode timer. the runs reload the model for each thread count, so wall time is useful operational context, but the table below uses decode milliseconds from the benchmark output.

model	repeats	1 thread decode	best local decode	best threads	read
Qwen3 0.6B	3	6.70 tok/s	6.70 tok/s	1	no local threading gain in this small run
LLaMA 3.2 1B	2	2.63 tok/s	3.23 tok/s	4	modest decode improvement
LLaMA 3.2 3B	2	1.09 tok/s	1.53 tok/s	8	clearer local decode improvement
Gemma 4 E2B	2	1.74 tok/s	2.73 tok/s	4	best local thread count was not the maximum
LLaMA 3.1 8B	1	0.59 tok/s	1.24 tok/s	8	directional smoke result only

the careful conclusion is narrow: on this machine, larger dense Q8_0 models benefited from the threaded runtime paths, while the small Qwen3 0.6B run did not. this does not predict a specific cloud speedup; it only says the threading work has measurable local effect once the model is large enough for the overhead to pay back.

benchmark and probe pipeline

the Python side now has a clearer benchmark surface: run_benchmark.py runs manifest-defined jobs, benchmark_summary.py records artifact status, and render_benchmark_report.py turns summaries into human-readable Markdown. the report language is intentionally about decodability and artifact status, not scientific conclusions.

probe training gained stricter split-policy handling for grouped experiments. random stratified splits still exist, but root-heldout, pattern-heldout, root-pattern combination-heldout, sentence-heldout, and template-heldout policies can now be recorded instead of being implicit. requested grouped splits fail when the required field is missing; they do not silently fall back to random.

validation and reporting

smoke reports now record more of the machine and command context. golden-logit reports can be summarized into compact JSON and Markdown. causal-intervention reports can also render Markdown, but their interpretation is deliberately narrow: probe-direction removal can affect decodability, while behavioral causality requires changed logits or continuations.

the activation-reference design doc is the next important validation step. golden logits say the output surface matches a reference for a prompt. hidden-state probing also needs layer-by-layer activation checks for the same prompt, tokenizer, model, layer, and token position before treating representational geometry as numerically validated.

docs and site

the docs were split into clearer lanes. the main ember page now separates engineering from research/results. the SIMD/Qwen3/Gemma page explains narrow runtime and model-family changes. the research notes remain responsible for Arabic morphology interpretation, and those claims should stay tied to actual validation artifacts.

what this does not prove

next engineering pressure

the most useful next work is boring: independent golden-logit references for Qwen3 and LLaMA, activation reference checks for one small model and one prompt, full encoder benchmark reports, compact cloud artifact pullback, and continued cleanup of stale docs that overstate what has been validated.