← voidwest    research    internals    simd + qwen/gemma

ember engineering update

2026-06-06 · post-SIMD/Qwen3/Gemma support · engineering status, not a research-result claim

the recent ember work moved the project from "model support works in a narrow demo" toward an inspectable research pipeline. the main changes are runtime plumbing, streamed activation extraction, benchmark manifests, golden-logit summaries, smoke reports, and conservative causal-intervention reporting. the point is not broader model coverage for its own sake; it is making every run easier to reproduce, compare, and distrust productively.

runtimeCPU backend helpers, q8_0 scratch reuse, Rayon attention paths, deterministic greedy sampling, and pooled activation extraction.
artifactsstreamed .npy activation files, sidecar metadata, smoke summaries, benchmark summaries, and Markdown reports.
validationgolden-logit comparison scripts and activation-reference design docs, with independent references still required for strong claims.
research pipelinesplit-policy metadata, benchmark manifests, encoder extraction, MDL curves, and conservative intervention summaries.

engineering pages

the earlier engineering writeup on q8_0 SIMD kernels and narrow model-family support. linked here as part of the engineering track.
simdqwen3gemma4model support
lower-level notes on tensors, the backend trait, KV cache, GGUF loading, sampling, and the original probing pipeline.
architecturememorybackend

runtime changes

the CPU backend now carries more of the execution contract. row helpers, q8_0 prefill scratch, cached attention helpers, and Rayon parallelism reduce repeated hot-path work without changing the public model contract. the SIMD kernel benchmark is covered in the separate SIMD/Qwen3/Gemma page; the local benchmark below is only about post-SIMD thread-count behavior.

probe extraction also changed shape. instead of treating probe mode as a large in-memory dump, ember streams activation rows to .npy and records pooled per-layer states for the selected token positions. that makes cloud pullback and repeated benchmark runs less dependent on raw activation transfer.

thread-count benchmark

i ran scripts/benchmark_threads.py locally on an Intel i5-1135G7 laptop CPU with 4 physical cores and 8 hardware threads. all rows use Q8_0 GGUF files and Ember's own --benchmark decode timer. the runs reload the model for each thread count, so wall time is useful operational context, but the table below uses decode milliseconds from the benchmark output.

Local Ember thread-count benchmark showing decode tokens per second for Qwen3 0.6B, LLaMA 3.2 1B, LLaMA 3.2 3B, Gemma 4 E2B, and LLaMA 3.1 8B across 1, 2, 4, and 8 Rayon threads
local smoke benchmark only. Qwen3 0.6B used 32 generated tokens and 3 repeats; LLaMA 1B/3B and Gemma E2B used 16 tokens and 2 repeats; LLaMA 8B used 8 tokens and 1 repeat.
modelrepeats1 thread decodebest local decodebest threadsread
Qwen3 0.6B36.70 tok/s6.70 tok/s1no local threading gain in this small run
LLaMA 3.2 1B22.63 tok/s3.23 tok/s4modest decode improvement
LLaMA 3.2 3B21.09 tok/s1.53 tok/s8clearer local decode improvement
Gemma 4 E2B21.74 tok/s2.73 tok/s4best local thread count was not the maximum
LLaMA 3.1 8B10.59 tok/s1.24 tok/s8directional smoke result only

the careful conclusion is narrow: on this machine, larger dense Q8_0 models benefited from the threaded runtime paths, while the small Qwen3 0.6B run did not. this does not predict a specific cloud speedup; it only says the threading work has measurable local effect once the model is large enough for the overhead to pay back.

benchmark and probe pipeline

the Python side now has a clearer benchmark surface: run_benchmark.py runs manifest-defined jobs, benchmark_summary.py records artifact status, and render_benchmark_report.py turns summaries into human-readable Markdown. the report language is intentionally about decodability and artifact status, not scientific conclusions.

probe training gained stricter split-policy handling for grouped experiments. random stratified splits still exist, but root-heldout, pattern-heldout, root-pattern combination-heldout, sentence-heldout, and template-heldout policies can now be recorded instead of being implicit. requested grouped splits fail when the required field is missing; they do not silently fall back to random.

validation and reporting

smoke reports now record more of the machine and command context. golden-logit reports can be summarized into compact JSON and Markdown. causal-intervention reports can also render Markdown, but their interpretation is deliberately narrow: probe-direction removal can affect decodability, while behavioral causality requires changed logits or continuations.

the activation-reference design doc is the next important validation step. golden logits say the output surface matches a reference for a prompt. hidden-state probing also needs layer-by-layer activation checks for the same prompt, tokenizer, model, layer, and token position before treating representational geometry as numerically validated.

docs and site

the docs were split into clearer lanes. the main ember page now separates engineering from research/results. the SIMD/Qwen3/Gemma page explains narrow runtime and model-family changes. the research notes remain responsible for Arabic morphology interpretation, and those claims should stay tied to actual validation artifacts.

what this does not prove

next engineering pressure

the most useful next work is boring: independent golden-logit references for Qwen3 and LLaMA, activation reference checks for one small model and one prompt, full encoder benchmark reports, compact cloud artifact pullback, and continued cleanup of stale docs that overstate what has been validated.