simd kernels, qwen 3, and gemma 4 support

ember started as a clean cpu path: row-major tensors, gguf loading, q8_0 weights kept compressed, and a backend trait around the model operations. the newer work keeps that shape. simd is not a new backend, and qwen 3 / gemma 4 are not separate applications. they are narrow additions to the same inference loop.

the important constraint is boring but useful: one prompt path, one probe path, one kv-cache contract. architecture-specific code should enter only where the model family actually differs.

what changed?	q8_0 decode got simd kernels; qwen 3 gained qwen metadata + q/k norm; gemma 4 gained a separate dense text model path.
why did it matter?	decode is dominated by quantized linear layers, and new model families need the same generation and probe tooling.
hard part	keep fast paths narrow: no new backend, no forked generation loop, no duplicated probe pipeline.
same contract	ForwardModel, q8_0 storage, gguf loading, sampling, logits dump, and probe extraction stay shared.
result	8.2x fused q8_0 decode speedup in the kernel benchmark, plus qwen 3 and dense text-only gemma 4 support through the existing cli.

simd path

q8_0 weights are stored as raw ggml blocks: two bytes of fp16 scale followed by 32 signed int8 values. the old scalar path dequantized a block into f32, then multiplied. the simd path keeps the same layout and changes only the inner loops.

runtime dispatch lives in src/simd.rs. x86-64 uses avx2 for widening and fma for dot products. aarch64 uses neon. unsupported machines take the scalar fallback. the public call sites stay in CpuBackend, so model code still just calls Linear::forward.

path	kernel	why it exists
q8_0 row dequant	avx2 / neon / scalar	turn compressed rows into f32 without changing storage
single-token decode	fused q8_0 dot product	avoid a dense temporary and sgemm overhead
prefill	block dequant + sgemm	reuse dense matrix multiply when many prompt rows share weights
elementwise ops	sum squares, add, elemul, weighted add	speed up rms norm, attention accumulation, and mlp glue

the unsafe surface is intentionally small. the backend validates tensor rank and inner dimensions before calling the q8_0 kernels; QuantizedWeight::try_new checks block alignment and byte length when external gguf data is loaded. inside the kernels, unsafe is used for target-feature intrinsics and pointer loads, not for a broader ownership trick.

benchmark

this benchmark is the core performance reason for the simd work. on an intel i5-1135G7 release build with 4096x4096 q8_0 weights, dispatching to the simd kernels drops q8_0 row dequantization from 1.50 us to 0.40 us per row, and the fused single-token decode call from 17.54 ms to 2.14 ms. that is 3.8x for raw dequantization and 8.2x for the decode kernel that generation hits every new token.

qwen 3

qwen 3 runs through the llama-family transformer in src/llama.rs. that is the right split because the block is still rms norm, separate q/k/v projections, rope, grouped query attention, and swiglu. the differences are metadata and q/k normalization, not a new execution engine.

LlamaConfig::from_gguf_metadata reads general.architecture and switches the metadata prefix to qwen3.*. it also honors explicit attention.key_length when present, instead of assuming embedding_length / head_count. per-layer attn_q_norm.weight and attn_k_norm.weight tensors are optional; when present, the attention path applies rms normalization per head after rope and before attention.

the result is that qwen 3 can use the same generation, logits dump, and probe modes as llama. the qwen-specific flag mostly says: choose the qwen tokenizer default and read qwen metadata without pretending the file is llama.

gemma 4

gemma 4 gets its own module, src/gemma4.rs, because its text model is close to the llama family but not identical. it has local and global attention layers, separate local/global rope settings, optional final-logit softcapping, and optional per-layer input embeddings. squeezing that into Llama would make both paths harder to reason about.

the supported target is dense text-only gemma 4. the loader rejects moe metadata, and the cli still keeps demo and interactive mode limited to the older gpt-2 path. generation, last-logit dump, and probe extraction use the shared ForwardModel trait.

feature	qwen 3	gemma 4
module	src/llama.rs	src/gemma4.rs
attention	llama-like attention with optional q/k norm	local/global attention mix with sliding windows
rope	shared llama-family rope path with qwen metadata keys	separate local and global rope settings
what changed	metadata prefix, explicit head dim, q/k rms norm	new block type, layer types, softcap, optional PLE
what stayed same	generation, sampling, q8_0 weights, probe extraction	generation, sampling, q8_0 weights, probe extraction
hard boundary	llama-family assumptions still apply	dense text-only; no moe or multimodal path
cli	--arch qwen3, tokenizer-qwen3.json	--arch gemma4, tokenizer-gemma4.json

why this matters

simd changes speed without changing the model contract. qwen 3 changes metadata and attention normalization without changing the llama-family core. gemma 4 changes the transformer module, but still implements the same ForwardModel surface. that is the point of the architecture: new kernels and new model families can land without rewriting generation, sampling, probing, or gguf loading from scratch.