← voidwest    research    engineering    internals

simd kernels, qwen 3, and gemma 4 support

ember started as a clean cpu path: row-major tensors, gguf loading, q8_0 weights kept compressed, and a backend trait around the model operations. the newer work keeps that shape. simd is not a new backend, and qwen 3 / gemma 4 are not separate applications. they are narrow additions to the same inference loop.

the important constraint is boring but useful: one prompt path, one probe path, one kv-cache contract. architecture-specific code should enter only where the model family actually differs.

what changed?q8_0 decode got simd kernels; qwen 3 gained qwen metadata + q/k norm; gemma 4 gained a separate dense text model path.
why did it matter?decode is dominated by quantized linear layers, and new model families need the same generation and probe tooling.
hard partkeep fast paths narrow: no new backend, no forked generation loop, no duplicated probe pipeline.
same contractForwardModel, q8_0 storage, gguf loading, sampling, logits dump, and probe extraction stay shared.
result8.2x fused q8_0 decode speedup in the kernel benchmark, plus qwen 3 and dense text-only gemma 4 support through the existing cli.

simd path

q8_0 weights are stored as raw ggml blocks: two bytes of fp16 scale followed by 32 signed int8 values. the old scalar path dequantized a block into f32, then multiplied. the simd path keeps the same layout and changes only the inner loops.

  1. block layout 1 fp16 scale value plus 32 signed int8 weights: [scale][q0 ... q31]
  2. scalar idea turn each int8 into f32 with q * scale, then multiply by the activation value.
  3. decode path compute x · dequant(w_row) directly from compressed bytes, accumulate into fp32, and skip the dense temporary.
  4. prefill path dequantize output-column blocks into scratch, then run matrixmultiply::sgemm when many prompt rows reuse the same weights.

runtime dispatch lives in src/simd.rs. x86-64 uses avx2 for widening and fma for dot products. aarch64 uses neon. unsupported machines take the scalar fallback. the public call sites stay in CpuBackend, so model code still just calls Linear::forward.

pathkernelwhy it exists
q8_0 row dequantavx2 / neon / scalarturn compressed rows into f32 without changing storage
single-token decodefused q8_0 dot productavoid a dense temporary and sgemm overhead
prefillblock dequant + sgemmreuse dense matrix multiply when many prompt rows share weights
elementwise opssum squares, add, elemul, weighted addspeed up rms norm, attention accumulation, and mlp glue

the unsafe surface is intentionally small. the backend validates tensor rank and inner dimensions before calling the q8_0 kernels; QuantizedWeight::try_new checks block alignment and byte length when external gguf data is loaded. inside the kernels, unsafe is used for target-feature intrinsics and pointer loads, not for a broader ownership trick.

benchmark

this benchmark is the core performance reason for the simd work. on an intel i5-1135G7 release build with 4096x4096 q8_0 weights, dispatching to the simd kernels drops q8_0 row dequantization from 1.50 us to 0.40 us per row, and the fused single-token decode call from 17.54 ms to 2.14 ms. that is 3.8x for raw dequantization and 8.2x for the decode kernel that generation hits every new token.

Ember Q8_0 CPU kernel benchmark showing 3.8x Q8_0 dequant speedup and 8.2x fused Q8_0 decode speedup
kernel benchmark, not full-model latency. full generation still pays tokenizer, attention, sampling, allocation, and cache costs, but the quantized linear layers are the expensive part of decode.

qwen 3

qwen 3 runs through the llama-family transformer in src/llama.rs. that is the right split because the block is still rms norm, separate q/k/v projections, rope, grouped query attention, and swiglu. the differences are metadata and q/k normalization, not a new execution engine.

cargo run --release -- \
  --arch qwen3 \
  --model Qwen3-0.6B-Q8_0.gguf \
  --tokenizer tokenizer-qwen3.json \
  --prompt "اكتب جملة قصيرة"

LlamaConfig::from_gguf_metadata reads general.architecture and switches the metadata prefix to qwen3.*. it also honors explicit attention.key_length when present, instead of assuming embedding_length / head_count. per-layer attn_q_norm.weight and attn_k_norm.weight tensors are optional; when present, the attention path applies rms normalization per head after rope and before attention.

the result is that qwen 3 can use the same generation, logits dump, and probe modes as llama. the qwen-specific flag mostly says: choose the qwen tokenizer default and read qwen metadata without pretending the file is llama.

gemma 4

gemma 4 gets its own module, src/gemma4.rs, because its text model is close to the llama family but not identical. it has local and global attention layers, separate local/global rope settings, optional final-logit softcapping, and optional per-layer input embeddings. squeezing that into Llama would make both paths harder to reason about.

cargo run --release -- \
  --arch gemma4 \
  --model models/gemma-4-E2B-it.Q8_0.gguf \
  --tokenizer tokenizer-gemma4.json \
  --prompt "اكتب جملة قصيرة"

the supported target is dense text-only gemma 4. the loader rejects moe metadata, and the cli still keeps demo and interactive mode limited to the older gpt-2 path. generation, last-logit dump, and probe extraction use the shared ForwardModel trait.

featureqwen 3gemma 4
modulesrc/llama.rssrc/gemma4.rs
attentionllama-like attention with optional q/k normlocal/global attention mix with sliding windows
ropeshared llama-family rope path with qwen metadata keysseparate local and global rope settings
what changedmetadata prefix, explicit head dim, q/k rms normnew block type, layer types, softcap, optional PLE
what stayed samegeneration, sampling, q8_0 weights, probe extractiongeneration, sampling, q8_0 weights, probe extraction
hard boundaryllama-family assumptions still applydense text-only; no moe or multimodal path
cli--arch qwen3, tokenizer-qwen3.json--arch gemma4, tokenizer-gemma4.json

why this matters

simd changes speed without changing the model contract. qwen 3 changes metadata and attention normalization without changing the llama-family core. gemma 4 changes the transformer module, but still implements the same ForwardModel surface. that is the point of the architecture: new kernels and new model families can land without rewriting generation, sampling, probing, or gguf loading from scratch.