← voidwest    research    engineering

ember internals

this is a line-by-line walkthrough of the data structures, algorithms, and memory layouts that make ember work. it assumes you've read the portal page and want the raw details.

architecture

the project is intentionally small enough to follow from entry point to logits. main.rs owns CLI dispatch, generation, logits dump, and probe modes. model families implement ForwardModel. compute operations go through the Backend trait, with CpuBackend as the current runtime.

main.rsentry point, cli args, generation, logits dump, probe modes
├─ loader.rsgguf v3 parser, metadata, tensor loading
├─ model.rsgpt-2 blocks and ForwardModel trait
├─ llama.rsllama/qwen-family transformer path
├─ gemma4.rsdense text-only gemma 4 path
├─ backend.rsbackend trait, cpu backend, q8_0 and attention helpers
├─ tensor.rsrow-major f32 tensor operations
├─ kv_cache.rsflat k/v cache and scratch buffers
└─ probes/python probing, validation, and report scripts

design decisions

backend traitmodel code talks to a bounded compute interface instead of scattering CPU details through every architecture.
row-major tensorsrows are contiguous, so embeddings and output rows are direct slices; column slices are explicit gathers.
gguf mmapmodel files are mapped and tensors are copied/dequantized into explicit runtime buffers.
flat kv cachesingle allocation with fixed stride math: [layer][head][pos][head_dim].
artifacts over claimsprobe, benchmark, and validation scripts write machine-readable summaries before interpretation.

math primitives

the low-level math is ordinary but unforgiving: softmax needs an all-masked-row fallback, layer norm and RMS norm must write their result back into the block stream, and activation functions must be numerically stable enough not to silently deform logits.

primitivewhere it mattersfailure mode
softmaxattention and samplingall -inf rows produce NaN without a fallback
GELU / SiLUMLP blockswrong activation changes logits without crashing
layer norm / RMS normbefore attention and MLP blocksforgetting to bind the returned tensor leaves stale activations
RoPEllama/qwen/gemma attentionposition convention mistakes show up as plausible but wrong text

attention

attention is split into q/k/v projections, RoPE or positional handling, causal masking, softmax, and weighted value accumulation. the cache turns decode from recomputing the whole sequence every token into a prefill pass plus one cached step per generated token.

bugs and first light

the most useful early failures were not crashes. they were coherent enough to mislead: scrambled GGUF layouts produced nonsense text, and the first KV-cache prefill bug let the model attend only to the final prompt token.

column-major gguf layout
q8_0 and f16 tensors needed layout handling before runtime matmul. after fixing the loader/transpose path, correlation with a PyTorch reference went from near-noise to near-identical.
kv cache prefill overwrite
cache writes originally reused the same cursor position during prefill. passing the explicit absolute token position fixed prompt-token retention.

first coherent output was still a smoke result, not full validation:

The cheese snowball Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese Cheese
ember v0.1, greedy GPT-2 output after the column-major layout fix

tensor

CpuTensor is the only data structure every other component touches. three fields:

pub struct CpuTensor {
    pub shape: Vec<usize>,      // e.g. [12, 64, 768]
    pub strides: Vec<usize>,    // contiguous row-major strides
    pub data: Vec<f32>,         // flat f32 buffer
}

strides are computed but never used for indexing. the compute_strides function builds a standard row-major stride array (e.g. [768, 1] for shape [64, 768]), but every access goes through direct offset math like r * cols + c. the strides exist purely for get(&[usize]), the multi-dimensional indexer used only in testing.

every operation allocates. no views, no mutation in place. add() returns a new CpuTensor. softmax() returns a new CpuTensor. matmul() returns a new CpuTensor. the allocator sees a stream of identical-sized allocations during decode (token count is constant), so jemalloc or the system allocator reuses the same slab. this is not accidental, it's the reason the hot path can get away with per-op allocations without profiling as a malloc storm.

#[must_use] on every pure op. if you write x.softmax(); without binding the result, the allocation drops silently and the model runs on stale data. the attribute makes that a compile error. it caught the bug of forgetting to assign layer norm output back to x in the transformer block loop, a bug that produces no panic, no NaN, just progressively degraded text.

matmul: matrixmultiply::sgemm

the matmul delegates to bluss's matrixmultiply crate, a pure-rust sgemm with no blas linking. the call site:

unsafe {
    matrixmultiply::sgemm(
        m, k, n,                // dimensions
        1.0,                    // alpha
        a.as_ptr(), k as isize, 1,   // A: m×k, col stride = k, row stride = 1
        b.as_ptr(), n as isize, 1,   // B: k×n, col stride = n, row stride = 1
        0.0,                    // beta
        c.as_mut_ptr(), n as isize, 1, // C: m×n, row-major
    );
}

matrixmultiply is scalar, so this is the throughput bottleneck. the Backend trait is the insertion point for simd: implement SimdBackend with the same trait, swap the backend type, and every matmul in the model uses the new kernel with zero model code changes.

backend trait

the Backend trait is the abstraction that decouples model code from hardware. every operation that touches tensor data goes through a &B reference:

pub trait Backend {
    type Tensor: Clone + Send + Sync;
    type Error: core::error::Error;

    fn zeroes(&self, shape: &[usize]) -> Result<Self::Tensor, Self::Error>;
    fn matmul(&self, a: &Self::Tensor, b: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
    fn matmul_q8_0(&self, x: &Self::Tensor, w: &QuantizedWeight)
        -> Result<Self::Tensor, Self::Error>;
    fn add(&self, a: &Self::Tensor, b: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
    fn softmax(&self, x: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
    fn gelu(&self, x: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
    fn layer_norm(&self, x: &Self::Tensor, w: &Self::Tensor, b: &Self::Tensor, eps: f32)
        -> Result<Self::Tensor, Self::Error>;
    fn rms_norm(&self, x: &Self::Tensor, w: &Self::Tensor, eps: f32)
        -> Result<Self::Tensor, Self::Error>;
    fn silu(&self, x: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
    fn elemul(&self, a: &Self::Tensor, b: &Self::Tensor)
        -> Result<Self::Tensor, Self::Error>;
    fn apply_rotary_emb(&self, x: &Self::Tensor, cos: &Self::Tensor,
        sin: &Self::Tensor, start_pos: usize)
        -> Result<Self::Tensor, Self::Error>;
    fn index_select(&self, t: &Self::Tensor, index: usize) -> Result<Self::Tensor, Self::Error>;
    fn assign_row(&self, dst: &mut Self::Tensor, index: usize, src: &Self::Tensor);
    fn slice_cols(&self, x: &Self::Tensor, start: usize, end: usize) -> Self::Tensor;
    fn shape<'a>(&self, x: &'a Self::Tensor) -> &'a [usize];
    fn data<'a>(&self, x: &'a Self::Tensor) -> &'a [f32];
    fn load_from_cpu(&self, data: Vec<f32>, shape: &[usize])
        -> Result<Self::Tensor, Self::Error>;
    fn add_broadcast(&self, x: &Self::Tensor, bias: &Self::Tensor)
        -> Result<Self::Tensor, Self::Error>;
}

CpuBackend is a zero-sized struct that delegates every method to CpuTensor. CpuError wraps TensorError for shape mismatches and out-of-bounds indexing. the trait methods are designed to be implementable on any hardware: gpu, simd, or accelerator. the five llama-family primitives (rms_norm, silu, elemul, apply_rotary_emb, matmul_q8_0) sit alongside the gpt-2 primitives on the trait, each with a CpuBackend implementation.

kv cache

layout

k: Vec<f32> // flat [layer][head][pos][head_dim] v: Vec<f32> // same layout offset math: layer_offset = layer * n_heads * max_seq_len * head_dim head_offset = head * max_seq_len * head_dim pos_offset = pos * head_dim absolute = layer_offset + head_offset + pos_offset

why this layout. cache.get(layer) returns a contiguous &[f32] slice of all heads and positions for that layer. the attention loop indexes into it with fixed-stride math, no pointer chasing, no nested vecs, single allocation. for gpt-2 small at 2048 context: 12 × 12 × 2048 × 64 × 4 × 2 = ~72 MB.

cursor semantics

the cursor is the write position in the sequence dimension. it is not advanced per-layer. each layer reads cache.cursor() to know where to store, but the cursor only advances after all layers finish in Gpt2::forward_with_cache:

for (layer, block) in self.blocks.iter().enumerate() {
    x = block.forward_with_cache(backend, &x, cache, layer)?;
}
// cursor advances here, after all 12 layers have stored.
for _ in 0..seq_len {
    cache.advance_cursor();
}

this is the design that caused bug #5. during prefill, the attention loop called append for each prompt token, every call landed at cursor = 0, overwriting the previous. only the final token's k/v survived. the fix passes an explicit pos to append, computed as cache.cursor() + token_index in the attention forward pass.

qk_scratch

a separate Vec<f32> pre-allocated to max_seq_len (2048). reused across every head and every token in a decode step:

pub fn qk_scratch_mut(&mut self) -> &mut Vec<f32> {
    &mut self.qk_scratch
}

the caller does:

let scratch = cache.qk_scratch_mut();
scratch.clear();
scratch.resize(total_seq_len, f32::NEG_INFINITY);

// fill scratch[pos] = dot(q, k[pos]) for pos in 0..total_seq_len
// softmax in-place on scratch
// weight sum of v[pos] by scratch[pos]

because scratch was allocated to max_seq_len, the resize never re-allocates as long as total_seq_len ≤ max_seq_len. this eliminates 144 small heap allocations per generated token (12 layers × 12 heads).

gguf format & quantization

file structure

offset field 0x00 magic: 0x47 0x55 0x46 0x47 ("GGUF") 0x04 version: u32 (3) 0x08 n_tensors: u64 0x10 n_metadata: u64 0x18 metadata kv pairs (n_metadata entries) ... tensor info table (n_tensors entries) ... tensor data (raw bytes at offsets from info table)

the loader walks the metadata to find model hyperparameters (gpt2.block_count, gpt2.context_length, gpt2.embedding_length, etc.), then reads the tensor info table to get each weight's name, shape, dtype, and byte offset. weights are matched by hardcoded gpt-2 tensor names ("tok_embeddings.weight", "blk.{i}.attn.output.weight", etc.).

q8_0 block quantization

each block encodes 32 f32 values into 34 bytes:

bytes 0-1: d (fp16 scale factor) bytes 2-33: q[0..31] (int8 quantized values) reconstruction: dst[j] = q[j] as f32 * d.to_f32()

the dequantization loop:

for i in 0..n_blocks {
    let d_bits = u16::from_le_bytes(src[block_start..block_start + 2]);
    let d = f16::from_bits(d_bits).to_f32();

    for j in 0..Q8_0_BLOCK_SIZE {
        let q = src[block_start + 2 + j] as i8;  // signed int8
        dst[out_start + j] = q as f32 * d;
    }
}

the int8 values are signed (i8), range [-128, 127]. most gguf implementations use signed int8 for q8_0. the fp16 scale means the representable range per block is roughly [-128 × d, 127 × d] where d varies per block. for gpt-2 weights, typical scales are in the range 0.001-0.05, giving roughly ±0.1 to ±6.4 per weight element.

the column-major trap

gguf stores q8_0 and f16 tensors in column-major order. the q8_0 block size is 32, and the innermost dimension must be a multiple of 32. for a weight matrix shaped [768, 50257], the column dimension (50257) is the block axis, it's padded to a multiple of 32 on disk. the raw bytes are laid out: all 768 rows of column 0, then all 768 rows of column 1, etc.

the tensor info header reports the logical shape [768, 50257] in row-major convention, but the dequantized buffer is column-major. the loader originally called reshape(&[768, 50257]), which assumes row-major, producing a transposed weight matrix. the fix is one line in the loader: reverse dims before calling reshape. after dequantizing or converting f16, the loader does:

// dims = [rows, cols] from the tensor info header
// but data is column-major, so reverse to match:
dims.reverse();
let tensor = CpuTensor::from_data(dims, flat_f32_data);

then in Gpt2::from_loader, linear layer weights are transposed back to restore [in_features, out_features] for matmul. embeddings skip the transpose since the dim reversal already makes index_select pick contiguous rows.

ForwardModel trait

ForwardModel<B: Backend> is the trait that unifies gpt-2 and llama under a single inference interface. both Gpt2<B> and Llama<B> implement it:

pub trait ForwardModel<B: Backend> {
    fn create_cache(&self, backend: &B, max_seq_len: usize) -> KVCache;
    fn forward_with_cache(
        &self, backend: &B, token_ids: &[u32],
        cache: &mut KVCache, start_pos: usize,
    ) -> Result<B::Tensor, B::Error>;
    fn n_layers(&self) -> usize;
    fn embed_dim(&self) -> usize;
    fn forward_with_activations(
        &self, backend: &B, token_ids: &[u32],
    ) -> Result<(Vec<Vec<f32>>, B::Tensor), B::Error>;
}

forward_with_cache is the standard inference path: embed tokens, run through transformer blocks with kv cache, project to logits. forward_with_activations is the probing entry point — same forward pass, but it collects hidden states after every transformer block at the last token position instead of discarding them. this is what --probe mode calls to extract activations for downstream analysis.

llama architecture specifics

llama-family models differ from gpt-2 in four structural ways. each required a new primitive on the Backend trait.

rotary position embeddings (RoPE)

instead of learned position embeddings, llama encodes position by rotating pairs of dimensions in the query and key vectors. the rotation angle for dimension pair 2i, 2i+1 at position p is p / theta^(2i/d) where theta is typically 500,000.0 for llama 3.

cos/sin tables are precomputed once in Llama::from_loader via compute_rope_freqs(max_seq_len, head_dim, theta_base) and cloned into each LlamaAttention layer. at inference time, apply_rotary_emb rotates the q and k tensors in-place using these tables. the precomputation means RoPE costs one 2d-subspace rotation per dimension pair per token — no trigonometry in the hot path.

RMS norm

rms_norm normalizes activations by the root mean square without centering:

rms = sqrt(mean(x²) + eps)
y[i] = x[i] * weight[i] / rms

layer norm subtracts the mean before dividing by standard deviation; RMS norm skips the mean subtraction. in deep networks, the mean of activations is approximately zero anyway, so the subtraction is redundant. dropping it saves one subtraction per element and one reduction pass. llama uses RMS norm before attention and before the mlp in every block, and as the final output norm.

SwiGLU mlp

llama's feed-forward network replaces gpt-2's c_fc → gelu → c_proj with a gated variant:

gate = silu(gate_proj(x))    // swish activation on gate
up   = up_proj(x)           // linear projection
y    = gate ⊙ up            // element-wise multiply
out  = down_proj(y)         // final projection

the element-wise product (elemul) is the new primitive. SwiGLU uses two weight matrices (gate_proj and up_proj) instead of one (c_fc), roughly doubling the mlp parameter count for the same hidden size. llama compensates by using an intermediate size of 8/3 * hidden_dim instead of gpt-2's 4 * hidden_dim.

grouped-query attention (GQA)

llama uses fewer kv heads than query heads to reduce cache memory. llama 3.2 1B uses 32 q heads and 8 kv heads — four q heads share each kv pair. the KVCache stores only n_kv_heads k/v per layer, not n_heads.

during attention, kv heads are repeated to match the query head count:

let n_repeat = n_heads / n_kv_heads;  // 4 for llama 3.2 1B
let kv_h = h / n_repeat;              // which kv head does query head h use?
// read k[kv_h], v[kv_h] from cache
// compute dot(q[h], k[kv_h]) as usual

for gpt-2, n_heads == n_kv_heads so this is a zero-behavior-change path. for llama, the cache uses 75% less memory (8 kv heads vs 32 query heads), enabling longer context windows for the same ram budget. the qk_scratch buffer size is unchanged since qk scores are per query head.

probing pipeline

probe mode (--probe) is a research feature that extracts hidden states from every transformer block. the flow:

stimuli json (e.g. nonce_root_pattern.json) │ ▼ tokenize each stimulus model.forward_with_activations(token_ids) │ ▼ collect per-layer activations at last token position save as .npy (shape: [n_stimuli, n_layers, embed_dim]) │ ▼ python probe scripts train_linear_probe.py → logistic regression classifier cca_analysis.py → canonical correlation with linguistic features rsa_analysis.py → representational similarity matrices divergence_analysis.py → layer-wise js divergence

forward_with_activations runs the same forward pass as inference but collects hidden states after each block at the last token position — shape [embed_dim] per layer. this is the standard probing setup: train a simple classifier on these representations and test whether the model encodes specific linguistic properties (e.g. morphological features) at particular layers.

sampler

the sampling pipeline is five stages applied in order:

1. temperature

if temperature > 0.0 {
    for l in &mut logits { *l /= temperature; }
}

temperature = 0 means greedy argmax, the pipeline skips temperature scaling and goes straight to argmax in categorical_sample. any positive value divides logits: t < 1.0 sharpens (peaks get higher relative weight), t > 1.0 flattens (more uniform).

2. top-k

let mut indexed: Vec<(usize, f32)> = logits.iter().cloned().enumerate().collect();
indexed.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(Equal));

let threshold = indexed[k - 1].1;
for l in logits.iter_mut() {
    if *l < threshold { *l = f32::NEG_INFINITY; }
}

sorts a copy of (index, logit) pairs descending, reads the k-th largest value, masks everything below it. O(V log V) where V = vocab size (50257 for gpt-2). a no-op when k ≥ V or k = 0.

3. top-p (nucleus)

let soft = softmax_1d(logits);              // temporary distribution
let cutoff = nucleus_cutoff(&soft, p);     // threshold value
for (i, s) in soft.iter().enumerate() {
    if *s < cutoff { logits[i] = f32::NEG_INFINITY; }
}

computes a temporary softmax to find the smallest set of tokens whose cumulative probability exceeds p. the cutoff is the smallest probability value in that set. the final softmax is computed once by the caller (sample_token) after all filters run, avoiding double-softmax.

4. softmax

let max = logits.iter().fold(f32::NEG_INFINITY, |a, &b| a.max(b));
if max == f32::NEG_INFINITY {
    return vec![1.0 / logits.len() as f32; logits.len()];
}
let exps: Vec<f32> = logits.iter().map(|x| (x - max).exp()).collect();
let sum: f32 = exps.iter().sum();
exps.iter().map(|x| x / sum).collect()

the max trick prevents overflow: subtracting max from every logit ensures the largest exponentiated value is exp(0) = 1. the all-masked fallback returns a uniform distribution.

5. categorical sample (inverse cdf)

let r: f32 = rng.gen();  // uniform in [0, 1)
let mut cum = 0.0;
for (i, &p) in dist.iter().enumerate() {
    cum += p;
    if r < cum { return i; }
}
// floating-point rounding fallback: argmax
dist.iter().enumerate()
    .max_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap())
    .map(|(i, _)| i).unwrap_or(0)

walks the cumulative sum of probabilities. the first index where the running sum exceeds the random draw is the sampled token.

profiling notes

wall-clock breakdown of a single decode step (gpt-2 small, x86_64, release build). measurements from std::time::Instant with 1000-sample microbenchmark:

decode step total (~84 ms) ──────────────────────────────────── matmul (12 layers × 4 matmuls): ~55 ms (66%) c_attn (768→2304): ~14 ms c_proj (768→768): ~6 ms c_fc (768→3072): ~19 ms c_proj (3072→768): ~16 ms attention (scalar loops): ~12 ms (14%) gelu (12×3072 elements): ~5 ms (6%) layer_norm (12×2×768 elements): ~4 ms (5%) softmax (12 heads × 64 dim): ~3 ms (4%) other (adds, copies, sampling): ~4 ms (5%)

matmul dominates. the c_fc and output projection are the two heaviest matmuls, 768×3072 and 3072×768 respectively, each 2.36 million multiply-adds. simd would drop this to roughly 8-10 ms, pushing throughput toward 40-50 tok/s.

edge cases worth knowing

all-masked softmax rows. the causal mask sets future positions to -inf. on the first decode step (only one token in the sequence), every row after the first is entirely -inf. standard softmax produces NaN ((-inf - -inf).exp() evaluates to NaN per ieee 754). one branch, if max == f32::NEG_INFINITY { return uniform }, prevents NaN from propagating through 12 layers into the output logits.

temperature = 0. the sampler checks if temperature > 0.0 and skips scaling. in categorical_sample, the softmax distribution for sharp logits concentrates all mass on the argmax token, and inverse cdf sampling selects it deterministically. this is how --temperature 0 produces identical output every run (given the same rng seed).

scratch buffer resize. the qk_scratch buffer is allocated to max_seq_len (2048). during decode at position 47, the caller resizes it to 48. because 48 ≤ 2048, the resize is a no-op, capacity was set at construction time. this is a deliberate invariant: the hot path never calls the global allocator.

transposed embeddings. the gguf file stores token embeddings as [vocab, embed] (50257 × 768). the loader transposes them to [embed, vocab] (768 × 50257) during model construction. this means index_select(token_id) picks a contiguous 768-element slice rather than gathering 768 elements strided by vocab_size. cost: one transpose at startup. benefit: every inference step does a memcpy instead of a gather.