this is a line-by-line walkthrough of the data structures, algorithms, and memory layouts that make ember work. it assumes you've read the portal page and want the raw details.
the project is intentionally small enough to follow from entry point to logits. main.rs owns CLI dispatch, generation, logits dump, and probe modes. model families implement ForwardModel. compute operations go through the Backend trait, with CpuBackend as the current runtime.
| backend trait | model code talks to a bounded compute interface instead of scattering CPU details through every architecture. |
| row-major tensors | rows are contiguous, so embeddings and output rows are direct slices; column slices are explicit gathers. |
| gguf mmap | model files are mapped and tensors are copied/dequantized into explicit runtime buffers. |
| flat kv cache | single allocation with fixed stride math: [layer][head][pos][head_dim]. |
| artifacts over claims | probe, benchmark, and validation scripts write machine-readable summaries before interpretation. |
the low-level math is ordinary but unforgiving: softmax needs an all-masked-row fallback, layer norm and RMS norm must write their result back into the block stream, and activation functions must be numerically stable enough not to silently deform logits.
| primitive | where it matters | failure mode |
|---|---|---|
| softmax | attention and sampling | all -inf rows produce NaN without a fallback |
| GELU / SiLU | MLP blocks | wrong activation changes logits without crashing |
| layer norm / RMS norm | before attention and MLP blocks | forgetting to bind the returned tensor leaves stale activations |
| RoPE | llama/qwen/gemma attention | position convention mistakes show up as plausible but wrong text |
attention is split into q/k/v projections, RoPE or positional handling, causal masking, softmax, and weighted value accumulation. the cache turns decode from recomputing the whole sequence every token into a prefill pass plus one cached step per generated token.
the most useful early failures were not crashes. they were coherent enough to mislead: scrambled GGUF layouts produced nonsense text, and the first KV-cache prefill bug let the model attend only to the final prompt token.
first coherent output was still a smoke result, not full validation:
CpuTensor is the only data structure every other component touches. three fields:
pub struct CpuTensor {
pub shape: Vec<usize>, // e.g. [12, 64, 768]
pub strides: Vec<usize>, // contiguous row-major strides
pub data: Vec<f32>, // flat f32 buffer
}
strides are computed but never used for indexing. the compute_strides function builds a standard row-major stride array (e.g. [768, 1] for shape [64, 768]), but every access goes through direct offset math like r * cols + c. the strides exist purely for get(&[usize]), the multi-dimensional indexer used only in testing.
every operation allocates. no views, no mutation in place. add() returns a new CpuTensor. softmax() returns a new CpuTensor. matmul() returns a new CpuTensor. the allocator sees a stream of identical-sized allocations during decode (token count is constant), so jemalloc or the system allocator reuses the same slab. this is not accidental, it's the reason the hot path can get away with per-op allocations without profiling as a malloc storm.
#[must_use] on every pure op. if you write x.softmax(); without binding the result, the allocation drops silently and the model runs on stale data. the attribute makes that a compile error. it caught the bug of forgetting to assign layer norm output back to x in the transformer block loop, a bug that produces no panic, no NaN, just progressively degraded text.
the matmul delegates to bluss's matrixmultiply crate, a pure-rust sgemm with no blas linking. the call site:
unsafe {
matrixmultiply::sgemm(
m, k, n, // dimensions
1.0, // alpha
a.as_ptr(), k as isize, 1, // A: m×k, col stride = k, row stride = 1
b.as_ptr(), n as isize, 1, // B: k×n, col stride = n, row stride = 1
0.0, // beta
c.as_mut_ptr(), n as isize, 1, // C: m×n, row-major
);
}
matrixmultiply is scalar, so this is the throughput bottleneck. the Backend trait is the insertion point for simd: implement SimdBackend with the same trait, swap the backend type, and every matmul in the model uses the new kernel with zero model code changes.
the Backend trait is the abstraction that decouples model code from hardware. every operation that touches tensor data goes through a &B reference:
pub trait Backend {
type Tensor: Clone + Send + Sync;
type Error: core::error::Error;
fn zeroes(&self, shape: &[usize]) -> Result<Self::Tensor, Self::Error>;
fn matmul(&self, a: &Self::Tensor, b: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
fn matmul_q8_0(&self, x: &Self::Tensor, w: &QuantizedWeight)
-> Result<Self::Tensor, Self::Error>;
fn add(&self, a: &Self::Tensor, b: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
fn softmax(&self, x: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
fn gelu(&self, x: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
fn layer_norm(&self, x: &Self::Tensor, w: &Self::Tensor, b: &Self::Tensor, eps: f32)
-> Result<Self::Tensor, Self::Error>;
fn rms_norm(&self, x: &Self::Tensor, w: &Self::Tensor, eps: f32)
-> Result<Self::Tensor, Self::Error>;
fn silu(&self, x: &Self::Tensor) -> Result<Self::Tensor, Self::Error>;
fn elemul(&self, a: &Self::Tensor, b: &Self::Tensor)
-> Result<Self::Tensor, Self::Error>;
fn apply_rotary_emb(&self, x: &Self::Tensor, cos: &Self::Tensor,
sin: &Self::Tensor, start_pos: usize)
-> Result<Self::Tensor, Self::Error>;
fn index_select(&self, t: &Self::Tensor, index: usize) -> Result<Self::Tensor, Self::Error>;
fn assign_row(&self, dst: &mut Self::Tensor, index: usize, src: &Self::Tensor);
fn slice_cols(&self, x: &Self::Tensor, start: usize, end: usize) -> Self::Tensor;
fn shape<'a>(&self, x: &'a Self::Tensor) -> &'a [usize];
fn data<'a>(&self, x: &'a Self::Tensor) -> &'a [f32];
fn load_from_cpu(&self, data: Vec<f32>, shape: &[usize])
-> Result<Self::Tensor, Self::Error>;
fn add_broadcast(&self, x: &Self::Tensor, bias: &Self::Tensor)
-> Result<Self::Tensor, Self::Error>;
}
CpuBackend is a zero-sized struct that delegates every method to CpuTensor. CpuError wraps TensorError for shape mismatches and out-of-bounds indexing. the trait methods are designed to be implementable on any hardware: gpu, simd, or accelerator. the five llama-family primitives (rms_norm, silu, elemul, apply_rotary_emb, matmul_q8_0) sit alongside the gpt-2 primitives on the trait, each with a CpuBackend implementation.
why this layout. cache.get(layer) returns a contiguous &[f32] slice of all heads and positions for that layer. the attention loop indexes into it with fixed-stride math, no pointer chasing, no nested vecs, single allocation. for gpt-2 small at 2048 context: 12 × 12 × 2048 × 64 × 4 × 2 = ~72 MB.
the cursor is the write position in the sequence dimension. it is not advanced per-layer. each layer reads cache.cursor() to know where to store, but the cursor only advances after all layers finish in Gpt2::forward_with_cache:
for (layer, block) in self.blocks.iter().enumerate() {
x = block.forward_with_cache(backend, &x, cache, layer)?;
}
// cursor advances here, after all 12 layers have stored.
for _ in 0..seq_len {
cache.advance_cursor();
}
this is the design that caused bug #5. during prefill, the attention loop called append for each prompt token, every call landed at cursor = 0, overwriting the previous. only the final token's k/v survived. the fix passes an explicit pos to append, computed as cache.cursor() + token_index in the attention forward pass.
a separate Vec<f32> pre-allocated to max_seq_len (2048). reused across every head and every token in a decode step:
pub fn qk_scratch_mut(&mut self) -> &mut Vec<f32> {
&mut self.qk_scratch
}
the caller does:
let scratch = cache.qk_scratch_mut();
scratch.clear();
scratch.resize(total_seq_len, f32::NEG_INFINITY);
// fill scratch[pos] = dot(q, k[pos]) for pos in 0..total_seq_len
// softmax in-place on scratch
// weight sum of v[pos] by scratch[pos]
because scratch was allocated to max_seq_len, the resize never re-allocates as long as total_seq_len ≤ max_seq_len. this eliminates 144 small heap allocations per generated token (12 layers × 12 heads).
the loader walks the metadata to find model hyperparameters (gpt2.block_count, gpt2.context_length, gpt2.embedding_length, etc.), then reads the tensor info table to get each weight's name, shape, dtype, and byte offset. weights are matched by hardcoded gpt-2 tensor names ("tok_embeddings.weight", "blk.{i}.attn.output.weight", etc.).
each block encodes 32 f32 values into 34 bytes:
the dequantization loop:
for i in 0..n_blocks {
let d_bits = u16::from_le_bytes(src[block_start..block_start + 2]);
let d = f16::from_bits(d_bits).to_f32();
for j in 0..Q8_0_BLOCK_SIZE {
let q = src[block_start + 2 + j] as i8; // signed int8
dst[out_start + j] = q as f32 * d;
}
}
the int8 values are signed (i8), range [-128, 127]. most gguf implementations use signed int8 for q8_0. the fp16 scale means the representable range per block is roughly [-128 × d, 127 × d] where d varies per block. for gpt-2 weights, typical scales are in the range 0.001-0.05, giving roughly ±0.1 to ±6.4 per weight element.
gguf stores q8_0 and f16 tensors in column-major order. the q8_0 block size is 32, and the innermost dimension must be a multiple of 32. for a weight matrix shaped [768, 50257], the column dimension (50257) is the block axis, it's padded to a multiple of 32 on disk. the raw bytes are laid out: all 768 rows of column 0, then all 768 rows of column 1, etc.
the tensor info header reports the logical shape [768, 50257] in row-major convention, but the dequantized buffer is column-major. the loader originally called reshape(&[768, 50257]), which assumes row-major, producing a transposed weight matrix. the fix is one line in the loader: reverse dims before calling reshape. after dequantizing or converting f16, the loader does:
// dims = [rows, cols] from the tensor info header
// but data is column-major, so reverse to match:
dims.reverse();
let tensor = CpuTensor::from_data(dims, flat_f32_data);
then in Gpt2::from_loader, linear layer weights are transposed back to restore [in_features, out_features] for matmul. embeddings skip the transpose since the dim reversal already makes index_select pick contiguous rows.
ForwardModel<B: Backend> is the trait that unifies gpt-2 and llama under a single inference interface. both Gpt2<B> and Llama<B> implement it:
pub trait ForwardModel<B: Backend> {
fn create_cache(&self, backend: &B, max_seq_len: usize) -> KVCache;
fn forward_with_cache(
&self, backend: &B, token_ids: &[u32],
cache: &mut KVCache, start_pos: usize,
) -> Result<B::Tensor, B::Error>;
fn n_layers(&self) -> usize;
fn embed_dim(&self) -> usize;
fn forward_with_activations(
&self, backend: &B, token_ids: &[u32],
) -> Result<(Vec<Vec<f32>>, B::Tensor), B::Error>;
}
forward_with_cache is the standard inference path: embed tokens, run through transformer blocks with kv cache, project to logits. forward_with_activations is the probing entry point — same forward pass, but it collects hidden states after every transformer block at the last token position instead of discarding them. this is what --probe mode calls to extract activations for downstream analysis.
llama-family models differ from gpt-2 in four structural ways. each required a new primitive on the Backend trait.
instead of learned position embeddings, llama encodes position by rotating pairs of dimensions in the query and key vectors. the rotation angle for dimension pair 2i, 2i+1 at position p is p / theta^(2i/d) where theta is typically 500,000.0 for llama 3.
cos/sin tables are precomputed once in Llama::from_loader via compute_rope_freqs(max_seq_len, head_dim, theta_base) and cloned into each LlamaAttention layer. at inference time, apply_rotary_emb rotates the q and k tensors in-place using these tables. the precomputation means RoPE costs one 2d-subspace rotation per dimension pair per token — no trigonometry in the hot path.
rms_norm normalizes activations by the root mean square without centering:
rms = sqrt(mean(x²) + eps)
y[i] = x[i] * weight[i] / rms
layer norm subtracts the mean before dividing by standard deviation; RMS norm skips the mean subtraction. in deep networks, the mean of activations is approximately zero anyway, so the subtraction is redundant. dropping it saves one subtraction per element and one reduction pass. llama uses RMS norm before attention and before the mlp in every block, and as the final output norm.
llama's feed-forward network replaces gpt-2's c_fc → gelu → c_proj with a gated variant:
gate = silu(gate_proj(x)) // swish activation on gate
up = up_proj(x) // linear projection
y = gate ⊙ up // element-wise multiply
out = down_proj(y) // final projection
the element-wise product (elemul) is the new primitive. SwiGLU uses two weight matrices (gate_proj and up_proj) instead of one (c_fc), roughly doubling the mlp parameter count for the same hidden size. llama compensates by using an intermediate size of 8/3 * hidden_dim instead of gpt-2's 4 * hidden_dim.
llama uses fewer kv heads than query heads to reduce cache memory. llama 3.2 1B uses 32 q heads and 8 kv heads — four q heads share each kv pair. the KVCache stores only n_kv_heads k/v per layer, not n_heads.
during attention, kv heads are repeated to match the query head count:
let n_repeat = n_heads / n_kv_heads; // 4 for llama 3.2 1B
let kv_h = h / n_repeat; // which kv head does query head h use?
// read k[kv_h], v[kv_h] from cache
// compute dot(q[h], k[kv_h]) as usual
for gpt-2, n_heads == n_kv_heads so this is a zero-behavior-change path. for llama, the cache uses 75% less memory (8 kv heads vs 32 query heads), enabling longer context windows for the same ram budget. the qk_scratch buffer size is unchanged since qk scores are per query head.
probe mode (--probe) is a research feature that extracts hidden states from every transformer block. the flow:
forward_with_activations runs the same forward pass as inference but collects hidden states after each block at the last token position — shape [embed_dim] per layer. this is the standard probing setup: train a simple classifier on these representations and test whether the model encodes specific linguistic properties (e.g. morphological features) at particular layers.
the sampling pipeline is five stages applied in order:
if temperature > 0.0 {
for l in &mut logits { *l /= temperature; }
}
temperature = 0 means greedy argmax, the pipeline skips temperature scaling and goes straight to argmax in categorical_sample. any positive value divides logits: t < 1.0 sharpens (peaks get higher relative weight), t > 1.0 flattens (more uniform).
let mut indexed: Vec<(usize, f32)> = logits.iter().cloned().enumerate().collect();
indexed.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(Equal));
let threshold = indexed[k - 1].1;
for l in logits.iter_mut() {
if *l < threshold { *l = f32::NEG_INFINITY; }
}
sorts a copy of (index, logit) pairs descending, reads the k-th largest value, masks everything below it. O(V log V) where V = vocab size (50257 for gpt-2). a no-op when k ≥ V or k = 0.
let soft = softmax_1d(logits); // temporary distribution
let cutoff = nucleus_cutoff(&soft, p); // threshold value
for (i, s) in soft.iter().enumerate() {
if *s < cutoff { logits[i] = f32::NEG_INFINITY; }
}
computes a temporary softmax to find the smallest set of tokens whose cumulative probability exceeds p. the cutoff is the smallest probability value in that set. the final softmax is computed once by the caller (sample_token) after all filters run, avoiding double-softmax.
let max = logits.iter().fold(f32::NEG_INFINITY, |a, &b| a.max(b));
if max == f32::NEG_INFINITY {
return vec![1.0 / logits.len() as f32; logits.len()];
}
let exps: Vec<f32> = logits.iter().map(|x| (x - max).exp()).collect();
let sum: f32 = exps.iter().sum();
exps.iter().map(|x| x / sum).collect()
the max trick prevents overflow: subtracting max from every logit ensures the largest exponentiated value is exp(0) = 1. the all-masked fallback returns a uniform distribution.
let r: f32 = rng.gen(); // uniform in [0, 1)
let mut cum = 0.0;
for (i, &p) in dist.iter().enumerate() {
cum += p;
if r < cum { return i; }
}
// floating-point rounding fallback: argmax
dist.iter().enumerate()
.max_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap())
.map(|(i, _)| i).unwrap_or(0)
walks the cumulative sum of probabilities. the first index where the running sum exceeds the random draw is the sampled token.
wall-clock breakdown of a single decode step (gpt-2 small, x86_64, release build). measurements from std::time::Instant with 1000-sample microbenchmark:
matmul dominates. the c_fc and output projection are the two heaviest matmuls, 768×3072 and 3072×768 respectively, each 2.36 million multiply-adds. simd would drop this to roughly 8-10 ms, pushing throughput toward 40-50 tok/s.
all-masked softmax rows. the causal mask sets future positions to -inf. on the first decode step (only one token in the sequence), every row after the first is entirely -inf. standard softmax produces NaN ((-inf - -inf).exp() evaluates to NaN per ieee 754). one branch, if max == f32::NEG_INFINITY { return uniform }, prevents NaN from propagating through 12 layers into the output logits.
temperature = 0. the sampler checks if temperature > 0.0 and skips scaling. in categorical_sample, the softmax distribution for sharp logits concentrates all mass on the argmax token, and inverse cdf sampling selects it deterministically. this is how --temperature 0 produces identical output every run (given the same rng seed).
scratch buffer resize. the qk_scratch buffer is allocated to max_seq_len (2048). during decode at position 47, the caller resizes it to 48. because 48 ≤ 2048, the resize is a no-op, capacity was set at construction time. this is a deliberate invariant: the hot path never calls the global allocator.
transposed embeddings. the gguf file stores token embeddings as [vocab, embed] (50257 × 768). the loader transposes them to [embed, vocab] (768 × 50257) during model construction. this means index_select(token_id) picks a contiguous 768-element slice rather than gathering 768 elements strided by vocab_size. cost: one transpose at startup. benefit: every inference step does a memcpy instead of a gather.