ember, a cpu-first llm inference engine in rust

ember is a gpt-2 inference engine that runs quantized models on cpu, in pure rust, with zero framework dependencies. it loads gguf files, runs token-by-token generation with temperature / top-k / top-p sampling, and produces coherent text from a 124m-parameter model, all in roughly 2,200 lines.

rust was chosen for two reasons. first, memory safety: juggling 12 layers of attention weights, a flat kv cache, and a row-major tensor library with no garbage collector is where rust's ownership model pays for itself. second, the Backend trait, a zero-cost abstraction that means the same transformer code runs on cpu now and will run on a gpu later without touching model logic.

the end target is embedded inference, small language models on devices without a gpu. a cortex-m isn't going to run gpt-2 at 12 tok/s, but the architecture here (explicit memory, no hidden allocations, alloc-first design) is the architecture that gets there. start on x86, target arm later.

2. architecture

the central idea is the Backend trait. every transformer component, attention, mlp, layer norm, is generic over B: Backend. there is one implementation: CpuBackend, a zero-sized struct that delegates to CpuTensor methods running in scalar cpu loops. the trait abstracts linear ops (matmul, add_broadcast), element-wise math (gelu, softmax), normalization (layer_norm), and tensor lifecycle (zeroes, load_from_cpu, data, shape).

honest abstraction

attention is a documented exception. the forward methods call data() to extract raw f32 slices and run the attention math in scalar cpu loops, bypassing the backend entirely. a gpu backend would still execute attention on the cpu through this path. the trait docs and readme both say this explicitly. adding fn attention(...) to the trait is planned, until then, the abstraction is honest about what it covers.

CpuTensor is row-major, backed by a flat Vec<f32>. every operation allocates a new tensor. no views, no mutation in place. the cost is extra allocations; the benefit is that every op is trivially correct and the compiler can reason about ownership without fighting, which mattered during debugging (see section 7).

3. the gguf format

gguf, the binary format that replaced ggml in the llama.cpp ecosystem, is a flat, self-describing file. a 4-byte magic number (0x46554747, "gguf" in ascii), a 4-byte version (3), then a count of tensors and metadata key-value pairs. the metadata comes next, strings like "gpt2.block_count" mapped to typed values. after that, a tensor info table gives each tensor's name, dimensions, data type, and byte offset. read the header, seek to each offset, read the bytes.

magic number confusion. interpreting a single ascii 'g' byte (0x67) as a big-endian u64 produces 7,421,932,185,906,577,408. that number is not the gguf magic and not the ggml magic, it's a misaligned read. binary formats are sensitive to endianness, alignment, and interpretation. get one wrong and the file doesn't fail gracefully; it produces a number like 7.4 quintillion and waits. this happened when i misread a hex dump at 2 am, thinking the first byte was a u64 instead of the start of a 4-byte magic.

the column-major trap. gguf stores q8_0 quantized tensors (and, by convention, f16 tensors) in column-major order. q8_0 works in blocks of 32 values, and the innermost dimension must be a multiple of 32. for a weight matrix of shape [768, 50257], the column dimension (50257) is the block axis, it's larger and can be padded to a multiple of 32. so the data on disk is laid out column-by-column: all 768 rows of column 0, then all 768 rows of column 1, and so on.

the tensor info header reports the logical shape, [768, 50257], but says nothing about storage order. the loader dequantized the flat buffer into a Vec<f32> and called reshape(&[768, 50257]), which assumes row-major. every weight matrix in the model was scrambled. the fix: reverse the dims in the loader before reshaping. one line. correlation with pytorch reference went from 0.004 to 0.999975.

4. tensor primitives

softmax

softmax turns a vector of arbitrary numbers into a probability distribution: exp(x_i) / sum(exp(x)). implemented naively, exp(88) overflows f32. the standard fix subtracts the maximum value from every element first, the exponent becomes exp(x_i - max), which never exceeds 1.0, and the math is identical because exp(a)/exp(b) = exp(a-b).

one edge case: a causal mask sets future positions to -inf. during early decode, when only one token has been generated, every row after the first is entirely masked, every element is -inf. under standard softmax, max = -inf, so (-inf - -inf).exp() evaluates to NaN per ieee 754. that NaN propagates through 12 transformer layers and into the output logits. the fix is one branch: if max is -inf, return a uniform distribution of 1/n per position.

gelu

gelu, the gaussian error linear unit, is gpt-2's activation function. it's a smoother version of relu: instead of hard-clipping negative values to zero, it squashes them toward zero probabilistically, weighted by the gaussian cdf. the formula: 0.5 * x * (1 + erf(x / √2)).

it lives in the mlp block: the hidden layer projects from 768 to 3072, gelu fires on every element, then the output layer projects back to 768. every token passes through this bottleneck 12 times (once per transformer block). the implementation uses libm::erff, pure rust, no libm linking, portable across targets.

layer norm

layer normalization keeps activations from drifting into numerically useless ranges as they pass through 12 blocks of attention and mlp. for each row of a [batch, features] tensor, it computes the mean and variance across that row, normalizes every element to (x - mean) / √(var + ε), then scales by a learned weight and shifts by a learned bias. epsilon (1e-5) prevents division by zero.

it's four lines of arithmetic per row. but get it wrong, forget to apply the weight, skip the bias, or forget to assign the result back to x in the block loop, and the model's output degrades silently. no panic. no NaN. just progressively worse text.

5. attention

attention is the core mechanism that makes transformers work. here is what happens in the code.

q, k, v. every token's embedding (a vector of 768 numbers) passes through a single linear layer called c_attn, which maps 768 → 2304. that 2304-element output is three 768-element vectors packed together: query, key, and value. for each attention head (gpt-2 small has 12), you take a 64-element slice of q, k, and v. the query asks "what am i looking for?", the key advertises "here's what i contain", and the value is the actual content. the dot product of q and k gives similarity scores. those scores become weights, and you sum the values according to those weights.

the causal mask. during training, a token can cheat by looking at future tokens, the model would learn to copy the answer instead of predicting it. at inference time, future tokens don't exist yet; they're generated one at a time. the solution is a triangular mask: token at position i can attend to positions 0..=i, and everything beyond i is set to -inf before softmax. the -inf becomes 0.0 after exponentiation, so those positions contribute nothing.

why sqrt(head_dim)? the dot product of two random 64-element vectors has variance proportional to 64. after softmax, that means one attention score dominates and the rest are near zero, the model can't blend information from multiple positions. dividing by √64 = 8 brings the variance back to 1.0, keeping the softmax in a regime where multiple tokens get meaningful weight. the recip() call (1.0 / sqrt(head_dim)) is computed once at the top of the attention forward pass and reused for every row.

6. the kv cache

without a cache, every generated token requires recomputing attention over the entire sequence from scratch. generate 20 tokens with a 50-token prompt, and you run 71 full attention passes, each doing o(n²) dot products over a growing sequence. the kv cache changes this from o(n²) per step to o(n) per step.

the key and value projections for every token at every layer are deterministic given the model weights, once computed, there's no reason to recompute them. during prefill, you run the full prompt through all 12 layers and store every position's k and v. during decode, you compute k and v only for the single new token, append it to the cache, and run attention against the entire cached history. the query is the only thing that changes per token, it encodes "what should the next token pay attention to?", so it's always computed fresh.

pre-allocation. the cache is a flat [layer][head][pos][head_dim] array: 12 layers × 12 heads × 2048 positions × 64 dimensions × 4 bytes × 2 (k and v) = ~72 mb. allocated once, upfront, in KVCache::new(). the append path is a copy_from_slice into a pre-computed offset. there's a scratch buffer for attention score rows (qk_scratch), also pre-allocated to max_seq_len, reused across all heads and tokens in a decode step via clear() + resize(). the resize never re-allocates because capacity was set at construction time. zero heap allocations in the inference hot path.

7. bugs (this is 60% of the story)

six bugs total. four were standard off-by-one and format-misunderstanding fare. two were the kind where the model produces output that looks almost correct but is systematically wrong.

bug #4, the layout that broke everything

symptom: every prompt produced complete gibberish. temperature 0, deterministic argmax, should produce the same output every time, produced a different random string on each run. llama.cpp with the same gguf file produced coherent text. the model was reading scrambled weights.

root cause: gguf stores q8_0 and f16 tensors in column-major order so the innermost dimension is a multiple of 32 (the q8_0 block size). the tensor info header reports the logical shape (e.g. [768, 50257] for the embedding matrix), but the dequantized flat buffer is laid out column-major, the data for tensor[i][j] is at offset j * rows + i, not i * cols + j. the loader reshaped with the logical dims in row-major order, scrambling every weight matrix.

fix: after dequantizing q8_0 or converting f16, reverse the dims vec so the reshape matches the column-major storage. then adjust the transposes in Gpt2::from_loader: embeddings need no manual transpose (loader already reversed dims, so index_select picks rows directly), while linear weights get a .transpose() to restore [in_features, out_features] for matmul.

verified fix: correlation with pytorch reference output went from 0.004 (effectively noise) to 0.999975. one commit.

bug #5, kv cache prefill overwrite

symptom: "1, 2, 3, 4, 5" → "/5/5/5/5/5/5/5/". the model could attend to at most one prompt token. it saw the last token ("5") and nothing else.

root cause: KVCache::append wrote every entry at self.cursor. but cursor is deliberately not advanced until Gpt2::forward_with_cache finishes all 12 layers. during prefill, when Attention::forward_with_cache looped over prompt tokens calling append, every call landed at cursor = 0, each overwriting the last. only the final prompt token's k/v projection survived; all other positions stayed zero.

fix: append now takes an explicit pos: usize parameter. Attention::forward_with_cache snapshots cache.cursor() before the loop and passes cursor + pos, so each token lands at its correct absolute cache slot. the cursor is still batch-advanced afterward in Gpt2::forward_with_cache, that part was already correct.

after fix: "1, 2, 3, 4, 5" → ", 6, 7, 8, 9, 10, 11, 12,".

8. what worked first try

9. numbers: before and after the kv cache

the kv cache was the single largest performance change. before it, every generation step recomputed attention over the full growing sequence, 71 forward passes for a 50-token prompt with 20 generated tokens, each doing o(n²) attention. after wiring the cache into Attention::forward_with_cache, the generation loop does one prefill pass (populating the cache) plus one o(n) decode step per generated token.

metric	before	after
forward passes (50-token prompt + 20 gen)	71 (o(n²) each)	1 prefill + 20 decode (o(n) each)
decode throughput	~2 tok/s	~12 tok/s
attention hot-path allocations	144 per token (12 layers × 12 heads)	0 (scratch buffer pre-allocated in kv cache)
kv cache memory (2048-context, 12 layers, 12 heads, 64 head_dim)	n/a	~72 mb (k + v combined)

the scratch buffer eliminated 144 small heap allocations per generated token for a 12-layer, 12-head model. the buffer is pre-allocated in KVCache::new(); the attention loop calls clear() + resize() on it, and the resize never re-allocates because capacity equals max_seq_len.

10. first light

after five bugs and a column-major revelation, the model ran on the prompt "the" with greedy sampling, temperature zero, deterministic argmax, no randomness. if every tensor op was correct, the output would be the same every time. it was:

it's not good prose. it's a 124m-parameter model from 2019 with a one-word prompt. but every word is a real english word, capitalized correctly, in grammatical order. no invalid utf-8, no NaN tokens, no random integers. every tensor op, softmax, gelu, layer norm, matmul, the qkv split, the causal mask, the kv cache append-and-read, worked together, and what came out the other end was text.

the repetition is also informative. gpt-2 is not chatgpt. it's a next-token predictor trained on internet text. given "the", its most likely continuation involved cheese, and it repeated "cheese" 17 times because once it entered that loop, the attention pattern reinforced itself. that's not a bug in ember, it's the model doing what it was trained to do. the fact that ember faithfully reproduces that behavior is the proof that the pipeline is correct.

11. what's next

ember runs gpt-2. the path forward is llama-family support, which requires four new primitives:

each primitive is specified with starter code and tests in the repo. the Backend trait was designed for exactly this, adding a new op to the trait makes it available to every future backend automatically.

simd. the current matmul delegates to matrixmultiply::sgemm, a pure rust implementation that is correct but scalar. the next performance jump comes from vectorized kernels, avx2 on x86, neon on arm. the Backend trait is the insertion point: a SimdBackend implementing the same trait with architecture-specific intrinsics. same model code. 4× to 8× throughput.

after that: gpu support (the Backend trait means same transformer code on cuda or metal), then embedded targets, arm cortex-m, no_std, models in the 10m–100m parameter range running on microcontrollers.