ember is a gpt-2 inference engine that runs quantized models on cpu, in pure rust, with zero framework dependencies. it loads gguf files, runs token-by-token generation with temperature / top-k / top-p sampling, and produces coherent text from a 124m-parameter model, all in roughly 2,200 lines.
rust was chosen for two reasons. first, memory safety: juggling 12
layers of attention weights, a flat kv cache, and a row-major tensor
library with no garbage collector is where rust's ownership model
pays for itself. second, the Backend trait, a zero-cost
abstraction that means the same transformer code runs on cpu now and
will run on a gpu later without touching model logic.
the end target is embedded inference, small language models on devices without a gpu. a cortex-m isn't going to run gpt-2 at 12 tok/s, but the architecture here (explicit memory, no hidden allocations, alloc-first design) is the architecture that gets there. start on x86, target arm later.
the central idea is the Backend trait. every
transformer component, attention, mlp, layer norm, is generic over
B: Backend. there is one implementation:
CpuBackend, a zero-sized struct that delegates to
CpuTensor methods running in scalar cpu loops. the
trait abstracts linear ops (matmul,
add_broadcast), element-wise math (gelu,
softmax), normalization (layer_norm), and
tensor lifecycle (zeroes, load_from_cpu,
data, shape).
data() to extract raw f32 slices and run
the attention math in scalar cpu loops, bypassing the backend
entirely. a gpu backend would still execute attention on the cpu
through this path. the trait docs and readme both say this
explicitly. adding fn attention(...) to the trait is
planned, until then, the abstraction is honest about what it covers.
CpuTensor is row-major, backed by a flat
Vec<f32>. every operation allocates a new tensor.
no views, no mutation in place. the cost is extra allocations; the
benefit is that every op is trivially correct and the compiler can
reason about ownership without fighting, which mattered during
debugging (see section 7).
gguf, the binary format that replaced ggml in the llama.cpp
ecosystem, is a flat, self-describing file. a 4-byte magic number
(0x46554747, "gguf" in ascii), a 4-byte version (3),
then a count of tensors and metadata key-value pairs. the metadata
comes next, strings like
"gpt2.block_count" mapped to typed values. after that,
a tensor info table gives each tensor's name, dimensions, data type,
and byte offset. read the header, seek to each offset, read the
bytes.
two things went wrong.
magic number confusion. interpreting a single ascii
'g' byte (0x67) as a big-endian u64 produces
7,421,932,185,906,577,408. that number is not the
gguf magic and not the ggml magic, it's a misaligned read. binary
formats are sensitive to endianness, alignment, and interpretation.
get one wrong and the file doesn't fail gracefully; it produces a
number like 7.4 quintillion and waits. this happened when i misread
a hex dump at 2 am, thinking the first byte was a u64 instead of the
start of a 4-byte magic.
the column-major trap. gguf stores q8_0 quantized
tensors (and, by convention, f16 tensors) in column-major order.
q8_0 works in blocks of 32 values, and the innermost dimension must
be a multiple of 32. for a weight matrix of shape
[768, 50257], the column dimension (50257) is the block
axis, it's larger and can be padded to a multiple of 32. so the data
on disk is laid out column-by-column: all 768 rows of column 0, then
all 768 rows of column 1, and so on.
the tensor info header reports the logical shape,
[768, 50257], but says nothing about storage order. the
loader dequantized the flat buffer into a
Vec<f32> and called
reshape(&[768, 50257]), which assumes row-major.
every weight matrix in the model was scrambled. the fix: reverse the
dims in the loader before reshaping. one line. correlation with
pytorch reference went from 0.004 to 0.999975.
three operations carry the model. everything else is scaffolding around them.
softmax turns a vector of arbitrary numbers into a probability
distribution:
exp(x_i) / sum(exp(x)). implemented naively,
exp(88) overflows f32. the standard fix subtracts the
maximum value from every element first, the exponent becomes
exp(x_i - max), which never exceeds 1.0, and the math
is identical because exp(a)/exp(b) = exp(a-b).
one edge case: a causal mask sets future positions to
-inf. during early decode, when only one token has been
generated, every row after the first is entirely masked, every
element is -inf. under standard softmax,
max = -inf, so
(-inf - -inf).exp() evaluates to NaN per
ieee 754. that NaN propagates through 12 transformer layers and into
the output logits. the fix is one branch: if max is
-inf, return a uniform distribution of
1/n per position.
gelu, the gaussian error linear unit, is gpt-2's activation
function. it's a smoother version of relu: instead of hard-clipping
negative values to zero, it squashes them toward zero
probabilistically, weighted by the gaussian cdf. the formula:
0.5 * x * (1 + erf(x / √2)).
it lives in the mlp block: the hidden layer projects from 768 to
3072, gelu fires on every element, then the output layer projects
back to 768. every token passes through this bottleneck 12 times
(once per transformer block). the implementation uses
libm::erff, pure rust, no libm linking, portable across
targets.
layer normalization keeps activations from drifting into numerically
useless ranges as they pass through 12 blocks of attention and mlp.
for each row of a [batch, features] tensor, it computes
the mean and variance across that row, normalizes every element to
(x - mean) / √(var + ε), then scales by a learned
weight and shifts by a learned bias. epsilon (1e-5) prevents
division by zero.
it's four lines of arithmetic per row. but get it wrong, forget to
apply the weight, skip the bias, or forget to assign the result back
to x in the block loop, and the model's output degrades
silently. no panic. no NaN. just progressively worse text.
attention is the core mechanism that makes transformers work. here is what happens in the code.
q, k, v. every token's embedding (a vector of 768
numbers) passes through a single linear layer called
c_attn, which maps 768 → 2304. that 2304-element output
is three 768-element vectors packed together: query, key, and value.
for each attention head (gpt-2 small has 12), you take a 64-element
slice of q, k, and v. the query asks "what am i looking for?", the
key advertises "here's what i contain", and the value is the actual
content. the dot product of q and k gives similarity scores. those
scores become weights, and you sum the values according to those
weights.
the causal mask. during training, a token can cheat
by looking at future tokens, the model would learn to copy the
answer instead of predicting it. at inference time, future tokens
don't exist yet; they're generated one at a time. the solution is a
triangular mask: token at position i can attend to
positions 0..=i, and everything beyond
i is set to -inf before softmax. the
-inf becomes 0.0 after exponentiation, so those
positions contribute nothing.
why sqrt(head_dim)? the dot product of
two random 64-element vectors has variance proportional to 64. after
softmax, that means one attention score dominates and the rest are
near zero, the model can't blend information from multiple
positions. dividing by √64 = 8 brings the variance back
to 1.0, keeping the softmax in a regime where multiple tokens get
meaningful weight. the recip() call (1.0 / sqrt(head_dim)) is computed once at the top of the attention forward pass and
reused for every row.
without a cache, every generated token requires recomputing attention over the entire sequence from scratch. generate 20 tokens with a 50-token prompt, and you run 71 full attention passes, each doing o(n²) dot products over a growing sequence. the kv cache changes this from o(n²) per step to o(n) per step.
the key and value projections for every token at every layer are deterministic given the model weights, once computed, there's no reason to recompute them. during prefill, you run the full prompt through all 12 layers and store every position's k and v. during decode, you compute k and v only for the single new token, append it to the cache, and run attention against the entire cached history. the query is the only thing that changes per token, it encodes "what should the next token pay attention to?", so it's always computed fresh.
pre-allocation. the cache is a flat
[layer][head][pos][head_dim] array: 12 layers × 12
heads × 2048 positions × 64 dimensions × 4 bytes × 2 (k and v) = ~72
mb. allocated once, upfront, in KVCache::new(). the
append path is a copy_from_slice into a pre-computed
offset. there's a scratch buffer for attention score rows
(qk_scratch), also pre-allocated to
max_seq_len, reused across all heads and tokens in a
decode step via clear() + resize(). the
resize never re-allocates because capacity was set at construction
time. zero heap allocations in the inference hot path.
six bugs total. four were standard off-by-one and format-misunderstanding fare. two were the kind where the model produces output that looks almost correct but is systematically wrong.
[768, 50257] for the embedding matrix), but
the dequantized flat buffer is laid out column-major, the data
for tensor[i][j] is at offset
j * rows + i, not i * cols + j. the
loader reshaped with the logical dims in row-major order,
scrambling every weight matrix.
dims vec so the reshape matches the column-major
storage. then adjust the transposes in
Gpt2::from_loader: embeddings need no manual
transpose (loader already reversed dims, so
index_select picks rows directly), while linear
weights get a .transpose() to restore
[in_features, out_features] for matmul.
"1, 2, 3, 4, 5" →
"/5/5/5/5/5/5/5/". the model could attend to at
most one prompt token. it saw the last token ("5") and nothing
else.
KVCache::append wrote every entry at
self.cursor. but cursor is
deliberately not advanced until
Gpt2::forward_with_cache finishes all 12 layers.
during prefill, when
Attention::forward_with_cache looped over prompt
tokens calling append, every call landed at
cursor = 0, each overwriting the last. only the
final prompt token's k/v projection survived; all other
positions stayed zero.
append now takes an explicit
pos: usize parameter.
Attention::forward_with_cache snapshots
cache.cursor() before the loop and passes
cursor + pos, so each token lands at its correct
absolute cache slot. the cursor is still batch-advanced
afterward in Gpt2::forward_with_cache, that part
was already correct.
"1, 2, 3, 4, 5" →
", 6, 7, 8, 9, 10, 11, 12,".
three design decisions that paid off immediately:
[vocab, embed]. the
loader transposes them so index_select picks a row
with one contiguous slice instead of gathering strided elements.
cost: one transpose at startup. benefit: every inference step
memcpys instead of gathering.
-inf) produces NaN under standard
softmax. one branch, if max == -inf, return
1/n per position, prevents NaN from propagating
through 12 transformer layers.
#[must_use] on all pure tensor ops.
tensor operations allocate new storage, they never mutate in
place. calling x.softmax() without binding the
result silently drops the allocation.
#[must_use] caught the bug of forgetting to assign
the layer norm output back to x in the transformer
block loop at compile time. that bug produces silently wrong
logits with no panic.
the kv cache was the single largest performance change. before it,
every generation step recomputed attention over the full growing
sequence, 71 forward passes for a 50-token prompt with 20 generated
tokens, each doing o(n²) attention. after wiring the cache into
Attention::forward_with_cache, the generation loop does
one prefill pass (populating the cache) plus one o(n) decode step
per generated token.
| metric | before | after |
|---|---|---|
| forward passes (50-token prompt + 20 gen) | 71 (o(n²) each) | 1 prefill + 20 decode (o(n) each) |
| decode throughput | ~2 tok/s | ~12 tok/s |
| attention hot-path allocations | 144 per token (12 layers × 12 heads) | 0 (scratch buffer pre-allocated in kv cache) |
| kv cache memory (2048-context, 12 layers, 12 heads, 64 head_dim) | n/a | ~72 mb (k + v combined) |
the scratch buffer eliminated 144 small heap allocations per
generated token for a 12-layer, 12-head model. the buffer is
pre-allocated in KVCache::new(); the attention loop
calls clear() + resize() on it, and the
resize never re-allocates because capacity equals
max_seq_len.
after five bugs and a column-major revelation, the model ran on the prompt "the" with greedy sampling, temperature zero, deterministic argmax, no randomness. if every tensor op was correct, the output would be the same every time. it was:
it's not good prose. it's a 124m-parameter model from 2019 with a one-word prompt. but every word is a real english word, capitalized correctly, in grammatical order. no invalid utf-8, no NaN tokens, no random integers. every tensor op, softmax, gelu, layer norm, matmul, the qkv split, the causal mask, the kv cache append-and-read, worked together, and what came out the other end was text.
the repetition is also informative. gpt-2 is not chatgpt. it's a next-token predictor trained on internet text. given "the", its most likely continuation involved cheese, and it repeated "cheese" 17 times because once it entered that loop, the attention pattern reinforced itself. that's not a bug in ember, it's the model doing what it was trained to do. the fact that ember faithfully reproduces that behavior is the proof that the pipeline is correct.
ember runs gpt-2. the path forward is llama-family support, which requires four new primitives:
layer_norm method.
x * sigmoid(x),
self-gating, smooth. the gated ffn in llama uses it in a
gate_proj * up_proj element-wise product before the
down projection.
wpe lookup table. instead of storing a
vector for each absolute position, rope encodes position by
rotating pairs of dimensions in the query and key vectors by an
angle proportional to the position. the attention score between
positions i and j depends only on their relative distance (j −
i), not their absolute positions, so the model generalizes to
sequence lengths it wasn't trained on.
n_q_heads / n_kv_heads q heads.
each primitive is specified with starter code and tests in the repo.
the Backend
trait was designed for exactly this, adding a new op to the trait
makes it available to every future backend automatically.
simd. the current matmul delegates to
matrixmultiply::sgemm, a pure rust implementation that
is correct but scalar. the next performance jump comes from
vectorized kernels, avx2 on x86, neon on arm. the
Backend trait is the insertion point: a
SimdBackend implementing the same trait with
architecture-specific intrinsics. same model code. 4× to 8×
throughput.
after that: gpu support (the Backend trait means same
transformer code on cuda or metal), then embedded targets, arm
cortex-m, no_std, models in the 10m–100m parameter range running on
microcontrollers.