when the logits lie: gemma 4 gguf debugging

llama and qwen3 both landed golden-logit parity against llama.cpp within a few days each. cosine ~0.9998, top-1 matched across english and arabic prompts, the pipeline was clean. gemma 4 was supposed to be the third one.

the model loaded. dimensions checked out, 35 layers, hidden dim 1536, vocab 262144. the forward pass ran. didn't crash. the logits weren't nan. first prompt i tried was arabic, "كتب الطالب".

the output was english. sometimes japanese. entropy 3.6 out of 12.48, nearly uniform. llama.cpp on the same prompt had entropy 0.06. the logits were so flat that the model couldn't pick a token with any confidence. it just... emitted things. coherent things, wrong language, like a multilingual slot machine.

llama: the sanity check

this matters because it set expectations. llama 3.2 1b hit cosine 0.99947–0.99992 across 7 prompts, top-1 matched every time, top-5 overlap was 5/5, top-10 was 9–10/10. max abs diff 0.36. the bugs along the way were real but diagnosable, kv cache recreated every decode step, entire token sequence passed during decode instead of just the new token, logits sliced with embed_dim instead of vocab_size. all fixable in an afternoon each.

qwen3 0.6b was similar. cosine 0.99955–0.99979, all top-1 match, max abs diff 0.59. arabic prompts dropped slightly, 0.99955 vs 0.99975, but still matched. different tokenizer, qk norm, slightly different rope config, but the architecture was close enough to llama's that the port was mostly swapping metadata.

prompt	llama cosine	qwen3 cosine
"Hello"	0.99991	—
"Hello world"	0.99986	0.99979
"كتب الطالب"	0.99984	0.99974
"الطالبات كتبن"	—	0.99955

llama: 7 prompts, max abs diff 0.36, mean 0.040 · qwen3: 5 prompts, max abs diff 0.59, mean 0.079

after two clean ports i had exactly the wrong amount of confidence for gemma 4. i expected a weekend. it took two weeks and the answer is still not really satisfying.

gemma 4: the play

act i: the shapes matched, which meant almost nothing

gemma 4 is weirder than llama. per-layer embeddings (ple), global projection layers, layer output scales, a separate rope_freqs.weight tensor for partial rope, bf16 weights in the gguf, tied embeddings for the lm head, final logit softcapping at 30.0. the block layout has three sub-pathways per layer instead of two. none of this is exotic, exactly, but there are more places to get things wrong.

the shapes matched. the model ran. the logits were flat garbage. "كتب الطالب" went in. english came out. sometimes japanese. i started trying things.

act ii: the ablation graveyard

each of these was a hypothesis. build, run golden-logit comparison against llama.cpp, check cosine. most took a few hours. some took a day.

what i changed	cosine vs llama.cpp	verdict
disable ple entirely	0.08	nope
disable final softcap	no real change	nope
move ple to end of block	0.10	nope
move ple to start of block	0.72	this one helped
disable embedding scaling (sqrt(1536))	0.86	minor, kept it
disable layer output scales	−0.54	essential, don't touch
unweighted rms norm on v	0.70 (worse)	nope
flip post-norm ordering	0.92 → 0.13	nope, and concerning
scale ple by sqrt(per_layer_dim)	0.88 → 0.82	nope
replace q8_0 matmul with f32	identical	not quantization
force scalar sum_squares (no simd)	identical	not simd
check rmsnorm weights vs gguf	cosine 1.0, diff 0.0	not weight loading
add rope freq_factors to global layers	minimal change	correct, kept it

the ple placement move from end-of-block to start-of-block was the first real jump, 0.10 to 0.72. but 0.72 is still wrong. a cosine of 0.72 on a vocabulary of 262k tokens means the model is in roughly the right neighborhood but not producing the right answer.

the post-norm ordering one was interesting in a bad way. llama.cpp applies norm before the residual add, that's the standard transformer formulation. when i flipped ember to match, cosine dropped from 0.92 to 0.13. that meant something else in the block was wrong in the opposite direction, and the two errors were accidentally cancelling. this kind of thing makes you question every assumption you have about the code.

after maybe ten rejected hypotheses the pattern was becoming visible. this wasn't going to be one clean bug with a satisfying one-line fix.

act iii: cosine, the beautiful liar

at one point cosine improved from 0.84 to 0.92. i thought i was making progress. top-5 overlap was still zero. top-1 was still wrong. the model was still producing english from arabic.

what cosine measures: the angle between two vectors in 262k-dimensional space. a vector where most entries are medium-valued (flat distribution) can have high cosine with a sharply peaked vector if they point in roughly the same direction. softmax doesn't care about the angle, it cares about the relative differences between entries. so cosine can go up while the model stays broken.

the metric that actually tracks correctness is top-k overlap with the reference. if llama.cpp's top-5 tokens don't appear in ember's top-5, cosine 0.99 is meaningless. you're sampling from a different distribution.

i burned at least two days chasing cosine improvements that turned out to be noise. don't do that. watch top-k.

act iv: the layerwise oracle

at some point comparing final logits wasn't giving enough signal. a final cosine of 0.87 could mean one layer is catastrophically wrong, or every layer is slightly wrong and the errors compound. you can't tell from the output alone.

so i patched llama.cpp to dump per-layer hidden states. three source files modified, llama-graph.h, llama-graph.cpp, llama-context.cpp, plus gemma4.cpp to push each block output onto a vector. compiled a small c++ helper that evaluates a bos token and writes 35 × 1536 floats to a binary file. ember already had a --dump-layers flag from earlier work, same format. python script reads both, computes per-layer cosine and l2 norms.

layer 0 (attn_norm): cosine 1.000 , bit-identical layer 1: cosine 0.998 layer 2: cosine 0.994 layer 3: cosine 0.990 layer 5 (global attention): cosine 0.82 , first real drop layer 10 (global attention): cosine 0.62 layer 15 (global attention): cosine 0.096 , worst single layer layer 23: cosine 0.031 layer 34 (final hidden): cosine 0.51 , before lm head final logits: cosine 0.87 , after output projection

first: layer 0 is bit-identical. the attn_norm output has the same floating-point values in both implementations. that means the embedding lookup, tokenization, and initial normalization are all correct. the pipeline starts exactly where it should.

second: the divergence is gradual. there's no single cliff where cosine drops from 0.99 to 0.10 in one layer. each layer loses a little more alignment. the global attention layers hit harder, every 5th layer has head_dim 512 instead of 256 and a different rope theta, so there's more surface area for numerical differences to accumulate.

third: the lm head recovers cosine from 0.51 to 0.87. the final hidden state is pretty divergent from llama.cpp's, but after multiplying by the tied token_embd.weight matrix, the resulting logit vector happens to point more toward the reference. this is not the model "getting better", it's an artefact of the projection matrix. and it's part of why cosine on final logits was misleading: it looked better than the hidden states actually were.

finale: the structural bugs were loud

with the layerwise pipeline working, i could actually see which changes helped and which didn't. the structural bugs, where ember was doing a fundamentally different computation from llama.cpp, were loud. fixing one moved cosine visibly.

ple pathway. gemma's per-layer embedding uses blk.{i}.inp_gate.weight (1536 → 256) and blk.{i}.proj.weight (256 → 1536). the proj weight was loaded with a transpose. additionally, there's a global ple projection: per_layer_model_proj [1536, 8960] combined with raw ple lookup, scaled by 1/sqrt(2). the global proj is stored as bf16 in the gguf, ember had no bf16 loader. added one.

block layout. the order of operations per block had to exactly match llama.cpp's graph: attn_norm → attention → post_attn_norm → residual, then ffn_norm → gate/gelu/up/down → post_ffn_norm → residual, then ple → post_ple_norm → residual, then multiply by layer_output_scale. getting one norm in the wrong place shifts everything downstream.

embedding scaling. token embeddings get multiplied by sqrt(1536). this is in the reference, i had missed it. minor impact on cosine (~0.02) but it's correct.

layer output scales. each block has a learned scalar with geometric mean ~0.42. disabling this gave cosine negative 0.54. not optional.

gelu. both mlp and ple use the tanh approximation (ggml_gelu), not exact gelu. this matters for matching numerics.

rope freq_factors. the rope_freqs.weight tensor (256 values) controls partial rope: 64 frequency pairs get rotation, 192 pairs get identity (factor ~1e30 → freq = 0). this is different from llama where all pairs get rotated.

final softcap. gguf metadata key gemma4.final_logit_softcapping says 30.0. i had it hardcoded to 15.0. halved the effective range.

tied embeddings. no separate output.weight, the lm head reuses token_embd.weight. path is hidden → output_norm → tied logits → softcap(30.0). not complicated, just different from llama.

fix	rough impact
ple placement at start of block	cosine 0.10 → 0.72
global ple projection + bf16 loader	correct pre-projection
block layout aligned to llama.cpp	operations match
embedding scale sqrt(1536)	~0.02
layer output scales	essential (disabled → −0.54)
gelu tanh approximation	matches ggml_gelu
rope freq_factors	partial rope correct
final softcap 30.0	doubled effective range
tied embeddings	no missing output.weight

after all of these landed, final logit cosine settled around 0.87. the model produced coherent english. arabic still diverged, but it was no longer producing japanese from arabic prompts. the remaining gap was about 0.13.

curtain call: the last 0.13

at this point i had fixed every structural mismatch i could find. the pipeline started bit-identical at layer 0. rmsnorm was verified six ways: manual computation matched the backend, weights matched gguf exactly (cosine 1.0, l2 diff 0.0), simd sum_squares matched scalar, and 26 tests passed. the code was correct as far as i could tell.

what i eventually understood, by dumping and comparing l4 attn_norm inputs from both implementations, was that the remaining gap isn't a bug. it's numerical sensitivity.

raw input cosine (before rmsnorm): 0.997 ← looks basically fine rmsnorm output cosine (after): 0.477 ← what rmsnorm weight stats for this layer: rms of weights: 41 max weight: 323 an input that still looked almost aligned by cosine became heavily misaligned after rmsnorm, because the weight vector scales certain dimensions much more than others. this happens at every layer's normalization step.

gemma 4's rmsnorm weights are just large. the max value in some layers is over 200. that means even a tiny angular difference in the input, from gelu precision, matmul accumulation order, attention softmax numerics, whatever, gets amplified into a visible divergence. this does not look like a simple bug in Rust, C, or the high-level implementation. it's a property of the model architecture interacting with floating-point non-associativity.

closing the remaining gap would likely require matching llama.cpp's numerical execution path much more closely: accumulation order, approximations, quantized matmul details, and attention/softmax precision. the practical answer is: after structural bugs are resolved, cosine ~0.87 with coherent output is a reasonable stopping point, matching llama.cpp's structure and behavior without cloning its exact numerical execution path.

i spent about a week on this conclusion and i'm still not entirely happy with it. but the evidence is consistent. l0 is bit-identical. the divergence is gradual, not catastrophic. rmsnorm weights are the amplifier. i couldn't find evidence of a remaining structural mismatch.

things i learned

current status

appendix: commands and numbers

golden-logit comparison

layerwise comparison

tests

summary numbers

model	prompts	cosine range	top-1 match	max abs diff
llama 3.2 1b	7	0.99947–0.99992	yes	0.36
qwen3 0.6b	5	0.99955–0.99979	yes	0.59
gemma 4 e2b (early, flat logits)	1	0.18	no	—
gemma 4 e2b (intermediate)	1	0.67	no	31.7
gemma 4 e2b (after fixes)	1	~0.87	no	—

key files

file	what
src/gemma4.rs	full gemma 4 forward pass (~1980 lines)
src/tensor.rs	compute_rope_freqs, rms_norm, softmax
src/loader.rs	gguf parsing, bf16 (type 30)
src/simd.rs	avx2/neon kernels, sum_squares check
src/quant.rs	q8_0 dequantization
tools/dump_llamacpp_layers.cpp	llama.cpp layer dump helper
scripts/compare_layer_dumps.py	layerwise comparison