← voidwest    research    engineering    internals

when the logits lie: gemma 4 gguf debugging

2026-06-10 · golden-logit parity · gemma 4 · llama.cpp reference
arabic went in. english and japanese came out. that was the first sign something was deeply wrong.

llama and qwen3 both landed golden-logit parity against llama.cpp within a few days each. cosine ~0.9998, top-1 matched across english and arabic prompts, the pipeline was clean. gemma 4 was supposed to be the third one.

the model loaded. dimensions checked out, 35 layers, hidden dim 1536, vocab 262144. the forward pass ran. didn't crash. the logits weren't nan. first prompt i tried was arabic, "كتب الطالب".

the output was english. sometimes japanese. entropy 3.6 out of 12.48, nearly uniform. llama.cpp on the same prompt had entropy 0.06. the logits were so flat that the model couldn't pick a token with any confidence. it just... emitted things. coherent things, wrong language, like a multilingual slot machine.

this is what it took to figure out why.


llama: the sanity check

this matters because it set expectations. llama 3.2 1b hit cosine 0.99947–0.99992 across 7 prompts, top-1 matched every time, top-5 overlap was 5/5, top-10 was 9–10/10. max abs diff 0.36. the bugs along the way were real but diagnosable, kv cache recreated every decode step, entire token sequence passed during decode instead of just the new token, logits sliced with embed_dim instead of vocab_size. all fixable in an afternoon each.

qwen3 0.6b was similar. cosine 0.99955–0.99979, all top-1 match, max abs diff 0.59. arabic prompts dropped slightly, 0.99955 vs 0.99975, but still matched. different tokenizer, qk norm, slightly different rope config, but the architecture was close enough to llama's that the port was mostly swapping metadata.

promptllama cosineqwen3 cosine
"Hello"0.99991
"Hello world"0.999860.99979
"كتب الطالب"0.999840.99974
"الطالبات كتبن"0.99955

llama: 7 prompts, max abs diff 0.36, mean 0.040 · qwen3: 5 prompts, max abs diff 0.59, mean 0.079

after two clean ports i had exactly the wrong amount of confidence for gemma 4. i expected a weekend. it took two weeks and the answer is still not really satisfying.


gemma 4: the play

act i: the shapes matched, which meant almost nothing

gemma 4 is weirder than llama. per-layer embeddings (ple), global projection layers, layer output scales, a separate rope_freqs.weight tensor for partial rope, bf16 weights in the gguf, tied embeddings for the lm head, final logit softcapping at 30.0. the block layout has three sub-pathways per layer instead of two. none of this is exotic, exactly, but there are more places to get things wrong.

the shapes matched. the model ran. the logits were flat garbage. "كتب الطالب" went in. english came out. sometimes japanese. i started trying things.

the shapes matched, which meant almost nothing.

act ii: the ablation graveyard

each of these was a hypothesis. build, run golden-logit comparison against llama.cpp, check cosine. most took a few hours. some took a day.

what i changedcosine vs llama.cppverdict
disable ple entirely0.08nope
disable final softcapno real changenope
move ple to end of block0.10nope
move ple to start of block0.72this one helped
disable embedding scaling (sqrt(1536))0.86minor, kept it
disable layer output scales−0.54essential, don't touch
unweighted rms norm on v0.70 (worse)nope
flip post-norm ordering0.92 → 0.13nope, and concerning
scale ple by sqrt(per_layer_dim)0.88 → 0.82nope
replace q8_0 matmul with f32identicalnot quantization
force scalar sum_squares (no simd)identicalnot simd
check rmsnorm weights vs ggufcosine 1.0, diff 0.0not weight loading
add rope freq_factors to global layersminimal changecorrect, kept it
Ablation graveyard: horizontal bar chart showing attempted fixes and their impact on cosine similarity

the ple placement move from end-of-block to start-of-block was the first real jump, 0.10 to 0.72. but 0.72 is still wrong. a cosine of 0.72 on a vocabulary of 262k tokens means the model is in roughly the right neighborhood but not producing the right answer.

the post-norm ordering one was interesting in a bad way. llama.cpp applies norm before the residual add, that's the standard transformer formulation. when i flipped ember to match, cosine dropped from 0.92 to 0.13. that meant something else in the block was wrong in the opposite direction, and the two errors were accidentally cancelling. this kind of thing makes you question every assumption you have about the code.

after maybe ten rejected hypotheses the pattern was becoming visible. this wasn't going to be one clean bug with a satisfying one-line fix.


act iii: cosine, the beautiful liar

at one point cosine improved from 0.84 to 0.92. i thought i was making progress. top-5 overlap was still zero. top-1 was still wrong. the model was still producing english from arabic.

what cosine measures: the angle between two vectors in 262k-dimensional space. a vector where most entries are medium-valued (flat distribution) can have high cosine with a sharply peaked vector if they point in roughly the same direction. softmax doesn't care about the angle, it cares about the relative differences between entries. so cosine can go up while the model stays broken.

cosine improved. top-k stayed at zero. cosine was lying.
Cosine vs top-k overlap: dual-axis chart showing cosine similarity improving while top-5 overlap stayed at zero

the metric that actually tracks correctness is top-k overlap with the reference. if llama.cpp's top-5 tokens don't appear in ember's top-5, cosine 0.99 is meaningless. you're sampling from a different distribution.

i burned at least two days chasing cosine improvements that turned out to be noise. don't do that. watch top-k.


act iv: the layerwise oracle

at some point comparing final logits wasn't giving enough signal. a final cosine of 0.87 could mean one layer is catastrophically wrong, or every layer is slightly wrong and the errors compound. you can't tell from the output alone.

so i patched llama.cpp to dump per-layer hidden states. three source files modified, llama-graph.h, llama-graph.cpp, llama-context.cpp, plus gemma4.cpp to push each block output onto a vector. compiled a small c++ helper that evaluates a bos token and writes 35 × 1536 floats to a binary file. ember already had a --dump-layers flag from earlier work, same format. python script reads both, computes per-layer cosine and l2 norms.

layer 0 (attn_norm): cosine 1.000 , bit-identical layer 1: cosine 0.998 layer 2: cosine 0.994 layer 3: cosine 0.990 layer 5 (global attention): cosine 0.82 , first real drop layer 10 (global attention): cosine 0.62 layer 15 (global attention): cosine 0.096 , worst single layer layer 23: cosine 0.031 layer 34 (final hidden): cosine 0.51 , before lm head final logits: cosine 0.87 , after output projection
Layerwise cosine: bar chart showing per-layer cosine similarity between Ember and llama.cpp hidden states, highlighting global attention layers and the final logit recovery

three things stood out.

first: layer 0 is bit-identical. the attn_norm output has the same floating-point values in both implementations. that means the embedding lookup, tokenization, and initial normalization are all correct. the pipeline starts exactly where it should.

second: the divergence is gradual. there's no single cliff where cosine drops from 0.99 to 0.10 in one layer. each layer loses a little more alignment. the global attention layers hit harder, every 5th layer has head_dim 512 instead of 256 and a different rope theta, so there's more surface area for numerical differences to accumulate.

third: the lm head recovers cosine from 0.51 to 0.87. the final hidden state is pretty divergent from llama.cpp's, but after multiplying by the tied token_embd.weight matrix, the resulting logit vector happens to point more toward the reference. this is not the model "getting better", it's an artefact of the projection matrix. and it's part of why cosine on final logits was misleading: it looked better than the hidden states actually were.


finale: the structural bugs were loud

with the layerwise pipeline working, i could actually see which changes helped and which didn't. the structural bugs, where ember was doing a fundamentally different computation from llama.cpp, were loud. fixing one moved cosine visibly.

ple pathway. gemma's per-layer embedding uses blk.{i}.inp_gate.weight (1536 → 256) and blk.{i}.proj.weight (256 → 1536). the proj weight was loaded with a transpose. additionally, there's a global ple projection: per_layer_model_proj [1536, 8960] combined with raw ple lookup, scaled by 1/sqrt(2). the global proj is stored as bf16 in the gguf, ember had no bf16 loader. added one.

block layout. the order of operations per block had to exactly match llama.cpp's graph: attn_norm → attention → post_attn_norm → residual, then ffn_norm → gate/gelu/up/down → post_ffn_norm → residual, then ple → post_ple_norm → residual, then multiply by layer_output_scale. getting one norm in the wrong place shifts everything downstream.

embedding scaling. token embeddings get multiplied by sqrt(1536). this is in the reference, i had missed it. minor impact on cosine (~0.02) but it's correct.

layer output scales. each block has a learned scalar with geometric mean ~0.42. disabling this gave cosine negative 0.54. not optional.

gelu. both mlp and ple use the tanh approximation (ggml_gelu), not exact gelu. this matters for matching numerics.

rope freq_factors. the rope_freqs.weight tensor (256 values) controls partial rope: 64 frequency pairs get rotation, 192 pairs get identity (factor ~1e30 → freq = 0). this is different from llama where all pairs get rotated.

final softcap. gguf metadata key gemma4.final_logit_softcapping says 30.0. i had it hardcoded to 15.0. halved the effective range.

tied embeddings. no separate output.weight, the lm head reuses token_embd.weight. path is hidden → output_norm → tied logits → softcap(30.0). not complicated, just different from llama.

fixrough impact
ple placement at start of blockcosine 0.10 → 0.72
global ple projection + bf16 loadercorrect pre-projection
block layout aligned to llama.cppoperations match
embedding scale sqrt(1536)~0.02
layer output scalesessential (disabled → −0.54)
gelu tanh approximationmatches ggml_gelu
rope freq_factorspartial rope correct
final softcap 30.0doubled effective range
tied embeddingsno missing output.weight
Structural fixes timeline: line chart showing cosine improvement as each fix was applied

after all of these landed, final logit cosine settled around 0.87. the model produced coherent english. arabic still diverged, but it was no longer producing japanese from arabic prompts. the remaining gap was about 0.13.


curtain call: the last 0.13

at this point i had fixed every structural mismatch i could find. the pipeline started bit-identical at layer 0. rmsnorm was verified six ways: manual computation matched the backend, weights matched gguf exactly (cosine 1.0, l2 diff 0.0), simd sum_squares matched scalar, and 26 tests passed. the code was correct as far as i could tell.

and the final cosine was still 0.87.

the structural bugs were loud. the last bug was quiet.

what i eventually understood, by dumping and comparing l4 attn_norm inputs from both implementations, was that the remaining gap isn't a bug. it's numerical sensitivity.

raw input cosine (before rmsnorm): 0.997 ← looks basically fine rmsnorm output cosine (after): 0.477 ← what rmsnorm weight stats for this layer: rms of weights: 41 max weight: 323 an input that still looked almost aligned by cosine became heavily misaligned after rmsnorm, because the weight vector scales certain dimensions much more than others. this happens at every layer's normalization step.

gemma 4's rmsnorm weights are just large. the max value in some layers is over 200. that means even a tiny angular difference in the input, from gelu precision, matmul accumulation order, attention softmax numerics, whatever, gets amplified into a visible divergence. this does not look like a simple bug in Rust, C, or the high-level implementation. it's a property of the model architecture interacting with floating-point non-associativity.

the final boss was not a transpose. it was sensitivity.
RMSNorm amplification: three-panel figure showing nearly identical inputs, large RMSNorm weights, and the resulting amplified output divergence

closing the remaining gap would likely require matching llama.cpp's numerical execution path much more closely: accumulation order, approximations, quantized matmul details, and attention/softmax precision. the practical answer is: after structural bugs are resolved, cosine ~0.87 with coherent output is a reasonable stopping point, matching llama.cpp's structure and behavior without cloning its exact numerical execution path.

i spent about a week on this conclusion and i'm still not entirely happy with it. but the evidence is consistent. l0 is bit-identical. the divergence is gradual, not catastrophic. rmsnorm weights are the amplifier. i couldn't find evidence of a remaining structural mismatch.


things i learned


current status

  1. llama/qwen parity validated, golden-logit cosine 0.9995+
  2. gemma structural parity fixed, block layout, ple, rope, gelu, softcap, tied embeddings
  3. gemma exact numerical parity not bit-identical, cosine ~0.87, l0 starts perfect, divergence is gradual
  4. remaining gap small numerical drift amplified by rmsnorm weights across 35 layers
  5. next step dump and compare l4 ffn intermediates (gate_proj → gelu → up_proj → down_proj) to isolate the operation where cosine first drops below 0.99

appendix: commands and numbers

golden-logit comparison

target/release/ember --model gemma-4-E2B-it-Q8_0.gguf --arch gemma4 \
 --prompt "Hello world" --max-seq-len 128 --temperature 0 \
 --dump-logits /tmp/test.npy

python3 -c "
import numpy as np
e = np.load('/tmp/test.npy')[0]
r = np.load('artifacts/golden_logits_gemma/llamacpp_logits.npz')['logits'][0]
cos = np.dot(e,r)/(np.linalg.norm(e)*np.linalg.norm(r))
print(f'cosine={cos:.6f}')
"

layerwise comparison

# requires patched llama.cpp (see docs/layer-dump-tooling.md)
# 1. dump llama.cpp layers
./dump_llamacpp_layers gemma-4-E2B-it-Q8_0.gguf "" llama_layers.bin 16

# 2. dump ember layers
target/release/ember --model gemma-4-E2B-it-Q8_0.gguf --arch gemma4 \
 --prompt "" --max-seq-len 16 --temperature 0 \
 --dump-layers ember_layers.bin

# 3. compare
python3 scripts/compare_layer_dumps.py \
 --ember ember_layers.bin --reference llama_layers.bin \
 --layers 35 --hidden-size 1536 \
 --out-md report.md --out-json report.json

tests

cargo test --lib

summary numbers

modelpromptscosine rangetop-1 matchmax abs diff
llama 3.2 1b70.99947–0.99992yes0.36
qwen3 0.6b50.99955–0.99979yes0.59
gemma 4 e2b (early, flat logits)10.18no
gemma 4 e2b (intermediate)10.67no31.7
gemma 4 e2b (after fixes)1~0.87no

key files

filewhat
src/gemma4.rsfull gemma 4 forward pass (~1980 lines)
src/tensor.rscompute_rope_freqs, rms_norm, softmax
src/loader.rsgguf parsing, bf16 (type 30)
src/simd.rsavx2/neon kernels, sum_squares check
src/quant.rsq8_0 dequantization
tools/dump_llamacpp_layers.cppllama.cpp layer dump helper
scripts/compare_layer_dumps.pylayerwise comparison