a glossary of terms used across the site, with brief explanations.
- inference engine
- The program that runs a model and produces text. It takes a prompt and returns tokens. ember is an example.
- tokenizer
- Converts text to tokens (numbers) and back. The model doesn't read letters — it reads numbers. The tokenizer is the translator between them.
- token
- A small text unit. Can be a full word, part of a word, or a single character. The model generates one token after another.
- BPE (byte pair encoding)
- A tokenization method that merges the most frequent character pairs. Builds a vocabulary from subword units. Used by GPT-2 and GPT-4.
- embeddings
- Vectors (lists of numbers) that represent a word or token in a multi-dimensional space. Similar words end up with nearby vectors.
- transformer
- The neural network architecture that powers GPT, LLaMA, and all modern LLMs. Built on attention instead of recurrence.
- attention
- A mechanism that lets each token "attend" to other tokens and gather weighted information based on relevance. This is what links words together.
- softmax
- A mathematical function that converts any set of numbers into a probability distribution (summing to 1). Used in attention and in picking the next token.
- logits
- The raw output from the model before conversion to probabilities. Unbounded numbers — high values mean the model is "confident" this token is correct.
- temperature
- Controls the randomness of the output. Low values = predictable, repetitive output. High values = creative, varied output. Zero = deterministic (greedy).
- top-k sampling
- Randomly samples from the top k highest-probability tokens. Cuts off weak choices and keeps selection among the best candidates only.
- top-p (nucleus) sampling
- Samples from the smallest set of tokens whose cumulative probability exceeds p. Smarter than top-k because it adapts to the probability distribution.
- layer normalization
- Normalizes values inside the network to have mean zero and variance one. Prevents exploding and vanishing gradients during training.
- GELU
- An activation function used in GPT-2. A smoother version of ReLU. Allows negative values to pass through partially instead of clipping to zero.
- RoPE (rotary position embeddings)
- A method for encoding a token's position in the sequence by rotating its dimensions. Lets the model generalize to longer sequences than it was trained on. Used in LLaMA.
- GQA (grouped-query attention)
- An attention optimization that reduces the number of key/value heads relative to query heads. Saves memory and improves speed with minimal quality impact.
- SwiGLU / SiLU
- Newer activation functions used in LLaMA. SiLU = x times sigmoid(x). SwiGLU = a gated version of SiLU.
- RMS norm
- A simplified version of layer normalization. Uses only the root mean square without centering. Faster and used in LLaMA.
- probing
- A research technique for understanding what a model learns internally. Trains a simple classifier on hidden states and checks whether it can predict a specific linguistic property.
- morphology (الصرف)
- The study of word structure. In Arabic: how a triliteral root (e.g. k-t-b) merges with patterns (e.g. faʿala, mafʿūl) to produce different words.
- root-pattern (non-concatenative) system
- The root-and-pattern system. The root (consonants) and pattern (vocalic templates) interleave instead of concatenating. Unlike English which glues prefix + stem + suffix in sequence.
- nonce roots
- Novel (invented) roots that don't exist in the language. Used in experiments to distinguish between memorization (the model memorized the word) and generalization (the model learned the pattern).
- clitic
- A small morpheme that attaches to a word. In Arabic: prepositions (bi-, li-, ka-) and attached pronouns (-hu, -hum). A grammatical element that clings phonologically to its host.
- finite-state transducer
- A mathematical machine that processes strings with rules. Fast and deterministic. Used by older NLP systems before neural networks.
- RAG (retrieval-augmented generation)
- A technique that combines document retrieval with generation. The model searches a knowledge base before answering, instead of relying only on its internal memory.