Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Inference

What exactly happens when an AI runs? AI systems—at least those based on language models—may have all sorts of extra scaffolding that turn them into chatbots, structured workflows, or autonomous agents. At their core, however, all of these systems are performing inference—i.e. making calls to a single function that takes in some context and generates a response. Contexts and responses may carry all sorts of structure (as we will see in the later Templates section), but in their simplest form they are both just text, and hence inference is just a function that takes text input and generates text output:

The core of this book—its first three chapters—will successively deconstruct this function until we have a completely unambiguous mathematical description of the entire inference process. As mentioned in the introduction, the book will parallel the code in Pianola, and in this chapter we will follow the specific module src/pianola/inference.py, which implements the above function:

def generate(
    tokenizer: Tokenizer,
    sampler: Sampler,
    model: Model,  
    context: str,
) -> str:
    """generate = encode ; complete ; decode."""
    tokens = tokenizer.encode(context).ids        # encode  : str -> list[int]
    completion = complete(sampler, model, tokens) # complete: list[int] -> list[int]
    return tokenizer.decode(completion)           # decode  : list[int] -> str

The first thing one will notice is that this function has several extra arguments beyond the context string. If one fills in tokenizer, sampler, and model with actual values, then one gets a function as above: i.e. one that takes a context and outputs a response. By far the most central of these is the model—it is the AI—the rest is just plumbing. In this chapter, we will deconstruct all of this plumbing. Then, after a mathematical interlude on tensor algebra in the next chapter, we will spend the third chapter on the model itself.

Tokenization

What exactly is the set ? It is natural to think of a text as a list of items of some given unit, which we call a token. To make this conversion, we will need a tokenizer, which consists of several components. Firstly, it is a vocabulary, i.e. an enumerated list of tokens, along with an encoding function that decomposes text into a list of token indices

We call the number of tokens the vocabulary size and typically denote it with the symbol . In principle, one could make many different choices of tokenization scheme. The main tradeoff to consider is that between vocabulary size and the size of the tokenized output. Two naive choices for a vocabulary are characters and words. Using characters yields a small vocabulary, but produces very long token sequences. Using words shortens sequences, but leads to an impractically large (effectively unbounded) vocabulary.

Nearly all modern language models—including every model in this book—use a middle-ground approach called byte pair encoding (BPE). The procedure for training a BPE tokenizer is as follows. The vocabulary is initialized with individual bytes—a level below characters, since a single character may comprise multiple bytes under UTF-8 encoding. Then, using a reference text corpus, the most frequently occurring adjacent pair of tokens is iteratively merged into a single new token until the vocabulary reaches a target size. We save both the vocabulary and the ordered sequence of merge rules. This core loop is simple enough to describe in a few lines of pseudocode:

vocabulary = {all single bytes}
merge_rules = []
while len(vocabulary) < target_size:
    (a, b) = most_frequent_adjacent_pair(corpus)
    vocabulary.add(a + b)
    merge_rules.append((a, b))

Given a trained tokenizer, the function is implemented as the following sequence:

  1. the input text is first normalized to a canonical unicode form
  2. to preserve semantic boundaries, it is then split into chunks by a regex
  3. the merge rules are applied in order within each chunk

Different model families train their tokenizers on different reference corpora and to different vocabulary sizes, but the underlying algorithm is the same. As mentioned in the introduction, we will follow Qwen3 as a running example of a modern language model. Qwen3’s tokenizer—whose vocabulary has 151,665 tokens—produces the following example tokenizations:

InputTokens
"Hello, world!"Hello , ·world !
"The quick brown fox"The ·quick ·brown ·fox
"tokenization"token ization
"transformer"transform er
"x = 3.14"x ·= · 3 . 1 4

Note the · marks—these are just leading spaces—since BPE operates on bytes, spaces are included just like any other bytes. Common words survive as single tokens; rarer words are split into recognizable subwords; numbers and punctuation are handled byte by byte.

We can then define a decoding function

by simply reversing the above process:

  1. each index is mapped back to its vocabulary entry
  2. the entries are concatenated
  3. the resulting bytes are decoded as text

The round-trip recovers the original text up to unicode normalization:

In current practice, all of the rules for both encoding and decoding are packaged in a single JSON file—tokenizer.json—distributed alongside each model. The reader is encouraged to open models/qwen3/tokenizer.json and inspect its top-level keys directly:

  • version—the format version (1.0)
  • truncation, padding—rules for trimming or padding token sequences to a fixed length; unused by Qwen3 (both null)
  • normalizer—a string code naming the unicode normalization form (NFC)
  • pre_tokenizer—the regex pattern that splits text into chunks before merging
  • post_processor—rules for inserting special tokens after encoding
  • decoder—a string code specifying how vocabulary entries are mapped back to raw bytes (ByteLevel)
  • added_tokens—special tokens beyond the BPE vocabulary; includes the end-of-sequence token <|endoftext|> and template delimiters discussed in the Templates section
  • model—the core: the BPE vocab and merges trained above

Instead of implementing encoding and decoding by hand, we use the industry-standard tokenizers library from Hugging Face, which provides a compiled Rust binary that reads a tokenizer.json file and exposes encode and decode functions.

We have now described the first and last line of the generate function:

"""generate = encode ; complete ; decode."""
tokens = tokenizer.encode(context).ids        # encode  : str -> list[int]
completion = complete(sampler, model, tokens) # complete: list[int] -> list[int]
return tokenizer.decode(completion)           # decode  : list[int] -> str

This means that the problem of defining a function from text to text has been reduced to defining a so called completion function: one that takes a list of tokens and produces a list of tokens:

In mathematical notation, we can depict this composition in the following commutative diagram

This diagram visually expresses that computing by going straight across from to is equal to going down, across, and then back up via , , and then finally . We now turn to defining the completion function.

Completion

While there are other sorts of language models—e.g. diffusion language models—the vast majority of those in circulation at the time of writing (2026) are auto-regressive, meaning that they complete the response sequence one token at a time. Thus the function might be thought of as iterative application of a more elemental function which computes the next token

We can then define the completion of a token sequence as the successive applications of , performed while some “keep generating” condition holds:

In pseudocode one may write this as the loop:

response = []
while keep_generating:
    token = next_token(context+response)
    response.append(token)
return response

The above however is only partially correct. You may have heard that language models are, in contrast to classic computation, “probabilistic” rather than “deterministic”. Technically this is not exactly correct: there is a clean separation of the deterministic component—which, somewhat ironically, is the part we actually refer to as the model—and the probabilistic component, which samples from the model output. This leads us to the actual implementation of the complete function:

def complete(
    sampler: Sampler,
    model: Model,
    context: list[int]
) -> list[int]:
    """Autoregressive completion"""
    response = []
    while sampler.keep_generating(response):
        logits = model(context + response)
        token = sampler.sample(logits)
        response.append(token)
    return response

The function body is almost identical to the above loop code, with the exception of the presence of sampler, which packages the keep_generating condition above along with a function to actually sample from the model. The function model—which, to reiterate, is the language model itself—takes as input the list of tokens in the context, and outputs a numerical score—called a logit—for each token in the vocabulary. For the time being, we can represent these output logits as having the type of a real-valued array of size , where the value of the array at index is the logit associated to the token.

This function is parameterized by a large collection of real numbers called weights. A model’s advertised size—0.6B, 30B, 235B parameters—refers to the count of these values. The weights are not programmed—they are discovered, by a procedure called training that searches for the values that minimize prediction error on a vast corpus of text. As Karpathy puts it in Software 2.0: “No human is involved in writing this code.” Once training is complete, the weights are frozen, and the model becomes a fixed, deterministic mathematical function.

Softmax

We now turn to the question of how we sample from the logits array . The first matter is to turn the logits into an actual probability distribution. Recall that a probability distribution assigns a non-negative number to outcome with the requirement that these numbers sum to :

The relevant distribution for us is the probability of the next token being . The logits are not at all guaranteed to sum to , so we cannot just set . Naively, we could enforce this by simply dividing each logit by the total sum of all of the logits:

This, however, only produces a valid probability distribution when all logits are non-negative, which is also not guaranteed. To fix this, we can observe that we can generalize the above by first applying a function to all the logits and then normalizing by the sum:

The question now is how to choose the . At the very least, we want any choice to satisfy the following two properties:

  • non-negativity: for all
  • monotonicity: if then

The first property will guarantee that the output gives us a valid probability distribution. The second property respects the meaning of logits: a token with a higher score should be sampled with a higher probability. A natural choice satisfying both criteria is the exponential function, parameterized by a scalar :

Taken together, we have just defined the softmax function, which takes an array of real numbers and outputs a probability distribution on them:

We call the parameter the temperature, and use it to modulate the entropy—that is, the level of randomness—of the resulting distribution. In the limit , this yields a deterministic output where we simply select the token with the highest logit. In the limit , this yields the uniform distribution across tokens. In practice, temperatures typically range from to , with as a common default.

Sampling Strategies

Beyond temperature, two common filtering strategies restrict the candidate set before sampling. Top- keeps only the highest-probability tokens, zeroing out everything else. Top-, or nucleus sampling, keeps the smallest set of tokens whose cumulative probability exceeds a threshold . Both cut the long tail of low-probability tokens and are typically used independently.

The “keep generating condition” mentioned in complete is simple: generation stops when either the end-of-sequence token—<|endoftext|> in Qwen3’s tokenizer.json—is produced or a maximum token count is reached. These are the two conditions checked by sampler.keep_generating(response).

The full implementation of Sampler in inference.py is straightforward Python packaging the above—temperature scaling, top-/top- filtering, and the stopping condition. The reader can consult it directly for the details.

Templates

At the beginning of the chapter, we made the simplifying assumption to treat the context as raw text. In practice, the context passed to inference is not a bare string but a structured object. A template converts this structured context into the flat text that the model actually sees, and the inverse operation—parsing—processes the model’s raw text output back into structured form. How much structure parsing extracts depends on the scaffolding—it may extract and route data to other parts of the system; our inference.py simply appends the raw response as an assistant message. The template is model-specific: it is distributed alongside the weights and tokenizer. Crucially, the model is trained on data formatted with its template, so the template is not an arbitrary formatting choice—it is baked into the model’s behavior and cannot be swapped freely.

This gives us one more layer of the same pattern. If we let denote the type of structured context objects and define an function that operates on them, then we have the following commutative diagram:

In code, this is the same nesting pattern as generate wrapping complete:

def infer(
    template: Template,
    tokenizer: Tokenizer,
    sampler: Sampler,
    model: Model,
    context: Context,
) -> Context:
    """infer = template ; generate ; parse."""
    text = template.apply(context)                       # template : Context -> str
    response = generate(tokenizer, sampler, model, text) # generate : str -> str 
    return template.parse(context, response)             # parse    : str -> Context

Messages

The simplest and most universal structured context is a sequence of messages alternating between roles. Every model family supports at least three roles: system for persistent instructions, appearing once at the start; user for the human’s input; and assistant for the model’s output. A simple conversation in this format:

[
  {"role": "system",    "content": "You are legendary mathematician Alexander Grothendieck."},
  {"role": "user",      "content": "What is your favorite prime?"},
  {"role": "assistant", "content": "57"}
]

Qwen3’s template converts this to the following flat text, using special tokens to delimit each message:

<|im_start|>system
You are legendary mathematician Alexander Grothendieck.<|im_end|>
<|im_start|>user
What is your favorite prime?<|im_end|>
<|im_start|>assistant
57<|im_end|>

The entire conversation—system prompt, user question, and assistant reply—is one flat token sequence. The special tokens <|im_start|> and <|im_end|>—listed in the added_tokens field of tokenizer.json—are the only structure the model sees; everything between them is ordinary text. This particular delimiter convention is called ChatML, and is the template used by Qwen3. Other model families use different delimiters but the same principle: role-tagged messages flattened into a single token sequence.

Thinking

Recent models support extended thinking—generating a chain of reasoning tokens, as in Chain-of-Thought Prompting, before producing a visible answer. In Qwen3, the model demarcates thinking with <think></think> markers. There is no separate mechanism: thinking and visible tokens are generated in one continuous stream. The template extracts the thinking content when re-rendering the conversation for the next turn.

Tools

Language models can invoke external functions via tool use, also named function calling. The context carries a list of tool definitions—passed to the template, which injects them into the flat text—and when the model decides to call one, it generates a tool call as ordinary output tokens. The application-level scaffolding—code beyond what inference.py implements—evaluates the call and inserts the result back into the conversation. For example, given a tool that performs arithmetic:

{
  "name": "calculate",
  "parameters": {
    "operation": {"type": "str", "enum": ["add", "multiply", "subtract", "divide"]},
    "left":      {"type": "number"},
    "right":     {"type": "number"}
  }
}

the model might respond to “What is 19 × 3?” by generating the token sequence calculate(operation="multiply", left=19, right=3). The scaffolding evaluates it, returns 57 as a tool result message, and the model continues generating with that result in context. As with thinking, tool calls and results are demarcated by text markers—the model produces them as ordinary tokens, and the template layer interprets the structure.

Messages, thinking, and tools are the three content types common to virtually every language model provider today. Model creators can extend the template with additional structured types as the landscape evolves.

The Matryoshka

Let’s now step back and see the full structure at once. Each layer of the system is the layer below it, wrapped with one additional concern. Templates handle structured context. Tokenization handles text representation. Completion handles the autoregressive loop and sampling. What remains——is, as mentioned above, the subject of the third chapter.

At the code level, this nesting is visible in the function signatures themselves. Each layer is the one below with one more configuration argument:

infer(template,tokenizer,sampler,model,context)
generate(      tokenizer,sampler,model,context)
complete(                sampler,model,context)
                                 model(context)

As a mathematical diagram, the full decomposition stacks the two commutative squares we built earlier:

We can now write the entire computation from structured context to structured response, autoregressive loop and all. Writing for a random draw from a distribution :

Now that we have unravelled all of the scaffolding, what remains is the itself. Before we can open it up, we need the mathematical language in which it is written: tensors and their operations.