Structure and Execution of Language Models

This textbook accounts for every mathematical operation executed by a language model, expressed in both mathematical notation and code. Beyond the text itself, this project has two software components:

Catform, a domain-specific language for expressing tensor computations
Pianola, an engine that executes catform programs

AI programs today broadly have two layers—an outer scaffolding layer, and an inner layer, which is the language model itself. The book’s first chapter Inference covers in full detail the core logic of scaffolding—converting context into tokens (a unit of language) and then autoregressively sampling the language model to generate a response, one token at a time. The chapter also references Pianola’s implementation in src/pianola/inference.py line by line.

The language model itself is a purely deterministic mathematical function—it can be described entirely as a sequence of tensor operations. In this text, we write language models in catform—short for categorical form, inspired by category theory—a notation designed from first principles to mirror the algebra of these tensor operations. Writing the model in a custom language, rather than in a Python tensor framework, isolates its mathematical description as a standalone artifact: a single .cat file.

Catform is a minimal language: it uses just six primitive tensor operations to express the full computation of any modern transformer-based language model—and the resulting specification is no more verbose than the Python it replaces. The book’s second chapter Tensors is an interlude on the mathematics of tensor computations, expressed as executable catform. This set of operations is also closed under differentiation: the derivative of any catform program is itself a catform program. The upcoming Gradients chapter develops this property by implementing this transformation.

The third chapter Models walks through the complete mathematical description of Qwen3, a popular and representative open-weight language model. The description is expressed as a single .cat file, which Pianola can execute with either PyTorch or JAX without any modification to the model code.

Modern language models share the same core transformer architecture, varying in only a handful of components. The upcoming Architecture chapter will show how variants—mixture of experts, compressed latent attention, linear attention—differ from the baseline by substituting a small number of functions in the .cat file.

A model is determined by both its architecture—its sequence of operations—and by its weights, numerical parameters (counting in the billions, and more recently trillions) that parameterize these operations. These weights are not programmed, but rather discovered by searching for the model that minimizes a loss function—a measure of its prediction error on a giant corpus of data. An optimizer performs the search, stepping through the space in the direction of the loss gradient—computed in the aforementioned Gradients chapter. The upcoming Training chapter walks through this process.

Philosophy

Our title is a direct homage to the seminal programming textbook Structure and Interpretation of Computer Programs. The debt runs deeper than the name:

The general technique of isolating the parts of a program that deal with how data objects are represented from the parts of a program that deal with how data objects are used is a powerful design methodology called data abstraction.

— SICP, Section 2.1

The relationship between catform and Pianola is an instance of what SICP calls data abstraction. The .cat file is a declarative artifact: a piece of data that describes what to compute, without prescribing how. The execution layer—which can lower to PyTorch, JAX, or in principle to any target hardware backend—consumes that data and runs it. Portability across frameworks is not an independent feature but a consequence of the abstraction: neither layer knows the other’s details. This is precisely the separation named in the title: the structure is the catform specification, and its execution is Pianola playing it like a piano roll.

Organization

Both the book and the code are focused entirely on correctness and clarity, without attention to performance. Future chapters may focus on making all of these systems fast.

All six chapters were described above. The first three are available now.

Inference covers the outer loop for running inference on a model
Tensors is a mathematical interlude on tensor computations
Models covers the complete mathematical description of a modern language model (Qwen3)

The subsequent three will be released soon.

Architecture will survey architectural variants—mixture of experts (Qwen3 MoE), compressed latent attention (DeepSeek V3), and linear attention—showing how each modifies the baseline transformer
Gradients will be a mathematical interlude on computing the derivative of a tensor computation
Training will cover the components of training—the training loop, loss functions, and optimizers—in both supervised and reinforcement learning settings

The dependency graph of the chapters is as follows.

Audience

This book is primarily for two kinds of reader. The first is the programmer—especially one working in or around AI—who wants a complete understanding of the mathematical structure of language models. The second is the student of mathematics, who may be distant from programming, but desires an entry point into AI that speaks in their language and meets their standard of precision.

The primary programming prerequisite for this book is a basic understanding of Python—specifically functions, types—including enums, dataclasses, and containers like tuples and dicts—and basic control flow. Python is universally used in AI engineering, and we follow suit. The primary mathematical background is a basic understanding of linear algebra—not a full course, just familiarity with vectors, linear maps, matrices, and the notion of dimension. Later on, in the chapter on gradients, the reader will want familiarity with differential calculus—and single-variable suffices.

Both traditions share the same elemental concepts: things (terms or elements), collections of things (types or sets), and transformations between them (functions). Since this spine is shared, we need only establish conventions: mathematicians speak of sets and their elements; programmers speak of types and their terms. We use the latter throughout, noting that for our purposes the theoretical distinctions between the two foundations are not relevant. Throughout this text, most concepts are stated twice: once as a mathematical expression and once as a snippet of working code. This is not redundancy—it is the point. The two notations say the same thing to different readers—including the one we call a computer—and seeing them side by side reveals that the distance between blackboard and terminal is shorter than it appears.

Notation

In mathematical notation, we declare that a term $x$ is of type $X$ by writing:

$x : X$

In Python and catform alike, a type annotation looks like:

x: X

Given types $X$ and $Y$ , a function $f$ takes an input $x : X$ and returns an output $f (x) : Y$ . In mathematical notation:

$f : X \to Y$

In Python:

def f(x: X) -> Y: ...

In catform:

f(x: X) -> (y: Y) {...}

Given two functions $f : X \to Y$ and $g : Y \to Z$ , their composition is the function that applies $f$ first and then $g$ . In mathematical notation, we represent this either as an arrow diagram:

or as a binary operator. The classical notation writes composition as $g \circ f$ —read aloud as “ $g$ of $f$ ”, reading right to left. We often prefer to match the diagrammatic order and use forward composition, written $f; g$ and read aloud as “ $f$ then $g$ ”.

In both Python and catform, our convention for readability is to express composition line by line:

y = f(x)
z = g(y)

while in catform it is the only way one can express it—each line is always precisely one single value assignment to the output of an operation.

y: Y = f(x)
z: Z = g(y)

Setup

The book is more useful—and more fun—when you can see the math actually running on your computer. Setup instructions are in the README.

Keyboard shortcuts

Structure and Execution of Language Models

Structure and Execution of Language Models

Philosophy

Organization

Audience

Notation

Setup