The Anatomy of a Prompt : A journey from Python to Silicon - Part 2 of 5
The Attention Tax: O(n²) and Why Context Windows Cost So Much
Chapter 2: Understanding the computation - What the Model actually does
Well, we have now looked at how we go from a simple text in a prompt box to massive voltage management on the GPU. So what exactly does a model really do?
The Attention Question
The brutal truth about language models is that in order to predict word N+1, the model must consider every word from 1 to N. Not just the last word. Not just the last sentence. Everything.
This is the Attention Mechanism - at this point both a boon and a bane of the Transformer architecture.
When the model is given the task of completing a sentence : “The cat sat on the __”.
To predict “mat” the model does not just look at the “the”. It asks :
What was the subject? (”cat” — probably something a cat sits on)
What was the action? (”sat” — we need a surface)
What’s the grammar? (”the ___” — singular noun coming)
Back in the days, when I had just started by Software Engineering journey, I was working on a research team at SAP. This was pre-deeplearning era and we were doing a fair bit of NLP work. During that time, one of things we were trying to solve was Anaphora resolution - figuring out that ‘it’ in ‘The cat sat on the mat. It was soft.’ refers to the mat, not the cat, for understanding about products in long form text. When I look at the evolution of transformers and attention mechanism, I cannot help but smile the ease at which these attention models now solve the Anaphora resolution problem.
Query, Key, Value
The model answers these questions by computing three vectors for every token:
Vector | Role | Analogy
Query (Q) | "What am I looking for?" | A search query
Key (K) | "What do I contain? | "A file's label
Value (V) |"What information do I carry? |"The file's contents
Each token gets transformed into these three representations through learned weight matrices.
The Attention Matrix
Now that we have seen what Attention actually looks like, this is the place where cost actually explodes.
To compute attention, the model calculates Q × K^T, a matrix multiplication between every Query and every Key. The result is a matrix of “attention scores”, how much each token should pay attention to every other token.
The Quadratic Trap
Now that we have seen the matrix multiplication ballooning up in size, lets imagine our context has a mere 1000 tokens, this is then a 1000 x 1000 matrix. If our context has 100,000 tokens (a normal PDF with 20-30 pages), then this is 100,000 x 100,000 matrix - 10 billion elements.
And the model has 80+ layers, each with 64+ attention heads.
Context Length. Attention Matrix Size Operations
1,000 tokens 1M elements ~1 million
10,000 tokens 100M elements ~100 million
100,000 tokens 10B elements ~10 billionThis is O(n²) scaling. Double your context length, quadruple your compute.
This is why longer conversations cost more. This is why “context window” is a headline feature. And this is why my 20GB GPU can’t run a 70B model with a long conversation — the attention computation alone can exceed available memory.
There are other challenges with bigger context window than just size, but lets leave that for another conversation in the future.
The Forward pass
One prediction, one token, requires a full forward pass in the deep learning network.
Embedding Layer: Convert tokens to vectors
Layer 1: Attention → Feed-Forward → Normalize
Layer 2: Attention → Feed-Forward → Normalize
...repeat 80 times...
Output Layer: Convert final vector to token probabilities
The critical constraint: Layer 2 cannot start until Layer 1 finishes. You can parallelize within layers, but not across them.
This is why generating text is slow (speed is relative). Each new token requires a full pass through all 80 layers, and each layer must wait for the previous one.
Chapter 3: The Journey - Moving the data
Now, until this point, we have seen how text gets converted into voltage and we have seen what the model actually does when its given that piece of text that we typed in as a prompt.
GPU is an island
Now before any of the math can happen on the GPU, the data must travel.
The GPU is an island. It has its own memory (VRAM), separate from your computer’s main RAM. Data must cross a physical bridge called the PCIe Bus.
The Loading Dock
When python code runs .to(‘cuda’), the CPU packs binary data into packets and drives them across this bridge. But the bridge is narrow and orders of magnitude slower than the GPU’s internal speed.
This is why we see CPU usage spike during model loading. The CPU is the “manager” frantically packing boxes to keep the “factory” fed.
The DRIVER as the translator
Once the data arrives on the GPU, who tells the GPU hardware what to do? The GPU does not speak Python. The GPU speaks electricity.
This is the main role of the driver. The NVIDIA drivers we all have installed hundreds of times on linux machines to run our model training and inferencing, yes those very drivers. They translate high-level commands into instructions the GPU understands.
Kernels and The Choreography
The drivers are not just saying here do the math. It activates specific, pre-written programs called Kernels.
A kernel is a C++ code that describes exactly how thousands of threads should coordinate. It is a choreography script, at absolutely unbelievable scale.
“Thread 1, grab data from address A.”
“Thread 2, grab data from address B.”
“Everyone wait here.”
“Now multiply.”
“Thread 1, write your result to address C.”
The quality of these kernels determines whether your GPU sits idle or runs at full capacity.
Why My AMD GPU Struggles
Once I got to this point of understanding what is actually happening, I was like, okay all this makes sense, but it still does not answer why my AMD GPU struggles. The mystery started to unravel at this point.
When I run a game, be it Cyberpunk or Frostpunk 2, it uses standardized APIs such as DirectX, Vulcan, OpenGL. Game engines have been successful in abstracting away GPU differences for decades. When a game developer writes rendering code, it works both on NVIDIA and AMD because both vendors support the same standards.
Machine Learning has no such luxury, yet. This is why we see companies building their trillion dollar moats and why it takes other companies a lot of time, money and talent to potentially breach that moat. That kernel program is the key unlock.
NVIDIA created CUDA, a proprietary programming model. For 15 years, they have optimized their kernel libraries (cuBLAS, cuDNN) to perfectly match the physical characteristics of their chips.
AMD has ROCm, their equivalent. It works, however, its years behind in terms of its optimization. They are working very hard on it, but many kernels have not been ported. Some operations fall back to slow generic implementations. Remember, speed is relative here, when we say slow, it is still mind bogglingly fast, however not fast enough to run Machine learning.
When I run cyberpunk, my GPU executes standardized rendering kernels that AMD has optimized for years.
When I run Llama, the model calls CUDA kernels that literally do not exist on my hardware. There is some fast paced work going on in this space, where PyTorch and AMD integration via ROCm is getting better.
Next week : Inside the GPU factory
So now I understood why my AMD GPU struggles — CUDA kernels that literally don’t exist on my hardware. 15 years of NVIDIA optimization that AMD is still catching up to.
But I wasn’t satisfied. If kernels are the choreography, I wanted to see the dancers.
What’s actually happening inside those 16,000 CUDA cores? Why can’t the CPU just do this work? And why, despite all this parallel power, do GPUs spend 80% of their time... waiting?
The answer involves something called the “memory wall.” And it starts with leaky buckets of electrons.
Part 3 drops next week. Subscribe so you don’t miss it.












