The Anatomy of a Prompt : A journey from Python to Silicon - Part 5 of 5 - Tensor Parallelism, NVLink, and Why Output Tokens Cost 6x More
Chapter 7: The scale, Context Crisis and KV Cache
Now, we have seen a lot, and when you think about handling a few prompts, this does not end up looking so bad. However, if you are Anthropic, Google, OpenAI, xAI, you are looking at billions of tokens a minute probably.
I just started wondering, how are they managing such massive loads? Even just thinking about inferencing, I am at the moment scared to go down the path of understanding distributed training, maybe that should be a separate series in itself.
They cannot block GPU or TPUs per user or for several users all the time.
They cannot have the entire token and context for a user, especially when you long conversations in one place.
When the model generates the word “cat,” it computes Key and Value vectors for that token. If it throws them away, it must recompute them for every subsequent token — rereading the entire conversation each time.
That seems like a lot of cycles lost, not to mention power wasted to do that. We have used caches in Software stack for a very long period of time and it makes absolute sense that in this scenario KV caches were introduced.
That seems like a lot of cycles lost, not to mention power wasted to do that. We have used caches in Software stack for a very long period of time and it makes absolute sense that in this scenario KV caches were introduced.
KV Cache Growth
For a 70B model :
KV cache per token ≈ 1.6 MB
1,000 tokens = 1.6 GB
10,000 tokens = 16 GB
100,000 tokens = 160 GB (two full H100s for one user)The model weights alone take 140GB. Add a long conversation’s KV cache, and you’ve exceeded the memory of most GPU setups.
This is why I couldn’t run Llama 70B on my 20GB card. The weights alone don’t fit, and every token of conversation makes it worse (I mean I knew that Llama 70B wont fit, I was trying my luck with quantization :D )
The Tetris Problem
So, when I looked at the size of KV cache growing for a longer conversation, the first question I had was, how does the memory allocation happen? Is it a big chunk at once and then how does the expansion happen later if we need more and what happens if we dont use all of it?
Early inference systems reserved contiguous memory blocks for each user “just in case” they had a long conversation.
If User A chats for 5 minutes (large context) and User B chats for 1 minute (small context), you get gaps. Because memory was contiguous, you couldn’t fit new users into the gaps.
Result: “Out of Memory” errors with 30% of VRAM sitting empty.
So obviously this is a challenge and I wanted to know how this is being solved.
Humans are amazing (most of the times) and its incredible how they keep coming up with solutions to solve problems at scale. Reliably even.
PagedAttention: The Solution
So PagedAttention, solves this problem by treating memory like a book library instead of warehouse.
I remember reading about something similar called Demand paging during my under-grad, when we were learning about operating systems. I guess this is an extension of that mindset. So PagedAttention works as follows :
Break each conversation into small “pages” (blocks of 16 tokens)
Scatter them wherever there’s free space
Keep a “card catalog” (Block Table) mapping logical position to physical address
The GPU kernels are rewritten to follow the map, hopping between addresses to gather scattered pages.
This eliminates fragmentation. Memory utilization jumps from 70% to 95%+.
When One GPU isn’t Enough
It is very clear, to run at scale, we will definitely need more GPUs running together.
The more incoming prompts, the more tokens we need to manage. Inspite of PagedAttention, we cannot fit them all on one GPU. Multi-gpu is the answer.
It’s not a question of if, but when and how many. The reasons stack up:
Model weights alone exceed one GPU. Llama 70B in FP16 = 140GB. An H100 has 80GB. You literally cannot fit the model on one chip without quantization.
KV cache grows without bound. Even with PagedAttention eliminating fragmentation, the raw bytes still accumulate. At 1.6MB per token, a 100K context conversation consumes 160GB — two full H100s just for one user’s memory.
You’re not the only user. I was just using Claude with a massive context — I’ve been going back and forth about a certain topic for a week now, and that conversation would exceed any KV cache size one GPU could hold. Not to mention there are probably a million more users like me on any given day.
So the question becomes: how do we split the work across multiple GPUs?
So how can we parallelize across GPUs?
So, we are all aware of distributed computing, where we have found ways to be resilient when distributing our compute capabilities across the globe. The challenge we see here is not just about distribution, but latency plays a massive role, especially for real time inferencing.
I had to find out how can one distribute across GPUs, what are our options?
There are several ways to split a model across GPUs. Each has different tradeoffs:
Tensor Parallelism (TP): Split each layer horizontally.
GPU 1 handles Attention Heads 1-32
GPU 2 handles Attention Heads 33-64
Every layer, the GPUs must synchronize results
Pipeline Parallelism (PP): Split the model vertically.
GPU 1 handles Layers 1-20
GPU 2 handles Layers 21-40
Data flows through GPUs like an assembly line
Data Parallelism: Each GPU handles different users entirely.
Works for serving many small requests
Doesn’t help when one user’s context exceeds one GPU
For inference with long contexts, Tensor Parallelism is the dominant approach. Here’s why: you need all the KV cache accessible for every attention computation. If you split by layers (Pipeline), the KV cache would need to live on every GPU anyway.
The moment I read about tensor parallelism, I wanted to know in reality how does this work? There is still latency, now in more than one place, and I wanted to understand how this is solved, because Gemini and Opus models are so good at running with long contexts and the latency is so low.
Tensor Parallelism: Splitting Attention
Here is a trace of what happens when attention is split across 2 GPUs.
In single-GPU attention:
One GPU has all 64 attention heads
It computes Q, K, V for all heads
It stores the entire KV cache
It computes the full attention output
In 2-way Tensor Parallelism:
GPU 1 has heads 1-32, GPU 2 has heads 33-64
Each GPU computes Q, K, V for its heads only
Each GPU stores KV cache for its heads only
Each GPU computes attention output for its heads
But here’s the catch: after attention, the outputs from all heads must be combined. And after the Feed-Forward layer, results must be synchronized again.
Full_Output = Concat(Partial₁, Partial₂) × W_OThis means we need GPUs to talk to each other, at some point.
I had to ask the question, based on my intuition , : “Would it not make sense to just send both partials to one GPU, concatenate, multiply, done.”
But that creates a bottleneck — one GPU does all the work while the other sits idle.
As we stated above, all this scaling work are a bunch of clever methods bundled together.
The Clever Trick — Split W_O So Partials SUM Instead of Concatenate
Here’s the insight that makes Tensor Parallelism efficient.
We split the output projection matrix W_O by rows:
GPU 1 holds W_O_top (rows corresponding to heads 1-32)
GPU 2 holds W_O_bottom (rows corresponding to heads 33-64)
Now the math transforms:
Concat(Partial₁, Partial₂) × W_O
= Concat(Partial₁, Partial₂) × [ W_O_top ]
[ W_O_bottom ]
= (Partial₁ × W_O_top) + (Partial₂ × W_O_bottom)
\__________________/ \_____________________/
GPU 1 computes GPU 2 computes
locally locallyConcatenation followed by matrix multiply equals local multiplies followed by addition.
Each GPU multiplies its partial result with its portion of W_O. The results just need to be summed, not concatenated. And because both GPUs need the full result for the next layer’s split weights, ALL-REDUCE gives them each the sum simultaneously — no bottleneck.
ALL-REDUCE — Distributed Sum
This is where communication happens. Each GPU has a partial result. We need the sum on both GPUs (because the next layer’s weights are also split, so both GPUs need the full input).
ALL-REDUCE does exactly this:
GPU 1 sends Result₁ to GPU 2
GPU 2 sends Result₂ to GPU 1
Both GPUs compute Result₁ + Result₂
Both GPUs now have identical full output
Why Both GPUs Need the Result
Because the next operation — Feed-Forward Network — also has split weights. GPU 1 holds half of the FFN weights, GPU 2 holds the other half. Both need the full attention output to multiply with their portion.
If only one GPU had the result, we’d have to broadcast it, defeating the parallelism.
The FFN follows the same pattern: split weights, local multiply, ALL-REDUCE to sum.
Each layer requires two ALL-REDUCE operations:
After Attention (combining head outputs)
After Feed-Forward (combining FFN outputs)
The Communication Tax
There is no free meal. We all know this. From this, it was very evident to me that we are definitely paying heavy tax for communication. How the hell is the latency as low as we see, when we need so much communication across GPUs?
I had to take a look again to understand what is being transferred between GPUs.
At each ALL-REDUCE, each GPU sends its partial result to the other. For a 70B model with hidden dimension 8192:
Per ALL-REDUCE:
- Each GPU sends: [sequence_length × hidden_dim] values
- At batch_size=1, seq_len=1 (decode): 8192 × 2 bytes (FP16) = 16 KB
- At batch_size=32, seq_len=1: 32 × 16 KB = 512 KB
- During prefill with 2000 tokens: 2000 × 16 KB = 32 MBThat seems small. But multiply by 160 syncs per forward pass:
Per token (decode):
- 160 syncs × 16 KB = 2.56 MB transferred per GPU
Per prefill (2000 tokens):
- 160 syncs × 32 MB = 5.12 GB transferred per GPUNow consider you’re generating 100 tokens per second with batching:
Sustained bandwidth needed:
- 100 tokens/sec × 160 syncs × 512 KB = 8.2 GB/sec per GPUThis is where we see the introduction of NVLink. Remember all those tweets from Elon, about xAI infrastructure and putting together that mega GPU clusters with NVLinks, well I finally understood why it is that important.
Interconnect Bandwidth 160 syncs @ 512KB Time
--------------------------------------------------------------------
PCIe 4.0 32 GB/s 82 MB 2.6 ms
PCIe 5.0 64 GB/s 82 MB 1.3 ms
NVLink 4.0 900 GB/s 82 MB 0.09 msWith PCIe, you’re spending 2.6 milliseconds per token just on communication — often longer than the actual computation.
With NVLink, communication drops to 0.09 milliseconds — nearly invisible.
Bandwidth is only part of the story. Each ALL-REDUCE also has latency — the fixed overhead of initiating a transfer, regardless of size.
PCIe goes through the CPU’s PCIe controller:
GPU → PCIe → CPU/Switch → PCIe → GPU
Latency: ~1-5 microseconds per hopNVLink is direct GPU-to-GPU:
GPU → NVLink → GPU
Latency: ~0.5-1 microsecondsWith 160 syncs per token, even 5 microseconds of latency adds up:
PCIe: 160 × 5 μs = 800 μs = 0.8 ms of pure latency
NVLink: 160 × 1 μs = 160 μs = 0.16 ms of pure latencyNow, does this hold good when lets say we have a cluster of 30,000 GPUs ?
Well, for simplicity sake, because I could not comprehend a cluster of 30000 GPUs, but I know xAI have built one, I sort of asked, how does this work when we have 8 GPUs? Every GPU needs partial outputs from every other GPU and just that data movement and shuffling must take so much more time and energy.
With 8-way Tensor Parallelism, ALL-REDUCE becomes more complex. Each GPU must exchange with all other GPUs:
2 GPUs: Each sends to 1 other → 1 exchange
8 GPUs: Each sends to 7 others → 7 exchanges (or clever ring/tree algorithms)This is where the NVSwitch in DGX systems helps. Instead of point-to-point connections, NVSwitch acts as a crossbar — any GPU can talk to any other GPU simultaneously at full bandwidth.
The Bottom Line, cause Stone cold said so?
Tensor Parallelism only works economically with NVLink. Without it:
Communication dominates computation time
GPUs spend more time waiting than working
You’d be better off with a single larger GPU (if one existed)
This is why NVIDIA’s expensive DGX systems — with their full NVLink mesh — command premium prices. The NVLink interconnect isn’t a luxury. It’s what makes multi-GPU inference viable.
So the NVIDIA moat gets bigger and stronger. I want to learn more about Google’s TPUs, and how they do this differently, stay tuned for another series. That is probably as deep or deeper as a rabbit hole, because of how much Google have invested in their infrastructure.
Once, I was able to fully understand how the scaling works (well limited to 8 GPUs :D ), the next question I had was, how does this work in terms of economics? I have a paid subscription to Gemini and Claude models, along with subscription to cursor (which I am thinking of moving away from, especially given how good Antigravity is getting and not to mention Claude Code). I could not stop thinking about pricing, especially given the CAPEX of such infrastructure, especially to serve billions of tokens a day.
Chapter 8: The Economics
Based on everything I know at this point, if I have to build an inference architecture that could actually serve millions of users per day, I know I need the following:
Multiple GPUs per model (Tensor Parallelism for 70B+ models)
NVLink connectivity (or you’re bottlenecked by communication)
Enough aggregate memory for KV caches (PagedAttention helps, but bytes are bytes)
Potentially separate prefill and decode clusters (disaggregated serving)
Now what would the cost of something like this look like? Time to bring out the envelopes to do some calculations.
A single DGX H100:
8 H100 GPUs: $240,000+ (at $30K each, often more)
NVLink mesh + NVSwitch: Included, but adds to system cost
System (CPU, RAM, storage, networking): $60,000+
Total: $300,000 - $500,000 per node
When has one node ever been enough?
Cluster Economics
To serve a product like Claude or Gemini, you need clusters of these nodes.
A modest inference cluster might have:
32 DGX H100 nodes = 256 GPUs
Cost: $10-15 million in hardware alone
A large-scale deployment:
Thousands of GPUs across multiple data centers
Cost: $100+ million in hardware
And that’s just the GPUs.
Once we have the CAPEX done, we look at the OPEX (Operational costs / expenses)
Operational Costs
Hardware is a one-time cost (sort of — GPUs get replaced every 2-3 years). Operations never stop.
Power: A DGX H100 draws 10.2 kW at full load. If you have not noticed on Twitter ( I still find it hard to call it X), xAI’s CEO talks about why power is the biggest challenge to build a data center at the moment. The infrastructure for power is where a lot of money is being invested and we see that China is so far ahead in that game. We need so much power to run these machines. Nuclear Fusion could not come sooner!
10.2 kW × 24 hours × 365 days = 89,352 kWh/year
At $0.10/kWh = $8,935/year per node just in electricity
At $0.15/kWh (more realistic) = $13,400/year per nodeBut that’s just the GPUs. Cooling typically adds 40-100% overhead (PUE of 1.4-2.0).
Total power cost per DGX node: $15,000 - $25,000/year
For a 32-node cluster: $500,000 - $800,000/year in electricity alone.
Other operational costs:
Data center space: $10,000-30,000/rack/year
Network bandwidth: Variable, but significant
Staff (MLOps, SRE, on-call): $500K+/year for a small team
Maintenance and failures: GPUs fail, nodes go down
How do these costs translate to tokens, I dare ask?
Now let’s calculate what we can actually earn.
A single DGX H100 (8 GPUs) running Llama 70B with Tensor Parallelism:
Can serve the model with 8-way TP (one model replica)
With continuous batching: ~4,000-8,000 tokens/second throughput
Let’s say 5,000 tokens/second average
Revenue at $0.001 per 1,000 output tokens:
5,000 tokens/sec × $0.001/1000 = $0.005/second
$0.005 × 3600 × 24 × 365 = $157,680/year per DGX nodeBut wait — that’s at 100% utilization.
Real-world utilization for inference is 30-60%. Let’s say 50%.
$157,680 × 0.50 = $78,840/year actual revenue per nodeCost per node:
Hardware amortized over 3 years: $400K / 3 = $133,000/year
Operational costs: ~$40,000/year
Total: ~$173,000/year
Revenue: $78,840/year Cost: $173,000/year Loss: $94,160/year per node
At $0.001/1K tokens, you lose money.
Given that I use Gemini a lot, I looked into Google Gemini 3 pro pricing to get an understanding of the situation and compare it against the base price we have identified, to see if Google is losing money on this.
The Price List Everyone Misreads
Let’s look at actual pricing. Here’s Google’s Gemini 3 Pro pricing (as of early 2026):
Per 1 Million TokensPer 1K Tokens
Input (≤200K context) $2.00 $0.002
Input (>200K context) $4.00 $0.004
Output (≤200K context) $12.00 $0.012
Output (>200K context) $18.00 $0.018Most developers glance at this and think: “Oh, about $2-12 per million tokens. That’s cheap!”
They’re missing three critical details that completely change the economics.
Detail #1 : Output costs 6x More than Input
Why the huge difference, I wanted to understand this better.
Input (Prefill):
I send 2,000 tokens. The GPU processes them all at once:
Q: [2000 × dim] — all tokens in parallel
K: [2000 × dim] — all tokens in parallel
V: [2000 × dim] — all tokens in parallel
Attention = [2000 × 2000] matrix multiply
FFN = [2000 × dim] × [dim × 4×dim] matrix multiplyThese are massive matrix-matrix multiplications. The GPU loads the model weights once, then does trillions of operations on 2,000 tokens. The Tensor Cores stay busy. This is compute-bound — the math is the bottleneck.
Output (Decode):
Now you generate tokens one at a time. For each new token:
Q: [1 × dim] — just the new token
K: [N × dim] — loaded from KV cache (all previous tokens)
V: [N × dim] — loaded from KV cache (all previous tokens)
Attention = [1 × N] vector — tiny
FFN = [1 × dim] × [dim × 4×dim] — matrix-vector multiplyThe GPU loads the entire model weights and the entire KV cache... to produce one token.
Here’s the brutal math for a 100K context decode step:
Memory loaded: ~3.2 GB (KV cache) + model weights
Compute ops: ~1.6 billion (matrix-vector, not matrix-matrix)
H100 memory bandwidth: 3.35 TB/s → Load time: ~1 ms
H100 compute: 990 TFLOPS → Compute time: ~0.002 ms
You spend 99.8% of the time waiting for memory.The model’s response costs more than your entire prompt.
The 6x price difference reflects a real hardware efficiency gap:
Prefill: High utilization, amortized memory loads
Decode: Low utilization, repeated memory loads for single tokens
Detail #2 :Long Context Multiplies the Decode problem
Notice the price doubles for prompts over 200K tokens:
Input: $2.00 → $4.00 per million
Output: $12.00 → $18.00 per million
For each output token, the GPU must load the KV cache for all previous tokens.
At 200K tokens, a Gemini-scale model might need:
200K × ~2MB/token = 400GB of KV cache for one user
That’s 5 H100 GPUs worth of memory just for your context.
The 50% output premium for long context ($12 → $18) actually undercharges relative to the true cost. It’s likely subsidized to encourage adoption.
Detail #3 :The Million-Token Illusion
Pricing “per million tokens” makes costs feel tiny. Let’s convert to real usage:
A single heavy user session:
50K tokens of context (your conversation history)
10 back-and-forth exchanges
Average 500 tokens in, 1,000 tokens out per exchange
Input: 50K + (10 × 500) = 55K tokens × $0.002/1K = $0.11
Output: 10 × 1,000 = 10K tokens × $0.012/1K = $0.12
Total for one session: $0.23At scale:
1 million daily active users
3 sessions per day average
= 3 million sessions/day
3M sessions × $0.23 = $690,000/day
= $251 million/year in inference revenueIf your infrastructure costs are $100M/year, you’re making $150M profit on inference alone.
Now, do you see why Google, OpenAI, xAI, NVIDIA are trying to build multi-billion-dollar data infrastructure. The usage will only go up, with every company getting into vibe coding, this cost will be compared with paying Engineers salaries and with more optimizations that will come in, the infrastructure game is very much the real game! Ofcourse, every company needs good model, which I think is the retention mechanism, as once that retention is triggered, changing to other models is not easy, thereby helping in predicting revenue in coming months and years.
The frontier players (Google, Anthropic, OpenAI) are not struggling. They’re generating significant revenue from inference, which funds the next generation of model development.
The “brutal economics” exist in the commodity tier — companies racing to serve open-weight models like Llama at the lowest price. That’s where every optimization (Flash Attention, PagedAttention, quantization) is life-or-death.
For frontier providers, those same optimizations mean higher margins, not survival. And not to mention the fact that they have the talent, infrastructure and data to bring better models into the picture, so this market is incredibly competitive. The big tech are also building additional distribution models to increase and retain users.
The Developer’s Mistake
Most developers building on these APIs make a critical budgeting error:
What they estimate:
"We'll send 10K tokens per request, that's $0.02"
(Using input pricing, forgetting output)What they actually pay:
Input: 10K × $0.002/1K = $0.02
Output: 5K × $0.012/1K = $0.06
Actual: $0.08 (4x their estimate)Then they’re shocked when their API bill comes in.
Best practice: Always estimate costs assuming 60-80% of your bill will be output tokens.
How are the big tech already trying to improve their margins?
So obviously, the next question to ask is, what optimizations do we already see at play to improve the token economics? Higher the margin, the better, I am assuming that is how the conversations happen in the meeting rooms, where these decisions are being made.
Premium Features
When we look at the pricing, there are some interesting things that stand out :
Longer context
Faster response
Larger models
Fine-tuned models for domains
I noticed that they are all premium features and they have a price multiplier.
Feature Why it costs more Price multiplier
---------------------------------------------------------------------
Longer context More KV cache = more memory 2-10x
Faster response Dedicated capacity,
no batching 2-5x
Larger models More GPUs required 2-4x
Fine-tuned models Dedicated capacity VariableMaximize utilization through batching
Remember continuous batching? The more users you can pack onto the same GPUs, the better your economics.
A request sitting in queue costs you nothing. A GPU sitting idle costs you everything.
The entire PagedAttention system exists to maximize how many users can share the same memory.
Quantization
Precision Memory for 70B Fits on Quality loss
-----------------------------------------------------------------------
FP16 140 GB 2+ H100s Baseline
INT8 70 GB 1 H100 ~1%
INT4 35 GB 1 H100 with room ~3-5%Going from FP16 to INT8 halves your GPU cost for roughly the same quality. Atleast that is what the benchmarks are showing. Ofcourse, for using AI to write emails and slack message, maybe the depreciation in quality is not bad at all, so if I was big tech, based on where and how the AI is being used, I would probably have variants of all of them running, so that I can route them accordingly based on the intent, thereby the user does not see a decrease in quality and I could increase my margins.
We could apply the quantization to the KV Cache as well in INT8 instead of FP16, halves the memory and bandwidth needed without a big drop in performance.
Disaggregated serving (prefill/decode split)
Remember the prefill vs decode bottlenecks?
Prefill is compute-bound
Decode is memory-bound
By separating them:
Prefill servers can be optimized for compute density
Decode servers can be optimized for memory bandwidth
Overall efficiency improves 30-50%
Some providers are exploring using different hardware for each phase — even using CPUs or specialized accelerators for decode.
As part of Decoding, there are several things at play for further optimization. It was fascinating to read about speculative decoding and Medusa in particular. I might do another post just going deep down into this, for now, its good to see the amount of effort that is going on in trying to provide low latency, low cost and yet high performance tokens to users.
1. Speculative Decoding: Instead of generating 1 token at a time, use a small “draft” model to guess 4-8 tokens, then verify them all at once with the big model. If the guesses are right (often 70-80% are), you’ve done 4-8 tokens of work with one memory load.
2. Medusa / Multi-token Prediction: Train the model to output multiple tokens simultaneously. Still experimental but looks to be promising.
Custom Silicon, The Next Frontier
In the end, the ultimate win is building chips that are specialized specifically for inference. In this journey that we have been on, we have seen the kind of optimizations that has been put in place to use the hardware and manipulate at electrons level to get the performance we need. Companies are looking at this and going, well what is we can just build hardware that is much suited for ML ?
I hope you have all recently seen the massive investment / buy out of Groq by Nvidia.
Chapter 9: Connecting it all, it was time
Now that I had seen what lies beneath the tip of the iceberg and it all started with the question, why does my AMD GPU not support the capability to run all LLMs. Just a simple question. I have grokked for weeks on this topic. I have had long taxi conversations with my non-techie partner, around how fascinating this has been, on how a simple prompt, runs across so many abstractions in a matter of milliseconds, and it only keeps getting better. Let us not even get started on multi-modal inferencing right now. I think my brain needs a short break for just so much information.
It is time we trace our entire journey!
I type a message into Claude, I have been talking for a week, building context, refining ideas and I press Enter.
In my browser: My text becomes a network request. (You can inspect the network traffic to see this.)
At the data center: A scheduler receives your request. It looks up my conversation ID. My KV cache has been partially offloaded to CPU RAM because I haven’t messaged in 6 hours (duh, I have a life). The scheduler initiates a swap-in, pulling in my context back into GPU memory across 4 H100s.
On the prefill cluster: My new message — let’s say 500 tokens — gets tokenized, embedded, and converted to binary. The tokens flow across PCIe to the GPUs. They’re processed through 80 transformer layers, each split across 4 GPUs via Tensor Parallelism. At each layer:
Each GPU computes attention for its 16 heads using its portion of your KV cache
PagedAttention kernels gather scattered memory blocks using block tables
Flash Attention keeps the computation in SRAM, never materializing the full attention matrix
ALL-REDUCE operations sync results across GPUs via NVLink
New KV entries are written to each GPU’s allocated blocks
My 500 tokens become 500 new rows in the KV cache: 800MB of state, distributed across 4 GPUs.
On the decode cluster: Now the response generates, one token at a time. For each token:
The last token’s embedding propagates through all 80 layers
At each layer, attention reaches back across your entire conversation — all 50,000+ tokens accumulated over a week
PagedAttention gathers the scattered KV pages (thousands of blocks, distributed across GPUs)
Flash Attention computes over tiles, never storing the 50K × 50K attention matrix that would require 5GB if materialized
Feed-forward networks activate
ALL-REDUCE syncs 160 times (twice per layer)
A probability distribution emerges
One token is sampled
This happens in about 20 milliseconds. Then it repeats for the next token. I dare you to just read all this text in under 2 minutes, just the above paragraph and it takes more time. This is effing-baffling!
Back to you: Tokens stream to my browser as they’re generated. I see words appearing in real-time. It feels like magic.
Behind those words: petabytes of training data, billions of parameters, thousands of GPUs, millions of lines of kernel code, and the accumulated expertise of thousands of engineers working at every layer of the stack.
One word at a time.
FIN.
The End — And What’s Next
This journey started because my AMD GPU couldn’t run a model.
It ended with me understanding:
Why capacitors leak and SRAM can’t scale
Why Flash Attention never stores the attention matrix
Why tensor parallelism uses addition instead of concatenation
Why NVLink costs a fortune but NVSwitch costs more
Why output tokens cost 6x input tokens
Why NVIDIA is worth $3 trillion
I still have questions. How do TPUs work differently? What is Groq actually doing with no HBM at all? How does training differ from inference at this level?
More rabbit holes ahead.
If you made it through all 5 parts — thank you. You now understand AI infrastructure better than most engineers building on it.
Subscribe for future deep dives. And if this helped you, share it with someone who’d appreciate it.































