The Anatomy of a Prompt : A journey from Python to Silicon - Part 3 of 5 - 16,000 Cores Waiting for Memory: The Physics of Forgetting
How are GPUs designed? Why can we not run Deep learning models on CPUs?
Chapter 4: The Factory - Inside the GPU
Well, I finally got the answer as to why my AMD struggles with model inference and training. However, I was not done with scratching the itch, I have come so far, now what exactly happens inside the GPU? How does voltages work together? Why are the DRAM prices going up if everything is happening on GPU? I decided to go further with my pursuit of understanding.
The transistor budget
I had a simple question for many years, why can we not run all of the ML processing on CPU? Clearly CPU can do a lot, so why can it not run kernels? It all comes down to silicon real estate. Yes, you heard me right, the same real estate challenge we have here in Berlin when it comes to housing, is also the challenge at a silicon level.
A chip designer has limited budget of transistors. How they spend it defines the chips personality.
The CPU, a group of Geniuses
The CPU (henceforth referred to as The manager) is designed for latency, doing one thing as fast as possible. It spends transistors on massive control units to handle the complex branching logic (think of massive if / else ladder ), deep caches to hide memory latency and sophisticated predictors. It has very few actual “math stations”, maybe 8-16 cores.
The GPU, an army of Laborers
The GPU (henceforth referred to as The factory) is designed for throughput, doing many things at once. The designer bulldozes the managers office, shrinks the cafeteria, and fills the chip with thousands of tiny, simple calculators (ALUs - Arithmetic Logic units). A H100 has over 16,000 CUDA cores. 16,000!
The CPU is a few geniuses. The GPU is an army of laborers.
The CUDA Hierarchy
Now when know that the CUDA kernel is responsible for the entire choreography. Imagine having to co-ordinate 10000 dancers without chaos? It is the same challenge for CUDA.
The CUDA programming model splits work into a rigid hierarchy:
Grid: The entire job
Blocks: The grid is divided into blocks. Each block is assigned to a Streaming Multiprocessor (SM) — a mini-factory within the GPU.
Warps: Inside each block, threads are grouped into packs of 32. A Warp moves in lockstep — all 32 threads execute the same instruction at the same time.
Warp Lockstep
All 32 threads in a Warp must do the same thing at the same time. If Thread 1 adds and Thread 17 needs to multiply, the whole Warp stalls. It is like performing synchronized swimming routines that we see at the olympics.
Latency Hiding
The GPU has a trick. Well it has many tricks I am assuming, but one of the tricks it has up its sleeve is when one WARP stalls waiting for memory, the scheduler instantly swaps in another WARP that is ready to work.
The individual workers are slow (slow as always is relative), but there are so many workers and they swap so fast that the factory floor never stops moving. Talk about taking a page from the Toyota manifest for Kaizen.
Chapter 5: The Memory wall - The Physics of Forgetting
Well, we have now seen what is the input to the GPU, how data moves from CPU, Memory to GPU and the design decisions which makes GPUs perfect for ML work.
However, the question that remains is, how does the GPU handle all that data ?
The data is definitely larger than the memory it has access to.
Memory Hierarchy
The cores are fast. The kernels are optimized. But none of it matters without data.
And here we meet the true villain of our story: memory bandwidth.
Starving Cores
The data lives in HBM (High bandwidth memory), the 80GB / 20GB / xGB “warehouse” stacked next to the GPU die. But “high bandwidth” is relative. Fetching data from HBM takes 200+ clock cycles. In that time, a Tensor core could have done hundreds of operations, if only it had data to work on.
The Journey of Data
I wanted to then understand why is the HBM so slow? Why can’t we just make the super-fast L1 cache bigger? If we can put together such a massive HBM, we should be able to make L1 caches bigger right? Well apparently, the answer is Physics. Literal, electron-level physics.
The Leaky Bucket: DRAM/HBM
HBM (and all DRAM, inspite of their crazy prices) stores bits using capacitors, microscopic buckets that hold electrons.
Bucket full of electrons = 1
Bucket empty = 0
The problem: capacitors leak. Electrons quantum-tunnel through the insulator walls. If you store a ‘1’ and walk away, within about 64 milliseconds, it fades to garbage.
Its fascinating that we have been using computers, smartphones and other devices for so long now and we never think about electrons, nor do we realize the leaks. This is how deep the abstraction sits at the moment. Most of us are higher-level programmers, where we take hardware for almost granted.
The Refresh Cycle
The first question I asked was, well if DRAM or HBM leak, how the hell do these computations run? Well ofcourse there is a way in which this has been resolved, but it comes at a cost. The system must constantly refresh — read every cell, check if it’s still valid, and refill it. This refresh cycle consumes bandwidth and adds latency.
Destructive Reads
I started thinking, okay, maybe we use the Manager real estate to have a dedicated chip that just refreshes DRAM / HBM, if that is the only challenge. However, the deeper I went into the rabbit hole , I came across Destructive reads. Reading values from a capacitor is destructive. To measure the charge, you have to drain the bucket. Then you refill it. This obviously comes at a cost, it takes time.
The Solid Switch: SRAM
Inside the GPU cores, the L1 cache uses SRAM, a completely different technology. This is the memory from which the GPU cores actually operate on.
SRAM stores bits using 6 transistors arranged in a feedback loop. They actively hold each other in the On or Off state. No leaking. No refresh. Access time: ~1 clock cycle.
And there are no leaks and no destructive reads.
The Size-Speed Tradeoff
The immediate question I had was, okay this is a no brainer, why dont we use SRAM everywhere?
Cost and size. A single SRAM cell is physically ~6x larger than a DRAM cell. Building 80GB of SRAM would require a chip the size of a pizza box, costing millions of dollars.
The Physics Trap
We are now actively caught in a physical trade-off:
Memory Type. | Speed. | Capacity. |Cost. |Physics
SRAM(L1 Cache) |~1 cycle. |~256 KB per SM | $$$ |6 transistors,stable
HBM (VRAM). ~200 cycles |80 GB total |$ |1 capacitor, leakyWe must use the slow, leaky buckets for capacity. We must write clever software to minimize trips to those buckets.
This is the memory wall we have to deal with. With this memory wall, it was very evident to me what is happening at both GPU and CPU level, a lot of software optimizations to fight against the physics.
We live in a world where software development, fine-tuning and training of LLMs, inference at scale, is happening at breakneck speeds and a whole lot of us have almost 0 idea on the lower levels of abstractions, without innovations in that space, we could not have done what we do today. Its 2:00 AM Berlin time as I am writing this, and I am going to bed with more questions, I realise I need to go deeper down the rabbit hole. I need to know what are those software optimizations and innovations that make this happen. We will see that in Chapter 6.
Next Week: Flash Attention
We’re trapped by physics. Capacitors leak. SRAM is too expensive to scale. We can’t make memory faster.
By 2021, ML engineers knew this. They’d run profilers and see the brutal truth: Tensor Cores — the expensive, fast math units — were idle 80% of the time. Not computing. Waiting for memory.
The question became: We can’t make HBM faster. We can’t make SRAM bigger. Can we reorganize the work so the GPU stops waiting?
In 2022, Tri Dao asked a different question: “What if we never store the full attention matrix at all?”
The answer changed everything.
Part 4: Flash Attention — the trick that made GPUs 4x faster. Stay tuned.


















