NVIDIA GPU Pipeline — Zero to Expert

How a GPU Turns One Pixel
Into a Full Image

A zero-to-hero visual journey through NVIDIA GPU architecture. Watch data flow from your CPU through Streaming Multiprocessors, CUDA threads, and memory hierarchies — all the way to your screen.

CPULoads Image
PCIeTransfer
VRAMGPU Memory
SMCompute
DisplayOutput
Scroll
The Full Journey

From JPEG on Disk
to Pixels on Screen

Every image you see passes through 5 distinct hardware and software stages. Click each step to understand exactly what happens.

STEP 01~8MB for a 1080p image

CPU Loads the Image

Host → System RAM

Your CPU reads the image file from disk (JPEG/PNG), decodes it into raw pixel data (RGBA bytes), and stores it in system RAM. Each pixel is 4 bytes — Red, Green, Blue, Alpha.

snippet.cu
CUDA
uint8_t pixels[1920 * 1080 * 4]; // ~8MB
stbi_load("photo.jpg", pixels, ...);
1 / 5
Memory Hierarchy

Speed vs. Size:
The Memory Pyramid

GPUs have 5 levels of memory, each trading speed for capacity. Understanding this pyramid is the key to writing fast GPU code.

Faster & Smaller
Slower & Larger
L0

Registers

Per Thread

Latency
~1 cycle
Bandwidth
Unlimited
Capacity
256 KB / SM
Managed by
Programmer

Fastest storage. Each CUDA thread gets its own private registers — like a calculator's display. Disappears when the thread ends.

example.cu
float r = pixel.r; // stored in register
By the Numbers
0
CUDA Cores
in RTX 4090
Parallel compute units
0TB/s
VRAM Bandwidth
GDDR6X memory bus
Memory throughput
0 stages
Pipeline Stages
from disk to display
End-to-end processing
0×
Faster than RAM
register vs. system RAM
Register speed advantage