CUDA:
The Software Layer
Between You and the GPU
CUDA is NVIDIA's programming model that lets you write C++ code that runs on thousands of GPU cores simultaneously. Learn the thread hierarchy, kernel launches, and memory model from scratch.
Thread → Warp → Block → Grid:
CUDA's Execution Model
CUDA organizes millions of parallel threads into a 4-level hierarchy. Click each level to visualize it.
👤 One worker at a single desk, doing their assigned calculation.
The smallest execution unit. Each thread runs the same kernel function but operates on different data (e.g., one pixel). Has its own private registers and local memory.
// threadIdx.x identifies this thread
int x = threadIdx.x; // 0..31
float val = input[x];
output[x] = val * 2.0f;Writing and Launching
Your First CUDA Kernel
Write the Kernel Function
A kernel is a C++ function marked with __global__. It runs on the GPU, called from the CPU. Each invocation is one thread — threadIdx and blockIdx give each thread its unique identity.
__global____global__ void processImage(
float* d_output,
float* d_input,
int width, int height
) {
// Each thread handles one pixel
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
int idx = y * width + x;
float pixel = d_input[idx];
// Gamma correction
d_output[idx] = pow(pixel, 1.0f / 2.2f);
}Memory Access Patterns:
The Difference Between Fast and Slow
When 32 threads in a warp access consecutive memory addresses, the GPU combines all 32 reads into ONE memory transaction. This is called coalesced access — the ideal pattern.
// Thread i reads element i
// Addresses: 0,1,2,3,...,31
// → 1 memory transaction
float val = d_array[threadIdx.x];
// All 32 threads access
// contiguous 128-byte blockWhy Not Just Use
a Faster CPU?
Parallelism Demo: Process 16 pixels
CPU: one at a time. GPU: all at once.
The Key Insight
A CPU is a sprinter — fast at one thing at a time, great for complex logic. A GPU is a swimming team — thousands of average swimmers all racing simultaneously. For an image with 2 million pixels, each needing the same color math, the GPU wins by 100× because every pixel is independent.