HomeCUDA Integration
Compute Unified Device Architecture

CUDA:
The Software Layer
Between You and the GPU

CUDA is NVIDIA's programming model that lets you write C++ code that runs on thousands of GPU cores simultaneously. Learn the thread hierarchy, kernel launches, and memory model from scratch.

32Threads per Warp
1024Max Threads/Block
12.xCUDA Version
Thread Hierarchy

Thread → Warp → Block → Grid:
CUDA's Execution Model

CUDA organizes millions of parallel threads into a 4-level hierarchy. Click each level to visualize it.

Thread
1 unit
T0
Analogy

👤 One worker at a single desk, doing their assigned calculation.

Private registers
255 per thread
Unique ID
threadIdx.x/y/z
Thread
1 unit

The smallest execution unit. Each thread runs the same kernel function but operates on different data (e.g., one pixel). Has its own private registers and local memory.

kernel.cu
Thread
// threadIdx.x identifies this thread
int x = threadIdx.x; // 0..31
float val = input[x];
output[x] = val * 2.0f;
Kernel Launch

Writing and Launching
Your First CUDA Kernel

Kernel Launch Simulator
processImage<<<(120,68), (16,16)>>> — 2,088,960 threads
SM utilizationidle
128 SMs — each cell = 1 SM processing assigned thread blocks
01

Write the Kernel Function

A kernel is a C++ function marked with __global__. It runs on the GPU, called from the CPU. Each invocation is one thread — threadIdx and blockIdx give each thread its unique identity.

main.cu__global__
__global__ void processImage(
  float* d_output,
  float* d_input,
  int width, int height
) {
  // Each thread handles one pixel
  int x = blockIdx.x * blockDim.x + threadIdx.x;
  int y = blockIdx.y * blockDim.y + threadIdx.y;
  
  if (x >= width || y >= height) return;
  
  int idx = y * width + x;
  float pixel = d_input[idx];
  
  // Gamma correction
  d_output[idx] = pow(pixel, 1.0f / 2.2f);
}
Memory Patterns

Memory Access Patterns:
The Difference Between Fast and Slow

1 transaction
Coalesced Access

When 32 threads in a warp access consecutive memory addresses, the GPU combines all 32 reads into ONE memory transaction. This is called coalesced access — the ideal pattern.

Memory Transactions (32 threads)
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
T25
T26
T27
T28
T29
T30
T31
100% efficient
memory_pattern.cu
// Thread i reads element i
// Addresses: 0,1,2,3,...,31
// → 1 memory transaction
float val = d_array[threadIdx.x];
// All 32 threads access
// contiguous 128-byte block
CPU vs GPU

Why Not Just Use
a Faster CPU?

Parallelism Demo: Process 16 pixels

CPU: one at a time. GPU: all at once.

CPU
Sequential — 1 pixel at a time
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
P15
Time: ~3200ms (sequential)
GPU
Parallel — all 16 at once
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
P15
Time: ~200ms (parallel — all at once)
Property
CPU
GPU
Core Count
8–24 cores
16,384 CUDA cores
Clock Speed
4–6 GHz
2–3 GHz
Task Type
Sequential logic
Parallel data
Memory BW
50–100 GB/s
~1 TB/s
Cache
Large (32+ MB L3)
Small per SM (128 KB)
Latency
Low (1–10 ns)
High (200–600 ns)
Throughput
Low
Extreme
Best For
OS, logic, branching
Images, AI, physics

The Key Insight

A CPU is a sprinter — fast at one thing at a time, great for complex logic. A GPU is a swimming team — thousands of average swimmers all racing simultaneously. For an image with 2 million pixels, each needing the same color math, the GPU wins by 100× because every pixel is independent.