HomeCUDA Integration

Compute Unified Device Architecture

CUDA:
The Software Layer
Between You and the GPU

CUDA is NVIDIA's programming model that lets you write C++ code that runs on thousands of GPU cores simultaneously. Learn the thread hierarchy, kernel launches, and memory model from scratch.

32Threads per Warp

1024Max Threads/Block

12.xCUDA Version

GPU Architecture

CUDA v12.x

Thread Hierarchy

Thread → Warp → Block → Grid:
CUDA's Execution Model

CUDA organizes millions of parallel threads into a 4-level hierarchy. Click each level to visualize it.

⊂⊂⊂

Thread

1 unit

Analogy

👤 One worker at a single desk, doing their assigned calculation.

Private registers

255 per thread

Unique ID

threadIdx.x/y/z

Thread

1 unit

The smallest execution unit. Each thread runs the same kernel function but operates on different data (e.g., one pixel). Has its own private registers and local memory.

kernel.cu

Thread

// threadIdx.x identifies this thread
int x = threadIdx.x; // 0..31
float val = input[x];
output[x] = val * 2.0f;

Kernel Launch

Writing and Launching
Your First CUDA Kernel

Kernel Launch Simulator

processImage<<<(120,68), (16,16)>>> — 2,088,960 threads

SM utilizationidle

128 SMs — each cell = 1 SM processing assigned thread blocks

Write the Kernel Function

A kernel is a C++ function marked with __global__. It runs on the GPU, called from the CPU. Each invocation is one thread — threadIdx and blockIdx give each thread its unique identity.

main.cu__global__

__global__ void processImage(
  float* d_output,
  float* d_input,
  int width, int height
) {
  // Each thread handles one pixel
  int x = blockIdx.x * blockDim.x + threadIdx.x;
  int y = blockIdx.y * blockDim.y + threadIdx.y;
  
  if (x >= width || y >= height) return;
  
  int idx = y * width + x;
  float pixel = d_input[idx];
  
  // Gamma correction
  d_output[idx] = pow(pixel, 1.0f / 2.2f);
}

Memory Patterns

Memory Access Patterns:
The Difference Between Fast and Slow

1 transaction

Coalesced Access

When 32 threads in a warp access consecutive memory addresses, the GPU combines all 32 reads into ONE memory transaction. This is called coalesced access — the ideal pattern.

Memory Transactions (32 threads)

T10

T11

T12

T13

T14

T15

T16

T17

T18

T19

T20

T21

T22

T23

T24

T25

T26

T27

T28

T29

T30

T31

100% efficient

memory_pattern.cu

// Thread i reads element i
// Addresses: 0,1,2,3,...,31
// → 1 memory transaction
float val = d_array[threadIdx.x];
// All 32 threads access
// contiguous 128-byte block

CPU vs GPU

Why Not Just Use
a Faster CPU?

Parallelism Demo: Process 16 pixels

CPU: one at a time. GPU: all at once.

CPU

Sequential — 1 pixel at a time

P10

P11

P12

P13

P14

P15

Time: ~3200ms (sequential)

GPU

Parallel — all 16 at once

P10

P11

P12

P13

P14

P15

Time: ~200ms (parallel — all at once)

Property

CPU

GPU

Core Count

8–24 cores

16,384 CUDA cores★

Clock Speed

4–6 GHz★

2–3 GHz

Task Type

Sequential logic

Parallel data

Memory BW

50–100 GB/s

~1 TB/s★

Cache

Large (32+ MB L3)★

Small per SM (128 KB)

Latency

Low (1–10 ns)★

High (200–600 ns)

Throughput

Low

Extreme★

Best For

OS, logic, branching

Images, AI, physics

The Key Insight

A CPU is a sprinter — fast at one thing at a time, great for complex logic. A GPU is a swimming team — thousands of average swimmers all racing simultaneously. For an image with 2 million pixels, each needing the same color math, the GPU wins by 100× because every pixel is independent.

Explore GPU Architecture Back to Home

CUDA:The Software LayerBetween You and the GPU

Thread → Warp → Block → Grid:CUDA's Execution Model

Writing and LaunchingYour First CUDA Kernel

Write the Kernel Function

Memory Access Patterns:The Difference Between Fast and Slow

Why Not Just Usea Faster CPU?

Parallelism Demo: Process 16 pixels

The Key Insight

CUDA:
The Software Layer
Between You and the GPU

Thread → Warp → Block → Grid:
CUDA's Execution Model

Writing and Launching
Your First CUDA Kernel

Memory Access Patterns:
The Difference Between Fast and Slow

Why Not Just Use
a Faster CPU?