Shaders and Rasterization

A shader is a small program that runs on the GPU. The name originates from “to shade” since early shaders calculated how light interacts with surfaces. Today the term encompasses any GPU program.

The key mental model shift from CPU programming: you write code for one element (a single pixel or vertex), and the GPU executes it across thousands of elements automatically.

The graphics pipeline processes geometry through several stages. Rasterization converts vector geometry (triangles defined by vertices) into a raster (a grid of pixels). After rasterization determines which pixels are covered by geometry, the fragment shader runs on each pixel to compute its final color. A fragment is a “potential pixel” that may still be discarded (e.g., occluded by another object) or blended before becoming a final pixel.

Early Vector Architectures

Early programmable GPUs used vec4 ALUs optimized for RGBA color processing. Each ALU processed all four components (X, Y, Z, W) of a single pixel simultaneously, but with a critical constraint: all four components had to execute the same operation.

ADD vec4 → [ADD on X | ADD on Y | ADD on Z | ADD on W]

This worked well for uniform color operations. The problem arose when shaders needed different operations per component, or scalar operations on single values. A scalar addition would mask out three components:

ADD scalar → [ADD on X | masked | masked | masked]

Three out of four lanes wasted per instruction. As shader workloads evolved to include more scalar and mixed operations, utilization dropped significantly.

VLIW Architecture

Very Long Instruction Word (VLIW) architectures solved the masking problem by allowing different opcodes per ALU within a single instruction.

The hardware—multiple ALUs per shader unit—already existed. VLIW changed the control mechanism: instead of one shared opcode, the instruction encodes separate opcodes for each ALU, each writing to a different destination component.

Without VLIW (4 cycles):
  Instruction 1: ADD on X [mask Y,Z,W]
  Instruction 2: MUL on Y [mask X,Z,W]
  Instruction 3: SUB on Z [mask X,Y,W]
  Instruction 4: DIV on W [mask X,Y,Z]

With VLIW (1 cycle):
  Instruction 1: [ADD on X | MUL on Y | SUB on Z | DIV on W]

The cost shifts to the compiler, which must find independent operations to pack together. This tradeoff makes sense for GPUs: thousands of cores mean transistor savings per core compound dramatically. Superscalar designs (where hardware dynamically finds parallelism) require out-of-order execution logic, register renaming, and reorder buffers per core—prohibitively expensive at GPU scale.

VLIW Variants

AMD’s TeraScale architecture used VLIW5: five scalar ALUs (X, Y, Z, W, plus a transcendental unit) with five opcodes per instruction.

An alternative design is VLIW 3+1: one vec3 unit plus one scalar unit, encoding two opcodes per instruction. The vec3 unit handles operations that consume three inputs and produce one output (like dot products), while the scalar unit executes an independent operation:

V_DP3_S_MUL R2, R0, R1

Vec3 unit: DP3 on X,Y,Z → (R0.x × R1.x) + (R0.y × R1.y) + (R0.z × R1.z)
Scalar unit: MUL on W → R0.w × R1.w

This acknowledges that RGB operations often differ from alpha channel operations, and many calculations (lighting, normals) are inherently 3-component.

SIMT and Massive Parallelism

VLIW exploits parallelism within a single pixel’s components. The massive parallelism of GPUs comes from a second level: executing the same shader across many pixels simultaneously.

SIMT (Single Instruction, Multiple Threads) means the GPU takes your shader and runs it on, say, 64 pixels at once. Each “thread” processes a different pixel but executes the same instruction at the same time.

// Your shader (runs on ONE pixel)
float x = a + b;
// GPU execution (64 pixels in parallel, same cycle)
Pixel 0:  x = a + b
Pixel 1:  x = a + b
...
Pixel 63: x = a + b

Even scalar shader instructions become SIMD operations when viewed across threads.

Waves, Warps, and Lock-Step Execution

A group of threads executing together has vendor-specific names:

  • Wave, Wavefront (AMD, typically 64 threads)
  • Warp (NVIDIA, typically 32 threads)
  • Subgroup (Vulkan/OpenGL, varies)

All threads in a wave execute in lock-step: the same instruction, the same cycle. They cannot independently diverge.

This constraint exists because lock-step execution allows 64 threads to share one instruction fetch unit, one decoder, and one scheduler. Without it, each thread would need its own control logic—64× the transistors for scheduling alone.

Important

The term lane means different things in CPU vs GPU contexts. In CPU SIMD, a lane is a portion of a wide register (e.g., one 32-bit segment of a 128-bit register). In GPU SIMT, a lane is an entire thread with its own registers.

Control Flow and Divergence

Lock-step execution complicates branching. When threads in a wave encounter a conditional:

if (x > 0) {
    // branch A
} else {
    // branch B
}

If all threads agree (all take A, or all take B), the GPU can skip the other branch entirely. This is uniform control flow.

If threads disagree—some need A, others need B—the wave is divergent. The GPU must execute both branches, masking inactive threads in each. Performance cost doubles.

GPU control flow support evolved through three stages:

Naive implementation executed both branches unconditionally, then used conditional move (CMOV) to select results. Cost: 2× performance, 2× energy.

Predicated instructions emit both branches but masked threads don’t execute (ALUs idle). Cost: 2× performance, 1× energy.

True branching with jump instructions can skip code entirely, but only when all threads in the wave take the same path.

Divergent loops are particularly costly: the wave iterates as many times as the thread requiring the most iterations. Threads finishing early simply mask and wait.

Cross-Lane Operations

Normally each thread can only access its own registers. Cross-lane operations let threads read values directly from other threads’ registers within the same wave.

The name draws an analogy to swizzling. Vector swizzling reorders components within a vector (vec.wzyx). Cross-lane swizzling reorders values across threads:

Before:
  Thread 0: R0 = 10
  Thread 1: R0 = 20
  Thread 2: R0 = 30
  Thread 3: R0 = 40

After cross-lane "shift right" on R0:
  Thread 0: R0 = 40  (from Thread 3)
  Thread 1: R0 = 10  (from Thread 0)
  Thread 2: R0 = 20  (from Thread 1)
  Thread 3: R0 = 30  (from Thread 2)

Cross-lane operates on specific values, not entire thread state. Swapping all registers would merely swap thread identities.

This is fast because all threads in a wave share the same physical register file. The alternative—communicating through shared memory—requires explicit stores, synchronization, and loads.

Temporal SIMT

Building a 64-wide SIMD unit is expensive. Temporal SIMT uses narrower hardware (e.g., 16-wide) and issues the same instruction multiple times across cycles.

AMD GCN uses 16-wide SIMD blocks. A 64-thread wave completes one instruction over 4 cycles:

Cycle 1: threads 0-15
Cycle 2: threads 16-31
Cycle 3: threads 32-47
Cycle 4: threads 48-63

The advantage emerges when considering multiple waves. A GPU processing thousands of pixels has many waves in flight. GCN’s scheduler can interleave them:

Cycle 1: Wave 0, threads 0-15
Cycle 2: Wave 1, threads 0-15
Cycle 3: Wave 2, threads 0-15
Cycle 4: Wave 3, threads 0-15
Cycle 5: Wave 0, threads 16-31
...

If Wave 0 stalls waiting for memory, the scheduler continues with other waves. A 64-wide design completing whole waves per cycle has fewer interleaving opportunities.

GPU Memory Hierarchy

GPUs have dedicated VRAM (the “16GB” in product specs), physically separate RAM on the graphics card. The hierarchy:

LevelLatency
Registers~1 cycle
L1 Cache~20 cycles
VRAM~200-400 cycles

VRAM has high bandwidth but high latency. Memory stalls are the primary reason GPUs maintain many waves in flight—while one wave waits for data, others execute.

Core Structure

A GPU core contains several components:

┌──────────────────────────────┐
│           Core               │
│                              │
│  ┌─────┐ ┌─────┐ ┌────────┐  │
│  │ ALU │ │ ALU │ │ Load/  │  │
│  │     │ │     │ │ Store  │  │
│  │     │ │     │ │ Unit   │  │
│  └──┬──┘ └──┬──┘ └───┬────┘  │
│     └───────┴────────┘       │
│              │               │
│      ┌───────┴───────┐       │
│      │   Registers   │       │
│      └───────────────┘       │
└──────────────────────────────┘

ALUs perform arithmetic only. The Load/Store Unit handles memory requests separately. When a shader loads from memory:

  1. Load/Store Unit sends request to memory controller
  2. ALU is immediately free
  3. Memory controller fetches data (200+ cycles)
  4. Data arrives in registers
  5. Wave can continue

The wave blocks (cannot proceed without data), but the ALUs are available for other waves. This separation is what enables latency hiding through wave interleaving.