Multi-threading means multiple threads can be executed by a single core and is achieved in two ways:

  • simultaneous / hyper-threading
  • temporal multi-threading

Simultaneous multi-threading (SMT)

In simultaneous multi-threading, instructions from more than one thread can be executed in any given pipeline stage at a time. The main additions to the processor architecture are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads.

The most notable implementations are:

  • Intel Hyper-Threading (HT), first introduced in 2002 in the Intel Pentium 4 and still used today, using two hardware threads per core
  • AMD SMT: introduced in 2017 with the AMD Ryzen processors (Zen microarchitecture)
  • IBM POWER Processors: implemented first in POWER5 in 2004, with two hardware threads per core, and then with up to 8 per core (POWER9 in 2017)
  • Sun Microsystems UltraSPARC T1: launched in 2005, up to four hardware threads per core

Temporal multi-threading

In coarse-grained temporal multi-threading the main processor pipeline contains a single thread at the time and the processor must effectively perform a rapid context switch or thread switch before executing a different thread.

Tip

In this case the multi-threading comes from the processor ability to switch threads quickly, often within few cycles, compared to 100+ cycles required in traditional architecture. This is achieved by having multiple thread states in hardware and avoiding the need to flush registers and restore them

In fine-grained or interleaved, the main processor pipeline may contain multiple threads, with context switches effectively occurring between pipe stages (e.g., in the barrel processor). This form of multi-threading can be more expensive than the coarse-grained forms because execution resources that span multiple pipe stages may have to deal with multiple threads.

Examples of popular TMT implementations are:

  • Barrel Processors (Fine-Grained TMT): CDC 6600 Peripheral Processors (as early as 1964) which switched threads at every pipeline stage to hide memory latency and Sun Microsystems’ MAJC processor (1999)
  • Coarse-Grained TMT in GPUs: NVIDIA GPUs (starting in 2006) schedule different warps (groups of threads) when others stall, though their primary parallelism comes from SIMT (Single Instruction, Multiple Threads).
  • Cray XMT Supercomputers: Introduced in 2007, the Cray XMT implemented coarse-grained multi-threading to handle large-scale problems like graph traversal. The processor switched threads on memory latency to maximize pipeline utilization.