Multi-threading means multiple threads can be executed by a single core and is achieved in two ways:
- simultaneous / hyper-threading
- temporal multi-threading
Simultaneous multi-threading (SMT)
In simultaneous multi-threading, instructions from more than one thread can be executed in any given pipeline stage at a time. The main additions to the processor architecture are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads.
The most notable implementations are:
- Intel Hyper-Threading (HT), first introduced in 2002 in the Intel Pentium 4 and still used today, using two hardware threads per core
- AMD SMT: introduced in 2017 with the AMD Ryzen processors (Zen microarchitecture)
- IBM POWER Processors: implemented first in POWER5 in 2004, with two hardware threads per core, and then with up to 8 per core (POWER9 in 2017)
- Sun Microsystems UltraSPARC T1: launched in 2005, up to four hardware threads per core
Temporal multi-threading
In coarse-grained temporal multi-threading the main processor pipeline contains a single thread at the time and the processor must effectively perform a rapid context switch or thread switch before executing a different thread.
Tip
In this case the multi-threading comes from the processor ability to switch threads quickly, often within few cycles, compared to 100+ cycles required in traditional architecture. This is achieved by having multiple thread states in hardware and avoiding the need to flush registers and restore them
In fine-grained or interleaved, the main processor pipeline may contain multiple threads, with context switches effectively occurring between pipe stages (e.g., in the barrel processor). This form of multi-threading can be more expensive than the coarse-grained forms because execution resources that span multiple pipe stages may have to deal with multiple threads.
Examples of popular TMT implementations are:
- Barrel Processors (Fine-Grained TMT): CDC 6600 Peripheral Processors (as early as 1964) which switched threads at every pipeline stage to hide memory latency and Sun Microsystems’ MAJC processor (1999)
- Coarse-Grained TMT in GPUs: NVIDIA GPUs (starting in 2006) schedule different warps (groups of threads) when others stall, though their primary parallelism comes from SIMT (Single Instruction, Multiple Threads).
- Cray XMT Supercomputers: Introduced in 2007, the Cray XMT implemented coarse-grained multi-threading to handle large-scale problems like graph traversal. The processor switched threads on memory latency to maximize pipeline utilization.