GPU Compute Runtimes
Why GPU Compute Runtimes Matter
GPUs (Graphics Processing Units) were originally designed for graphics rendering — computing the color of millions of pixels in parallel for every frame on a display. Their architecture reflects this origin: instead of a few powerful cores optimized for sequential work (like a CPU), a GPU has thousands of simple cores designed to execute the same operation on many data points simultaneously. This execution model is called SIMD (Single Instruction, Multiple Data).
This architecture turns out to be ideal for matrix multiplication — the core mathematical operation in neural network training and inference. A single forward pass through a neural network is essentially a chain of matrix multiplications interspersed with activation functions. A GPU can perform thousands of these multiply-accumulate operations in parallel, achieving orders-of-magnitude speedups over a CPU for the same workload.
However, GPUs do not natively understand Python or ML frameworks. To use a GPU for general computation — a practice known as GPGPU (General-Purpose computing on Graphics Processing Units) — you need a compute runtime: a software layer that provides a compiler, driver interface, memory management, and a library of optimized operations (kernels) so that application code can dispatch work to the GPU hardware.
The choice of compute runtime determines which GPU hardware you can use, which ML frameworks are available to you, and how much of the GPU’s theoretical performance you can actually access.
CUDA
CUDA (Compute Unified Device Architecture) is NVIDIA’s proprietary GPU compute platform, first released in 2006. It is the dominant standard for AI/ML (Artificial Intelligence / Machine Learning) workloads.
What CUDA Provides
- nvcc: a C/C++ compiler that compiles GPU code (called “kernels”) into PTX (Parallel Thread Execution) intermediate representation and then into GPU machine code.
- Runtime API (Application Programming Interface): functions for allocating GPU memory, copying data between CPU and GPU, and launching kernels.
- cuDNN (CUDA Deep Neural Network library): a library of highly optimized primitives for neural network operations — convolutions, pooling, normalization, and attention. Framework authors (PyTorch, TensorFlow) call cuDNN rather than implementing these operations from scratch.
- cuBLAS (CUDA Basic Linear Algebra Subprograms): optimized matrix multiplication and linear algebra routines. This is where much of the raw performance comes from.
- Tensor Cores: specialized hardware units on NVIDIA GPUs (Volta architecture and later) that perform mixed-precision matrix multiply-accumulate operations in a single clock cycle. cuDNN and cuBLAS automatically use Tensor Cores when available.
CUDA Cores
A “CUDA core” is NVIDIA’s term for a single scalar processing unit on the GPU. Each CUDA core can execute one floating-point or integer operation per clock cycle. Modern GPUs have thousands of them — for example, an RTX 3060 has 3,584 CUDA cores, and an H100 data center GPU has 16,896.
Why CUDA Dominates
PyTorch, TensorFlow, JAX, and nearly every ML framework are built primarily on CUDA. The libraries (cuDNN, cuBLAS, NCCL for multi-GPU communication, TensorRT for inference optimization) have years of engineering and optimization behind them. This ecosystem lock-in is NVIDIA’s primary competitive moat: even if competing hardware matches NVIDIA on raw compute, the software gap is enormous.
CUDA only runs on NVIDIA GPUs. There is no way to run CUDA on AMD or Intel hardware.
ROCm
ROCm (Radeon Open Compute) is AMD’s open-source GPU compute platform, positioning itself as an alternative to CUDA.
Architecture
ROCm’s key abstraction is HIP (Heterogeneous-compute Interface for Portability) — a C++ runtime API (Application Programming Interface) that is intentionally source-compatible with CUDA. HIP code looks nearly identical to CUDA code: the API calls have the same structure, with function names like hipMalloc mapping directly to cudaMalloc. This design means:
- HIP code compiles for both AMD and NVIDIA GPUs. On AMD hardware, HIP compiles through AMD’s compiler stack. On NVIDIA hardware, HIP calls map directly to CUDA.
- The
hipifytool can automatically convert existing CUDA source code to HIP by performing a syntactic translation of API calls.
ROCm includes its own equivalents of NVIDIA’s libraries: rocBLAS (linear algebra), MIOpen (neural network primitives, analogous to cuDNN), and RCCL (multi-GPU communication, analogous to NCCL).
Current State
- PyTorch officially supports ROCm. You can install a ROCm-enabled PyTorch build from the official channels.
- TensorFlow has experimental ROCm support but it is less mature.
- Hardware support is limited. ROCm primarily targets AMD Instinct accelerators (MI250, MI300X — data center GPUs). Some RDNA (Radeon DNA, AMD’s consumer GPU architecture) cards work, but support is inconsistent.
- Linux-only. There is no ROCm compute support on Windows.
- Improving rapidly but still has rough edges: driver compatibility issues, missing features, and less community knowledge compared to CUDA.
Vulkan Compute
Vulkan is a low-level, cross-platform graphics and compute API maintained by the Khronos Group (an industry consortium that also maintains OpenGL and OpenCL).
How It Works
Vulkan provides explicit control over the GPU at a level much closer to the hardware than CUDA or ROCm. The programmer manually manages:
- Memory allocation: choosing memory types (device-local, host-visible), allocating buffers, and binding them.
- Command buffers: recording sequences of GPU commands and submitting them to hardware queues.
- Synchronization: inserting barriers and semaphores to ensure correct ordering of operations.
- Shader compilation: compute operations are written as SPIR-V (Standard Portable Intermediate Representation) shaders, compiled from GLSL (OpenGL Shading Language) or HLSL (High-Level Shading Language).
Why It Matters for GPU Compute
Vulkan works on virtually any modern GPU: NVIDIA, AMD, Intel, Qualcomm (mobile), and Apple (via MoltenVK translation layer). This universality makes it attractive as a fallback compute backend.
Vulkan is not widely used for ML training — the API is far too low-level and lacks the optimized neural network libraries that CUDA provides. However, it is used as an inference backend in some projects. Notably, llama.cpp includes a Vulkan backend, enabling GPU-accelerated LLM inference on any GPU vendor without requiring CUDA or ROCm.
Metal
Metal is Apple’s GPU compute and graphics framework, available on macOS and iOS. It is the only way to access the GPU on Apple Silicon chips (M1, M2, M3, M4).
Unified Memory Architecture
Apple Silicon uses a unified memory architecture (UMA): the CPU and GPU share the same physical memory pool. In a discrete GPU system, data must be explicitly copied from CPU RAM to GPU VRAM (Video Random Access Memory) over the PCIe bus — a major bottleneck. On Apple Silicon, both processors can access the same memory region with no copy required. This is particularly beneficial for LLM inference, where models must fit in GPU-accessible memory.
Neural Engine
Apple Silicon chips include a dedicated Neural Engine — fixed-function hardware designed specifically for ML inference operations (matrix multiplications and convolutions at INT8/INT16 precision). The Neural Engine is accessed through Apple’s Core ML framework or indirectly through Metal shaders. It is separate from the GPU cores.
MLX
See Model Formats for detailed coverage of MLX, Apple’s open-source ML framework built on Metal.
MLX is the primary way to run ML workloads natively on Apple Silicon, leveraging unified memory for efficient data sharing between CPU and GPU operations.
Relevance
Metal is not available on Linux or Windows. It is only relevant for macOS and iOS workloads. For LLM inference on Apple Silicon, both MLX and OLLAMA (via llama.cpp’s Metal backend) deliver strong performance.
llama.cpp and GGML
llama.cpp is a C/C++ implementation of LLM (Large Language Model) inference, originally written by Georgi Gerganov. It has become the de facto universal inference runtime for running LLMs locally.
Architecture
llama.cpp is built on GGML — a C tensor library (also by Gerganov) that provides the low-level tensor operations (matrix multiplication, attention, normalization). GGML is to llama.cpp what cuBLAS/cuDNN are to PyTorch: the computational foundation.
Models are stored in the GGUF (GGML Universal Format) file format, which packages model weights, architecture metadata, and tokenizer information into a single file. See Model Formats for details on GGUF and quantization options.
Multi-Backend Support
llama.cpp supports multiple compute backends, selected at compile time or runtime:
- CPU: uses AVX2 (Advanced Vector Extensions 2) or AVX-512 SIMD instructions for vectorized computation on x86 processors. ARM NEON on ARM chips.
- CUDA: for NVIDIA GPUs.
- ROCm/HIP: for AMD GPUs.
- Metal: for Apple Silicon GPUs.
- Vulkan: for any GPU with Vulkan drivers.
- SYCL (originally “System-wide Compute Language”): for Intel GPUs and accelerators.
This multi-backend design is what OLLAMA uses under the hood. Ollama is essentially a user-friendly wrapper around llama.cpp that handles model management, API serving, and automatic backend selection.
Why This Matters
llama.cpp makes the GPU vendor choice significantly less critical for LLM inference specifically. Whether you have an NVIDIA, AMD, Intel, or Apple GPU — or only a CPU — llama.cpp can run the same model. The performance will vary (CUDA on NVIDIA hardware is fastest), but the functionality is universal.
Comparison
| Runtime | Vendor Lock-in | Open Source | OS Support | ML Ecosystem | LLM Inference |
|---|---|---|---|---|---|
| CUDA | NVIDIA only | No | Linux, Windows | Dominant | Excellent |
| ROCm | AMD only | Yes | Linux only | Growing | Good (via HIP) |
| Vulkan | Any GPU | Yes (spec) | Linux, Windows, macOS | Minimal | Via llama.cpp |
| Metal | Apple only | No (Metal); Yes (MLX) | macOS, iOS | Apple ecosystem | Excellent on Apple Silicon |
| llama.cpp | None | Yes | All major platforms | LLM inference only | Excellent everywhere |
Practical Guidance
- Training ML models: NVIDIA + CUDA is currently the only production-ready choice. The ecosystem (PyTorch, cuDNN, NCCL, DeepSpeed, Megatron) is built around it. AMD ROCm is catching up but is not yet reliable enough for production training workloads.
- LLM inference: llama.cpp (typically via OLLAMA) works on everything — CUDA, ROCm, Metal, Vulkan, or CPU-only. GPU vendor choice matters less here.
- Apple Silicon: MLX offers the best native performance by leveraging unified memory. Ollama (via llama.cpp’s Metal backend) is an excellent alternative with a simpler interface.
- AMD GPU: ROCm works for inference and experimental training. If ROCm does not support your specific card, Vulkan via llama.cpp is a reliable fallback for LLM inference.
See also
- NVME — PCIe bandwidth affects GPU data transfer rates
- External GPUs (eGPUs) — using desktop GPUs with laptops over Thunderbolt
- OLLAMA — LLM inference tool built on llama.cpp
- Model Formats — GGUF, MLX, SafeTensors, and other model file formats
- Accelerating inference — techniques for faster LLM inference
- Synchronization in GPU programming — low-level GPU synchronization primitives
- Arrow on the GPU — GPU-accelerated data processing with Apache Arrow