In GPU programming, synchronization ensures that operations on different devices (or between a host and a device) occur in the correct order, avoiding race conditions and ensuring data consistency. GPU devices, like those using CUDA or HIP, execute operations asynchronously by default, meaning tasks are dispatched to the GPU without blocking the CPU or subsequent commands. However, when interdependencies arise, event synchronization becomes crucial.
CUDA and HIP Streams
A stream is a sequence of commands (kernels, memory copies, etc.) that are issued to the GPU. Commands within a stream execute in the order they are issued, but commands in different streams can execute out of order or concurrently.
- CUDA Streams: Streams in NVIDIA’s CUDA framework are used to manage tasks running on GPUs.
- HIP Streams: A similar abstraction in AMD’s ROCm platform, analogous to CUDA streams.
Events in GPU Synchronization
An event is a marker that can be recorded at a specific point in a stream. Events are used to track execution progress or synchronize streams. Events are lightweight and offer finer control over synchronization compared to traditional host-based synchronization methods
Host-based synchronization involves the CPU actively managing or waiting for GPU tasks to complete, issuing a blocking call that waits for all GPU tasks to finish, such as with cudaDeviceSynchronize(). This form of busy-waiting involves frequent host-device communication, can introduce latency and overhead
GPU event-based synchronization allows synchronization to occur entirely on the GPU, without involving the CPU. When an event is recorded on a GPU stream using cudaEventRecord or hipEventRecord, the event is tied to a specific point in that stream.
For example:
cudaEventRecord(event, stream1);- Here,
stream1“publishes” the event, signaling when all previous operations instream1have completed. - If another stream (
stream2) is made to wait for this event, the GPU ensures that no tasks instream2execute until theeventis completed instream1.
Tip
cudaEvent_t (CUDA) and hipEvent_t (HIP) are C structs types to represent event objects in the Cuda and HIP
CUDA: cudaStreamWaitEvent
The cudaStreamWaitEvent function allows a stream to wait until a specific event has been recorded in another stream.
cudaEvent_t event;
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaEventCreate(&event);
#### Example
// Kernel execution in stream1
myKernel<<<blocks, threads, 0, stream1>>>(...);
// Record an event in stream1
cudaEventRecord(event, stream1);
// Make stream2 wait for the event
cudaStreamWaitEvent(stream2, event, 0);
// Execute a kernel in stream2 after the event
myOtherKernel<<<blocks, threads, 0, stream2>>>(...);
// Clean up
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
cudaEventDestroy(event);HIP: hipStreamWaitEvent
The hipStreamWaitEvent function mirrors CUDA’s functionality, enabling synchronization between streams in AMD’s HIP framework.
hipEvent_t event;
hipStream_t stream1, stream2;
hipStreamCreate(&stream1);
hipStreamCreate(&stream2);
hipEventCreate(&event);
// Kernel execution in stream1
hipLaunchKernelGGL(myKernel, blocks, threads, 0, stream1, ...);
// Record an event in stream1
hipEventRecord(event, stream1);
// Make stream2 wait for the event
hipStreamWaitEvent(stream2, event, 0);
// Execute a kernel in stream2 after the event
hipLaunchKernelGGL(myOtherKernel, blocks, threads, 0, stream2, ...);
// Clean up
hipStreamDestroy(stream1);
hipStreamDestroy(stream2);
hipEventDestroy(event);