In traditional blocking I/O, a thread would wait for a single operation (e.g., a file read or a socket connection) to complete before continuing. This leads to inefficiencies when managing multiple I/O operations, as each one requires a dedicated thread or process to handle blocking calls.

An I/O selector avoids this by allowing a single thread to manage multiple I/O sources, such as file descriptors or sockets, timers or signals. Such a thread will receive an event and offload work to thread in the task queue.

Tip

This idea of having a single thread co-located with the selector that dispatch to a thread pool for CPU-bounded work was implemented by Tokio scheduler in the version 0.1

epoll

epoll is a Linux-specific I/O selector that provides efficient readiness notifications for large numbers of file descriptors. It allows registering, deregistering, and polling events in an event-driven manner.

epoll supports two modes:

  • level-triggered(default): epoll keeps notifying the application until the monitored file remains in ready state
  • edge-triggered: the application gets notified only on state changes, and require full handling and then re-arming epoll

Re-arming epoll requires addressing the readiness condition, i.e. consuming all bytes on a socket.

Example

If you have been notified of data availability on a socket and you haven’t consumed it, if new data arrives and you are using edge-triggered mode you won’t receive a new notification, because there is no state change.

// EPOLLET: Edge triggered. 
// Removing the | EPOLLET use the defualt mode, level triggered
struct epoll_event event = { .events = EPOLLIN | EPOLLET, .data.fd = sockfd };
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &event);
 
while (1) {
    struct epoll_event events[10];
    int n = epoll_wait(epfd, events, 10, -1);
 
    for (int i = 0; i < n; i++) {
        if (events[i].events & EPOLLIN) {
            char buffer[1024];
            ssize_t bytes_read;
            // Fully consume the data
            while ((bytes_read = read(events[i].data.fd, buffer, sizeof(buffer))) > 0) {
                process_data(buffer, bytes_read);
            }
            // If data remains unread, no further notifications will occur
        }
    }
}
 

epoll predecessors: poll and select

poll and select provided mechanisms for monitoring multiple file descriptors for readiness but suffered from significant scalability issues.:

  • select uses a fixed-size bitmask (fd_set) to track file descriptors, requiring the entire set to be reinitialized before each call. It was too inefficient for large numbers of descriptors, as it imposed a hard limit (typically 1024)
  • poll improved on select by using a dynamic array of structures (pollfd) to represent descriptors, lifting the hard limit on file descriptors and supporting a broader range of event types.
  • both required on each poll invocation linear scanning to find the ready file descriptors

epoll ready-list and tree for watches

epoll stores in the Kernel memory space a separate ready-list that is used to return the matching file descriptors to the program when the epoll_wait method is invoked, and a tree-like structure to store watches more efficiently and is used when epoll_ctl is invoked to add, modify or remove a watch

kqueue

kqueue, available on BSD-based systems (including macOS), is a scalable I/O selector that monitors events such as file descriptors, timers, and signals. It uses filters to specify the types of events to monitor and is known for its ability to handle a wide variety of system events beyond just I/O.

IOCP

I/O Completion Ports (IOCP) is the I/O selector mechanism used in Windows. Unlike epoll or kqueue, which are readiness-based, IOCP is completion-based, meaning that it notifies the application when an operation is fully completed. IOCP is particularly efficient for asynchronous file and network I/O.

Co-locating Threads with I/O Seilectors

In traditional event-driven systems, an I/O Selector operates on a single thread to manage multiple I/O events. Co-locating threads with I/O selectors refers to the practice of pairing a thread with its own dedicated I/O selector, allowing tasks to be executed on the same thread that monitors readiness for those tasks.

Benefits

  • Reduced Context Switching: Tasks that are ready can be immediately executed on the same thread, avoiding the overhead of moving tasks between threads.
  • Cache Locality: By keeping I/O operations and task execution on the same thread, the data accessed during execution remains in the local CPU cache, improving performance.
  • Simplified Synchronization: Since tasks are tied to specific threads, less cross-thread communication is required, reducing the need for locks or atomic operations.

Limitations

  • Single-threaded Bottleneck: If the I/O selector thread becomes overloaded with too many tasks, it can become a bottleneck.
  • Reduced Flexibility: Tasks tied to specific threads can limit the ability to fully utilize multi-core systems, especially in dynamic workloads.