In traditional blocking I/O, a thread would wait for a single operation (e.g., a file read or a socket connection) to complete before continuing. This leads to inefficiencies when managing multiple I/O operations, as each one requires a dedicated thread or process to handle blocking calls.
An I/O selector avoids this by allowing a single thread to manage multiple I/O sources, such as file descriptors or sockets, timers or signals. Such a thread will receive an event and offload work to thread in the task queue.
Tip
This idea of having a single thread co-located with the selector that dispatch to a thread pool for CPU-bounded work was implemented by Tokio scheduler in the version 0.1
epoll
epoll is a Linux-specific I/O selector that provides efficient readiness notifications for large numbers of file descriptors. It allows registering, deregistering, and polling events in an event-driven manner.
epoll supports two modes:
- level-triggered(default):
epollkeeps notifying the application until the monitored file remains in ready state - edge-triggered: the application gets notified only on state changes, and require full handling and then re-arming
epoll
Re-arming epoll requires addressing the readiness condition, i.e. consuming all bytes on a socket.
Example
If you have been notified of data availability on a socket and you haven’t consumed it, if new data arrives and you are using edge-triggered mode you won’t receive a new notification, because there is no state change.
// EPOLLET: Edge triggered.
// Removing the | EPOLLET use the defualt mode, level triggered
struct epoll_event event = { .events = EPOLLIN | EPOLLET, .data.fd = sockfd };
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &event);
while (1) {
struct epoll_event events[10];
int n = epoll_wait(epfd, events, 10, -1);
for (int i = 0; i < n; i++) {
if (events[i].events & EPOLLIN) {
char buffer[1024];
ssize_t bytes_read;
// Fully consume the data
while ((bytes_read = read(events[i].data.fd, buffer, sizeof(buffer))) > 0) {
process_data(buffer, bytes_read);
}
// If data remains unread, no further notifications will occur
}
}
}
epoll predecessors: poll and select
poll and select provided mechanisms for monitoring multiple file descriptors for readiness but suffered from significant scalability issues.:
selectuses a fixed-size bitmask (fd_set) to track file descriptors, requiring the entire set to be reinitialized before each call. It was too inefficient for large numbers of descriptors, as it imposed a hard limit (typically 1024)pollimproved onselectby using a dynamic array of structures (pollfd) to represent descriptors, lifting the hard limit on file descriptors and supporting a broader range of event types.- both required on each poll invocation linear scanning to find the ready file descriptors
epollready-list and tree for watches
epollstores in the Kernel memory space a separate ready-list that is used to return the matching file descriptors to the program when theepoll_waitmethod is invoked, and a tree-like structure to store watches more efficiently and is used whenepoll_ctlis invoked to add, modify or remove a watch
kqueue
kqueue, available on BSD-based systems (including macOS), is a scalable I/O selector that monitors events such as file descriptors, timers, and signals. It uses filters to specify the types of events to monitor and is known for its ability to handle a wide variety of system events beyond just I/O.
IOCP
I/O Completion Ports (IOCP) is the I/O selector mechanism used in Windows. Unlike epoll or kqueue, which are readiness-based, IOCP is completion-based, meaning that it notifies the application when an operation is fully completed. IOCP is particularly efficient for asynchronous file and network I/O.
Co-locating Threads with I/O Seilectors
In traditional event-driven systems, an I/O Selector operates on a single thread to manage multiple I/O events. Co-locating threads with I/O selectors refers to the practice of pairing a thread with its own dedicated I/O selector, allowing tasks to be executed on the same thread that monitors readiness for those tasks.
Benefits
- Reduced Context Switching: Tasks that are ready can be immediately executed on the same thread, avoiding the overhead of moving tasks between threads.
- Cache Locality: By keeping I/O operations and task execution on the same thread, the data accessed during execution remains in the local CPU cache, improving performance.
- Simplified Synchronization: Since tasks are tied to specific threads, less cross-thread communication is required, reducing the need for locks or atomic operations.
Limitations
- Single-threaded Bottleneck: If the I/O selector thread becomes overloaded with too many tasks, it can become a bottleneck.
- Reduced Flexibility: Tasks tied to specific threads can limit the ability to fully utilize multi-core systems, especially in dynamic workloads.