What is epoll?

epoll is a new event polling mechanism introduced in Linux kernel 2.6. Like select(2) and poll(2), epoll allows user programs to check for events coming from multiple file descriptors (called "fds" below), such as data becoming available for reading, or space becoming available in the write buffer, or some errors happening on a fd, etc. The user program will be blocked until events happen on at least one of the fds, or until a certain timeout has expired, then the events will be returned to the user program who can do the reading or writing as necessary, thus avoiding wasting CPU time in manually polling each fd in nonblocking mode when no events are happening (Note that you must not use blocking I/O when you want to wait for events from more than one fd per thread, for who knows which fd will have events first?).

Why epoll? I already have select(2), poll(2), threads, etc., that does the same thing!

select(2) and poll(2) are inefficient when dealing with a large number of fds (denoted as N), since the list of fds to wait for is sent to the kernel for each call, so the kernel has to parse so much data for every call, and it must fill the readiness information for each fd when returning, and the user program has to do a linear scan on the returned readiness information, resulting in three O(N) passes over the list of all fds, even if only a small number of them have events available. epoll solves this problem by having a stateful API that stores the set of fds being monitored inside the kernel, so there is no longer need to send the fd list to the kernel every time events must be waited for, and with careful programming, an overhead mostly independent from N (the total number of fds being waited for) can be achieved.

Of course, you can also use threads. However, many people (such as me) find writing correct multi-threaded programs harder than writing correct single-thread event-driven programs, and having one thread for each client can be a rather high overhead at times. Anyway, using epoll does not preclude you from also using threads to split the load between multiple CPUs --- just let each thread handle a subset of the fds.

What do "edge-triggered" and "level-triggered" mean?

Ordinary select(2) and poll(2), as well as epoll by default, are level-triggered, in that they return an event on a fd whenever it is readable, writable, etc., while in the edge-triggered mode an event will be returned only when a fd becomes readable, writable, etc., and nothing will be reported anymore for the readability of a fd until it is out of data to be read at some time (read(2) returns EAGAIN; in some cases you can also be sure that the data to be read is exhausted if read(2) returns fewer bytes read than the buffer size) then becomes readable again. The same applies for writability, etc..

The level-triggered mode work well for select(2) and poll(2), for the list of fds being waited for is sent to the kernel every time anyway, so it can be changed arbitrarily between calls without extra overhead. However, when using the epoll interface, changing the list of fds requires extra system calls, which is overhead better avoided. It is often the case that you want to temporarily stop waiting for events from a certain fd, for example a server won't be interested in reading a client's next command when the processing of the current one has not been finished. When using the level-triggered mode, you must remove the fd from the fd/event set being monitored, otherwise the kernel will keep reporting "Client XXX has commands to be read" to your program which is not interested in that data at all. When using the edge-triggered mode, you may just keep the fd set as is, since no more "readable" events will be received on that fd until the server actually has time to read new commands from that fd and drains all data available for reading, thus avoiding the overhead of changing the set of fds/events to be monitored.

You are recommended to use the edge-triggered mode in newly-written programs, but you must be careful in draining all available data before expecting an event on a fd again, something which programs using select(2) or poll(2) need not do. Therefore, when porting programs using select(2) or poll(2) to the epoll API, it is usually quite trivial to fit the program to use the level-triggered mode, while converting the program to be compatible with the edge-triggered mode may need much more effort.

Let me have a look at the API.

No, I won't write an API reference, since the manpages are already good enough. If you don't have it on your system, you can grab it from http://www.xmailserver.org/linux-patches/{epoll,epoll_create,epoll_ctl,epoll_wait}.txt (replace the braces and the part between them with one of the items such as "epoll_wait"). Now I'll just give an introduction on what the API can do.

  • epoll_create(2): creates an epoll fd that represents the state information, i.e. the set of currently monitored fds. After use the epoll fd should be closed using close(2).
  • epoll_ctl(2): add or remove fds from the set of fds monitored (associated with a certain epoll fd), or modify the events monitored for a certain fd in the set. For now, you can add/remove/modify only one fd at a time. The per-fd choice between level- and edge-triggered notification is also done with this system call.
  • epoll_wait(2): wait for events. The set of fds being monitored is designated by the epoll fd, rather than being sent to the kernel as the argument to this call. Like select(2), epoll_wait(2) blocks until at least one event is received, a timeout is reached, or a signal arrives (in which case the call results in an EINTR as usual). It stores the reported events in a separate buffer supplied by the user program (nothing is stored for fds that do not have events to report, unlike select(2) and poll(2)), and returns the number of events reported, or -1 upon an error.

How to effectively use the epoll APIs?

Note: I have done some server programming, but I'm far from a guru on it, so feel free to give your suggestions.

First and foremost, if you want to use event-polling APIs such as select(2), poll(2) or epoll, make your fds nonblocking! Otherwise you won't be able to make sure that no bytes are available anymore on a certain fd, for the luxury of EAGAIN is gone. Also, the events returned by select(2), etc. are merely hints; it is entirely possible that an event is reported on an fd but no data is available any more when you actually come to access it, because another reader of the same socket has eaten the data, or just because of some queerness in the drivers. Your program won't fare well if it blocks in the wrong place, so please play safe and use nonblocking modes.

Now, if you want to use a less-than-portable API such as epoll, you are asking for performance and scalability, so don't waste the benefits coming from epoll by stupidities such as dumb data structures with O(N) overhead. GLib will be your friend. Consider a O(N) scan over all fds as an expensive operation, and you will see how rarely this is actually needed, with the help of sane data structures.

Well, I understand that many people will have concerns over using too many libraries in systems programming, but GLib is such perfectly lovely library with its byte order macros, string handling functions, data structures, and various OS-related utilities that it is hard to resist. Anyway, the GLib dependency isn't that bad, for we are writing a server or something similar, not something to fit in a boot floppy when statically linked. My opinion is that if your program is simple enough to not need fancy data structures, or if you are using C++/STL or Java, or if the code for these data structures are already available for your project, you may not need glib; otherwise, having one more library dependency is far better than using a linked list when a hash table is needed, or writing another balanced tree implementation that may well have subtle bugs in it for no apparent gain.

Also, we will use edge-triggered mode of epoll, for it is usually more efficient this way.

You should have a data structure called CoState representing the state information (including a data buffer for read but not yet processed, or generated but not yet written data) for each coroutine that processes the data coming from a certain fd. The fact that you are using epoll means you don't like the overhead or inconvenience of threads, so you will have to schedule these coroutines by yourself. All the CoStates should be organized into an expansible array (you may use GPtrArray in GLib) or a hash table (GHashTable), indexed by the fd number. Each CoState structure should contain two fields named can_read and want_read if you may need to wait for data to read in it, and likewise for writing and other things (including other conditions unrelated to I/O that a coroutine may wait for). can_read is set when epoll_wait() reports that the fd becomes readable, and is cleared when reading from the fd results in an EAGAIN or a short read in certain cases. It should be initially set, for epoll will not report events about initial read/writeability. want_read should be set or cleared when returning from the coroutine after some processing: set when the coroutine returns in order to wait for more data to be read, and cleared when the coroutine returns for any other reason. Now you may determine whether a coroutine co is runnable using the following expression:

is_runnable = ! (co->want_read && !co->can_read) && ! (co->want_write && !co->can_write) && other_user_specified_conditions;

In order to schedule the coroutines efficiently, you should use a queue to record all currently runnable coroutines (i.e, those not waiting for some fd to be readable/writable/etc.). For each iteration in the main loop, remove the element from the head of the queue, and call the corresponding coroutine, in which the CoStates of new runnable coroutines may be added to the tail of the queue, if a coroutine may continue because some user-specified conditions is now met, or if the current coroutine is still runnable but returns early to avoid starving others. If the queue is empty, it means nothing can run for now, so we call epoll_wait(), set can_read flags according to the returned events, and add the corresponding CoStates to the tail of the queue if necessary (i.e., if can_read changes from 0 to 1 and want_read is 1). Be careful to avoid introducing duplicate elements into the queue; a variety of methods may be used for that, but the simplest way is to add a in_queue flag to the CoState data structure. Initially the queue should be empty, and the fd set for epoll should contain the initially existing fds such as listening sockets. When a new coroutine is started for a new fd (for example by accepting a new connection), the new fd should also be added to epoll's fd set, and don't forget to make the new fd nonblocking.

There is one more thing to be aware of: when you are done with a certain fd, either because the connection is closed by the peer or because of some other action, do not free any associated data structures (such as its CoState) rashly, for it might still be in the queue, and when it is reached Bad Things might happen. For the simplest case, where all such things happen within the coroutine associated with the fd (for example, you only destroy the CoState when read() returns 0 (end of file), or write() results in an ECONNRESET or EPIPE), the CoState should not be in any queue, so you may just remove the corresponding entry from the fd=>CoState array or hash table, remove the fd from the fd set for epoll (not really necessary since it is done automatically upon fd close), close the fd, and free the CoState. For more complex cases (for example the server may forcefully close a connection for some other reason), you probably should just record the connections to close in a cleanup list, then after the queue has been walked, process the cleanup list and do the cleanups above for each fd and the associated CoState. Also, be careful if you want to close a connection when iterating over all the connections in the hash table (or other complex data structures), since you are probably not allowed to change the hash table during iteration.

What about timeouts used in the coroutines?

I think you may put all the timeout-related information (timeout expiration time, corresponding CoState) into a balanced binary tree (such as GTree; actually a heap will do), ordered by the expiration time. Each time before calling epoll_wait, get the current time t0 and the smallest expiration time in the tree t1. If t0<t1, then use t1-t0 as the timeout for epoll_wait, otherwise use zero. If the tree is empty (no timers active), use no timeout for epoll_wait. After returning from epoll_wait, process and remove all elements in the tree whose timeout has expired (since a balanced binary tree or a heap is used, this should be fast), by putting the respective CoStates into the runqueue. This is done in addition to the processing of the events returned by epoll_wait.

What about signals?

Well, in my opinion this is a sore spot in the API; why Linux does not provide a pselect(2)-like interface that allows one to atomically change the signal mask during the call is quite beyond me.

The naive way is to set a flag in the signal handler, and check it in each main loop iteration. It seems to work because epoll_wait() etc. are interrupted by signals, but it actually doesn't: if the signal arrives and the flag set after testing the flag but before the call to epoll_wait (improbable but possible), the signal will not be handled until epoll_wait somehow returns, which is a Bad Thing. Even if you block the signals except around the call to epoll_wait(), since the unblocking and the epoll_wait() call are not done atomically, there is no way to completely avoid the race.

The correct way to do it when using the level-triggered mode (including good old select/poll and epoll in level-triggered mode), is to siglongjmp() from the signal handler directly into the main loop before the signal testing part, after setting relevent flags. The signals should still be blocked except around the select/poll/epoll_wait() call, for the siglongjmp may leave data structures inconsistent if done from arbitrary places.

Alas, when using edge-triggered notification, jumping like this will cause loss of events, especially if the signal arrives after epoll_wait() returns with some events but before its return value (the number of events returned) is stored. Because of the edge-triggeredness, you will never see the lost events again. Most frequently, you want to handle program termination signals such as SIGINT and SIGTERM, so that the program will exit gracefully. In this case losing a few events will not matter since you are going to terminate the main loop anyway. For other signals such as SIGCHLD, try to do everything inside the signal handler (but it can be too late if you want to wake up other coroutines), failing that, use a finite and bounded timeout for epoll_wait(), which is an ugly last resort but should work.

Other possibilities:

  • If the epoll fd had supported SIGIO (so that one is sent when epoll_wait won't block), one could have used sigtimedwait() to have a main loop based on signals instead. Unfortunately, it seems that the current epoll implementation (as of 2.6.3) does not...
  • Create a nonblocking pipe, include its reading end in the fd set used for epoll, and write something to the pipe in the signal handler after setting flags (if the pipe is full, writing can be skipped). When epoll_wait() indicates that the pipe is readable, drain the data in it, test the flags and do the necessary handling.

Please give me your suggestions here.

Also you may need to ignore SIGPIPEs, since writing to a socket closed on the other side will result in a SIGPIPE (as well as the error EPIPE for the write) which will terminate your program if not handled. I don't find it worthwhile to try to avoid getting these SIGPIPEs.

I want portability!

I think it would be relatively easy to make a wrapper around poll(2) (or select(2), but then the fd numbers are limited by FD_SETSIZE) to give it epoll-compatible semantics. Simply record the fds and the events monitored in each fd in a data structure, use them as the fd set for poll(2), and collect the returned events, making modifications for edge-triggeredness if necessary. Now the fds will not be removed from the set automatically upon close, so you will need to use EPOLL_CTL_DEL manually. The efficiency won't be near epoll's, but it should be around the same as a normal program based on poll(2).

Also, take a look at libevent (http://monkey.org/~provos/libevent/).

The C10K document at http://www.kegel.com/c10k.html is a good introduction to various other event-polling mechanisms used on other systems. Worth a read.