December 2002/A Templated Library for Fast Multiplexing

Internet & Network Programming

A Templated Library for Fast Multiplexing

Christopher Rooney

If the number of your network connections hits the stratosphere, then you need this.

Introduction

As the Internet continues to grow, the traditional Unix network programming paradigms show more and more signs of age. What functioned perfectly well for 30 or 50 simultaneous connections fails spectacularly when asked to handle 10 or 15 thousand. The recognized (and portable) way to achieve better scaling behavior is to use standard multiplexing I/O (i.e., poll). Even multiplexing applications, though, can show considerable signs of strain on servers that demand high performance while scaling to thousands of concurrent connections. Fortunately, Unix platform vendors have recognized this failing and responded with more scalable alternatives to poll. Of course, Unix vendors being Unix vendors, their responses are platform dependent. Since the various schemes necessarily deliver at least as much functionality as standard poll, it is possible to abstract their behavior into a well-behaved class that provides a simple interface for a polling engine. This article presents a class that uses a templated delegate pattern instead of subclassing to achieve platform independence. The class provides implementations for /dev/poll, which is standard in Solaris 8 and available as a patch for Linux; the kqueue/kevent model, standard in FreeBSD and OpenBSD; and traditional poll.

The Dark Ages

Traditional approaches to Unix network programming involve using a separate thread or process to handle each concurrent client connection. The first problem here is that context switching costs, especially switching between contexts that are blocked in I/O calls. It’s easy to overlook the high price of context switching, since the behind-the-scenes processing it necessitates is transparent to the programmer, but the costs are serious: stack frames have to be saved and restored, memory pages swapped, and registers reset. Preforking (or prespawning) and allowing contexts to serve connections in sequence will amortize the cost of forking, but not of context switching. Unfortunately, context switching is not the only or even most serious problem with an approach that uses a separate thread of execution per client. If you want to handle several thousand connections simultaneously, you need an equivalent number of threads or processes, and current operating systems are just not prepared to handle that many.

Take Apache as an example: say you compile a fairly lean Apache with shared modules. Each instance goes about 2 MB, and you have 1 GB of free RAM (ignore disk swap for this example). The theoretical maximum number of simultaneous connections is 500, as each connection requires an httpd instance. Servers that use threads instead of processes may be able to squeeze more connections into the same amount of memory, but are obviously still constrained to some relatively low upper limit. The presented polling engine allows and encourages single-threaded programming.

The Way of the Forefathers

Traditionally, multiplexing I/O uses the poll system call:
int poll(struct pollfd *fds, int nfds, int timeout);
It accepts an array of pollfd structs, which contains fields for the descriptor, a short integer representation of events in which the caller is interested, and another short for the ready events returned from the kernel. There may be holes in the array, signified by a negative descriptor, which are examined and then ignored by the kernel. Polling in this fashion necessarily scales linearly with the number of pollfds to be checked, even if these structs are in holes (have descriptors of -1). When the number of descriptors to be checked gets into the thousands, performance degrades noticeably. In addition, calling code must loop through the entire array, or at least until it discovers as many ready descriptors as the poll call reports. So if there are ready descriptors only in the first and hundredth indices of the pollfd array, both the kernel and client code must inspect and test 98 unready indices to find 2 ready ones. These failings of traditional poll generated much comment on mailing lists and newsgroups, and the kernel implementers heard. And in grand Unix tradition, they had similar, yet incompatible answers.

The Way of the Sun

Sun introduced the first response to the polling problem as a patch to Solaris 7, which became a standard feature in Solaris 8. Their innovation was /dev/poll, a polling “driver” (their description), which a caller opens, writes pollfd structs to, and closes. Since there is a nice orthogonal close to match open, then the opposite of write must be read, right? Of course not. Reading is done with ioctl, to which a caller passes the /dev/poll descriptor, the integer constant DP_POLL, and a dvpoll struct. This dvpoll struct is a hack to conform to the ioctl interface, which contains a pollfd pointer, the number of pollfds represented by the pointer, and an integer timeout. This is almost irredeemably ugly, but the performance improvement over poll is not to be sneezed at and grows as the number of descriptors increases. The benchmark at [1] shows a 7 to 1 speed advantage (over traditional poll) at 100 descriptors growing to nearly 400 to 1 at 10,000 descriptors. In my parish, those kinds of numbers redeem a gang of inelegance.

The Way of the Free

Jonathan Lemon, who implemented kernel queues in FreeBSD 4.1, also used the idea of opening and reading from a descriptor. But his vision was a little broader than his nameless counterpart at Sun, as the kqueue mechanism is intended not only for sockets and descriptors but also process events (exits, forks, etc.), signals, vnode events (deletions, writes, appends, etc.), and others. To use a kernel queue, you get a descriptor from a call to kqueue. Then you call kevent to have the kernel read from an array of kevents and fill another with ready descriptors. Subsequent calls to kevent only need to pass new and changed interest info, while the kernel keeps track of which descriptors are ready. This is a pretty neat and elegant system. Its aims at genericity dictate that it uses the more expansive struct kevent instead of the familiar pollfd. A small problem for C++ coders, though, is that the kevent struct and system call are identically named so that variables must always be declared to be of type “struct kevent” to avoid having calls to kevent interpreted as constructors and the attendant hilarity of error messages that appear to make no sense.

The kqueue system and /dev/poll both rely on a kernel technique called “hinting,” in which files (sockets, in this case), upon becoming ready for an event, signify that readiness to the kernel. If any user code has specified interest in the event, the kernel stores the readiness indicator until asked.

But, But, assert(LINUX==UNIX) Fails...

Which of these will make its way into the Linux kernel? Looks like neither. There exist unofficial patches to get /dev/poll [2] under Linux, but Linus himself has said scathing things about both kqueue and /dev/poll, although what he suggests instead is not completely dissimilar to kqueue, at least to the user-space coder [3]. Various suggestions, of varying elegance, on alternate interfaces for faster polling have been suggested on the Linux lists and newsgroups. What exists now for Linux, though, and around what, so help me, a lot of people seem to be writing code, are the POSIX real-time signals. The technique of applying real-time signals to multiplexing is so ugly it offends delicate sensibilities, so I will only describe its fatal flaw: when there are many concurrent connections, the signal queue can overflow and you must revert to using poll. So you either have to maintain a hopefully unnecessary pollfd array or construct one just when you are being creamed with client requests. And poll, of course, remains less than ideal.

If you are determined to use Linux for high concurrency TCP servers, you can either get the best /dev/poll patch you can find or set yourself up for heartache.

Enough Hooha, Let’s See Some Code

The poller engine’s interface presented here is an extremely simple one (Listing 1). It presents only three functions (and one overload) for adding, clearing, and polling the descriptors and events of interest. With add, you register a callback, which is simply a function object whose operator() returns either KEEP or REMOVE. poll determines which descriptors are ready to read and write and executes the callbacks associated with the ready ones. If the callback returns REMOVE, the descriptor and its associated callback are cleared. Generally, if the socket is not closed, it is expected that a callback adds another before returning REMOVE. In this way, the engine is self-managing; clear may never be needed.

The steps for accepting a connection hardly ever vary; code accepts a connection and gets a descriptor for the new socket, sets the descriptor to non-blocking, and registers a read callback for it. The library provides this standard accept_callback parameterized over a reader callback. Given that, the usual structure of a program using the polling engine is:
define at least one reader callback
define at least one writer callback
create a poller
open a listening socket
register an accept_callback<reader_callback> for the listening socket
while (1)
  poller::poll()
  unrelated processing (if necessary)
end while
The callback pointers used to register with the polling engine are wrapped in a smart pointer based on the Boost library’s implementation [4], which means you allocate a callback on the heap and pass in its address, and the polling engine takes responsibility for deleting it. It may seem a little odd at first, newing objects and not deleting them, but hey, you could be writing Java and feel odd all the time.

The callbacks are implemented as function objects to allow them to maintain state of which the polling engine is ignorant. Function object callbacks can efficiently hold as much state as desired because only (smart) pointers are ever copied by the polling engine. Scott Meyers describes this technique for storing stateful function objects in standard containers in Effective STL, Item 38 [5]. These callbacks can therefore be much more full featured than conventional function-based callbacks. In particular, function object callbacks can have knowledge of other callback instances. A reader could know which writer is associated with its socket and append strings to a FIFO member of the writer. Alternatively, a callback could do all the reading and writing for a connection. Included with the source code (available for download from <www.cuj.com/code>) is a simple server that defines a reader that exists for the life of a connection, creating a new writer for every response. These stateful callbacks can do pretty much whatever you want, so go ahead; theWorld == you.getOyster().

Delegates

As Listing 1 shows, the poller class delegates almost all work to the class over which it is parameterized. This pattern is similar to the use of traits classes, but the emphasis is reversed: a traits class generally defines typedefs and provides low-level functionality, while all the logic and algorithmic complexity is in the parameterized class. In the Delegate pattern [6] used by this polling engine, the parameterized class is merely an interface, similar to a pure virtual base class, except that none of the run-time performance costs of polymorphism are incurred. Note that in this instance we still need to do a little autoconf magic to compile conditionally, because no current system has /dev/poll and kqueue headers, but this is not inherent in the pattern.

All three delegates have similar logic. poll is called; it checks its descriptors and processes the ready ones. During this processing, additional descriptors may be added (when an accept socket is ready), and callbacks may signify intent to modify or purge themselves. Changes are stored in a delegate-specific way and processed again the next time poll is called. A cute optimization is that callbacks are executed before adding, as frequently they can be immediately removed and avoid further processing. For example, client sockets are usually writable when their requests are sent, well before their requests are received, read, and parsed.

Standard library containers (and the hash_map extension [7]) are used exclusively in the delegates. Even the arrays needed by system calls are maintained as vectors and zeroth element references passed as pointer addresses.

The kqueue and /dev/poll delegates actually have simpler implementations, because their underlying operating system constructs maintain state. A look at two delegates’ implementations of remove (Listing 2) shows some of the additional complexity the older poll requires. The newer, stateful approaches are able to aggregate the events that they have been asked to monitor. For the stateless poll, you have to do the aggregation yourself, or pay the price of evaluating the same descriptor multiple times. Therefore, the member data structure and the manipulation code are considerably more complicated. Additionally, a priority queue (available_) is used by the generic poller to store the lowest available index, as you want to keep the pollfd array as small and densely populated as possible, because of the linear scaling behavior mentioned above.

Exceptions and Expectations

The library purposefully only throws a small number of exceptions and expects clients to ignore most of them. Theoretically, setting a socket to non-blocking mode could fail, because it uses fcntl, which could return -1, although it is hard to imagine a circumstance in which changing the blocking flags for a valid socket will fail. Because the socket is always set to non_blocking immediately after creation, it will never be invalid. The convenience function set_non_blocking throws an exception if it can’t execute, but client code can in good conscience ignore it, as does the provided accept_callback.

Initial setup is another matter, however. If the delegate-specific poller cannot be created (i.e., the call to open(/dev/poll) or kqueue fails), an exception will be thrown that should be caught. The convenience function get_listen_socket combines five related system calls into one function that returns the descriptor on which the program listens for incoming connections. If this function throws, steps should be taken. Probably, the only possible response to one of these circumstances is a log message and graceful exit. This exception-handling code, though, can be kept outside the main polling loop.

The imaginatively titled exception class adds which and where to the what it inherits from std::exception. The complete implementation is shown in Listing 3 since it is generally useful as a translation layer between the POSIX errno framework and C++ exception handling and might be of interest to programmers writing non-networking code. Did you know that errno might be implemented as a macro and mysteriously break all kinds of things under Linux that work without problems on four other Unix versions? Consider reading a magazine the easy way of finding out.

Caveats

Is that really all there is to it? Can you be up and running and serving super high traffic on multiple platforms with just three method calls and a couple function objects? Well, not exactly. First you have to configure your system to allow a suitably large amount of descriptors per process, and possibly do other system tuning. Such explanation is beyond the scope of this article, but full information on necessary tweaks (and much, much more) can be found on Dan Kegel’s excellent C10K page [8].

The Solaris header file for /dev/poll has an interesting little quirk. A } near the end of the file needs to be commented out to compile with g++. This is interesting as it is within an #ifdef __cplusplus condition. Presumably it works with Sun’s toolset.

The kevent syscall waits the entire timeout, even if there are ready descriptors. This is different than poll and /dev/poll, and frankly, I am hard pressed to think of a situation in which this is desirable behavior. For portable speed, you probably want to pass zero timeouts.

The main development environment for this polling engine was OpenBSD 2.9 (Intel) [9]. The /dev/poll delegate was written on Solaris 8 (Intel). The kqueue system was also tested on FreeBSD 4.3 (Intel). The generic poll piece was compiled on Red Hat Linux 7.1 (Intel) and NetBSD 1.5.1 (AMD).

Conclusion

While money dribbles out of the Internet, consumer demand keeps streaming in. Consequently, there exists a need to be able to network more efficiently, to increase capability without purchasing new hardware. The currently much ballyhooed peer-to-peer models, in particular, will require maximal per-node network efficiency for commodity machines. The polling engine presented here provides a simple paradigm for writing high-powered, highly scalable Unix networking applications.

Notes

[1] <www.kegel.com/dkftpbench/Poller_bench.html>

[2] I have just become aware of a new /dev/poll patch (actually called /dev/epoll) for Linux that is supposed to perform wonderfully. I have not examined it. See <www.xmailserver.org/linux-patches/nio-improve.html>.

[3] < http://groups.google.com/groups?hl=en&safe=off&th=3c85e39303c716a2,62&seekm=fa.fb3h7jv.90o42k%40ifi.uio.no>. Daunting URL, I know. Search google groups for “linus /dev/poll interface _hell_”. It should be the only thread that returns.

[4] <www.boost.org/libs/smart_ptr/>

[5] Scott Meyers. Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library (Addison-Wesley, 2001).

[6] Since writing this article and code, I have read Andrei Alexandrescu’s excellent Modern C++ Design (<www.moderncppdesign.com>) and recognize what I call Delegates as examples of what Andrei calls policy-based design. I will defer to Andrei and the inevitable and assume the pattern will be called the Policy Pattern, although I think Delegate remains a better name for it.

[7] Only the SGI (g++, STLPort) style of hash_map is used, as I have no access to the DinkumWare STL. If someone sends me a patch, I will apply it.

[8] <www.kegel.com/c10k.html>

[9] Thanks, Theo.

Christopher Rooney (<http://crooney.com>) is cofounder of Fistful of Digits (<http://fodny.com>), a software-consulting firm in New York City.