Dr. Dobb's Journal March 1997
In my work -- implementing communications protocols -- efficient and flexible dynamic memory-management implementation is critical to success. We usually cannot rely on library- or operating-system-provided memory allocators for three reasons:
Most workstation environments and applications include some type of network access. For communications software to be considered successful, it should be transparent to both applications and users. Inefficient communications software becomes readily apparent. Hence, it is part of the communications-software developer's job to ensure that all functions, at least in the main path of the product, are blazingly fast. And since the primary purpose of communications software is to move data buffers from the network to applications and back, developers use dynamic memory allocators extensively. The allocators we use must be as efficient as possible. On a workstation, the communications software is shared among all processes and therefore operates under many of the same conditions as system software.
In fact, many communications-protocol implementations are part of an operating system. In some cases, however, workstation operating systems do not include communications software, and the communications-protocol implementations are generally combinations of device drivers, user libraries, and user processes. The allocators available to these implementations were not constructed for use in system software. As data travels from an application through the library, protocol implementation, device driver, and network device, information is added at each layer to the buffer that the same layer on the receiver, or intermediate node, uses to correctly implement the protocol.
Historically, the design specifies that each layer prepends its information to the beginning of the buffer. This requires a flexible allocator. Implementations of communications protocols are inherently concurrent. The code has to operate on behalf of a number of processes as well as wait on multiple interfaces to the network for incoming traffic. In addition, each process may have any number of threads using the communications subsystem.
All these conditions require strict attention to the freeing of memory. Given one's own implementation of a memory allocator, you can add functions that help you find memory leaks, double frees, and the use of pointers to freed memory. In this article, I'll review memory allocators in general and discuss the allocator used in typical implementations of TCP/IP. I'll describe a number of allocator-implementation techniques that are helpful in debugging applications that use the allocator.
In The Art of Computer Programming: Fundamental Algorithms (Addison-Wesley, 1973), Donald Knuth identifies two considerations for dynamic memory allocators: "How is this partitioning of available space to be represented inside the computer?" and "Given this representation of the available spaces, what is a good algorithm for finding a block of n consecutive free spaces and reserving them?" The solutions to these two requirements are fundamental to half of any memory allocator. The other half, of course, is the deallocation of blocks no longer in use.
Additionally, you may consider using a technique known as "memory compaction," which coalesces consecutive blocks from the free list on deallocation. Alternatively, the choice of many communications-protocol implementations is not to compact memory but to keep n free lists for n different sizes of blocks allocated by the allocator. Once a particular size block is carved out of memory, allocated, and deallocated, it goes on the free list for its size. Subsequent allocations use the block from the free list. If the free list for a particular size is empty, then the block is created out of the available pool. This variation works well only when a large percentage of requests are for similarly sized blocks. For communications protocols, this assumption is generally valid.
Knuth gives a number of algorithms for allocation and deallocation using lists as well as a description of the buddy system. One of the most interesting dynamic allocation schemes, the buddy system, excels in situations where requests are for widely varying block sizes because of the ease by which free blocks can be compacted. For every request of size n, the allocator reserves a block 2j where 2j-1<n<2j. So each block, internally to the allocator, is a power of two, but externally appears as a block of the size requested. For a memory pool of size 2m there are m free lists, one for each power of two. For a request of size n, if there is no block on list j, then the allocator searches the lists up to m to find the smallest k>j for which there is a free block. This block is then split in half until a block of size 2j is created. All other blocks, including the other half, the buddy, of the 2j block are added to the appropriate free list. The 2j block is marked "in use" and returned to the caller. Note that the address of each block of size 2j is a multiple of 2j, so the lowest, at least, consecutive j bits are 0s. Finding the buddy of a block of size 2j whose address is x is shown in Figure 1.
Each block has a flag indicating whether it is in use, so the coalescing function on deallocate is a simple test of the flag of the buddy block. If the buddy of the block being deallocated is not in use, then the two buddies are coalesced and added to the free list of 2j+1. The buddy system fragments memory in a situation where many requests are slightly larger than a power of two.
Most TCP/IP implementations use an allocator that dispenses blocks of memory in only two sizes, a small size of 128 bytes and a large size of 2000 bytes. In The Design and Implementation of the 4.3BSD UNIX Operating System (Addison-Wesley, 1989), Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and John S. Quaterman make the statement that the implementation of the TCP/IP stack needs an allocator distinctly different from the other parts of the operating system. It has to be efficient and also very flexible, as requests are of widely varying sizes. They also mention the need to prepend protocol headers to the beginning of data buffers.
The basic unit of memory allocation used by the TCP/IP implementation is an m_buf. The 128-byte m_bufs can contain 112 bytes of data. The other 16 bytes are used for a link pointer, the length of the data in the m_buf, a data offset from the beginning of the m_buf, a pointer used to chain lists of m_bufs, and a type field. Data is held in a number of chained m_bufs and copied to contiguous memory in only two places: into a buffer provided by an application for a receive and into a buffer used by the network device. The advantage of using m_bufs as the internal storage structure is that headers and trailers can be prepended and appended without copying the data. The operation to prepend a header simply writes the header into an m_buf and adds the m_buf to the beginning of the chain.
Stripping headers, also a frequent operation, is accomplished by simply removing the m_bufs from the head of the chain. Note that each m_buf need not be completely full because each m_buf carries the length of the data and the offset to the start of the data in the m_buf. For example, if a header is 24 bytes long, one m_buf can be used with its length field set to 24 and offset to 12. For large data, the larger m_bufs are used. Flexibility is maintained with a reduction in the chain-traversal time.
In implementations of other communications protocols, m_bufs are not used, but the requirements of prepending, efficiency, and flexibility still exist and are solved in other ways. One other solution to the prepend problem is to define the maximum amount of space needed for headers that would ever be prepended and add that amount to all requests, then return to the caller a pointer into the buffer, leaving the defined amount in front. Although some performance is gained because contiguous buffers are maintained and fewer data copies occur, this method is less flexible than using m_bufs. However, designers may be willing to trade flexibility for performance gain. The efficiency problem is almost always solved by keeping preallocated buffers of common sizes in a free list and not compacting memory. When the implementors gain experience with their product, they can accurately predict the size of buffers that will be requested and can hand tailor the operation of the allocator.
Two important opportunities that become available once you decide to write your own allocator are:
As the software is tested, lots of information can be gathered on the performance of the allocator, and the implementors can then optimize the allocator specifically for a particular implementation. A common problem when using certain programming languages is code that writes past the end or beginning of its allocated memory. A debug version of the allocator can be built to uncover these errors. Although often slower, a debug allocator can be invaluable. One technique developed by a number of my coworkers at IBM is to allocate memory in buffers defined by the operating system rather than allocating out of a single large block using either our own allocator or something like malloc. Then, when the code overwrites this operating-system-defined buffer, for which the operating system has data about length, ownership, and permission, the operating system will throw an exception that can be caught. Thus, the overwrite is discovered. Developers are generally very surprised at the number of latent bugs that are discovered when a debug allocator is used.
The implementations of communications protocols impose special requirements on dynamic-memory allocators. The usual variations of memory allocators reported in the literature do not suffice. They lack efficiency because they try to do too good a job with respect to compaction and try to support a general environment. The buddy system comes close but is still too inefficient, can cause too much fragmentation, and is not flexible enough to be easily modified as developers gain experience with their software.
My thanks to Mike Kawalec at IBM for permission to describe the debugging method he developed.
DDJ