Disk Thrashing & the Pitfalls of Virtual Memory

Dr. Dobb's Journal May 2002

Solving software performance problems

By Bartosz Milewski

Bartosz is the author of C++ In Action, Industrial Strength Programming Techniques (Addison-Wesley, 2001) and president of Reliable Software, a distributed software company. He can be contacted at bartosz@relisoft.com.

Disk thrashing, also known as virtual memory thrashing, is among the more serious software performance problems. In this article, I will examine the mechanisms behind disk thrashing and explain how you can control it.

My most recent encounter with disk thrashing involved a version-control system called "Code Co-op" developed by my company (Reliable Software, http://www.relisoft.com/). One of Code Co-op's modes of synchronization involves sending change scripts via e-mail. Normally changes are small, but there are cases when large scripts must be exchanged. Since many users had problems sending large attachments through e-mail servers, we built script compression into Code Co-op. Everything worked fine in testing, even with large scripts. Users, however, had a different idea of what a large project was and disk thrashing suddenly raised its ugly head.

Disk thrashing occurs when programs have to deal with large amounts of data. When data size crosses a certain threshold, program performance can suddenly drop by orders of magnitude. In such cases, for instance, you won't see a lot of CPU activity when looking at the Windows Task Manager performance meter because time is being spent reading/writing to disk, making it look like the program is frozen. Still, you can hear the characteristic thrashing noises made by the disk.

Virtual Memory

To understand disk thrashing, you have to understand virtual memory. In DOS and early 16-bit versions of Windows, programs allocated only as much memory as was physically present in RAM, minus whatever the operating system needed. Modern operating systems don't limit allocations to the size of RAM, so you can allocate as much memory as there's room on disk, or up to the limit imposed on the size of your swap file.

To create the illusion of larger memory, the OS — with the help of the processor — performs some magic behind the scenes. When a new process starts, it is given access to virtual address space of 4 GB (the number of possible addresses on a 32-bit machine). But your program can't just randomly access any address in this range. If it tries to do so (for instance, by dereferencing a pointer), it generates an access violation, also called a "general protection" (GP) fault.

Memory has to be allocated before it can be accessed and the memory allocation mechanism must at some point call the OS to make raw chunks of address space accessible to the program. These basic chunks are multiples of page size, which on current Intel processors is equal to 4 KB. In Windows, programs can use this low-level allocation mechanism directly by calling the VirtualAlloc API. Normally, however, you do allocation using the run-time library's heap, which chops up large memory pages obtained from the OS into small pieces requested by malloc or new.

Memory allocated by your process is not immediately mapped to a region of RAM. (That's why it's called "virtual memory.") What the OS does during memory allocation is mark a certain range of address space as virtually allocated and reserve the corresponding space on disk in the swap file. If necessary, the system might grow the swap file. It's only when the system cannot grow the swap file — either because it has run out of disk space or because users impose limits on its growth — that memory allocation may fail. The call to new doesn't fail because there isn't enough RAM — it fails because there isn't enough disk space (of course, the process can run out of virtual address space, although this is uncommon on current PCs).

Since the freshly allocated memory is not mapped into RAM, accessing it causes a hardware exception — a page fault. The OS gets the first chance to respond to this fault. The system looks through its page tables and finds out if a given range of memory has been marked as allocated. If not, it turns the fault into access violation and passes it back to the program that caused it. If the memory has been allocated, the system quickly maps a range of virtual address space to a range of physical memory — the actual RAM. From then on, the processor translates all memory accesses within this virtual range into reads/writes to the corresponding range of physical memory. Virtual-to-physical address translation is implemented in hardware and is very fast.

Example of Mapping

To illustrate, suppose your program wants to allocate 256 MB of memory (that's 0x10000000 in hexadecimal) using the operator new:

char * p = new char [0x10000000];

Suppose also that your computer has 128 MB of physical memory, but lots of free disk space and the limit on the swap file is large. The call to new goes through the program's run time, which quickly determines that it doesn't have a chunk of memory handy to satisfy this allocation. It turns to the OS (for example, in Windows by calling VirtualAlloc). The system looks at the swap file, first trying to consolidate some free clusters within it, up to the total of 256 MB. If it fails, it grows the swap file.

The clusters in the swap file are then assigned to 64,000 consecutive pages in virtual memory (this is how many 4-KB pages there are in the 256 MB of memory that you requested). Suppose that the first such page corresponds to the virtual address of 0x40000000. This is the value returned by new and assigned to pointer p. (I'm conveniently ignoring the fact that the run time might offset this address slightly to make room for its own housekeeping.) Obviously, the address 0x40000000 cannot be used to access your computer's physical memory directly. The highest physical address in the example is 0x7FFFFFF.

But what happens when your program tries to access the newly allocated memory, for instance by dereferencing the pointer p, as in: p[0]='A';? The first such access causes a page fault. Since the memory at 0x40000000 has been allocated, the system satisfies the page fault by mapping a page of virtual memory to a page of physical memory (for performance reasons, the system usually maps more than one page at a time). How does it find a free physical page? The OS keeps track of the total usage of physical memory across all processes running on the computer. Suppose it finds a free physical page at the physical address 0xA000. It tells the processor that, until further notice, the range of virtual addresses between 0x40000000 and 0x40000FFF is mapped into the range of physical addresses between 0xA000 and 0xAFFF; see Figure 1. The page fault has been processed and the instruction causing it may proceed. The letter A is physically written into RAM at the address 0xA000. If you follow this write by another, as in p[1]='B';, for instance, the processor automatically maps it, and the letter B ends up at the physical location 0xA001, and so on.

Swapping

The purpose of swap file mapping is this: The swap file comes into play when, during the processing of a page fault, the system cannot find a free page in physical memory. The system plays the game of "physical chairs" by unseating somebody else's virtual page. It does so by finding the least recently used (LRU) page and unmapping it. The physical page becomes free again and can satisfy the current page fault.

What happens to the unseated page? Somebody, maybe even the current process, might have written some data into it. If the page is indeed "dirty," the OS copies it to disk into the cluster of the swap file that's been assigned to it. The page is now swapped out and no longer has its corresponding physical address and cannot be accessed without causing a page fault. The next time it is accessed, the page fault is processed differently. The page is swapped back in and, as before, a fresh physical page is found (maybe by swapping some other page out), but this time filled with the data brought back from the swap file.

Example of Swapping

Returning to the example: Suppose you keep filling the memory buffer with data, for instance by running the loop:

for (int i = 0; i < 0x10000000; ++i)

p [i] = 'x';

You keep faulting new pages every 4000 writes until the system runs out of physical memory, say, after reaching i=0x2D000. At that point, the system has to free some physical memory by swapping out a page. Suppose that it decides to swap out the first page you have accessed in this loop, the one starting at the virtual address 0x40000000. This page is dirty — you have just filled it with xs — so the system must write it to the swap file. That frees the physical page at 0xA000 and it can be used to satisfy the current page fault. The virtual page at 0x4002D000 is mapped to the physical page at 0xA000 and the loop continues executing. The next page fault most likely swaps out the page at 0x40001000 and so on. This process continues, with a window of physical memory sliding along a much larger buffer of virtual memory.

So what happens when you try to read the data at the beginning of the buffer, as in:

char c = p [0];

The page at the address pointed to by p has been swapped out. Since it's no longer mapped into any physical page, this access causes a page fault. The system checks its page table and finds out that the virtual page starting at 0x40000000 had been written into the swap file. It procures a new physical page (possible by swapping out some other page), say at the physical address 0xF2000, fills it with data read from the swap file, and maps it to the virtual page at 0x40000000. Now the same virtual address that was once mapped to the physical address 0xA000 is mapped to a different physical address 0xF2000. But because of the swapping, it contains exactly the same data as before. Your program is completely unaware of all these behind-the-scenes machinations. There are, however, situations when users are aware of what's going on, like when your program starts thrashing the disk.

Thrashing

Our first implementation of compression in Code Co-op used a standard vector to store data. A standard vector is a handy dynamic buffer, most useful when you don't know in advance how much memory you will need. Indeed, you never know how much space will be needed for compressed data (presumably much less than the original size) during compression. Conversely, you never know how much memory to allocate for decompressed data during decompression. Using vectors, you can just keep pushing one byte at a time and not worry about any allocations. So that's what we did.

Internally, vectors contain buffers. In the beginning, the buffer is small. If you keep adding new entries, however, the vector runs out of buffer space and has to reallocate. It first allocates a new buffer that's twice the size of the old one. It copies the contents of the old buffer into the first half of the new buffer, swaps the two buffers, and deallocates the old one. Because of the doubling in size, the amortized performance of adding a new entry to a vector is constant. The copying itself takes O(N), but it happens less and less often as the vector grows, so it is diluted to a constant. That is, the performance of a vector is always asymptotically constant, although under some circumstances the multiplier in front of that constant can be unusually large. We learned this the hard way — by hearing complaints from our users.

For instance, one user hit a situation where the size of the output from the decompressor was much larger than the size of the available physical memory. The decompressor kept adding bytes to the output vector and the vector kept doubling its internal buffer. When the buffer reached a threshold size, it no longer fit in physical memory. The writes to the top of the buffer caused page faults that had to be satisfied by swapping out the pages at the bottom of the buffer. This wasn't a tragedy since the OS bunches the writes to the swap file into large chunks (64 KB is a common number). Adding more data became somewhat slower, due to occasional disk writes. The real trouble started at the next doubling of the buffer. The new buffer was allocated in virtual memory and the copying of the old buffer started. At that point, most of the old buffer had already been swapped out. Here's what happened:

This procedure involving two disk writes and one disk read had to be repeated every 64 KB; see Figure 2. Moreover, the three disk accesses were likely scattered inside the swap file, thus requiring long seeks. A disk seek is orders of magnitude slower than random memory access, so even though the performance of the vector was still constant, the new multiplier was orders of magnitude larger. Disk seeks, which usually involve moving magnetic heads, are almost always audible, leading to characteristic sound effects associated with disk thrashing.

We didn't need sophisticated tools to diagnose our thrashing problem. It was enough to break into the program a few times under the debugger where we would invariably catch it inside the C++ Standard Library, busily resizing the decompressor's output vector.

Simple Remedy

A vector that is much larger than the available physical memory behaves more like a disk file than a random-access data structure. Data written to such a vector is periodically flushed to the disk (paged out). It has to be read back from the disk (paged in) when the vector is accessed again. Physical memory plays the role of a cache that keeps the freshly read or written parts of the file temporarily available for fast access.

If you look at a vector as a file, its growth mechanism seems inefficient. For instance, here is the algorithm using the language of file access:

  1. Start with a fixed-size file.

  2. Keep writing data into the file until either:

    • (a) You're done (exit).
    • (b) The file is completely filled.
  3. Create a new file, twice the size of the old file.

  4. Copy the contents of the old file into the new one.

  5. Delete the old file.

  6. Go back to 2.

This is definitely not the algorithm the filesystem uses to grow files. The filesystem never copies data when it has to extend a file — it allocates another cluster and links it to the previously filled clusters. File clusters don't have to be contiguous on disk because the filesystem has additional data structures, which serve as directories of clusters for each file.

Therefore, the simplest way to avoid disk thrashing when dealing with large allocations is to use data structures that imitate those used by the filesystem. Fortunately, the Standard Library has a replacement for a vector, called a "deque," with a structure similar to that of a file. Just as a file is divided into clusters, a deque is divided into blocks. The tradeoff of using noncontiguous memory is that random access to deques is somewhat slower than random access to vectors because you have to go through a directory of blocks (usually implemented as a tree). Iteration, on the other hand, is almost as fast.

The most important difference from our point of view is in the growth mechanism. When you keep adding new elements to a deque, it keeps allocating new blocks. The blocks are large, so allocations don't happen very often. When physical memory is tight, old blocks will quietly get swapped out. But since there is no behind-the-scenes copying of old blocks, they stay swapped-out until you are done filling the deque.

Changing the implementation of Code Co-op's compressor/decompressor to use a deque instead of a vector was trivial since these two data structures have almost identical interfaces. The difference in performance, on the other hand, was staggering. Disk thrashing became history.

Things to Avoid

Whenever disk thrashing is a consideration, carefully choose your data structures and algorithms. The trick is to imagine that you are dealing with disk files rather than in-memory data structures. Large files have exceedingly bad performance characteristics when you're trying to access them randomly. (In the worst case, each access translates into a disk read.) If possible, large files should be read sequentially because filesystems are optimized for sequential access. Anticipating your program's needs, filesystems usually read many clusters ahead and keep them cached in memory.

A standard vector is not a bad choice for large amounts of data, as long as you know its target size up front and start by resizing it or reserving the room it needs for growth. Calling resize has the side effect of invoking default constructors on all elements of the vector. This most likely faults the whole vector in, which might not be such a good idea — calling reserve is better. To avoid thrashing, both fill the vector sequentially and access it sequentially. If, on the other hand, you're planning on accessing your data randomly, a vector is a bad choice.

Even worse are data structures that use many small allocations. A linked list comes to mind. Although linked lists are usually accessed sequentially, following the sequence of links might cause one page fault after another. Consecutive links are not necessarily allocated from the same page of virtual memory. The nightmare scenario occurs when each link is allocated from a different page. Similar arguments apply to hash tables, sets, and maps. They not only use small allocations, but they are also specifically designed for random access.

Algorithms that require random access should not be used on large data structures. The two most important groups of algorithms are sorting and searching.

QuickSort, which stands behind the standard library's sort algorithm, is a prime candidate for thrashing. (In general, any algorithm that demands random-access iterators should immediately set a red flag.) A better choice is merge sort, which was designed specifically for sorting large files. The standard library's merge algorithm uses input and output iterators rather than random-access iterators.

Searching algorithms are tied to data structures on which they operate. The crudest data structure, with no internal ordering, requires linear search. Repetitive linear searching through a large vector or deque is bound to cause thrashing. Searching through a linked list is even worse. Searching through sorted sequences can be more efficient. Binary search, for instance, will only cause a logarithmic O(log(N)) number of page faults. Look in the standard library for binary_search as well as lower_bound and upper_bound algorithms.

Most data structures designed for fast searching, like hash tables of binary trees (standard sets and maps), will not work well with large amounts of data. Up to a certain size, a two-tiered approach is reasonable. Divide your sorted data into large, but not too large, chunks that can each be brought from disk in a single read. Have a small searchable directory of chunks that stay in-memory most of the time. A search through such a data structure quickly finds the correct chunk using the directory, then proceeds to search the chunk. The chunk might have to be faulted in, but once in memory, it can be searched using any algorithm.

For really gargantuan data sets, the two-tiered approach breaks down when the directory of chunks no longer fits in memory. Again, look to the filesystem for guidance. The B-tree data structure has been used as a backbone for many filesystems. B-trees are multitiered directories organized in such a way that their traversal requires as few disk hits as possible. A search using B-trees causes the least amount of thrashing. Currently there is no B-tree algorithm in the standard library.

Conclusion

For the most part, I've focused here on two levels of storage — disk and memory. In truth, there are more levels, starting with the Internet — a storage system that is even slower than your local disk and mostly read-only but of gigantic (and constantly growing) capacity. You rarely access the Internet directly — the browser usually "faults in" web pages from remote servers to your local disk. Recently visited pages are cached for days or weeks.

On the other end of the spectrum, the processor has an ultra-fast set of registers and fast on-chip memory cache. As processors become faster, accessing main memory has become more of a bottleneck. So the processor uses an internal cache to keep recently accessed data. If current trends continue, we might soon start using techniques described here to combat physical-memory thrashing. This is not such a far-fetched idea if you consider that there are physical limitations on the speed of main memory access. An impulse moving with the speed of light can only travel one foot between the clock ticks of a 1-GHz processor. As long as main memory is separated from the CPU by distances comparable to the size of a desktop or a laptop computer, laws of physics will prevent off-chip access speeds to keep up with on-chip access speeds. You'll have to get used to treating RAM the way you treat disks now — as backing store for the fast cache.

DDJ