July 2001/The Standard Librarian

C/C++ CONTRIBUTING EDITORS

The Standard Librarian: File-Based Containers

Matt Austern

Looking for a way to create a file-based container? You might need to look outside the Standard C++ library, and memory mapping may be the answer.

The Standard C++ library makes it easy to combine file I/O with generic algorithms. For structured files where you’re always reading (or writing) the same data type, you can treat the file as a sequence of values that you access with an istream_iterator (or ostream_iterator). When treating a file as a simple sequence of characters, you can access it with an istreambuf_iterator (or ostreambuf_iterator). So, for example, you can combine streambuf iterators with an STL algorithm to perform “y to k” conversion:
std::ifstream in("f1");
std::ofstream out("f2");
std::replace_copy
    (std::istreambuf_iterator<char>(in),
     std::istreambuf_iterator<char>(),
     std::ostreabuf_iterator<char>(out),
     'y', 'k');
When you turn to more complicated (and less contrived) examples, you begin to run into problems. You can use std::find to look for the first occurrence of a character c, but what if, instead of looking for a character, you need to look for a multi-character string? That’s what std::search is for, but you can’t use it: its arguments must be Forward Iterators, and istreambuf_iterator is only an Input Iterator.

This isn’t an arbitrary restriction. Think about how you would implement std::search yourself. If you’re looking for a string in an input sequence, and you find a partial match, you can’t just restart your search where you first noticed the mismatch: in general you have to restart it a few characters back. (Consider looking for “ababac” within “abababac”.) There are clever algorithmic methods for avoiding unnecessary backtracking, but it can’t be avoided entirely. Saving “bookmarks” is just what you can’t do with an Input Iterator like istreambuf_iterator: input works one character at a time, and once you’ve read a character, you’ve read it.

This is a familiar problem in I/O: it often turns out that you can’t tell you’ve read enough until you’ve read too much. If you’re reading an integer, and the input contains “123x”, you can’t tell you’ve reached the end of the integer until you’ve consumed the unwanted character “x”. There are also several familiar solutions.

First, you can sometimes recast an algorithm — by using auxiliary storage, or by changing the algorithm so it asks a slightly different question, or by accepting restrictions on its arguments or its error handling — so that it requires nothing more than Input Iterator semantics. For example, the std::time_get facet, in the Standard C++ library, reads day and month names using a version of string match that works with Input Iterators. It’s not as flexible or robust as std::search, but it’s adequate for its purposes.

Second, you can simply read more characters than you need and then put the unwanted characters back on the input stream. If you’re reading characters through a streambuf, you can use streambuf::sputbackc to “unread” a character. This technique is often easy and effective, but it has some limitations (you can’t count on being able to put back more than one character), and there’s no way to combine it with the iterator interface of istream_iterator or istreambuf_iterator.

Third, you can read some portion of the input file into an internal buffer and then work with the buffer instead of working directly with the input stream. You have to be careful if you’re doing something that crosses the boundary between one buffer and the next, but that isn’t always necessary: it’s often possible, for example, to work with a single line at a time.

All of these techniques are useful, but, from the perspective of combining I/O with generic algorithms, none of them are completely satisfactory. All of them involve algorithmic changes, and one of them won’t even let you use an iterator interface. You might be able to express the same algorithm with putback as you could with Forward Iterators, but you’d have to write that algorithm twice.

Working with an internal buffer is easy if you don’t cross buffer boundaries, and you can ensure that you don’t if you use a large enough buffer, one that holds the entire file. This just takes a few lines of code:
std::ifstream in("f1");
std::istreambuf_iterator<char> first
    (in);
std::istreambuf_iterator<char> last;
std::vector<char> buf(first, last);
The container buf provides Random Access Iterators. But this technique isn’t very satisfactory either: it’s fine if f1 is a small file, but unworkable once files get large.

A file on disk is (on most operating systems) just a sequence of characters — it looks a lot like a container. Why can’t we find some way to treat it as a container, without having to read the whole thing into memory?

Memory Mapping

We can, of course. Modern operating systems provide a richer set of I/O primitives than are available through the C or C++ Standard libraries. Some of these choices can be made to fit into the C++ library’s framework.

Most of today’s operating systems allow memory-mapped I/O: associating a file with a region of memory, so that reading from (writing to) that region of memory gets translated into reading from (writing to) the file. To a program, this region of memory looks just like any other. You address it with a pointer, and you can use any of the usual pointer operations. Dereferencing a pointer into a memory-mapped region gives you the corresponding value in the file, and writing to a pointer into a memory-mapped region changes the corresponding value in the file.

It shouldn’t be surprising that this is possible: it’s not so very different from the way that an operating system manages virtual memory. When you write an address in your program, it sometimes gets translated to an address in physical memory, and it sometimes points to memory that has been paged out to disk. The operating system brings pages in and out of physical memory without programmer intervention. You can think of memory mapping as an interface that exposes some of these details.

Why do operating systems provide memory mapping? The basic answer is performance.

Memory mapping can conserve space. Mapping a file into memory lets you access the file as if you had read it into an array — it’s part of your program’s address space — but you don’t have to allocate physical memory for that array.

Memory mapping can save time. When you do ordinary file I/O, using functions like istream::read or ostream::write, the operating system uses some internal buffer of its own to interact with the physical hardware, but then must copy the data between that buffer and yet another buffer in our own program. Memory mapping avoids that extra copy, because it exposes the operating system’s internal buffer. In a test of simple file copying, Stevens [1] found that memory mapping sometimes gave more than a factor-of-two speedup.

Aside from performance, memory-mapped I/O gives us the simple container-like interface that we want. Memory mapping lets us access the contents of a file through a pointer. Pointers are iterators, so all we have to do is write a wrapper class that encapsulates memory mapping and that fulfils the syntactic requirements of the Container interface (Table 65 of the C++ Standard — plus, since our container will provide random access iterators, Table 66).

The first requirement in Table 65 is that every container class X must contain a typedef, X::value_type, that identifies the type of the container’s elements. What should we choose for this container’s value type? One answer might be that, following such examples as std::vector and std::list, we should allow a fully general value type by writing a templatized container class. In this case, however, that choice would present problems.

Classes with nontrivial constructors would be a problem. The whole point of having a constructor, after all, is that you can’t just view an object as a sequence of bytes. In general, for example, you can’t copy objects using memcpy.

Even leaving constructors aside, classes that contain pointers would be a problem. When you memory map a file, the operating system chooses the base address. If you wrote an object to disk and memory mapped it in again, the pointers would no longer point to the right place.

Even fundamental data types, like long and float, would be a problem. The binary representation of fundamental types varies from system to system; writing a float to disk as a series of raw bytes, and then reading it in again on a different system, is unlikely to give you the right answer.

All of these problems can be and have been solved, of course: the key would be that storing an object in a memory-mapped container would involve some kind of serialization protocol (replacing pointers with offsets, for example), not just a byte-for-byte copy. But serialization and file I/O are separate issues; it makes more sense to build a persistent storage system on top of a low-level I/O library than to mix levels together and design the whole library around serialization. A file is a sequence of bytes, so that’s what our container will provide. It will be a non-template class, and its value type will just be char.

The next question is what to do about reads versus writes. When you open a file, you provide a flag that determines whether you’re opening it read/write or read-only. This distinction isn’t reflected in the type system: you can try to write to a file that you’ve opened read-only, and the attempt will fail at run time. For a container, however, this isn’t a reasonable option. A container provides iterators, and iterators are either mutable or constant. We can’t provide mutable iterators into a read-only file.

This suggests that, if we want to support both read-only and read/write files, we need two different classes with different type declarations. For now I’ll only discuss the read-only class, ro_file.

Once we’ve made these decisions, it’s easy to fill in everything that Table 65 calls for. First come the typedefs: ro_file::value_type is char, ro_file::iterator and ro_file::const_iterator are both const char*, ro_file::reverse_iterator is std::reverse_iterator<const char*>, and so on. Next we provide data access. We keep track of a pointer to the beginning of the file (the base address that we get when we map the file into memory), and we keep track of the file’s length; begin returns base, and end returns base + length. The full class declaration is shown in Listing 1.

All of the real work is performed in the constructor and the other member functions that set up the private member variables base and length. These member functions are not inline, so they are not part of the class declaration. These member functions are also highly system dependent. Unix and Windows (and other operating systems) support memory mapping, but they support it in different ways. I will show a Unix implementation, which has been tested under Linux.

The general idea is simple: the constructor takes a filename, opens the file using the low-level system call open, finds the file’s length, and then maps the entire file into memory with mmap. All of these operations can fail, so we check return codes and throw an exception if necessary. Finally, we close the file. In Unix, a memory-mapped file doesn’t have to be kept open: the mapping isn’t lost until you call munmap.

The only part that’s at all tricky is assignment and copying. What should the semantics be? One answer might be that assignment should replace the contents of one file with another, and that the copy constructor should create and open a new file. But those answers don’t seem quite right: it would be strange to have one mutating operation in an otherwise read-only class, and it would be even stranger to have a file-copy operation that doesn’t let you name the copy. We’ll provide copying and assignment — but copying and assignment of handles, not files. When we copy a ro_file, we’ll memory map the file a second time. The two copies will have identical contents. (This is close to the edge of what’s allowed by the standard container requirements, since we’re not supposed to share the same object between two different containers, but it’s just barely legal. Our container is nonmodifiable, and our value type is char; there’s no way to tell the difference between two different nonmodifiable char objects and a single nonmodifiable char object with two different addresses.)

The full Unix implementation is shown in Listing 2.

Limitations

Memory mapping is a useful technique, but it’s not always appropriate.

The most obvious limitation of ro_file is in the name: It’s read-only. What would an rw_file class look like? That’s not completely clear. Assignment and copying would probably have to be changed: mapping the same file twice is reasonable for an immutable container, but a lot less reasonable for a mutable one. (It would be very strange if changing a value in a container C1 caused changes in an apparently unrelated container C2.) It’s also not clear how useful an rw_file container would be. We can use ro_file to avoid the limitations of an Input Iterator when reading from a file and to avoid the overhead of an extra copy between kernel space and user space, but neither of these arguments is as compelling for output as for input.

A second limitation is slightly less obvious. Unix programmers are used to thinking of a file as a simple sequence of characters and are used to thinking of a line as nothing more than the region in between two '\n' characters, but life isn’t always so simple. Text files in some operating systems can have more structure than that. (Some mainframe operating systems, for example, support text files with a fixed line length.) The C++ Standard, like the C Standard before it, distinguishes between opening a file in text and in binary mode; opening a file in text mode means that the standard library automatically takes care of all of these formatting requirements.

When you memory map a file, you’re presented with the exact sequence of bytes that are stored on disk. That is, memory mapping is effectively limited to binary-mode I/O. If you want to see a file as if it had been opened in text mode, then you’ll have to understand how your operating system deals with text files [2], and you’ll have to do the conversion yourself.

Third, this class doesn’t exploit all of the features of memory mapping. Most notably, it doesn’t support the common technique of using memory mapping for interprocess communication.

Finally, the error reporting from memory mapping is unfriendly. Since I/O operations are written as pointer dereference, I/O errors appear as memory access violations. This is more an issue for output than for input, but even input operations can fail — suppose, for example, that, while you’re reading from a file, some other process opens the file and truncates it. Memory mapping is most appropriate if you know that this sort of situation won’t arise, or if you can handle errors that appear as signals. (It’s possible to write a proxy class that wraps pointer dereference, traps signals, and translates them into C++ exceptions. The overhead, however, would be prohibitive. If you need that kind of error reporting, you’re probably better off with something other than memory-mapped I/O.)

Conclusions

File I/O in the Standard C++ library uses a simple model of reading and writing. For many purposes, this simple model is sufficient. Modern operating systems, however, provide a much richer set of operations than that: asynchronous I/O, signal-driven I/O, multiplexing, memory mapping. All of these techniques are used in real programs, and all of them are currently outside the scope of the Standard C++ library.

Memory mapping is one technique that fits naturally into the framework of the Standard C++ library, since a memory-mapped file looks very much like a container — and, for many generic algorithms, it’s more convenient to treat a file as a container than to look at it one character at a time. How to support other kinds of advanced I/O within the framework of the C++ library is an open question.

Notes and References

[1] W. Richard Stevens. Advanced Programming in the Unix Environment (Addison-Wesley, 1992).

[2] Fortunately, the answer is simple in some popular operating systems. In Unix, there is no distinction between text and binary files, and in Windows the only issue to worry about is that, if you open a text file in binary mode, you’ll see that the lines end with the two-character sequence "\r\n".

Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committee’s library working group. He works at AT&T Labs — Research and can be contacted at austern@research.att.com.