Looking for a way to create a file-based container? You might need to look outside the Standard C++ library, and memory mapping may be the answer.
The Standard C++ library makes it easy to combine file I/O with generic algorithms. For structured files where youre always reading (or writing) the same data type, you can treat the file as a sequence of values that you access with an istream_iterator (or ostream_iterator). When treating a file as a simple sequence of characters, you can access it with an istreambuf_iterator (or ostreambuf_iterator). So, for example, you can combine streambuf iterators with an STL algorithm to perform y to k conversion:
std::ifstream in("f1"); std::ofstream out("f2"); std::replace_copy (std::istreambuf_iterator<char>(in), std::istreambuf_iterator<char>(), std::ostreabuf_iterator<char>(out), 'y', 'k');When you turn to more complicated (and less contrived) examples, you begin to run into problems. You can use std::find to look for the first occurrence of a character c, but what if, instead of looking for a character, you need to look for a multi-character string? Thats what std::search is for, but you cant use it: its arguments must be Forward Iterators, and istreambuf_iterator is only an Input Iterator.
This isnt an arbitrary restriction. Think about how you would implement std::search yourself. If youre looking for a string in an input sequence, and you find a partial match, you cant just restart your search where you first noticed the mismatch: in general you have to restart it a few characters back. (Consider looking for ababac within abababac.) There are clever algorithmic methods for avoiding unnecessary backtracking, but it cant be avoided entirely. Saving bookmarks is just what you cant do with an Input Iterator like istreambuf_iterator: input works one character at a time, and once youve read a character, youve read it.
This is a familiar problem in I/O: it often turns out that you cant tell youve read enough until youve read too much. If youre reading an integer, and the input contains 123x, you cant tell youve reached the end of the integer until youve consumed the unwanted character x. There are also several familiar solutions.
First, you can sometimes recast an algorithm by using auxiliary storage, or by changing the algorithm so it asks a slightly different question, or by accepting restrictions on its arguments or its error handling so that it requires nothing more than Input Iterator semantics. For example, the std::time_get facet, in the Standard C++ library, reads day and month names using a version of string match that works with Input Iterators. Its not as flexible or robust as std::search, but its adequate for its purposes.
Second, you can simply read more characters than you need and then put the unwanted characters back on the input stream. If youre reading characters through a streambuf, you can use streambuf::sputbackc to unread a character. This technique is often easy and effective, but it has some limitations (you cant count on being able to put back more than one character), and theres no way to combine it with the iterator interface of istream_iterator or istreambuf_iterator.
Third, you can read some portion of the input file into an internal buffer and then work with the buffer instead of working directly with the input stream. You have to be careful if youre doing something that crosses the boundary between one buffer and the next, but that isnt always necessary: its often possible, for example, to work with a single line at a time.
All of these techniques are useful, but, from the perspective of combining I/O with generic algorithms, none of them are completely satisfactory. All of them involve algorithmic changes, and one of them wont even let you use an iterator interface. You might be able to express the same algorithm with putback as you could with Forward Iterators, but youd have to write that algorithm twice.
Working with an internal buffer is easy if you dont cross buffer boundaries, and you can ensure that you dont if you use a large enough buffer, one that holds the entire file. This just takes a few lines of code:
std::ifstream in("f1"); std::istreambuf_iterator<char> first (in); std::istreambuf_iterator<char> last; std::vector<char> buf(first, last);The container buf provides Random Access Iterators. But this technique isnt very satisfactory either: its fine if f1 is a small file, but unworkable once files get large.
A file on disk is (on most operating systems) just a sequence of characters it looks a lot like a container. Why cant we find some way to treat it as a container, without having to read the whole thing into memory?
Memory Mapping
We can, of course. Modern operating systems provide a richer set of I/O primitives than are available through the C or C++ Standard libraries. Some of these choices can be made to fit into the C++ librarys framework.
Most of todays operating systems allow memory-mapped I/O: associating a file with a region of memory, so that reading from (writing to) that region of memory gets translated into reading from (writing to) the file. To a program, this region of memory looks just like any other. You address it with a pointer, and you can use any of the usual pointer operations. Dereferencing a pointer into a memory-mapped region gives you the corresponding value in the file, and writing to a pointer into a memory-mapped region changes the corresponding value in the file.
It shouldnt be surprising that this is possible: its not so very different from the way that an operating system manages virtual memory. When you write an address in your program, it sometimes gets translated to an address in physical memory, and it sometimes points to memory that has been paged out to disk. The operating system brings pages in and out of physical memory without programmer intervention. You can think of memory mapping as an interface that exposes some of these details.
Why do operating systems provide memory mapping? The basic answer is performance.
- Memory mapping can conserve space. Mapping a file into memory lets you access the file as if you had read it into an array its part of your programs address space but you dont have to allocate physical memory for that array.
- Memory mapping can save time. When you do ordinary file I/O, using functions like istream::read or ostream::write, the operating system uses some internal buffer of its own to interact with the physical hardware, but then must copy the data between that buffer and yet another buffer in our own program. Memory mapping avoids that extra copy, because it exposes the operating systems internal buffer. In a test of simple file copying, Stevens [1] found that memory mapping sometimes gave more than a factor-of-two speedup.
Aside from performance, memory-mapped I/O gives us the simple container-like interface that we want. Memory mapping lets us access the contents of a file through a pointer. Pointers are iterators, so all we have to do is write a wrapper class that encapsulates memory mapping and that fulfils the syntactic requirements of the Container interface (Table 65 of the C++ Standard plus, since our container will provide random access iterators, Table 66).
The first requirement in Table 65 is that every container class X must contain a typedef, X::value_type, that identifies the type of the containers elements. What should we choose for this containers value type? One answer might be that, following such examples as std::vector and std::list, we should allow a fully general value type by writing a templatized container class. In this case, however, that choice would present problems.
- Classes with nontrivial constructors would be a problem. The whole point of having a constructor, after all, is that you cant just view an object as a sequence of bytes. In general, for example, you cant copy objects using memcpy.
- Even leaving constructors aside, classes that contain pointers would be a problem. When you memory map a file, the operating system chooses the base address. If you wrote an object to disk and memory mapped it in again, the pointers would no longer point to the right place.
- Even fundamental data types, like long and float, would be a problem. The binary representation of fundamental types varies from system to system; writing a float to disk as a series of raw bytes, and then reading it in again on a different system, is unlikely to give you the right answer.
All of these problems can be and have been solved, of course: the key would be that storing an object in a memory-mapped container would involve some kind of serialization protocol (replacing pointers with offsets, for example), not just a byte-for-byte copy. But serialization and file I/O are separate issues; it makes more sense to build a persistent storage system on top of a low-level I/O library than to mix levels together and design the whole library around serialization. A file is a sequence of bytes, so thats what our container will provide. It will be a non-template class, and its value type will just be char.
The next question is what to do about reads versus writes. When you open a file, you provide a flag that determines whether youre opening it read/write or read-only. This distinction isnt reflected in the type system: you can try to write to a file that youve opened read-only, and the attempt will fail at run time. For a container, however, this isnt a reasonable option. A container provides iterators, and iterators are either mutable or constant. We cant provide mutable iterators into a read-only file.
This suggests that, if we want to support both read-only and read/write files, we need two different classes with different type declarations. For now Ill only discuss the read-only class, ro_file.
Once weve made these decisions, its easy to fill in everything that Table 65 calls for. First come the typedefs: ro_file::value_type is char, ro_file::iterator and ro_file::const_iterator are both const char*, ro_file::reverse_iterator is std::reverse_iterator<const char*>, and so on. Next we provide data access. We keep track of a pointer to the beginning of the file (the base address that we get when we map the file into memory), and we keep track of the files length; begin returns base, and end returns base + length. The full class declaration is shown in Listing 1.
All of the real work is performed in the constructor and the other member functions that set up the private member variables base and length. These member functions are not inline, so they are not part of the class declaration. These member functions are also highly system dependent. Unix and Windows (and other operating systems) support memory mapping, but they support it in different ways. I will show a Unix implementation, which has been tested under Linux.
The general idea is simple: the constructor takes a filename, opens the file using the low-level system call open, finds the files length, and then maps the entire file into memory with mmap. All of these operations can fail, so we check return codes and throw an exception if necessary. Finally, we close the file. In Unix, a memory-mapped file doesnt have to be kept open: the mapping isnt lost until you call munmap.
The only part thats at all tricky is assignment and copying. What should the semantics be? One answer might be that assignment should replace the contents of one file with another, and that the copy constructor should create and open a new file. But those answers dont seem quite right: it would be strange to have one mutating operation in an otherwise read-only class, and it would be even stranger to have a file-copy operation that doesnt let you name the copy. Well provide copying and assignment but copying and assignment of handles, not files. When we copy a ro_file, well memory map the file a second time. The two copies will have identical contents. (This is close to the edge of whats allowed by the standard container requirements, since were not supposed to share the same object between two different containers, but its just barely legal. Our container is nonmodifiable, and our value type is char; theres no way to tell the difference between two different nonmodifiable char objects and a single nonmodifiable char object with two different addresses.)
The full Unix implementation is shown in Listing 2.
Limitations
Memory mapping is a useful technique, but its not always appropriate.
The most obvious limitation of ro_file is in the name: Its read-only. What would an rw_file class look like? Thats not completely clear. Assignment and copying would probably have to be changed: mapping the same file twice is reasonable for an immutable container, but a lot less reasonable for a mutable one. (It would be very strange if changing a value in a container C1 caused changes in an apparently unrelated container C2.) Its also not clear how useful an rw_file container would be. We can use ro_file to avoid the limitations of an Input Iterator when reading from a file and to avoid the overhead of an extra copy between kernel space and user space, but neither of these arguments is as compelling for output as for input.
A second limitation is slightly less obvious. Unix programmers are used to thinking of a file as a simple sequence of characters and are used to thinking of a line as nothing more than the region in between two '\n' characters, but life isnt always so simple. Text files in some operating systems can have more structure than that. (Some mainframe operating systems, for example, support text files with a fixed line length.) The C++ Standard, like the C Standard before it, distinguishes between opening a file in text and in binary mode; opening a file in text mode means that the standard library automatically takes care of all of these formatting requirements.
When you memory map a file, youre presented with the exact sequence of bytes that are stored on disk. That is, memory mapping is effectively limited to binary-mode I/O. If you want to see a file as if it had been opened in text mode, then youll have to understand how your operating system deals with text files [2], and youll have to do the conversion yourself.
Third, this class doesnt exploit all of the features of memory mapping. Most notably, it doesnt support the common technique of using memory mapping for interprocess communication.
Finally, the error reporting from memory mapping is unfriendly. Since I/O operations are written as pointer dereference, I/O errors appear as memory access violations. This is more an issue for output than for input, but even input operations can fail suppose, for example, that, while youre reading from a file, some other process opens the file and truncates it. Memory mapping is most appropriate if you know that this sort of situation wont arise, or if you can handle errors that appear as signals. (Its possible to write a proxy class that wraps pointer dereference, traps signals, and translates them into C++ exceptions. The overhead, however, would be prohibitive. If you need that kind of error reporting, youre probably better off with something other than memory-mapped I/O.)
Conclusions
File I/O in the Standard C++ library uses a simple model of reading and writing. For many purposes, this simple model is sufficient. Modern operating systems, however, provide a much richer set of operations than that: asynchronous I/O, signal-driven I/O, multiplexing, memory mapping. All of these techniques are used in real programs, and all of them are currently outside the scope of the Standard C++ library.
Memory mapping is one technique that fits naturally into the framework of the Standard C++ library, since a memory-mapped file looks very much like a container and, for many generic algorithms, its more convenient to treat a file as a container than to look at it one character at a time. How to support other kinds of advanced I/O within the framework of the C++ library is an open question.
Notes and References
[1] W. Richard Stevens. Advanced Programming in the Unix Environment (Addison-Wesley, 1992).
[2] Fortunately, the answer is simple in some popular operating systems. In Unix, there is no distinction between text and binary files, and in Windows the only issue to worry about is that, if you open a text file in binary mode, youll see that the lines end with the two-character sequence "\r\n".
Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committees library working group. He works at AT&T Labs Research and can be contacted at austern@research.att.com.