January 2001 Experts Forum/The (B)Leading Edge

C++ Experts Forum

The (B)Leading Edge: Using IOStreams, Part I

Jack W. Reeves

Introduction

Welcome to the new version of "The (B)leading Edge." I would like to take the opportunity to thank the editors and publisher of C/C++ User's Journal for the opportunity to continue writing "The (B)Leading Edge." If you are a regular reader of this column, I hope you will continue to find it interesting and useful. If you are a new reader, I hope you will be glad you found it.

Before I jump into IOStreams, I need to acknowledge some feedback. After my column on object-oriented design [1], reader Robert Allan Schwartz took me to task for claiming that Java was a "pure object-oriented language. " As he says

"Any language that contains types like boolean, char, byte, short, int, long, float, and double (none of which are classes) is not "pure"-ly object oriented.

Just because Java does not support non-member data and non-member functions does not mean 'anything written in Java is automatically object-oriented by definition.'"

He is absolutely correct. I replied that I didn't subscribe to the idea that Java was a pure object-oriented language myself, but that I had simply passed along a common assertion. Furthermore, my intention in doing so was to actually take a little tongue-in-cheek poke at those who do subscribe to this belief, and who may unconsciously also believe that code written in Java is somehow "purer," and hence better, than code written in C++. Stated so baldly, this assumption is pretty clearly nonsense. Nevertheless, I concede that Schwartz is correct — if I am going to write this column, I owe it to my readers to debunk nonsense, not promulgate it. So let me now make the following assertion: Java is not a "pure" object -oriented programming language. Rather, I consider Java to be a "practical" object-oriented programming language. Just like C++.

With this column, I am going to kick off a fairly in-depth look at the Standard IOStream library. As always, my approach will be based upon my own real-world experiences (or what passes for it in my case). First let me provide a little background. As you probably know, the IOStream library has been part of the C++ library for a long time. Certainly, it was available when I first started to use C++ in 1990.

When I encountered the IOStream library, I was immediately a fan. Here was an extensible, type-safe, I/O mechanism that was built as an ordinary library and did not depend upon some special support by the compiler. At the same time, it was an efficient and compact notation. It was possible to write
os << "X=" << x << endl;
and leave it up to the compiler to know what type x really was and leave it up to x to know how to output itself. This was what object-oriented programming was suppose to be all about. So, I immediately stopped using the old C stdio library and began to exclusively use IOStreams. The more I used the library, and in particular the more I studied the library, the more impressed I became. I can honestly say that I learned a lot about how to do good design in C++ by studying the IOStream library. Here are a few examples of some of the things I found impressive about the IOStreams library even back in the early 1990s. (I will use the past tense to refer to the pre-Standard version of the IOStreams library and the present tense for referring to the Standard-compliant IOStreams. Most remarks will apply to both unless I specifically note otherwise, but I will try to avoid constant usage of the phrase "is/was" .)

Abstraction — having a base class that provides an abstraction, and derived classes that provide different implementations, is a fundamental part of object-oriented design. It is easy to overlook just how good the IOStreams design was in that respect. About the only time I ever had to care about the actual type of IOStream that I was using was when I found it necessary to explicitly add an ends manipulator to a strstream. These days I use a stringstream (see sidebar, "strstream vs. stringstream") and even that distinction has disappeared. While the Standard IOStreams library comes with file-based streams and memory-based streams, lots of people have added implementations for other types of streams. We will be adding a few ourselves before this column is over (and many more before I am through).

Multiple layers of abstraction — the IOStreams library was built on multiple levels of abstraction. This is just plain good design. I repeatedly see high-level abstractions that are implemented in C for all practical purposes. Decomposing a problem into smaller pieces makes just as much sense for object-oriented design as it ever did before. In the object-oriented sense, the "smaller" pieces should themselves be good abstractions. As IOStreams shows, those lower-level abstractions can significantly simplify the implementation and often turn out to be extremely useful in their own right.

Multiple inheritance — the IOStreams library was one of the first good examples of the use multiple inheritance that I encountered. It remains one of the best examples to this day.

Useful idioms — the IOStreams library contains a number of useful idioms that are applicable to a lot of other class designs. One of my favorites is the automatic conversion operators that allow the status of an IOStream object to be tested directly:
if (istrm) . . . // ok to use it
I use this idiom repeatedly in my own classes, and I wonder why it is not more commonly used.

It is the extensibility of the IOStreams library that is probably its best known aspect, however. In fact, the IOStreams library might just as well be called "the extensible IOStreams library." This extensibility comes is many different forms:

The simplest form is the ability to define input and output operators externally to the IOStreams library itself. This is so common that most of us do not even think about it when we write a new operator<< for some class. Nevertheless, this doesn't "just happen" to work. It works because it was designed that way.

Besides the ability to add new operators to the IOStreams library without having to change the library itself, the library provides mechanisms to allow extensible formatting information on a per-class basis, again without having to change the library itself. I will explore this mechanism in a future column.

The IOStreams library can be extended by derivation. Again, this doesn't "just happen." There is far more to designing a good base class than just making some of the functions virtual. I have often maintained that designing the protected part of a class' interface is the hardest part of doing a good abstraction. The protected part is the abstraction that is exposed to derived classes, but not to ordinary clients. Many class designs abdicate on this part by just declaring all the data as protected, (or providing protected accessors and mutators for all the data elements). This is effectively no abstraction at all since it means that any changes to the base class' default implementation cause ripple changes to any derived classes. Alternatively, many classes just provide a "pure interface" with no data, no default implementations, and all functions declared virtual. This just transfers the full burden of implementation to every derived class [2]. The IOStreams library takes a middle approach. It provides a protected interface that is actually an abstraction, but lower level (and less safe) than the client abstraction. Then it provides a default implementation for all parts of this abstraction. This is one of the areas where the Standard version of IOStreams improves upon its predecessor.

Finally, the Standard IOStreams library is template based. This provides another dimension of extensibility that was not present in prior versions. In a future column, I will be taking advantage of this extensibility.

This column is intended as an introduction to IOStreams design, not a tutorial on IOStreams usage. To set the stage, I am going to jump right into the very heart of IOStreams by providing a fairly detailed example of creating a derived stream class. I have found that in spite of the fact that IOStreams is designed to be extended, many programmers seem to feel that extending IOStreams is too complex or too much work. There is no better way to get past this aversion than to dive right in.

What I want is an IOStream class to replace some of the functionality of the deprecated strstream class that is not available with the new stringstream class. A little background: once upon a time, I created a class called substring [3]. This class was created specifically to allow me to efficiently parse input strings. Naturally, one of the things I needed to parse out of the strings were numbers. When I had created substring, I ran into a problem: there was no Standard library routine that could be used to extract a number from a substring. The Standard C library provided some algorithms that did what I wanted (declared in stdlib.h), but they expected to receive a null-terminated character array as input. In order to create such from a substring, I would have to make a copy of the data. The new stringstream class would parse a number, but it required a string object as input, and again that would require making a copy of the substring's data. Finally, the strstream stream came close to providing what I needed, but not exactly. Besides, it was deprecated and I did not want to use it for that reason alone.

Eventually, I created my own strtoint and strtodouble algorithms. Their interface was modeled after the other STL algorithms (i.e., they operated on character sequences defined by a pair of iterators). This worked fine, but I was never really happy about it. There are a couple of areas where you expect your standard library implementation to be really tailored to a specific platform. One of these areas is memcpy and its cousins. Another area is numeric-to-string conversions. A lot of processors provide low-level instructions that perform these operations much more efficiently than is possible with generic C/C++ code. A good library implementation will take advantage of these low-level operations. This means that it is always prudent to route such operations through the standard library, if possible. Even if the library is a generic implementation, you haven't lost anything by using it. If it is tailored for the specific processor/platform there can be considerable gain.

Another reason for wanting to use the Standard IOStreams mechanism for parsing numbers is the locales facility. Some locales use ',' instead of '.' for fractions. Some represent money differently, and so on. Ideally, I wanted my input parser to be able to handle different locales automatically. In Standard C++, numeric conversions are actually part of the locales library. The typical way to access these functions is via an IOStream. What I really wanted was not an algorithm that converted a string to a number, but a way to invoke an IOStream operation on a substring. In other words, I wanted to be able to create an IOStream from a substring.

In order to get what I wanted, I created a template String_stream<> that would wrap an IOStream interface around any class that provided the appropriate subset of the standard string interface. I propose to go through the reasoning and the variations I went through in creating String_stream as a way of demonstrating what is involved in creating a derived IOStream class.

A String_stream Template (Version 1)

The first step in creating any new IOStream class is to create the appropriate streambuf derived class. Readers who are familiar with the organization of the Standard IOStreams library will know that streambuf is actually a synonym for basic_streambuf<char>. In order to keep the discussion more readable, I will limit my descriptions to the char-based IOStreams. Extending things to support different character types is a fairly easy process.

There are twelve virtual functions provided by the streambuf interface that can be overriden by a derived class. For completeness sake, I list these functions here:
imbue
setbuf
seekoff
seekpos
sync
showmanyc    (pronounced es-how-many-see)
xsgetn
underflow
uflow
pbackfail
xsputn
overflow
The first version of String_streambuf I created is shown in Listing 1. Note that each of these functions returns an int_typ. By convention, most functions in streambuf return the EOF character to indicate a failure, and some valid character to indicate success. For the first version, I just wanted to transfer characters directly to and from the string object. In streambuf terms, this was to be "unbuffered I/O." In other words, I was moving characters directly to/from the source/destination without any intermediate buffering in the streambuf object. In order to do this, the minimum functions that I needed to implement were: underflow, uflow, pbackfail, and overflow. The first three are related to input, and overflow handles output. Note that each of these functions returns an int_typ. By convention, most functions in streambuf return the EOF character to indicate a failure, and some valid character to indicate success. Because the latter is simpler, I will discuss it first.

A streambuf is allowed to impose a number of restrictions on the character sequence that it manages. Alternatively you could say that a managed character sequence can have certain restrictions that the streambuf can enforce with relation to any IOStream that uses the streambuf. In this case, I decided arbitrarily that output would always append to the string (assuming the string was writable and had space available). This made the overflow function trivial. The overflow function is called when the output buffer is full or does not exist. Since this version of String_streambuf has no output buffer, overflow is called for every character written to the stream. Note that each of these functions returns an int_type. By convention, most functions in streambuf return the EOF< character to indicate a failure and some valid character to indicate success.

You will note that overflow takes an input argument of int_type rather than char. This means that overflow can be called with an argument of EOF. It is not clear to me under what circumstances this will actually happen. Nevertheless, I have decided (again arbitrarily) that any EOF "characters" that do arrive will be silently ignored. This requires that overflow return something other than the character it received if it does receive an EOF.

The input side is only slightly more complicated. Since I did not want to actually remove characters from the string as they were read, I needed a member variable to keep track of the read position within the string. The functions underflow and uflow are called by the public interface of streambuf when the input buffer is empty, or it does not exist. Again, in this version of String_streambuf, one or the other of underflow or uflow will be called for every character that is read from the stream. The functions assure that the read position hasn't reached the end of the string and then return the character at that position. As you can see, the only difference between the two functions is that uflow increments the read position, but underflow does not.

The pbackfail function handles an attempt to push a character back into the input after it has been read. (The name comes from the fact that it is called if the public sputbackc function can not push the character back into the input buffer. Since this version of String_streambuf has no input buffer, pbackfail is called on every attempt to push back a character). Again, I decided that the function would not change the underlying string. This was necessary since I expected to often use a String_stream with a const string. pbackfail backs up the input position only if it is not already at the beginning of the string and provided the character being put back is actually the character at that position.

I am not sure that I actually needed to implement the function setbuf. The Standard is not clear (to me anyway) what the default behavior for setbuf is if there is no input buffer assigned. Just to be on the safe side, I provided a version of setbuf that does nothing. It would probably be more appropriate for setbuf to throw an exception, but doing nothing mimics the default behavior of setbuf.

For all other functions, I accepted the default behavior. In particular, I did not implement the two seek functions. Their default behavior is simply to report failure. By design, seeking the output position is not supported — output is always appended to the string, and I decided to ignore attempts to seek the input position because I was lazy. Other functions have default implementations in terms of the functions above, or they do nothing and return an appropriate indication. For example, the sync function does nothing (there is nothing to synchronize in this case), but its return value indicates success. Likewise, the showmanyc function returns zero. This is basically a "don't know" value, not a failure indication. This is what I meant when I talked about how IOStreams has a well-designed protected interface.

The one function to note in this implementation is the uflow function. This is a new function in the Standard version of IOStreams that was not available in classic IOStreams. As you can see from the implementation, underflow returns, but does not consume a character. This is because underflow is called by the public getc function, which does not consume the character in the buffer. But a call to the public function snextc should return the next character available and consume it. In a classic IOStream implementation, this meant that a derived streambuf class had to provide an input buffer of at least one character to hold the value returned by underflow so that it could be processed correctly by either getc or snextc. Obviously, it was typical to provide a much larger buffer, but there was really no such thing as unbuffered input. The addition of the uflow function corrects this.

As you can see, the various String_stream classes are basically trivial. Their only task is to initialize the String_streambuf and then make the String_streambuf available to their base classes. There is one thing to note: each String_stream class has to initialize its base class with a pointer to the String_streambuf. Because of the order in which C++ constructs base classes and member variables, a String_stream has to initialize its base class before it has a chance to initialize the String_streambuf member. The Standard specifically says that if you initialize a stream class with a pointer to a streambuf that has not been properly initialized the behavior is undefined. One way around this is to allocate the String_streambuf from the free store as part of the initialization of the String_stream base class (e.g. inherited(new Sting_streambuf<StringT>(str)) ). This complicates the destructor however, so I just provide a null pointer to initialize the String_stream base class and then call the init function from within the constructor — after the String_streambuf member has been initialized.

A String_stream Template (Version 2)

Listing 2 shows my second version of String_stream. The first version of String_streambuf uses a very minimal part of the string interface: the size, operator+=, and operator[] functions. Since every type of string that I could envision using provided these functions, everything was fine, so far. The only problem was that reading data out of a string involved a virtual function call for every character read. Virtual function calls are not expensive, but the whole point of String_stream was to be able to efficiently parse input strings in place. If I could get rid of the virtual function calls, it would be better. All of the string types that I wanted to use also provided a data function. Since the streambuf base class was designed to manipulate a buffer, it seemed reasonable to use the string::data function to provide such a buffer. This led to my second version of String_stream (Listing 2). You will note that all of the changes are in the String_streambuf class. (This is usually the case.)

By using the internal buffer of a string provided by the data function, the resulting String_streambuf reduces to a very minimalist version. It provides only the overflow function to add characters to the string. Since the buffer always provides access to all the characters currently available, the underflow and uflow functions revert to their default behavior, which is just to report failure (i.e., end-of-file). Likewise the default behavior of pbackfail and setbuf is appropriate — the former does nothing and reports failure; overflow just does nothing.

Unfortunately, while the input side of String_streambuf has been simplified considerably, the output side has paid the price. When you modify a string object, it is possible that all existing references into the string are made invalid. This includes iterators and the pointers returned by the data and c_str functions. Since you do not know when the references will become invalid, you have to program as though every modification invalidates them [4]. That means that overflow has to reestablish the get buffer area on every call. Note that in order to accomplish this correctly, it has to preserve the value of the current get position in the buffer.

A String_stream Template (Version 3)

Listing 3 shows my third version of String_stream. Since I expected to be using a String_stream primarily to read strings and seldom to write them, I was tempted to just leave version 2 alone — but not quite tempted enough. In order to avoid the overhead of having to reset the get area on every call to overflow, I could do one of two things (at least that I thought of immediately). Both ideas depend on the fact that typically you either write or read, but seldom intermix both. In approach one, a flag could be set to indicate that a get buffer was valid. If a call to overflow occurred when the flag was set, it would disable the get buffer by setting the pointers to null and clear the flag before appending the character to the string. Future calls to overflow would find the get buffer marked as invalid and be able to append characters without concern. If the stream switched to input, the invalid get buffer would trigger a call to underflow, which would reset the get buffer.

This approach struck me as having one big disadvantage. It required an underflow function that would have to deal with two situations: an invalid get buffer and no more characters in the string. Not a particularly complicated problem, but like I said before, I am lazy. So I choose to go with solution two, which was to provide a small buffer to allow output to accumulate before a call to overflow has to update the string and reset the input buffer area.

Now, besides overflow, I have to provide a sync function. This is called (indirectly) whenever the stream is flushed. Note that once I provide sync, overflow does not actually update the string. Instead, it calls sync to append the characters in the buffer into the string and then puts the character it received (the overflow character) back into the buffer. This is perfectly legitimate — overflow's real job is not to update the string, but just to make some room available in the output buffer. This means that sync is the function responsible for updating the string, so it has to update the get buffer also.

You will note that there is no underflow function. This means that a read can fail, even though there might be characters in the output buffer area that have not yet been transferred to the string. This is what typically happens if you don't call flush between switching from writing to reading the same stream. I could have created a version of overflow that would have checked this situation and called sync if needed, but I decided that it wasn't worth the effort.

Again, the default behavior of the rest of the interface is acceptable. Readers who wish to explore further might consider which, if any, of the other functions listed in the introduction it makes sense to implement. You need to be careful to understand the circumstances under which each of these virtual functions is called, and for that you really need access to a copy of the Standard. As a simple example, the showmanyc function is called by the public is_avail function only when there are no more characters in the input buffer. If there are characters available in the input buffer, is_avail returns that number. This means that the only reasonable implementation for a version of showmanyc is to return the number of characters that are in the output buffer, but not yet in the string. It doesn't make sense for showmanyc to return the size of the string — that doesn't take into account the number of characters that have already been read. Likewise, if showmanyc returns the size of the string minus the number of characters that have been read, it will always return zero because of the way it is called. Since that is the default behavior, why bother.

I hope that this little example has convinced you that developing a custom version of an IOStream class is not an arcane art. There are some issues that you need to be aware of, and as a result you do need either the Standard or a good reference (e.g., [5]), but with just a little practice you will find that the vast majority of cases quickly boil down to something like version 3. You can usually get by with a streambuf class that provides an underflow, overflow, and sync function. The actual stream classes are usually just a constructor, with occasionally some additional functions to query the status of the underlying source/sync.

The Standard IOStream library is an excellent abstraction for all sorts of I/O situations. Once you start thinking in terms of a customized IOStream as your source or sync instead of some specialized, one-off, API, you will quickly discover a great deal of synergy with your existing code base.

If you are interested in IOStream bugs, see the Sidebar "Stupid IOStream Bugs."

Next time, we will look at some other ways to extend the IOStreams library.

Notes and References

[1] Jack Reeves. "The (B)Leading Edge: Object Oriented Design and C++," C++ Report, May 2000.

[2] Obviously, this is the approach taken by Java interfaces. In C++, it is possible to have both pure interfaces and default implementations by using multiple inheritance. This paradigm is being advocated more and more often as the way to design base class abstractions: provide a pure interface in one class and a default implementation in a separate class. This approach has distinct advantages, but in the meantime IOStreams serves as an excellent example of the more traditional approach.

[3] Jack Reeves. "The (B)Leading Edge: More 'string' Utilities," C++ Report, June/July 1999.

[4] Some readers might consider the possibility of checking whether str.size() < str.capacity() before appending the character. In many string implementations, if this holds you can be sure that the pointer to the data buffer will not be invalidated. There are string implementations where size() always equals capacity() however. For such strings, we will still need to reset the get buffer after every character. The real solution to this problem is the third version of String_stream.
[5] Angelika Langer and Klaus Kreft. Standard C++ IOStreams and Locales (Addison-Wesley, 2000).

Jack W. Reeves is an engineer and consultant specializing in object-oriented software design and implementation. His background includes Space Shuttle simulators, military CCCI systems, medical imaging systems, financial data systems, and numerous middleware and low-level libraries. He currently is living and working in Europe and can be contacted via jack_reeves@bleading-edge.com.