Unicode & Filtering Stream Buffers

C/C++ Users Journal March, 2005

Using filters to convert between internal and external formats

By Tilman Kuepper

Tilman Kuepper is an R&D manager at XGraphic GmbH. He can be contacted at kuepper@xgraphic.de.

Unicode has become a standard in modern software development. Be it dealing with user interfaces or accessing XML files, the consistent use of wide-character strings in UTF-32 format offers many advantages over the classic approach of regional codepages and specially adapted character sets. However, text files are often saved in UTF-8 format, which makes it necessary to implement converter functions or classes between the internal UTF-32 characters and external UTF-8 octets.

The C++ Standard Library lets you make such conversions through the class std::basic_fstream, but first, code-conversion facets have to be defined and introduced into the file stream's locale. When implementing code-conversion facets, you need to ensure the proper functioning of the actual converter (whose tasks include converting correct UTF-8 data and recognizing and handling possible errors) as well as the correct interaction with the C++ Standard Library; even six years after the creation of the current C++ Standard, this is still not a trivial matter.

In this article, I present an alternative to the use of code-conversion facets. The conversion between the internal and external format occurs within a filtering stream buffer. (The complete source is at http://www .cuj.com/code/.) My three main objectives are:

To provide a robust and easy-to-use UTF-8/UTF-32 converter class. In case of encoding errors, it should be possible to correct them and continue the program execution.
To implement the converter class in an IOStream-compatible manner. The converter should be usable through standard stream objects.
To keep the converter independent of specific data sources or data sinks. For example, it should be possible to access data files as well as network connections.

I will not discuss Unicode in general, nor UTF-8 and UTF-32 in particular. Information on these topics can be found on the Internet [1] or in reference literature.

Files in UTF-8 Format

Before describing the design and implementation of the converter in detail, I will first address its use in practice. The wrapper class UTF8Filestream supports working with UTF-8 files. Listing 1 shows how a UTF-8 file is read in sequentially and resaved in a second UTF-8 file. The external UTF-8 octets are converted into internal UTF-32 characters while being read in, and the conversion runs in the opposite direction while the data is being saved. The loop terminates when the end of the input file is reached, or a UTF-8 encoding error is found. The converter classes are declared in the UTF8Lib namespace.

In practice, the handling of internal UTF-32 characters and strings is frequently based on the character type wchar_t. However, this is not a prerequisite for using the converter portrayed here. Template parameters allow configuring the converter for all UTF-32 character types supported by the IOStreams classes. This is particularly useful for working with Unicode characters outside of the "Basic Multilingual Plane" (that is, code-points above 0xFFFF). With some compilers, wchar_t is too small to hold such characters; for example, in Microsoft Visual C++ where sizeof(wchar_t) is 2.

With additional effort, you can both recognize and correct UTF-8 encoding errors. Listing 2 shows how the member function get_state() of the converter may be called to determine the reason for failed reading operations. In case of encoding errors, replacement characters are written and the program continues. This example also demonstrates the use of the character type long instead of wchar_t.

The member function get_state() returns one of the error codes in Table 1. Error codes that begin with RD_ refer to failed input operations. Error codes that begin with WR_ refer to failed output operations. If work with the file is to be continued after an error is recognized (except for RD_EOF or WR_EOF), first call the member function clear() of the stream object.

Socket Connections in UTF-8 Format

The converter is not just restricted to working with files; it can be combined with numerous data sources and data sinks. Figure 1 shows the required prerequisites:

An external stream buffer object derived from std::streambuf. The converter reads external UTF-8 octets from here in order to convert them to internal UTF-32 characters. Similarly, UTF-8 octets generated by the converter during writing operations are sent to the external stream buffer.
A stream object derived from std::wiostream or std::basic_iostream. The stream object uses the converter as its stream buffer.

Listing 3 shows how UTF-8 data is transferred via a socket connection. Because the C++ Standard Library does not include any classes for network programming, appropriate libraries should be used. In this case, this is the free Socket++ library [2] by Swaminathan and Straub. The class socketbuf from this library allows socket connections through an std::streambuf-compatible interface.

After the socket connection is established, the converter is initialized. The constructor of the UTF8Streambuf object accepts a reference to the previously created socketbuf object. The converter is now able to exchange UTF-8 data via the network connection. The empty template parameters indicate that the character type wchar_t is used for the internal UTF-32 characters. Finally, a stream object is configured to use the converter as its stream buffer. Subsequent program parts access the network connection through this stream object.

Implementing the Converter

It was already mentioned that the actual converter is implemented as a filtering stream buffer. But what are filtering stream buffers? They have become known as practical tools, especially from the publications of James Kanze. In his article "Filtering Streambufs Variations on a Theme by Schwarz," [3] Kanze writes:

In a filtering streambuf, the streambuf in question is not the ultimate sink or source, but simply processes the data and passes it on. In this way, it acts somewhat like a pipe in UNIX, but within the process. Anyone who has worked with UNIX is familiar with just how powerful this piping idiom can be.

Stream buffers connect stream objects with actual data sources and sinks. The C++ Standard Library defines stream buffers for accessing files (std::basic_filebuf) and strings (std::stringbuf). The converter I discuss here is implemented as a stream buffer, too. However, it is not limited to specific data sources or sinks, but rather delegates its input and output operations to a second ("external") stream buffer. Because the converter lies between the stream object and external stream buffer like a filter, it is called a "filtering stream buffer."

The converter is derived from std::basic_streambuf. From this base class it inherits its public interface, which is known and used by the stream layer. The member functions of this interface are nonvirtual and, therefore, cannot be redefined. Instead, the public nonvirtual member functions call protected virtual member functions within the stream buffer object. This internal, protected interface is the starting point for defining our new stream buffer class.

For output, the protected virtual member function overflow() needs to be implemented. It is invoked for each character to be written. Slightly oversimplifying, Listing 4 shows the selected implementation. You can see how internal UTF-32 characters are transferred from overflow() to put_next_char(), where they are converted into UTF-8 octets. They finally reach the external stream buffer via put_utf8_octet().

For input (Listing 5), data is read via the protected virtual member function underflow(). It calls get_next_char() where the external UTF-8 octets are converted to internal UTF-32 characters. The individual UTF-8 octets are read out of the external stream buffer by get_utf8_lead_octet() and get_utf8_cont_ octet(), respectively. If problems occur during data input or during conversion, the member variable state_ is set to corresponding error codes; the same holds true for problems occurring during data output.

The definition of the new stream buffer class is based on the implementation of only two protected virtual member functions, namely overflow() and underflow(). The minimum effort can be explained by the fact that the converter defined here does not do any data buffering of its own. Buffering is already done by the external stream buffer and does not have to be redone in the converter.

Summary

Defining new stream buffers offers many possibilities for expanding the familiar IOStreams classes from the C++ Standard Library. In this article, I presented an implementation of a new stream buffer class that translates between internal UTF-32 characters and external UTF-8 octets. The filtering stream buffer implementation allows flexibility; for example, in accessing UTF-8 files or UTF-8 network connections.

Much value was placed on platform-independent implementation. The present version of the converter was tested with Microsoft Visual C++ (Versions 6.0 to 7.1) and GCC 3.4. Compilation using the Comeau online compiler is also possible. The ready-to-use source code of the converter and some examples of use can be downloaded from http://www.cuj.com/code/.

Acknowledgments

Thanks to the participants of the C++ newsgroups on the Internet. In particular, the comments of Dietmar Kühl and James Kanze were highly valuable in implementing the converter. I would also like to thank Markus Kuhn for the Unicode information that he compiled on his homepage.

References

[1] http://www.unicode.org/.
[2] http://www.linuxhacker.at/socketxx.
[3] C++ Reports, September 1998.
[4] Langer, A. and K. Kreft. Standard C++ IOStreams and Locales, Addison-Wesley, 1999.
[5] Josuttis, N. The C++ Standard Library, Addison-Wesley, 1999.