Columns


Standard C

Multibyte Functions

P.J. Plauger


P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest book is The Standard C Library, published by Prentice-Hall. You can reach him at uunet!plauger!pjp.

Introduction

Support for large character sets was probably the largest single addition to Standard C. You won't find much prior art for this stuff. A few Japanese companies worried about large character sets years ago, as did a few international companies keen to sell software to Japan. Even so, much of what went into Standard C was pure invention.

One of my major motivations in writing the Standard C library was to prove in the multibyte functions. It's nice to know whether they can be written efficiently, or whether they can be written at all. I was pleased to find that the basic specification in the C Standard is indeed viable.

What's hard is to write the critical functions to be usable with multiple character sets. Even within Japan, several multibyte encodings are popular. I didn't want to write multiple specialized versions of these functions. And I didn't want to require lots more coding to support Chinese, Korean, Arabic, or other large character sets around the world. That proved to be the fun part.

The functions I present here are the ones declared in the header <stdlib.h>. They are the bare minimum needed to support multibyte characters (sequences of one or more bytes to represent each character), wide characters (fixed-size integers that can represent all possible characters), and conversions between these two forms. Both the ISO C and C++ standards will soon incorporate a richer set of additional functions for manipulating large character sets.

What the C Standard Says

7.10.7 Multibyte character functions

The behavior of the multibyte character functions is affected by the LC_CTYPE category of the current locale. For a state-dependent encoding, each function is placed into its initial state by a call for which its character pointer argument, s, is a null pointer. Subsequent calls with s as other than a null pointer cause the internal state of the function to be altered as necessary. A call with s as a null pointer causes these functions to return a nonzero value if encodings have state dependency, and zero otherwise.131 Changing the LC_CTYPE category causes the shift state of these functions to be indeterminate.

7.10.7.1 The mblen function

Synopsis

#include <stdlib.h>
int mblen(const char *s, size_t n);

Description

If s is not a null pointer, the mblen function determines the number of bytes contained in the multibyte character pointed to by s. Except that the shift state of the mbtowc function is not affected, it is equivalent to

mbtowc((wchar_t *)0, s, n);
The implementation shall behave as if no library function calls the mblen function.

Returns

If s is a null pointer, the mblen function returns a nonzero or zero value, if multibyte character encodings, respectively, do or do not have state-dependent encodings. If s is not a null pointer, the mblen function either returns 0 (if s points to the null character, or returns the number of bytes that are contained in the multibyte character (if the next n or fewer bytes form a valid multibyte character), or returns -1 (if they do not form a valid multibyte character).

Forward references: the mbtowc function (7.10.7.2).

7.10.7.2 The mbtowc function

Synopsis

#include <stdlib.h>
int mbtowc(wchar_t *pwc, const char *s, size_t n);

Description

If s is not a null pointer, the mbtowc function determines the number of bytes that are contained in the multibyte character pointed to by s. It then determines the code for the value of type wchar_t that corresponds to that multibyte character. (The value of the code corresponding to the null character is zero.) If the multibyte character is valid and pwc is not a null pointer, the mbtowc function stores the code in the object pointed to by pwc. At most n bytes of the array pointed to by s will be examined.

The implementation shall behave as if no library function calls the mbtowc function.

Returns

If s is a null pointer, the mbtowc function returns a nonzero or zero value, if multibyte character encodings, respectively, do or do not have state-dependent encodings. If s is not a null pointer, the mbtowc function either returns 0 (if s points to the null character), or returns the number of bytes that are contained in the converted multibyte character (if the next n or fewer bytes form a valid multibyte character), or returns -1 (if they do not form a valid multibyte character).

In no case will the value returned be greater than n or the value of the MB_CUR_MAX macro.

7.10.7.3 The wctomb function

Synopsis

#include <stdlib.h>
int wctomb(char *s, wchar_t wchar);

Description

The wctomb function determines the number of bytes needed to represent the multibyte character corresponding to the code whose value is wchar (including any change in shift state). It stores the multibyte character representation in the array object pointed to by s (if s is not a null pointer). At most MB_CUR_MAX characters are stored. If the value of wchar is zero, the wctomb function is left in the initial shift state.

The implementation shall behave as if no library function calls the wctomb function.

Returns

If s is a null pointer, the wctomb function returns a nonzero or zero value, if multibyte character encodings, respectively, do or do not have state-dependent encodings. If s is not a null pointer, the wctomb function returns -1 if the value of wchar does not correspond to a valid multibyte character, or returns the number of bytes that are contained in the multibyte character corresponding to the value of wchar.

In no case will the value returned be greater than the value of the MB_CUR_MAX macro.

7.10.8 Multibyte string functions

The behavior of the multibyte string functions is affected by the LC_CTYPE category of the current locale.

7.10.8.1 The mbstowcs function

Synopsis

#include <stdlib.h>
size_t mbstowcs(wchar_t *pwcs,
const char *s, size_t n);

Description

The mbstowcs function converts a sequence of multibyte characters that begins in the initial shift state from the array pointed to by s into a sequence of corresponding codes and stores not more than n codes into the array pointed to by pwcs. No multibyte characters that follow a null character (which is converted into a code with value zero) will be examined or converted. Each multibyte character is converted as if by a call to the mbtowc function, except that the shift state of the mbtowc function is not affected.

No more than n elements will be modified in the array pointed to by pwcs. If copying takes place between objects that overlap, the behavior is undefined.

Returns

If an invalid multibyte character is encountered, the mbstowcs function returns (size_t)-1. Otherwise, the mbstowcs function returns the number of array elements modified, not including a terminating zero code, if any.132

7.10.8.2 The wcstombs function

Synopsis

#include <stdlib.h>
size_t wcstombs(char *s, const
wchar_t *pwcs, size_t n);

Description

The wcstombs function converts a sequence of codes that correspond to multi-byte characters from the array pointed to by pwcs into a sequence of multibyte characters that begins in the initial shift state and stores these multibyte characters into the array pointed to by s, stopping if a multibyte character would exceed the limit of n total bytes or if a null character is stored. Each code is converted as if by a call to the wctomb function, except that the shift state of the wctomb function is not affected.

No more than n bytes will be modified in the array pointed to by s. If copying takes place between objects that overlap, the behavior is undefined.

Returns

If a code is encountered that does not correspond to a valid multibyte character, the wcstombs function returns (size_t)-1. Otherwise, the wcstombs function returns the number of bytes modified, not including a terminating null character, if any.132

Footnotes

131. If the implementation employs special bytes to change the shift state, these bytes do not produce separate wide character codes, but are grouped with an adjacent multibyte character.

132. The array will not be null- or zero-terminated if the value returned is n.

Using the Functions

mblen — Use this function to determine the length of the multibyte sequence that defines a single wide character. That length cannot be greater than MB_CUR_MAX, defined in <stdlib.h>. Multibyte sequences can contain locking shifts that alter the interpretation of any number of characters that follow. Hence, mblen stores in a private static data object the shift state for the multibyte string it is currently scanning. If the call mblen(NULL, 0) is nonzero, you can safely scan only one multibyte string at a time by repeated calls to mblen. Here, for example, is a function that checks whether a multi-byte string has a valid encoding:

#include <stdlib.h>

int mbcheck(const char *s)
   {  /* return zero if s is valid */
   int n;

   for (mblen(NULL, 0); ; s += n)
      if ((n = mblen(s, MB_CUR_MAX)) <= 0)
         return (n);
   }
mbstowcs — Use this function to convert an entire multi-byte string to a wide-character string. You needn't worry about whether locking shifts occur, since the function processes the entire multibyte string. You also needn't worry that the resultant wide-character string is too long, since the third argument n limits the number of elements stored. If the function returns n, the conversion was incomplete. If the function returns a negative value, the multibyte string has an invalid encoding.

mbtowc — Use this function much the same as you would mblen, described above. Two differences exist between the functions mbtowc and mblen:

The functions mblen and mbtowc maintain separate static data objects to store shift states. Thus, you can scan different strings at the same time with the two functions even when multibyte strings have locking shifts.

wcstombs — Use this function to convert an entire wide-character string to a multibyte string. You needn't worry about whether locking shifts occur, since the function processes the entire wide character string. You also needn't worry that the resultant multibyte string is too long, since the third argument n limits the number of elements stored. If the function returns n, the conversion was incomplete. If the function returns a negative value, the wide-character string is invalid.

wctomb — Use this function to convert a wide-character string to a multi-byte string one wide character at a time. Here, for example, is a function that checks whether a wide-character string has a valid encoding:

#include <limits.h>
#include <stdlib.h>
int wccheck(wchar_t *wcs)
   {  /* return zero if wcs is valid */
   char buf[MB_LEN_MAX];
   int n;

   for (wctomb(NULL, 0); ; ++wcs)
      if ((n = wctomb(buf, *wcs)) <= 0)
         return (-1);
      else if (buf[n - 1] == '\0')
         return (0);
   }
Note that wctomb includes the terminating null character in the count it returns.mbtowc does not.

Implementing the Functions

Listing 1 shows the file mbtowc.c and Listing 2 shows the file mblen.c. Both mbtowc and emblen call the internal function _Mbtowc to do the actual work. Each provides separate storage of type _Mbsave, defined in <stdlib.h>, to memorize the shift state while walking a multibyte string. The data objects _Mbxlen and _Mbxtowc both have names with external linkage. That permits the header <stdlib.h> to define masking macros for both functions .mblen can, in principle, be simpler than mbtowc. In this implementation, however, little difference exists between what the two functions must do.

Listing 3 shows the file mbstowcs.c. The function mbstowcs calls _Mbtowc repeatedly to translate an entire multibyte string to a wide character string. It too provides storage of type _Mbsave, but it need not retain the shift state between calls.

Listing 4 shows the file xmbtowc.c. The function _Mbtowc parses a multibyte sequence far enough to develop the next wide character that it represents. It does so as a finite-state machine executing the state table stored at _Mbstate, defined in the file xstate.c. (See Standard C, March and April 1991 for a discussion of how to specify state tables as part of a locale.)

_Mbtowc must be particularly cautious because _Mbstate can be flawed. It can change with locale category LC_CTYPE in ways that the Standard C library cannot control.

Note the various ways that the function can elect to take an error return:

The rest of _Mbtowc is simple by comparison. The function retains the wide-character accumulator (ps->_Wchar) as part of the state memory. That simplifies generating a sequence of wide characters with a common component while in a given shift state. _Mbtowc returns after delivering each wide character.

Listing 5 shows the file wctomb.c. The function wctomb calls the internal function _Wctomb solely to provide separate state memory. In this case, the shift state can be stored in a data object of type char. The data object _Wcxtomb has a name with external linkage so that the header <stdlib.h> can define a masking macro for wctomb.

Listing 6 shows the file wcstombs.c. The function wcstombs calls _Wctomb repeatedly to translate a wide-character string to a multibyte string. It too provides its own state memory, but it need not retain the shift state between calls.

What makes this function complex is the finite length of the char array it writes. If at least MB_CUR_MAX elements remain, _Wctomb can deliver characters directly. Otherwise, wcstombs must store the generated characters in an array of length MB_LEN_MAX and deliver as many as it can.

Listing 7 shows the file xwctomb.c. The function _Wctomb converts a wide character to the one or more characters that comprise its multibyte representation. It does so as a finite-state machine executing the state table stored at _Wcstate, defined in the file xstate.c.

_Wctomb must also be cautious because _Wcstate can also be flawed. It can change with locale category LC_CTYPE in ways that the Standard C library cannot control. Note the various ways that the function can elect to take an error return:

The rest of _Wctomb is likewise simple by comparison. It returns after consuming each input wide character.

This article is excerpted from P.J. Plauger, The Standard C Library, (Englewood Cliffs, N.J.: Prentice-Hall, 1992).