P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest books are The Standard C Library, published by Prentice-Hall, and ANSI and ISO Standard C (with Jim Brodie), published by Microsoft Press. You can reach him at pjp@plauger.com
Introduction
Last month, I introduced the header <string.h>. (See Standard C, CUJ Oct. '92.) I discussed the functions that copy and concatenate strings, and a few others. I also showed how to implement them. This month, I discuss the functions declared in <string.h> that help you compare strings in various ways.
What the C Standard Says
7.11.4 Comparison functions
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
7.11.4.1 The memcmp function
Synopsis
#include <string.h> int memcmp(const void *s1, const void *s2, size_t n);Description
The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2 .136
Returns
The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
7.11.4.2 The strcmp function
Synopsis
#include <string.h> int strcmp(const char *s1, const char *s2);Description
The strcmp function compares the string pointed to by s1 to the string pointed to by s2.
Returns
The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.
7.11.4.3 The strcoll function
Synopsis
#include <string.h> int strcoll(const char *s1, const char *s2);Description
The strcoll function compares the string pointed to by s1 to the string pointed to by s2, both interpreted as appropriate to the LC_COLLATE category of the current locale.
Returns
The strcoll function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2 when both are interpreted as appropriate to the current locale.
7.11.4.4 The strncmp function
Synopsis
#include <string.h> int strncmp(const char *s1, const char *s2, size_t n);Description
The strncmp function compares not more than n characters (characters that follow a null character are not compared) from the array pointed to by s1 to the array pointed to by s2.
Returns
The strncmp function returns an integer greater than, equal to, or less than zero, accordingly as the possibly null-terminated array pointed to by s1 is greater than, equal to, or less than the possibly null-terminated array pointed to by s2.
7.11.4.5 The strxfrm function
Synopsis
#include <string.h> size_t strxfrm(char *s1, const char *s2, size_t n);Description
The strxfrm function transforms the string pointed to by s2 and places the resulting string into the array pointed to by s1. The transformation is such that if the strcmp function is applied to two transformed strings, it returns a value greater than, equal to, or less than zero, corresponding to the result of the strcoll function applied to the same two original strings. No more than n characters are placed into the resulting array pointed to by s1, including the terminating null character. If n is zero, s1 is permitted to be a null pointer. If copying takes place between objects that overlap, the behavior is undefined.
Returns
The strxfrm function returns the length of the transformed string (not including the terminating null character). If the value returned is n or more, the contents of the array pointed to by s1 are indeterminate.
Example
The value of the following expression is the size of the array needed to hold the transformation of the string pointed to by s.
1 + strxfrm(NULL, s, 0)Footnote:136. The contents of "holes" used as padding for purposes of alignment within structure objects are indeterminate. Strings shorter than their allocated space and unions may also cause problems in comparison.
Using the Comparison Functions
memcmp This function offers the quickest way to determine whether two character sequences of the same known length match character for character. You can also use it to establish a lexical ordering between two character sequences, but that ordering can change among implementations. If a portable result is important, you must write your own comparison function.strcmp This function offers the quickest way to determine whether two null-terminated strings match character for character. You can also use it to establish a lexical ordering between two strings, but that ordering can change among implementations. If a portable result is important, you must write your own comparison function. See also strcoll and strxfrm, below.
strcoll Use this function to determine the locale-specific lexical ordering of two null-terminated strings. You must know the current status of locale category LC_COLLATE to use this function wisely. (You must at least assume that someone else has set this category wisely.) Under some circumstances, you may want to use strxfrm, described below, instead.
strncmp This function offers the quickest way to determine whether two character sequences of the same known length match character for character up to and including any null character in both. You can also use it to establish a lexical ordering between two such character sequences, but that ordering can change among implementations. If a portable result is important, you must write your own comparison function.
strxfrm Use strxfrm(s1, s2, n) to map the null-terminated string s2 to a (non-overlapping) version at s1. Strings you map this way can later be compared by calling strcmp. The comparison determines the locale-specific lexical ordering of the two strings that you mapped from. You must know the current status of locale category LC_COLLATE to use this function wisely. (You must at least assume that someone else has set this category wisely.) Under most circumstances, you may want to use strcoll, described above, instead. Use strxfrm if you plan to make repeated comparisons or if the locale may change before you can make the comparison. Use malloc, declared in <stdlib.h>, to allocate storage for s1, as in:
size_t n = strxfrm(NULL, s2, 0); char *s1 = malloc(n + 1); if (s1) strxfrm(s1, s2, n);The first call to strxfrm determines the amount of storage required. The second performs the conversion (again) and stores the translated string in the allocated array.
Implementing the Comparison Functions
Listing 1 shows the file memcmp.c. memcmp is careful to perform unsigned char comparisons to meet the requirements of the C Standard.Listing 2 shows the file strncmp.c. The function strncmp is similar to memcmp, except that it also stops on a terminating null character. And unlike memcmp, strncmp can use its pointer arguments directly. It type casts them to pointer to unsigned char only to compute a nonzero return value.
Listing 3 shows the file strcmp.c. The function strcmp differs from strncmp only in not worrying about a limiting string length n.
The last two functions help you perform locale-specific string collation. Both strcoll and strxfrm determine collation sequence by mapping strings to a form that collates properly when compared using strcmp. The locale category LC_COLLATE determines this mapping. (See "Standard C", CUJ Nov. 1990.) It does so, in this implementation at least, by specifying a table-driven finite-state machine. You can write your own state tables in a locale file, as I describe briefly below. The state tables are used by the internal function _Strxfrm. Thus, strcoll and strxfrm call _Strxfrm to map strings appropriately.
Listing 4 shows the file xstrxfrm.h. All the collation functions include the internal header "xstrxfrm.h". It includes, in turn, the standard header <string.h> and the internal header "xstate.h", shown in Listing 5. (I won't describe all the magic here. I outlined the locale-file language for state tables in "Standard C", CUJ Jan. 1991.) "xstrxfrm.h" defines the type _Cosave and declares the function _Strxfrm. A data object of type _Cosave stores state information between calls to _Strxfrm.
Listing 6 shows the file strxfrm.c. The function strxfrm best illustrates how the collation functions work together. It stores the mapped string in the buffer pointed to by s1, of length n. Once the buffer is full, the function translates the remainder of the source string to determine the full length of the mapped string. strxfrm stores any such excess characters in its own dynamic temporary buffer buf.
Listing 7 shows the file xstrxfrm.c. It defines the function _Strxfrm that performs the actual mapping. It does so as a finite-state machine executing the state table stored at _Wcstate, defined in the file xstate.c. I won't repeat that file here it simply initializes _Wcstate to point at a table that transforms strings unchanged.
_Strxfrm must be particularly cautious because _Wcstate can be flawed. It can change with locale category LC_COLLATE in ways that the Standard C library cannot control.
Note the various ways that the function can elect to take an error return:
The rest of _Strxfrm is simple by comparison. The function retains the wide-character accumulator (ps->_Wchar) as part of the state memory. That simplifies generating a sequence of mapped characters with a common component while in a given shift state. _Strxfrm returns after it fills the output buffer (with size characters) or whenever it encounters the terminating null character in the source string.
- if a transfer occurs to an undefined state
- if no state table exists for a given state
- if the function makes so many state transitions since generating an output character that it must be looping
- if the state table entry specifically signals an error
That can happen more than once. Note the careful way that strxfrm distinguishes the three reasons why_Strxfrm returns:
Listing 8 shows the file strcoll.c. The function strcoll is somewhat more complex than strxfrm. It must translate two source strings a piece at a time so that it can compare their mapped forms. The type Sctl describes a data object that holds the information needed to process each source string. The internal function getxfrm calls _Strxfrm to update an Sctl data object.
- If the last character delivered is a null character, the translation is complete. _Strxfrm delivers a null character if an error occurs. It also jiggers the stored state information to fail immediately should it be inadvertently called again for the same string.
- Otherwise, if the next source character is a null character, _Strxfrm wants to rescan the source string. _Strxfrm will not point past a null character in the source string.
- Otherwise, _Strxfrm wants to continue where it left off.
The comparison loop within strcoll thus calls getxfrm for each source string that has no mapped characters in its Sctl buffer. That ensures that each source string is represented by at least one mapped character, if any such characters remain to be generated.strcoll compares all the mapped characters that it can. It returns zero only if both mapped strings compare equal character by character and have the same length.
State Tables
The function _Strxfrm maps one character string into another a byte at a time. It begins in state zero with a zero (16-bit) accumulator. It can transfer among up to 16 different states. It uses the next character to map as an index into the current state table. That tells the function what operations to perform and what state to enter next. The operations are:
This is not an easy language to program in, but you can do a lot with it. Just think of all the tricks you can play with a 16-state machine. Listing 9 shows a simple example. It implements a "dictionary sort" by the rules:
- $F Map the character to a new value and fold that value into the accumulator, replacing the lower eight bits.
- $R Rotate the accumulator left eight bits.
- $O If the new value is nonzero, map the character to that value. Otherwise, take the low eight bits of the accumulator as the mapped value. In either case, output the mapped character to the destination string. If the mapped value is zero, return.
- $I If the character to map is nonzero (not the null character), point past it. Otherwise, return.
The actual mapping discards all characters but letters on the first pass, mapping upper case to lower case. It then appends a dot ('.'). On the second pass over the input string, it copies the input unchanged.
- Ignore all characters except letters, and treat upper case letters the same as lower case.
- If two strings prove identical by the first rule, compare them again as raw ASCII text.
Here's how to read Listing 9. The first line defines a locale called DICT. The last line terminates the definition. The locale differs from the "C" locale only in the category LC_COLLATE. The second line is simply a comment. The remaining lines rewrite the collation state tables zero and one.
Now examine the line
collate[0, 0 ] '.' $0 $I $1It describes how to rewrite table 0, entry 0 (0, 0). The mapped value is the character dot ('.'). The operations to perform are output and input ($0 $I). And the successor state is 1 ($1). That entry handles putting out the dot when the null is encountered at the end of the first pass.The next entry rewrites the remainder of table 0 to consume all input and stay in state 0. The one after that re-rewrites the table for the lower case letters. And the one after that fixes up the upper case letters to map to lower case.
The last entry defines the entire table 1 to put out characters unchanged. No need to treat the null character specially here. It automatically stops the translation when it gets copied to the output string.
There seem to be no end of collation rules used around the world. Many European languages treat accented characters mostly the same as unaccented, except to distinguish otherwise identical strings. This scheme handles all the cases I've seen so far. It just takes some practice to get the notation down.
I'll leave it to your imagination to try more ambitious mappings. One of my favorite exercises simply extends the dictionary sort for telephone directories. Say you want names beginning with "Mac" to sort as if they began with "Mc". It takes a few more states, but it's quite doable. Try it.
This article is excerpted in part from P.J. Plauger, The Standard C Library, (Englewood Cliffs, N.J.: Prentice-Hall, 1992).