Columns


Standard C

Comparing Strings

P.J. Plauger


P.J. Plauger is senior editor of The C Users Journal. He is secretary of the ANSI C standards committee, X3J11, and convenor of the ISO C standards committee, WG14. His latest books are The Standard C Library, published by Prentice-Hall, and ANSI and ISO Standard C (with Jim Brodie), published by Microsoft Press. You can reach him at pjp@plauger.com

Introduction

Last month, I introduced the header <string.h>. (See Standard C, CUJ Oct. '92.) I discussed the functions that copy and concatenate strings, and a few others. I also showed how to implement them. This month, I discuss the functions declared in <string.h> that help you compare strings in various ways.

What the C Standard Says

7.11.4 Comparison functions

The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

7.11.4.1 The memcmp function

Synopsis

#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);

Description

The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2 .136

Returns

The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.

7.11.4.2 The strcmp function

Synopsis

#include <string.h>
int strcmp(const char *s1, const char *s2);

Description

The strcmp function compares the string pointed to by s1 to the string pointed to by s2.

Returns

The strcmp function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2.

7.11.4.3 The strcoll function

Synopsis

#include <string.h>
int strcoll(const char *s1, const char *s2);

Description

The strcoll function compares the string pointed to by s1 to the string pointed to by s2, both interpreted as appropriate to the LC_COLLATE category of the current locale.

Returns

The strcoll function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by s1 is greater than, equal to, or less than the string pointed to by s2 when both are interpreted as appropriate to the current locale.

7.11.4.4 The strncmp function

Synopsis

#include <string.h>
int strncmp(const char *s1, const char *s2, size_t n);

Description

The strncmp function compares not more than n characters (characters that follow a null character are not compared) from the array pointed to by s1 to the array pointed to by s2.

Returns

The strncmp function returns an integer greater than, equal to, or less than zero, accordingly as the possibly null-terminated array pointed to by s1 is greater than, equal to, or less than the possibly null-terminated array pointed to by s2.

7.11.4.5 The strxfrm function

Synopsis

#include <string.h>
size_t strxfrm(char *s1, const char *s2, size_t n);

Description

The strxfrm function transforms the string pointed to by s2 and places the resulting string into the array pointed to by s1. The transformation is such that if the strcmp function is applied to two transformed strings, it returns a value greater than, equal to, or less than zero, corresponding to the result of the strcoll function applied to the same two original strings. No more than n characters are placed into the resulting array pointed to by s1, including the terminating null character. If n is zero, s1 is permitted to be a null pointer. If copying takes place between objects that overlap, the behavior is undefined.

Returns

The strxfrm function returns the length of the transformed string (not including the terminating null character). If the value returned is n or more, the contents of the array pointed to by s1 are indeterminate.

Example

The value of the following expression is the size of the array needed to hold the transformation of the string pointed to by s.

1 + strxfrm(NULL, s, 0)
Footnote:

136. The contents of "holes" used as padding for purposes of alignment within structure objects are indeterminate. Strings shorter than their allocated space and unions may also cause problems in comparison.

Using the Comparison Functions

memcmp — This function offers the quickest way to determine whether two character sequences of the same known length match character for character. You can also use it to establish a lexical ordering between two character sequences, but that ordering can change among implementations. If a portable result is important, you must write your own comparison function.

strcmp — This function offers the quickest way to determine whether two null-terminated strings match character for character. You can also use it to establish a lexical ordering between two strings, but that ordering can change among implementations. If a portable result is important, you must write your own comparison function. See also strcoll and strxfrm, below.

strcoll — Use this function to determine the locale-specific lexical ordering of two null-terminated strings. You must know the current status of locale category LC_COLLATE to use this function wisely. (You must at least assume that someone else has set this category wisely.) Under some circumstances, you may want to use strxfrm, described below, instead.

strncmp — This function offers the quickest way to determine whether two character sequences of the same known length match character for character up to and including any null character in both. You can also use it to establish a lexical ordering between two such character sequences, but that ordering can change among implementations. If a portable result is important, you must write your own comparison function.

strxfrm — Use strxfrm(s1, s2, n) to map the null-terminated string s2 to a (non-overlapping) version at s1. Strings you map this way can later be compared by calling strcmp. The comparison determines the locale-specific lexical ordering of the two strings that you mapped from. You must know the current status of locale category LC_COLLATE to use this function wisely. (You must at least assume that someone else has set this category wisely.) Under most circumstances, you may want to use strcoll, described above, instead. Use strxfrm if you plan to make repeated comparisons or if the locale may change before you can make the comparison. Use malloc, declared in <stdlib.h>, to allocate storage for s1, as in:

size_t n = strxfrm(NULL, s2, 0);
char *s1 = malloc(n + 1);

if (s1)
   strxfrm(s1, s2, n);
The first call to strxfrm determines the amount of storage required. The second performs the conversion (again) and stores the translated string in the allocated array.

Implementing the Comparison Functions

Listing 1 shows the file memcmp.c. memcmp is careful to perform unsigned char comparisons to meet the requirements of the C Standard.

Listing 2 shows the file strncmp.c. The function strncmp is similar to memcmp, except that it also stops on a terminating null character. And unlike memcmp, strncmp can use its pointer arguments directly. It type casts them to pointer to unsigned char only to compute a nonzero return value.

Listing 3 shows the file strcmp.c. The function strcmp differs from strncmp only in not worrying about a limiting string length n.

The last two functions help you perform locale-specific string collation. Both strcoll and strxfrm determine collation sequence by mapping strings to a form that collates properly when compared using strcmp. The locale category LC_COLLATE determines this mapping. (See "Standard C", CUJ Nov. 1990.) It does so, in this implementation at least, by specifying a table-driven finite-state machine. You can write your own state tables in a locale file, as I describe briefly below. The state tables are used by the internal function _Strxfrm. Thus, strcoll and strxfrm call _Strxfrm to map strings appropriately.

Listing 4 shows the file xstrxfrm.h. All the collation functions include the internal header "xstrxfrm.h". It includes, in turn, the standard header <string.h> and the internal header "xstate.h", shown in Listing 5. (I won't describe all the magic here. I outlined the locale-file language for state tables in "Standard C", CUJ Jan. 1991.) "xstrxfrm.h" defines the type _Cosave and declares the function _Strxfrm. A data object of type _Cosave stores state information between calls to _Strxfrm.

Listing 6 shows the file strxfrm.c. The function strxfrm best illustrates how the collation functions work together. It stores the mapped string in the buffer pointed to by s1, of length n. Once the buffer is full, the function translates the remainder of the source string to determine the full length of the mapped string. strxfrm stores any such excess characters in its own dynamic temporary buffer buf.

Listing 7 shows the file xstrxfrm.c. It defines the function _Strxfrm that performs the actual mapping. It does so as a finite-state machine executing the state table stored at _Wcstate, defined in the file xstate.c. I won't repeat that file here — it simply initializes _Wcstate to point at a table that transforms strings unchanged.

_Strxfrm must be particularly cautious because _Wcstate can be flawed. It can change with locale category LC_COLLATE in ways that the Standard C library cannot control.

Note the various ways that the function can elect to take an error return:

The rest of _Strxfrm is simple by comparison. The function retains the wide-character accumulator (ps->_Wchar) as part of the state memory. That simplifies generating a sequence of mapped characters with a common component while in a given shift state. _Strxfrm returns after it fills the output buffer (with size characters) or whenever it encounters the terminating null character in the source string.

That can happen more than once. Note the careful way that strxfrm distinguishes the three reasons why_Strxfrm returns:

Listing 8 shows the file strcoll.c. The function strcoll is somewhat more complex than strxfrm. It must translate two source strings a piece at a time so that it can compare their mapped forms. The type Sctl describes a data object that holds the information needed to process each source string. The internal function getxfrm calls _Strxfrm to update an Sctl data object.

The comparison loop within strcoll thus calls getxfrm for each source string that has no mapped characters in its Sctl buffer. That ensures that each source string is represented by at least one mapped character, if any such characters remain to be generated.strcoll compares all the mapped characters that it can. It returns zero only if both mapped strings compare equal character by character and have the same length.

State Tables

The function _Strxfrm maps one character string into another a byte at a time. It begins in state zero with a zero (16-bit) accumulator. It can transfer among up to 16 different states. It uses the next character to map as an index into the current state table. That tells the function what operations to perform and what state to enter next. The operations are:

This is not an easy language to program in, but you can do a lot with it. Just think of all the tricks you can play with a 16-state machine.
Listing 9 shows a simple example. It implements a "dictionary sort" by the rules:

The actual mapping discards all characters but letters on the first pass, mapping upper case to lower case. It then appends a dot ('.'). On the second pass over the input string, it copies the input unchanged.

Here's how to read Listing 9. The first line defines a locale called DICT. The last line terminates the definition. The locale differs from the "C" locale only in the category LC_COLLATE. The second line is simply a comment. The remaining lines rewrite the collation state tables zero and one.

Now examine the line

collate[0, 0     ] '.'     $0  $I  $1
It describes how to rewrite table 0, entry 0 (0, 0). The mapped value is the character dot ('.'). The operations to perform are output and input ($0 $I). And the successor state is 1 ($1). That entry handles putting out the dot when the null is encountered at the end of the first pass.

The next entry rewrites the remainder of table 0 to consume all input and stay in state 0. The one after that re-rewrites the table for the lower case letters. And the one after that fixes up the upper case letters to map to lower case.

The last entry defines the entire table 1 to put out characters unchanged. No need to treat the null character specially here. It automatically stops the translation when it gets copied to the output string.

There seem to be no end of collation rules used around the world. Many European languages treat accented characters mostly the same as unaccented, except to distinguish otherwise identical strings. This scheme handles all the cases I've seen so far. It just takes some practice to get the notation down.

I'll leave it to your imagination to try more ambitious mappings. One of my favorite exercises simply extends the dictionary sort for telephone directories. Say you want names beginning with "Mac" to sort as if they began with "Mc". It takes a few more states, but it's quite doable. Try it.

This article is excerpted in part from P.J. Plauger, The Standard C Library, (Englewood Cliffs, N.J.: Prentice-Hall, 1992).