C/C++ Contributing Editors


C++ Theory and Practice:
Replacing Character Arrays with Strings, Part 1

Dan Saks

It is a truism that well-designed library objects are superior to the more primitive C data structures. But it still helps to know the costs of converting to them.


Copyright © 1999 by Dan Saks

For most of the past year, I've been discussing principles of large-scale software design and how you can use C++ to apply those principles. I've tried to keep discussion practical by applying the principles to a specific programming example, a program called xr that generates a cross-reference of words read from a text stream. Though hardly a large program, xr is large enough to demonstrate many of the points I've been trying to make.

The unifying theme of all these articles is that good design and programming fosters an illusion of simplicity. It does this by decomposing programs into units — typically classes — that represent abstractions. Each class should hide some of the program's details from the rest of the program. By hiding details, each class lets you act in many ways as if the program is simpler than it really is.

Two of my more recent articles [1, 2] focused on another general principle for decomposing programs into classes, namely,

I started applying this principle in rewriting xr, but I haven't finished.

Although the Standard C++ library provides several general-purpose classes that I could use to simplify the program, I've been holding off on using them until now. As I explained in my previous article [3], the generality of the library classes sometimes exacts a heavy toll in both run-time and code size. I don't like to pay until I have a better sense of the price.

This month, we'll continue exploring the consequences of hiding each design decision in a separate class. Along the way, we'll see if it's possible to integrate the standard string class into the program without paying too dearly.

Picking Up the Thread

xr reads text from standard input and writes a cross-reference listing to standard output. The output is an alphabetized list of the words (actually identifiers as in C++) that appeared in the input. Each line in the output contains one word followed by the sequence of unique line numbers on which that word appeared in the input.

xr uses a binary tree to accumulate the words in alphabetical order. The program uses a class called cross_reference_table to encapsulate the tree [4]. table.h in Listing 1 contains the cross_reference_table class and inline member function definitions. table.cpp in Listing 2 contains the non-inline definitions for the cross_reference_table class.

Each node in the tree holds a word and its corresponding sequence of line numbers. The nodes in the tree are all of type tree_node, a structure declared in Listing 1 as a private member of class cross_reference_table. The complete definition for tree_node appears in Listing 2 as:

struct
cross_reference_table::tree_node
   {
   tree_node(unsigned n);
   deep_pointer<char> word;
   line_number_sequence lines;
   deep_pointer<tree_node>
      left, right;
   };

deep_pointer is a class template. For any type T, a deep_pointer<T> behaves just like a T *, except that a deep_pointer<T> const behaves like a T const *const [5].

line_number_sequence is a class that stores a sequence of line numbers. The class definition appears in Listing 3.

Conceptually, each word in a tree_node is a variable-length character string. But its declaration reveals that word is actually a pointer to the zeroth character of a null-terminated character sequence. Code in the body of cross_reference_table::add reveals that each null-terminated character sequence resides in a dynamically allocated array of char. The principle of hiding each design decision in a separate class suggests that these implementation details ought to be tucked away in a separate class.

Rather than declare tree_node's word as:

deep_pointer<char> word;

the program should use a type for the abstract concept of a variable-length character string. That type could be the standard library's string class or one written specifically for this application. In either event, the declaration of member word should be something such as:

string word;

With this change, tree_node looks like:

struct
cross_reference_table::tree_node
   {
   tree_node(unsigned n);
   string word;
   line_number_sequence lines;
   deep_pointer<tree_node>
      left, right;
   };

which reflects the program's design a bit better than the previous tree_node definition. Even without comments, it says "each node in the cross-reference table's tree associates a string with a corresponding line number sequence."

Eliminating Arbitrary Limits

A string class will be handy in other parts of the program. xr acquires the strings it places in the cross-reference table by reading them from standard input. The get_token function does most of that work, with a little help from main.

main and get_token appear in the source file xr.cpp (Listing 4). main repeatedly calls:

int get_token(char *s, size_t n);

to scan the next token from the input until it exhausts the input. If it detects a token, get_token copies that token and a trailing null into the character array starting at s, and returns a non-zero value (meaning true). If there's nothing left to scan, get_token copies just a null character (an empty string) into s, and returns zero (meaning false).

get_token's second parameter, n, is the maximum number of characters (including the null character) that s can hold. get_token takes pains to avoid copying more than n characters into s. When it encounters a token that's too long to fit in s, get_token simply discards the excess characters at the end of the token.

Most of the tokens in a typical input stream (a text file containing either prose or source code) are fairly short. Few are more that twenty characters long. Nonetheless, xr should be able to handle longer tokens when they occur. The main function in Listing 4 uses a fixed-length array whose size is generous enough to hold some pretty large tokens:

#define MAX_TOKEN 256
...
char token[MAX_TOKEN];

The value 256 was an arbitrary choice. Choosing such arbitrary numbers is always tricky business, and it often leaves you with maintenance or performance problems. If you choose a number that's too small, the program might not handle long input properly. If you choose a number that's too big, the program may tie up resources — usually memory — that it never uses.

A variable-length string class, such as the one in the Standard C++ library, eliminates the need to specify a maximum length for the input tokens. String objects can start out empty and expand as you stuff characters into them. Standard strings grow to hold any number of characters up to some implementation-defined limit. That limit is often exceedingly generous.

Scanning into a String

Let's rewrite xr so that it uses standard string objects instead of character arrays. As most of you know by now, I'm fairly conservative when it comes to making changes to working programs, so I'd rather not make too many changes all at once. Let's start by rewriting just xr.cpp (the main source file in Listing 4) and see how it works out.

Begin by adding:

#include <string>

somewhere near the top of the source file. This introduces the definition for the standard string class into the compilation.

The standard string class is a member of namespace std. You might want to add the using directive:

using namespace std;

or the using declaration:

using std::string;

so that you can use the unqualified name string instead of the fully-qualified name std::string. (See [6] for a refresher on using-directives and -declarations.) I recommend holding off on this until it's evident that using the fully-qualified name is inconvenient.

Most of the work involves rewriting get_token so that it scans the token into a string instead of a character array. A declaration for get_token appears above main as:

int get_token(char *s, size_t n);

You should change the type of parameter s from char * to either std::string * or std::string &. The second parameter, n, specifies the maximum capacity of s. Since strings grow as needed to hold whatever you stuff into them, get_token has no use for n. You can just discard it.

The choice of passing by address or passing by reference is largely a matter personal preference. It shouldn't affect performance. However, functions that modify a string parameter typically pass the string by reference, and I think it's appropriate here, too.

Although get_token's return type is int, the only values it returns are 0 (meaning false) and 1 (meaning true). Using int as the return type is just a holdover from when xr was a C program. As long as you're rewriting the function, you might as well change the return type to bool. The declaration now looks like:

bool get_token(std::string &s);

Now, let's rewrite main. Change the declaration for the local variable token from:

char token[MAX_TOKEN];

to:

std::string token;

This declaration invokes the default constructor, which initializes token as an empty string.

Next, change the condition in the while-loop from:

while (get_token(token, MAX_TOKEN))

to just:

while (get_token(token))

This eliminates all references to the symbol MAX_TOKEN. Thus, you can remove MAX_TOKEN's definition, which appears just above main.

main needs one more change. The add member function in the cross-reference table class is still declared as:

void add(char const *w, unsigned n);

Now that token has type std::string, the call:

table.add(token, ln);

appearing in main no longer compiles. The standard string class doesn't have a conversion operator that will convert a string to a char const * implicitly. Rather, it has a member:

char const *c_str() const;

that will do the conversion explicitly. Using this conversion function, the call to the add function becomes:

table.add(token.c_str(), ln);

Now let's look in the body of get_token. You need not change the while-loop at the beginning of the function, but the if-statement:

if (c == EOF)
   {
   *s = '\0';
   return 0;
   }

needs work.

When s pointed into an array, the assignment:

*s = '\0';

turned s into an empty sequence. Now that s refers to a string, you can empty it by using any of the following statements:

s = "";
s.clear();
s.erase();

(Microsoft Visual C++ 6.0 doesn't like s.clear() but it accepts the other forms.) Now that get_token returns a bool,

return 0;

should become:

return false;

It's been a while since I looked carefully at this part of the program, and it occurs to me now that setting the string empty isn't really necessary in this case. Code that calls get_token isn't supposed to pay any attention to the string unless get_token returns true. I think you can rewrite the if-statement as just:

if (c == EOF)
   return false;

These two assignments appear immediately after the if-statement:

*s++ = c;
n -= 2;

get_token no longer uses n, so you can just erase the latter statement. The former takes a bit more thought.

When s pointed into an array, the first of these assignments placed c into the first element, and positioned s to the second element. When s refers to a string, that statement becomes simply:

s = c;

which assigns the single character c to s. This assignment treats c as a character even though c has type int.

Whereas:

s = c;

replaces the value of s with c,

s += c;

appends c to the value of s. The latter will not work in this situation, for the following reason.

On each call to get_token, s refers to local variable token in main. On the first call, token is an empty string, so assigning and appending produce the same result. Neither get_token nor main sets token back to empty. Thus, on subsequent calls to get_token, token is non-empty. The assignment:

s = c;

in get_token wipes out the old value in s and starts assembling a new one. In contrast,

s += c;

appends the first character of the new token after the last character of the previous token. get_token will return a string containing every token it's seen so far. I don't think you want that to happen.

If you really prefer to replace:

*s++ = c;

with:

s += c;

you can get it to work if you clear or erase s first. Just insert:

s.erase();

(or one of its alternatives) at the beginning of get_token. This also restores get_token's old behavior of returning an empty string when it encounters EOF.

The next stretch of code to consider is in the second loop in get_token:

if (n > 0)
   {
   *s++ = c;
   —n;
   }

This code uses n to determine if there's room in the array to append c to the token. With strings, there's always room (up to some exceedingly generous limit), so you can delete everything except the assignment:

*s++ = c;

This you should replace with:

s += c;

Here, get_token really is appending each character to the token.

Finally, replace the last two statements in get_token:

*s = '\0';
return 1;

with just:

return true;

You never need to place an end marker such as '\0' into a string. Each string keeps track of its own length. Placing '\0' into a string object has no special significance. To a string, '\0' is just another character value.

A version of xr.cpp using strings appears in Listing 5. The fully-qualified name std::string appears only three times in the listing. The source file doesn't need a using-directive or using-declaration after all.

No Free Lunch

Of course, the flexibility and convenience of string objects comes at a price. Strings grow and shrink by allocating and deallocating memory at run time. All that allocating and deallocating uses memory more frugally, at the expense of greater code size and execution time.

To try to get a sense for what this costs I ran a set of measurements similar to the ones I did in my last column. I compared several versions of the cross-reference program:

1. a version written in Standard C, similar in style to the one appearing in exercise 6-3 of Tondo and Gimpel;

2. a version that uses classes for cross-reference tables and line number sequences, but still using stdio and character arrays. (It uses xr.cpp from Listing 4.)

3. a version that uses classes and stdio as in version 2, but uses strings instead of character arrays for input. (It uses xr.cpp from Listing 5.)

4. a variation of version 3 in which get_token erases the strings each time before appending characters. (It uses the get_token function from Figure 1.)

The first two programs are the same as ones I measured last time. However, since I made the earlier measurements, I replaced the hard drive on my machine with a bigger drive. I also switched from using Windows 95 to Windows 98. The run times for the first two programs are a little slower than they were.

Anyway, the results of my new measurements appear in Table 1. With compilers B and C, just using strings instead of character arrays for getting tokens increased the execution time for the C++ program by about 15%. I think that's tolerable for most non-time-critical work. For compiler A, the time increased by about 50%. That's a bit less tolerable.

The last row of the table shows that a seemingly small change in a string algorithm might produce a surprisingly large change in program execution time.

Next time, I'll look at integrating strings into the rest of the cross-reference program.

References

[1] Dan Saks. "C++ Theory and Practice: Isolating Design Decisions, Part 1," CUJ, July 1999.

[2] Dan Saks. "C++ Theory and Practice: Isolating Design Decisions, Part 2," CUJ, August 1999.

[3] Dan Saks. "C++ Theory and Practice: Standard C++ as a High-Level Language?," CUJ, November 1999.

[4] Dan Saks. "C++ Theory and Practice: Partitioning with Classes," CUJ, February 1999.

[5] Dan Saks. "C++ Theory and Practice: Thinking Even Deeper," CUJ, July 1999.

[6] Dan Saks. "C++ Theory and Practice: An Introduction to Namespaces," CUJ, January 1998.

Dan Saks is the president of Saks & Associates, which offers training and consulting in C++ and C. He is active in C++ standards, having served nearly seven years as secretary of the ANSI and ISO C++ standards committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield, OH 45504-4906 USA, by phone at +1-937-324-3601, or electronically at dsaks@wittenberg.edu.