An Idea for Dynamic C Strings

Daniel Lawrence

Give <string.h> its walking papers with this slick dynamic string library for C.

Native string handling in C leaves something to be desired. C provides no automatic bounds checking and no automatic growth. The programmer must always know the size of the buffer he is working with and must check sizes on every buffer operation. This is tedious and error prone. This article presents an idea that might make C -string handling a little easier.

Usual Solutions

The usual solution to the string problem involves special string structures and a library of functions. The structures hold buffer sizes, and the library functions grow character arrays whenever necessary.

A typical structure looks something like this:

struct {
     char *buf;
     size_t size;
     size_t len;
};
size is the allocated size of the buffer, and len is the amount of the buffer currently used. Every time buf is written to, size and len are checked. If the operation would cause the buffer to overrun, the buffer is grown using realloc first.

The advantage to a system like this is that the string may contain any characters, including embedded \0s. However, every operation on the string requires a special function, and there have to be ways to convert back and forth between the special buffer and native C strings. These special buffers don't really feel like strings, and there is the question of how well these special structures will integrate with additional third-party libraries that may be part of the application. Additionally, the initialization and freeing of these special structures are themselves housekeeping operations that must be managed. The programmer almost has to "buy into" non-standard string management techniques.

What We Want

Although the special string structures do work, they are sometimes a bit over the top. They are cumbersome to use. What we really want is something that is compatible with native strings, but will handle automatic growth. Embedded \0s are not always necessary, so perhaps a simpler idea can be used to augment native string handling just a little.

Idea

Consider the fact that the standard memory allocation functions malloc and friends carry the same information as the above structure. When you free an allocated region of memory, the allocation library knows the size of the chunk. The programmer does not have to manage anything but a pointer to the actual usable memory area.

The same idea might be used to manage dynamic strings. Suppose you have a specialized string allocation function that returns a normal char *, but also records allocation size information somewhere else, like malloc does. And suppose you have other string functions such as string_cat, which can find this size information based on a given char *. string_cat would then be able to grow the string as necessary because it can easily look up the allocated size of the buffer based on the address of the destination string.

That idea is the basis for the simple library described here. The library consists of some functions that can determine the allocated size of a string, as well as a few functions to do things like copying, concatenating, and formatting strings. Because these functions can grow strings, we cannot rely on the regular strcpy and strcat type of functions. But otherwise, the strings managed by the library are normal C strings with \0 termination, and they can be passed around via char * as usual.

Implementation

The simplistic implementation used currently consists of an array of structures like this:

struct {
     char *s;
     size_t size;
}
The length of the string is available via strlen(), so it is not recorded in the struct.

When a dynamic string is allocated, a free element is found in the array and s and size are set. If the array needs to grow, realloc is used to extend it automatically.

Given a char * to a dynamic string, it is possible to scan the array looking for the corresponding element. This search uses the string's address as the key. When it finds a match, it knows the allocated size. The library rejects strings it has not allocated itself.

The following are some of the string management functions available in the library. The common theme is that a destination is passed to the function. This is expected to be NULL, which causes the function to allocate a new dynamic string, or the address of an existing dynamic string known to the library. The function will look up the size of the string and realloc the buffer, if necessary, before doing the operation. The function will then return the string's address. A NULL return value indicates there was not enough memory to complete the operation, and that the destination string has been freed.

The side effect of automatically freeing a string when there isn't enough memory is useful in simplifying code. See the file copy example below.

char *string_copy(char *dst, char *src);
Copy the string src to the dynamic string dst.

char *string_cat(char *dst, ..., NULL);
Concatenate the strings listed between dst and NULL to the dynamic string dst. Any number of strings may be listed before the NULL. For example:

s = string_cat(s, "dir", "c:\www.cuj.com/", "file", NULL);

char *string_sprintf(char *buf, char *fmt, ...);
This is similar to sprintf, except that the dynamic string buf is used as the destination for the formatted string.

char *string_getline(char *buf, FILE *f, int *err);
This is similar to fgets, except that the \n-terminated line from f is placed in the dynamic string buf and the \n is not preserved.

char *string_free(char *s);
This explicitly frees the dynamic string s. If s is actually a string managed by the library, it will be freed and NULL will be returned. Otherwise s is returned. This is the only function that will not abort the application if passed a "martian" string.

If the "dynamic string" passed to a library function is not actually a string allocated and managed by the library, the library functions will print a "martian" message on stderr and abort the application.

Examples

Here is a simple-minded way to copy files:

char *buf = NULL;

while ((buf = string_getline(buf, stdin, NULL)) != NULL)
     puts(buf);
This code takes advantage of two features of string_getline. First, because buf is NULL the first time it is passed to string_getline, a new dynamic string will automatically be allocated. Second, when the loop terminates because buf has become NULL again, the dynamic string has already been freed.

Here is a function to construct path names:

char *
mkpath(char *buf, char *dir, char *name)
{
     return string_sprintf(
          buf,
          "%s/%s",
          dir,
          name
     );
}
Because I am passing a dynamic string (buf) to mkpath and I am returning the result of string_sprintf, mkpath itself has no knowledge of dynamic strings. It can be called like this:

char *path = NULL;

if ((path = mkpath(path, "etc", "app.conf")) == NULL)
     ...error out of memory...
f = fopen(path, "r");
Before this example function returns, you could call string_free(path), or you could declare path static.

Weaknesses

This obviously isn't the perfect solution to string handling in C. There are some weaknesses.

Because you are using native C strings, \0s are still string terminators and cannot be embedded in strings.

The extra bookkeeping involved in tracking the dynamic string lengths slows down string manipulations somewhat. On the other hand, any dynamic string implementation will have to do the same bookkeeping, so this library might not be any worse.

The current implementation uses a linear list. An optimized implementation could use some sort of hash on char * or some other speedy lookup function.

Where to Get the Library

Renamed versions of these string functions are available as part of The Toolbox at <www.alphazed.co.uk/software/toolbox/>. A real application that uses these dynamic string functions is available at <www.alphazed.co.uk/software/path/> and at <www.cuj.com/ code>.

About The Author

Daniel Lawrence lives in London with his wife Judith and little boy Cain. He runs AlphaZed Ltd., a Unix admin and programming company, <www.alphazed.co.uk>.