October 1993/The SPLASH Class Library

Features

The SPLASH Class Library

Jim Morris

Jim Morris is currently Director of ROM Software at The 3DO Company. Despite this he has been a professional programmer for over 17 years. He has been using C++ sporadically for about 2 years. He can be reached via EMail at morris@netcom. com, via compuserve Mail as 73200,2306 or by phone at (415) 578-6765.
I have used Perl for many years and often wished that some of the data types and functions provided in Perl were available when I was writing my C programs. When I started writing C++ code, I realized that the list and string data types that I used in Perl scripts could be faithfully duplicated by a relatively simple class library. The reason for this is that C++ allows you to create user-defined types. Additionally C++ templates allow you to create a list class that can store any data type.
Some benefits of basing this class library on the existing Perl language are that the behavior and functions of the data types are already defined, so I can concentrate on the implementation. Also the functions and data types are known to be useful and effective, as they have been in use for years.
I am only going to talk about the string and list data types in this article; however, SPLASH implements associative arrays and array slices too.

A Brief Description of Perl
Perl is a public domain scripting language written by Larry Wall. It can run on most platforms, from UNIX to MS-DOS. Perl provides list, string, numeric, and associative array data types, and simplified file I/O. It also allows easy access to some very powerful regular expression (RE) functions. (See sidebar, "What is a Regular Expression?")
I call Perl a scripting language because, like UNIX csh or MS-DOS .BAT files, you don't have to compile the script, just edit and run. Perl is very useful for processing files and extracting information from them, or translating files in various ways. It is most often used by UNIX system administrators for writing scripts that run automatically at frequent intervals and do some relatively complex task, such as cleaning up log files or tracking accounting system usage. I have used Perl for varying tasks from a global search and replace script to calculating my monthly Compuserve bill from the dialing log generated by Procomm Plus.

A Brief Description of SPLASH
The SPLASH (Simple Perl-like String And List Handling) class library is a C++ implementation of the Perl list and string data types. It includes the regular-expression, substitute, and translate operations on strings. (For a function-by-function description of SPLASH, see the sidebar, "SPLASH Function Reference.")
SPLASH provides a dynamic list template class (SPList) that allows you to add and extract data from the top of a list (push and pop), and from the bottom of the list (unshift and shift). For example, a FIFO (First In, First Out) list could be implemented by:
SPList<int> mylist;
// put 1 and 2 onto the list
mylist.push(1); mylist.push(2);
y = mylist.shift(); // y gets 1
x = mylist.shift(); // x gets 2;
A list can be sorted by comparing the elements with operator< (which must be defined for the data type being sorted). The list can also be reversed, as with:
SPList<float> mylist, sortedlist;
// sorts the list, then reverses the
// the resulting list, and returns it
sortedlist= mylist.sort().reverse();
A list can be inserted anywhere within another list, or elements can be deleted from a list (with splice). Any individual element within a list can be randomly accessed using operator[]. The list is dynamic, which means it automatically shrinks and expands as necessary, without the user having to worry about memory management.
The string class (SPString) implements a dynamic string data type. The offset of a substring can be found (with index). A substring may be extracted or assigned to, effectively overwriting or deleting the substring (substr). The string may be used anywhere a const char* can be used, so it can be passed as a parameter to the Standard C str* functions. The standard comparison operators (<, >, ==, etc.) are defined, so intuitive operations can be applied to the strings. String concatenation is implemented by defining operator+ and operator+=.
Some of the most powerful functions are the regular-expression matching operations, which allow a regular expression to be applied to a string, returning a Boolean on the success or failure of the match, or even returning a list of substrings that matched (m). The substitute and replace function (s) allows you to replace a successful regular-expression match within the string with an arbitrary substring. The translate function (tr) will replace all occurrences of a specific character, or range of characters. For example, mystr.tr ("a-z", "A-Z") will convert each lowercase character in mystr with its corresponding uppercase character.
The string list class (SPStringList) is basically an SPList<SPString> class with some added functionality specific to lists of strings. It lets you grep ("globally look for a regular expression") within the list, returning a list of strings that match. You can generate a list of strings from a single string by splitting it on a given regular expression (split). The inverse operation (join) will generate a single string by concatenating a list of strings separated by a given string.
All the classes have stream operators for input and output, to allow convenient file or stream processing.

The Design Process
The trap I always fall into when writing C++ programs is trying to be too object-oriented. I get carried away trying to make my programs fit exactly into the object-oriented design paradigm. I find concepts such as data hiding, or encapsulation, easier to use than things like inheritance. The Regexp class is a good example of where encapsulation really worked in SPLASH. The Regexp class effectively wraps itself around an existing C regular expression package.
The reason this worked so well is that I used two different regex libraries during the design process. One was the standard regex library provided with most UNIX systems. The other one was Henry Spencers's public domain regex library. I ended up using the latter, because it matches Perl more closely. The interface to the Regexp class was such that when I changed the underlying C library I didn't have to change any of the implementations that used the Regexp class.
While designing SPLASH, I did not always find it intuitively obvious when to use public inheritance, private inheritance, or layering. I found Scott Meyers's book [1] to be very helpful when deciding which method to use. In the first implementation of SPLASH, I derived both SPList and SPString from a dynamic array class called MArray, which I had written a year previously for another project. MArray (Listing 1) is a very simple array class that will grow when elements are added to it. The only interface to it is the overloaded operator[] (for indexing), an add function, and a remove function.
Even though its interface and implementation are definitely not optimal for either class, it did get me going pretty quickly. Once the functionality of SPLASH was frozen and tested, I wrote two new classes to replace MArray. These two classes (SPListBase and VarString) were tailored specifically to the SPList and SPString classes, thereby improving efficiency, and proving that encapsulation really works. None of the test programs using either SPList or SPString needed changing, even though the underlying implementation of the two classes changed radically.

The Implementation
It could be argued that neither VarString nor SPListBase needed to be written, and that the functionality of these classes could just as easily have been put directly into SPString and SPList. The reason I decided to use a separate class in each case was that the required interface to these classes was very well defined, and the mechanisms therein were completely self-contained. In addition, I could optimize the low-level implementation of each without having to change the more complicated SPList and SPString classes. For instance, I might decide to re-implement SPListBase as a linked list instead of a linear array. This may be better suited to some list applications. If I were to do this, I could effectively replace SPListBase with a linked-list class, with no changes needed to SPList.
SPListBase (Listing 3) implements the mechanism for the dynamic list. It is a minimalist implementation of a linear array very similar to MArray, with one optimization. Because elements can be added to the front or back of the list, room is kept at both ends, so that data doesn't have to be shuffled up each time an element is added to the front of the list. This is accomplished using an offset (the private variable first) to indicate where the beginning of the array is. first initially points to the middle of the array.
For performance reasons the array never shrinks, although this would be trivial to implement if it was required. Also note the judicious use of const on all parameters which are never changed by the function. Other than reducing the chances of unexpected bugs, it also allows some compilers to do better optimization. I also make all functions which do not alter the state of the class (don't write to any variables in the class) const functions. This makes the intent of the functions clear to people reading the code. It also means that these functions can be called from within other const functions. The syntax for const functions can be seen in Listing 3, on the count member function.
SPListBase could be reused by any program that requires a dynamic array. This has some bearing on why I made some members private and some protected. I call any function that uses this class a client, whereas a class that inherits from this class I have called a derived class. grow is a private member because I don't want it used outside of the class. It implies a knowledge of how the list is stored, and this kind of knowledge needs to be hidden from any client or derived class. A client should be able to use this class without any knowledge of how the data is actually stored.
Hence, the interface to SPListBase is very basic. There is a way to randomly access any element within the array, a way to find out how many elements there are in the array, and two ways to add an element to the array. There is only one way to remove an element from the array, and that is using the protected member compact. I decided to make this a protected member because if I were to reuse this class as a generic dynamic array, the compact function would not normally be used. But a derived class usually has more intimate knowledge about its base class than a client would. Protected member functions are only accessible within the class and by any derived class.
There are two versions of operator[]. One is a const version and the other returns a reference to an element, which means that the element can be overwritten with new data. The const version implies that the data being indexed is constant data, and therefore must exist. So I generate an assert error if the index is out of range. The other version could be used for adding data to the array, so I silently expand the array if the index is out of range. I do this because it is the behavior of Perl lists.

Description of VarString
VarString (Listing 2) is similar to SPListBase in that it manages a variable-length array and is a minimal implementation. It has five constructors, which allow a VarString to be generated from different data types. Two of them allow a VarString to be generated from a char *, either a complete C string or a partial one. The latter is the mechanism used to implement substrings. The conversion operator (operator const char *()) allows a VarString to be converted to a const char *, so that it can be passed to any of the standard str* library functions.
As you can see, all this does is return the address of the private array stored in VarString. This could be very dangerous if the client decided to overwrite the array, so I specifically return a const char *, which will stop this from happening. Again, a client of VarString should make no assumptions about how the string is stored internally to VarString, and I am trying to use the compiler to enforce this rule.
An example of how the internal data representation may change is that if I wanted to store binary strings in VarString, the '\0' could no longer have any specific meaning, unlike in a C string where it marks the end of the string. So VarString could store a string with a separate length count in this case. Whenever the VarString needed to be converted to a C string, a '\0' could be appended to the array returned by the conversion operator.
I have also added a member function called remove which by default will delete one character from the array. But if a second parameter is given, it will delete that number of characters.

The TempString Class
A useful but simple class is TempString (Listing 7) . This is a mechanism for allocating a string for temporary usage which will automatically destroy itself when it passes out of scope. It replaces the often used code sequence:
char *astring, *tmp;
// ....
tmp= malloc(strlen(astring));
strcpy(tmp, astring);
// use tmp for something like strtok
free(tmp);
Because the compiler generates a destructor call when the TempString passes out of scope, the storage is automatically released. So it reduces the chance of a memory leak in your programs. Of course, you have to be careful to not pass the address of the TempString to something that will use it after it has been released, but even C++ can't save you from every logic mistake.

When to Inherit
The question of when to inherit and when to layer is a thorny one. The books say that inheritance implies that the relationship between the derived class and the base class is one of"IS A." When I applied this test to SPList and SPListBase, I decided that SPList "IS NOT A" SPListBase. Instead, it is "implemented in terms of" SPListBase.
"Implemented in terms of" can be achieved in two ways. One is layering, where the object you are using is declared as a variable in the class that is using it. Another way is to privately inherit from the class. Private inheritance means that only the derived class will have access to the protected and publicmembers of the base class. In other words, it is exactly as if the base class were a private member of the deriving class. So which one to use? According to Myers, layering is the preferred method. He recommends you use that method, unless you need access to a protected member in the layered class. SPListBase does indeed contain a protected member (compact), so I used private inheritance in this case.
One reason that I don't consider that SPList "IS A" S-PListBase is that I want users of SPList to see only the Perllike interface, not the raw interface to SPListBase.
SPString also fails the "IS A" test with regards to VarString. SPString is "implemented in terms of" VarString. VarString is a raw interface to a dynamic string type, and SPString packages this up with a Perl-like interface. Accordingly, SPString doesn't inherit from VarString, Instead, it defines a VarString as a private member (layering). VarString is a private data member because I want to hide the actual datastorage method from users of SPString, in case I decide to change it in the future.
A good example of an "IS A" relationship is that of SPStringList (Listing 6) and SPList. SPStringList publicly inherits from SPList (SPList<SPString> to be exact) because it "IS A" SPList. It just adds some more functionality to SPList. A client of SPStringlist has access to all the functions in SPList. In addition, SPStringList offers a few functions that are specific to a list of SPStrings.

Description of SPList
SPList (Listing 4) implements the Perl-like list data type. What the user of SPList sees is an interface similar to the various operators that can be applied to a Perl list.
SPList is privately derived from SPListBase. This means that all the member functions within SPList have access to the protected and public member functions of SPListBase, but the client of an SPList cannot use, or even see, any member functions within SPListBase.
SPList is a template class. This means that when it is used in a declaration the client must specify a type in angle brackets. The compiler then generates the code for the class, substituting the given type where appropriate. This saves me having to rewrite the class four times to support four different data types in the list. For example:
// Generate a list of Integers
SPList<int> intlist;
// Generate of list of pointers to char
SPList<char *> strlist;
The implementation of most of SPList is fairly straightforward, so I'll just mention a few of the more interesting features. operator void *(), which simply returns count(), is a trick that is used to allow the variable name to be used in an if statement as a Boolean test. The return value in this case is, of course, true only if the list is not empty — so:
 SPList<int> alist;
if (alist)
   cout << "I'm not empty" << endl;
else // this will get executed
   cout << "I'm empty" << endl;
alist.push(1);
if(alist)// this will get executed
   cout << "I'm not empty" << endl;
else
   cout << "I'm empty" << endl;
Note, in Listing 4, the two ways I have made functions in SPListBase available to users of SPList. The operator[] functions implicitly call the base class equivalent, whereas count uses another syntax which is given in the ARM (Annotated C++ Reference Manual [2]), page 244, section 11.3. However, I have found that many C++ compilers will not accept this syntax.

Description of SPString
SPString (Listing 5) implements the Perl string data type. It is more oriented towards processing data in files than manipulating strings, so it probably is not a good general purpose string data type. Because it has the conversion operator const char *(), an SPString can be used as a parameter to any of the Standard C string routines. In addition, the compare and equality operators have been defined three different ways, so a C string can be used in a comparison operation either on the right or the left of the operator, for example:
SPString mystring, yourstring;
// ....
if(mystring != "Hello") ....
if("abcd" < mystring) ....
if(mystring >= yourstring) ...
In order for the C string (a char *) to be used on the left of a binary operator, it must be defined as a friend function, so:
friend int operator<(const char *s,
   const SPString& sp)
   {return (strcmp(s, sp.pstr) < 0); }
This puts the scope of this function outside of the class it is defined within, as described in the ARM, page 248, section11.4.
As can be seen, the actual implementation just calls a regular C string compare function. This also demonstrates operator overloading, where operator< is overloaded to take an SPString and a const char *. Similarly, operator+ has been overloaded to do a concatenation of two strings. It can take an SPString and an SPString, an SPString and a const char *, an SPString and a char, and a const char * and an SPString. The last is implemented as a friend function, as I described above.
Why define operator<(const SPString& s) and operator<(const char *s) when a conversion operator from SPString to const char * is available? Why not just let the compiler generate the conversion, especially when it is going to take place anyway when it calls the strcmp function? The reason is the following compiler error message generated on the following lines of code:
SPString s1, s2;
// ...
if(s1 == s2)
// ....
error: ambiguous call: operator==
   (class SPString, class SPString)
choice of operator ==()s:
   int operator ==(const class SPString, const char *);
   int operator ==(const char *, const SPString&);
This very helpful error message explains the problem. Because we have a friend function to allow a char * to be on either side of the binary operator, the compiler is unable to decide which of the two SPStrings to convert to a const char *, so it gives this error message. I fixed the problem by disambiguating the call, by writing a specific operator for each case.

Substrings
I wanted to hide the substring class from the user, so I made it a nested class within the SPString class. This way the user has no access to the substring class. Unfortunately, not all C++ compilers support nesting of classes yet (where one class is defined within another class). And not all compilers handle the calling of member functions in the enclosing class from the nested class the same way. The Borland C++ 3.1 and cfront compilers do support nesting, so that is how I implemented it.
An alternative method, which has a similar effect, is to define the class in the normal way, but put all the constructors in the private section and declare SPString to be a friend of substring. This means that a user will not be able to instantiate this class, but SPString will be able to use it. In either method, SPString must make substring a friend, so substring can access SPString's private variable pstr.
Even though the user has no access to a substring, the SPString member function substr returns a substring. How can this work? By providing a constructor in SPString that can create an SPString from a substring, and by providing an assignment operator that allows a substring to be assigned to an SPString.
The substring class basically just remembers the offset within the SPString, the length of the substring, and the address of the first character of the substring. This is necessary in order to allow the substr function to appear on the left hand side of an expression, so you can do the following:

SPString mystr= "0123456789"; mystr.substr(3, 2)= "Ab"; // mystr becomes "012Ab56789"
The string will shrink or expand depending on whether the string being assigned is smaller or larger than the target substring. Thus, a problem to overcome is when the substrings overlap. So I check to see if the SPString I am assigning to is the same as the source string. If so, I make a copy of the source string, then copy it in. This way, I don't corrupt the source string as I copy it.
SPString::insert is a private member function because it is only used by substring. It basically inserts a given length string into the current string, expanding or contracting the string array as required.

Stream Input for SPString
The input stream operator for the SPString class is shown in Listing 8. This will allow a line of text (terminated by a newline) to be loaded into an SPString variable. The input can come from any stream type, be it an in-memory stream (strstream) or the standard input stream (cin). For example:
SPString instr;
while (cin){ // read until EOF
   cin >> instr; //read in line
// process line
   }
The tricky part is leaving the stream in a good state so that the typical syntax (above) used for reading a file will work. A line at end-of-file that is not terminated with a newline is considered a complete line, but the state the stream is left in is critical for the while (cin) test to work correctly. Specifically, the line must be read, but the stream must not be left in a fail state. The code in Listing 8 will correctly return the last line if it is terminated by end-of-file, with the stream still in a non-fail condition. But the stream will be left at end-of-file, so the next call will fail as required, and the while loop will terminate at the proper time.

Class Interdependencies
When a class A references another class B as anything but a pointer or a reference, the class B must already have been defined. (The compiler must know what the size of the class is and what member functions it contains).
This can be a problem if two classes refer to each other (a cyclic dependency). That could happen if class A contains a data member of class B and class B contains a data member of class A. The only way I know to get around this problem is to change one of the data members to a pointer. Then a forward declaration of one class is all that is needed to declare the other, which holds the pointer.
I had many cyclic dependencies in SPLASH, and I was constantly shuffling declarations and definitions of classes around to solve them. In one case, I had to change the data member to a pointer, with the accompanying syntax changes in all the implementations that used that data member.
The problem becomes even worse when templates are involved. Some compilers need to see the entire definition of a template member function before you can use it. Cyclic dependencies in this case can be very difficult to overcome. Thus, you should avoid cyclic dependencies in template classes, or use a compiler that is very intelligent in the way it handles templates. I am still trying to get SPLASH to compile with the GNU g++ and Zortech C++ compilers for exactly these reasons. Borland's C++ and the cfront compilers seem more tolerant in these cases, but now I have to rewrite some of the code to break these cyclic dependencies. If I had known of these problems ahead of time, it would have been easier to avoid them.

Class Templates
One of the things to note when using templates is that all the implementation of template member functions should go into the header file. This is so the compiler can expand a template when it is needed. As I mentioned above, the compiler must have seen the template definition before it can expand the function, much like an inline function.
Some compilers allow you to put the template definitions in a separate file and then it uses some technique to figure out what file to look at for the definition as it needs it. And some compilers seem to have difficulties finding a particular template, even though it appears to be staring them in the face. A work-around for this behavior is to specify the expansion in your program explicitly:

// Pull in the output operator for // a list of ints ostream SPList<int>::operator<< (ostream& os, int i);
This usually clears up any problems. Template handling currently varies considerably from compiler to compiler. It appears to be the least portable C++ construct I have used so far.

Summary
The SPLASH distribution contains a number of examples of how to use the various features of this class library, including associative arrays and array slices which have not been covered here, and examples of how to use the stream operators for processing text files.
I haven't been able to cover all the interesting things I discovered about C++ while I was writing this class library. I have pointed out some of the pitfalls and some useful techniques. I encourage you to get a copy of the source code and see how the various functions have been implemented. It is not the best and most efficient code, but then it wasn't designed to be; it was designed to work. I find the functions very useful for a wide range of applications, and reusability of classes is finally becoming a reality for me.

Bibliography
1. Meyers, Scott, Effective C++. Reading, MA: Addison-Wesley, 1992.
2. Ellis, Margaret A., & Bjarne Stroustrup. The Annotated C++ Reference Manual. Reading, MA: Addison-Wesley, 1992.

"How to Get SPLASH"