Portability


Some Thoughts On Portability

Jack Purdum, Ph.D


Jack Purdum is president of Ecosoft, Inc., and has authored several articles and books on C, including The C Programming Guide and a newly released book on Quick-C. Dr. Purdum can be reached at 8295 Indy Court, Indianapolis, IN 46214.

It seems safe to say that C remains one of the most popular development languages available for personal computers. If you ask people who use C why they chose it over the alternatives, one reason often given is, "because C is a portable language." Perhaps a better statement would be, "C can be a portable language." Much has already been written about portability — more than we can cover in one article. The purpose of this article is to discuss some portability problems that we've run up against in the past, and to discuss what we did (or should have done) to cope with those problems.

What Is Portability?

In his book Portability and the C Programming Language (Howard Sams), Rex Jaeschke defines portability as, "the degree to which a program or other software can be moved from one computer system to another."

That is, how easy is it to successfully recompile the source code in a different environment? The "different environment" can take several different forms. For example, we might move the source code to:

1. a different operating system

2. a different CPU

3. a different memory model

4. a different compiler

5. a different combination of the above

The permutations of the last item in the list are considerable. You might have the same operating system on different CPUs (e.g., UNIX on an 80486 or 68030), different operating systems on the same CPU (MS-DOS or XENIX on a 80386), different compilers on different CPUs and operating systems, and so on.

To say that C will always fulfill the portability needs for all of the possible system combinations that the programmer might face seems a bit much to assume. Indeed, it seems unlikely that any non-trivial program can be moved between computing environments without some modification. The real issue, then, is, "What can I do to make the port less difficult?"

It should be obvious that the more portable a program is, the less expensive it is to move the code to the new environment. The benefits accrue not only in getting the program up and running in the new environment, but also in the support of the program once the port is complete.

Where Am I?

In the past thirteen years, our firm has been involved with moving two rather large programs (each consisting of several million bytes of source code) to different environments. The ports involved different operating systems, different CPUs, and even a switch in languages during the port. When we switched programming languages, we had the freedom to redesign the program with portability in mind. (Alas, we weren't sure where the code would be ported to, only that it would likely be ported "somewhere.") With our other program, we were more or less locked into the design and had to port the program within existing code constraints.

The degree to which your program is (or will be) portable is influenced by where you are in the development cycle. Writing portable code while the program is in the design stage is one matter; modifying existing code to be portable is quite another. In the remainder of this article, we will examine some of the things we did in both situations to improve program portability.

Getting Organized

If you have the luxury of designing a program with portability in mind, you have greater flexibility in writing portable code than if you are locked into an existing design. Almost any (nontrivial) program has five major parts, as shown in Table 1.

Program initialization often involves reading a configuration file (containing information such as: drive or directory where the data are found, colors, fonts, video display information, setting or disabling interrupts, etc.), allocating dynamic memory, and a host of other tasks. Program input might involve reading the keyboard, a data file, or some other device that supplies data to the system. Processing is the manipulation of the data, while output sends the processed data to the desired output device. Finally, program termination involves any housekeeping tasks required by the program (e.g., closing open files, freeing dynamic memory, updating configuration files, and so forth).

A useful step in the design stage is a "sideways" refinement of the five basic elements listed in Table 1. For example, the Output element might be refined as shown in Table 2.

This is a simple process that helps you to identify the type of functions that will be needed to display the program output. With a little additional thought, the process can help you identify those functions that will probably require hardware-dependent (i.e., non-portable) resources. In Table 2, it is probably inevitable that the clearscreen() and cursor control functions will use non-portable code.

Having marked the non-portable functions, you should form a source code organization plan. Each of the primary program elements in Table 1 should have its own subdirectory. For example, the primary directory might be the program name or just PROJECT. The subdirectories might be INIT, INPUT, PROCESS, OUTPUT, and TERMINAL. In some cases, it may be desirable to have subdirectories below each of these directories. For example, under the OUTPUT subdirectory, we might have PORT and NOTPORT. This helps to keep the non-portable source code separate from the code that can be ported easily.

Table 3 shows how a typical project might be laid out. (Only the OUTPUT directory shows the third layer of disk organization.)

In the PROJECT directory, all five elements are brought together to form the completed application. Each of the five elements will have test code that is used to exercise that particular element of the program. If done properly, much of the code in these subdirectories will likely end up in their own library (LIB) files.

An alternative is to simply place all of the non-portable source code into a single "non-portable" subdirectory. The important thing is for you to identify which elements are likely to be non-portable. Once the functions are identified and separated as being non-portable, you can look for possible alternatives that might allow you to recode them in a more portable fashion. At the very least, knowing which functions are non-portable will make you more aware of the types of resources you will need in the new environment.

Using a disk layout similar to that shown in Table 3 assumes that you have a means by which you can quickly locate any given function definition. That is, you will need a utility program that can read all source files in a subdirectory and tell you the file and line number where each function definition appears. An example of such a utility can be found in the February 1989 issue of Computer Language or in the C Programmer's Toolkit (Que Corporation). Something as simple as having an organized plan for the source code and being able to locate a given function quickly can significantly reduce development time.

While you are in the process of organizing things, you should limit all I/O functions as much as possible. For example, in our statistics package, all screen output goes through a single function. Although we didn't anticipate it, this is going to prove to be a very smart move when (if?) we move the program to Windows. The reason is because printf() is not usable in the Windows (or almost any other GUI) environment. Had we used printf() throughout the source code, it would take days of editing just to make this one change.

Know Your Resources

If you know the environment that you are moving to, it will pay to study the programming tools that will be used in both environments. For example, when we worked in the CP/M environment, the development tools we used supported identifiers of up to eight characters. When we switched compilers, the compiler still recognized the first eight characters as significant, but the linker now only recognized the first six as being significant. We made the mistake of assuming that both the compiler and linker recognized the same number of significant characters. This little erroneous assumption cost us almost two weeks, to reduce the variable names to the shorter length.

You should also exercise care in making assumptions about other resources and features available to you in the two environments. For example, if the existing code was written with a K&R compiler and the target machine supports an ANSI compiler, certain resources and features may be missing. The K&R compiler will support low-level (unbuffered) file I/O while the ANSI compiler is only required to supply high-level (buffered) functions as part of the standard library. Although present compilers support both types of file I/O, it might prove costly to make the assumption in the future.

Another feature that may be missing (even on UNIX) is that some compilers do not support function prototyping. If you find yourself faced with this limitation, you will need to create a header file that lists the function declarations with prototypes (using, for example, #ifdef ANSI) and without (#else) prototypes. The same method can be used to toggle K&R or ANSI coding style for function arguments in a function definition. Listing 1 shows an example, with the corresponding header file in Listing 2.

Note that both listings use the K&R style of #ifdef rather than the ANSI #if defined preprocessor directive. While the style is not particularly pleasing to the eye, it does let you have the advantages of function prototyping in the ANSI environment without modifying the source code. And, after all, the goal is to support one body of source code for all environments.

Compiler differences between environments will also affect other coding elements. For example, things like structure passing and assignment, certain keywords (e.g., const, enum, far, near, void, volatile), and initialization of auto arrays can be affected. Even the size of identical data structures may vary because one machine may require byte alignment and the other doesn't.

Certain coding techniques should always be used, even if you don't plan to port the code to another environment. Hopefully, you already use a number of these coding techniques while others might be new to you.

Magic Numbers

Every C programmer is familiar with the use of #define to avoid using magic numbers in a program. For example, define a macro:

#define TABLESIZE 50
and then use it as in:

for (i = 0; i < TABLESIZE; i++) {
   printf("\n%d", number[i]);
}
This is a common coding practice. The advantage is that changes to the number[] array are easily accounted for by a single change to the symbolic constant TABLESIZE regardless of how many times TABLESIZE appears in the source code.

While this approach reduces the amount of editing required in the source file, a better solution is possible. For example:

int tablesize;
/*  main() plus some other code */
tablesize = sizeof(number) / sizeof(number[0]);
for (i = 0; i < tablesize; i++) {
   printf("\n%d", number[i]);
}
There are several advantages to using this construct instead of the #define. First, any changes to the number of elements in the number[] array no longer require further editing the source file. (Be honest. Have you ever miscounted the number of elements in an array?) The program automatically adjusts to the new size of the number[] array. Second, because tablesize is a variable, it may be more useful with a source code debugger. Symbolic constants are often lost to the debugger.

Another portability problem arises with macros used as bit masks. While the macro

#define INTEGER_MASK 0x7fff
works fine on one system, it may fail miserably on a different system. One reason is because the size of an integer may vary between machines. A second reason might be due to the ordering of the integer in memory. (Is the high or low byte stored first?) Related problems may occur with any bitwise operator or manipulation of a bit field data item. I'm not sure if there is a totally portable means to cope with such problems. You should, however, document such items clearly in the source code.

Improper Use Of #define

In some cases, bad coding practices only show up after a port. One potential problem is using a #define when a typedef is more appropriate. We know that making the source code more readable is always a good idea. With that in mind, we try something like:

#define INTERGER_POINTER int  *
/* ....some code */
INTEGER_POINTER table;
After the preprocessor pass, the definition for table becomes:

int *table;
and the code works fine. However, in the process of making the port, you find you need a second pointer in the new environment and you change the definition statement to:

INTEGER_POINTER table, delta;
However, because a macro is a simple textual substitution, the statement actually becomes:

int *table, delta;
The variable table is defined as a pointer, but delta remains a straight integer. The proper solution is a typedef:

typedef int *INTEGER_POINTER;
Now the code works the way you intended it to work, even with multiple identifiers.

String Constants

Some time ago we were asked if we wanted to have our statistics package translated into a foreign language. While the benefits of the translation probably would have been substantial, it required giving out the program source code. Had we thought about foreign translations during the design stage, we would have altered the way we handled string constants in the program.

We should have written all of the string constants to be defined as an array of pointers to char. For example:

char *message[] = {
   "Select variable name:",     /*  Message # 0 */
   "All or Subset (A, S):",     /*  1 */
   "Printer or File (P, F):"    /*  2 */
   /* More in the list */
   "Out of memory"              /*  N */
};
First, this approach uses memory more efficiently because you can have multiple occurrences of a constant without using additional memory. (If you plan to port to Windows, this method is consistent with placing string constants in a resource file.) Second, you can use the "tablesize" approach discussed earlier to determine the size of the string table without using a macro. Third, anyone who wanted to translate the program need have only a copy of the string table, not the code itself. Finally, if a constant does need to be changed, a single edit is all that is necessary to change every instance of the constant throughout the program.

As you may know, many foreign countries use different formats for dates, time, currency, and similar information. Although our packages don't have these particular formatting requirements, they may well apply to your application. If that is the case, you should examine the locale.h header file to see if any of the symbolic constants defined there can be used to advantage in your code. It may help make the port a bit easier.

Program Input

Another fortuitous design decision we made was to use a single function to get all input from the keyboard. We felt we had to do this because we needed not only to read ASCII keystrokes but also to detect the simultaneous pressing of the function and shift keys. Porting the code to an environment like Windows won't be easy, but at least all of the keyboard input is isolated to one function.

While we're on the subject, it's been our experience that scanf() is not a good function to use for data input. Not only is scanf() a huge function and difficult to use properly, the function makes it difficult for your program to sense input errors. Further, functions that rely on a terminating newline character don't translate well in some environments. In Windows, for example, pressing the Enter key is the same as clicking on the OK button.

Another problem is that the behavior of scanf() may vary among compilers. For example, ANSI states that the e and g conversion characters are not case sensitive, while System V doesn't specify how these conversion characters are viewed. On the other hand, ANSI specifies that the l (for "long") modifier in scanf() is case sensitive. For example, "%lf" differs from "%Lf". The latter form is used to get a long double. In some environments, however, a long double won't even be available in scanf() because that data type is not supported (e.g., System V). Little details like these need to be investigated if you still wish to use scanf().

A related problem arises when testing for the end of a user's input from the keyboard, regardless of the function used to capture the input. Some compilers use the newline character ('\n') to terminate input while others use the carriage return character ('\r'). For example, the statement

if (buff[0] == '\n') {
   /* The user didn't enter anything */
}
is often used to see if the user entered anything from the keyboard. The code will work fine with one compiler but fail with a different compiler. If your code tests for either of these character constants to sense end of input, you may want to #define a symbolic constant rather than test for a specific character. That is, if your compiler uses the carriage return to terminate user input from the keyboard, the code would be more portable if you write

#define ENDOFINPUT '\r'
if (buff[0] == ENDOFINPUT) {
   /* The user didn't
      enter anything */
}
If data input is coming from a disk file, keep in mind that data sizes may vary among environments. For example,

#define RECORD 50
fread(buf, RECORD, 1, fpin);
may not work properly. First, the data types themselves may not be the same size. (Does an int require two or four bytes of storage?) Second, the system may require the data items to be aligned in some special way. This may mean that the data must be padded with extra bytes to fulfill the alignment requirements. The moral here is: Don't use a symbolic constant when the sizeof operator can accomplish the same task.

Numbers

We have already mentioned several instances where a difference in the size of a data item can have an impact on the program. Most of these considerations centered on the storage requirements (such as the size of an int). However, the storage requirement for a data item is not the only way such differences show up in a program. Clearly, a difference in storage requirements implies that the numbers are capable of different numeric ranges. Indeed, even data items with the same storage requirements can have a differing ranges of values. For example, is the default for a char a signed or unsigned quantity?

If you are moving to an ANSI compliant compiler, the limits.h header file should be helpful in answering such questions. If your present compiler does not have the limits.h header file, you might want to consider writing your own. Most compilers supply enough information about the data types that creating your own limits.h header file is not very difficult. (If you haven't done so already, you should examine the symbolic constants defined in this header file.)

Another header file that may prove useful is float.h. It defines almost everything you need to know about floating point variables (e.g., number of digits of precision, floating point exceptions, exponent limits, etc.). One potential problem area for programs that use floating point variables is the epsilon factor for a floating point number. (This factor is called DBL_EPSILON for type double.) The epsilon factor is the smallest value that can be added to 1.0 and satisfy the condition:

1.0 + DBL_EPSILON != 1.0
The value of epsilon can vary widely among compilers. Such wide variation can produce bugs that are very difficult to track down. Most C programmers know that testing a floating point variable against 0.0 can be dangerous because of the epsilon factor. You should check the epsilon factor for both compilers to see if they are similar in magnitude. If they are not, you may have to incorporate the DBL_EPSILON constant into your code.

Odds And Ends

Without thinking about it very much, we developed the habit of documenting our code, often using nested comments. In other cases, we often leave test code in the source file and simply surround it with comment characters. However, because we tend to comment the test code as well, we end up with nested comments.

Some compilers do not allow nested comments. It is a real pain to "uncomment" nested comments. We have since moved away from commenting out test code and now use preprocessor directives to conditionally include the test code in the program. For example, we used to comment out the code in the following manner:

/*
/*  Inspect the values for
   table[] */
for (i = 0; i < MAXSIZE; i++) {
   printf("table[%d] = %g", i, table[i]);
}
*/
This approach results in a nested comment. Still, we feel that the comments are often useful and didn't want to simply throw them away. Now, we would write the same debug code as:

#ifdef DEBUG
/* Inspect the values for
   table[] */
for (i = 0; i < MAXSIZE; i++) {
   printf("table[%d] = %g", i, table[i]);
}
#endif
If we need to turn on the test code, we can simple insert a #define DEBUG and recompile the program to activate the debug code.

Finally, I'd like to present a brief laundry list of don'ts that can cause problems in the middle of a code port.

Don't use lowercase l as a modifier to a long constant; it looks too much like the digit 1. Use the uppercase L, as in

a = 20L;
Don't use a long double if you don't really need to. Some compilers don't support it yet, plus is can add significantly to the data segment requirements. While you're at it, check to see if float arithmetic is supported. If your application can use the lower precision, it might prove useful.

Don't assume a pointer is a fixed length, especially in a mixed-model environment. Sometimes all data pointers are only two bytes, but function pointers are four bytes.

Don't assume wildcard characters are universal across environments. A question mark and asterisk may mean nothing on a different system.

Don't assume a filename is limited to a certain length.

Don't assume a NULL pointer means all bits are set to 0; it is implementation defined. All you can safely assume is that the test for a NULL pointer behaves in the normal way (even though the bits may be nonzero).

Don't use environment variables if you can avoid them. If you can't live without them, make sure you mark those functions that use them.

Don't create your own symbolic constants if ANSI already provides for them (e.g., DBL_EPSILON, SEEK_CUR, etc.).

As a final rule, if you follow most of the suggestions presented here, do take the estimated time required for the port, double it, and try to live within that time frame. C is not the perfect portable language, but it's way out in front of whatever is in second place.