March 2002/Binary Code Reuse in a Linux Environment

Linux

Binary Code Reuse in a Linux Environment

John F. Hubbard

Traditional Unix-like filters meet C++ in these useful classes for launching and controlling processes in Linux.

Introduction

In a Linux environment, there is a significant body of well-tested, readily available code. Some of the most useful pieces, however, are lurking just out of sight, deep within the internals of compiled, executable programs. A C++ programmer new to Linux would not hesitate to make use of a standard routine such as qsort or std::cout, but that same programmer is likely to overlook the opportunity to reuse the larger works of code embodied in, for example, cpp [1]. With a tool such as cpp, sophisticated file preprocessing is easy; the catch is that cpp is a program, rather than a convenient library [2].

In fact, there are two distinct situations in which a Linux C++ programmer will find it helpful to implement “binary code reuse” of this sort. The first situation arises when you need to use an existing utility, such as the cpp program mentioned above. The second occurs when you are creating programs that, like the Linux utilities, expect to communicate via standard input and standard output. In each case, there is a need for a routine that accepts the name of an input file, the path to an external program, and the command-line arguments to be sent. This routine should read the input file, launch the external program with the supplied command-line arguments, pipe the file contents to the external program’s standard input via Unix pipes [3], and (typically) wait for the external program to finish its work before continuing on. The FeedFileToSubprogram routine shown in Listing 1 meets all of these requirements.

This has all been done countless times, of course. However, only in this article will you find a single routine, with a modern C++ interface, to do all of these things in one step. This article is also the only place that you can find a complete, concise (under 300 lines of code, total) set of parent and child programs, each with a modern C++ implementation, that provide an immediately accessible implementation of this form of binary code reuse. And if that’s not enough, if you read carefully, you’ll also find a standard Flex- and Bison-based parser hiding within the same 300 lines of code.

GNU’s Flex and Bison

In order to provide both an article example and a coding template to be used in the near future, I decided to create a simple calculator, using GNU’s Flex lexer generator and GNU’s Bison parser generator. Although the example might seem artificial, it is not. One of the single most powerful code reuse opportunities in the software industry is that of automatically generating lexers and parsers, from high-level BNF (Backus Naur Form) grammar specifications. Pick up any book on compilers and parsers, and you will be subjected to a lecture by the author on the value of using Lex and Yacc to generate your parser, or — if it is a really modern book — perhaps the author will even exhort you to use Flex and Bison [4]. And yet, despite all of this supposedly heavy backing, it is surprisingly time-consuming to get polished programs up and running using these tools unless you are already very familiar with them. I was amazed to discover that none of the manuals contain modern, fully working sample programs that use Flex and Bison. You have to read two or three manuals and piece together the missing pieces and bug workarounds all by yourself — unless, of course, you are clever enough to use my two grammar files (one each for the lexer and parser) as the starting point for any new parsing project. I have kept the grammar files exceedingly small and simple, because I intend on using them as templates myself on an upcoming compiler project.

The ProcessDriver and SimpleCalculator Programs

Figure 1 shows the architecture and intended use of this mini-system. The SimpleCalculator program accepts floating-point numbers and performs some basic operations; it also handles nested parentheses correctly. It does all of this in 50 or 100 lines of BNF grammar, mixed in with some C++. (Unfortunately, space limitations preclude the possibility of actually discussing the BNF. There’s no help for it; you’ll just have to peek at the source code on the CUJ website.) This creates just the right sort of problem, because, lo and behold, Flex and Bison generate code that expects to perform I/O via standard input and standard output (stdin and stdout). If you left things as Flex and Bison have set them up, you would be in the embarrassing position of requiring your end users to type
cat input_file.txt | BigCompiler
to run your compiler. The ProcessDriver program provides a way to read the input file and send its contents to your new parser.

The ProcessDriver program, shown in Listing 2, acts as a driver program. The SimpleCalculator program, shown in Listing 3, acts as a subprogram. Both “driver-plus-subprogram” and “parent-child” phrases are useful ways to describe the relationship here; I use the terms interchangeably throughout both this article and the code itself. These programs are simple, but do not be deceived into thinking that they are unrealistic. The fact is neither ProcessDriver.cpp nor SimpleCalculator.cpp need any changes at all in order to be used as the foundation for a Flex- and Bison-based system.

Figure 2 shows an example session with the ProcessDriver program, including command-line parameters, input file contents, and program output. In upcoming sections, I will discuss selected pieces of the source code.

Using the CommandLine Class

In order to control a complex parser or compiler, the conventional approach is to supply long lists of command-line options. (GCC [5] is my favorite example, with scores of options that only a few rabid enthusiasts can remember without the aid of a hyperlinked index.) Of course, the only civilized way to manage basic command-line options is to use the CommandLine class, whose public interface is shown in Listing 4. In the simplest case, you construct a CommandLine object instance by passing it exactly what you receive from main’s argument list: argc and argv. Thereafter, any of the command-line parameters are available for your inspection via these two methods: Exists(optionName) and GetByName(optionName).

The most straightforward use of CommandLine is to use it to provide simplified ways to parameterize unit tests, as in this fragment:
int main(int argc, char* argv[])
{
    try
    {
        tools::CommandLine commands(argc, argv);
        FastSAX_Parser
          theParser(commands.GetByName("file"));
        // ...etc...
    }
    catch(std::string& strError)
    {
        std::cout << "Command-line error:"
                  << strError << std::endl;
        Usage();
    }
    return 0;
}
A more powerful technique is to pass around a const reference to the original CommandLine object. This allows read-only access to the original command-line parameters, available to any part of the program to which you care to hand the CommandLine reference. You can see how this is done in Listing 3, which shows the SimpleCalculator’s main module.

Out-of-Band Parameter Passing

As soon as you have more than one program, you have to deal with how to communicate between them. There are two choices: pass information from parent to child program in the command-line arguments, or simply pipe information between parent and child via Unix pipes. The cleanest, simplest choice is the a judicious mix of both: you pass control information from parent to child in the child’s command-line arguments. (The argument list is actually part of a single string, passed as an argument to the popen call, which includes both the child program’s name and its argument list.) You pass data to be processed via the standard Unix pipe that is called stdin; this is the child program’s standard input. This approach allows you to avoid reinventing minor data-passing protocols that would be necessary in order to pass control information over a Unix pipe.

I call this approach “Out-of-band Parameter Passing.” As the name implies, there are two separate channels (“bands,” in telecom speak) for communications between the parent and child processes: one for control information, and one for the data to be processed. The CommandLine and ExtendedCommands classes make this easy to implement. You can see the entire mechanism in Listings 2 and 3.

Long and Short Forms: Command Filtering

For readability and self-documenting systems, long, spelled-out command-line options are indispensable. However, to keep from going mad when typing in seemingly endless streams of long options, short forms of the same options are equally valuable. Therefore, it is important to provide both: new users learn from the long forms, and once familiar with the program, they shift to the short forms (or more likely, simply use shell aliasing to hide the whole thing).

Having two forms of the same option is no big deal, until your hierarchy of programs (driver plus subprograms) begins to grow. Clearly, the driver program is in the perfect position to provide command filtering, in the following sense: the driver program can accept both long and short forms of the same option, and perhaps other combinations or aggregates, such as an -all option. However, each subprogram should only handle the long form of each option. The driver program examines all of the original options and then synthesizes a precise set of options for each of its subprograms. This fits the situation nicely; the subprograms require inhumanly precise option lists, which is fine as their only direct user is another program. The subprograms are thus spared the repetitive complexity of checking for alternate forms and may simply write:
    commands.GetByName( longFormOptionA );
rather than
    if (commands.Exists(shortNameA))
    {    optionA = commands.GetByName(shortNameA);
    }
    else if (commands.Exists(longNameA))
    {    optionA = commands.GetByName(longNameA);
    }
The ExtendedCommands class, whose implementation is shown in Listing 5, inherits from the CommandLine class. The ExtendedCommands class provides the ability to search for a command by both a long and a short name, as well as the ability to synthesize a new command-line string from an existing CommandLine class (plus a few hints).

Emerging Conventions

As you can see from Listings 2 and 3, the parent and child programs cooperate in order to provide version reporting. This is important because a common mistake would be to update the child program (which the end user likely never even sees), yet leave the version number of the parent program unchanged. This causes you to instantly lose track of who has which version of the anonymous child program.

Linux Utility Programs

Part of the pleasure in working with Linux is the remarkable depth of pre-existing utilities. There are hundreds, if not thousands, of utilities that ship with a modern GNU Linux distribution. Poke around in Linux long enough, and you will find a utility to do just about anything that can be done.

Because most of the utilities do just one thing, you may end up delighted at the near-infinite possible combinations available, or frustrated that you have to routinely run many programs to do just about anything useful from the shell prompt. Here is an example, complete with the legendary, apocryphal “regular expression” syntax that preprocesses a C/C++ file:
cat input_file.cpp | cpp | egrep -v "^[#\t\n]"
For those of you who are delighted, you have a lot of enjoyable projects ahead of you now that it is easy to gain precise control of each program to be launched. You now have two additional ways to run utilities such as cpp: 1) run the ProcessDriver program, or 2) programmatically call the FeedFileToSubprogram routine.

Details

Note that the FeedFileToSubprogram routine in Listing 1 uses popen("w"), which means “write” as in “launch the subprogram and write to the subprogram’s standard input.” You can also specify "r", which means “launch the subprogram and read from the program’s standard output.” I will leave further exploration of this to the reader; popen is explained, with Stevens’ usual clarity, in [3].

The code shown in this article has been tested on Linux Redhat 7.1 (x86), Solaris 7 (Sparc), Solaris 8 (Sparc), and Cygwin 1.3.x (running on Microsoft Windows 2000, SP1). The compiler in all cases was GNU’s GCC 2.95.3, augmented with STLport-4.5 [6]. I doubt that you’ll be able to compile the code as is on “stock” GCC; that compiler has been a bit slow on adopting the full C++ Standard library. STLport, on the other hand, is a thread-safe, portable, standards-compliant implementation of the C++ Standard library, and I’ve had excellent results with it so far.

The CommandLine class is supplied as part of the ClassCreator [7] application. As a convenience — and of course, to encourage wider use of ClassCreator — the entire ClassCreator source code is included along with this article’s source code at <www.cuj.com/code>.

Notes and References

[1] The cpp (C/C++ preprocessor program) reads text files (normally C or C++ source code, but not necessarily) and does the fussy work of parsing, including files, stripping out comments, processing macros, selecting active code, and all the rest of the wondrous behavior that you may previously have mentally lumped together with the C++ compiler itself.

[2] While it is true that all parts of a GNU Linux distribution are required to ship with the source code for each program, most of us eventually come to the bleak realization that having a given set of source code does not necessarily mean that we’d want to use it in our current project. For a variety of reasons, the most effective way to reuse the code within a Linux utility is very often to simply run the utility and communicate with it via IPC.

[3] W. Richard Stevens. Advanced Programming in the Unix Environment (Addison Wesley Longman, 1993). Stevens has some of the best explanations of Unix pipes that you’ll find anywhere.

[4] Lex and Yacc are early implementations of automatic lexer and parser generators, respectively. These programs generate C code that generally will not compile, unmodified, as C++ code. GNU Flex and GNU Bison were designed to replace Lex and Yacc. Flex and Bison are widely acknowledged to be superior implementations; one of the most important improvements (there are many) is that Bison-generated code can be compiled as C++ code, allowing seamless integration with modern C++ programs.

[5] GCC is GNU’s C/C++ compiler, for those of you who have been off-planet for the past few years.

[6] STLport-4.5 is available at <www.stlport.org>.

[7] John F. Hubbard. “Building a Professional Software Toolkit,” C/C++ Users Journal, May 2001. I originally introduced the CommandLine class, in this article, as part a C++ code generation utility called ClassCreator. Since then, CommandLine has not changed, despite heavy use; the interface appears to be mature at this point.

John F. Hubbard spent eight years as a nuclear submarine line officer, logging thousands of hours of submerged operations before finally succumbing to the lure of civilian computer technology. He currently works as a senior software engineer at ATD Azad Technology Development Corporation, a software outsourcing company that specializes in real-time programming, embedded systems, and factory automation. Mr. Hubbard holds a BS in Electrical Engineering from Utah State University. He may be reached at hubbardjohn@earthlink.net.