April 1997/Yet Another Command-Line Parser

Features

Yet Another Command-Line Parser

Panos Kougiouris

Parsing command-line arguments is one of those tedious jobs that can really benefit from some predefined structure, particularly when you have to make changes later on.

Command Lines are a Pain

Writing a new C or C++ program can be a lot of fun — after you get past parsing the command-line arguments, that is. This parsing task has little to do with the nature of the program being developed. Nevertheless, you must dedicate significant thought and coding toward retrieving those command-line arguments. To make matters worse, every decent command-line interface will print a usage message in case of a syntax error. Writing and maintaining this code is among the most boring of programming tasks.

To implement the command-line parser, developers typically cut and paste the code from earlier programs, and then modify it to fit the new program. The result is an ugly looking main program with two or more nested loops iterating through the argv array. Since the modification process is error-prone, it usually introduces minor but annoying bugs.

All the command-line utilities I have seen (dating back to UNIX sources ) enforce a simple syntax and are implemented more or less the same way. The syntax is typically describable by a simple grammar. The command line is commonly separated by white space into switches (-v for example), switches that take arguments (-l panos for instance), and just plain arguments. I will henceforth refer to these three collectively as command-line elements. Some command line elements are optional and some are required.

On the implementation side, as the command-line utility parses the command-line elements, it sets a number of global variables. When the remainder of the program executes, it decides what to do based on the contents of these variables. For instance, an application might take an argument -v that says whether or not it is to run in verbose mode. The developer will usually define a corresponding global variable, say, gVerbose. At startup this variable is set to the default value, zero; if this value is never changed, then the program runs without printing lots of messages. But if the command-line parser finds the -v option, it sets gVerbose to 1 and the program becomes "verbose."

So each command-line element requires that code be maintained in three different locations within the source code: within the parsing loop, at the global variable, and at the site of the help text. Keeping these three code chunks in sync is a maintenance nightmare. Some libraries, such as the FSF's GNU utility, include functions that facilitate code reuse; but these libraries still do not address the maintenance problem. Other libraries, such as Microsoft's MFC, repackage the arguments but address neither the reusability nor the maintenance problems.

In this article, I present a small C++ framework that addresses all the problems mentioned above. The two main classes in this framework, CApp and CArg, are abstract base classes. Every program that uses this framework must define a new class that inherits from CApp. As its name implies, CArg defines the abstract interface for command-line argument types. I provide concrete subclasses for most of the major data types, such as int, const char*, bool, etc. You can introduce more types to extend the framework.

An Example

Before I dive into implementation details, I show a simple example of how the framework is used. Let's assume that you want to rewrite telnet. You look up the syntax and find that it looks something like the following:
telnet [-c][-r] [-l user] [-n file] host [ port ]
(I have omitted some of the arguments for simplicity.)

It's obvious that you'll need six variables to remember the state of each argument. Since this is C++ it makes sense to put these variables in a class, CMyTelnet:
Class CMyTelnet {
public:
    m_do_not_read_telnetrc;
    m_rlogin_ui;
    m_login_name;
    m_file_name;
    m_host;
    m_port;
};
Of course, the above will not compile, since the variables are not properly declared as types. The next thing is to define the types of the variables. Here is where the framework comes into play. Instead of using one of the language types, use the appropriate class derived from CArg. The CMyTelnet class definition becomes:
Class CMyTelnet : public CApp {
public:
    CApp::CToggleArg m_no_read_telrc;
    CApp::CToggleArg m_rlogin_ui;
    CApp::CCstringArg m_login_name;
    CApp::CCstringArg m_file_name;
    CApp::CCstringArg m_host;
    CApp::CIntArg     m_port;

    CMyTelnet();
};
The above member declarations are prefixed with CApp:: because CArg is a nested class of CApp. I made CArg nested to minimize the number of global identifiers exposed by the framework. If all C++ compilers supported namespaces I would have used them instead. If your program has multiple source files this declaration should probably go into a header file so that the global state is accessible by all the files composing your application.

The next step is to define the constructor for the class. In many classes, the constructor is just a placeholder for initialization of the member variables. But right here is where you can set all the interesting attributes of each command line element: is it a switch or an argument, is it required or not, what is its default value, and what message should be printed in the "syntax error" message.

After defining the class, you must instantiate an object of this class somewhere. My example instantiates this object in global scope so that the states can be accessible from everywhere.

Finally — and this is very important — your program must call member function parse as the very first statement in main. This one statement parses the arguments and initializes the global state of your program. You are done! Listing 1 shows the whole program.

If you try to run the program you see that the variables are set as expected. Furthermore, attempting to run the program with the wrong argument creates the following output:
prompt> mytelnet -?
Error: Unknown option -?
Usage: [-c] [-r] [-l<string>] [-n<string>] host<string> [port<integer>]
    "-c" Disables the reading of the user's telnetrc file Default: off
    "-r" Use rlogin semantics Default: off
    "-l" Use this login name instead of the current user Default:
    "-n" Open this trace file Default: NULL
    "host" The host to connect
    "port" The port to use Default: 23
prompt>
So far so good, but the strength of this approach can be fully appreciated only later on, during the maintenance cycle. Adding a new option is really easy. You need only add the argument in the class declaration and initialize it in the constructor. If you do the one step but not the other, a compile-time error will protect you from a run-time embarrassment.

Advanced Features

The CArg constructor invoked for each command-line element takes four arguments. The three first arguments are obvious: the name, the "syntax error" string, and the default value. The fourth argument is a flags argument. It determines if the element is required or optional; if it is a switch, argument, etc.; or if it is an array of arguments.

You can also pass a couple of flags to the CApp object's constructor. The IgnoreUnknownOptions flag forces the parser to just ignore unknown command-line switches. The IgnoreUnknownArgs flag forces the parser to ignore unexpected command line arguments. Normally in both cases a syntax error would be generated.

Framework Interface

Listing 2 shows the overall structure of the framework interface. (There's not enough room here to show the full framework, but it, as well as the implementation, is available electronically. See p. 3 for details.) Class CApp contains a nested class CArg, which serves as an abstract base class for all argument types. Each CArg-derived argument object will fulfill two main purposes: 1) It will hold information pertaining to the name of the command-line argument it is meant to process, the usage message to be printed with the argument, the flags that apply, and the environmental variable to be used. 2) The argument object will store the parsed value of the command-line element retrieved. CArg supplies some member functions typically required by all argument types (e.g. is_option), as well as a few virtual functions that must be "filled in" in a derived class.

Below CArg in Listing 2, one of the derived argument types, CCharArg, is shown in its entirety. All the derived types implement the pure virtual functions that were declared in CApp. In addition, each derived type contains a protected member m_value, to store the value of the argument that the user typed in on the command line.

After defining the subclasses of CArg, the CApp class defines some helper functions and declares the static data members that will be used while parsing the command line.

Implementation of the Framework

The CApp class keeps an array (m_args) that specifies the characteristics of each type of command-line element the application is designed to handle. Each element in the array is basically a pointer to a CArg-derived object as mentioned above. This array is created and filled in when the CApp class is instantiated. How this happens is not immediately obvious, so I explain it here in some detail. Listing 1, the main source file for the sample telnet application, shows that the CMyTelnet class is derived from the CApp class. CMyTelnet extends CApp by providing some extra members, which are all objects of classes derived from CApp::CArg. Note that the application creates a CMyTelnet application object, the_app, at global scope. Thus, its constructor will execute before the application enters main.

When the_app is constructed, the compiler first calls the constructor of its base class, CApp. The CApp constructor appears as follows:
CApp::CApp(int a_flags)
{
    m_args_index = 0;
    m_args_max_index = 1;
    m_args = new CApp_CArg_ptr[1];
    m_flags = a_flags;
}
CApp::CApp creates an "array" (m_args) of one element — an unitialized pointer to a CApp::CArg. (CApp_CArg_ptr is a typedef for a CApp::CArg*.) After CApp::CApp is finished, CMyTelnet::CMyTelnet executes, and calls the constructor for each member named in the initialization list.

Each member's constructor is responsible for linking that member into the m_args array. This linking occurs when the member's base class constructor calls CApp::add (Listing 3) . CApp::add first makes sure that m_args has room to hold a pointer to the member; if not, it extends the array, then links in the member. Each member's constructor also initializes that member, either to the value specified in the initialization list, or to a default value.

As mentioned previously, the first function that must be called inside of main is CApp::parse. When CApp::parse is called, it steps through each of the elements in the application's argv array, picking up the command-line elements that were supplied to the application when it was invoked. For each command-line element, parse iterates through the array of argument specifiers (m_args) and asks each specifier if it recognizes the command-line element. If it recognizes the element, the specifier object parses the element and sets its member m_value to the value indicated by the element. (The type of m_value varies from argument specifier to argument specifier. For example, CApp::CIntArg's m_value member is an int; CApp::CCharArg's is a char.) In case of an error, each argument specifier prints its own diagnostic message.

To make the framework extensible for new argument types, I've designed CApp to know very little about particular argument types. CApp's interface to arguments is through its nested CArg class, which is an abstract class. I've abstracted all the functionality of CArg into a number of virtual function members. In addition, I've designed the virtual function interface to make adding a new argument type really simple. CArg contains only four such virtual functions, and each one is usually about one or two lines of code. The class CCharArg in Listing 2 shows that a typical subclass of CArg can be implemented in about 25-30 lines.

Limitations and Further Work

The framework as presented here has a few limitations:

The list of argument specifiers currently resides in a static data structure, so you should create only one CApp object per application.

To simplify the code I have removed the support for the fl_multiple flag. This feature would have allowed for multicharacter switches (e.g. -foo), typing in a switch and following argument with no intervening space, and concatenating multiple switches. None of these features work here, so you must type mytelnet -c -r instead of mytelnet -cr, for example [1] .

Although the current design will accommodate the syntax of most of the command line utilities, there is still syntax it does not support. For instance, the framework does not support situations in which one argument depends on another (e.g. -t only if -v is also present). This is an area where the framework could be extended.

Some libraries must be initialized from the command-line arguments. Typically these libraries remove the elements they understand from the command line. The X11 libraries are a typical example. If you use such libraries you should first initialize the libraries and then call the CApp::parse method. If you are writing this sort of library yourself, you could add support to this framework to support these situations.

Conclusion

This article demonstrates how to create an extensible framework by combining two uses of inheritance. The first use of inheritance occurs when the framework subclasses an abstract application class to handle specific command-line arguments. The second use occurs when concrete argument types are formed by subclassing an abstract base argument type. While this dual inheritance makes the underlying code somewhat complex, it makes life easier for the user of that code. As a result, it should be easy with this code to add a command-line interface to your application, or to extend that interface at a later date.

Notes

[1] Both mytelnet -n foo and mytelnet -nfoo work correctly.

Panos Kougiouris is a software development manager at Healtheon Corporation, a Silicon Valley Internet startup. His interests include distributed systems and component software. In the past he held technical positions with Oracle and Sun Microsystems. He holds an MS degree in Computer Science from the University of Illinois at Urbana-Champaign and a diploma from the University of Patra, Greece. He can be reached at panos@acm.org.