Dr. Maher is the founder and head of Consultix, which specializes in UNIX and C training. He can be contacted at P.O. Box 70563, Seattle, WA 98107-0563, or via UUCP {uunet, uw-beaver}!timji!tim, CompuServe 72460,3050, Internet tim@timji.celestial.com, or ATTMail as tmaher.
UNIX tool developers have been slow to provide "beautifiers" for C++. Consequently, some programmers have resorted to posting C++ reworkings of the BSD indent utility in the Usenet news, violating federal copyright law in the process.
My alternative approach to writing a beautifier was to apply the UNIX "filter" model to the problem. As Figure 1 illustrates, this entailed using a preprocessor to disguise C++ as C, using standard C tools to effect beautification, and then using a post-processor to convert the disguised C++ back to its original form. In this article, I'll describe the programs and the testing procedure, and show sample inputs and outputs.
Listing One (page 27) shows the Bourne-shell driver program, which is linked to the names c++cb and c++indent. It is used like cb or indent, according to its invocation name. The case construct starting on line 6 determines which name was used. If it was c++cb, the appropriate argument parsing is performed in lines 8-14. Line 16 composes the command that will eventually beautify the preprocessed text and place it into the file named in $BEAUT.
If the program is invoked as c++indent, arguments are parsed (lines 18-23) according to the invocation format shown on the 4.2 BSD man page for indent: input file, optional output file, and other processing options. Line 24 composes the appropriate beautification command for this beautifier. Note that most keywords that occur in C++ but not in C are listed in connection with typedef (-T) options, to inform indent of words requiring special treatment. The keyword operator is purposely omitted because it appears only as a function-name prefix, and therefore has no special formatting implications.
Line 29 uses a Bourne-shell "special substitution" to assign a default "symbol" to the environment variable CppSYM, if it is not inherited with a non-null value from the invoking shell. The code in lines 30 and 31 causes an error message and exit if the text to be beautified happens to already contain this symbol, which is reserved for selective introduction by the preprocessor, c++encode. If the program exits here, you should set the environment variable CppSYM to some other valid C identifier at the invoking shell level, and try again.
The preprocessor, c++encode, is called on line 32, followed by the invocation of the beautification command composed previously. The eval prefix is needed to allow the shell to properly recognize metacharacters that appear after substitution for the variable BEAUTIFY has been performed. The disguised code of the original C++ program is then reconstituted in lines 35-36 by a sed command that reverses the effects of the preprocessor. It does this by removing the C-comment "wrappers" that encapsulate inline comments and reconverting the various occurrences of CppSYM to the colon expressions they represent.
Listing Two (page 27) shows the preprocessor program, c++encode.c, prototyped in nawk but converted to C for a 40-fold speed improvement. This program does some elementary parsing of the C++ program, and then selective encoding of C++ sequences that were empirically determined to generate syntax errors after C beautification. The parsing step is required to allow discrimination between the C++ sequences subject to beautification (which need C-language encoding) and literal sequences that occur in quoted strings and comments. (Early experience showed that syntax errors could be introduced during beautification if C++ sequences in quotes or comments were encoded.)
The for loop starting in line 38 examines each character in the current input line according to the current mode (that is, inside a comment, inside a quoted string), and the code in lines 42-63 detects the onset and termination of these special modes. In some cases, special output sequences are generated to disguise the input. For instance, if the onset of an inline comment is detected (line 51), the program generates the initiating sequence for a C comment (line 53). In line 68, strings generated in this manner are sent to the output; then continue is executed to initiate the processing of the next input character.
Processing of unquoted and uncommented text is undertaken in lines 70-82. The C++ scope qualifier (::) is replaced in lines 71-72 by the symbol set in the invoking shell script, and the colon used to express derivations is replaced by a slightly different string based on that symbol (lines 73-81), so that the post-processor can differentiate between these cases. In lines 84-85, any generated text is sent to the output, or else the literal input text is sent out. The C-comment terminator for an encapsulated inline comment is provided in lines 87-88. The terminating sequence is preceded by a hyphen to flag it as eligible for post-processing reconversion to the inline form. At the end of the input file, the program will issue an error message (lines 91-95) if any special mode (single-quotes, for instance) is still in effect.
Example 1(a) shows a sample fragment of (nonsense) C++ code, in dire need of beautification; Example 1(b) shows what can happen if the C-language beautifier cb is directly used to reformat C++ code. What was formerly an inline comment in Example 1(a), line 5, is now three lines of syntactic nonsense (lines 7-9); what was a pair of double-colons is now a pair of single-colons on different lines (14-15), and all other inline comments (lines 19, 22) are ruined by the imposition of a space between their slashes. The havoc wreaked on the last line, an inline comment containing special characters, adds insult to injury.
(a)
$ cat sample.cc
1 class C : virtual public F { public:
2 C(int x):F(x, 100u) { } // note; arg 2: unsigned
3 C func(double* = 0): /* defined below */
4 };
5 C C::func(double*x) { if (x) // commentary
6 { } // inline comment with funny stuff- // : ::
7 }
(b)
$ cb sample.cc
1 class C :
2 virtual public F {
3 public:
4 C(int x):
5 F(x, 100u) {
6 }
7 // note;
8 arg 2:
9 unsigned
10 C func(double * = 0); /* defined below */
11 };
12
13
14 C C:
15 :
16 func(double*x)
17 {
18 if (x)
19 / / commentary
20 {
21 }
22 / / inline comment with funny stuff - / / :
23 :
24 :
25 }
26
27
(c)
$ c++encode sample.cc
1 class C__42 virtual public F { public:
2 C(int x):F(x, 100u) { } /*- note; arg 2: unsigned -*/
3 C func(double* = 0): /* defined below */
4 }:
5 C C 42func(double*x) { if (x) /*- commentary -*/
6 { } /*- inline comment with funny stuff- // : :: -*/
7 }
(d)
$ c++cb sample.cc
1 class C : virtual public F {
2 public:
3 C(int x):
4 F(x, 100u) {
5 } // note: arg 2: unsigned
6 C func(double* = 0); /* defined below */
7 }:
8 C C::func(double*x) {
9 if (x) // commentary
10 {
11 } // inline comment with funny stuff- // : ::
12 }
(e)
$ c++cb -s sample.cc
1 class C : virtual public F {
2 public:
3 C(int x):
4 F(x, 100u) {
5 } // note; arg 2: unsigned
6 C func(double * = 0); /* defined below */
7 };
8
9
10 C C::func(double*x)
11 {
12 if (x) /*- commentary -*/ {
13 } // inline comment with funny stuff- // : ::
14 }
15
16
(f)
$ c++indent sample.cc; cat sample.cc
1 class C : virtual public F {
2 public:
3 C(int x):F(x, 100 u) {
4 } // note; arg 2: unsigned
5 C func(double *= 0); /* defined below */
6 }:
7 C
8 C::func(double *x)
9 {
10 if (x) { // commentary
11 } // inline comment with funny stuff- // : ::
12 }
Example 1(c) shows that c++encode replaced C: virtual by C__42_virtual (line 1), thereby disguising a uniquely C++ sequence as a valid C identifier. In like fashion, the sequence C::func was disguised as C_42func (line 5). Note that the colon in line 2 was not replaced because it was not surrounded by spaces. (To give you control over the results of beautification, I decided that stand-alone colons would be encoded, rendering them invisible to the beautifier, whereas embedded colons would be subject to beautifier manipulations.) The only other change is that all inline comments were converted to C comments. The hyphen following the C-comment initiator (/*) warns indent not to monkey with the enclosed text; the hyphen preceding the comment terminator tells the post-processor to consider reconverting it to inline form.
Example 1(d) shows the effect of running c++cb on the code sample of Example 1(a), which preprocesses using c++encode, beautifies using cb, and then post-processes to reinstate the disguised C++ code using sed (lines 35-36, Listing One). The result is an attractive and syntactically correct C++ program. (If the line-split introduced after the colon of line 3 is not desired, you can put spaces around those colons to prevent splitting, as in line 1.)
Example 1(e) shows the results of running c++cb-s on this code. They are significantly different from those shown in Example l(d) only in line 12. The beautifier chose to place program code (a "{" symbol) behind the C comment on this line, which had been converted from an inline comment by the preprocessor. Because reconverting such a comment to the inline form would effectively comment out the following program code, the post-processing stage (lines 35-36, Listing One) uses the simple strategy of only reconverting inlined comments that end at the line's end. Example 1(f) shows the results of running c++indent on this code. They are cosmetically quite different from the results produced by c++cb. Unfortunately, there are also material differences--syntax errors were introduced on lines 3 and 5 (discussed presently).
I tested both c++cb and c++indent under SunOS 4.1.1b using the NIH Class Library--367 files containing over 55,000 lines of code. (c++cb was also tested under System V R3.2, which lacks indent.) The c++cb program was run both with and without the strict processing option (-s), and c++indent was run without any added options.
Although most of the files were successfully beautified without trouble, some problems did arise. Specifically, in one case c++cb without options separated the components of the C++ insertion operator ("<<"). Manual patching was required to correct the problem. For its part, indent (invoked by c++indent) always inserted a space between the components of new-style constants; that is, 100u became 100 u, as in Example 1(f), line 3.
Another problem was that in one case a pointer operator was erroneously attached to a following equal sign, as in Example 1(f) line 5. This problem can easily be avoided through routine use of dummy variable names--changing double* = 0 to double *x = O. Manual patching was required to permit compilation.
For the 55,000 lines comprising the NIH Class Library, the manual post-beautification cleanup process amounted to two lines for c++indent, and one line for c++cb. Thus, although the technique is not perfect, good results can be achieved with typical code, and the mistakes that do occur are easily identified and corrected.
Some final notes on the use of c++indent are in order. First of all, it is no accident that beautification via indent produced both the most attractive and the most troublesome results. It is precisely because indent takes so many liberties in reformatting the (disguised) code for maximum visual appeal that it sometimes introduces errors. Nonetheless, if my experience with the NIH Class Library is representative, and I think it is, problems with c++indent are likely to be few and far between.
Secondly, certain options available in some versions of indent must be avoided. In particular, the -troff option should not be used, because it is fundamentally incompatible with the decoding technique; as an alternative, beautify first, and run vgrind afterwards. The -st option ("standard output") must also be avoided, because it is incompatible with the design of c++indent.
Finally, because no attempt was made to test all of the myriad processing options of indent, c++indent users should be wary of unexpected results when straying from the defaults.
_A C++ BEAUTIFIER_ by Tim Maher[LISTING 1]
1 :
2 # @(#) c++cb,c++indent - driver program for C++ beautification
3 # Tim Maher, CONSULTIX, 11/9/91. (206) 781-UNIX
4 #
5 ENCODED=/tmp/c++encode_$$ BEAUT=/tmp/beaut_$$ # temp files
6 case "$0" in # use cb or indent, depending on invocation name
7 *c++cb) OUT="" # for cb, all arguments are optional
8 while test $# -gt 0 # separate options from filenames
9 do case "$1" in # "l" option takes argument
10 -[!l]) OPTS="$OPTS $1"; shift ;;
11 -l) OPTS="$OPTS $1 $2"; shift 2 ;;
12 *) break # end of options
13 esac
14 done # if no filename, copy stdin, and set filename arg
15 test -z "$*" && cat > /tmp/c++cb_$$ && set /tmp/c++cb_$$
16 BEAUTIFY="cb $OPTS $ENCODED > $BEAUT"; INPUT=$* ;;
17 *c++indent) # not a filter; needs filename arg
18 INPUT=${1:?"Usage: $0 infile [outfile] [flags]"}; shift
19 case "$1" in # next arg would be output filename or flag
20 "") ;; # no second arg is okay too
21 [!-]*) OUT=$1; shift # set output filename
22 esac
23 OPTS="$*"; : ${OUT:=$INPUT} # if no outfile, use input name
24 BEAUTIFY="indent $ENCODED $BEAUT $OPTS -Tasm -Tbool -Tcatch
25 -Tclass -Tconst -Tdelete -Tdo -Tfriend -Tinline -Tnew
26 -Tprivate -Tprotected -Tpublic -Tsigned -Ttemplate -Tthrow
27 -Ttry -Tvirtual -Tvolatile" # -Toperator omitted
28 esac
29 : ${CppSYM:=_42}; export CppSYM; # set up disguising string
30 name=`grep -l "$CppSYM" $INPUT` && # exit if symbol in input
31 { echo Error- \"$CppSYM\" appears in $name >&2; exit 100; }
32 c++encode $INPUT > $ENCODED || exit 100
33 if eval $BEAUTIFY # beautify, leaving output in $BEAUT
34 then trap "" 2 3 15; # reconstruct C++ after beautification
35 sed -e 's+/\*-\(.*\) -\*/$+//\1+' -e "s/_${CppSYM}_/ : /g" \
36 -e "s/$CppSYM/::/g" < $BEAUT > $OUT
37 else echo "$0: error code $? from $BEAUTIFY" >&2; exit 200
38 fi
39 rm -f /tmp/*_$$; # clean-up temp files
[LISTING 2]
1 /* @(#) c++encode.c: Tim Maher, 11/9/91, tim@Timji.Celestial.Com
2 C++ syntax disguiser, allowing beautification via cb, indent */
3 #include <stdio.h>
4 #include <string.h> /* Use this line for AT&T systems */
5 /* #include <strings.h> /* Use this line for BSD systems */
6 char *getenv();
7 #define MAX 512 /* Maximum length for C++ input line */
8 #define F 0 /* FALSE */
9 #define T 1 /* TRUE */
10
11 main (argc, argv) int argc; char *argv[]; {
12 int filenum, linenum, cnum, modes, exit();
13 int sq, dq, cc, ic, ch, nx, max, knt;
14 char line[MAX], out[10], *sym;
15 FILE *handle;
16
17 if (argc < 2) {
18 fprintf(stderr, "Usage: %s file1 [file2 . . .]\n",
19 argv[0]); exit(1);
20 }
21 /* use default "sym" string if none supplied */
22 if ((sym=getenv("CppSYM")) == (char *)0) sym="_42";
23 for (filenum = 1; filenum < argc; filenum++) {
24 handle = (filenum == 1 ? fopen(argv[filenum], "r") :
25 freopen(argv[filenum], "r", handle));
26 if (handle == (FILE *)0) {
27 fprintf(stderr, "%s: fopen() error\n",argv[0]);
28 exit (50);
29 }
30 linenum = 0; sq = dq = cc = ic = F;
31 while (fgets(line, MAX, handle) != (char *) 0) {
32 max=strlen(line)-2; /* ignore NL; 0-based index */
33 if (max == MAX - 3) {
34 fprintf(stderr,"%s: increase MAX\n", argv[0]);
35 exit(100);
36 }
37 linenum++;
38 for (cnum = 0; cnum <= max; cnum++) {
39 ch = line[cnum];
40 sprintf(out,"%c",ch); /* default out = ch */
41 nx = (cnum == max) ? '\0' : line[cnum+1];
42 if (cc) { /* in C comment mode */
43 if ('*' == ch && nx == '/') {
44 /* C comment ends */
45 cc = F; strcpy(out, "*/"); cnum++;
46 }
47 } else if ('/' == ch && nx == '*' && !(sq||dq)) {
48 /* C comment starts */
49 cc = T; strcpy(out, "/*"); cnum++;
50 } else if (ic) ; /* in inline comment */
51 else if ('/' == ch && nx == '/' && !sq && !dq) {
52 /* inline comment starts */
53 ic = T; sprintf(out, "/*-"); cnum++;
54 } else if ('\\' == ch) { /* quote by BS */
55 if (cnum != max) { /* next char is quoted */
56 sprintf(out, "%c%c", ch, nx); cnum++;
57 }
58 } else if (dq) { /* in double-quotes */
59 if ('"' == ch) dq = F; /* DQ string ends */
60 } else if (sq) { /* in single-quotes */
61 if (ch == '\'') sq = F; /* SQ string ends */
62 } else if ('\'' == ch) sq = T; /* SQ starts */
63 else if ('"' == ch) dq = T; /* DQ starts */
64 else strcpy(out,""); /* no output created */
65 /* OUTPUT OF QUOTED AND COMMENTED TEXT */
66 if (strcmp(out,"") != 0) { /* non-null string */
67 /* print and then process next char */
68 printf("%s", out); continue;
69 } /* PROCESS UNQUOTED AND UNCOMMENTED TEXT */
70 /* process scope qualifier; :: -> sym */
71 if (ch == ':' && nx == ':') {
72 strcpy(out, sym); cnum++;
73 } else if (ch == ' ' && nx == ':') {
74 /* handle derivation; SP:SP OR SP:NL */
75 if (cnum == max) { /* line ends with : */
76 sprintf(out, "%s%s%s", "_", sym, "_");
77 cnum += 1; /* SP:NL -> _sym_NL */
78 } else if (line[cnum+2] == ' ') {
79 sprintf(out, "%s%s%s", "_", sym, "_");
80 cnum += 2; /* SP:SP -> _sym_ */
81 }
82 } /* FOLLOWING SECTION DOES ALL NORMAL OUTPUT */
83 /* print prepared string, or literal input ch */
84 if (strcmp(out, "") != 0) printf("%s", out);
85 else printf("%c", ch);
86 } /* END OF FOR-CNUM LOOP; FINISHED WITH LINE */
87 if (ic) { /* inlines end with EOL, so turn off */
88 ic = F; printf(" -*/\n");
89 } else printf("\n"); /* NL to end this line */
90 } /* END OF WHILE LOOP */
91 if (sq || dq || ic || cc) {
92 fprintf(stderr,"%s: %s%s; sq=%d dq=%d ic=%d cc=%d\n",
93 argv[0], "ERROR- altered mode at EOF for file ",
94 argv[filenum], sq,dq,ic,cc); exit (200);
95 }
96 } /* END OF FOR FILENUM LOOP */
97 exit (0);
98 }
Example 1:
(a)
$ cat sample.cc
1 class C : virtual public F { public:
2 C(int x):F(x, 100u) { } // note; arg 2: unsigned
3 C func(double* = 0); /* defined below */
4 };
5 C C::func(double*x) { if (x) // commentary
6 { } // inline comment with funny stuff- // : ::
7 }
(b)
$ cb sample.cc
1 class C :
2 virtual public F {
3 public:
4 C(int x):
5 F(x, 100u) {
6 }
7 / / note;
8 arg 2:
9 unsigned
10 C func(double * = 0); /* defined below */
11 };
12
13
14 C C:
15 :
16 func(double*x)
17 {
18 if (x)
19 / / commentary
20 {
21 }
22 / / inline comment with funny stuff - / / :
23 :
24 :
25 }
26
27
(c)
$ c++encode sample.cc
1 class C__42_virtual public F { public:
2 C(int x):F(x, 100u) { } /*- note; arg 2: unsigned -*/
3 C func(double* = 0); /* defined below */
4 };
5 C C_42func(double*x) { if (x) /*- commentary -*/
6 { } /*- inline comment with funny stuff- // : :: -*/
7 }
(d)
$ c++cb sample.cc
1 class C : virtual public F {
2 public:
3 C(int x):
4 F(x, 100u) {
5 } // note; arg 2: unsigned
6 C func(double* = 0); /* defined below */
7 };
8 C C::func(double*x) {
9 if (x) // commentary
10 {
11 } // inline comment with funny stuff- // : ::
12 }
(e)
$ c++cb -s sample.cc
1 class C : virtual public F {
2 public:
3 C(int x):
4 F(x, 100u) {
5 } // note; arg 2: unsigned
6 C func(double * = 0); /* defined below */
7 };
8
9
10 C C::func(double*x)
11 {
12 if (x) /*- commentary -*/ {
13 } // inline comment with funny stuff- // : ::
14 }
15
16
(f)
$ c++indent sample.cc; cat sample.cc
1 class C : virtual public F {
2 public:
3 C(int x):F(x, 100 u) {
4 } // note; arg 2: unsigned
5 C func(double *= 0); /* defined below */
6 };
7 C
8 C::func(double *x)
9 {
10 if (x) { // commentary
11 } // inline comment with funny stuff- // : ::
12 }
Copyright © 1992, Dr. Dobb's Journal