Features


Word Counting

L.J.G. Schermerhorn


L. J. G. Schermerhorn is a retired professor of geochemistry. A recent convert to C++, he is involved in scientific programming. Readers may contact him at Camminghalaan 10, 3981 GH, Bunnik, The Netherlands.

Counting words is one of the chores that would-be authors frequently undertake. Editors, when reviewing papers submitted for publication, generally want to know the number of words in the paper. Occasionally, they also want to know the number of printable characters as well. In pre-computer days, before submitting a typescript you would count the number of words in an average page by hand and extrapolate. Modern word processors usually include a word counting utility, so it is easy to arrive at exactly the number of words in a given text. Well, maybe not always exactly.

After upgrading to WordPerfect 5.1, I noticed that sometimes the word counting utility would count words and sometimes it would not, for no discernible reason. So I wrote my own word counting utility in C++, using the Zortech compiler (version 2.1). The utility also countslines, printable non-whitespace characters, and bytes. See Listing 1.

The code is practically self-explanatory. The algorithm defines a word as a sequence of printable characters followed by white space or end-of-file. Here, white space is defined as a blank space, horizontal tab, vertical tab, line feed, carriage return, or form feed. When the program encounters white space, a control character, or EOF, it then checks the preceding character. If the preceding character is not whitespace or a control character, the program advances the word counter by one. Control codes at the beginning or at the end of a file tend to upset the score, so they are filtered out.

You call the program from the command line by typing wcount <file name>. You could easily recode this in C, taking into account that I/O in C++ relies on the istream and ostream classes.

I urgently needed a word counter, but after I had written and used wcount, I realized that I might have unintentionally reinvented the wheel. In fact, K&R [1] present a program that flags words and counts their first characters. Leafing through the Zortech Compiler Reference, I came across the wc.c utility, a K&R-type program, which only needs compiling (but is not quite accurate). Turbo Algorithms [2] shows a word counting utility that counts words by adding leading and trailing spaces to the text string, and then counting how many times a space character is followed by a non-space character.

Each program this article mentions has a different definition for words in a text string: a sequence of non-space characters that is (1) unbroken (K&R), (2) preceded by a blank (Turbo Algorithms), or (3) followed by a blank or EOF (as presented in this article).

Zortech's wc and my wcount produce identical line and byte tallies, but wc is sometimes one word off the mark. (In files starting with a non-whitespace character, wc fails to score the first word. In other files it may register one word extra.) WordPerfect's word counting efforts are a different story. Taking the file of a recent column on dynamic file management, WordPerfect counted 3,302 words, wcount counted 3,840 words, and wc counted 3,841 words.

These discrepancies worried me, so I devised a program, textgen, to create text files of known size in order to test wcount. See Listing 2.

In C++, new allocates memory (here to the character array text), which is later freed by delete. (In C, calloc and free would carry out these tasks.)

The random number generator, rand, is tricked into yielding only numbers between 33 and 126 inclusive, to be used as ASCII codes. rand returns an int that ranges from 0 to 32,767. The division rand()/32768.0 produces the range 0.0 to 1.0 (exclusive). rand is not purely random, simply pseudo-random. (Knuth [3] elaborates on methods for generating pseudo-random sequences.) This program produces a sequence of ASCII codes that reappear unchanged every time the program is started

This program is sufficient for creating a text string of known length. This string is segmented into words of varying length by calling rand() a second time, now constrained to inserting blanks at irregular intervals, and counting the blanks. The result is very cryptic, but serves the purpose. The program textgen.cpp can also be adjusted to insert control codes in a file.

Using textgen, I created two text files — one 10,000 bytes long and including 1,672 blank spaces, and the other 50,000 bytes long with 8,306 blanks. According to wordcount, they contained 1,516 and 7,544 words, respectively. According to wc, they contained 1515 and 7543 words. WordPerfect, after I recast the files in its format, tallied 2,336 and 11,627 words, respectively. Obviously, a text cannot contain more words than blanks plus one. I have no explanation for this inconsistency. The moral, I suppose, is never to take anything for granted that the computer tells you, in hard numbers or otherwise.

References

[1] Kernighan, Brian W. & Ritchie, Dennis M., 1978, The C Programming Language. Prentice-Hall.

[2] Weiskamp, Keith, Shammas, Namir & Pronk, Ron, 1989, Turbo Algorithms, A Programmer's Reference. John Wiley & Sons.

[3] Knuth, Donald E., 1981, Seminumerical Algorithms, 2nd Ed., Vol. 2 of The Art of Computer Programming. Addison-Wesley.