Columns


Questions & Answers

Identifiers

Kenneth Pugh


Kenneth Pugh, a principal in Pugh-Killeen Associates, teaches C and C++ language courses for corporations. He is the author of C Language for Programmers and All On C, and was a member on the ANSI C committee. He also does custom C programming for communications, graphics, image databases, and hypertext. His address is 4201 University Dr., Suite 102, Durham, NC 27707. You may fax questions for Ken to (919) 489-5239. Ken also receives email at kpugh@dukemvs.ac.duke.edu (Internet) and on Compuserve 70125,1142.

Q

I share your oft-repeated conviction that longer, more suggestive identifiers are a good thing in source code, and I would like to see more people adopting them. However, what you recommend in the February 1993 issue carries it too far. There is a real risk that what you advise would result in verbose, long-winded text.

A good writer of natural language, like a good writer of source code, has a sense of what an ordinary reader can be presumed to know, either from context or otherwise, and from that can write more concisely. In natural language it is an effective practice to describe some specifics on introduction of a term or concept and from then on to use a short, concise identifier. For example, at the start of a report one may speak of "the size of the engine implied by the consumer rating index" and thereafter say "size" to mean the same thing. The traditional source code equivalent is to use a concise identifier and comment on it when first introduced or whenever necessary, and we have long been told that this identifier should be suggestive and pronounceable.

In contrast, you wish to put a lot more of the specifics into the identifier name. For example you recommend the more descriptive names when given a choice between

virtual int get_engine_size_by_model_name(...);
virtual int get_engine_size_by_consumer_rating(...);
versus

virtual int getsizeM(...);
virtual int getsizeCr(...);
No doubt you have come across source code with loads of comments in the right margin that just state the obvious and do not enhance understanding, but the reader still has to look at all the comments for fear of missing something important. It seems to me that what you are recommending carries a risk of making this same mistake, but with the comments now contained in the identifiers. The resulting source code could be like a sort of legalese, where the reader must digest large quantities of high-chaff verbiage. It would not be an appetizing meal. Humans are good at remembering the specifics when the identifier is suggestive. When you were reading David Copperfield, how often did you have to look up the same word twice? Not too often, I'll bet.

Nowadays when you do have to look something up, it takes just a few keystrokes. And you will get more reliable information than you could get from a lengthy name. You say that having to look something up distracts from the flow of the plot. But you don't mention that it is also distracting to be presented with extraneous information, for example, to be told by the member function name that the engine size is from the consumer rating in a subroutine that just needs an engine size. More troubling, you would have the comment repeated again wherever that member function name was invoked, including when the parameter name was making it redundant. That would drive the reader to distraction. Since class member functions are always backward references and may be invoked frequently, a concise but suggestive name accompanied by a true comment in the class definition is more readable, and closer to how it's done in natural language.

Lastly, overloaded operators are an effective means of concise expression in natural language (look at metaphors) and they hold out the promise of doing the same thing in source code. You won't see metaphors in legalese, but you will see a lot of redundancies, because lawyers prefer no unwritten inferences. What you propose is more like legalese than the natural language. You gave your correspondent David A. Dennerline a bad rap.

Sean Furlong
New York, NY

A

We agree that reasonable identifiers are a good thing. I may go to the long extreme for emphasis. I agree with your statement that a program that is too wordy is almost as bad as one that is not descriptive enough. To paraphrase Mark Twain, we've identified the problem, we're just trying to agree on the details.

Let me respond to some of your points. I might suggest that the length of names could be inversely proportional to their frequency of appearance. To use a short name for a function that may be seldom used simply requires the reader to look up its meaning. The association between name and meaning may be easily forgotten. I also feel my associative memory is starting to fill up, so associations should only be used if there is distinct performance advantage to reading speed.

On the other hand, a name as

get_engine_size_of_car_by_model_name_specified_by_a_character_string 
(my_model_name);
is definitely too long and repetitive. We generally assume that a name is going to be either a char * or a String type. And the of_car part is redundant, since the function is a member of class Car. That would be like having to describe Mark Twain in the first paragraph as a humorist/writer. (However, this may need to be stated for readers unfamiliar with American literature).

If the name of the arguments that were passed to a function identified the function sufficiently, then additional description on the name is redundant, as you suggest. For example,

engine_size = car.get_engine_size_by_model_name(my_model_name);
could be shortened to something like:

engine_size : car.get_engine_size_by_mn(my_model_name);
Unfortunately the creator of the function has no control over its user. So if the user codes:

ensz = car.get_engine_size_by_mn(md);
then it is a bit more difficult for the reader to understand its meaning from the context.

As I suggested, if you use a standard abbreviation list for all function and member names, then the name could be shortened even more. For example, if you had a file call abbrev.doc with

 es engine_size
 mn model_name
 cr consumer_rating
then the function might be called with:

 es = car.get_es_by_mn(mn);
My guideline for this usage is that these abbreviations should be used everywhere. For example, in all references to the same type of object, such as engine size, es should be used; not eng_size in one place or engsz in another. A comment for the function name is not even needed in this case, since the true meaning of the function simply requires knowing a standard set of abbreviations.

The user of the function should use the exact same abbreviations in the call. If it were called with:

 ensz = car.get_es_by_mn(mdnm);
then the reader would not only have to recall the meaning of the standard abbreviation, but also remember the meaning of the caller's abbreviations.

The only danger in using short abbreviations is that the code space between them decreases and therefore the possibility of misreading increases. What I mean by code space is the number of characters that need to be changed to transform one abbreviation into another. Say el and et were used for engine_length and engine_torque. The abbreviations have only a one letter difference, while the long names have a six letter difference.

I agree with you that the seemingly redundant language that is seen in complex legal documents seems a bit overbearing. The wording of "consumer readable" contracts is closer to what I would suggest as a reasonable model for program specification. Having a standard place for definitions of terms (e.g. the meanings of abbreviations) and using those terms throughout the program helps make it understandable, as you state.

The opportunity for conciseness of overloaded operators is definitely an advantage. However, conciseness can be a disadvantage. You probably have heard of one-line APL programs that are not maintained, but rewritten, as the original coder's intent was hidden in its conciseness. I do not know any people who would write

 i = multiply(add(1,2),add(3,4));
instead of

 i = 1+ 2* 3 + 4;
When I give my "That's an Object, I Object" presentation to C+ + students, only one person has ever objected to being able to code:

 String a_string b_string, a_string; = b_string + c_string;
However, almost all (except for one group) have objected to overloading the - (hyphen) operator. The meaning of the following could not be agreed upon:

 b_string = a_string - c_string;
Does it delete the value of c_string from a_string if it finds it at the end of a_string or anywhere within a_string? If the latter, does it take out the first instance, the last instance, or all instances?

The one group that did agree had a common basis for agreement. The VMS operating system uses - (hyphen) to designate removing c_string from a_string if it occurs at the end of a_string. If an entire group of programmers has a common basis for abbreviating or overloading, then this collective knowledge can be used for shorter programs.

I don't think I gave David Dennerline a bad rap. He was arguing for the same points that I made — emphasize readability over short code.

As for David Copperfield, I can't remember how often I looked up the same words. I think my associative memory had quite a bit of spare storage then. It took me some time to read the book (a whole summer). I'm sure I often looked up the same word in separate sittings.

Modems Strings

Taking CUJ at its word, I am faxing this question to you in hopes that you'll be able to answer it or point me in the right direction.

I am a developer working on modem communication programs. The company I work for has a large installed base of users with an extremely wide variety of modems. I have a very difficult time trying to support my programs when the user knows absolutely nothing about the modem and worse yet does not even have the manual that came with the modem to help me solve their problem(s). Each time I hear, "Well Procomm has modem installation" it makes me more desperate to find what I have been looking for, namely, a comprehensive, single source of modem commands and S-register settings for most if not all of the so called Hayes-compatible modems out there.

Any help you could give me would be greatly appreciated (along with all the other help you have given me in your column.)

Andrew W. Ackard

I wish what you request was available. I have not seen such an item, but perhaps one of our readers has. I just went through the pains of a similar project. A central computer was used to collect data files from client computers. We had no control over the type of modems used for the client computers. The client programs had to be set up so that they were completely automatic — start them up and leave them alone till the files were transferred.

A set-up program was used to customize the initialization string for each client's modem. Of the items on the string, about two-thirds were the same for all modems, the rest were different. Some were necessary simply because the default settings were different on each modem. Unfortunately, once you get past the simple commands, to the ones that were introduced with speeds higher than 2,400 baud, differences emerge.

The S-registers appear to be different among just the three modems whose manuals I have handy. For example, S13 is used as a UART-status, code-guard time by one, but is reserved or undefined by the other two. Maybe by combing through ten or so manuals and doing similar comparisons, you may come up with a reasonable list.

On a side note, you might have heard that Hayes is sueing other modem manufacturers for patent infringement. It is concentrating on the use of the escape sequence (silence followed by a string +++ followed by silence) to return to command mode. Although it wants to sue on duplication of the remainder of the commands, it supposedly has less of a leg to stand on. Standardization would be a boon to programmers and users, but such suits tend to discourage vendors from exact replication.

File truncation

I noted in your response to Jeff Dragovic in "Q?/A!", (CUJ, January 1993, p. 107) an implication that the only way to truncate a file under MS-DOS was to copy the file up to the desired point, close the copy, and discard the original.

Under MS-DOS, writing zero bytes will truncate the file being written at the current file position. This is documented under MS-DOS function 40H, which is usually the system call underlying most MS-DOS fwrite and write functions.

Not only that, if you seek beyond EOF and then write zero bytes, the file will be extended to the new file position, but without any actual data being written to the file. It's a great way to create really big files really fast when preallocating for any reason.

While there is, of course, no guarantee that an ANSI implementation will actually honor a request to write zero bytes, I venture to suggest that under MS-DOS it will always be so, simply because of this functionality.

Hope this helps!

Karl Auer
Canberra ACT
Australia

Thank you for your letter. I also received a similar response from Lloyd at Lloyd@debug.cuc.ab.ca. I don't have a last name (or is it the first name?), since the trailer was lost in electronic garbage land. He mentioned that Borland C has a chsize function, that calls function 40H. Borland also has a set of _open, _read, _write, and _close calls that call 40H directly. Listing 1 shows their use in truncating a file.

The corresponding program using open, write, and close does not truncate the file. That would have made the functions disagree with their generally accepted operation.

Seeking beyond the end of a file and writing extends the file in both MS-DOS and UNIX. Under MS-DOS the additional disk blocks in the gap are physically allocated at this time, as you mentioned. With UNIX, the blocks are not allocated. If you seek and write to them at a later time, they will then be allocated.

This poses an interesting twist if one is porting between the two systems. Under MS-DOS, you would be sure you had disk space for the file if the write worked after the seek. Under UNIX, you will not be sure until all the blocks in the intervening gap are written. This difference might be hidden by another glue routine which might look like:

enum e_file_extend {FROM_BEGINNING, FROM_CURRENT, FROM_END};
int file_extend(int file_descriptor, long size_in_bytes, enum e_file_extend);
With MS-DOS, it would perform the seek as you described. Under UNIX, it would actually need to do the writes. (KP)