LETTERS

It's Binary Trees, Not B-trees

Dear DDJ,

The article, "Setting Precedence," September 1989, misused the term "B-tree" in place of binary tree. A naive reader might understand these references to be to binary search trees instead of what they actually are. B-trees and their variants (B+ trees) are balanced multiway trees used to form indexes for large data files. The term B-tree was first used in 1972 by its developers, Bayer and McCreight, to describe a way of forming a multiway tree from the bottom up so that it was always balanced. Bayer and McCreight have never revealed the origin of the name; the B could stand for balanced, broad, Boeing (their employer at the time), Bayer, or could be from some inside joke. By the way, an exceptionally clear presentation of the B-tree concept is in Folk and Zoellick's File Structures: A Conceptual Toolkit. This presentation also includes a large amount of C code.

Terry Johnson

Stillwater, Oklahoma

Dear DDJ,

In "Setting Precedence" by Mark Peterson (September 1989), the author uses the term "B-tree" when he means "binary tree." Real B-trees are rarely binary and are always balanced. A definition of B-tree can be found in Donald Knuth's The Art of Computer Programming, Volume 3/Sorting and Searching, p. 473: A B-tree of order m is a tree which satisfies the following properties:

i) Every node has <= m sons.

ii) Every node, except for the root and leaves, has >=m/2 sons.

iii) The root has at least 2 sons (unless it is a leaf).

iv) All leaves appear on the same level, and carry no information.

v) A nonleaf node with k sons contains k-1 keys.

Further on in the article are the description and figure of a full 4-bit binary tree. The author says there are only two such trees, when it seems to me there are 16. This stems from the fact that there must be one leaf at a level lower than all others, which is shown in Figure 3 as the node numbered 0. However, that lower leaf could be any value from 0 to 15.

While maintaining balanced trees can improve efficiency, one should not jump at the chance in every case. With random input, ordered trees tend to stay remarkably balanced by themselves. In an application such as compiler symbol tables, additional work for balancing would normally be wasted. Each application must be evaluated individually to determine if maintaining balance is worthwhile.

Tim Paterson

Renton, Washington

CP/M Lives!

Dear DDJ,

In his letter printed in your September issue, Arpad Elo Jr. said that he had decided to write a TECO editor for CP/M. Please allow me to save him the trouble: I wrote such a beast about 12 years ago. Called TED, this program could be considered a superset of TECO editors found on TOPS-10 and RT-11 systems (except for the screen-oriented features). Among other things, it has extended Q-register functions, including the ability to use any Q-register as the editing buffer.

TED occupies about 13K of code space. Although it was at first a commercial offering, it has long been freely available through bulletin board systems and on conferencing systems such as BIX. One caveat: it only runs on Z-80 based CP/M systems.

Mark E. Mallett

Litchfield, New Hampshire

PS: Though I remember TECO fondly, I'm a longtime EMACS convert.

A* Heuristics

Dear DDJ,

I have just finished reading Randy Nevin's article "Autorouting with the A* Algorithm" in the September 1989 issue of DDJ. I am very pleased to see heuristic search and its applications receiving coverage in your fine journal.

I would, however, like to add some comments on A*, and additionally, its heuristic function, H(x). First, readers may be interested in knowing that the BFS (Breadth-First Search) algorithm is actually a special case of A*, obtained by setting H(x) = 0 and distance(pred(x),x) = 1 for all nodes x (the notation is as defined in Figures 1 and 3 of Nevin's article). Second, the A* algorithm always terminates with a solution whenever one exists (such an algorithm is said to be "complete"). This property holds regardless of whether A* is applied to finite or infinite graphs. Finally, if the heuristic function, H(x), used by A* is optimistic, that is, if H(x) always underestimates (or exactly estimates) the cheapest cost of a path going from node x to a goal node (such a heuristic is typically labeled "admissible"), then A* is guaranteed to find the optimal solution path whenever a solution path exists. Further constraints on H(x) yield even stronger results pertaining to the number of nodes expanded by (and hence the run-time of) A*. The interested reader is referred to Heuristics: Intelligent Search Strategies for Computer Problem Solving, by Judea Pearl (Addison-Wesley: Reading, Mass., 1984).

Andrew R. Spillane

University of Virginia

Charlottesville, Virginia

CA for the Rest of Us

Dear DDJ,

This is just a note to update your readers on the current state of cellular automation simulators (as mentioned in the sidebar to the article on "Simulated Annealing" by Michael McLaughlin in your September 1989 issue).

Autodesk Inc. is now shipping CA Lab, also known as Rudy Rucker's Cellular Automata Laboratory. CA Lab runs on a PC clone with CGA, and updates 320 x 200 pixels several times a second. CA Labs supports either bits per pixel, and new rules can be programmed in C, Pascal, or Basic.

Rudy Rucker

Mathenaut, Autodesk

Sausalito, California

APL Deserves Some Respect...

Dear DDJ,

In general, I agree with most of the technical articles in journals such as yours. However, I believe the column by Jeff Duntemann in the August 1989 issue contains at least one misleading statement.

APL is my favorite language, and I don't think it's weird. In fact, it is more powerful and concise than any language mentioned in your journal. At one of our SIG/APL meetings some time ago, we were told by an IBM employee that APL is used there internally for almost everything, then converted to Cobol, Fortran, etc. as needed for the outside.

W. S. (Bill) Cook

Marina Del Rey, California

.... And So Does Oberon

Dear DDJ,

In his July 1989 "Structured Programming" column, Jeff Duntemann mentioned Niklaus Wirth's new language, Oberon. Considering the quality of Wirth's previous languages, I think that any new language he develops should be examined. I believe the chances are that it will be found to be an excellent language.

I would like to see a discussion of it in DDJ, or at least a mention of how more information about it can be obtained.

Brian Jedrick

Nutley, New Jersey

Another Country Heard From

Dear DDJ,

I am writing from distant country, from Poland. In my country there are subscribers of Dr. Dobb's Journal too. We take pleasure in informing you that in our opinion DDJ is a journal on very high programming level and full of tips and tricks.

I am a student and a member of a student computer circle. We have AT and XT compatibles. We are very interested in C language and, for this reason, in Al Steven's C column. (Among other programs, we have the incredible Turbo C 2.0). Al has developed the communication program Smallcom. In our country, private communication via modems is rising now (there are some mailboxes like Fido net). His program is ideal for us (the cheapest, but powerful!). We have tried it to run. But here the troubles have arisen. Unfortunately we don't have some of the source code for libraries TWRP, context sensitive help, and windows (DDJ September to December 1988 issues). Could you do us a favor? Can you send us copies (reprints) of that source code from DDJ Sept. - Dec. 1988 with Al Stevens's articles? It is the only way for us to obtain needed source code and his explanation to it (our DDJ is a gift subscription).

That's all for now. I sincerely hope you will be able to help us in these matters. We wish you to keep the DDJ as it is! Best wishes to you all the editors of DDJ! I am looking forward to hearing from you.

Artur Terech

Krakow, Poland

DDJ: The source code and copies of the articles you need are in the mail.

Finite State Machines Are Fine by Me

Dear DDJ,

I found the article on procedure tables by Tim Berens in DDJ #154 (August 1989) quite interesting. I have to confess that I have not used pointers to functions in C as much as I probably should because I found the syntax of pointers a bit messy and confusing, but they are a powerful tool that I should have in my programming toolbox. One application for an array of pointers to functions is to implement a finite state machine.

For the benefit of those not familiar with the finite state machine I should explain that the concept is that the program has a limited (finite) number of states or modes and behaves differently in each state or mode. For example I am using the vi text editor to write this letter. At the moment it is in text entry mode and if I were to type a capital M, as I have just done, it enters it in the buffer as part of the document that I am preparing. If I press the escape key it changes to the command mode and then if I type a capital M it moves the cursor to the beginning of the line at the middle of the screen. Whether or not the author of vi thought of it as a finite state machine I do not know. Each state or mode does its thing but it also must watch for indications that it should turn matters over to some other state and initiate the transfer when it is indicated.

I know of at least three ways of programming a finite state machine and there are probably others that I never heard of. A finite state machine can be created using gotos, a switch, or a procedure table.

It is said that the goto method can be the most efficient, but it is usually difficult to understand the code and therefore difficult to maintain unless it is done in a very disciplined manner. The code for each state is labeled and whenever a transition to a different state is needed a go to the label for that state's code is executed.

I have usually used the switch method for my finite state machines. This requires an integer variable called state or mode or whatever is appropriate for that application, and a loop containing a switch controlled by this variable. There is a case for each state and either the code for that state is put directly into the switch or it is written as a function and called within that case. When it is necessary to transfer to a different state, the code within the case can assign a different value to the state variable. Putting the code directly in the switch cases avoids the overhead of a function call and is therefore slightly more efficient; however, it can result in an undesirably long switch. It is possible to put the code directly in the switch only for the simpler states and use function calls for the states that require larger amounts of code. Unless efficiency is of great importance, it is probably better to use function calls and have a shorter, easier to understand switch.

The procedure table method produces what is probably the most compact and easiest code to understand. To set up a finite state machine using a procedure table, a function is needed for each state and a state variable to index the procedure table in addition to the procedure table itself. The functions should return an integer type and, besides carrying out the work of that state, the function must watch for indications that a different state is required and then return the number for the next state. The procedure table is an array of function pointers, and by pointing them at the state functions a number is assigned to each function. The functions can then be called by number. The computer is perfectly happy with numbers but we humans find it easier to use names, and of course the functions themselves must have names. The plan that I use is to name each state, use the name of the state in lower case for the function name, and assign the same name in uppercase to the number for that state by using a preprocessor define.

If you have one of the newer compilers this might be a good place to use the new enumeration feature of the C language. A loop is established containing a call to one of the state functions by means of the procedure table indexed by the state variable and returning a new value of the state variable. This one statement is really all that is necessary in the loop, although in some cases it may be desirable to add some error detecting code. Unless it is desired to have the finite state machine run until somebody turns off the computer of pulls the plug (and this is sometimes done for a dedicated process control computer), there has to be some way to shut down the finite state machine. There are at least two ways to do this that I know of and probably others that I am not familiar with. There could be a terminal state whose function tidies things up (closes files, removes temporary files, etc.), and then calls exit( ) and so never returns, or there could be a value of the state variable not corresponding to any function, which would cause the loop to end. Even if the terminal state method is used, it is unwise to use a forever loop. If the state variable ever gets out of range, then memory outside the range of the pointer array would be used as a function pointer and almost anything could happen when code was entered at some random point or data was interpreted as code! The only sure thing about such a situation would be that it would be a disaster. The easiest way to guard against this is to use a while or for loop with the condition for continuing being that the state variable is in range. Don't forget that the state variable must be initialized to one of the states before the loop is entered. Usually the finite state machine always starts with the same state, or it can be made to start in the same state by having an initialization state which, often among other duties, has the task of selecting the starting state and transferring to it.

If such a state is used, it is never entered from any of the other states so that it executes only at the startup of the finite state machine. It is even possible to do it all in one for statement. It might go something like:

for(state=0;state>=0&&state < NSTATES;state=(*proctbl[state])( )fp));

While that is the most compact form, it might be clearer as:

state = 0 while(state .= 0 && state < NSTATES) state = (*proctbl[state]) (fp);

Both of these assume that the starting state is in the zero slot of the procedure table. It would probably be clearer and less restrictive to use the name defined as the starting state in place of zero for the initialization of the state variable. The zero in the comparison for the while condition should remain zero, however. NSTATES is, of course, the number of states assigned slots in the procedure table, and therefore the number of slots in the procedure table. Since these slots are numbered zero through NSTATES-1 the check on the upper limits is state < NSTATES.

When the functions are written, there are some special considerations and problems pertaining to finite state machines that should be kept in mind. Care should be exercised in writing the code that determines whether the state should continue or transfer to another state lest a situation should arise that would be rejected by all of the states. This could result in an endless loop of transfers from state to state with none of the states doing anything about it. Perhaps a special state named "CONFUSION" would be needed to deal with these unresolvable situations.

Another problem that often comes up when programming finite state machines is the situation where a state starts to process an item before it discovers, and possibly before it can discover, that the item requires a different state. Avoid having the state take any irreversible action before it is certain that the item being processed is its business. Frequently this problem occurs in connection with input. A state will read in an item and then find that the item should have been read and processed by a different state. If the input is being read in one character at a time then the answer is to use the ungetc( ) function. When larger chunks are being read, such as a line or a record, there is more of a problem. If it is certain that input is coming from a random access file then it is possible to use ftell( ) to record the position in the file before each read and when necessary to use fseek( ) to "push back" what was read. Unfortunately this does not work for input from most devices or from pipes.

A more general method is to preread. This works where the unit read is always the same, one line or one record or whatever. A global buffer, accessible to all the functions, is provided and the first item is read into it either before the finite state machine is started, or by the initialization state if that is used. The processing loop within each state would be a do loop that would process the item in the buffer, read a new item into the buffer, and decide whether to loop to process the new item or transfer to another state.

There is always the question of whether or not a finite state machine is the right program structure to use for a given application. This depends on what is meant by "right." At one extreme are those that program for efficiency no matter if that trick that gives some tiny advantage in efficiency makes the code hard to understand and maintain. At the opposite extreme are those that seem to have contempt for any considerations of efficiency and for the sake of understandability and maintainability they follow a rigid set of rules that permit only a few structures that have been blessed by the high priests of structured programming. Neither of these are likely to use the finite state machine concept. One rejects it because any formal structure is nonsense to him, and the other because it is not one of the sacred structures. Most programmers are somewhere in between. They realize that some structure is needed and take the pragmatic approach of if the structure fits the application use it and if it does not then find one that will fit.

Some say that a finite state machine, even when implemented using the switch or the procedure table, is really goto programming in disguise. To some extent this may be true but if the assembly language used to implement the favorites of the structured programmers, such as the while statement, is examined you will find gotos. Yet structured programming with its hidden gotos does have definite advantages. Therefore there must be merit in hiding the gotos. The merit in hiding the gotos lies in concealing confusing details so that the overall picture can be seen and comprehended more easily. But hiding the details does not do much good if what does show is confusing and hard to understand.

The finite state machine concept is a powerful tool, but like any tool, the more powerful it is the better the results if used skillfully, and the worse the results if misused. The finite state machine structure is quite general. It can be used to program almost anything, but it is the best choice for only a small fraction of what it can be used for programming. Earlier I described how a finite state machine could be implemented with gotos. Unfortunately this process can be reversed and any bit of goto programming can be translated to a finite state machine. Simply write each labeled hunk of code as a state function with gotos and places where the program would simply fall through to another hunk of code replaced by return statements returning the number for the function containing the code that would be reached. The resulting finite state machine could then be implemented by either the switch or the procedure table method. The result should superficially resemble structured programming, but bowl of spaghetti programming is bad programming no matter how it is implemented.

This letter is much too long. You probably will not find room to publish my ramblings. Anyway, writing out my ideas helps me get clear in my own mind so it is not a total waste of paper.

David S. Tilton

Manchester, New Hampshire

DDJ: Thanks for your explanation of finite state machines, David. Be sure to check our October 1989 issue for more info on FSMs.


Copyright © 1989, Dr. Dobb's Journal