Alan Cline has a B.S. in mathematics and an M.S. in physics. His 19 years experience includes systems design, telecommunications, applications development, and project management. He is currently working in object-oriented design and GUI projects. He can be reached at his systems consulting company, Carolla Development in Columbus, Ohio (614-764-8923).
Introduction
Software development has always been a tedious and time-consuming task, situated somewhere between an art and a science on the productivity spectrum. In many cases programmers can shorten development time and convert abstract designs into workable code more concisely, almost mechanically, by using state transitional automatons, especially if they "layer" their automations in a way that reflects the natural layers of abstraction within the problem.The technique described in this article helps the programmer design and code an application using a language-independent and easy-to-follow, step-by-step method based on state transition theory. Programmers, who can describe their problem solution as a series of independent states and transitions, can better organize their data and functions. Each design step of the method follows from the preceding one. The application is easily maintainable because it is composed of small, reusuable modules, and often only header files need to be changed.
This article will show how to develop applications with the the usual state transition theory and explain the problems encountered. The initial examples solve the problem at a very low level of abstraction. The closing section shows how the solution can be improved by constructing a separate automation that operates at a higher level of abstraction. Each point is illustrated with specific C source code examples, and the code for the automaton and its support routines are shown at the end.
State Transition Method
Historically, state transition theory has been most commonly applied to parsing applications. I will use a simple parsing example to illustrate the following five steps of developing a state transition application.1. Find a parsing rule for the input.
2. Draw a state transition diagram.
3. Construct a table of states and target tokens (Sparse Table).
4. Normalize the Sparse Table into a State Transition Table.
5. Optimize and generalize the State Transition Table to improve efficiency for implementation.
I'll illustrate the method with the somewhat trivial example of a routine to parse and validate a standard MS-DOS filename. Later I show how the same technique can be modified to handle more diverse applications.
1. Find a Parsing Rule
Define an MS-DOS filename as a letter followed by zero or more alphanumeric characters (up to a limit of seven characters), followed by an optional period or dot, followed by zero to three alphanumeric characters.The parsing rule is written as a series of single-character tokens:
a x* [.x [x [x] ] ] {e}using the notation:
a := any alphabetic character A-Z or a-z; x := any alphanumeric character A-Z, a-z, 0-9; * := zero or more occurrences of the preceding token; + := one or more occurrences of the preceding token; [ ] := encloses optional tokens; {e} := termination of input, or terminating character.The possible values of a and x, the dot and the terminator, comprise the alphabet of our input.According to the parsing rule, if a dot is present, at least one alphanumeric character must follow. Under the parsing rule freddy.dat, R2D2.rbt, foo.bar and just m are all acceptable filenames.
Note that no information is contained in the parsing rule about how many x's are permitted. This information is handled better in the transition from state to state.
2. Draw a State Transition Diagram
The parsing rule is analyzed syntactically, that is, letter by letter, to produce a State Transition Diagram (shown in Figure 1) .Starting at initial state A, the state machine jumps to state B only when an alphabetic character a is read. The state machine gets another character and cycles back to state B if the character was a dot, or to state C if it was an alphanumeric x, or jumps out if no more input is encountered (or the terminator is found). As more input characters are read, the diagram's paths are traversed accordingly.
A path of all possible allowable states, representing all possible allowable filenames, can be traced along the routes from the initial state to the final OUT state. If an input occurs such that no path is available, then the input stream contains as error.
State Transition Rules
To build a proper State Transition Diagram (STD), the following rules must be obeyed:Rule 1: Each path from a node must be unique (deterministic).
Rule 2: All possible valid paths must be shown; paths not shown represent input parsing errors.
Rule 3: Each transition must be determinable by a Boolean test.
I find it easiest to build STDs by first creating a "backbone" from the longest path and then adding the optional paths and variations. Figure 1 shows a backbone path of a x .x x x {e} from initial to final state. I based all other paths in the STD on this backbone. Check the STD against the rules above. Each transition occurs as the result of an unambiguous test of an input character; all possible valid filenames are represented (invalid filenames are blocked by being omitted from any path); and each node connection results from a unique Boolean test for the target character, i.e., the proper member of the alphabet is input at the proper time.
3. Build the Sparse Table
To represent the states and transitional flow as a table, chart the token alphabet against the possible states. Each position in the table represents the next state that the input token would produce. For the MS-DOS filename example, the STD yields the sparse table in Table 1.From the table, the state machine jumps from state A only to state B, only if an alphabetic token a is received. However, state B can transition to states B, C, or OUT, if the token was x, dot (.), or {e}, respectively. These and all other paths of the STD in Figure 1 are reflected in the sparse table of Table 1.
4. Normalize the Sparse Table
Each state of the sparse table may precede one or more states depending on the token received. Each jumped-to state can be treated as a substate of the jumped-from state. In other words, state A has one substate (B); state B has three substates (B, C, and OUT); state E has two substates (F and OUT), etc.Each state has an associated target token, a substate to jump to if the received token matches the target token (True transition), and a substate to jump to if the received token does not match (False transition). With each substate transition, an action can be performed on that token. In Table 2, the sparse table has been rearranged to show the substates and the target to form a "normalized" sparse table.
The path of state transitions always starts at substate zero and progresses through valid states until reaching OUT, or an error, which is indicated in the normalized sparse table as state ERR. For example, only an alpha token is allowed from state A, substate zero. If a numeral is received, the input is unacceptable and no further state transitions can take place.
Build the State Transition Table
The state machine, or Deterministic Finite Automaton (DFA), performs actions at each state transition. An action is associated with each of the True and False transitions. The actions are performed typically on the token last received. The DFA also contains a routine to provide tokens to be tested, and perhaps the testing routines as well. The DFA will be discussed in more detail later.At this point in the application development process, the programmer should associate token-consuming functions with the particular states. For our example, assume four routines that comprise a filename validation program:
Assume the DFA contains a routine to remove one character from the input (call it "tokenizer"), a routine that jumps from state to state ("movestate"), and any routines needed to check if a received token is the target token.
- pushchar: Concatenate the last token into the work buffer.
- savetoken: Move the concatenated tokens of the work buffer into an output area, considering it fully validated. Reset the buffer for new input.
- errmsg: Output an error message describing an invalid filename.
- skip: Jump to next state without getting a new token.
Add the token-consuming functions to the normalized sparse table to get the state transition table (STT) of Table 3. The error message function errmsg is associated with each ERR state, and skip is called each time the current token must be tested for a different state.
5. Optimize and Generalize
Table 3 shows that substates 6, 8, and 9 have the same target, the same True and False transition states, and the same True and False routines. These substates are identical, and can be optimized into a single transition substate. The new optimized STT is shown in Table 4.Substates 3 and 6 are not identical, and cannot be removed, because they have different Action routines. The Boolean nature of the STD and target tests tends to grow the number of substates in the STT's exponentially. However, because of redundancy in the tables, and allowing generalizations among routines, optimized STT's reduce to a manageable size.
Optimizing and generalizing STT's are common sense, a matter of searching for duplicates and eliminating them. Optimized and unoptimized state tables are topologically the same, as would be expected.
Implementation
In C, the STT can be a header file, a text file read in at runtime, explicit in the code, or linked with the DFA. The names of the targets, True routines, and False routines become pointers to functions that the DFA will evoke when it sees the proper substate. The next example illustrates these details.
Further Example: Restricted MS-DOS Pathname
A more complicated example shows some of the other details that affect application development. Suppose you need to parse a string into a validated drive letter, MS-DOS path segment(s), filename, and file extension. The file path is the concatenation of one or more path segments, and occurs between the drive colon and the filename. A path segment is the concatenation of letters delimited by backslashes. The pathname is called "restricted" because MS-DOS allows more flexibility than shown in the parsing rule, but the rule is sufficient as an example. In the rule, a disk drive letter followed by a colon is required, and a backslash as part of the filename is required. Legal restricted MS-DOS filenames are a:\sample3.dat, c:\doc\article.tex and c:\doc\tbls\stt3.tbl.The Parsing Rule for restricted MS-DOS pathnames is:
a: (\ a x*)+ [.x [ x [ x ] ] ] {e}using the notation as before.
Tables and Diagrams for Restricted MS-DOS Pathname
The State Transition Diagram for the Parsing rule is shown in Figure 2. It is similar to Figure 1 with one important exception: each time a backslash is encountered, the state machine jumps to state D to begin a new path segment or the filename. When the dot is seen, the state machine moves on; no lookahead is required to know if the machine is parsing a path segment or a filename as it processes.With the addition of the backslash to our alphabet, and a few more states, we can parse a much more complicated string. The sparse table for the restricted pathname (and Figure 2) is shown in Table 5.
The State Transition Table, shown in Table 6, uses the same action routines as Table 3, with the addition of:
(The previous routine "savetoken" was replaced with the more specific save routines.)
- savedrive: Copy the work buffer to the drive name variable.
- reset: Clear the output buffer and reset to beginning.
- appendpath: Append the current string of tokens to the path.
- savefile: Copy the work buffer to the filename variable.
- saveext: Copy the work buffer to the file extension variable.
- cleanup: Display the parsed input.
Substates 10 and 12 are identical and can be optimized away, but substate 7 uses a different action routine and must remain.
Calling the DFA from the Application
The state table is implemented as a function pointer table dostbl1.h (Listing 1) , and the application calling the state machine that uses the table is shown in syndos.c (Listing 2) .I declared a state machine syndos_dfa of type DFA_CLASS, which is defined in dfa.h (Listing 5) . Once the input file is open, I loop through each record of the file, parsing and storing the validated MS-DOS filenames. InitStateMachine must be passed the address of the particular instance of the DFA, the address of the state table, the tokenizer function pointer, the input file pointer, and the string of delimiters the tokenizer will use. In this case, NULL is passed so the default tokenizer get_token will be used.
The automaton StateMachine is passed only the address of the DFA declared. The function StateMachine will return ERR or OUT, as defined in the state transition table.
The tokenizer get_token is shown below the main program, along with each of the target and action functions.
More Powerful Techniques
The state transition application can be more powerful if the target and action functions are abstracted up to the semantic level. Certain ambiguity problems inherent in the syntactical level also disappear. Recasting the MS-DOS filename parsing rule semantically yields
[drive] [root] [path]* fname [ext]where
drive := a ':' root : = '\' path := ax* '\' fname := ax* ext := .x [x [x]] {e} := '\n'This parsing rule is equivalent to the syntactical one used earlier, but allows the target and action routines to be more substantial. Fewer states result in a more efficient application.The state transition diagram for the semantic case is shown in Figure 3, and its corresponding sparse table in Table 7. The state transition table and header implementation dostbl2.h is shown in Table 8 and Listing 3, respectively.
The program, semdos.c (Listing 4) , which calls the state machine, is the same as the one used for the syntactic case (Listing 2) , with the exception of the DFA variable and the state table name used in InitStateMachine and StateMachine. Different target and action functions are included, of course.
Implementing the State Machine
The state machine and constants are defined in dfa.h (Listing 5) . The state machine source code is shown in dfa.c (Listing 6) . The state machine's standard set of target functions and action functions are shown in dfafcns.c (Listing 7) .A sample input file and its corresponding output is shown in Figure 4.
Conclusion
Usually a state machine application can be written by specifying a group of small target and action routines, perhaps a customized tokenizer, and initializing and invoking the state machine. Application development using this technique is methodical, involves writing only small routines, and is easily modifiable within the state transition table. Using this technique, I have implemented applications in half the time of other techniques. State machine technique, or finite state automatons, are especially well-suited to applications that involve data-flow tranformations (e.g, parsers or data converters), or ones where control flow is tightly sequenced (e.g., transaction processing applications).Variations on the technique, e.g., non-Boolean testing and selective searching, yield more power and speed for application development. Even more power can be coaxed from the DFA by tweaking some of the built-in parameters, but those are topics for a different article.