Features


A Natural Language Processor

Russell Suereth


Russell Suereth has been consulting for over 12 years in the New York City and Boston areas. He started coding and designing systems on IBM mainframes and now also builds PC software systems. You can write to Russell at 84 Old Denville Rd, Boonton, NJ 07005, or call him at (201) 334-0051.

Hardware and software are sold with CD-ROMs, sound boards, and speech synthesizers. These components allow the user and the computer to interact in a human manner. This stage of user and computer interaction can be extended further with human language. The natural language processor presented in this article is a practical application of the use of human language.

Processing Human Language

A natural language processor takes a human statement, analyzes it, and responds in a way that appears human. But this appearance isn't a mystery. A natural language processor performs certain processes based on the words in the input sentence. In many ways, processing human language is just another data-processing function. The input sentence is the input transaction record. The words in the sentence are fields in the input record. The dictionary is a master file of word information. The meaning is derived from a combination of information in the input sentence, the dictionary, and program code. The generated response to the sentence is the output.

Some simple natural language processors process one-word commands that don't require much analysis, such as find, up-date, and delete. This type of processor uses a small dictionary to identify the commands. When used with a database, commands to such a processor could execute functions to find, update, and delete records. A similar processor can be connected to a remote-controlled car and used with the commands forward, reverse, and stop.

Other natural language processors are more sophisticated and can process multiple input sentences. A larger dictionary contains information about the word such as whether it is a noun, a preposition, or a verb. This complex processor can analyze a sentence, identify parts of the sentence, derive meaning from the sentence, and generate an appropriate response or action. Theoretically, a complex processor may parse through a book and derive meanings and themes from the sentences. It may then generate a response based on these meanings and themes.

Building a Processor

The natural language processor presented here is a miniature version of the complex processor just described. It can process several English input sentences such as Jim is running in the house, Bill is walking on the street, and Sue is going to a store. It can also process sentences such as Where was Bill walking? and generate a response that states Bill was walking on the street. An actual session with the processor is shown in Figure 1. This miniature processor, however, has a limited capability. It can only be used with the kinds of input sentences shown in the figure. But this capability can be increased by expanding existing routines and adding new ones. Some areas of possible expansion will be indicated throughout this article.

A natural language processor must perform some amount of analysis on the sentence and the words in it. The analysis may be simple and only identify that the words are valid by matching them with the dictionary. Or, it may be complex and identify structures, phrases, meanings, and themes in the sentence. The analysis this particular processor performs identifies phrase structures and meanings.

This processor uses transformational grammar to identify underlying phrase structures in the input sentence. Each word in the sentence has a type such as noun or verb. Specific combinations of these types make a certain phrase structure. For example, the words a, the, and these are determiners and are located right before a noun. A determiner-noun combination such as the house is a kind of noun phrase. Phrase structures must be identified in order to identify sentence elements such as subject, action, and place. Figure 2 shows some abbreviated phrase structures used in transformational grammar.

There are a limited number of underlying phrase structures in transformational grammar. But there are an unlimited number of possible sentences. The relationship of structures to possible sentences is shown with the two sentences Jim is running in the house and Sue is going to a store. The underlying structure (name-auxiliary-verb-preposition-determiner-noun) is the same in these two examples, although the words make the two sentences different. The limited number of structures can be coded in the program without coding the many combinations of words that may occur.

Transformational grammar also defines how words are used together or used in a certain context. For example, the woman drove the car is correct but the table drove the car is not. Words have restrictions to determine how they can be used together properly. The inanimate object table cannot be used with the action verb drove. Only humans can drive so a HUMAN restriction should be assigned to the verb drove. Additional restrictions of MAN, WOMAN, PERSON, and TEENAGER should also be assigned. The natural language processor in this article uses only one restriction labeled ING for words such as going, running, and walking. An expanded natural language processor may use many additional restrictions. One of these is verb tense so that Jim was runs in the house is recognized as an incorrect sentence.

Processing the Sentence

Listing 1 contains the natural language processor. The main routine first calls the initialize routine. This routine initializes some variables that are used throughout the program. Each entry in the subjects, actions, and places arrays is initialized. The first entry contains the subject, action, and place of the first input sentence. The second entry contains this information for the second input sentence. The arrays have 20 entries and can contain information from 20 input sentences. Next, the main routine opens the dictionary file named diction. If the dictionary opened successfully, the main control flow of the program is entered.

The main control flow of the program is a while statement that loops once for each input sentence. Several processes occur within the while statement for each sentence. From a broad view, the program extracts the sentence words and matches them with the dictionary, which contains information about the word. The program loads this word information into arrays that it analyzes to determine the underlying structure of the sentence. The program identifies the subject, action, and place words in the sentence and copies these words to their arrays, which contain an entry for each input sentence. The program determines the appropriate response and then generates that response. When there are no more input sentences the program closes the dictionary and then ends.

From a detailed view, the while loop executes until the input sentence is null, that is, when only Enter is pressed on the keyboard. First, the program calls the reset_sentence routine. This routine initializes variables that are used for each input sentence. reset_sentence initializes the word_ct variable, which will contain the number of words in the sentence. reset_sentence also initializes array entries. These arrays contain an entry for each word in the sentence, allowing up to 10 words in an input sentence. The type_array contains five additional entries for each word, allowing up to five possible types for the word. Examples of types are noun, preposition, and verb. The main routine then parses through the input sentence to extract each sentence word. For each word in the sentence main calls the get_record routine and increments word_ct.

The get_record routine reads each record in the dictionary and calls the match_record routine to determine whether the dictionary word matches the sentence word. If they match, then the types variable is incremented to accommodate another type for the same word. The dictionary can contain multiple types for the same word. The word jump, for instance, can be a noun or a verb and would have a dictionary record for each type. When the end of the dictionary file has been reached, the types variable is checked to see whether the word was found. If the word wasn't found and the first character of the word is uppercase, then the word is a name. Names such as Jim and Sue aren't kept in the dictionary. The word is then copied to word_array for later processing. The routine can be modified to find the word faster by changing the dictionary to an indexed file. The dictionary appears in Listing 2. It is the same dictionary used in the session shown in Figure 1.

get_record calls the match_record routine for each record in the dictionary. match_record compares the passed sentence word with the word in the current dictionary record. match_record extracts the word from the dictionary with the extract_word routine, then it matches the extracted dictionary word with the passed word. If the match is successful, then the type is extracted from the dictionary record and copied to type_array.

If the type is a verb, then the root is extracted from the dictionary with the extract_root routine and copied to root_array. In a group of similar words such as run, runs, ran, and running, the word run is the root. Each verb in the dictionary has a root. The root will later identify a group of similar words that may be used in a generated response sentence. An expanded natural language processor that generates many different responses would find the root invaluable. For example, given the input sentence Jim is running in the house, a generated response may be Why does Jim run in the house? or It appears that Jim runs often. The response words run and runs can be identified in the dictionary through the common root run.

The Underlying Structure

The check_underlying routine identifies underlying phrase structures in the input sentence. The code shows two specific underlying structures that can be identified. The first underlying structure is a question sentence that starts with a WH word. A WH word is a word such as where or what that starts with the letters wh. The next word in the input sentence must be an auxiliary which is labeled AUX. The next word must be a name, and the last word must be a verb. This underlying structure has the types: WH-AUX-NAME-VERB. It can be used for many similar sentences such as Where was Bill walking and Where was Sue going.

The check_underlying routine calls check_type which compares the passed type with the possible types in type_array. The type_array variable holds the possible types for each input sentence word. If the first input sentence word has a WH type, then it matches the structure for the first word. Each word in the input sentence is checked to see if it matches the type in the structure. If all the input sentence words match, then the sentence has the underlying structure WH-AUX-NAME-VERB.

Once the underlying structure is matched, the correct type is copied to prime_types. The prime_types array identifies that the word jump, for example, is a noun rather than a verb. Verbs refer to actions and nouns refer to places. This type identification will be used later to identify words in the sentence that refer to an action or a place.

Next, the kind of phrase is assigned. Auxiliaries and verbs are assigned to verb phrases, determiners and nouns are assigned to noun phrases, and prepositions are assigned to prepositional phrases. Phrases are combinations of specific, adjacent word types. For instance, a noun phrase has a DET-NOUN combination. Phrase identification will be needed later to locate words in the sentence that identify a place. For example, a place can be identified by the prepositional phrase in the house. In an expanded natural language processor, phrase identification would be increased to several processes. What initially looks like a noun phrase such as the house may be a prepositional phrase such as in the house after additional sentence analysis.

The second underlying structure that can be identified has the types NAME-AUX-VERB-PREP-DET-NOUN. This structure can be used for sentences such as Bill is walking in the street and Sue is going to a store. The two coded underlying structures will only accept input sentences such as Sue is going to a store and Where was Sue going? Other kinds of sentences can be processed when other underlying structures are coded.

Additional underlying structures that can be coded are shown in Figure 2. Notice that the two coded structures aren't shown. These two structures were created only for explanatory purposes instead of the more lengthy code required for all the underlying structures. In an expanded natural language processor, the transformational grammar structures would be coded. For example, one coded structure would be DET-NOUN to identify a kind of noun phrase.

Identifying Sentence Elements

The elements of subject, action, and place help convey the meaning in the input sentence. Subject identifies who or what the sentence is about, action identifies the activity that is performed, and place identifies where the activity occurred. In the input sentence Sue is going to a store, the subject is Sue, the action is going, and the place is to a store. Without the identification of these elements, the sentence would be merely composed of meaningless words, types, and phrases.

Three routines in the processor identify the sentence elements. The check_subject routine looks at each word in the input sentence. If the word is a name, then check_subject copies the word to the subjects array, which contains a subject entry for each input sentence. The check_action routine also looks at each word in the input sentence. If the word is a verb, then check_action copies the root of the word to the actions array, which contains an action entry for each input sentence. The root will be useful when expanding the processor. It will allow the processor to determine that Jim ran on the street and Jim runs on the track are similar actions. It will also allow an appropriate form of run to be used in a response statement. For example, Jim ran should be used to describe past tense and Jim runs to describe present tense. The check_place routine looks at each word in the input sentence too. If the word is in a prepositional phrase, then check_place concatenates the word to the places array, which contains a place entry for each input sentence. Each word in the prepositional phrase refers to a place and will be concatenated to the places array.

With this information, a simulated understanding of the sentence can be derived. The processor does this by matching the subject and action words in the current input sentence with information in previous sentences. For example, one input sentence can be Jim is running in the house. The processor will place Jim in the subjects array, run in the actions array, and in the house in the places array. A later input sentence can be a question that asks, Where was Jim running? The processor will identify Jim as the subject and run as the action. Since this is a question, the processor will also search the subjects array and actions array for the words Jim and running. When a match is found, the corresponding places array will be used to create a response that states, Jim was running in the house. If a person saw only the input sentences and responses as shown in Figure 1, then the processor would appear to have some degree of understanding. But this is only an appearance. The processor generates a canned response of words that are based on a combination of input sentence words and information in the arrays.

An expanded natural language processor can contain code for a number of responses. It can also contain the routines check_manner and check_time to identify how and when something occurred. These two routines should allow prepositional phrases to identify elements of place such as in my house as well as elements of manner and time such as in my joy and in the morning.

Generating a Response

The response is the output statement from the natural language processor. After reading and processing an input sentence, the natural language processor must generate an appropriate response in acknowledgement. When a person speaks, the listener responds to let the speaker know the words were heard and understood. The response may be a simple nod of the head or several sentences that explain the listener's understanding. Two considerations that determine the kind of response are the listener's knowledge of the spoken information, and whether the spoken words were a question or a statement.

The make_response routine uses these two considerations to generate responses to the input sentence. First, it checks to see whether the first word in the input sentence is Where. If it is, then the input sentence is a question. The second consideration is whether the processor has knowledge of information in the input sentence. The processor has this knowledge when the information exists in sentence elements from previous sentences. In the question Where was Jim running, the subject is Jim, and the action is running. Since it's a where-question, it's asking for a location associated with the words Jim, and run which is the root of running. The processor keeps information from previous sentences in the subjects, actions, and places arrays. The places array contains locations. The make_response routine searches the subjects and actions arrays for a match with Jim and run. When it finds a match the associated entry in the places array will contain the information to generate a response that states where Jim is running.

When the input sentence is a where-question, and make_response does not find Jim and run, the processor doesn't have enough information to indicate Jim's location. The routine then moves the statement I don't know to the response variable. When the input sentence is not a question, it is simply a statement of fact. The routine then moves Ok to the response variable. Other kinds of responses can be coded and generated. For example, an array of You don't say, Please go on, and Tell me more statements can be coded and used as responses.

When the input sentence is a where-question, and make_response finds Jim and run, the make_response routine calls the make_answer routine. make_answer creates an answer by placing the associated subject, action, and place words together in the response variable.

The make_answer routine is passed an array index that relates the appropriate entries in the subjects and places arrays. First, the routine copies the appropriate subject to the response variable giving Jim. It then concatenates was to the variable to give Jim was. Next, it calls the get_verb_ing routine to retrieve the ING version of the action word. The ING version is a word such as running or walking. The ING verb must be used in the response because other selections of Jim was runs and Jim was ran are incorrect. The get_verb_ing routine reads each record in the dictionary file. It calls the match_verb_ing routine to determine whether the record contains the correct ING verb. The correct ING verb has an ING restriction. The correct ING verb also has a root that matches the action in the input sentence. If match_verb_ing finds the correct ING verb, the get_verb_ing routine concatenates it to the response giving Jim was running. Finally, the make_answer routine concatenates the appropriate place words to the response resulting in Jim was running in the house.

Expanding the Processor

There are several ways this natural language processor can be expanded. The simplest would be to add words to the dictionary. Adding words would enable more words to be used in the coded, underlying structures. The number of generated responses may increase when words are added to the dictionary. More words cause more possible word combinations which allow more possible generated responses. The additional words, though, will require word restrictions that define how the words can go together properly. When expanding the processor in this way, expect to add restrictions to the dictionary and to enhance the program code that processes the restrictions.

Another way to expand the processor is to consider the sentence tense. Basically, a sentence refers to something in the past, present, or future. Sentence tense affects the words used in the sentence as well as its context and meaning. In this miniature processor, the auxiliary that helps define the tense is ignored. The generated response is even hard-coded with the auxiliary was. The first step in this expansion would be to add a tense restriction in the dictionary for auxiliary and verb words. For example, the word is would have a present-tense restriction and will a future-tense restriction. Next, the code would have to be expanded to accommodate the several kinds of auxiliary structures. For example, auxiliaries can occur as have, could have, or could have been. An overall auxiliary tense would have to be derived from these individual auxiliary words. A verb with the appropriate tense can be retrieved from the dictionary after the overall auxiliary tense has been determined.

Expanding this natural language processor will provide additional human language capabilities. But not all processors require the same capabilities. The capabilities that are needed for a given processor depend, in part, on the kinds of input sentences and words that are expected. This natural language processor, presented in its current form, doesn't read a book or answer questions about the book. But expanding this processor will make that capability possible.

(C) 1993 by Russ Suereth.