March 1990/An Introduction To Speech Recognition

Artificial Intelligence

An Introduction To Speech Recognition

B.J. Gleason

B.J. Gleason is an Assistant Professor at Upsala College in New Jersey, and holds a master's degree in computer science. He is currently working on an Ed.D. in computer education at Nova University. Contact Mr. Gleason care of the Computer Science Department, Upsala College, East Orange, NJ 07019.
"Open the pod bay door, Hal."
"I'm sorry, Dave, I'm afraid I can't do that." —
2001, A Space Odyssey
"Shields Open." —Batman
One of the friendliest user interfaces should be voice. At some point in the future, we will be able to talk to our computer system and have it understand us. In the two examples above, the computer understood the spoken word. In Hal's case, he could also lip-read (An article for this is currently in progress.). With the Batmobile, the speech recognition is more realistic, if not the automatic pilot in the car.
In the above paragraph, I use the word "should" to describe voice as a friendly interface. It should be, but at this time, it isn't. Not for lack of trying, however. Speech recognition (SR) is not yet in the home although it has been around for about 20 years. SR is still unreliable and not easy to use. Yet.
Speech recognition is a wide field that must be broken down if it is to be understood. There are two portions to a speech recognition (SR) system. The first is the recognition portion. This is almost easy. The person says something and the recognizer returns the words that were spoken. The second part, the understanding of what was said, is much harder. The second portion falls into the area of Natural Language Processing. In the Batman example, the understand portion is easy. In the 2001 example, it is harder. For example, Hal must not only know how to open the door, but why the door is to be opened. Hal understands this, and realizes that he must not open the door for Dave.
In this article, I will describe the first portion, the recognition procedure. It is beyond the present scope to describe Natural Language Processing. We will take a look at the current techniques used in SR and I will provide you with everything you need in order to experiment with SR on your own.

Intro To Speech Recognition
Most SR systems belong to one of two fundamental classes: speaker dependent or speaker independent.
With speaker dependent systems, the user must first "train" the system to recognize his or her voice. During training the system displays a word, the user pronounces it, and the system saves the resulting voice pattern. Once a sample of all the required words is stored, the user may begin using the system. When the user speaks, the system will take the unknown voice pattern and compare it to the patterns stored in memory. The word associated with the closest matching pattern is returned as being the word the system believes the user spoke. This is the most common technique.
Speaker dependent SR units are capable of 91% to 99% accuracy. While the training is time consuming, this technique is economical. You can pick up one of these units for less than 100 dollars. If you are electronically inclined, you can build one for less than 50 dollars.
A speaker independent system requires no training. It should recognize words as soon as it is turned on. Speaker independence is much harder to accomplish. The voice pattern is broken down and analyzed for certain features — features chosen to distinguish among the words to be recognized.
Speaker independent systems are very attractive, and will become more so in the future. At this time however, most speaker independent units are expensive, achieve accuracies of only 85% to 95%, and often recognize fewer words as well. Speaker independent systems require more processing power to do the analysis, require dedicated processors and start in the 500 dollar range.
Radio Shack is now selling a speaker independent isolated word recognizer chip for about 10 dollars. Build an amplifier circuit for it and it will recognize nine different words. This particular chip is used in a number of voice activated toys.

Specifications
How many words do we need to recognize? The ultimate dream, of course, is the voice typewriter which would have to be speaker independent and to recognize about 50,000 or so words. But many applications can get away with much less. A simple voice calculator for example, would need to recognize only 10 digits, four operations, a decimal point, and an equal sign — only 16 words. Several companies use voice inventory systems. These require the digits and a few commands, again around 16 or so words.
Many applications need only a few words. A small vocabulary has important advantages. Larger vocabularies require more memory and more processing time to find a match. With larger vocabularies, accuracy will typically decrease. Given the current state of the art, it would be best to limit the vocabulary to the smallest possible number of words.
Speech habits also affect recognition systems. Most humans talk in what is known as continuous speech. Most SR units depend upon isolated speech. Research has shown that it is very difficult to separate the words in continuous speech. The pauses between words in continuous speech are very short — sometimes even shorter than the natural pauses that occur within words. The emphasis and accents common to continous speech create additional problems. There is typically a marked difference between the pronunciation of a sentence and a collection of words.
Dictating to an isolated word recognizer for dictation takes some getting used to. You must pause, typically for one tenth of a second or more, between words. Until you get used to it, it can be frustrating.

The Hardware
In order to talk to your computer, we need a device to translate voice waves into binary values. Figure 1 shows a block diagram of a typical SR unit. The microphone is fed into a preamp circuit. The output of the preamp is fed into two bandpass filters. The lowband filter has a range of 150Hz to 900Hz. The highband filter has a range of 1Khz to about 4Khz. The output of the bandpass filters is a sinewave whose frequency falls within the range of the associated filter.
Using these bandpass filters we can isolate the high and low frequencies of the utterance. More sophisticated SR units might have more bandpass filters to include midranges as well.
The output of the bandpass filters goes into a zero crossing detector (ZCD) and a rectifier/averager (R/A) circuit. The ZCD is a comparator that tests its input against a reference voltage. When the signal is above the reference voltage, the output is one; when less than the reference voltage, it is zero. The timing output of the ZCD will approximate the input frequency.
The R/A circuit converts the average RMS AC signal to its DC equivalent. This signal is then inverted and fed into a comparator set to slightly less than the reference voltage. With no speech, the normal output would be zero. When speech enters the system, the output of the comparator will go to one, indicating that voice input is present.
These four outputs are fed into a parallel port and read by the computer.

Speaker Independent Systems
Now that we can get the speech into the computer system, we need to process it. There are several different techniques for speaker independent speech. The technique I will describe is phonetic analysis.
The sound that we produce to form words can be broken down into several categories:

Pure Voice Vowels (V): a,e,i,o,u,uh, aa,ee,er,uu,ar,aw

Nasal (N): m,n,ng

Voice Fricative (VF): z,zh,v,dh

Unvoiced Fricative (UF): s,sh,f,th

Plosive (P): b,d,g,p,t,k,h

Glide (G): r,w,l,y
Our speaker independent board and program would first take the speech utterance and break it down into these categories. For example, vowels sounds are continuous and generate low frequency energy, whereas fricatives are continuous but generate high frequency energy. The software must look for the characteristics of each group in the input speech. It would then generate a sequence of these phonetic grouping and look up the word in a dictionary:

0 VF-V-G-V YES G-V-UF 1 G-V-N NO N-V 2 P-V START UF-P-V-P 3 UF-G-V STOP UF-P-V-P 4 UF-V-G 5 UF-V-VF 6 UF-V-P-UF 7 UF-V-VF-N 8 V-P 9 N-V-N
This small table starts to illustrate some of the problems with this technique. As we add words, correctly grouping the phonemes becomes a parsing nightmare. Notice that the phonetic makeup for START and STOP are the same! How can we tell the difference? We can't. Not easily anyway. Larger vocabularies well generate even more "collisions". In a 20,000 word dictionary, there are only about 2000 unique phonic combinations. 10,000 of the words are five phonetic symbols or less.
Another technique, template averaging, reads in several copies of a word and then finds the major features of the word. This technique is a mix between speaker dependent and independent. In the training (similar to speaker dependent) phase, five people save the same set of words. The program finds the major features and uses these collective patterns as the model. The system is then speaker independent and will recognize almost any voice pronouncing the word.

Speaker Dependent Systems
Speaker dependent systems are reasonably straightforward. They all work on the same principle. In the training phase, the sounds are stored in a reference template. In the speech phase, the unknown sound is compared against the known sounds for the closest match.
I have prepared a program and data files that will allow you to experiment with an isolated word, speaker independent system. The program appears in Listing 1. The data files are available on the code disk for this issue.

How It Works
If we are training or analyzing the speech, the same process goes on when a person talks into the microphone.
To capture sound and reproduce any waveform, we must sample it at twice the highest frequency (the Nyquist rate). For speech we must capture and store 8000 four-byte samples per second. At this rate, we would need 32K of storage for a single word. Most SR units don't really sample at the Nyquist rate, since they don't reproduce the sound. SR units commonly capture only 100 four-byte samples per second. Assuming a maximum of 1.5 seconds of speech per utterance, each word will require less than 600 bytes of storage.
The comparator circuit converts the output of the bandpass filters into square waves at a relative frequency. The software counts the number of square waves in both the low and the high bandpass filters during each 0.01 second interval, producing a one-byte approximation to the frequency in each band. These numbers, along with the values of high and low energy (four values all together), are stored in a "raw speech buffer". Up to 150 samples can be stored, enough for about 1.5 seconds of speech. The beginning of speech is indicated by at least 0.1 seconds of sound, and the end of speech is at least 0.1 seconds of silence.
The "raw" bytes produced by this process are still not very useful, as they may still be "distorted" by "time warping." If you say the word ONE twice and each time say it slightly slower, each repetition will appear to be a longer word. To compensate for this variance, we time-normalize the word so that it can be matched against faster or slower pronunciations.
Once the end of speech is detected, the buffer is broken down into 16 evenly spaced points, based on the total length of the word. At each of the 16 points, a value is taken from the high and low energy, as well as the two zero crossing detectors. The final result is a 64-byte template of the utterance.
Time normalization forces all the templates to be the same size, making it easier to compare against one another.
A template constructed during the training phase would be stored in an array, indexed to the number of a word. A template constructed while analyzing the speech is instead compared to all the known templates in the array.
During the recognition phase, the speech input is reduced to a template which is compared against the stored templates. As the unknown template is compared against each stored template, a difference value — the minimum difference — is computed.
During the matching process, each minimum difference is also compared against a rejection limit; if the program can't find a match with a minimum difference at least as small as the rejection limit, it will respond "unknown word" instead of making a wild guess. Thus, the rejection limit sets the degree of "confidence" required to announce a match. Higher rejection limits will result in more erroneous matches; lower limits will result in more "unknown word" responses.
Calculating all the minimum differences is time consuming. We can avoid some of these calculations by abandoning the calculation as soon as the partial result exceeds the rejection limit. We can get even better results by "remembering" the best minimum difference calculated so far and abandoning the calculation as soon as the partial result exceeds this limit. Both optimizations can be folded into one test if during the search the "rejection limit" is adjusted dynamically so that it is either the programmed rejection limit or the best minimum difference seen during the current matching attempt.
We can enhance the rejection process by using the delta difference technique. As we calculate the differences of all the words to the unknown sample, we may get two words that are very close to each other but both under the rejection limit. The delta difference technique requires that the correct word "beat" the other words by a certain amount. If the delta difference is set to 10, and one template difference is 124 and the other is 129, both candidates will be rejected. In general, the greater the delta difference between the two best choices, the better.
Once a speaker dependent system is trained, it is tested. If a particular word generates a large number of errors, the user may re-train on that particular word.

Operation Considerations
To get the best results with any voice recognition circuit, remember a few simple guidelines: speak slowly and clearly; try to repeat the words as consistently as possible; operate it in a quiet environment; be careful when choosing your word lists, avoiding words that sound alike; and hold the microphone close to your lips.
The smaller the list of words, the greater the accuracy of the system. For example, for a game of Hi/Lo, only four words are needed: higher, lower, yes, and no. By clearing the other templates, your recognition of these should be 100 percent.
One can also increase recognition by storing two templates (derived from different training sessions) in the vocabulary. This will give you two chances to match the word.

The Program
The main program, called SPEECH.C, is the simplest version of the speech recognition algorithm. This program will work quickly with reasonable accuracy. The code was written in Turbo C v2.0 and has been written for clarity rather than speed or compactness.
The program is menu-driven. It has options to load and train a set of voice (data) files. With the "Train" option you may vary the voice files individually. The "Debug" option (normally on) will display the waveforms of the words and indicate the elements being extracted for the templates. During the "Perform" choice, the debug option will also display the minimum difference table along with the delta difference. With the debug option off, the system will just display which word is recognized.

Data Files
The disk includes data files for the digits zero through none and a telephone number (with area code). These files are taken from the raw speech buffer. These files are included so that you can experiment with voice recognition algorithms without having the hardware. The function getvoice() on my system will wait for a word to be spoken and place the raw speech into an array. On your system, it will open up and read the data file and place the raw speech into an array.
If the debugging feature is turned on, a character-based plot of the wave forms will be generated on the screen in a vertical fashion as they are read in. If this drawing is too crude for your tastes, you can import these files into a spreadsheet program, such as Lotus, and plot them.
You can use the digit files to train the SR system, and then use it to recognize the phone number.

Alternate Techniques
I include two sets of digit data, each set captured during a separate speech session. You can use this extra data to experiment with techniques that may increase the accuracy of the system:
Duplicate Entries — Train the system with both sets, so you have twenty templates. During recognition, if the template matched is greater than nine, subtract 10 to get the "real" number.
Template Averaging — Take two samples of each digit, average them together and use the result as the template.
Input Averaging — Average each point on each band to the point directly ahead and behind. This will help to smooth the band and eliminate noise that can creep into the system.
Amplitude Normalization — On some system, the volume of the sound can greatly effect the signals coming from the speech board. You can eliminate most of this effect with normalization. Calculate the average of an entire band, compare this to a standard amplitude factor (for example, 20), then multiply this factor by each element of that band. This will have a sliding effect on the band. Alternatively, you can just subtract the average from each element.
Interpolation — When extracting the 16 samples from the raw speech buffer, I calculated the precise position, but used the nearest element. For a more exact representation of the wave form, a new sample should be interpolated from the adjacent two samples.
Number of Samples — I have used 16 normalized samples. You can vary this number up or down and test the results. The larger values save more information but increase the processing time. Smaller values increase processing speed, but make for closer delta differences.

Speech Recognition With Natural Language
Many databases now have "natural" language interfaces so that you can ask questions such as "What is the highest mountain in New Jersey?"
While many of these interfaces seem natural, most of them have a hidden structure behind them. For example, in the Batmobile after the word "Shields", the system will probably accept "Open", "Close" and nothing. If only the word "Shields" is spoken, the shields will close by default.
The SR unit would have the words Shields, Open, Close, Stop, and a wide (we assume) list of other words. To help cut down on processing time, we can eliminate illogical words as we parse the sentence. "Shields Stop" would be illogical. When we identify the word "Shields", we compare the next utterance to "Open" and "Close" only. If no word is spoken, or the word is rejected, we close the shields.
Natural language processing also helps with the "To, Too and Two" problem of similar sounding words. If we said "I want to go too", the NL processor would be able to sort out the correct usage based on the context in which the word was said.

Acknowledgements
The author would like to thank Larry Eckelkamp and Nuala Murphy for their help with this article.

Bibliography
Ainsworth, W. A., Mechanisms of Speech Recognition, Elmsford, NY: Pergamon Press, 1976.
Carter, John P., Electronically Hearing: Computer Speech Recognition, Indianapolis, IN: Howard W. Sams & Co., Inc., 1984.
Staugaard, Jr. Andrew C., Robotics and AI, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1987.