PROGRAMMER'S BOOKSHELF

Can We Talk?

Jonathan Erickson

Computational Models of American Speech
M. Margaret Withgott and Francine R. Chen
University of Chicago Press, 1993
143 pp. $15.95
ISBN 0-937073-98-9

Long before Hal made his film debut in 2001: A Space Odyssey, interacting with computers through speech has been among the holiest of computer grails. Unfortunately, it's also been one of the toughest computer science nuts to crack.

That doesn't mean computer scientists have given up on speech recognition. Far from it. Apple, for example, recently introduced a pair of new Macintosh computers--the Centris 640AV and Quadra 840AV--that recognize rudimentary voice commands. You can tell the 840AV, for instance, to "open Microsoft Word" and it will launch the appropriate word processing program, or you can "train" it to perform a variety of other tasks.

Still, the problems confronting speech recognitionists are formidable and current applications remain more in the realm of novelty than practicality. On the human side, we pronounce words inaccurately and inconsistently, our dialects and languages are confusing (half the time I can't come close to figuring out what my teenage son is saying), and even a relatively small thing, like your having a little case of the sniffles, can be a high-powered speech-recognition system to its knees. On the computer side, the processing power to collect, analyze, and interpret speech has been generally prohibitive in cost and availability (digital speech processing is starting to change this, however), the data has been inconsistent and difficult to collect, and the algorithms often inappropriate. These challenges notwithstanding, the allure of verbally commanding a computer to act remains.

Margaret Withgott and Francine Chen, authors of Computational Models of American Speech, present these problems (and others) and propose what they consider to be workable solutions. As its title suggests, this book is scholarly in tone and presentation. Still, Withgott (who's a researcher at Interval Research Corporation) and Chen (who holds a similar position at Xerox PARC) have written a monograph that's interesting and readable for anyone dabbling in speech recognition theory--and required reading for anyone doing serious work in the field.

Withgott and Chen correctly point out that much of the speech recognition work done by programmers to date has relied on small collections of data drawn from their own limited knowledge of, and experience with, language. Speech-recognition specialists, on the other hand, historically have manipulated large amounts of data for training systems without worrying about the pronunciation details in the data. Withgott and Chen attempt to close this gap by examining and developing "probabilistic and rule-based computational models of transcription data using conditioning factors drawn from theory," claiming that their work represents "one of the first attempts to bring together theoretical concerns with an analysis of a large American English database." Their ultimate goal is to create a computational system that handles "the kind of variant pronunciations one observes in large collections of transcribed American speech."

After (predictably) presenting a brief historical background on approaches to speech recognition (ranging from the mid-'70s Dragon system, to the more recent Carnegie-Mellon Sphinx system), it's this speech database the authors zero in on. Although unfamiliar to me, the speech database "TIMIT," jointly developed in the mid '80s by Texas Instruments and the Massachusetts Institute of Technology, is central to much of today's speech-recognition research. This database, available from the National Institute of Standards and Technology on CD-ROM, contains 6300 sentences produced by 630 speakers recorded by TI and transcribed by MIT. (When Withgott and Chen wrote their book, the CD-ROM only contained 4300 utterances; it's since been updated. For more information on TIMIT, see the accompanying textbox entitled "The TIMIT Speech Database.") Although the authors fault TIMIT because the speakers read sentences instead of spontaneously speaking, they don't question its value for speech-recognition exploration. In addition to providing speech, the database provides visual spectral representations of the utterances so that you can watch patterns on the computer screen.

Withgott and Chen spend a fair amount of time examining spoken language data structures (phonetic structures, probabilistic pronunciation networks, and the like) and rules (context descriptors) for predicting possible pronunciations. In doing so, they describe how they apply these rules to TIMIT data primarily using a rule interpreter (implemented by Steven Bagley).

At the heart of the book is an algorithm for computing "context trees"--decision trees in which the number of groups of contextual factor values is determined from the data. (Judging from the extensive reference section at the end of the book, this is a topic Chen, in particular, has spent several years investigating.)

A context tree is an n-ary decision tree which provides a representation for modeling the relationship between contextual factors and the variant pronunciations of a dictionary symbol in different contexts. Decision trees have been used for both interpretation and classification of data. Given a data set, decision trees partition the data set and can be formed automatically. The resulting trees can be converted to rules, which is convenient if one wishes to analyze the phonological rules encoded in a tree or compare them with a hand-derived set of rules.

In short, context trees are subsets of the familiar decision trees. Context trees can be used for organizing the values of contextual factors, providing a representation of the relationship between a context and the data element for classification and prediction purposes. Figure 1 (adapted from the book) illustrates a context tree. The authors describe the figure this way:

Each non-terminal node of a context tree is labeled with a contextual factor. In the figure, node 1 corresponds to the contextual factor word-boundary-type. The branches from a node are labeled with mutually exclusive sets of values of the contextual factor and each branch leads from the parent note to a child node. The top branch of node 1 represents the values initial and initial-and-final of the factor word-boundary-type. The middle branch corresponds to the mutually exclusive value final. The context of a terminal node is defined by the contextual factor values encountered in traversing the tree from the root node to the terminal node. For example, terminal node 5 represents the context word-final with primary or secondary stress. Each terminal node of a context tree encodes the distribution of phonetic elements in each context. In general, more than one phonetic element occurs in a context because realizations of a dictionary symbol are not deterministic. Rather than predicting only the most likely phonetic element in a context, the probability of each of the different possible phonetic elements is enumerated.

Obviously, a complete description of the context tree algorithm is beyond the scope of this article. The algorithm does, however, seem to have more general use than just speech recognition.

It comes as no surprise that many of the problems inherent in speech recognition are similar to those faced in other types of recognition. As we found out last year with the DDJ Handwriting Recognition Contest, a valid collection of data samples is critical. (GO Corporation, for instance, has perhaps one of the most extensive databases of handwriting samples in the industry--and they guard it almost as closely as their recognition algorithms.)

Computational Models of American Speech is not the place to start if you're getting started in speech recognition; it's just too narrowly focused. However, if you're really serious about speech recognition, it's a book you'd be well advised to pick up.

Figure 1: Context tree. (Node numbers are in upper right of boxes.)

The TIMIT Speech Database

The TIMIT speech database is designed to provide speech data for the acquisition of acoustic-phonetic knowledge, and for the development and evaluation of speech recognition systems.

TIMIT contains speech from 630 speakers from eight major dialects of American English, each speaking 10 phonetically rich sentences. The TIMIT system includes time-aligned orthographic, phonetic, and word transcriptions as well as speech waveform data for each sentence-utterance. The project was a joint effort among the Massachusetts Institute of Technology, SRI International, and Texas Instruments. The speech was recorded at TI using a Sennheiser head-mounted microphone in a quiet environment, digitizing the speech at a 20 kHz sampling rate and then downsampling to 16 kHz for distribution. The data was transcribed at MIT, and then verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST).

All of the phonetic transcriptions have been hand verified and approximately 2 percent of the phonetic labels have been changed from earlier releases. New test and training subsets have been selected and specified. These subsets are balanced for phonetic and dialectal coverage. The directory structure has been simplified and the speech waveform files are formatted with the NIST SPHERE header structure. A revised version of the SPHERE speech file header software is also included. Online documentation provides a description of the tabular computer-searchable information.

Copies of the TIMIT database are available on CD-ROM through the National Technical Information Service (NTIS). Specify the NIST Speech Disc 1-1.1, NTIS Order No. PB91-505065. The domestic price is $100.00 (international price, $300.00). Contact NTIS, Springfield, VA 22161, 703-487-4650, or the Linguistics Data Consortium, University of Pennsylvania, 609 Williams Hall, Philadelphia, PA 19104.

--J.E.