Dr. Dobb's Digest November 2009
Python is well known among programmers and system administrators alike to possess powerful libraries ranging from web frameworks and image processing to automated workflows and gaming. A lesser known yet extremely powerful Python library is the Natural Language Toolkit. Natural Language Processing with Python demonstrates how to leverage this toolkit to create sophisticated NLP applications. Read on for my review.
Ever since I programmed my first interactive application in BASIC on a TRS-80 nearly 25 years ago, I dreamed of fluid, natural conversations with computers, ala various science fiction stories like Arthur C Clark's 2001: A Space Odyssey and Phillip K Dick's Do Androids Dream of Electric Sheep. The computing world has evolved by leaps and bounds since then, though we have yet to attain that elusive vision of ubiquitous, natural conversational interaction with a computer. Text to speech engines are fast approaching the sound and inflection of a convincing human voice and voice recognition has also greatly improved (my Android phone is highly accurate with short phrases - not quite 100%, but far better than what voice recognition was like even a few years ago). Still, an intelligent back-end is required to hook all these technologies together in a cohesive, effortless user experience. While Natural Language Processing with Python didn't quite attain this lofty goal, it educated me on the nuances of NLP and the difficult computing problems that need to be resolved before this futuristic vision can become commonplace.
The book starts off with the terms and concepts behind NLP and introduces the free, open source Natural Language Toolkit (NLTK), followed by installing and downloading the NLTK demo book data collection and running some simple Python scripts to show off the NLTK's functions and lexical diversity. A fun exercise is running nltk.chat.chatbots() which shows how NLP can interact with users in a not-quite-there Turing Test sort of way. The next 10 chapters delve into all things NLP, from accessing and processing large bodies of text (both text corpora and raw formats), a quick Python primer oriented toward NLP (complete with Mathplotlib and PyNum data visualization examples) in Chapter 4, using a part-of-speech tagger and automating such tagging via regular expressions, lookups and N-Gram tagging. Text and sequence classification and recognizing textual entailment (ex: predicting the true/false relationships of text within a statement) are covered in Chapter 6. Decision trees, information gain (a measure of "how much more organized the input values become when we divide them up using a given feature"), naive Bayes classifiers ("every feature gets a say in determining which label should be assigned to a given input value"), and other techniques: zero counts, smoothing, maximum entropy classifiers, linguistic pattern modeling, information extraction architecture from unstructured data, chunking, chinking and tag patterns, tree traversal, named entry recognition (NER), relation extraction and more. Chapter 8 covers sentence structure analysis (i.e., dealing with ubiquitous ambiguity), context-free, dependency and weighted grammars, with feature-based grammars discussed in Chapter 9. All this dense background comes together in Chapter 10 by applying an NLP interface to an underlying SQL-structured data source using propositional and first-order logic. Understanding the semantics of English sentences via the Principle of Compositionality, lambda-Calculus, quantified NP's, transitive verbs and discourse representation structures (DRS). The final chapter on managing linguistic data from various sources such as the web, word document files and spreadsheets is demonstrated in a TIMIT (a consortium of Texas Instruments and the Massachusetts Institute of Technology) Corpus, and concluding with an extended welcome to the Open Languages Archive Community (OLAC). The book closes with an Afterword on engaging the reader in the various computational challenges in state-of-the-art NLP systems, the NLTK roadmap and a bold invitation to "build new language technologies to better serve the needs of the information society, and ultimately as a pathway into deeper understanding of the vast riches of human language." Who could turn down such an offer?!
Each chapter concludes with a series of exercises ranging in difficulty; unfortunately, answers to the exercises are nowhere to be found, not even on the book's website. Some of the more public-facing examples of NLP in action are on popular web sites including ask.com and wolframalpha.com. While the authors fail to point readers to such commercial entities that have successfully incorporated the NLTK into their backend data processing applications, such websites no doubt employ the principles discussed in the book.
In summary, Natural Language Processing with Python delivers a solid education for any computing professional interested in the complexity and current state of the art in NLP systems. Python programmers will find the book especially Pythonic in the NLTK's implementation and use of NLP principles. While my dream of having an intelligent spoken word conversation with my computer may have to wait for another 25 years of computing evolution, this book helped me understand the complexities of the problem and ways to get closer to the solution.