Optical character recognition (OCR) gets much press. Unfortunately, anyone who has spent any time running typical documents through a commercial OCR package quickly realizes how far the technology is from acceptability. This predicament has spawned many ingenious algorithms, ranging from matrix matching through machine-learned fragment analysis, contour analysis, and neural networks.The problem, however, is much bigger than these solutions suggest. There are four phases to OCR. First, a region detector cuts the page into regions, filters dirt, and ignores white space. Second, an isolator isolates shapes inside a region then decides whether they are candidates for character recognition or just ambiguous graphic garbage. Third, a recognizer operates on this detached shape and generates a list of best character guesses. Finally, a page reconstructor takes a raw list of characters and their locations on the page, then merges them into a reasonable facsimile of the original page. This last phase usually incorporates a dictionary comparator that permutes the guesses of the recognizer into reasonable words.
Character recognition, phase 3, is the sexy phase of the procedure and, as such, gets all the attention. But the other phases are just as essential for effective character recognition. Although it sounds sequential, the process is neither discrete nor linear. For example, the isolator may pull a blob from the region and pass it to the recognizer. Upon which, the recognizer says, "I don't know this at all," and passes it back to the isolator. The isolator takes a second glance, thinks it sees a ligature joining multiple characters, chops them apart and sends them back to the recognizer. Any effective solution must use this kind of feed-back to correct discrimination errors between phases.