Graphics


Detecting Text Regions Using Cellular Automata

David Weber


David Weber has been writing software for the last 22 years and managing projects for a decade. He is currently an independent algorithm designer hiding out in rural Colorado.

Cellular automata are a powerful tool. You will find them used in a wide range of disciplines, including physics, simulation modeling, statistical mechanics, and aerodynamics. Any problem expressed as local effects between neighboring cells is a candidate for cellular automata. In this article I present an algorithm for finding text regions on a bitmapped image of a page using cellular automata.

This algorithm takes advantage of the fact that Western languages are essentially horizontal. Nothing in the algorithm invalidates it for the requirements of a vertical language, such as Chinese. You just need to make a few modifications. I will note these changes where they are applicable.

The Algorithm

I adjusted the rules of the original cellular automata, the game of Life by J.H. Conway (see the box "How Cellular Automata Works"), to find text regions. Because I want rectangular regions, I ignore the cell neighbors along the diagonals. Instead of an octant of neighbors, I have a quadrant. I use those above, below, left and right of the cell. This helps square out the regions. It also enhances speed by reducing the number of calculations. Second, I throw out the rule that causes death by overcrowding. Overcrowded, filled-in regions are what I want, so I make no penalty for them. Third, I encourage fertility. I make it easy to birth new cells and fill the interstices. Finally, I don't worry about the edges. Edge effects in this variation are self cancelling. They do not change the outcome.

The CUJ online source code includes the full code for a demonstration exercise. Like most real projects, the bulk of the software deals with the user interface. zone.h (Listing 1) and zone.c (Listing 2) contain the actual guts of the machine. I took two liberties in the example code. I turned off the vertical-line cutter (see the box called "How OCR Works") so people would not wonder why pictures were being chopped. I also sampled the image while I scaled it for display. Usually, you would read the entire image into memory and sample it from there. The memory image is then available for future operations like filtering or OCR. Since images of pages are quite large, this means storing the page in EMS/XMS memory or using a 286/386 extender. I wanted the demonstration program to run on anything from CGA to SVGA and not be constrained by memory requirements. So I sampled the page on the fly.

You may see where judicious use of pointers, instead of array indexing, will lead to speed. I saw some of them, too, but leaned toward clarity of expression rather than raw results. In a commercial implementation, I would sacrifice at the altar of performance, but keep a clear exposition of the algorithm in my bottom drawer.

Detecting Text Regions

Region detection, the first phase of the OCR process, presents several problems. Text page format, in Western languages, consists of regions of horizontal lines separated by vertical white space called column breaks. You need to find these column breaks and use them to differentiate regions. Text varies widely in size, typically from six to 24 points. To complicate matters, many pages include a random spattering of pictures, logos, and graphics across the page, conditions that can seriously hamper OCR.

Forms introduce more problems for the region detector. Sometimes lines and boxes separate regions of text. The lines are convenient for the human eye, drawing attention to restricted regions. The computer finds them a hindrance. They are just noise in the way of the characters.

There are five steps to detecting text regions:

Page Sampling

Proper sampling is the key to effective OCR. Mess up the sample and everything else falls apart. Sample too frequently and the regions spread out beyond the capacity of cellular automata to coalesce them. Sample infrequently and everything collapses into one big region. I developed sample_page from the empirical testing of a range of typical pages. The exact sizes of sample_page's parameters are important but not critical. The algorithm is flexible enough to adapt to a variety of inputs without breaking down.

Looking at the sampling intervals tells you two things. First, the size of the parameters depends on the DPI (dots per inch) density of the original page. For demonstration purposes, I hardcoded midrange values for a 200 DPI page (see Listing 1) . For a 300 DPI page, you would scale up these parameters. Similarly, for a 100 DPI page, you would decrease them. If the source page has a non-square density, like a 200 x 100 DPI FAX, the sampling values you choose should be proportionally out of square.

Second, sampling rates are not identical in both directions. The horizontal axis uses a greater sampling frequency (10 times) than the vertical axis (eight times), because Western text is primarily horizontal. In this application, I emphasize horizontal phenomena. If the target language was Chinese, the vertical sampling would be more frequent than the horizontal.

For more efficient page sampling, I double the sampling rate in the Y direction and OR pairs together, instead of just increasing the Y sample frequency (see Listing 2) . This technique prevents text from going unnoticed while doing a low frequency scan down the page.

By choosing bit testing or byte testing, depending on the sample rate, I can minimize the chance of missing a piece of text, and maximize the width of my sampling glance without overlapping sequential samples. If the sample rate is below eight, I look at bits. If it is above eight, I look at bytes.

Lastly, I store each sample in the buffer as a byte, not as a bit. This wastes some space, but enhances speed. It also allows me to use the extra bits as generation flags when iterating over the page. A typical letter-size page scanned at 200 x 200 DPI has a raw image size of about half a megabyte. The sampled image, on the other hand, fits in a 46KB buffer. It is possible to do cellular automation over the original image at the original size but it is abysmally slow. The sampled image gives the same results as the full page but with much greater speed.

Removing Vertical Lines

With a sampled page in the buffer, the next step removes vertical lines. cut_vertical_lines removes boxes enclosing areas of text. A box has both horizontal and vertical lines, but only vertical lines need removing. Once the vertical lines disappear, the outline breaks, exposing the text to cellular combination. Trying to cut the horizontal lines of a box is tricky. After sampling, small text (six-point) looks much like a line. Any attempt to cut the horizontal lines will probably remove some text along with the lines. Of course, Chinese is opposite. It requires cutting horizontal lines and leaving vertical lines alone.

Before removing the lines you must decide on the minimum length of a vertical line suitable for cutting. Choose a line too short and you cut up large text fonts. Make it too long and you miss small boxes. The value I chose is arbitrary. It seems to function well over a spectrum of forms.

The optional step of removing vertical lines has its hazards. A picture with lots of dark space will look much like a cluster of vertical lines. Therefore, the cutter will chop it up. If you want to extract graphics from the page, pull them first before cutting the vertical lines. For OCR purposes, it doesn't matter since the recognizer discards pictures anyway.

Blocking Text Regions with Cellular Automata

After sampling the page and possibly cutting out vertical lines, you need to block text regions with cellular automata. block_zones is the heart of the algorithm I am presenting. To achieve accurate region detection you make a single pass over the sample area with a cellular automaton filter, then traverse it with a coalescing filter that fills the gaps.

The cellular automaton pass filters out dirt and enhances text blocks. Multiple cellular iterations will improve the image, but you pay with time. Like image convolutions, cellular automata make many visits to each pixel. In this algorithm we encounter each pixel five times, either as a neighbor or as a candidate cell. People with faster processors or more patience might want to experiment with multiple iterations.

Rather than having two toggle buffers for cellular automation, I use a single buffer. Bit 7 and bit 0 distinguish between current and future states of life and death. After processing the sample, I take another pass to bury the pixels slated to die and deliver all pixels destined for birth.

After cellular automation comes coalescence. A user input parameter, coarseness, ranging from 0 to 5, establishes how pixel clusters coalesce into regions. A coarseness of zero produces many small regions while a coarseness of five creates a few large regions. Coalescence works by looking at neighbors some distance from a pixel. For example, a coarseness of 3 closes horizontal and vertical gaps of three or fewer pixels. The coalescence process takes two passes. Horizontal coalescence goes first, then vertical. Why not save time and combine these into a single pass? Again, our need for horizontal preference requires separate passes. Chinese, once more, inverts the order, having the vertical pass first and the horizontal second.

Extracting Rectangular Zones

Extracting rectangular zones from the blocked out regions is the next step of the algorithm. extract_zone starts in the upper left corner of the page and proceeds downwards in raster scan fashion. When it hits a black region, it walks the region's perimeter, and saves the minimum and maximum X/Y values of the perimeter. This defines a rectangular zone. If the enclosed region is significantly large, the program saves the region on a zone list. extract_zone discards any small fry. A region, once extracted, should not be reconsidered. To prevent this redundancy, the program whites out the region after extraction. It repeats this process down the page, gathering the zones onto a list.

Overlapping zones can cause problems. You do not want to perform OCR on the same area of the page more than once. There are several ways to solve this. This application uses a somewhat simplistic approach. If two or more zones overlap, overlap_zones combines them and makes one larger zone. The new X/Y minima for the combined zone come from the lesser of the source zones' minima. The maximum, likewise, comes from the greater of the maxima. If you define a zone as a polygon instead of a rectangle, you can combine two rectangular zones into a larger polygonal region. These polygons will more truly reflect the layout of the page, compared with trying to force it into rectangles. However, most page formats follow rectangular grid lines. The complexity of polygonal region boundaries may produce an insignificant reward.

Sequencing the Zones

The last step in resolving text regions is sequencing the zones. When you visually scan a page in an English language magazine the flow of the page follows one of two patterns. The text either moves in columns down the page and then across. This is termed column major ordering. Or it moves first across the page and then later down. This is row major order.sequence_zones takes a user- provided type parameter that chooses either column- or row- major order. The sequencer reorganizes the zones on the list to reflect their order on the page. Its operation is quite simple. It just compares the X/Y ordering of the zones and swaps any that are out of place. Sequencing is heavily language-dependent. I have mentioned Chinese, which naturally requires a different sequencer. It affects other languages also. Hebrew reads right to left. The horizontal sequence runs opposite from English while the vertical sequence remains the same. The Hebrew sequencer needs to flip the horizontal numbering. Hebrew presents even more problems for OCR systems. Although it reads right to left, it can contain embedded English words or phrases that scan left to right. This means the scanning engine and page reconstructor depend on linguistic context. This can lead to lots of late night coding entertainment.

Extensions of the Algorithm

As it stands, the algorithm requires two user input parameters, coarseness and sequence ordering. How can I automate these? Placing the algorithm in a feedback loop will remove the coarseness parameter. If the results have many small regions, the coarseness is too low. Rerun the function with a higher coarseness. If there is only one large region, retry with a lower coarseness value. Removing the sequence ordering parameter is difficult. It requires knowledge of text content. Everyone has seen a poorly laid out magazine page where it took a second to find the continuation of a column. Think how much fancy processing you are doing when your eye and brain link to the next column. Completely automating column linkage requires a higher level of knowledge about the page.

Dot matrix has always been a hurdle for OCR. The separated dots cause havoc for the character isolator. Filtering with cellular automata fills in the spaces between the dots, simplifying the problem. It is a good preprocessor for dot matrix.

The techniques discussed here apply to monochrome images. How do you handle color? This is a subtle obstacle. It requires remapping the color space to two values, foreground and background. You then extract the foreground elements from the background. You might have to vary the definition of foreground and background colors across different parts of the page.

Conclusion

Cellular automata are more than just a source of entertainment. They are a powerful tool, useful for simplifying difficult tasks. And simplification is the essence of intelligent software design. When deciding whether or not to use cellular automata, keep these few rules in mind. Can the problem be divided into cellular space? Does the problem involve local effects, not action at a distance? Can cellular evolution take the problem from a complex state to a simpler one? If these rules apply, give cellular automation a try.