David Weber has been programming since 1971 and is currently living in rural Colorado, where he is an organic gardener by daylight and a software by night.
The Problem
Algorithms live in a world far removed from textbooks. An unspoken assumption throughout my reference shelf is that CPU resources are unlimited. Time, of course, is analyzed in great detail. Long proofs result in a "big O" relationship that carefully defines what I should expect for average performance. However, there is no equivalent "big S" explanation that covers space requirements. True, such an analysis is complicated by CPU architecture, but that doesn't make it any less essential in the real world. I was reminded of this one day when I answered the phone, talked over a problem, and took a contract.The marketing folks at this firm decided at the tail end of a project that "send fax" capabilities were necessary. The product was well evolved and all 640 KB of memory real estate had been oversubscribed, argued for, and staked out with pit-bull tenacity. They needed a way to scale an arbitrary image to fax dimensions and they needed to do it without resources. The project specifications required a scaling technique that met these criteria:
My personal algorithm library yielded no solution. I had lots of scaling techniques, but they demanded dynamic lookup tables or matrices for a state machine. So I went back to the basics. There are three fundamental types of scalers. These use either floating-point multiplication, incremental error accumulation, or sampling.
- It must scale up and down with a dynamic range varying from 0.1 to 10.
- It has to generate lines that are exactly the fax width (216 bytes). This demands clipping lines that are too long and whiting out the tails of those that are too short.
- Horizontal and vertical scaling factors must be independent. This is because faxes have two resolutions, high (200x200 dpi) and standard (200x100 dpi).
- The buffer control has to reside in the calling function.
- It needs to be 80x86 compatible and run on an 80186 or better processor.
- It cannot use floating-point math.
- It has to have a precision of 5 per cent or better.
- It must maintain the image quality. In particular it cannot corrupt dithering, cannot introduce discontinuities, and should preserve the darkness and contrast of the image.
- Of course, it must be as fast as possible and use an absolute minimum of space.
The first choice was out by specification. I could use a fixed-point multiplier, but the target CPU was not very fast at integer multiplication and I couldn't guarantee scaling ratios that were amenable to shift-add techniques. The second alternative, error accumulation, produces beautifully scaled images. It works like the Bresenahm line algorithm in that it carries forward an error accumulator that determines whether or not to darken the next pixel. It handles half toning with grace but its performance is miserable without adequate room for bit-counting lookup tables. This left sampling as the lone answer.
Sampling Scalers
There are probably more sampling scalers in use today than any other kind. The reason is simplicity. And simplicity in concept leads to small fast code. Look at Figure 1, which is a mechanical analog of sampling. As the gear rotates along the input line, it punches out pixels, which gather on the output line. The ratio of teeth to the circumference, in this case 8/16, determines the scale ratio. This ratio varies in direct proportion to the number of teeth and the size of the gear. Ideally, equal spacing between the teeth produces the best samples. Realistically, odd numbers of teeth do not fit well in a base-2 universe. So some unevenness is to be expected.Because sampling ignores parts of the input line those pixels falling between the teeth it is susceptible to certain forms of error. In Figure 2 you can see two identical gears operating on two identical input lines, yet producing exact opposite results. The only difference is the starting offset.
The input line pattern in this case is not unusual. It is a 50 per cent gray half tone. As a consequence this form of image corruption happens quite often while scaling. Other artifacts also occur. Moire patterns, named for the wave-like sheen of silk, appear in an image when a beat frequency sets up between the sampler and the pixels. They create synthetic designs that never existed in the original.
To a greater or lesser extent, all sampling methods fall victim to these problems. Sampling scalers are among the worst offenders, while error accumulation scalers handle them best. When it is acceptable, you can avoid many of the effects by avoiding even sampling ratios. When asked to do a 50 per cent scale, try 49 per cent or even 47 per cent. The user usually won't notice the slight difference in image size. In my case, the scaler was passing the output to a fax card that, being a noisy channel, would mangle the image well beyond anything the scaler could do. My requirement was to keep the error introduced by the sampler insignificant in the final fax.
Scaling Up
So far the scalers presented in this discussion scale one way, and that is down. By using a gear that is all teeth and no spaces, the highest attainable ratio is 16:16 or 100 per cent. The question naturally arises, how can I use this method to increase the size of an image? Return to Figure 1 and assume 150 per cent scaling. This scale factor means that for every two pixels in the input line there are three pixels in the output. Redefine the meaning of the teeth and space on the gear. Every time a tooth passes over a pixel, duplicate it. Every time a space passes over, replicate it. This gives the desired 3/2 proportion.Let's look at the logic from another angle. If the tooth on the gear is considered a 1 and a space is a 0, we can define an additive basis for the gear. In the 150 per cent example, the basis is 1. When a tooth passes over, create (basis+1) copies of the pixel. When a space passes over, create (basis+0) copies. By changing the basis and the teeth on the gear we can scale up to any ratio.
Scaling down is just a specific case when the basis is 0. Then a tooth means copy (basis+l = 1) and a space means ignore (basis+0 = 0). Like scaling down, scaling up can produce undesirable effects on an image. Since pixel patterns replicate in two dimensions, there is a tendency to form blocks and emphasize aliasing. The more you scale the picture, the greater this tendency. Fortunately, this effect becomes significant only when the scaling ratios are large.
Driving a Sampler
There are two ways to drive a sampler. You can push the input line through, operating on each source pixel. Or you can pull the output line, operating on each destination pixel. The preferred technique depends on the typical use. If, on the average, the input line is much longer than the output line, it is better to iterate over the output line. This decreases the loop count and speeds the scaler.An example of this is scaling for CRT displays. They have low resolution, which means most scaling is below 100 per cent. Therefore you should loop on the display pixels while sampling the source line. If, however, there is approximate parity between the line sizes, or the input width is less than the output, then iterating over the input line is the better choice. Also, if the source line has exploitable regularity, like long stretches of white space, then looping on the source is the better choice. This way you can optimize performance.
Seeing the teeth and spaces on the gear as ones and zeros should ring some digital bells. They translate directly into bits in a machine-word rotator. Listing 1 is a first cut at casting this mechanical system into C. The gear is a 32-bit rotator. The use of 32 meets the 5 per cent precision requirements and is a comfortable choice for the CPU instruction set. By changing a typedef and a #define, you could use this function with a rotator that is 8, 16, 32, or 64 bits.
The scaler interface has two function calls. The first, init_fax_scale, takes horizontal and vertical scale factors as well as the byte width of the source line. It sets up the basis and lays out the teeth of the rotator. Note that the destination byte width is not needed. This is because we know in advance that the output lines will always be fax width, 216 bytes. The more general case would also handle the destination width. There is nothing profound in this function. It just does a straightforward scaling, taking roundoff into account.
The second function, scale_fax_line, does the work. It walks the bytes in the source line, iterates the bits in the bytes, and finally spins the gear. An accumulator with a sentinel bit handles the output. This function returns a line-repeat value varying from 0 up to the maximum scaling factor (32). This keeps the buffer control in the caller. If scale_fax_line did the line copying, it could easily overrun a buffer. Instead it is the caller's responsibility to replicate the scaled line. Listing 2 shows a sample calling procedure that does this copying.
Improving Performance
Listing 1 does the job but is not particularly fast. A quick look at most faxes shows that they have a lot of white area. Listing 3 is the scale_fax_line function with white-space optimization. On the average, it runs twice as fast as the generic code in Listing 1. But we pay for it with space. This is always the trade off space and time at opposite poles tugging at the programmer in between. With sufficient room for hefty lookup tables, I could rip out the inner loops and make the algorithm fly. But that option is foreclosed. The factor of two speed increase in this case does, however, merit the slight increase in space. This type of judgment call always crops up in algorithm development and should make you pause and consider.Any program hoping to a claim to minimalism will likely end up in assembly language. In fact, that goal influenced the design from the beginning. But C is the natural starting point. I always validate an algorithm first in C before migrating it to assembly language. In C I can easily see the pattern and flow. I can also make quick changes and try out variants. Assembly language has an inertia that tells you not to mess with it, or it will shatter. This is not conducive to experimentation.
C also provides an excellent template for writing the assembly language, as Listing 4 shows. I take the C original and convert it entirely into comments with an editor macro, then interleave the assembly language between the C statements. The C code provides documentation and enforces structure.
A few changes in the assembly language stray from the initial C source. The C code rotates to the right while the assembly language rotates left. Speed is the reason. Shifting with the instruction add reg,reg beats shl reg,1 every time, especially on older processors. The assembly language also makes extensive use of the carry bit as a decision flag, an option not available with C.
Note also that I store the static data in the code segment. Some processors take great exception to data in an executable segment. Or they cannot see changes to the data because their Harvard architecture does read-only caching of the code. The design intent in this case justifies the use of the executable segment. The fax scaler is part of a device driver that loads into and executes from the data segment. As such, it is like a tiny model program where everything must fit into one segment.
Looking through scale_fax_line, you will see that it carefully uses every CPU register. This register parsimony is essential when converting from C to assembly language. As a rule, once you fall out of the register set and start allocating variables on the stack, performance will suffer. This is why compiler optimizers spend so much time color mapping registers to their variables. It is also the primary reason why RISC processors with their large register set are often faster than CISC. While doing the original C design, strive to minimize the number of variables and, thereby, the number of registers.
The assembly language weighs in at 395 executable bytes, 10 bytes of constant data, 15 bytes of modifiable data, and 12 bytes of stack above and beyond the function call. In an embedded processor, the constant data and the code would fit in ROM, leaving only 15 bytes of temporary demand on RAM space. The C equivalent to the assembly language is twice the size and one third the speed. This is a fairly typical ratio. On a 20 MHz 386 DX the scaling rate is about one megapixel per second.
Code Availability
All the software in this article, a test bench and some sample images are available from the usual outlets. Refer to the fine print on page 5 of this magazine for their locations. Feel free to download this package and try the scaler. Even with dithered pictures the output is quite good. I would be very interested in seeing other people's solutions to this problem. I, for one, do not underestimate the quality of programmers who read this journal. I know there are some slick scaling tricks out there.Keep in mind the limitations imposed by the specifications. Many of my first speed optimizations ended up doing ugly things to the picture at certain scaling ratios. In particular, they either modified the black density (the darkness) or caused white lines to appear on a black background when they skipped over pixels. Send your ideas to me in care of C/C++ Users Journal. With enough feedback, we might even do a "best of" compilation in a future article.