LETTERS

Hashing It All Out

I really enjoyed your recent article on Bloom Filters, "An Existential Dictionary" by Edwin T. Floyd (November 1990); studying hashing techniques is one of my favorite forms of recreation. I was, however, so astonished by Floyd's assertion that a 32-bit CRC function was not an adequate hashing function that I had to duplicate his collision test for myself. Sure enough, I also saw "thousands of collisions." But then I realized that I was using a 16-bit counter to produce the "10-digit ASCII numbers" used in the test; most of my "collisions" were really the result of my counter wrapping at 64K and producing duplicate keys! With that little problem solved I measured 12 collisions in a test with 373,380 keys; Floyd's equation for expected collisions predicts 17. I have no idea if Floyd's test suffered from the same bug as mine, but I learned long ago that when a test produces results that are at great variance with theory, you should take a real good look at the test before you start to doubt the theory.

Some ten hours of crunching on Knuth's "Algorithm S (Percentage Points for Collision Test)" reveals that with probability 0.99 there will be at most 60 collisions in Floyd's collision test; Floyd's data shows that he exceeded this value in seven of 30 tests. The odds against that happening with a good hash function are something like 50 million to one! There is also a probability of 0.22 that there will be at most 39 collisions, yet Floyd's tests never got a value that low. He consistently gets about ten too many collisions. There seem to be three possibilities: 1. Floyd's hashing algorithm doesn't do a very good job of turning his word list into a list of random numbers; 2. his word list has about 48,000 more words than he thinks it does; 3. his word list contains about ten duplicate words.

While there is nothing wrong with Floyd's scheme of hashing a key to produce a seed and then inserting the seed into the Bloom Filter, the same results are obtainable with much less work. A CRC generator really does make a pretty good hashing function, and it's very easy to generate a string of hash values by appending zeros to the end of the key. The first hash value is the CRC of the key, the second is the CRC of the key with a zero appended, the third is the CRC of the key with two zeros appended, etc. In practice you don't actually recompute the CRC of the key each time, you simply crank one more byte through the CRC process. Given a function that accepts a pointer to a string of bytes, a byte count, and an initial value, and returns a CRC, the code fragment in Example 1 inserts a key into a Bloom Filter.

Example 1

  hash = crc ( key, strlen ( key ), 0 );
  setBit ( hash % hashMod );
  for( i = 1; i < 14; ++i ) {
     hash = crc( "\O", 1, hash );
     setBit( hash % hashMod );
     }

When I used this method on Floyd's "practical test" I measured 39 false drops on a list of 93,345 words; the number predicted by theory is 40. As usual, simplicity brings speed; this technique, all coded in C, inserts 280 keys per second on an old 6-MHz AT, 590 on a 10-MHz 286, and 1678 on a 20-MHz 386. The simple optimization in Example 2 , which moves the incremental CRC calculation in-line, brings those numbers up to 340, 700, and 1995, respectively.

Example 2

  hash = crc ( key, strlen ( key ), 0 );
  setBit ( hash % hashMod );
  for( i = 1; i < 14; ++i ) {
     hash = ( crcTable [ hash & OXFF ] ) ^ ( hash >> 8 );
     setBit( hash % hashMod );
     }

Finally, I must mention that Doug McIlroy was not only aware of Bloom Filters when he wrote the spelling checker mentioned in the article, he was improving on a spelling checker that used a Bloom Filter! The solution Floyd proposes takes 25 percent more memory then McIlroy's; more memory, in fact, than was addressable in the PDP 11 it was written for.

John A. Murphy

Performance Technology

San Antonio, Texas

Edwin replies: Mr. Murphy is correct; there was a bug in my 32-bit CRC test, though not the one he describes. After I corrected the CRC routine (by inserting a 1-byte instruction: CLD) the collision test showed 41 collisions where theory predicts 45, and the practical test showed 46 false drops where theory predicts 47. Encouraged, I rewrote the algorithms in assembler and reran the benchmarks. My implementation now inserts 850 keys per second on an 8-MHz V-20 and 4941 on a 20-MHz 386!

I have to say I was troubled from the beginning by the complexity of my hashing algorithm, but it was the first one that worked with anywhere near the predicted test results. Thanks to John, I believe we now have a better, faster hashing algorithm. My faith in the power of publication to improve our art is renewed. I've improved DICT.PAS and the assembler source, BLOOM.ASM, for the high-speed CRC and bit set/test routines; this code is available in the DDJ Forum on CompuServe and on M&T's Telepath online service.

Finally, I realize that Doug McIlroy must have been aware of Bloom Filters though I didn't know his spelling checker was an improvement on one. It's amusing that I chose his "improvement" to illustrate the technique. My solution does take more space than Doug's, as I pointed out in the article, but with that space we buy the ability to update the dictionary instantaneously. It's a classic trade-off.

Let us suppose that the human mind is capable of cognition. Let us further suppose that the human mind is implemented in hardware which we will call the brain (to suppose otherwise is to invoke dualism). It is generally agreed that the brain consists of neurons joined by inhibitory and excitatory connections, and that the level of excitation of these neurons defines the state of the brain at any moment. In short, the brain is a neural net, albeit a far more complicated and capricious one than any artificial neural net to date.

However, according to Fodor and Pylyshin, neural nets cannot support cognition. Therefore human beings cannot think ("I knew it all along!" you say ...). If we assume that Fodor and Pylyshin are human beings, this conclusion applies to them as well. From this we must infer that they derived their arguments without resort to cognitive processes.

In closing, Michael Swaine states that a neural net is "the [computational] equal of a Turing machine." Given this premise, and the premise that a Turing machine is capable of semantic manipulation, then a neural net must be similarly capable. Why does he assert that a neural net can support semantic processing only if used to implement a Turing machine, which then does the real work? Does a neural net stop being a neural net as soon as it replicates the function of a Turing machine?

Although a Turing machine can be programmed to emulate some cognitive processes, my suggestion is that most of what passes for human thought (including thoughts generated by Fodor, Pylyshin, and Swaine) arises without the intermediary of a Turing machine.

Suppose for the moment that Fodor and Pylyshin were correct, that neural nets were incapable of cognition. What is their utility? Biological neural nets, even very simple ones, solve countless life and death problems daily, reliably and in real time, with a limited amount of hardware, apparently without resorting to semantic manipulation or cognition. Consider the ability of flying insects to take off, navigate, and land, making adjustments as necessary in a fraction of a second. Show me the program that performs a similar function, and then show me the nonbiological hardware that implements it as quickly and as well as the nonsemantic fly! Better yet, show it to Boeing or the folks at DARPA, and watch the bucks roll in.

Ted Carnevale

Stony Brook, New York

RAM Disk for the Rest of Us

Dear DDJ,

Thanks for the article "RAM Disk Driver for Unix" (Jeff Reagen, October 1990). I was able to compile and install the driver on my Microport System V/386. Driver code was unchanged but the kernel rebuild was a bit different from the procedure outlined in the article. Anyway, it was educational (only somewhat painful!) and took me a few places in the Unix manuals where I don't usually go. Again, thanks, and keep up the good work.

James Littlefield

CompuServe 71611,2121

LETTERS

Hashing It All Out

Example 1

Example 2

Following Up on Software Patents

An Accidental Tourist...

... And an Accidental Turing

RAM Disk for the Rest of Us