LETTERS

Hashing It All Out

Dear DDJ,

I really enjoyed your recent article on Bloom Filters, "An Existential Dictionary" by Edwin T. Floyd (November 1990); studying hashing techniques is one of my favorite forms of recreation. I was, however, so astonished by Floyd's assertion that a 32-bit CRC function was not an adequate hashing function that I had to duplicate his collision test for myself. Sure enough, I also saw "thousands of collisions." But then I realized that I was using a 16-bit counter to produce the "10-digit ASCII numbers" used in the test; most of my "collisions" were really the result of my counter wrapping at 64K and producing duplicate keys! With that little problem solved I measured 12 collisions in a test with 373,380 keys; Floyd's equation for expected collisions predicts 17. I have no idea if Floyd's test suffered from the same bug as mine, but I learned long ago that when a test produces results that are at great variance with theory, you should take a real good look at the test before you start to doubt the theory.

Some ten hours of crunching on Knuth's "Algorithm S (Percentage Points for Collision Test)" reveals that with probability 0.99 there will be at most 60 collisions in Floyd's collision test; Floyd's data shows that he exceeded this value in seven of 30 tests. The odds against that happening with a good hash function are something like 50 million to one! There is also a probability of 0.22 that there will be at most 39 collisions, yet Floyd's tests never got a value that low. He consistently gets about ten too many collisions. There seem to be three possibilities: 1. Floyd's hashing algorithm doesn't do a very good job of turning his word list into a list of random numbers; 2. his word list has about 48,000 more words than he thinks it does; 3. his word list contains about ten duplicate words.

While there is nothing wrong with Floyd's scheme of hashing a key to produce a seed and then inserting the seed into the Bloom Filter, the same results are obtainable with much less work. A CRC generator really does make a pretty good hashing function, and it's very easy to generate a string of hash values by appending zeros to the end of the key. The first hash value is the CRC of the key, the second is the CRC of the key with a zero appended, the third is the CRC of the key with two zeros appended, etc. In practice you don't actually recompute the CRC of the key each time, you simply crank one more byte through the CRC process. Given a function that accepts a pointer to a string of bytes, a byte count, and an initial value, and returns a CRC, the code fragment in Example 1 inserts a key into a Bloom Filter.

Example 1

  hash = crc ( key, strlen ( key ), 0 );
  setBit ( hash % hashMod );
  for( i = 1; i < 14; ++i ) {
     hash = crc( "\O", 1, hash );
     setBit( hash % hashMod );
     }

When I used this method on Floyd's "practical test" I measured 39 false drops on a list of 93,345 words; the number predicted by theory is 40. As usual, simplicity brings speed; this technique, all coded in C, inserts 280 keys per second on an old 6-MHz AT, 590 on a 10-MHz 286, and 1678 on a 20-MHz 386. The simple optimization in Example 2 , which moves the incremental CRC calculation in-line, brings those numbers up to 340, 700, and 1995, respectively.

Example 2

  hash = crc ( key, strlen ( key ), 0 );
  setBit ( hash % hashMod );
  for( i = 1; i < 14; ++i ) {
     hash = ( crcTable [ hash & OXFF ] ) ^ ( hash >> 8 );
     setBit( hash % hashMod );
     }

Finally, I must mention that Doug McIlroy was not only aware of Bloom Filters when he wrote the spelling checker mentioned in the article, he was improving on a spelling checker that used a Bloom Filter! The solution Floyd proposes takes 25 percent more memory then McIlroy's; more memory, in fact, than was addressable in the PDP 11 it was written for.

John A. Murphy

Performance Technology

San Antonio, Texas

Edwin replies: Mr. Murphy is correct; there was a bug in my 32-bit CRC test, though not the one he describes. After I corrected the CRC routine (by inserting a 1-byte instruction: CLD) the collision test showed 41 collisions where theory predicts 45, and the practical test showed 46 false drops where theory predicts 47. Encouraged, I rewrote the algorithms in assembler and reran the benchmarks. My implementation now inserts 850 keys per second on an 8-MHz V-20 and 4941 on a 20-MHz 386!

I have to say I was troubled from the beginning by the complexity of my hashing algorithm, but it was the first one that worked with anywhere near the predicted test results. Thanks to John, I believe we now have a better, faster hashing algorithm. My faith in the power of publication to improve our art is renewed. I've improved DICT.PAS and the assembler source, BLOOM.ASM, for the high-speed CRC and bit set/test routines; this code is available in the DDJ Forum on CompuServe and on M&T's Telepath online service.

Finally, I realize that Doug McIlroy must have been aware of Bloom Filters though I didn't know his spelling checker was an improvement on one. It's amusing that I chose his "improvement" to illustrate the technique. My solution does take more space than Doug's, as I pointed out in the article, but with that space we buy the ability to update the dictionary instantaneously. It's a classic trade-off.

Following Up on Software Patents

Dear DDJ,

The Dr. Dobb's Journal, November 1990 article titled "Software Patents" by The League for Programming Freedom brings to mind the saying that you never can appreciate the problems of others until they become your own. The present woeful state of software patents is nothing new to engineers and scientists who have dealt with the patent "system" in the past. The article's conclusion that patents should not be granted to any form of software hits the nail on the head.

It is well known that the patent office, and in many cases the same examiner, will issue duplicate patents for the exact same invention by different applicants within a few months or even weeks of each other, making the title "patent examiner" an oxymoron. It is common knowledge that at least half of the patents granted, in all areas, are of questionable integrity since they bear strong ties to the obvious and/or prior art. With this past record as prologue, that the U.S. Patent Office can pretend to delve deeply into the superficial layers of an application's obviousness and conflict with prior art strains credulity. That the U.S. Patent Office is incompetent is a moot point since the office will never be competent with any level of staffing. The U.S. Patent Office only can be a rubber stamp issuing entry visas into the U.S. legal system and guaranteeing job security for lawyers and patent examiners. What better excuse to raise taxes?

Why is this? Patents are applied for by two types of applicants. The first is genuine and thinks their idea/invention is new and original (which it may/may not be) while the second, a "patent system parasite," simply applies for patents strategically where it is believed that one may be granted, even when the application is known to be based upon prior art and/or is obvious. The only people that can be considered competent to review a patent application in a specific area are those people who currently or in the past have worked in areas related to the patent application. This type of peer review, while impossible to achieve in the present patent system, would still have obvious limits. Yet we have something better, a system of illiterate stone age cave-person judges, juries and lawyers. Can we ever escape the dark ages?

The U.S. Constitution not only sets forth the law for patents, it is the base of all law, and all things technical or mundane fall under the law. Aye, there lies the rub. Law, lawyers, the legal "system" (read casino) are the final controllers of all our lives and, as things progress, thoughts. This fact flies in the face of a responsible community of engineers, scientists, and citizens who naturally expect more of the U.S. Government. What we have is an intellectual dichotomy as vast and penal as the dark ages. The U.S. law and legal system is not interested in scientific progress, freedom, or anything except the propagation of laws and the power of government, right or wrong. Unfortunately this sad tale has repeated itself for thousands of years and hundreds of generations, being the Achilles' heel of the most intelligent mammals on this planet. The more things change the more they stay the same, or what is this thing we call "civilization" anyway?

To argue that the patent system "protects" inventors is like saying the Mafia protects small family businesses. Sure it does, but at what cost? The granter of patents, the U.S. constitution, also leaves answers to the problem in the keyword freedom. Freedom, that modern anachronism, usually gets the short shrift when the fittest creature surviving, able to squelch any innovation if it threatens an agency or fee, is the U.S. bureaucracy. To look to Congress to solve this problem is to ask the largest body of conflict of interest to cut its own throat; granted, an exaggeration, but today one's wallet is anatomically connected to the throat. Talk about virtual realities!

Too bad the Eastern Europeans are turning to the U.S. for legal advice. Let's hope they don't repeat our errors and learn from our mistakes. Step right up and place your bets in the legal casino of life.

Tari Taricco, President

Taricco Corp.

San Pedro, California

Dear DDJ,

It's great to see DDJ return to its historic visionary role with that software patents article. We get absorbed in the minutiae of what we do so easily, and we really need to pay attention to the big world around us. As a fundamentalist of the old-time religion of assembly language, I'm as guilty as they get on that point.

There are other issues just as worthy of some hell raising: The collapse of hardware standards, the ownership of the dominant operating system by a secretive and erratic private company, the next-year's-Chevyism of minimal upgrades.... So keep it up.

Instead of refusing to cooperate in patent applications, we might try the inundation approach. Small companies could require every programmer to put each week's work into a patent application -- sort of like kids stockpiling snowballs for a fight. The resulting applications would have no less merit than most of the existing patents. In fact, that's such a good idea I should probably try to patent it.

John Sprung

Viacom Productions

Universal City, California

Dear DDJ,

Your article on patents (November 1990) was very timely and informative. As a software developer in a small startup company, the potential restrictions imposed by arbitrary patents are disconcerting to say the least. I propose much of the problem stems from the misinterpretation of what software is in the mind of the public -- which is reflected in the minds of the bureaucracies and legislative bodies. As pointed out in the article, patents apply specifically to things, not abstract entities. The confusion seems to arise because software is so intimately tied to a machine, which is definitely a patentable thing. The (mistaken) perception that software is a patentable thing derives from this close association.

Perhaps the better way to think of software is as a literature equivalent: it is not built, it is written. The end result is not a process or a widget, but something that is read by a machine to produce widgets and objects (even if they are on a screen). We can copyright something that is written, but we can't copyright or patent the principles and techniques by which they are written.

For instance, if the field of music were suddenly a brand new discipline, and in the evolution of this discipline the notation principles of black notes on staves of five lines each (with all the sharp and flat signatures, etc.) were developed by an individual or an organization, would that individual or organization be entitled to a patent or even a copyright? Could someone patent a B minor chord? Or a cadence leading to a resolution? What would happen to the field of music if such practices were allowed?

Similarly, in the field of literature, would it be possible to patent a construct such as a poem? Iambic pentameter? Chapters? Table of Contents? Clearly, these are the fundamental concepts and techniques which are indigenous to the field of literature and make its advance possible: the allegory to algorithm is not inevident or accidental.

There is certainly protection for authors who create a specific piece of literature, music, or art through the copyright process. This has worked well (mostly) in these fields and, if the software allegory to literature can be accepted, would work equally well for it.

Software has the unique capability of producing further literature, to be consumed by machines or humans, and if this literature can be specifically described, it should be subject to copyright rules as well. The key here is specifically described. This would include (and is specifically intended to provide for) menu and user interface screens. All menu and UIF constructs are finite state -- there are only so many specific combinations possible, allowing perhaps specific rules for certain menus and windows to be repositioned. Under this construct, a software company should be able to render complete descriptions of every state of its UIF and, if it meets the tests of specificity, be granted a copyright for that UIF. This should make people like Apple and Lotus happy, without rendering proprietary the techniques by which those screens were created which are, in fact, part of the software development discipline.

In short, I am agreeing with the statements and definitions provided in the article, but suggest that software be considered as a new type of literature. This should be easily provable: no software can produce anything without a machine to read it. This position would retain the tenets adopted by the authors, but provide a graspable allegory for the nontechnically minded officials and legislators who are, fortunately or otherwise, the ones that have to be convinced.

Rick Berger

Sedona Software

San Diego, California

An Accidental Tourist...

Dear DDJ,

I am not a programmer, amateur or professional, but a semi-retired physicist, electronischer, inventor, and patent buff who is much more at ease bashing tin or slinging solder than writing the strictly ordered poetry of a computer program. Oh, all right, I do cook up some Basic, with a few lines of machine code thrown in, to do donkey-work (modeling the on-axis performance of any horizontal-axis windmill using a Sinclair ZX-81 and only 16K; or home-brew memory-mapped I/O, using another Sinclair to control and log a long-term life test) which I would otherwise have to do manually. To me, a computer is just another power tool, like a sabre saw.

So why on earth am I a subscriber to DDJ? It was an accident. I subscribed to a "computer" magazine; it went belly-up, and offered a choice of others to take its place. I selected one, but that one also died promptly, and offered a choice of still others. One was described as "highly technical" and I gladly chose it, fearing at the same time that I might be a Jonah. It proved to be highly technical, but in a field which was (and is) very strange and wonderful. Your accidental subscriber has found Dr. Dobb's.

But why have I continued to subscribe? I have asked myself that question many times, but just when I resolve to let the subscription expire, I find some articles or letters which are of such great general interest, or simply so superbly written, that they make the task of reading an unalloyed pleasure. Even though I am nearly illiterate in the languages of programming, I can enjoy explorations of neural nets, fulminations about Ada, and always the trenchant comments in "Swaine's Flames." My interest seems more aesthetic than technical, but that may be a common human trait: I thoroughly enjoy opera, even though I know little German and French, and even less Italian.

Gurdon Abell

Woodstock, Connecticut

... And an Accidental Turing

Dear DDJ,

In his November 1990 "Programming Paradigms" column, Michael Swaine presented portions of the critique of connectionism advanced by Fodor and Pylyshin. These authors have made major contributions to the field of cognitive science, and their analysis of the connectionist approach to cognition raises many important issues.

One of Fodor's lines of reasoning goes something like this:

  1. Cognition (thinking) involves the manipulation of symbols. A symbol must have semantic content. Therefore something that thinks (e.g., a mind) must deal with semantic elements.
  2. "Neural nets" deal in the weights of interactions among, and levels of excitation of, simple processing units. These weights and excitation levels are not semantic elements. Therefore a neural net cannot think.
Although Fodor and Pylyshin present their case very elegantly, it is unsatisfactory for several reasons. The most striking of these is that their argument refutes itself.

Let us suppose that the human mind is capable of cognition. Let us further suppose that the human mind is implemented in hardware which we will call the brain (to suppose otherwise is to invoke dualism). It is generally agreed that the brain consists of neurons joined by inhibitory and excitatory connections, and that the level of excitation of these neurons defines the state of the brain at any moment. In short, the brain is a neural net, albeit a far more complicated and capricious one than any artificial neural net to date.

However, according to Fodor and Pylyshin, neural nets cannot support cognition. Therefore human beings cannot think ("I knew it all along!" you say ...). If we assume that Fodor and Pylyshin are human beings, this conclusion applies to them as well. From this we must infer that they derived their arguments without resort to cognitive processes.

In closing, Michael Swaine states that a neural net is "the [computational] equal of a Turing machine." Given this premise, and the premise that a Turing machine is capable of semantic manipulation, then a neural net must be similarly capable. Why does he assert that a neural net can support semantic processing only if used to implement a Turing machine, which then does the real work? Does a neural net stop being a neural net as soon as it replicates the function of a Turing machine?

Although a Turing machine can be programmed to emulate some cognitive processes, my suggestion is that most of what passes for human thought (including thoughts generated by Fodor, Pylyshin, and Swaine) arises without the intermediary of a Turing machine.

Suppose for the moment that Fodor and Pylyshin were correct, that neural nets were incapable of cognition. What is their utility? Biological neural nets, even very simple ones, solve countless life and death problems daily, reliably and in real time, with a limited amount of hardware, apparently without resorting to semantic manipulation or cognition. Consider the ability of flying insects to take off, navigate, and land, making adjustments as necessary in a fraction of a second. Show me the program that performs a similar function, and then show me the nonbiological hardware that implements it as quickly and as well as the nonsemantic fly! Better yet, show it to Boeing or the folks at DARPA, and watch the bucks roll in.

Ted Carnevale

Stony Brook, New York

RAM Disk for the Rest of Us

Dear DDJ,

Thanks for the article "RAM Disk Driver for Unix" (Jeff Reagen, October 1990). I was able to compile and install the driver on my Microport System V/386. Driver code was unchanged but the kernel rebuild was a bit different from the procedure outlined in the article. Anyway, it was educational (only somewhat painful!) and took me a few places in the Unix manuals where I don't usually go. Again, thanks, and keep up the good work.

James Littlefield

CompuServe 71611,2121


Copyright © 1991, Dr. Dobb's Journal