Security & Pseudorandom Number Generators

Dr. Dobb's Journal March 2004

Sometimes weak randomness is better than strong

By Ben Laurie

Ben is the technical director of A.L. Digital Ltd. and The Bunker, a director of the Apache Software Foundation, core team member of the OpenSSL Group, coauthor of Apache: The Definitive Guide and author of Apache-SSL. He can be contacted at ben@algroup.co.uk.

The Apache Portable Runtime (APR) project is a portable runtime library implemented by the Apache Software Foundation (ASF) that is used in the Apache HTTP Server, the Subversion version-control system, and other projects. As the APR (http://apr.apache.org/) was extended to meet the needs of the HTTP Server, discussions arose about which kind of randomness should be the default. As an ASF security guru, I argued that the default should be the strongest available source of randomness.

Granted, this can be problematic. The strongest source of randomness might not provide randomness quickly enough, so things can be delayed while waiting for more randomness. This is fine. Casual users are safe from their own ignorance and won't get hacked to shreds as a result of sacrificing security for convenience. On the other hand, knowledgeable users can make their own decisions, change the source of randomness, find more entropy, or do whatever they think best. With this in mind, the APR used strong randomness.

A year or two later, however, the Subversion guys wanted us to adopt weak randomness. It seems that the APR's randomness was causing problems by slowing some functions. But why would they want weak randomness, I wondered? If you don't care about strength, why not just use rand(), or even start at one and work upwards? The answer surprised me: They wanted to be able to generate Universally Unique IDs (UUIDs), which don't care about unpredictability, but do care about collisions. Why? Because UUIDs are not secret, so unpredictability doesn't matter, although it is very bad if they collide.

That got me thinking—and I realized that I was wrong.

Entropy

In the Handbook of Applied Cryptography, Alfred Menezes et al., (CRC Press, 1996) entropy is defined in the following way: "The entropy of X is...the uncertainty about the outcome before an observation of X." In other words, entropy is a measure of the amount of unpredictable information there is in a data source. The name of the game in producing useful randomness is to have sufficient entropy that the randomness really is random. Once you have enough entropy, you can use it as a source for all future random needs by generating pseudorandom numbers from a cryptographic pseudorandom number generator (PRNG). A cryptographic PRNG is one that cannot have its internal state recovered from its output by any method more efficient than brute force.

How much is enough? It's enough that attackers cannot, by brute force, deduce the original entropy (cryptographers tend to use the term "entropy" to mean the actual random data, as well as a measure of its randomness), by comparing output from the PRNG with candidate values of the entropy with the output as observed from the actual PRNG. I refer to this number as "E." Although E is generally agreed to be 128 bits, I use a placeholder because it may be that future advances in computation will require an increase in E.

Unpredictability And Collision Resistance

Why is unpredictability different from collision resistance? It all to has do with knowledge. I can have a great source of entropy, but if attackers know all about that source, then it gives me no unpredictability at all. However, knowledge doesn't damage collision resistance in the slightest.

In other words, there is a difference between the inherent unpredictability of a source of randomness and its practical unpredictability given an adversary's access to the source. Cryptographers tend to assume that adversaries can divine information about some sources of randomness. Hence, cryptographers either reduce their opinion of the value of those sources or avoid using them entirely. Luckily, there is a formal definition of the difference between these two types of entropy. The entropy a data source has given to any knowledge the adversary has (or might have) is known as "conditional entropy," and the entropy ignoring any such knowledge, I call (to avoid confusion) "unconditional entropy." For instance, the entropy of a data source X is written H(X). The conditional entropy of X given some knowledge Y is written H(X|Y). It is known that H(X|Y)<=H(X), with equality occurring if (and only if) Y is independent of X. This ties in well with intuition: If you know something about X, then it has less entropy. If what you know has nothing to do with X, then the entropy is unchanged.

Therefore, I use "unconditional entropy" to mean the amount of entropy a data source has, ignoring any access attackers may have to it; and "conditional entropy" to mean the amount it has given an attacker's potential knowledge. You have to be careful with these terms, because they work in the opposite sense to the related concepts of conditional and unconditional security. However, since they all have widely accepted formal definitions, there is little sense in trying to reconcile them. Cryptographers want conditional entropy, but for UUIDs (and other applications), what's needed is unconditional entropy. What examples are there of things with plenty of unconditional entropy but little conditional entropy? For many purposes, the system time has plenty of unconditional entropy because it is usually different when generating different UUIDs. However, system time tends to have little conditional entropy, because attackers can often guess with a good deal of accuracy when the time was sampled by their victim.

This was used to great effect in an attack on Netscape's PRNG, for example, where reliance on time as an entropy source caused a complete breach of security in the SSL implementation (see "Randomness and the Netscape Browser," by Ian Goldberg and David Wagner, DDJ, January 1996).

Forward Security

Paranoid security gurus sometimes worry about forward security. What they mean by this is that, if your state is compromised at some point, it may be desirable that the state does not stay compromised forever. That is, if attackers somehow manage to determine the current PRNG seed data, you would probably like the attackers not to be able to reconstruct all past and future randomness, since that would likely compromise all past and future security. Preventing this type of attack is known as "forward security." Forward security is easily achieved simply by reseeding the PRNG periodically or, better yet, mixing fresh entropy into the internal state. Mixing is better because it increases the total amount of entropy, so long as the mix is done well. Reseeding throws away the entropy from the previous seed.

PRNG Seeding and Mixing

It is a really bad idea to either seed or mix less than E into your PRNG. In the case of seeding, this exposes your PRNG to a brute-force attack. In the case of mixing, if there has been a total or partial state compromise, then a brute-force attack starting from the compromised state is possible, meaning the new state is then also compromised.

PRNG API

Discussing these conflicting requirements made me realize that there currently isn't a PRNG API that meets everyone's needs. That's why we developed the PRNG API (available at http://apr.apache.org/).

For a standard PRNG, you need E bits of conditional entropy for the seed. From then on, you're okay. The API void manualforwardsecureprng(void *out,int nbytes) provides nbytes of output from such a PRNG. Also, you may at various points wish to have forward security of your PRNG. The API void prngbarrier(void) ensures that E bits of conditional entropy are mixed into the PRNG before any further randomness is extracted. You may be too lazy to figure out when to call prngbarrier(), so void forwardsecureprng(void *out,int nbytes) will do it periodically for you. You might also want forward security, but only when sufficient entropy happens to be available to permit it. The API void lazyforwardsecureprng(void *out,int nbytes) does that. In general, this is the call that programmers who don't want to think too hard about what they are doing should use. For very high-value randomness (for instance, the key the Bank of England uses to sign electronic pound notes), you might want to use as many bits of practical entropy as you are expecting for output, or at least some proportion of the number of bits in the output, rather than merely E. This is provided by void trueprng(void *out,int nbytes). Finally, (yes, this is the one Subversion wants), you may want to use E bits of unconditional entropy: void insecureprng(void *out,int nbytes).

Although this approach doesn't require a cryptographic PRNG, there's really no reason not to use one, since the problem with the traditional approach is the seeding, not the PRNG. Also, it has the advantage of permitting state sharing. This function would use entropy gathered differently from the others, since it can use sources of entropy normally considered too dangerous for cryptographic use, or can allow higher estimates of the total entropy from each source.

Error Handling

The API just discussed does nothing about errors. A real API should probably let an application (at least optionally) handle errors, though it is hard to see what can be done of value when randomness fails, other than produce better diagnostics (or apologize to users). There may also be an argument for producing versions of the APIs that allow timeouts, so that programs with UIs can inform users appropriately.

One thing that is widely provided (but seems ill advised) is a facility to return less randomness than was asked for if sufficient randomness is not available. Clearly, you should ask for what you need, and getting less must surely be a security or reliability risk. Of course, it is a general feature of this API that, if it can return any randomness at all, it can return a large amount of it.

Practical Considerations

In practice, one of the problems with PRNGs is that they tend to be implemented in libraries or applications; hence, each application instance consumes at least E bits of entropy. This is bad, because many systems are chronically short of entropy. It is much better if the PRNGs are shared across all applications, which either means building them into the operating system or having them in daemons—the latter leading to its own interesting security problems, of course.

Another question arises about how much state should be shared between the various APIs. If you assume the PRNG is secure, then this seems to be easily resolved—they can all share all the state, except insecureprng(), which requires less conditional entropy. Once there is sufficient entropy for the other APIs to start working, then even insecureprng() can share their state.

One thing to note when sharing state is that there is generally considered to be a "safe" amount of randomness that can be extracted from a PRNG before its state can be divined from the output, or at least some useful attack mounted on it. This amount is typically of the order of 2 E=2 bits—a very large number. Nevertheless, implementations should ensure that state is not shared between secure and insecure PRNGs once this limit has been exceeded, until new entropy has been mixed in. What this would mean in practice is that a point could (at least in theory) be reached where insecureprng() splits its state from the other PRNGs, and continues to produce output, while the other PRNGs block, waiting for more entropy. Another pitfall that can occur if the PRNGs are not implemented in a daemon or device driver is that, if a process forks, then the new process produces exactly the same output as the old one. There's not really a good solution to this problem, but if you must implement in-process, then check for forks and mix new entropy after they occur.

Finally, the data used to seed PRNGs should not include any data that should be kept private. This is because a dictionary attack on the private data coupled with the (now insufficient) other seed data could reveal the private data. It is worth noting that serial numbers or MAC addresses are often considered private because they can be used to correlate otherwise unrelated data. This point particularly applies to the insecure PRNG, because other data used for entropy may be easily guessable, making the dictionary attack much easier.

Acknowledgments

Thanks to David Wagner for pointing out major flaws in my original ideas on this subject and providing the basis of a substantially simplified API. Robert T. Johnson and Tina Bird also provided useful comments on early versions of this paper.

DDJ