High-Speed Cryptography with the RSA Algorithm

Dr. Dobb's Journal February 2000

By Michael J. Wiener

Michael is a senior cryptologist with Entrust Technologies. He can be reached at wiener@entrust.com.

In 1978, Ronald Rivest, Adi Shamir, and Leonard Adleman published the RSA public-key cryptosystem (see "A Method for Obtaining Digital Signatures and Public Key Cryptosystems," by R. Rivest, A. Shamir, and L. Adleman, Communications of the ACM, February 1978). Since then, public-key cryptography has become a critical technology for confidentiality and trust in messaging and online transactions on the Internet. There are many choices for symmetric cryptosystems, but only a few public-key algorithms are available. The two main contenders right now are RSA and elliptic-curve cryptography. Here, we focus on implementing RSA.

RSA can be implemented in hardware, but great performance can also be achieved in software running on general-purpose processors. In fact, many (but not all) hardware implementations of RSA are actually much slower than RSA on a PC. In this article, I'll explain some of the key optimizations (with source code examples) that can be made to make RSA as fast as possible.

The speed of RSA varies greatly from one implementation to the next. It is not unusual to see RSA code that is 100 times slower than the best available for a given platform. There are many tricks that boost performance by a few percent, but the most important ones make the code up to four times faster. Combining these ideas can make a speed difference of one to two orders of magnitude. Here, I'll discuss optimizations that apply to the underlying large integer math, the encryption and decryption operations, and key generation. I'll also give RSA performance figures for what can be achieved on an Intel PC.

RSA Math Background

Creating an RSA key pair begins with selecting two large prime numbers p and q at random. A typical size for each of these primes is 512 bits. The product of these primes is called the RSA modulus n=p×q. To say that we are using 1024-bit RSA means that the RSA modulus is 1024 bits long.

Next, a public exponent e is chosen. A common choice is e=2¹⁶+1=65537. Then the private exponent d is computed such that ed-1 is divisible by both p-1 and q-1. This makes e and d inverses of each other, in a sense to be discussed shortly.

To encrypt a message (which is often a symmetric key used to encrypt the real message), it must first be encoded as an integer m between 0 and n-1. The ciphertext is computed with a modular exponentiation operation: c=m^emod n. The special form of the aforementioned private exponent d makes it possible to recover the message from the ciphertext: m=c^d mod n.

The quantities e and n (but not p and q) can be made public, allowing anyone to encrypt a message, but only the owner of the private exponent d can decrypt the message. For attackers to find d, they must factor n into the primes p and q. For a 1024-bit modulus n, this is infeasible using the best factoring method known, the general-number field sieve (see The Development of the Number Field Sieve, Lecture Notes in Math. 1554, edited by A. Lenstra and H. Lenstra, Jr., Springer- Verlag, 1993).

If the goal is to compute a digital signature on a message (or hash of a message) rather than to encrypt it, the owner of the private exponent computes s=m^d mod n. Then anyone can recover m and verify that the signature must have been created by the owner of d by computing m=s^e mod n.

For a more complete mathematical description of RSA, see Handbook of Applied Cryptography, by A. Menezes, P. van Oorschot, and S. Vanstone (CRC Press, 1997).

Targets of Optimization

The main target of optimization is the modular exponentiation operation that consists of raising one number to some power modulo a third number: m^x mod n. A naive approach is to start with m, multiply by m x-1 times, divide by n, and take the remainder. This would be fine for very small numbers, but for RSA the private exponent d is very large, and doing d-1 multiplies is completely infeasible. A faster way can be illustrated using the exponent 19: m¹⁹=(((m²)²)²×m)²×m.

Instead of 18 multiplies, you need only four squares and two multiplies. The next problem is that m^x can be extremely large. This is dealt with by dividing by n and taking the remainder after every square and multiply instead of waiting until the end.

Of the five further RSA optimizations described in the following, four apply to modular exponentiation and one is specific to prime number generation.

Large Integer Representation

A frequent operation in modular exponentiation is multiplying two integers. However, these integers are generally much larger than the maximum integer size supported by compilers. Some strategy is needed for breaking the problem down.

To multiply large numbers, children are taught to break the problem down into many products of single-digit numbers and to add all the partial products. The same can be done with computers, except that you use bytes or words instead of digits; see Figure 1(a). Each large integer is represented as an array of words. The words are small enough that you can easily compute the product of two words. To multiply two large integers, consisting of three words each, would require 3×3=9 small products to be computed and summed appropriately to produce the final result consisting of six words (r0 to r5).

For C/C++ programmers, it is tempting to use 16-bit words so that the small products are 32 bits long and fit in an unsigned long int. However, most 32-bit machines have an instruction that takes two 32-bit unsigned integers and computes the 64-bit result. The problem is that for many compilers, the upper half of the result is not accessible. To get the full 64-bit product requires writing a small amount of assembly code. Example 1 shows inline assembly for Microsoft's Visual C++ on an Intel platform.

Using assembly code here is well worth it. Going from a large integer representation based on 16-bit words to 32-bit words cuts down the number of small products by a factor of four. This is possible on most 32-bit processors. Depending on the particular platform, this can save between a factor of two and four in the run time of all RSA operations. This is the only place in the RSA software where dramatic speed improvement is possible using assembly code. Intel's Merced chip promises further improvement by offering an integer multiply and add instruction that takes two 64-bit unsigned integers and produces the full 128-bit product (see "IA-64 Application Developer's Architecture Guide," May 1999, http://developer.intel .com/design/ia64/architecture.htm.)

Faster Squaring

The method used for modular exponentiation involves both multiplying and squaring large integers. The squaring can be done simply by using the multiply function with the arguments equal, but there is a faster way. In Figure 1(b), you see that three of the small products appear twice (for instance, a×b=b×a). Instead of computing each one twice, you can compute them once and add them twice to the final result. This does not save much for the small example in the figure, but for the larger numbers used in RSA, this can make squaring almost twice as fast as multiplication.

Divide and Conquer

To decrypt or digitally sign, the owner of the private exponent computes r=y^d mod n, where y is either a ciphertext to be decrypted or a message (or hash of a message) to be signed, and r is the resulting plaintext message or digital signature. A clever observation is that because n=p×q, the Chinese Remainder Theorem says that you can compute r mod p and r mod q, and then combine them to get r mod n (see J.J. Quisquater and C. Couvreur's "Fast Decipherment Algorithm for RSA Public-Key Cryptosystem," Electronics Letters, October, 1982).

To take advantage of this performance improvement, a 1024-bit RSA private key is not really stored simply as two 1024-bit values (d,n) as described earlier, but is stored as five 512-bit values (d_p,d_q,p,q,k). Three of these values require explanation. When a new RSA key pair is created, these values are computed as follows: d_p=d mod (p-1), d_q=d mod (q-1), and k is computed such that kq-1 is divisible by p. This quantity k is used in the final stage of constructing r. The process of computing r=y^d mod n begins with computing r_p=y^dp mod p and r_q=y^dq mod q. These are then combined to form r as follows:

r=((r_p+p-(r_q mod p))×k mod p)×q+r_q

The added p before subtracting r_q is there simply to avoid having to deal with negative numbers. It is more efficient to deal with only unsigned quantities. Normally, the owner of the private key also stores the public exponent e and possibly n (or he may choose to compute n=p×q when necessary to save space).

It is natural to ask at this point whether all this is really worth the trouble. The short answer is yes. The time required to perform a modular exponentiation is proportional to the cube of the size of the numbers involved. This means that each of the half-size exponentiations mod p and mod q take one-eighth as long as a big one mod n. The time required to do the final step of combining partial results is small by comparison. The overall improvement is almost a factor of four. Computing the extra information for a private key (d_p,d_q,p,q,k) is negligible and happens only at the time of key generation.

Sliding Window Method

I'll now examine a way to reduce the number of multiplies required to carry out a modular exponentiation. In the example earlier, we showed how to efficiently compute m¹⁹ by taking m, squaring it three times (to make m⁸), multiplying by m, squaring, and finally multiplying by m again to get m¹⁹. Looking at the binary representation of 19 (10011) suggests a general approach that works for any exponent. First, find the most significant set bit in the exponent and set the current result to be m. Then, for each successively less significant bit of the exponent, square the current result and if the exponent bit was set, multiply the result by m. This is a fairly simple approach that requires a square for each exponent bit (after the first one) and a multiply for each set exponent bit (after the first one).

The number of squares cannot be reduced much, but the number of multiplies required by the approach just described can be reduced significantly for large exponents using the sliding window method. This method deals with several exponent bits at a time instead of handling them one by one. You begin by selecting a window size w. For RSA-1024, w=6 is most efficient. Modular exponentiation proceeds in three stages described in the following steps:

1. Table computation. Compute all the odd powers of m up to 2^w using a square and 2^w-1-1 multiplies. For w=6, this means computing m¹, m³, ..., m⁶³ and storing them in a table.

2. Initialize exponent scanning. Next, we take the top window of w bits of the exponent and compute that power of m and assign it to the current result. If the w exponent bits form an odd value, then the appropriate table entry can be assigned to the current result. If the w exponent bits form an even value, then we can form that power of m by squaring a table entry or multiplying two entries and assigning the product to the current result. In Example 2, the initial exponent bits are 110, meaning that we have to initialize the result to m⁶ (by squaring m³ from the table).

3. Main Loop. Repeat the following until the exponent is used up. Starting from the end of the last window processed, scan for the next set bit in the exponent and square the result for each zero bit encountered along the way. When a "1" bit is found, take the next window of w exponents bits (including the first set bit) if there are that many. Otherwise, just take as many exponent bits as are left. Suppose that this string of exponent bits ends in z zero bits and begins with y bits that form an odd value. Then square the result y times, multiply by the table entry corresponding to the y bits, and finally square the result z times. This trick eliminates the need to store even powers in the table. In Example 2, after setting the result to m⁶, you have two zero bits each requiring a squaring, and then a window of 110 that ends in z=1 zero bits and whose initial y=2 bits form the odd value 3. This means that you square twice, multiply by m³, and then square once more.

Listing One implements the sliding window method. For 1024-bit RSA, this method does not significantly affect the number of squaring operations required, but it reduces the number of multiplies by more than a factor of two.

Sieving

A common way to generate the prime numbers needed for an RSA key pair is to generate random numbers of the desired size and perform a primality test on each one until a prime number is found. The most efficient primality tests are probabilistic; they can say that a number might be prime with some probability or that it is definitely not prime. If a given number passes the test enough times, then the probability of it not being prime is so low that we can declare it a highly probable prime and use it for RSA. A good probabilistic primality test is the Miller-Rabin test (again, Handbook of Applied Cryptography, by A. Menezes, P. van Oorschot, and S. Vanstone, CRC Press, 1997).

A first thought on this approach to generating prime numbers is that it makes no sense to test whether an even number is prime. If you generate only large odd candidates to test, the process should run twice as fast. Taking this a step further, you could check whether the candidate is divisible by 3 or 5 before running the Miller-Rabin test. This can be extended to checking that the candidate is not divisible by any prime below some bound. For a bound of 4096, you save a factor of 15 in the number of primality tests performed. However, the cost of checking each candidate for divisibility by the primes up to the bound can become expensive. Much of this cost can be eliminated using a technique called "sieving," where you seek the first prime number after a random starting candidate by efficiently eliminating multiples of small primes.

An efficient approach based on sieving begins with generating a random odd candidate c. Create an array of Boolean flags, where element 0 corresponds to c, element 1 corresponds to c+2, and in general, element i corresponds to c+2i. Initially, set all flags to 1 indicating that the corresponding integer c+2i might be prime. For each small prime s less than the bound, find the smallest value i such that c+2i is divisible by s and turn off flags i, i+s, i+2s, up to the end of the array. This eliminates all multiples of s as prime candidates. After this process is done for all small primes less than the bound, the flags that are still on correspond to possible primes that can be tested with the Miller-Rabin test. If none of the remaining candidates are prime, you can repeat with c+2l in place of c, where l is the length of the array. Listing Two implements sieving.

Performance

The performance figures in Table 1 are based on the RSA implementation by Entrust. Additional information is available at either http://developer.entrust.com/ or http://developer.entrust.ch/. The run times are based on a 400-MHz Pentium II, a public exponent of e=65537, and do not include overhead such as hashing a message.

For the currently recommended RSA key size of 1024 bits, the operations of encrypting, decrypting, signing, and verifying signatures are very fast. Even key pair generation, which is not required frequently, takes less than a second. Digital certificates, which enforce the usable lifetime of a key pair, most frequently specify lifetimes of at least a year, and so a fraction of a second for key pair generation is a small overhead.

But what if you are forced to use larger key sizes in the future because attackers will have faster computers as Moore's law marches on? RSA-2048 is about a billion times stronger than the already strong RSA-1024, and even today's RSA-2048 operations are quite fast. If attackers have faster computers in the future, then so will users who are protecting their data. Moore's law favors the cryptographer, not the cryptanalyst.

Conclusion

When RSA was first introduced, it ran very slowly on the computers that existed at the time. But, the continuous drive to faster and faster computers has brought us to a point where RSA is very fast and practical to use. As computers become faster still, attackers will have more power for trying to break RSA, but this can be countered easily by increasing key sizes. Doubling the key size to 2048 bits costs the cryptographer less than a factor of eight in RSA signing time, but costs the cryptanalyst about a billion times more work to attack RSA. This makes it quite easy for the cryptographer to stay ahead.

DDJ

Listing One

// To use this code, a BigInt class to handle big integers
// is needed along with the following 5 functions.

// set up an exponent to point to the top set bit for next_exponent_bit()
void init_exponent(BigInt exponent);

// get next exponent bit
unsigned long next_exponent_bit();

// return TRUE if exponent used up
int exponent_finished();

// result = result*result % modulus
void modular_square(BigInt &result, BigInt modulus);

// result = result*x % modulus
void modular_multiply(BigInt &result, BigInt x, BigInt modulus);

#define WINDOW 6
#define TABLE_LEN 32
BigInt table[TABLE_LEN];

void fill_table(BigInt message, BigInt modulus)
{
    BigInt sq;
    long i;
    sq = message;
    modular_square(sq, modulus);
    table[0] = message;
    for (i = 1; i < TABLE_LEN; i++)
    {
        table[i] = table[i - 1];
        modular_multiply(table[i], sq, modulus);
    }
}
BigInt modular_exponentiation(BigInt message, BigInt exponent, BigInt modulus)
{
    BigInt result;
    long started, num_pending_bits, num_zero_bits, i;
    unsigned long pending_bits, bit;

    fill_table(message, modulus);
    init_exponent(exponent);
    if (exponent_finished())
       return 1;
    started = 0;
    num_pending_bits = 0;
    pending_bits = 0;
    do
    {
        while (!exponent_finished() && ((bit = next_exponent_bit()) == 0))
            modular_square(result, modulus);
        num_pending_bits = pending_bits = bit;
        while (!exponent_finished() && (num_pending_bits < WINDOW))
        {
            pending_bits = (pending_bits << 1) + next_exponent_bit();
            num_pending_bits++;
        }
        if (num_pending_bits > 0)
        {
            if (!started)
            {
                if (pending_bits & 1)
                    result = table[pending_bits >> 1];
                else if (pending_bits & 2)
                {
                    result = table[pending_bits >> 2];
                    modular_square(result, modulus);
                }
                else
                {
                    result = table[(pending_bits-1) >> 1];
                    modular_multiply(result, message, modulus);
                }
                started = 1;
            }
            else
            {
                for (num_zero_bits = 0; !(pending_bits & 1); num_zero_bits++)
                    pending_bits >>= 1;
                for (i = num_zero_bits; i < num_pending_bits; i++)
                    modular_square(result, modulus);
                modular_multiply(result, table[pending_bits >> 1], modulus);
                for (i = 0; i < num_zero_bits; i++)
                    modular_square(result, modulus);
            }
        }
    } 
    while (!exponent_finished());
    return result;
}

Back to Article

Listing Two

// To use this code, a BigInt class to handle big integers
// is needed along with the following function.

// return true if candidate is a highly-probable prime
int miller_rabin_test(BigInt candidate);

#define SMALL_PRIME_BOUND_DIV2 2048
#define SQRT_BOUND 64
char small_prime_flags[SMALL_PRIME_BOUND_DIV2];

// sieve to find small primes
// small_prime_flag[i] == 1 means 2*i+1 is prime
void generate_small_primes()
{
    unsigned long i, sieve_val;
    small_prime_flags[0] = 0; // 1 is not prime
    for (i = 1; i < SMALL_PRIME_BOUND_DIV2; i++)
        small_prime_flags[i] = 1;
    // for each odd number, throw out its multiples
    for (sieve_val = 3; sieve_val <= SQRT_BOUND; sieve_val += 2)
        for (i = sieve_val + (sieve_val >> 1); i < SMALL_PRIME_BOUND_DIV2; 
              i += sieve_val)
            small_prime_flags[i] = 0;
}
#define SIEVE_LEN 2048
BigInt generate_large_prime(BigInt start)
{
    unsigned long small_prime, i, sp, candidate;
    char sieve_array[SIEVE_LEN];
    generate_small_primes();
    start |= 1;  // force starting point odd
    for (;; start += 2*SIEVE_LEN)
    {
        for (i = 0; i < SIEVE_LEN; i++)
            sieve_array[i] = 1;
        for (sp = 0; sp < SMALL_PRIME_BOUND_DIV2; sp++)
            if (small_prime_flags[sp])
        {
            small_prime = 2*sp + 1;  // next prime to sieve with
            // magic to find i such that small_prime divides start+2*i
            i = (small_prime - 1) - ((start - 1) % small_prime);
            if (i & 1)
                i += small_prime;
            i /= 2;
            // remove multiples of small_prime
            for (; i < SIEVE_LEN; i += small_prime)
                sieve_array[i] = 0;
        }
        // test primality of remaining candidates
        for (i = 0; i < SIEVE_LEN; i++)
            if (sieve_array[i])
        {
            candidate = start + 2*i;
            if (miller_rabin_test(candidate))
                return candidate;
       }
    }
}

Back to Article