April 27, 2006

The problem with random number generators

Author: Sam 'RavidgeMole' Bisbee

It is a commonly accepted fact that computers by themselves cannot generate truly random numbers, and so most software relies on pseudo random numbers. This means that encryption and other applied uses of "random" numbers may not be as secure as users think. However, it is possible to expand the boundaries of generating random numbers using a computer. The most common way to do that is to use indirect input, such as the system clock. Other methods rely on direct human input.


Google provides an interesting approach to generating random numbers through indirect input. Since the Internet is constantly changing, with Web pages and domains being continuously created, modified, and deleted, databases such as Google provide an extremely large dataset from which to seed a random number generator. In this context, a seed is a number input as a starting point for the generator's algorithms.

Many patterns in the data could be used to generate a seed, but they must vary enough to produce true random numbers and the generation must be simple. Since Google's GUI is too uniform (the best seed is constantly changing), it is preferable to find a common denominator within the page results that the search engine returns, from which to generate the seed.

Every page of Google query results is different, as Google filters out duplicate pages, thereby providing a dataset of unique elements. Most Web pages contain text, and this text differs from page to page, so pages returned by Google should be an excellent source for extracting seeds. If this assumption is incorrect, then stepping outside of the algorithm would provide an even more random seed. One numeric pattern found in text is how many words there are. By executing a word count on the text of each page resulting from a Google search, you can theoretically produce a random seed.

However, Google does not return randomly chosen result pages, but instead produces search results based on its Page Rank equation. If the same query were to be submitted twice, it would most likely produce the same order of pages, which would make our generated seeds too predictable. A better algorithm might look at the lowest page of results and/or search for different terms on each execution.

To further avoid predictability, the algorithm should only make unique queries, which would require a queue of search terms. Extracting one word from each of the result pages and appending it to a word list easily creates this.

There are several problems with this method. First, more commonly used words will populate the list in larger numbers and are likely to be about the same topics. Second, storing the queue on the computer would make it simpler to calculate what number will be generated next (a large implementation flaw). Third, dependency on the Internet would prevent non-connected devices from generating random numbers, and the speed with which connected devices could generate results would be dependent on their Internet connections (to compensate for slow page hosts, a timeout mechanism could be employed).

These flaws mean that implementation is more of a neat idea than a feasible tool. The Google Random Number Generator that I wrote provides a simple example of this method of generating random numbers.

User input

Users have to type to interact with most computers. Typing provides a consistent medium from which to generate seeds, if the seed generator were to calculate the time difference between keystrokes and store it as the seed. Since every user must log into the system, logins provide a perfect choke point to generate seeds.

The process could easily recalculate a seed each time a user logs on (by replacing the old seed, adding the two seeds, or some other method). This would also allow the seed to be generated before the random number generator requires it. Similar systems are already in use, though they often combine other pseudo random sources such as the system clock.

Then what?

Once you've generated a seed, by whatever means, you must store it somewhere on the system so that it can be called upon to initiate the random number generator (seeds are not generated at the same time random numbers are). It would not be sound to store the seed on any data storage device, including removable media, as other users could access the seed, allowing them to predict the random number that is generated. Therefore, the seed should be stored in protected memory.

In order to access the protected memory and write the seed to it, a seed generator must run with system level access, which is easily done by writing a Linux or BSD system kernel module. Once the kernel has stored the seed, it's a simple matter for a client program to request the seed from the kernel and then pass it to the random number generator, which will run the seed through its algorithm.

If one cannot use a kernel module due to insufficient access or the inability to write a module, one could do the same job with a constantly running process (service) that acts like a server, whose sole purpose is storing and returning the seed when it is called by another program. While running with a UID of 0, the client would use interprocess communication mechanisms to request the seed from the seed service. The seed service could be started at system boot, most likely through init.

Final thoughts

Pseudo random numbers may be good enough for some purposes, but applications that need to be secure, like SSH or GnuPG, must have truly random seeds.

It would be a simple matter to employ one of the above algorithms, or some other one, to generate truly random seeds in the operating system. Even establishing a standard hardware device to assist in the generation of random numbers would be preferable to the pseudo random mechanisms in place today. While a hardware approach would likely be less cost-effective for both the computer manufacturer and consumer, most of the computer friendly user market would likely accept the tradeoff.

Click Here!