Quickly generating billion-record synthetic databases

TLDR

Synthetic databases are needed to evaluate database systems, but as sizes grow to terabytes, generation often takes longer than evaluation, requiring multiple databases for design comparison. This paper presents several database generation techniques. The authors describe a C‑based system that uses parallelism, congruential generators, discrete logarithm–based index creation, and distribution‑modifying techniques to generate billion‑record SQL databases on a shared‑nothing cluster of 100 processors and 1,000 disks.

Abstract

Evaluating database system performance often requires generating synthetic databases—ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses: (1) Parallelism to get generation speedup and scaleup. (2) Congruential generators to get dense unique uniform distributions. (3) Special-case discrete logarithms to generate indices concurrent to the base table generation. (4) Modification of (2) to get exponential, normal, and self-similar distributions. The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems.

References

Page 1

	Year	Citations

Page 1