Publication | Closed Access
Quickly generating billion-record synthetic databases
349
Citations
12
References
1994
Year
Unknown Venue
EngineeringProbabilistic DatabasesData GenerationDatabase BenchmarkingData ScienceManagementData IntegrationBillion-record Synthetic DatabasesParallel ComputingDatabase ConstructionData ManagementBase Table GenerationParallel DatabaseData ModelingVery Large DatabaseKnowledge DiscoveryComputer ScienceDatabase TechnologyQuery OptimizationRelational QueriesSynthetic DataDatabase System PerformanceGeneration SpeedupBig Data
Synthetic databases are needed to evaluate database systems, but as sizes grow to terabytes, generation often takes longer than evaluation, requiring multiple databases for design comparison. This paper presents several database generation techniques. The authors describe a C‑based system that uses parallelism, congruential generators, discrete logarithm–based index creation, and distribution‑modifying techniques to generate billion‑record SQL databases on a shared‑nothing cluster of 100 processors and 1,000 disks.
Evaluating database system performance often requires generating synthetic databases—ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses: (1) Parallelism to get generation speedup and scaleup. (2) Congruential generators to get dense unique uniform distributions. (3) Special-case discrete logarithms to generate indices concurrent to the base table generation. (4) Modification of (2) to get exponential, normal, and self-similar distributions. The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems.
| Year | Citations | |
|---|---|---|
Page 1
Page 1