tpchgen-rs

Blazing fast TPCH benchmark data generator, in pure Rust with zero dependencies.

Features

Blazing Speed 🚀
Obsessively Tested 📋
Fully parallel, streaming, constant memory usage 🧠

Try it now!

Install Using Python

Install this tool with Python:

pip install tpchgen-cli

# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop
tpchgen-cli -s 10 --format=parquet

Install Using Rust

Install Rust and this tool:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install tpchgen-cli

# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop
tpchgen-cli -s 10 --format=parquet

Or watch this awesome demo recorded by @alamb and the companion blog post in the Datafusion blog.

Examples

# Create a scale factor 10 dataset in the native table format.
tpchgen-cli -s 10 --output-dir sf10

# Create a scale factor 1 dataset in Parquet format.
tpchgen-cli -s 1 --output-dir sf1-parquet --format=parquet

# Create a scale factor 1 (default) partitioned dataset for the region, nation, orders
# and customer tables.
tpchgen-cli --tables region,nation,orders,customer --output-dir sf1-partitioned --parts 10 --part 2

# Create a scale factor 1 partitioned into separate folders.
#
# Each folder will have a single partition of rows, the partition size will depend on the scale
# factor. For tables that have less rows than the minimum partition size like "nation" or "region"
# the generator will produce the same file in each part.
#
# $ md5sum part-*/{nation,region}.tbl
# 2f588e0b7fa72939b498c2abecd9fbbe  part-1/nation.tbl
# 2f588e0b7fa72939b498c2abecd9fbbe  part-2/nation.tbl
# c235841b00d29ad4f817771fcc851207  part-1/region.tbl
# c235841b00d29ad4f817771fcc851207  part-2/region.tbl
for PART in `seq 1 2`; do
  mkdir part-$PART
  tpchgen-cli --tables region,nation,orders,customer --output-dir part-$PART --parts 10 --part $PART
done

Performance

Scale Factor	`tpchgen-cli`	DuckDB	DuckDB (proprietary)
1	`0:02.24`	`0:12.29`	`0:10.68`
10	`0:09.97`	`1:46.80`	`1:41.14`
100	`1:14.22`	`17:48.27`	`16:40.88`
1000	`10:26.26`	N/A (OOM)	N/A (OOM)

DuckDB (proprietary) is the time required to create TPCH data using the proprietary DuckDB format
Creating Scale Factor 1000 data in DuckDB [requires 647 GB of memory], which is why it is not included in the table above.

Times to create TPCH tables in Parquet format using tpchgen-cli and duckdb for various scale factors.

tpchgen-cli is more than 10x faster than the next fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily generates data faster than can be written to SSD. See BENCHMARKS.md for more details on performance and benchmarking.

Testing

This crate has extensive tests to ensure correctness and produces exactly the same, byte-for-byte output as the original dbgen implementation. We compare the output of this crate with dbgen as part of every checkin. See TESTING.md for more details on testing methodology

Crates

tpchgen: the core data generator logic for TPC-H. It has no dependencies and is easy to embed in other Rust project.
tpchgen-arrow generates TPC-H data in Apache Arrow format. It depends on the arrow-rs library
tpchgen-cli is a dbgen compatible CLI tool that generates benchmark dataset using multiple processes.

Contributing

Pull requests are welcome. For major changes, please open an issue first for discussion. See our contributors guide for more details.

Architecture

Please see architecture guide for details on how the code is structured.

License

The project is licensed under the APACHE 2.0 license.

References

The TPC-H Specification, see the specification page.
The Original dbgen Implementation you must submit an official request to access the software dbgen at their official website

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
.github		.github
benchmarks		benchmarks
patches		patches
tests		tests
tpchgen-arrow		tpchgen-arrow
tpchgen-cli		tpchgen-cli
tpchgen		tpchgen
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
TESTING.md		TESTING.md
parquet-performance.png		parquet-performance.png
rust-toolchain.toml		rust-toolchain.toml
tbl-performance.png		tbl-performance.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tpchgen-rs

Features

Try it now!

Install Using Python

Install Using Rust

Examples

Performance

Testing

Crates

Contributing

Architecture

License

References

About

Releases 4

Contributors 8

Languages

License

clflushopt/tpchgen-rs

Folders and files

Latest commit

History

Repository files navigation

tpchgen-rs

Features

Try it now!

Install Using Python

Install Using Rust

Examples

Performance

Testing

Crates

Contributing

Architecture

License

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Contributors 8

Languages