Note This repo started from this DuckDB vs Pandas comparison article and has since been expanded to include polars.
Comparison of file conversion between duckdb, polars and pandas.
On average, duckdb
π¦ is ~3x faster than pandas
πΌ
however...
Running [100] iterations of the simulation...
converter_type duration
mean std min max median
0 duckdb 0.638409 0.017433 0.627250 0.738895 0.633449
1 pandas 1.842512 0.027998 1.818738 2.007856 1.836374
2 polars 0.190853 0.014131 0.171845 0.281984 0.186804
Results saved to [data/simulation_results.json]
Plot saved to [img/plot.png]
Complete! Total time: [268.05s]
Simulation results based on 100 iterations and 100k rows sampled from the machine learning subset of the MMLU dataset lighteval/mmlu
.
pip install -r requirements.txt
Standalone script - also includes checking if the system has the necessary dependencies.
./file-converter.sh <input_file> -f <output_format> [-o <output_file>] [-v]
python file_converter.py <input_file> -f <output_format> [-o <output_file>] [-v]
Standalone script - also includes checking if the system has the necessary dependencies.
Thanks to @Nasdin for the polars script! π€
./polars-converter.sh <input_file> -f <output_format> [-o <output_file>] [-v]
To run the simulation:
# Download dataset
python download_dataset.py
# Run simulation
python 3way_simulation.py
- This is not intended to be an exhaustive benchmark, and originally started from the
./file-converter.sh
script. - The
3way_simulation.py
script is a more controlled and isolated comparison of file conversion between duckdb, polars and pandas. - It also generates a plot of the results as seen above.