-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #50 from TidierOrg/larger-than-mem-ex
- Loading branch information
Showing
4 changed files
with
49 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# While using the DuckDB backend, TidierDB's lazy intferace enables querying datasets larger than your available RAM. | ||
|
||
# To illustrate this, we will recreate the [Hugging Face x Polars](https://huggingface.co/docs/dataset-viewer/en/polars) example. The final table results are shown below and in this [Hugging Face x DuckDB example](https://huggingface.co/docs/dataset-viewer/en/duckdb) | ||
|
||
# First we will load TidierDB, set up a local database and then set the URLs for the 2 training datasets from huggingface.co | ||
# ```julia | ||
# using TidierDB | ||
# db = connect(duckdb()) | ||
|
||
# urls = ["https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet", | ||
# "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet"]; | ||
# ``` | ||
|
||
# Here, we pass the vector of URLs to `db_table`, which will not copy them into memory. Since these datasets are so large, we will also set `stream = true` in `@collect` to stream the results. | ||
# ```julia | ||
# @chain db_table(db, urls) begin | ||
# @group_by(horoscope) | ||
# @summarise(count = n(), avg_blog_length = mean(length(text))) | ||
# @arrange(desc(count)) | ||
# @aside @show_query _ | ||
# @collect(stream = true) | ||
# end | ||
# ``` | ||
# Placing `@aside @show_query _` before `@collect` above lets us see the SQL query and collect it to a local DataFrame at the same time. | ||
# ``` | ||
# SELECT horoscope, COUNT(*) AS count, AVG(length(text)) AS avg_blog_length | ||
# FROM read_parquet(['https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet', 'https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet']) | ||
# GROUP BY horoscope | ||
# ORDER BY avg_blog_length DESC | ||
# 12×3 DataFrame | ||
# Row │ horoscope count avg_blog_length | ||
# │ String? Int64? Float64? | ||
# ─────┼────────────────────────────────────── | ||
# 1 │ Aquarius 49568 1125.83 | ||
# 2 │ Cancer 63512 1097.96 | ||
# 3 │ Libra 60304 1060.61 | ||
# 4 │ Capricorn 49402 1059.56 | ||
# 5 │ Sagittarius 50431 1057.46 | ||
# 6 │ Leo 58010 1049.6 | ||
# 7 │ Taurus 61571 1022.69 | ||
# 8 │ Gemini 52925 1020.26 | ||
# 9 │ Scorpio 56495 1014.03 | ||
# 10 │ Pisces 53812 1011.75 | ||
# 11 │ Virgo 64629 996.684 | ||
# 12 │ Aries 69134 918.081 | ||
# ``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters