Skip to content

Commit

Permalink
update outofmemex.jl
Browse files Browse the repository at this point in the history
  • Loading branch information
drizk1 authored Aug 7, 2024
1 parent 989babf commit 043bfe1
Showing 1 changed file with 14 additions and 4 deletions.
18 changes: 14 additions & 4 deletions docs/examples/UserGuide/outofmemex.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,28 @@

# To illustrate this, we will recreate the [Hugging Face x Polars](https://huggingface.co/docs/dataset-viewer/en/polars) example. The final table results are shown below and in this [Hugging Face x DuckDB example](https://huggingface.co/docs/dataset-viewer/en/duckdb)

# First we will load TidierDB, set up a local database and then set the URLs for the 2 training datasets from huggingface.co
# First we will load TidierDB and set up a local database.
# ```julia
# using TidierDB
# db = connect(duckdb())
# ```
# To run queries on larger than RAM files, we will set up our `db` as DuckDB outlines [here](https://duckdb.org/2024/07/09/memory-management.html#:~:text=DuckDB%20deals%20with%20these%20scenarios,tries%20to%20minimize%20disk%20spilling.)
# ```julia
# DBinterface.execute(db, "SET memory_limit = '2GB';");
# DuckDB.execute(db, "SET temp_directory = '/tmp/duckdb_swap';");
# DuckDB.execute(db, "SET max_temp_directory_size = '100B';")
# ```

# Executing a query on a large table is slower, so we will copy the tables into this our database
# ```julia
# urls = ["https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet",
# "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet"];
# copy_to(db, urls, "astro");
# ```

# Here, we pass the vector of URLs to `db_table`, which will not copy them into memory. Since these datasets are so large, we will also set `stream = true` in `@collect` to stream the results.
# We will also set stream = true in `@collect`` to stream the result and now query and collect the table.
# ```julia
# @chain db_table(db, urls) begin
# @chain db_table(db, "astro") begin
# @group_by(horoscope)
# @summarise(count = n(), avg_blog_length = mean(length(text)))
# @arrange(desc(count))
Expand Down Expand Up @@ -43,4 +53,4 @@
# 10 │ Pisces 53812 1011.75
# 11 │ Virgo 64629 996.684
# 12 │ Aries 69134 918.081
# ```
# ```

0 comments on commit 043bfe1

Please sign in to comment.