Skip to content

Commit

Permalink
Merge pull request #50 from TidierOrg/larger-than-mem-ex
Browse files Browse the repository at this point in the history
  • Loading branch information
drizk1 authored Aug 3, 2024
2 parents 569fc2e + c12b0c7 commit 989babf
Show file tree
Hide file tree
Showing 4 changed files with 49 additions and 2 deletions.
2 changes: 1 addition & 1 deletion docs/examples/UserGuide/databricks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# ## Connecting
# Connection is established with the `connect` function as shown below. Connection requires 5 items as strings
# - account instance : [how do to find your instance](https://docs.databricks.com/en/workspace/workspace-details.html)
# - Account Instance : [how to find your instance](https://docs.databricks.com/en/workspace/workspace-details.html)
# - OAuth token : [how to generate your token](https://docs.databricks.com/en/dev-tools/auth/pat.html)
# - Database Name
# - Schema Name
Expand Down
46 changes: 46 additions & 0 deletions docs/examples/UserGuide/outofmemex.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# While using the DuckDB backend, TidierDB's lazy intferace enables querying datasets larger than your available RAM.

# To illustrate this, we will recreate the [Hugging Face x Polars](https://huggingface.co/docs/dataset-viewer/en/polars) example. The final table results are shown below and in this [Hugging Face x DuckDB example](https://huggingface.co/docs/dataset-viewer/en/duckdb)

# First we will load TidierDB, set up a local database and then set the URLs for the 2 training datasets from huggingface.co
# ```julia
# using TidierDB
# db = connect(duckdb())

# urls = ["https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet",
# "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet"];
# ```

# Here, we pass the vector of URLs to `db_table`, which will not copy them into memory. Since these datasets are so large, we will also set `stream = true` in `@collect` to stream the results.
# ```julia
# @chain db_table(db, urls) begin
# @group_by(horoscope)
# @summarise(count = n(), avg_blog_length = mean(length(text)))
# @arrange(desc(count))
# @aside @show_query _
# @collect(stream = true)
# end
# ```
# Placing `@aside @show_query _` before `@collect` above lets us see the SQL query and collect it to a local DataFrame at the same time.
# ```
# SELECT horoscope, COUNT(*) AS count, AVG(length(text)) AS avg_blog_length
# FROM read_parquet(['https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet', 'https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet'])
# GROUP BY horoscope
# ORDER BY avg_blog_length DESC
# 12×3 DataFrame
# Row │ horoscope count avg_blog_length
# │ String? Int64? Float64?
# ─────┼──────────────────────────────────────
# 1 │ Aquarius 49568 1125.83
# 2 │ Cancer 63512 1097.96
# 3 │ Libra 60304 1060.61
# 4 │ Capricorn 49402 1059.56
# 5 │ Sagittarius 50431 1057.46
# 6 │ Leo 58010 1049.6
# 7 │ Taurus 61571 1022.69
# 8 │ Gemini 52925 1020.26
# 9 │ Scorpio 56495 1014.03
# 10 │ Pisces 53812 1011.75
# 11 │ Virgo 64629 996.684
# 12 │ Aries 69134 918.081
# ```
2 changes: 1 addition & 1 deletion docs/examples/UserGuide/s3viaduckdb.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# You can also use `DBInterface.execute` to set up any DuckDB database connection you need and then use that db to query with TidierDB

# ```julia
# Using TidierDB
# using TidierDB
#
# #Connect to Google Cloud via DuckDB
# #google_db = connect(duckdb(), :gbq, access_key="string", secret_key="string")
Expand Down
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -124,4 +124,5 @@ nav:
- "Using Snowflake" : "examples/generated/UserGuide/Snowflake.md"
- "Using Databricks" : "examples/generated/UserGuide/databricks.md"
- "Writing Functions/Macros with TidierDB Chains" : "examples/generated/UserGuide/functions_pass_to_DB.md"
- "Working With Larger than RAM Datasets" : "examples/generated/UserGuide/outofmemex.md"
- "Reference" : "reference.md"

0 comments on commit 989babf

Please sign in to comment.