Merge pull request #50 from TidierOrg/larger-than-mem-ex

TidierOrg · Aug 3, 2024 · 989babf · 989babf
2 parents 569fc2e + c12b0c7
commit 989babf
Show file tree

Hide file tree

Showing 4 changed files with 49 additions and 2 deletions.
diff --git a/docs/examples/UserGuide/databricks.jl b/docs/examples/UserGuide/databricks.jl
@@ -2,7 +2,7 @@
 
 # ## Connecting 
 # Connection is established with the `connect` function as shown below. Connection requires 5 items as strings
-# - account instance : [how do to find your instance](https://docs.databricks.com/en/workspace/workspace-details.html)
+# - Account Instance : [how to find your instance](https://docs.databricks.com/en/workspace/workspace-details.html)
 # - OAuth token : [how to generate your token](https://docs.databricks.com/en/dev-tools/auth/pat.html)
 # - Database Name
 # - Schema Name

diff --git a/docs/examples/UserGuide/outofmemex.jl b/docs/examples/UserGuide/outofmemex.jl
@@ -0,0 +1,46 @@
+# While using the DuckDB backend, TidierDB's lazy intferace enables querying datasets larger than your available RAM. 
+
+# To illustrate this, we will recreate the [Hugging Face x Polars](https://huggingface.co/docs/dataset-viewer/en/polars) example. The final table results are shown below and in this [Hugging Face x DuckDB example](https://huggingface.co/docs/dataset-viewer/en/duckdb)
+
+# First we will load TidierDB, set up a local database and then set the URLs for the 2 training datasets from huggingface.co
+# ```julia
+# using TidierDB
+# db = connect(duckdb())
+
+# urls = ["https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet",
+#  "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet"];
+# ```
+
+# Here, we pass the vector of URLs to `db_table`, which will not copy them into memory. Since these datasets are so large, we will also set `stream = true` in `@collect` to stream the results.
+# ```julia
+# @chain db_table(db, urls) begin
+#     @group_by(horoscope)
+#     @summarise(count = n(), avg_blog_length = mean(length(text)))
+#     @arrange(desc(count))
+#     @aside @show_query _
+#     @collect(stream = true)
+# end
+# ```
+# Placing `@aside @show_query _` before `@collect` above lets us see the SQL query and collect it to a local DataFrame at the same time.
+# ```
+# SELECT horoscope, COUNT(*) AS count, AVG(length(text)) AS avg_blog_length
+#         FROM read_parquet(['https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet', 'https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet'])
+#         GROUP BY horoscope  
+#         ORDER BY avg_blog_length DESC
+# 12×3 DataFrame
+#  Row │ horoscope    count   avg_blog_length 
+#      │ String?      Int64?  Float64?        
+# ─────┼──────────────────────────────────────
+#    1 │ Aquarius      49568         1125.83
+#    2 │ Cancer        63512         1097.96
+#    3 │ Libra         60304         1060.61
+#    4 │ Capricorn     49402         1059.56
+#    5 │ Sagittarius   50431         1057.46
+#    6 │ Leo           58010         1049.6
+#    7 │ Taurus        61571         1022.69
+#    8 │ Gemini        52925         1020.26
+#    9 │ Scorpio       56495         1014.03
+#   10 │ Pisces        53812         1011.75
+#   11 │ Virgo         64629          996.684
+#   12 │ Aries         69134          918.081
+# ```
diff --git a/docs/examples/UserGuide/s3viaduckdb.jl b/docs/examples/UserGuide/s3viaduckdb.jl
@@ -5,7 +5,7 @@
 # You can also use `DBInterface.execute` to set up any DuckDB database connection you need and then use that db to query with TidierDB
 
 # ```julia
-# Using TidierDB
+# using TidierDB
 # 
 # #Connect to Google Cloud via DuckDB
 # #google_db = connect(duckdb(), :gbq, access_key="string", secret_key="string")

diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -124,4 +124,5 @@ nav:
   - "Using Snowflake" : "examples/generated/UserGuide/Snowflake.md"
   - "Using Databricks" : "examples/generated/UserGuide/databricks.md"
   - "Writing Functions/Macros with TidierDB Chains" : "examples/generated/UserGuide/functions_pass_to_DB.md"
+  - "Working With Larger than RAM Datasets" : "examples/generated/UserGuide/outofmemex.md"
   - "Reference" : "reference.md"