Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap to whats next #1

Open
48 of 50 tasks
drizk1 opened this issue Feb 16, 2024 · 25 comments
Open
48 of 50 tasks

Roadmap to whats next #1

drizk1 opened this issue Feb 16, 2024 · 25 comments

Comments

@drizk1
Copy link
Member

drizk1 commented Feb 16, 2024

This is a map as guide for what is to come and what others might want to contribute if they feel so inclined.

Macros

  • mutate
  • select
  • arrange
  • distinct
  • filter
    • autodetects when to use WHERE vs HAVING, even when receiving mixed multiple conditions that are not all having/where
  • slice_min , slice_max, @slice_sample
  • group_by
  • count
  • summarize / summarise
  • left_join - will try to sort out syntax differences from tidier.jl.
  • right_join
  • inner_join
  • rename
  • window_order
  • window_frame
  • head
  • relocate
  • fill
  • pivot_longer and pivot_wider
    • pivots are going to be challenging. Unclear how to access values when data is in a database to build case whens...
  • collect
  • show_query
  • @union
  • @setdiff
  • @intersect
    Functions
  • rewrite sql parsing expr_to_sql to use Macrotools.postwalk
    • realized @capture can't recognize function names w _ so i made a workaround function for this bug, still a few holdover parses to convert
  • start_query_meta - allows query to begin and stores metadata to enable tidy selection
  • case_when, if_else
  • starts_with, ends_with, contains, str_detect
  • is_missing()
  • replace_missing and missing_if
  • as_integer, as_float, as_string
  • SQL aggregate functions: MAX, MIN, SUM, AVG, COUNT as n() in summarize
  • universalize metadata df so that it works for postgres, sqlite and others as they get added.
  • across in summarize
  • across in mutate

For Postgres/DuckDB specific parser

  • dates functions (lubridate)
    • done: year, month, day, hour, min, second, floor_date
    • difftime needed for postgres and mysql
  • stringr functions
    • done: str_detect, str_replace/_all, str_remove/_all
  • stats aggregate functions
    • done: corr, cov, std, var
  • collect/start_query_meta for postgres/duckdb
  • Fix MySQL date parsing
    Features
  • tidy selection (works with select, group_by, distinct, across inside of summarize. Mutate and summarize add new columns and update column tracking.)
  • documentation
  • interpolation -
    so i fixed it again to avoid using eval, but it just requires a hidden context and an add function. it works, but its a few extra keystrokes for the user.
    • implemented in most macros. works for vectors of variables as strings, or individual variables as strings, as well as values and inside of across. Need to add the escaped of tidier data
  • docstrings
  • copy_to added support for SQLite and DuckDB
  • other connect functions?
    • this is now done as connect()

For duckdb directly read in paths/ https without local intermediary

  • .csv
  • .parquet
  • .json
  • .arrow

missing something? drop a comment/new issue

Backends currently supported:

@drizk1
Copy link
Member Author

drizk1 commented Mar 9, 2024

Hey @kdpsingh I know you're super busy, but real quick, I've been going on a bug finding mission trying to use increasingly complex queries to find bugs. I've been able to replicate nearly all of the answers on this site I found, and I now have a postgres parser that I have been testing here and it works (with ability to switch back and forth) but I was wondering if you might have an idea of place with more complex queries that I can use to further bug search and test.

thank you!

here as an example of one of the more involved queries ive been able to replicate.

@chain start_query_meta(db, :athlete_events) begin
    @rename(tf = NOC)
    @mutateGold = if_else(Medal == "Gold", 1, 0),
               Silver = if_else(Medal == "Silver", 1, 0),
               Bronze = if_else(Medal == "Bronze", 1, 0))
    @mutate(total_medals = Gold + Silver + Bronze)
    @left_join(:noc_regions, NOC, tf)
    @group_by(region, Sport)
    @summarize(total_sum = sum(total_medals))
    @filter(region == "India" )
    @slice_max(total_sum)
   # @show_query
    @collect
end
WITH cte_1 AS (
SELECT ID, Name, Sex, Age, Height, Weight, Team, NOC AS tf, Games, Year, Season, City, Sport, Event, Medal
        FROM athlete_events),
cte_3 AS (
SELECT  ID, Name, Sex, Age, Height, Weight, Team, tf, Games, Year, Season, City, Sport, Event, Medal, CASE WHEN Medal = 'Gold' THEN 1 ELSE 0 END AS Gold, CASE WHEN Medal = 'Silver' THEN 1 ELSE 0 END AS Silver, CASE WHEN Medal = 'Bronze' THEN 1 ELSE 0 END AS Bronze
        FROM cte_1),
cte_5 AS (
SELECT  ID, Name, Sex, Age, Height, Weight, Team, tf, Games, Year, Season, City, Sport, Event, Medal, Gold, Silver, Bronze, Gold + Silver + Bronze AS total_medals
        FROM cte_3),
cte_6 AS (
SELECT  cte_5.*, noc_regions.*
        FROM cte_5
        LEFT JOIN noc_regions ON noc_regions.NOC = cte_5.tf),
cte_7 AS (
SELECT region, Sport, SUM(total_medals) AS total_sum
        FROM cte_6
        GROUP BY region, Sport 
        HAVING region = 'India'),
cte_8 AS (
SELECT *, RANK() OVER (
        ORDER BY total_sum DESC) AS rank_col
        FROM cte_7),
cte_9 AS (
SELECT  *
        FROM cte_8
        WHERE rank_col <= 1)  
SELECT *
        FROM cte_9  
1×4 DataFrame
 Row │ region  Sport   total_sum  rank_col 
     │ String  String  Int64      Int64    
─────┼─────────────────────────────────────
   1 │ India   Hockey        173         1

@kdpsingh
Copy link
Member

This looks amazing!!! Congrats on all the progress. Will give this a look soon and see where we should go next. My sense is that all the ingredients are here to packagify this.

@drizk1
Copy link
Member Author

drizk1 commented Mar 10, 2024

Thank you! Sounds great. I'm open to any direction. This has just been fun to try building

I'll work on getting docstrings / documentation sorted

@drizk1
Copy link
Member Author

drizk1 commented Mar 27, 2024

just an fyi so u hopefully don't duplicate work and can avoid the joy of docstrings, but I went ahead and got the docstrings actually running here ( i had to get rid of all the show_query examples and use collect for most examples because of hidden space characters i could not find)

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

Thank you for moving this to TidierOrg - was just about to suggest this.

I have an idea for how to use this alongside TidierData for now that doesn't require a big rewrite of TidierData.

@drizk1
Copy link
Member Author

drizk1 commented Apr 6, 2024

Of course. I'm curious what that will look like, because avoiding a rewrite of either would be nice.

I tried a few things to no avail for getting them to dispatch nicely together. (Ie creating a third macro that auto detects and then dispatches the correct version. )

I know it's not ideal but I'd be open to having it be @mutate.db for example so that way TidierDB and TidierData could be easily used in the same chains.

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

Here's what I would suggest. When we use TidierDB with TidierData, we shouldn't do using TidierDB. Instead, we should do import TidierDB as DB.

Then, all of the TidierDB macros will be available as DB.@mutate and so on, and the TidierData equivalent would be @mutate.

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

Also, can we rename start_query_meta() to db_table() since it's pointing to a table in a database?

@drizk1
Copy link
Member Author

drizk1 commented Apr 6, 2024

Yes absolutely I've been wanting to rename that because it was too long.

I love that import method. And then it would allow TidierData to be called as normal too? I think that's great

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

Yes, TidierData would work as normal so could use them together.

@drizk1
Copy link
Member Author

drizk1 commented Apr 6, 2024

Awesome. I think one of the last things then is getting the docs to build. I have all the files in place, but I'm not sure how to enable it to build

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

I haven't gotten the chance to play with it yet. Did you have to install SQLite and/or DuckDB to get the code to work? Or was installing the relevant Julia packages for those sufficient?

Just thinking about what the issue may be.

If you can't get it working, I'll look at it tonight or tomorrow.

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

Simpler question: do the docs build locally?

@drizk1
Copy link
Member Author

drizk1 commented Apr 6, 2024

I think they build locally. I changed the default to duckdb instead of SQLite and export 2 functions for opening the db and connecting. (Using other databases means the user has to write in using using clickhouse)

I think this has something to do with the package set up (ie git actions and I didn't set up for documenter. ) because there are only 2 green checks not 3 like other packages when pushing the new commits

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

Ok let me check on this - will get to it tonight or tomorrow at the latest.

Once the docs build, we can get this on the way to the registry.

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

Also I would go ahead and add the Arrow dependency to make it work with DuckDB.

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

I'm also considering allowing TidierData to work directly with Arrow but that will require a bit of work to ensure that we don't fully instantiate Arrow tables as DataFrames.

This way via DuckDB might be a nice alternative.

@drizk1
Copy link
Member Author

drizk1 commented Apr 6, 2024

Sweet. I'll get that added a little later tn and change the name to db_table

@drizk1
Copy link
Member Author

drizk1 commented Apr 6, 2024

I realized The documenter.yml just needs to be added and then it should work. So I should be able to get both this and tidierfiles working tn

@kdpsingh
Copy link
Member

kdpsingh commented Apr 6, 2024

There's one setting you'll have to change in Settings to make the documentation actually show up. If tests pass, happy to help.

@drizk1
Copy link
Member Author

drizk1 commented Apr 7, 2024

So i have reached a bit of an impasse.

I have been unable to get the docs to build and continue getting this error error

I have tried various things including making it so no code runs because everything is a comment in the docs, and no matter what i get this error. I am wondering if its something you have ever experienced?

ERROR: LoadError: AssertionError: length(exs) == 2 * length(results)
Stacktrace:
  [1] parseblock(code::String, doc::Documenter.Documents.Document, file::String; skip::Int64, keywords::Bool, raise::Bool, linenumbernode::LineNumberNode)
    @ Documenter.Utilities ~/.julia/packages/Documenter/bYYzK/src/Utilities/Utilities.jl:156
  [2] parseblock
    @ ~/.julia/packages/Documenter/bYYzK/src/Utilities/Utilities.jl:123 [inlined]
  [3] #parseblock#1
    @ ~/.julia/packages/Documenter/bYYzK/src/Documents.jl:72 [inlined]
  [4] runner(::Type{Documenter.Expanders.ExampleBlocks}, x::Markdown.Code, page::Documenter.Documents.Page, doc::Documenter.Documents.Document)
    @ Documenter.Expanders ~/.julia/packages/Documenter/bYYzK/src/Expanders.jl:625
  [5] dispatch(::Type{Documenter.Expanders.ExpanderPipeline}, ::Markdown.Code, ::Vararg{Any})
    @ Documenter.Utilities.Selectors ~/.julia/packages/Documenter/bYYzK/src/Utilities/Selectors.jl:170
  [6] expand(doc::Documenter.Documents.Document)
    @ Documenter.Expanders ~/.julia/packages/Documenter/bYYzK/src/Expanders.jl:42
  [7] runner(::Type{Documenter.Builder.ExpandTemplates}, doc::Documenter.Documents.Document)
    @ Documenter.Builder ~/.julia/packages/Documenter/bYYzK/src/Builder.jl:226
  [8] dispatch(::Type{Documenter.Builder.DocumentPipeline}, x::Documenter.Documents.Document)
    @ Documenter.Utilities.Selectors ~/.julia/packages/Documenter/bYYzK/src/Utilities/Selectors.jl:170
  [9] #2
    @ ~/.julia/packages/Documenter/bYYzK/src/Documenter.jl:273 [inlined]
 [10] cd(f::Documenter.var"#2#3"{Documenter.Documents.Document}, dir::String)
    @ Base.Filesystem ./file.jl:112
 [11] #makedocs#1
    @ ~/.julia/packages/Documenter/bYYzK/src/Documenter.jl:272 [inlined]
 [12] top-level scope
    @ ~/work/TidierDB.jl/TidierDB.jl/docs/make.jl:11

@kdpsingh
Copy link
Member

kdpsingh commented Apr 7, 2024

I haven't experienced this exact error but let me investigate today.

@drizk1
Copy link
Member Author

drizk1 commented Apr 7, 2024

sounds good, the docs seem to run locally on my end so curious what the issue is.

julia> include("make.jl")
┌ Warning: DocTestSetup already set for module TidierDB. Overwriting.
└ @ Documenter.DocMeta ~/.julia/packages/Documenter/bYYzK/src/DocMeta.jl:81
[ Info: SetupBuildDirectory: setting up build directory.
[ Info: Doctest: skipped.
[ Info: ExpandTemplates: expanding markdown templates.
[ Info: CrossReferences: building cross-references.
[ Info: CheckDocument: running document checks.
[ Info: Populate: populating indices.
[ Info: RenderDocument: rendering document.
[ Info: MarkdownWriter: rendering Markdown pages.
┌ Warning: Documenter could not auto-detect the building environment Skipping deployment.
└ @ Documenter ~/.julia/packages/Documenter/bYYzK/src/deployconfig.jl:75

@rdboyes
Copy link
Member

rdboyes commented Apr 8, 2024

LoadError: AssertionError: length(exs) == 2 * length(results)

I've had this error - it seems to happen when there is an example that does not output anything

@kdpsingh
Copy link
Member

kdpsingh commented Apr 8, 2024

That's exactly what the issue was. Now fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants