Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions to write to a database #16

Closed
vituri opened this issue Apr 26, 2024 · 13 comments
Closed

Functions to write to a database #16

vituri opened this issue Apr 26, 2024 · 13 comments

Comments

@vituri
Copy link
Contributor

vituri commented Apr 26, 2024

We now have tools to collect data from databases using the TidierData language, but what about writing to databases?

In R, I had to create a function that saves a temporary csv file and then upload it to MariaDB. I was able to do the same with Julia. This has the advantage of choosing whether to insert with REPLACE or IGNORE duplicate keys.

Do you plan to add these kinds of function to TidierDB? I can help with the MySQL/MariaDB part.

@drizk1
Copy link
Member

drizk1 commented Apr 26, 2024

I have thought about if and how to add write functions. dbplyr does have a set of functions that do so, which would probably be the method/syntax of how we go about adding them. So right now the plan is TBD I'd say.

SQLstrings is an interesting package I have thought about as an example for how to safely write to the underlying database.

@drizk1
Copy link
Member

drizk1 commented Apr 26, 2024

For MySql and MariaDB is there no way to do it without a csv intermediary ? I'm just curious.

@vituri
Copy link
Contributor Author

vituri commented Apr 26, 2024

In R the DBI package has some function to write tables to databases, but they are extremely slow; probably write line by line. I don't know of any approach to this in Julia.

@drizk1
Copy link
Member

drizk1 commented Oct 2, 2024

With a tiny foray into writing with @create_viewand most key functionality present for querying , it might be time to do something here. @vituri
if this is still something you're interested in, would you be able to drop a workflow example and we can figure out how to get it done in a way that's more efficient than line by line

I think compute is a way to create a table from a query directly in R as one option

@vituri
Copy link
Contributor Author

vituri commented Oct 3, 2024

Hello, @drizk1!

I am AFK now, but I can paste here from my phone a draft of what I use in production at my job. Basically for each dataframe I save a CSV file with some odd formatting (to avoid clashes with csv new lines) and then upload it to mariadb. It works really fast and have the IGNORE or REPLACE options, which are good enough for me. Don't mind the type piracies along the way.

using MySQL
using CSV

function collapse(x, sep = ",", borda1 = "", borda2 = "")
    borda1 * reduce((a, b) -> a * sep * b, x) * borda2
end

collapse_sql(x) = collapse(string.(x), "`,`", "(`", "`)")

function write_special_csv(df; delim = "|!|", newline = "|#|", header = false, filename = "temp.csv")
    CSV.write(filename, df, delim = delim, newline = newline, header = header, missingstring="\\N")
end


"""
    escreve_mariadb(df, nome_tabela, con; comando = "IGNORE")

Escreve no mariadb!!!

# Arguments
- df
- nome_tabela
- con
- `comando`: "REPLACE" ou "IGNORE (padrão).
"""
function escreve_mariadb(df, nome_tabela, con; comando = "IGNORE")
    db_names = names(con, nome_tabela)
    df_names = names(df)
    common_names = intersect(db_names, df_names)

    if length(common_names) == 0 
        @warn "No columns in common! Returning nothing"
        return nothing
    end

    temp_file = tempname()

    df2 = DataFramesMeta.@select(df, $common_names)

    write_special_csv(df2, filename = temp_file)

    query = """
LOAD DATA LOCAL INFILE '$(temp_file)' $comando INTO TABLE `$nome_tabela`
CHARACTER SET 'utf8'
COLUMNS TERMINATED BY '|!|'
LINES TERMINATED BY '|#|'
$(collapse_sql(common_names));
"""

    output = DBI.execute(con, query)    

    Base.Filesystem.rm(temp_file)

    output
end

function Main.names(con, table_name)
    query_columns = """SHOW COLUMNS FROM `$(table_name)`"""

    colunas_db =
        @chain begin
        DBI.execute(con, query_columns)
        DataFrame
        _.Field
        end

    colunas_db
end 

Do you have any suggestions on how to do that in a more "universal" manner (ie to other databases)?

@drizk1
Copy link
Member

drizk1 commented Oct 3, 2024

As far as database compatibility from what I've read / seen working on TidierDB, most databases use pretty similar patterns for writing so that shouldn't be an issue.

I'm curious, how come you opted to use a csv as opposed to a dataframe and something like MySQL.load(table, conn, table_name)

As far as workflows, Are these local files that you need to put on the database ? Or is it something you've wrangled from the db but want to make a permanent table

@vituri
Copy link
Contributor Author

vituri commented Oct 5, 2024

I came up with this method in R, because the writetable function for mariadb there is terribly slow (the driver writes line by line and there is no option to upsert). When migrating some packages to Julia, I just translated the code and it seemed to work really well.

As for the csv file: it is a temporary file where the code runs, and then you upload it to the mariadb server (that is: it is not a file created directly on the mariadb server). (maybe I did not understand your question)

@vituri
Copy link
Contributor Author

vituri commented Oct 5, 2024

I found the discussion here:

r-dbi/RMariaDB#162

@drizk1
Copy link
Member

drizk1 commented Oct 5, 2024

Ohh ok well I really like this.

My question was unclear initially but its essentially are you writing a table from a sql query for example

Table already in db -> sql query wrangle -> write as new table

Or

Local dataframe -> Julia wrangle -> write as new table

Because if it's the latter of the 2, I think makes sense to use the temp csv version you've got

@vituri
Copy link
Contributor Author

vituri commented Oct 5, 2024

It's the second case. A table in Julia memory -> csv -> database.

If the table is already in mariadb, we can always collect it first to julia memory as a dataframe

@drizk1
Copy link
Member

drizk1 commented Oct 6, 2024

Ahh ok perfect. So this to me sounds quite similar to copy_to(conn, df_or_path, "name") which is implemented for duckdb and SQLite. Expanding to another backend is great.

My thought would be to implement a copy_to for MySQL/mariadb (and add the create and replace argument to it as well).

And then later on we can implement a @create_table or @compute (like dbplyr) for creating a table from a query.

How does that sound to you? If that sounds good to you, would you be able to throw in a PR adding copy_to for the MySQL ext ? (I reformatted my comp and haven't reset up docker to test with quite yet)

@vituri
Copy link
Contributor Author

vituri commented Oct 10, 2024

Is this enough? I tested on my MariaDB and it went fine.

#72

@drizk1
Copy link
Member

drizk1 commented Oct 18, 2024

I think for now i will close this. every database is not yet supported, but perhaps fresh issues as they are needed will be ok.

@drizk1 drizk1 closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants