-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add generate_hmdb_tbl #10
base: S4
Are you sure you want to change the base?
Conversation
jorainer
commented
Oct 23, 2017
- Add the generate_hmdb_tbl function to create a simple compound tibble from an xml file from HMDB.
- Add related test files and unit tests.
- Add documentation.
- Add the simple_parse_hmdb_xml function to parse the HMDB xml file(s). - Add respective unit tests. - Add testthat and unit tests.
- Add the generate_hmdb_tbl function. - Add related documentation and unit tests.
Is there an advantage to parsing the XML instead of the SDF? |
Reason was that I had a script to retrieve all compounds individually from HMDB, because the release files were not really up-to-date. If you query them online (e.g. http://www.hmdb.ca/metabolites/HMDB0000001.xml) you get the xml, that's basically why. |
After implementing the SDF parse function too one advantage of the xml parsing is speed. But in the end it's good to have both in place. Thanks for the suggestion! |
- Add support to import compound data from HMDB SDF file (addresses comment from @stanstrup). - Add sample file in SDF format and related unit tests. - Add functionality to write compound data into a SQLite database.
OK, hmdb SDF support is in. |
I did also add some first code to write the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is very cool. I assume now is not the time for me to pull the PR if you are working on other stuff?
#' cmps <- generate_hmdb_tbl(fl) | ||
#' | ||
#' ## Create a metadata data.frame for the compounds. | ||
#' metad <- data.frame(name = c("source", "url", "source_version", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_frame to create tibble? (always drop=FALSE and no stringasfactor nonsense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - I'm still hesitant to switch completely to the tidyverse :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you do, you will never go back!
dplyr/tidyr/purrr is the holy trinity.
R/createCompoundDbPackage.R
Outdated
.valid_metadata(metadata) | ||
.valid_compound(x) | ||
db_file <- paste0(path, .db_file_from_metadata(metadata), ".sqlite") | ||
con <- dbConnect(dbDriver("SQLite"), dbname = db_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more consistent with tidyverse to use src_sqlite
?
Instead of SQL select you can also use dplyr::select
.
Example from the pubchem I am working on:
pubchem_db <- src_sqlite(path=sqlite_path, create = TRUE) # create or connect
db_insert_into(pubchem_db$con, table="pubchem_tbl", values=pubchem_tbl) #insert rows
pubchem_all <- tbl(pubchem_db$con, "pubchem_tbl") %>% collect() # read into tibble
rm(pubchem_db) # release the file
gc() # release the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll switch to the tidyverse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, tried to use the src_sqlite
, but then I have problems sending SQL queries to the connection (e.g. when I create the indices on certain database columns or count distinct elements etc). I'll need to do that a lot and prefer to have direct control over the query to send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmpf. I assumed dbConnect con was exactly the same as $con from src_sqlite. If not then "as you were".
Trying to make sqlite work in parallel I was using
dbClearResult(dbSendQuery(db.pubchem_db$con, "PRAGMA busy_timeout=5000;"));
the other day.
That seemed to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had some thoughts about internal tidyverse. For these, let's say, low level stuff I would like to stick to the low level packages (DBI
) with the dbConnect
, dbGetQuery
, dbWriteTable
etc because they do their stuff, and the dplyr
(or dbplyr
?) essentially just wraps them up. For examples, vignettes etc I'm OK with using tidyverse, but then we should consider to add dplyr
to DEPENDS
not only IMPORTS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I am fine with that.
Right, no need to make the pull request right now - better to wait, but good that you start looking at the code, otherwise it will be too much to look at ;) |
R/generate_cmp_tbl.R
Outdated
collapse = ", ")) | ||
## Check if we've got an xml file or a sdf file... | ||
if (any(tolower(file_ext(file)) == "xml")) { | ||
res <- as.tbl(do.call(rbind, lapply(file, simple_parse_hmdb_xml))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative:
file %>%
map(simple_parse_hmdb_xml) %>%
bind_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit from that? Is it faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very few I think :D
Only the tidyverse ethos of reading forward instead of inside out.
It is not important really.
It is slightly safer and according to the doc also more efficient.
bind_rows doesn't allow columns of different types and binds by column name instead of position.
I think you could even use map_dfr
here which would make it a single operation.
res <- map_dfr(file, simple_parse_hmdb_xml)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I'll check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I like that. But I had to add a as.tbl
.
- Add the code to create a CompoundDb database. - Add constructor function and first low level methods for CompoundDb. - Add related unit tests and documentation. - Address @stanstrup's comment to use map_dfr instead of do.call(rbind, lapply
- Add the createCompoundDbPackage function to create R-packages for a CompoundDb SQLite database. - Add related unit tests and documentation. - Add a vignette describing how to create compound annotation resources. - Update DESCRIPTION: add @jotsetung as author.
- Add functionality to access a CompoundDb. - Add related unit tests and documentation.
OK, now I have all of the core stuff in place:
Have a look at it @stanstrup and let me know if it's OK or if you'd like changes. |
Thanks! This is awesome. Just a few Qs for now
|
Yes, sorry that I added so much ;) - I just wanted to make sure it is at a stage where it might be useful. And for the largest part it's documentation, comments and unit tests. Re Qs:
|