Add generate_hmdb_tbl #10

jorainer · 2017-10-23T03:48:46Z

Add the generate_hmdb_tbl function to create a simple compound tibble from an xml file from HMDB.
Add related test files and unit tests.
Add documentation.

- Add the simple_parse_hmdb_xml function to parse the HMDB xml file(s). - Add respective unit tests. - Add testthat and unit tests.

- Add the generate_hmdb_tbl function. - Add related documentation and unit tests.

stanstrup · 2017-10-23T08:34:02Z

Is there an advantage to parsing the XML instead of the SDF? chemmineR::datablock2ma is very convenient for getting all the info.

jorainer · 2017-10-23T08:55:37Z

Reason was that I had a script to retrieve all compounds individually from HMDB, because the release files were not really up-to-date. If you query them online (e.g. http://www.hmdb.ca/metabolites/HMDB0000001.xml) you get the xml, that's basically why.

jorainer · 2017-10-23T10:06:40Z

After implementing the SDF parse function too one advantage of the xml parsing is speed. But in the end it's good to have both in place. Thanks for the suggestion!

@stanstrup

- Add support to import compound data from HMDB SDF file (addresses comment from @stanstrup). - Add sample file in SDF format and related unit tests. - Add functionality to write compound data into a SQLite database.

jorainer · 2017-10-23T12:10:51Z

OK, hmdb SDF support is in. generate_hmdb_tbl can now be used with file being the file name of a HMDB file either in xml or SDF format.

jorainer · 2017-10-23T12:14:05Z

I did also add some first code to write the tbl into a SQLite database (including metadata). Next things will be:

Implement the CompoundDb object, that can be used to interface the database.
Implement the code to create an R package containing the annotation.
Implement all required methods to use the CompoundDb. The simplest one will be to extract all data in the form of a tbl so it can be used straight using your code.

stanstrup

Thanks! This is very cool. I assume now is not the time for me to pull the PR if you are working on other stuff?

stanstrup · 2017-10-23T12:30:01Z

R/createCompoundDbPackage.R

+#' cmps <- generate_hmdb_tbl(fl)
+#'
+#' ## Create a metadata data.frame for the compounds.
+#' metad <- data.frame(name = c("source", "url", "source_version",


data_frame to create tibble? (always drop=FALSE and no stringasfactor nonsense)

Makes sense - I'm still hesitant to switch completely to the tidyverse :)

Once you do, you will never go back!
dplyr/tidyr/purrr is the holy trinity.

stanstrup · 2017-10-23T12:34:02Z

R/createCompoundDbPackage.R

+    .valid_metadata(metadata)
+    .valid_compound(x)
+    db_file <- paste0(path, .db_file_from_metadata(metadata), ".sqlite")
+    con <- dbConnect(dbDriver("SQLite"), dbname = db_file)


more consistent with tidyverse to use src_sqlite?
Instead of SQL select you can also use dplyr::select.

Example from the pubchem I am working on:

pubchem_db <- src_sqlite(path=sqlite_path, create = TRUE) # create or connect db_insert_into(pubchem_db$con, table="pubchem_tbl", values=pubchem_tbl) #insert rows pubchem_all <- tbl(pubchem_db$con, "pubchem_tbl") %>% collect() # read into tibble rm(pubchem_db) # release the file gc() # release the file

OK, I'll switch to the tidyverse.

uh, tried to use the src_sqlite, but then I have problems sending SQL queries to the connection (e.g. when I create the indices on certain database columns or count distinct elements etc). I'll need to do that a lot and prefer to have direct control over the query to send.

Hmpf. I assumed dbConnect con was exactly the same as $con from src_sqlite. If not then "as you were".

Trying to make sqlite work in parallel I was using

dbClearResult(dbSendQuery(db.pubchem_db$con, "PRAGMA busy_timeout=5000;"));

the other day.
That seemed to work.

Had some thoughts about internal tidyverse. For these, let's say, low level stuff I would like to stick to the low level packages (DBI) with the dbConnect, dbGetQuery, dbWriteTable etc because they do their stuff, and the dplyr (or dbplyr?) essentially just wraps them up. For examples, vignettes etc I'm OK with using tidyverse, but then we should consider to add dplyr to DEPENDS not only IMPORTS.

OK. I am fine with that.

jorainer · 2017-10-23T13:19:49Z

Right, no need to make the pull request right now - better to wait, but good that you start looking at the code, otherwise it will be too much to look at ;)

stanstrup · 2017-10-23T13:26:18Z

R/generate_cmp_tbl.R

+                                          collapse = ", "))
+    ## Check if we've got an xml file or a sdf file...
+    if (any(tolower(file_ext(file)) == "xml")) {
+        res <- as.tbl(do.call(rbind, lapply(file, simple_parse_hmdb_xml)))


Alternative:

file %>% map(simple_parse_hmdb_xml) %>% bind_rows

What's the benefit from that? Is it faster?

Very few I think :D
Only the tidyverse ethos of reading forward instead of inside out.
It is not important really.

It is slightly safer and according to the doc also more efficient.
bind_rows doesn't allow columns of different types and binds by column name instead of position.

I think you could even use map_dfr here which would make it a single operation.

res <- map_dfr(file, simple_parse_hmdb_xml)

Cool. I'll check that.

Nice. I like that. But I had to add a as.tbl.

@stanstrup

- Add the code to create a CompoundDb database. - Add constructor function and first low level methods for CompoundDb. - Add related unit tests and documentation. - Address @stanstrup's comment to use map_dfr instead of do.call(rbind, lapply

- Add the createCompoundDbPackage function to create R-packages for a CompoundDb SQLite database. - Add related unit tests and documentation. - Add a vignette describing how to create compound annotation resources. - Update DESCRIPTION: add @jotsetung as author.

- Add functionality to access a CompoundDb. - Add related unit tests and documentation.

jorainer · 2017-10-25T14:04:13Z

OK, now I have all of the core stuff in place:

CompoundDb S4 object to provide access to the related SQLite database files.
createCompundDb function to create a CompoundDb SQLite database file from a compound tbl (such as created by e.g. generate_hmdb_tbl).
createCompoundDbPackage function to create an annotation package containing the CompoundDb SQLite database file (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0 for such a package).
compounds function to extract compound data from a CompoundDb object.
src_compdb function to allow accessing the CompoundDb data in dplyr style (see ?src_compdb).
A vignette describing how to use the createCompoundDb and createCompoundDbPackage functions.
Unit tests for all internal and exported functions.

Have a look at it @stanstrup and let me know if it's OK or if you'd like changes.

stanstrup · 2017-10-25T14:22:51Z

Thanks! This is awesome.
And gulp! There is a lot to look at. It might take me some days.

Just a few Qs for now

Is it better to have one package for each DB rather than one with a collection?
Are you sure on the HMDB license that you can put up the db? On the website it is indicated that you need permission. I have contacted them to hear what we can do.

jorainer · 2017-10-25T14:32:53Z

Yes, sorry that I added so much ;) - I just wanted to make sure it is at a stage where it might be useful. And for the largest part it's documentation, comments and unit tests.

Re Qs:

I think yes. Reasons to keep the resources separate are:

Easier to maintain, generate and provide.
Different resources will also require different licenses, and one package can only have one license.
Different resources will have different release cycles. Having each in a separate package allows to tag them with the correct version or the original resource. That is crucial for reproducible research. That's the way people do it also for gene annotations, you can have NCBI, Ensembl or UCSC based annotations, but they are provided each in their own package.

No, I'm pretty sure that's not the correct license, but as long as the data is not used for commercial use it should be OK (they state:

Use and re-distribution of the data, in whole or in part, for commercial purposes requires explicit permission of the authors and explicit acknowledgment of the source material (HMDB) and the original publication (see below)
Once they reply I have to fix the license. In the end we will probably place a license file specific for each annotation resource into the annotation package.

jorainer added 5 commits October 20, 2017 20:14

Add internal function to parse HMDB xml

b863b8c

- Add the simple_parse_hmdb_xml function to parse the HMDB xml file(s). - Add respective unit tests. - Add testthat and unit tests.

Add unit tests and example HMDB xml files.

dc7d953

Add xml2 to imports

dc5f471

Implement a vectorized version of the HMDB parser.

77a3833

Add generate_hmdb_tbl (issue stanstrup#2)

cfe4d43

- Add the generate_hmdb_tbl function. - Add related documentation and unit tests.

jorainer mentioned this pull request Oct 23, 2017

Databases to convert to tables #2

Closed

12 tasks

jorainer added 2 commits October 23, 2017 14:04

Add support to import compound data from HMDB SDF file

a85efe4

- Add support to import compound data from HMDB SDF file (addresses comment from @stanstrup). - Add sample file in SDF format and related unit tests. - Add functionality to write compound data into a SQLite database.

Shorten HMDB sdf to 3 compounds

dce513d

stanstrup reviewed Oct 23, 2017

View reviewed changes

jorainer added 8 commits October 24, 2017 07:53

Fix links in documentation

9aeacbf

Fix documentation

9da8c38

Only return compounds with a mass in HMDB

b9263b7

Add functionality to access and query a CompoundDb

6140d8c

- Add functionality to access a CompoundDb. - Add related unit tests and documentation.

HMDB: remove trailing white spaces from compound names

4be6530

Fix documentation and unit tests

7e57e25

stanstrup mentioned this pull request Oct 27, 2017

Databases to convert to tables rformassspectrometry/CompoundDb#6

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generate_hmdb_tbl #10

Add generate_hmdb_tbl #10

jorainer commented Oct 23, 2017

stanstrup commented Oct 23, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

stanstrup left a comment

stanstrup Oct 23, 2017

jorainer Oct 23, 2017

stanstrup Oct 23, 2017 •

edited

Loading

stanstrup Oct 23, 2017

jorainer Oct 23, 2017

jorainer Oct 23, 2017

stanstrup Oct 23, 2017

jorainer Oct 24, 2017

stanstrup Oct 24, 2017

jorainer commented Oct 23, 2017

stanstrup Oct 23, 2017

jorainer Oct 23, 2017

stanstrup Oct 23, 2017 •

edited

Loading

jorainer Oct 23, 2017

jorainer Oct 24, 2017

jorainer commented Oct 25, 2017

stanstrup commented Oct 25, 2017

jorainer commented Oct 25, 2017

Add generate_hmdb_tbl #10

Are you sure you want to change the base?

Add generate_hmdb_tbl #10

Conversation

jorainer commented Oct 23, 2017

stanstrup commented Oct 23, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

jorainer commented Oct 23, 2017

stanstrup left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stanstrup Oct 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorainer commented Oct 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stanstrup Oct 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorainer commented Oct 25, 2017

stanstrup commented Oct 25, 2017

jorainer commented Oct 25, 2017

stanstrup Oct 23, 2017 •

edited

Loading

stanstrup Oct 23, 2017 •

edited

Loading