Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generate_hmdb_tbl #10

Open
wants to merge 15 commits into
base: S4
Choose a base branch
from
Open

Add generate_hmdb_tbl #10

wants to merge 15 commits into from

Conversation

jorainer
Copy link

  • Add the generate_hmdb_tbl function to create a simple compound tibble from an xml file from HMDB.
  • Add related test files and unit tests.
  • Add documentation.

- Add the simple_parse_hmdb_xml function to parse the HMDB xml file(s).
- Add respective unit tests.
- Add testthat and unit tests.
- Add the generate_hmdb_tbl function.
- Add related documentation and unit tests.
@jorainer jorainer mentioned this pull request Oct 23, 2017
12 tasks
@stanstrup
Copy link
Owner

Is there an advantage to parsing the XML instead of the SDF? chemmineR::datablock2ma is very convenient for getting all the info.

@jorainer
Copy link
Author

Reason was that I had a script to retrieve all compounds individually from HMDB, because the release files were not really up-to-date. If you query them online (e.g. http://www.hmdb.ca/metabolites/HMDB0000001.xml) you get the xml, that's basically why.

@jorainer
Copy link
Author

After implementing the SDF parse function too one advantage of the xml parsing is speed. But in the end it's good to have both in place. Thanks for the suggestion!

- Add support to import compound data from HMDB SDF file (addresses comment from
  @stanstrup).
- Add sample file in SDF format and related unit tests.
- Add functionality to write compound data into a SQLite database.
@jorainer
Copy link
Author

OK, hmdb SDF support is in. generate_hmdb_tbl can now be used with file being the file name of a HMDB file either in xml or SDF format.

@jorainer
Copy link
Author

I did also add some first code to write the tbl into a SQLite database (including metadata). Next things will be:

  1. Implement the CompoundDb object, that can be used to interface the database.
  2. Implement the code to create an R package containing the annotation.
  3. Implement all required methods to use the CompoundDb. The simplest one will be to extract all data in the form of a tbl so it can be used straight using your code.

Copy link
Owner

@stanstrup stanstrup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is very cool. I assume now is not the time for me to pull the PR if you are working on other stuff?

#' cmps <- generate_hmdb_tbl(fl)
#'
#' ## Create a metadata data.frame for the compounds.
#' metad <- data.frame(name = c("source", "url", "source_version",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_frame to create tibble? (always drop=FALSE and no stringasfactor nonsense)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - I'm still hesitant to switch completely to the tidyverse :)

Copy link
Owner

@stanstrup stanstrup Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you do, you will never go back!
dplyr/tidyr/purrr is the holy trinity.

.valid_metadata(metadata)
.valid_compound(x)
db_file <- paste0(path, .db_file_from_metadata(metadata), ".sqlite")
con <- dbConnect(dbDriver("SQLite"), dbname = db_file)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more consistent with tidyverse to use src_sqlite?
Instead of SQL select you can also use dplyr::select.

Example from the pubchem I am working on:

pubchem_db <- src_sqlite(path=sqlite_path, create = TRUE) # create or connect
db_insert_into(pubchem_db$con, table="pubchem_tbl", values=pubchem_tbl)  #insert rows

pubchem_all <- tbl(pubchem_db$con, "pubchem_tbl") %>% collect() # read into tibble

rm(pubchem_db) # release the file
gc() # release the file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll switch to the tidyverse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, tried to use the src_sqlite, but then I have problems sending SQL queries to the connection (e.g. when I create the indices on certain database columns or count distinct elements etc). I'll need to do that a lot and prefer to have direct control over the query to send.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmpf. I assumed dbConnect con was exactly the same as $con from src_sqlite. If not then "as you were".

Trying to make sqlite work in parallel I was using

dbClearResult(dbSendQuery(db.pubchem_db$con, "PRAGMA busy_timeout=5000;"));

the other day.
That seemed to work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some thoughts about internal tidyverse. For these, let's say, low level stuff I would like to stick to the low level packages (DBI) with the dbConnect, dbGetQuery, dbWriteTable etc because they do their stuff, and the dplyr (or dbplyr?) essentially just wraps them up. For examples, vignettes etc I'm OK with using tidyverse, but then we should consider to add dplyr to DEPENDS not only IMPORTS.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I am fine with that.

@jorainer
Copy link
Author

Right, no need to make the pull request right now - better to wait, but good that you start looking at the code, otherwise it will be too much to look at ;)

collapse = ", "))
## Check if we've got an xml file or a sdf file...
if (any(tolower(file_ext(file)) == "xml")) {
res <- as.tbl(do.call(rbind, lapply(file, simple_parse_hmdb_xml)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative:

file %>% 
map(simple_parse_hmdb_xml) %>% 
bind_rows

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit from that? Is it faster?

Copy link
Owner

@stanstrup stanstrup Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very few I think :D
Only the tidyverse ethos of reading forward instead of inside out.
It is not important really.

It is slightly safer and according to the doc also more efficient.
bind_rows doesn't allow columns of different types and binds by column name instead of position.

I think you could even use map_dfr here which would make it a single operation.

res <- map_dfr(file, simple_parse_hmdb_xml)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'll check that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like that. But I had to add a as.tbl.

- Add the code to create a CompoundDb database.
- Add constructor function and first low level methods for CompoundDb.
- Add related unit tests and documentation.
- Address @stanstrup's comment to use map_dfr instead of do.call(rbind, lapply
- Add the createCompoundDbPackage function to create R-packages for a CompoundDb
  SQLite database.
- Add related unit tests and documentation.
- Add a vignette describing how to create compound annotation resources.
- Update DESCRIPTION: add @jotsetung as author.
- Add functionality to access a CompoundDb.
- Add related unit tests and documentation.
@jorainer
Copy link
Author

OK, now I have all of the core stuff in place:

  • CompoundDb S4 object to provide access to the related SQLite database files.
  • createCompundDb function to create a CompoundDb SQLite database file from a compound tbl (such as created by e.g. generate_hmdb_tbl).
  • createCompoundDbPackage function to create an annotation package containing the CompoundDb SQLite database file (see e.g. https://github.com/jotsetung/CompoundDb.Hsapiens.HMDB.4.0 for such a package).
  • compounds function to extract compound data from a CompoundDb object.
  • src_compdb function to allow accessing the CompoundDb data in dplyr style (see ?src_compdb).
  • A vignette describing how to use the createCompoundDb and createCompoundDbPackage functions.
  • Unit tests for all internal and exported functions.

Have a look at it @stanstrup and let me know if it's OK or if you'd like changes.

@stanstrup
Copy link
Owner

Thanks! This is awesome.
And gulp! There is a lot to look at. It might take me some days.

Just a few Qs for now

  1. Is it better to have one package for each DB rather than one with a collection?
  2. Are you sure on the HMDB license that you can put up the db? On the website it is indicated that you need permission. I have contacted them to hear what we can do.

@jorainer
Copy link
Author

Yes, sorry that I added so much ;) - I just wanted to make sure it is at a stage where it might be useful. And for the largest part it's documentation, comments and unit tests.

Re Qs:

  1. I think yes. Reasons to keep the resources separate are:
  • Easier to maintain, generate and provide.
  • Different resources will also require different licenses, and one package can only have one license.
  • Different resources will have different release cycles. Having each in a separate package allows to tag them with the correct version or the original resource. That is crucial for reproducible research. That's the way people do it also for gene annotations, you can have NCBI, Ensembl or UCSC based annotations, but they are provided each in their own package.
  1. No, I'm pretty sure that's not the correct license, but as long as the data is not used for commercial use it should be OK (they state:

Use and re-distribution of the data, in whole or in part, for commercial purposes requires explicit permission of the authors and explicit acknowledgment of the source material (HMDB) and the original publication (see below)
Once they reply I have to fix the license. In the end we will probably place a license file specific for each annotation resource into the annotation package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants