plot_xref_counts()
andplot_branch_counts()
now use ROBOT instead of pyDOID for data extraction.append_to_url()
updated with new PMC URL.
suggest_regex()
takes a list of strings and suggests a regex pattern that will match them all.
is_invariant()
now works for more than just character & numeric vectors, with newlist
anddata.frame
methods and adefault
method that should be able to handle more cases (and replaces thecharacter
method).sandwich_text()
revised with newadd_dup
argument to control whether placeholders are added when they already exist at the start and end of strings.
plot_citedby()
argumentcolor_set
now requires names and one color for each of the 7 possible publication types when specifying colors manually.- NEW
retracted
argument added to specify how retracted articles should be managed.
- NEW
tidy_sparql()
argumentsas_curies
andlgl_NA_false
replaced with newtidy_what
argument and the optional tidy procedures has increased to include cleaning headers, unnesting list columns, converting URIs to CURIEs, outputting data as a tibble, replacing logical missing values with FALSE, and remove invariant language tags.robot_query()
completely revised to:- use
DO.utils::robot()
instead of a direct system call to a system-wide robot, to improve consistency. - replace
rq
argument withquery
and now accepts query text directly, in addition to file paths. It also automatically handles different SPARQL query types (i.e.SELECT
,UPDATE
,CONSTRUCT
, etc. - replace
save
argument withoutput
and now outputs results to R console by default when possible, with the option to save them to a file when given a file path. - allow for additional arguments to passed to ROBOT with
...
. - NEW
tidy_what
argument to usetidy_sparql()
on results with the need for a separate function call. - NEW
col_types
to specify column types of results.
- use
read_omim()
andinventory_omim()
now use the preferred "MIM:" prefix in their output instead of "OMIM:".inventory_omim()
has been modified to accept input with either prefix. This coincides with changes in the Human Disease Ontology (see DiseaseOntology/HumanDiseaseOntology#1323).read_omim()
now additionally:- parses official API-key requiring phenotypicSeries.txt downloads and may be able to handle additional API-key requiring downloads.
- recognizes X-linked inheritance from OMIM.
collapse_col()
updated to support negative selection and tidyselect selectors with dplyr versions >= 1.0, which broke these approaches.
download_omim()
downloads official API-key requiring files directly from OMIM (e.g. mim2gene.txt, phenotypicSeries.txt, etc.).extract_ordo_mappings()
extracts mappings from Orphanet Rare Disease Ontology, in native format asoboInOwl:hasDbXref
with Orphanet's text-based predicate modifiers, or as SKOS (supplemented with fillerdoid:
predicates whereSKOS
predicates don't exist.
- dplyr > v1.1.0 now required.
elucidate()
(generic) describes the data in a given object. Currently, onlyomim_inventory
has a defined method.
read_omim()
now additionally parses official OMIM downloads of search results and phenotypic series titles.- Includes
omim_official
attribute to indicate if the source was an official download. - If input is an official source, the output class will indicate the type.
keep_mim
arg can be used to filter OMIM search results.
- Includes
tidy_sparql()
now removes?
from column names and has the new argumentlgl_NA_false
for specifying whetherNA
values should be replaced withFALSE
in logical columns.write_gs.omim_inventory()
now has adatestamp
method.
- Aligned
read_delim_auto()
more closely withreadr::read_delim()
so it could handle compressed input. - Broadened
unique_if_invariant()
no longer uses it's own methods and instead relies onbase::unique()
. This may have some unintended consequences, particularly where custom methods ofunique()
are defined but it works for more inputs, better matching expectations. format_html()
txt
argument has been renamed totext
to align it more fully with expectations.
lexiclean()
processes text for improved text matching.round_zero()
to round numbers toward zero.round_down()
to round numbers down; more flexible thanbase::floor()
.
- Fix
make_use_case_html()
to have case-insensitive sorting. plot_citedby()
- Now accepts manually-defined color sets, in addition to color sets
provided by
DO.utils
. - Default plot size changed to better fit new position on disease-ontology.org statistics page.
- Now accepts manually-defined color sets, in addition to color sets
provided by
robot()
.path
argument renamed to.robot_path
to avoid use in other functions without changing the name.- Now informs when testing and caching a ROBOT executable for future use.
onto_missing()
was poorly designed and has been deprecated. To determine which OMIM entries are present in the DO as mappings, useinventory_omim()
instead.
- NEW FUNCTIONALITY to speed up curation of OMIM mappings!!!
read_omim()
reads data copied or downloaded from OMIM.- Previously an internal function with limited capability to handle specific copy/paste operation from OMIM.
- Now expanded to read data downloaded from omim.org phenotypic series pages using the "Download as" button and to handle all copy paste of tabular data from omim.org without the need for manual corrections.
- No longer returns
tidy_label
andprovisional
columns, as these were not particularly useful, and instead includesomim
andgeno_inheritance
columns to help with curation.
inventory_omim()
compares OMIM entry records against mappings in the DO and reports whether they exist, with accompanying DO class data when they do.write_gs()
(generic) writes data from DO.utils created classes to Google Sheets.omim_inventory
is the first method.
read_ga()
to read Google Analytics data exported to .csv file.- Eliminates the need for time-consuming, corrections to get a GA exported file into a tidy format for further use
- Can optionally read multiple tables from a single file (most exports have two).
- Can NOT merge GA data that has been split over multiple files due to size. These must be merged manually but this should be trivial.
onto_missing()
&tidy_sparql()
output now has improved column types (no longer all character vectors).append_to_url()
has a new named URL option, "DO_website", for direct link to disease info on disease-ontology.org.
citedby_scopus()
will no longer retain responses with zero results and gained a new argumentno_results
to control how these are signaled to the user, making it more consistent withcitedby_pubmed()
.
- Fix dplyr code error in
onto_missing()
.
- Adds
onto_missing()
and character length-sorting functions. - More support for creating links from CURIEs.
- Includes fixes to eliminate warnings from use of tidyverse (#15) and errors due to updates in some tidyverse packages.
length_sort()
: Sorts vector elements by character length.length_order()
: Sorts data.frames by character length of elements in specified column(s).iff_all_vals()
: Tests if all values are present in a vector and ONLY those values are present.
drop_blank()
: Now a generic withcharacter
andlist
methods.vctr_to_string()
: Now always returnsNA
when only input isNA
, even whenna.rm = FALSE
; previously returned"NA"
.
is_curie()
: Tests for CURIEs in character vectors, according to a specified definition that always conforms to W3C CURIE Syntax 1.0.onto_missing()
: Compares tsv/csv data with data in the ontology to identify data that may be missing. Optionally returns data that is present.
robot()
errors are now signaled in R and no longer specifies a max heap size when using a robot.jar file.to_curie()
/to_uri()
now appropriately remove brackets from URIs and handle delimited input.tidy_sparql()
has the newas_curies
argument and converts IRIs to CURIEs by default.ns_prefix
now includes more namespace prefixes, including those for MeSH and UniProt.append_to_url()
is now vectorized and can append to additional prefixes, including anything inns_prefix
and URLs commonly used on disease-ontology.org for cross-references.format_url()
no longer usesNA
in thetxt
argument as text input.build_hyperlink()
takes advantage of the updates toappend_to_url()
andformat_url()
.
- reticulate updated to >=v1.28 in an effort to resolve python package installation issues; see #12.
- stringr updated to >= 1.5.0, for access to new
str_escape()
function. - igraph added for new tidygraph/graphml functions. This change doesn't affect much since tidygraph is already a dependency that itself depends on igraph.
- Now SUGGESTS the keyring package for API key management.
- Migrated DO.utils repository to the DiseaseOntology organization.
- DO.utils documentation is now available on the web at https://diseaseontology.github.io/DO.utils/ with significant updates supporting citation-based assessment of use workflow.
extract_as_tidygraph()
: Extracts nodes and relationships identified in a SPARQL query from RDF/XML file and returns it as a tidygraph.write_graphml()
: Writes a graph object (tidygraph/iGraph) to a .graphml file.
- [BREAKING CHANGE]
robot()
wrapper function updated to make it easier to use when programming. plot_branch_counts()
now:- Identifies the count of classes in each branch that are asserted or inferred.
- Uses data directly from a local copy of the HumanDiseaseOntology repo instead of manually copied release notes.
- Gained the
aspect_ratio
argument.
plot_xref_counts()
now uses data directly from a local copy of the HumanDiseaseOntology repo instead of manually copied release notes.plot_citedby()
gained thecolor_set
argument to permit more flexible color choice.DO_pubs
updated with 2023 complex disease paper.- NOTE: The 2023 Database paper describing the 'Assessing Resource Use' workflow was NOT added because it is not a publication describing use of the DO).
DO_colors
now include accent colors generated as part of the DO-KB addition to the website. These colors are available in standard, "_mid", and "_light" versions.
- 'Assessing Resource Use: Obtaining Use Records' tutorial/vignette added. Describes how to set up DO.utils and execute functions to support the 'Assessing Resource Use' workflow.
tidy_pub_records()
: creates a tibble with more limited information from Scopus and PubMed references; includes only the columns:first_author
,title
,journal
,pub_date
,doi
,pmid
,scopus_eid
,pub_type
, andadded_dt
.set_scopus_keys()
: makes Scopus API key and/or insttoken available for use during an R session.set_entrez_key()
: makes Entrez Utils API key available for use during an R session; imported from rentrez package.
as_tibble()
methods for publication results now include theadded_dt
column in output that standardizes how record timestamps are created.tidy_pubmed_summary()
is now soft deprecated in favor oftidy_pub_records()
.
to_range()
now returnsNA
when passed empty vectors.citedby_scopus()
has a newinsttoken
argument.collapse_col()
gained all the methods ofcollapse_col_flex()
, along withna.rm
argument that can be used by all methods.append_to_url()
andbuild_hyperlink()
no longer add a trailing slash to the end of URLs when there is not one. Also, a newsep
argument has been added to provide greater control.- [BREAKING CHANGE]
format_hyperlink()
preserve_NA
argument removed and replaced withpreserve
argument. With this change, the output value when a URL is missing will be either the URL (i.e.NA
) or the text passed totxt
. This allows more flexibility in the output to support more use cases. format_hyperlink()
now warns when values are passed to...
whenas
does not equal "html" to reduce the likelihood of losing arguments silently.
Change license to CC0 1.0 Universal to match standard for the Human Disease Ontology project and in preparation for use in resource use assessment publication.
Website
make_user_list_html()
is deprecated because the user/use case information on disease-ontology.org was moved from the 'Collaborators' page to the new 'Use Cases' page. Replaced bymake_use_case_html()
.
Website
plot_*()
no longer include the datestamp in saved file names.
Formatters
format_doid()
parameters changed:allow_bare
renamed toconvert_bare
validate_input
added to allow invalid input to pass-through without modification.
Cited by / Search
read_pubmed_txt()
now parses IDs (PMID, PMCID, & DOI) from citations and returns a data.frame that includes a record number, IDs, and full citations instead of a vector of citations.
Cited by / Search
extract_pmid()
updated to recognize 1- to 8-long PubMed IDs which should cover the whole set of actual PMIDs; previously limited to recognizing 8-long PMIDs.as_tibble.esummary_list()
fixed error due to reduced data output (fewer columns of information) frompubmed_summary()
caused by API changes.match_citations()
now matches DOIs in a case insensitive manner bringing it into compliance with the DOI spec.
URLs
append_to_url()
:- Gained a new parameter
preserve_NA
, which allowsNA
values to pass through instead of being appended to the URL. - Added 'github' and 'orcid' as named URLs that might be appended to (via
get_url()
).
- Gained a new parameter
Datasets
- More prefixes in
ns_prefix
and new prefix subsets added:not_obo_prefix
: Subset ofns_prefix
with everything except OBO ontology prefixes (e.g. 'dc', 'terms', 'skos', 'owl', etc.).obo_prefix
: Subset ofns_prefix
with standard OBO ontology prefixes and namespaces.obo_prop_prefix
: New set of prefixes created to represent frequently used OBO ontology property prefixes. NOTE: There is one per ontology but these may not exist or may not be the actual property prefix used by the stated ontology... use with caution.
General Utilities
unique_to_string()
/vctr_to_string()
: addedsort
and other arguments for control of sorting.
Ontology Extracters/Modifiers
extract_*_axiom()
family: Extract equivalentClass ('eq'), subClassOf ('subclass'), or both ('class') logical axioms.queue_xref_split()
: Creates a 'curation queue' of diseases that may need to split because they have multiple cross-references from the same source.tidy_sparql()
: Tidies SPARQL query results.
Website
update_website_count_tables()
: Update counts in tables on 'DO Imports' and 'DO Slims' pages with data from a specified release of doid-merged.owl. Updates data in place.make_use_case_html()
: Produces the html for the new 'Use Cases' page, split into 3 files, 1 per section for: Ontologies, Resources, and Methodologies.- Does not update data in place. HTML for the rows & cells must be copied and pasted over the HTML for each section in the 'Use Cases' file.
- Content is sorted alphabetically by column.
make_contributor_html()
: Produces HTML list of contributors as<li>
elements for disease-ontology.org, including links to Github and ORCID, as available.
URLs
format_hyperlink()
: Converts URLs into hyperlinks for Google Sheets, Excel, or HTML.build_hyperlink()
: Shorthand for commonappend_to_url()
plusformat_hyperlink()
combination.
Cited by / Search
pub_id_match
(DATA): A named character vector of regex's to identify/extract publication IDs (currently PMID, PMCID, DOI, & Scopus EID).
General Utilities
sandwich_text()
: Pastes text around strings.wrap_onscreen()
: Wraps messages to be printed on the screen.invert_sublists()
: Swaps the list elements between depths 2 & 3, essentially inverting the grouping.lengthen_col()
: Splits column(s) by a delimiter and lengthens the data.frame so each value is in its own row.- NOTES:
- This is the reverse of
collapse_col()
but will not recreate the original data.frame after round trip in most cases. - Uses
unnest_cross()
internally so the results will always be the cartesian product of lengthened columns.
- This is the reverse of
- NOTES:
count_delim()
: Counts the values in delimited columns; essentially the combination oflengthen_col()
anddplyr::count()
.
Type Predicates
is_valid_obo()
: Tests whether the elements of a character vector are 'valid' OBO Foundry IDs (based on formatting, not actual existence).
Formatters
format_obo()
: Formats OBO Foundry IDs.format_axiom()
: Formats OWL functional syntax EQ/SubClassOf axioms to be more human readable, similar to that of Protege.
Testing
is_boolean()
: T/F type predicate.write_access()
: Test for file write existence with write access.
Data Conversion
to_curie()
&to_uri()
: Convert between URI & CURIEs.to_range()
: Convert vector of values to ranges (output as single string).
Data.frame Manipulation
append_empty_col()
: Add empty columns to a data.frame.unnest_cross()
: Unnest list columns in data.frame always creating the cartesian product.- Useful for expanding list columns produced by some SPARQL queries.
Datasets
ST_pubs
: Information about official publications describing the Symptom (SYMP) and/or Pathogen Transmission (TRANS) ontologies.- Currently, only one conatins one publication.
ns_prefix
: Named character vector of common namespace-prefix pairs used in DO and/or OBO ontologies.
cast_to_string
renamed tocollapse_to_string()
DO_pubs
now includeslens_id
with lens.org identifiers for each publication.
- Imports 'tidyselect', which is explicitly required for
unnest_cross
but is used throughout DO.utils to enable tidyverse-style semantics (via dplyr).
citedby_pubmed()
, equivalent tocitedby_scopus()
, now available (usescitedby_pmid()
andpubmed_summary()
internally.match_citations()
improvments:- BREAKING CHANGE --
add_col
argument changed from boolean toNULL
(replacesFALSE
) or the name of the column (replacesTRUE
). - Bug fixes & message improvements.
- BREAKING CHANGE --
extract_pmid()
has a new argument,no_result
, which provides control over the condition signaled when no results are found (error, warning, message, or none); previously a simple error was signaled. For any condition signaled, there is now an additional classno_result
to improve condition handling.- For the
elink
method the default forno_result
is still "error" while forelink_list
the default is now a warning.
- For the
as_tibble.esummary_list()
bug caused when some results have no data fixed -- caused errors inas_tibble.esummary_list_nested()
precipitated by the tidyverse's move to more strict vector merging.
IMPORTANT NOTE:
The "cited by" functionality of DO.utils
may no longer be improved because
recent review of Lens.org results suggests it may be a good replacement for this
PubMed + Scopus search & merge strategy.
If improvements are made they will likely facilitate one or more of:
- Merging PubMed and Scopus "cited by" results, probably using
standardize()
(or similar). - Reducing data requested from the APIs by implementing timeframe parameters in
citedby_*()
.
- Bug fix to correct error in
format_subtree()
when the subtree did not have any classes with multi-parentage. (Error label: "no fill needed")
- Create a text-based subtree/hierarchy.
extract_subtree()
extracts the data from doid.owl including all descendants and their relationships below a specified DOID.format_subtree()
arranges them in a dataframe as a text-based hierarchy mirroring disease-ontology.org.- Primarily designed for creating high quality "tree view" graphics similar to EBI's Ontology Lookup Service.
- Manage DOIDs.
is_valid_doid()
tests whether inputs are valid DOIDs. Note that mutliple formats are considered valid.format_doid()
converts between valid DOID formats.
DOrepo()
andowl_xml()
no longer fail silently when a file/directory does not exist. pyDOID now verifies file paths on instantiation of the underlying objects.
- Suggests tidygraph, which is required for
format_subtree()
.
- Added DO Nucleic Acids Research 2022 publication data to
DO_pubs
.
- R packages:
reticulate
>= v1.23 required.tidyr
v1.2.0 introduced breaking changes by requiring "safe" type casting (implemented viavctrs
).replace_na.list()
is now deprecated (requirestidyr
<= 1.1.4).
- Python dependency:
pyDOID
- Created
collapse_col_flex()
to collapse data frame columns more flexibly.- Adds two new methods beyond "unique": "first" & "last".
- Adds the ability to collapse columns using different methods.
- Created wrapper functions for
pyDOID
classes.DOrepo()
wraps thepyDOID.repo.DOrepo
classowl_xml()
wraps thepyDOID.owl.xml
class
- Updated:
plot_citedby()
to a stacked bar chart showing publication types.DO_colors
to include saturated versions (names prefixed withsat_
).- Functions generating html have been updated to match html style guide standards.
- Created:
theme_DO()
aggplot2
plotting theme for DO.plot_def_src()
to display the number of times a source is used to support disease definitions in the ontology (designed for disease-ontology.org/about/statistics).
- Renamed:
match_citations_fz()
tomatch_fz()
concat_pm_citation()
toread_pubmed_txt()
- Updated
match_citations()
to utilize Scopus EIDs. - Created:
pmc_summary()
, a parallel topubmed_summary()
that works for PubMed Central.hoist_ArticleIds()
(internal) - tidies PubMed/PMC identifierstidy_ArticleId_set()
(internal)
as_tibble()
, methodesummary_list_nested
- Created:
- Functions to read the doid-edit.owl file and extract URLs (+ helpers).
read_doid_edit()
extract_doid_url()
- Functions designed for URL validation.
validate_url()
+ helpers- NOTE: Helpers for robots.txt respectful validation remain INCOMPLETE and care should be taken not to overwhelm web servers with requests.
- Functions to read the doid-edit.owl file and extract URLs (+ helpers).
- Setup package to wrap python via
reticulate
package. - Added functionality to predict mappings using GILDA grounding (a type of
lexical string matching/natural entity recognition) via python and the python
modules
pyobo
,indra
, andgilda
. New functions:pyobo_map()
to create the predicted mappings.parse_mapping()
to parse the python.gilda.ScoredMatch results object to a list of data frames with matches (1 df/input term).unnest_mapping()
to unnest a list column generated bypyobo_map()
inside adplyr::mutate()
call; wrapsparse_maping()
.
- Added DEPENDENCIES on
ggplot2
,googlesheets4
, andglue
. - Renamed
match_citations_fz()
tomatch_fz()
. - Added
cast_to_string()
, a more generalized version ofvctr_to_string()
that accepts multiple inputs (similar topaste()
). - Added function to
partition()
vectors into groups withn
elements per group.
- Added latest official DO publication to
DO_pubs
. - Added official
DO_colors
.
- Added functions to create statistics graphs:
plot_citedby()
,plot_terms_def_counts()
,plot_branch_counts()
,plot_xref_counts()
. - Added
make_user_list_html()
to create rows of table in Community > Collaborators > Users of the Disease Ontology from the DO team's curated "Uses" Google sheet.
- Added
to_character()
, helper forcast_to_string()
, to reduce lists and data frames to character vectors while limiting data loss. - Added
html_in_rows()
, helper formake_user_list_html()
, to format html elements in rows (with optional row & cell attributes). - Added Google sheets identifiers for programmatic access.
- Added a
NEWS.md
file to track changes to the package. - Created/updated various helpers to manage data (get it, make it easier to
save/track with version control):
download_file()
to flexibly handle multiple downloads.- Created
download_status
Ref Class to manage downloads based on exit code/status.
- Created
confine_list()
/release_list()
to reversibly convert a list column to a character vector (using json).is_invariant()
to test for vectors with only 1 value; with methods for character & numeric vectors.unique_to_string()
to collapse vectors to strings.unique_if_invariant()
to conditionally collapse vectors with only 1 value.- Added
na.rm
argument to all vctr_to_scalar functions. replace_*()
NA
, method for lists.NULL
, to replace NULL values in lists recursively.blank
, to replace "" values.
collapse_col()
to collapse 1 or more specified columns in a data frame by concatenating unique values together, while preserving unique values in all other columns.
- Created
download_obo_ontology()
to download 1 or more ontologies maintained by the OBO Foundry.
- Added more to DO publication info.
- Added OBO Foundry metadata.
- Made it possible to count Alliance terms for a subset of DOIDs.
- Increased record type count options (arg: record_lvl) --> "full_record", "disease-object", "disease", "object"
- Created
citedby_scopus()
to get cited by publication data from Scopus API, along with s3 classes & methods to manage Scopus data.- Updated to capture datetime citedby data is first retrieved.
- Renamed
citedby_pubmed()
tocitedby_pmid()
to reflect its output.- NEED new
citedby_pubmed()
that combinescitedby_pmid()
&pubmed_summary()
.
- NEED new
- Modified
pubmed_summary()
to accept PMID lists as input.- Uses new
extract_pmid()
elink_list
method.
- Uses new
- Changed citedby
tidy()
methods toas_tibble()
methods & added new methods. - Created
truncate_authors()
to shorten long PubMed author lists. - Created
get_url()
&append_to_url()
to build DOI, PubMed, and PMC URLs for individual publications.