Skip to content

CFDE Data Distillery Partnership Project IDG contributions

License

Notifications You must be signed in to change notification settings

unmtransinfo/cfde-distillery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CFDE Distillery Project - IDG

The CFDE Data Distillery Partnership Project, led by the CFDE HuBMAP, SenNet, and Kids First teams, from Pitt and Children's Hospital of Philadelphia (CHOP) has developed a Data Distillery Knowledge Graph (DDKG), which distills Common Fund data with semantic interoperability, to support integrative biomedical research questions and scientific use cases.

The IDG-DCC team brings to this project a strong record of research in KG design, analytics and KG-based ML. This repo represents the contributions of the IDG team at the University of New Mexico. The IDG team has focused on including IDG datasets into the DDKG, and developing IDG scientific use cases combining datasets from IDG and multiple other Common Fund programs. In addition, the IDG team has developed the Condensed KG, derived from the DDKG, but subsetted and scoped for IDG use cases, and simplified for interpretability and usability,

Condensed-KG

Hosting the comprehensive DDKG requires significant computational resources, and for IDG applications, a small fraction of the data is required, and a lightweight subset is advantageous. To address this challenge, we developed a localized, cloud-ready, containerized KG. Drawing inspiration from the LINCS-DCC (Maayan Lab) project, we created a "Condensed-KG" derived and subsetted from the latest DDKG. This Condensed-KG, consisting of 8 distinct node labels and 1042 distinct relationship types, is designed to simplify and enhance interpretability, specifically tailored to IDG use cases.

KG2ML

The KG2ML module was developed to transform the KG into vector embeddings and an ML-ready dataset, designed to predict genes/proteins associated with a given disease/condition but lacking explicit links within the KG. Built upon our prior works in KG-based ML and positive-unlabeled (PU) learning, KG2ML estimates the proportion of genes/proteins thereby associated indirectly with a disease/condition, providing well-calibrated predictions on the association between genes/proteins and a specified disease/condition.

DDKG IDG Nodes and Edges files

How to load data from TSV files to Neo4J KG

  • Download:
  • If TCRD/DrugCentral databases changed, run compound_activity.sql and drug_central_data.sql present in the "sql" folder to get input data from TCRD and DrugCentral databases.
  • Update the input and output files path in drugcentral_nodes.py, drugcentral_edges.py, nodes.py, and edges.py and run these codes.
  • Move all output TSV files for nodes and edges to the "import" folder of the Neo4J database you created.
  • In Neo4J database settings, set dbms.memory.heap.max_size=8G. If it gives error, increase the value to a bigger number (e.g. 16gb).
  • Run all Cypher queries present in the "cql" folder.

Links

About

CFDE Data Distillery Partnership Project IDG contributions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published