A collection of interesting biological datasets for computationl biology analysis and machine learning model development
Work in Progress - Last updated: January 10, 2021
-
- Description: Benchmark of candidate Enhancer-Gene Interactions (BENGI) that integrate Registry of cCREs with experimentally derived 3D chromatin interactions, genetic interactions, and CRISPR/dCAS9 perturbations in 21 datasets across 13 biosamples
- Date: Jan 2020
- Link: GitHub, ENCODE and various databases
- Size: 162,000 unique cCRE-gene pairs across the 13 biosamples
-
- Description: In vitro proteasome digestion of 55 synthetic polypeptides followed by peptide product identification with mass spectrometry
- Date: May 2020
- Link: Mendeley and PRIDE PXD016782 (raw mass spectrometry data)
- Size: 15,028 spliced and 7,305 non-spliced peptide products
- Benchmarking single-cell RNA-sequencing protocols for cell atlas projects
- Description: Benchmark dataset of 13 commonly-used scRNA-seq protocols on samples consisted of human PBMC (60%), mouse colon cells (30%), and various cell lines
- Date: Apr 2020
- Link: GEO
- Size: Transcriptoics data for ~3,000 cell for each protocol (raw sequencing data)
-
Prediction of drug combination effects with a minimal set of experiments
- Description: A compendium of 23,595 drug combination matrices tested in various cancer cell lines and malaria and Ebola infection models
- Date: Dec 2019
- Link: GitHub, web tool, various publications and databases
- Size: 23,595 drug combination matrices tested in cancer cell lines and malaria and Ebola infection models
-
- Description: Data and competition result from AstraZeneca's DREAM challenge to predict effect of drug combinations
- Date: June 2019
- Link: Synapes, AstraZeneca, various databases
- Size: 11,576 experiments from 910 drug combinations across 85 cancer cell lines
-
Extending the small-molecule similarity principle to all levels of biology with the Chemical Checker
- Description: A collection of structural, chemical, and biological properties of ~800,000 small molecules
- Date: May 2020
- Link: ChemicalChecker and GEO
- Size: 25 properties of up to 778,460 small molecules (some features are not available for all small molecules)
-
Predicting drug–protein interaction using quasi-visual question answering system
- Description: An end-to-end deep learning approach for predicting drug-protein interaction. Also describe three public datasets.
- Date: Feb 2020
- Link: DUD-E, BindingDB, and Negative samples
- Size: Combined 62,392 positive interactions and > 1.4M negative interactions
-
- Description: Patterns of somatic passenger mutations detected in whole genome sequencing (WGS) of 2606 tumours representing 24 common cancer types produced by the PCAWG Consortium
- Date: Feb 2020
- Link: PCAWG Consortium, various databases
- Size: Genetic mutations of 2,606 tumours from 24 cancer types
-
Multi-omic and multi-view clustering algorithms: review and cancer benchmark
- Description: Benchmarking of methods to integrate multi-omics data (gene expression, miRNA expression, and DNA methylation) to identify distinct cancer patient groups in TCGA dataset
- Date: Nov 2018
- Link: Ron Shamir's Lab
- Size: 170-621 patients each from 10 cancer types (processed data)
-
- Description: Chest x-ray images released by NIH. Disease annotations were automatically mined from radiologist reports. Around 1,000 images also contain bounding boxes indicating the location of lesions. Files provided in processed .png formats.
- Date: 2017
- Link: Kaggle, also available from many other clouds
- Size: 112,120 chest x-ray images from 30,805 patients, with 14 disease class labels
-
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
- Description: Chest x-ray images from Beth Israel Deaconess Medical Center. Disease annotations were automatically mined from radiologist reports. Files provided in processed .jpg formats. DICOM full-resolution files and radiologist reports are also available.
- Date: 2019
- Link: PhysioNet, access must be requested
- Size: 377,110 chest x-ray images, with 14 disease class labels
-
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison
- Description: Chest x-ray images from Stanford University Medical Center. Disease annotations in the training set were automatically mined from radiologist reports. Disease annotation in the validation and test sets were obtained from experts. This is one of the datasets used to train NLP tools that automatically extract disease labels from radiologist reports.
- Date: 2019
- Link: Stanford's AIM
- Size: 224,316 chest x-ray images from 65,240 patients, with 14 disease class labels
-
VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations
-
BRAX, Brazilian labeled chest x-ray dataset
- Description: Chest x-ray images from Hospital Israelita Albert Einstein in Brazil. Disease annotations in the training set were automatically mined from radiologist reports (in Portugese). Both processed .png files and full-resolution DICOM files are available.
- Date: 2022
- Link: PhysioNet
- Size: 40,967 chest x-ray images from 19,351 patients, with 14 disease class labels.
-
PadChest: A large chest x-ray image dataset with multi-label annotated reports
- Description: Chest x-ray images from San Juan Hospital in Spain. Extensive annotations with 174 radiographic findings, 19 differential diagnoses, and 104 anatomic locations organized as a hierarchical taxonomy were provided. 27% of the labelings performed by clinicians and the rest were predicted by an NLP model.
- Date: 2020
- Link: BIMCV
- Size: 160,000 chest x-ray images from 67,000 patients.
- Extra: A COVID-19 extension of this dataset is also available.