This GitHub repository contains the code used to process raw data to final results for the study Lape, et al. "After the Infection: A Survey of Pathogens and Non-communicable Human Disease" (2025). See below for citation information. It has been updated to include an analysis using Phecodes instead of ICD10 codes as the outcome variable.
This code is made available along with explanatory flowcharts below to enable the replication of the results reported in the associated manuscript. UK Biobank data and TriNetX data must be obtained from the respective organizations.
There are many well-established relationships between pathogens and human disease, but far fewer when focusing on non-communicable diseases (NCDs). We leverage data from The UK Biobank and TriNetX to perform a systematic survey across 20 pathogens and 426 diseases, primarily NCDs. To this end, we assess the association between disease status and infection history proxies. We identify 206 pathogen-disease pairs that replicate in both cohorts. We replicate many established relationships, including Helicobacter pylori with several gastroenterological diseases and connections between Epstein-Barr virus with multiple sclerosis and lupus. Overall, our approach identified evidence of association for 15 pathogens and 96 distinct diseases, including a currently controversial link between human cytomegalovirus (CMV) and ulcerative colitis (UC). We validate this connection through two orthogonal analyses, revealing increased CMV gene expression in UC patients and enrichment for UC genetic risk signal near human genes that have altered expression upon CMV infection. Collectively, these results form a foundation for future investigations into mechanistic roles played by pathogens in NCDs. All results are easily accessible on our website, https://tf.cchmc.org/pathogen-disease.
All patient identifiers are generic and don't correspond to actual identifiers from either UK BioBank (UKB) or TriNetX (TNX). They are presented to make it easier to follow the code as well as inputs and outputs.
- R 4.x
- Python 3.x
R Libraries | Python Libraries |
---|---|
DT | numpy |
MASS | pandas |
argparse | scipy |
data.table | sklearn |
dplyr | statsmodels |
glue | matplotlib |
logistf | seaborn |
openxlsx | tabulate |
performance | tqdm |
progress | xlrd |
pryr | |
readxl | |
stringr | |
tidyr | |
vroom | |
writexl | |
PheWAS v0.99.6.1 |
-
GNU Parallel v20220122
Tange, O. (2022, January 22). GNU Parallel 20220122 ('20 years'). Zenodo. https://doi.org/10.5281/zenodo.5893336
Color | Shape |
---|---|
![]() |
![]() |
Code from this repository may be cited as:
Michael Lape, _et al_.(2023). WeirauchLab/pathogen_ncd: Preprint release
(preprint). Zenodo. https://doi.org/10.5281/zenodo.8423556
The associated manuscript is pending publication. In the meantime, you may cite the preprint on medRxiv:
After the Infection: A Survey of Pathogens and Non-communicable Human Disease
Michael Lape, _et al. medRxiv 2023.09.14.23295428;
doi: https://doi.org/10.1101/2023.09.14.23295428
Please contact Dr. Matthew Weirauch via email with any questions or concerns.
Name | Institution | Remarks |
---|---|---|
Mike Lape | University of Cincinnati | primary author |
Source code is ©2023 Cincinnati Children's Hospital Medical Center and Mike Lape.
Released under the terms of the GNU General Public License, Version 3. See
LICENSE.txt