Skip to content

After the Infection: A Survey of Pathogens and Non-communicable Human Disease

License

Notifications You must be signed in to change notification settings

WeirauchLab/pathogen_ncd

Repository files navigation

After the Infection: A Survey of Pathogens and Non-communicable Human Disease

DOI

Study Design

Study Overview


Repository Description

This GitHub repository contains the code used to process raw data to final results for the study Lape, et al. "After the Infection: A Survey of Pathogens and Non-communicable Human Disease" (2025). See below for citation information. It has been updated to include an analysis using Phecodes instead of ICD10 codes as the outcome variable.

This code is made available along with explanatory flowcharts below to enable the replication of the results reported in the associated manuscript. UK Biobank data and TriNetX data must be obtained from the respective organizations.

Manuscript Abstract

There are many well-established relationships between pathogens and human disease, but far fewer when focusing on non-communicable diseases (NCDs). We leverage data from The UK Biobank and TriNetX to perform a systematic survey across 20 pathogens and 426 diseases, primarily NCDs. To this end, we assess the association between disease status and infection history proxies. We identify 206 pathogen-disease pairs that replicate in both cohorts. We replicate many established relationships, including Helicobacter pylori with several gastroenterological diseases and connections between Epstein-Barr virus with multiple sclerosis and lupus. Overall, our approach identified evidence of association for 15 pathogens and 96 distinct diseases, including a currently controversial link between human cytomegalovirus (CMV) and ulcerative colitis (UC). We validate this connection through two orthogonal analyses, revealing increased CMV gene expression in UC patients and enrichment for UC genetic risk signal near human genes that have altered expression upon CMV infection. Collectively, these results form a foundation for future investigations into mechanistic roles played by pathogens in NCDs. All results are easily accessible on our website, https://tf.cchmc.org/pathogen-disease.

General Notes

All patient identifiers are generic and don't correspond to actual identifiers from either UK BioBank (UKB) or TriNetX (TNX). They are presented to make it easier to follow the code as well as inputs and outputs.

Software Versions

Languages utilized

  • R 4.x
  • Python 3.x

Additional Libraries

R Libraries Python Libraries
DT numpy
MASS pandas
argparse scipy
data.table sklearn
dplyr statsmodels
glue matplotlib
logistf seaborn
openxlsx tabulate
performance tqdm
progress xlrd
pryr
readxl
stringr
tidyr
vroom
writexl
PheWAS v0.99.6.1

Other 3rd party software

  • GNU Parallel v20220122

      Tange, O. (2022, January 22). GNU Parallel 20220122 ('20 years').
      Zenodo. https://doi.org/10.5281/zenodo.5893336
    

Flowcharts for primary analysis using diagnoses and serology data

Key for Diagrams

Color Shape
Color Key Shape Key

ICD10 Analysis

UK Biobank

Data Prep

UKB ICD Data Prep


Analysis

UKB ICD analysis


Permutations and Empirical P-values

UKB ICD Permutations


UKB ICD Permutations Continued


TriNetX

Data Prep

TNX ICD Data Prep


Analysis

TNX ICD Analysis


Results Post-processing

ICD Post-processing


Phecode Analysis

UK Biobank

Data Prep

UKB Phecode Data Prep


Analysis

UKB Phecode analysis


TriNetX

Data Prep

TNX Phecode Data Prep


Analysis

TNX Phecode Analysis


Results Post-processing

Phecode Post-processing


How to Cite

Code from this repository may be cited as:

Michael Lape, _et al_.(2023). WeirauchLab/pathogen_ncd: Preprint release
(preprint). Zenodo. https://doi.org/10.5281/zenodo.8423556

The associated manuscript is pending publication. In the meantime, you may cite the preprint on medRxiv:

After the Infection: A Survey of Pathogens and Non-communicable Human Disease
Michael Lape, _et al. medRxiv 2023.09.14.23295428;
doi: https://doi.org/10.1101/2023.09.14.23295428 

Feedback

Please contact Dr. Matthew Weirauch via email with any questions or concerns.

Contributors

Name Institution Remarks
Mike Lape University of Cincinnati primary author

License

Source code is ©2023 Cincinnati Children's Hospital Medical Center and Mike Lape.

Released under the terms of the GNU General Public License, Version 3. See LICENSE.txt

About

After the Infection: A Survey of Pathogens and Non-communicable Human Disease

Resources

License

Stars

Watchers

Forks