Co-Expression Network Inference from Chromatin Conformation Data through Graph Embedding

The repository contains the source code for the link prediction problem of the paper:
Varrone, Nanni et al., "Co-Expression Network Inference from Chromatin Conformation Data through Graph Embedding".

Abstract

Motivation

The relationship between gene coexpression and chromatin conformation is of great biological interest. Thanks to high-throughput chromosome conformation capture technologies (Hi-C), researchers are gaining insights on the tri-dimensional organization of the genome. Given the high complexity of Hi-C data and the difficult definition of gene coexpression networks, the development of proper computational tools to investigate such relationship is rapidly gaining the interest of the research community. One of the most fascinating questions in this context is how chromatin topology correlates with coexpression profiles of genes and which physical interaction patterns are most predictive of coexpression relationships.

Results

To address these questions, we developed a computational framework for the prediction of coexpression networks from chromatin conformation data. We first define a gene chromatin interaction network where each gene is associated to its physical interaction profile; then we apply two graph embedding techniques to extract a low-dimensional vector representation of each gene from the interaction network; finally, we train a Random Forest classifier on pairs of gene embeddings to predict if they are coexpressed.
Both graph embedding techniques outperform previous methods based on manually designed topological features, highlighting the need for more advanced strategies to encode chromatin information. We also establish that the most recent technique, based on random walks, is superior. Overall, our results demonstrate that chromatin conformation and gene regulation share a non-linear relationship and that gene topological embeddings encode relevant information, which could be used also for downstream analysis.

Pipeline

Guidelines

The order of execution of the pipeline is proprocessing -> network building -> link prediction.

Preprocessing

The pipeline is compatible with gene expression data downloaded from the Xena Browser TCGA Hub. The absolute path to the HiSeqV2 file is required in 01_gene_expression.py as the --input parameter and a name for the dataset can be chosen to identify the specific dataset through the parameter --dataset.

Since the original Hi-C datasets have different structure, we developed custom preprocessing scripts for the MCF-7 breast cancer (Barutcu et al. (2015)) and 22Rv1 prostate cancer (Rhie et al. (2019)) cell lines.

Network building

The first part of the pipeline is devoted to the building of the coexpression network from the gene expression data.
If no chromosome is specified in the --chr-src and --chr-tgt parameters of 01_compute_coexpression.py, the coexpression will be computed for all the chromosomes at once for the single model setting and the intra+inter-chromosomal setting.

In 02_coexpression_network.py, we can obtain networks based of which of the three analysis to perform:

different model for each chromosome: pass the --single-chrom parameter and select the percentile for the threshold with --perc-intra PERCENTILE
a single model for all the intra-chromosomal co-expression: select the percentile for the threshold with --perc-intra PERCENTILE
a single model for all the chromosomes considering also inter-chromosomal co-expression: select the two different thresholds (for the intra-chromosomal and the inter-chromosomal co-expression) through the --perc-intra PERCENTILE and --perc-inter PERCENTILE parameters.

In 03_hic_gene_selection.py builds an Hi-C matrix of interactions between genes instead of bins.
The --genome-wide parameter allows to build an network that includes both intra- and inter-chromosomal interactions.

The 05_chromatin_network.py script allows to build the Gene Chromatin Network by applying a threshold to the interactions.
The principle for obtaining the three different types of network is equal to the co-expression network case.
It is possible to select inter-chromosomal interactions with different resolution through the --resolution-inter RESOLUTION parameter.

Link prediction

The 01_matrix_factorization.py and 01_random_walk.py scripts allow to perform node embeddings based on the desired methods.
The methods require the the BioNEV library to be installed.
For both the scripts, the --chromatin-network FILE specifies the name of the network to embed, present in the data/DATASET/chromatin_network folder after building the network. Note that only the name is required (not the path) and without specifying the file format.

The files 02_link_prediction_chromosome.py, 03_link_prediction_intra.py, 04_link_prediction_genomewide.py perform link prediction on, respectively, the separated chromosome, single intra-chromosomal model, single intrachromosomal+interchromosomal settings.
The results of the prediction are saved in the results folder as a pickle dictionary with the acc, f1, roc, precision, recall keys depending on the type of performance measure.

The 05_results_comparison.py script generates the plot for comparing the performances between the different models.
With the --embs parameter it is possible to select which type of models to compare by listing the name of the results used to produce the prediction results. It's important replace the chromosome number with "{}" to compare runs over multiple chromosomes.
For example, to compare the topological measure method with node2vec, pass --embs rf_topological_primary_observed_ICE_{}_{}_40000_3.65_avg_l1 rf_node2vec_primary_observed_ICE_{}_{}_40000_3.59_es16_nw10_wl80_p1.0_q1.0_hadamard_0.38

Example script

Full pipeline

A script to automatically execute the whole pipeline for the single-chromosome and shared intra-chromosomal using the breast cancer dataset is provided through the run_breast_cancer.sh file.
Both the expression and the Hi-C data are automatically downloaded, but the preprocessing of the Hi-C data requires the presence of the juicer tools jar available at https://github.com/aidenlab/juicer/wiki/Juicer-Tools-Quick-Start. The version used in the paper is juicer_tools_1.13.02.jar. To execute the pipeline using a different version it is sufficient to change the value of the JUICER_PATH variable.
WARNING: verify that the machine used has the resources required to execute the pipeline.

Building a coexpression network

An example script for generating a coexpression network from the initial RNA-seq data for chromosome 1 and saving the plot of both the coexpression matrix and the adjacency matrix of the coexpression network is provided through the run_coexpression_breast_cancer.sh file.

By changing the value of the COEXP_PERCENTILE variable it is possible to select the threshold to consider when a pair of genes is coexpressed ror not.

Since the threshold percentile is compute across all the chromosome. The method must compute the coexpression matrices for all the chromosomes first.

Building a chromatin network

An example script for generating the chromatin network from the Hi-C data for chromosome 1 and saving the plot of both the Hi-C matrix and the adjacency matrix of the chromatin network is provided through the run_chromatin_breast_cancer_sh file.

It is possible to change the resolution of the Hi-C matrix to extract the links from by changing the value of the RESOLUTION variable (IMPORTANT: choose a value for ORIGINAL_RESOLUTION that is the highest among the options avaiable from juicer that are lower or equal to RESOLUTION). Similarly to the coexpression case you can change the thresholding value thorugh the HIC_PERCENTILE variable.

Since the threshold percentile is compute across all the chromosome. The method must compute the Hi-C matrices for all the chromosomes first.

Predicting the coexpression using the node2vec embeddings

An example script for predicting the coexpression of chromosome 1 using node embedding extracted with the node2vec algorithm and combined into edge embeddings using the hadamard operator is provieded through the run_node2vec_chr1.sh file.

The script contains all the variable required to control the characteristics of the networks to be used, the hyperaparameters of node2vec and the link prediction parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
juicer_tools_1.13.02.jar		juicer_tools_1.13.02.jar
pipeline.png		pipeline.png
run_breast_cancer.sh		run_breast_cancer.sh
run_chromatin_breast_cancer.sh		run_chromatin_breast_cancer.sh
run_coexpression_breast_cancer.sh		run_coexpression_breast_cancer.sh
run_node2vec_chr1.sh		run_node2vec_chr1.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Co-Expression Network Inference from Chromatin Conformation Data through Graph Embedding

Abstract

Motivation

Results

Pipeline

Guidelines

Preprocessing

Network building

Link prediction

Example script

Full pipeline

Building a coexpression network

Building a chromatin network

Predicting the coexpression using the node2vec embeddings

About

Releases

Packages

Languages

marcovarrone/gene-expression-chromatin

Folders and files

Latest commit

History

Repository files navigation

Co-Expression Network Inference from Chromatin Conformation Data through Graph Embedding

Abstract

Motivation

Results

Pipeline

Guidelines

Preprocessing

Network building

Link prediction

Example script

Full pipeline

Building a coexpression network

Building a chromatin network

Predicting the coexpression using the node2vec embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages