pip install git+https://github.com/jorgeavilacartes/panricci.git
or (recommended for developer) create a conda environment with the library in edition mode
git clone git@github.com:jorgeavilacartes/panricci.git
cd panricci
conda env create -f panricci.yml
conda activate panricci
panricci --help
panricci
is a library that deals with pangenome graphs (variation graphs and sequence graphs)
with tools from Riemannian Geometry.
A pangenome graph in the panricci
universe is a manifold
This manifold is evolved over time by using the Ricci-Flow, an algorithm that leverages the notion of curvature of edges to modify its weights until the curvature is constant (equal to 0).
Once the graph reach the state of constant curvature, we can use it to perform alignment of two graphs by definining coordinates with respect to source and sink nodes in our (directed) pangenome graphs.
NOTE Each pangenome graph (input file, and manifold) is assumed to be one single connected component.
panricci
works with pangenome graphs (variation graphs or sequence graphs) in .gfa
format.
- You need to evolve the metric of the graph, this is randomnly initialized.
- Once you have two graphs with the metrics after Ricci-Flow, we can align them.
- Check the results.
$ panricci --help
Usage: panricci [OPTIONS] COMMAND [ARGS]...
๐ฑ Welcome to PanRicci : Alignment of pangenome graphs with Ricci-Flow
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to copy it or โ
โ customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ ricci-flow apply ricci flow to a graph. โ
โ align Alignment of ricci graphs. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
We will use one graph from the data
folder
$ panricci ricci-flow \
--gfa data/test5.gfa \
--iterations 1000 \
--tol-curvature 1e-15 \
--outdir output/test5
- this tells
panricci
to apply the discrete Ricci-Flow algorithm for at most 1000 iterations, or until all curvatures are smaller than 1e-15 (our goal is to stop when all curvatures are 0) - the results for each iteration (weights and curvature for each edge of the graph) will be saved in the
output/test5
directory - the option
--undirected
allows the computation of the Wasserstein distance over the local subgraph of two nodes avoiding the directions, which is needed to keep the Wasserstein distance as a metric, so we recommend to keep this option. - by default node distributions are defined for a variation graph, i.e. paths are considered. If you have a sequence graph, where paths are not part of the .gfa file, then you can add the option
--sequence-graph
The alignment consists on finding a mapping between nodes of two graphs.
You can either create the ricci graph for another gfa in the folder, or we can use the same ricci graph to be align against itself, which we will do now.
Notice that in the output/test5
folder you will find a lot (one for each iteration), we will take the last one, which should be the file test5-ricciflow-52.edgelist
$ panricci align \
--ricci-graph1 output/test5/test5-ricciflow-52.edgelist \
--ricci-graph2 output/test5/test5-ricciflow-52.edgelist \
--path-save output/alignment.tsv
you should get a csv file with the following information inside
edge cost_alignment node1 node2
0 ['4-1', '4-2'] 0.0 4 4
1 ['3-1', '3-2'] 0.0 3 3
2 ['2-1', '2-2'] 0.0 2 2
3 ['1-1', '1-2'] 0.0 1 1
the columns,
node1
: the identifier of the node in the ricci graph 1node2
: the identifier of the node in the ricci graph 2edge
: are tuples of strings of the form '<node_id>-<graph_id>'cost_alignment
: the cost of aligningnode1
andnode2
(euclidean distance between their relative node representations)
as we can see from this result, the first row says that node 4 of the graph1 was aligned to node 4 of graph2, with a cost alignment of 0, which means that they have the same relative node representation.
Additionally, you can ask to add node metadata (label and node depth) to the output by providing the original .gfa files as follows
$ panricci align \
--ricci-graph1 output/test5/test5-ricciflow-52.edgelist \
--ricci-graph2 output/test5/test5-ricciflow-52.edgelist \
--path-save output/alignment.tsv \
--gfa1 data/test5.gfa \
--gfa2 data/test5.gfa
The output should look like this
edge cost_alignment node1 node2 label1 label2 node_depth1 node_depth2
0 ['4-1', '4-2'] 0.0 4 4 CAACGTTTTTTT CAACGTTTTTTT 0.5 0.5
1 ['3-1', '3-2'] 0.0 3 3 CTAGA CTAGA 1.0 1.0
2 ['2-1', '2-2'] 0.0 2 2 A A 0.5 0.5
3 ['1-1', '1-2'] 0.0 1 1 C C 1.0 1.0
The cost of alinging two nodes is the euclidean distance between the relative node representation of each node w.r.t. source and sink nodes in the graph. There is the option of including as well a cost that penalizes nodes with different labels
$ panricci align \
--ricci-graph1 output/test5/test5-ricciflow-52.edgelist \
--ricci-graph2 output/test5/test5-ricciflow-52.edgelist \
--path-save output/alignment.tsv \
--weight-node-labels 0.5 \
--gfa1 data/test5.gfa \
--gfa2 data/test5.gfa
from panricci.ricci_flow import RicciFlow
from panricci.utils import GFALoader
from panricci.node_distributions.variation_graph import DistributionNodes
# load variation graph
gfa_loader = GFALoader(undirected=False)
G = gfa_loader(gfa)
# compute distribution for each node of the graph
distribution_nodes = DistributionNodes(G, alpha=0.5)
# Initialize Ricci-Flow
ricci_flow = RicciFlow(G,
distribution=distribution_nodes, # the distribution over the 1-hop neighborhood for each node in the graph
save_last=False, # will overwrite the results in each iteration to keep the last one: outfile will be "{name}-ricciflow-{it}.edgelist"
save_intermediate_graphs=True, # will save results for all iteration: outfile will be "{name}-ricciflow.edgelist"
dirsave_graphs="outdir", # directory to save results
tol_curvature=1e-15, # tolerance of minimum curvature to stop Ricci-Flow
overwrite=False, # If the dirsave_graphs directory exists, it will raise an Exception
)
# apply Ricci-Flow
G_ricci = ricci_flow.run(
iterations=1000, # maximum number of iterations to run Ricci-Flow
name="id-graph" # some identifier to store results
)
from pathlib import Path
import networkx as nx
from panricci.utils import GFALoader
from panricci.alignment import GraphAlignment, parse_alignment
dirsave=Path(path_save)
dirsave.parent.mkdir(exist_ok=True, parents=True)
# initialize aligner
aligner = GraphAlignment(
ricci_embedding = True,
weight_node_labels = 0.5, # set to 0 if you only want to consider embedding cost
path_save_bipartite = "bipartite-graph.edgelist", # store bipartite graph used for alignment
)
# load ricci graphs
g1 = nx.read_edgelist(ricci_graph1, data=True, create_using=nx.DiGraph)
g2 = nx.read_edgelist(ricci_graph2, data=True, create_using=nx.DiGraph)
gfa_loader = GFALoader(undirected=False)
graph1 = gfa_loader(gfa1)
for edge, d in g1.edges.items():
graph1.edges[edge]["weight"] = d["weight"]
graph1.edges[edge]["curvature"] = d["curvature"]
graph2 = gfa_loader(gfa2)
for edge, d in g2.edges.items():
graph2.edges[edge]["weight"] = d["weight"]
graph2.edges[edge]["curvature"] = d["curvature"]
alignment = aligner(g1, g2, name="alignment")
df_alignment = parse_alignment(alignment, graph1, graph2)
df_aligment.to_csv("alignment.tsv", sep="\t")