Use graph analytics to predict relationships between entities
Project Background :
This repo implements the link prediction concepts described in Chapter - 8 of "Graph Algorithm" book by Mark Needham & Amy E. Holder. One can download a free copy of this excellent book at https://neo4j.com/graph-algorithms-book/. In this book, authors have implemented link prediction algorithm using PySpark ML algorithms.
In this repo I have shared the implementation of link prediction algorithm using popular Sklearn and Pandas python libraries (instead of PySpark) and also demontrated how we can use Neo4J (Graph Database) connectors to transmit data/commands across python and Neo4J servers.
We’ll use paper citation dataset containing information about research papers, authors, author relationship and citation relationship. We'll create a coauthors graph based on authors who have collaborated on papers and then predict future collaborations between pairs of authors. We’re only interested in collaborations between authors who haven’t collaborated before—we’re not concerned with multiple collaborations between pairs of authors. (see Dataset section below for more details)
Technologies used in the repo:
Python (Sklearn, Pandas and Neo4j libraries)
Neo4J Desktop Edition
Cypher Query Language
Prerequisites:
Graph Theory Concepts (presented very well in the book)
Familiarity with Neo4J Graph Database
Basic knowledge of Cypher query language
Knowledge of Sklearn and Pandas libraries
Dataset (info taken from page 200 of "Graph Algorithms" book)
Citation Network Dataset , a research dataset extracted from DBLP, ACM, and MAG. The dataset is described in the paper “ArnetMiner: Extraction and Mining of Academic Social Networks”, by J. Tang et al. The dataset version used in the book (DBLP-Citation-network V10) contains 3,079,007 papers, 1,766,547 authors, 9,437,718 author relationships, and 25,166,994 citation relationships. We’ll be working with a subset focused on articles that appeared in the following publications:
• Lecture Notes in Computer Science
• Communications of the ACM
• International Conference on Software Engineering
• Advances in Computing and Communications
Our resulting dataset contains 51,956 papers, 80,299 authors, 140,575 author relationships, and 28,706 citation relationships. We’ll create a coauthors graph based on
authors who have collaborated on papers and then predict future collaborations between pairs of authors. We’re only interested in collaborations between authors
who haven’t collaborated before—we’re not concerned with multiple collaborations between pairs of authors.
Process Overview (see Python Code for details)
Load data into Neo4J
Create balance data and split samples into Pandas DataFrames for training and testing
Implement ML methods for link prediction
Train and evaluate various versions of ML prediction models, starting with basic graphy features and adding more graph algorithm features extracted using Neo4j