This project analyzes the results of various models for Link Prediction on Knowledge Graphs using Knowledge Graph Embeddings. It allows to replicate the results in our work "Knowledge Graph Embeddings for Link Prediction: A Comparative Analysis".
We include 16 models representative of various families of architectural choices. For each model we used the best-performing implementation available.
-
DistMult:
-
ComplEx-N3:
-
ANALOGY:
-
SimplE:
-
HolE:
-
TuckER:
-
TransE:
-
STransE:
-
CrossE:
-
TorusE:
-
RotatE:
-
ConvE:
-
ConvKB:
-
ConvR:
- Paper
- (implementation kindly shared by the authors privately)
-
CapsE:
-
RSN:
-
We also employ the rule-based model AnyBURL as a baseline.
The project is completely written in Python 3.
- numpy
- matplotlib
- seaborn
The project is structured as a set of Python scripts, each of which can be run separately from the others:
- folder
efficiency
contains the scripts to visualize our results on efficiency of LP models.- Our findings for training times can be replicated by running script
barchart_training_times.py
- Our findings for prediction times can be replicated by running script
barchart_prediction_times.py
- Our findings for training times can be replicated by running script
- folder
effectiveness
contains the scripts to obtain our results on the effectiveness:- folder
performances_by_peers
contains various scripts that show how the predictive performances of LP models vary, depending on the number of source and target peers of test facts. - folder
performances_by_paths
contains various scripts that show how the predictive performances of LP models vary, depending on the Relational Path Support of test facts. - folder
performances_by_relation_properties
contains various scripts that show how the predictive performances of LP models vary, depending on the properties of the relations of test facts. - folder
performances_by_reified_relation_degree
contains various scripts that show how the predictive performances of LP models vary, depending on the degree of the original reified relation in FreeBase.
- folder
- folder
dataset_analysis
contains various scripts to analyze the structural properties of the original datasets featured in our analysis (e.g. for computing the source peers and target peers for each test fact, or its Relational Path Support, etc). We share the results we obtained using these scripts in ...
In each of these folders, the scripts to run in order to replicate the results of our paper are contained in the folders named papers
.
We note that
- In WN18RR, as reported by the authors of the dataset, a small percentage of test facts feature entities not included in the training set, so no meaningful predictions can be obtained for these facts. A few implementations (e.g. Ampligraph, ComplEx-N3) would actively skip such facts in their evaluation pipelines. Since the large majority of systems would keep them, we have all models include them in order to provide the fairest possible setting.
- In YAGO3-10 we observe that a few entities appear in two different versions depending on HTML escaping policies or on capitalisation. In these cases, odels would handle each version as a separate, independent entity; to solve this issue we have performed deduplication manually. The duplicate entities we have identified are:
- Brighton_&_Hove_Albion_F.C. and Brighton_&_Hove_Albion_F.C.
- College_of_William_&_Mary and College_of_William_&_Mary
- Maldon_&_Tiptree_F.C. and Maldon_&_Tiptree_F.C.
- Alaska_Department_of_Transportation_&_Public_Facilities and Alaska_Department_of_Transportation_&_Public_Facilities
- Turing_award and Turing_Award
-
Open a terminal shell;
-
Create a new folder named
comparative_analysis
in your filesystem by running command:mkdir comparative_analysis
-
Download and the
datasets
folder and theresults
folder from our storage, and move them into thecomparative_analysis
folder. Be aware that the files to download occupy around 100GB overall. -
Clone this repository under the same
comparative_analysis
folder with command:git clone https://github.com/merialdo/research.lpca.git analysis
-
Open the project in folder
comparative_analysis/analysis
(using a Python IDE is suggested).- Access file
comparative_analysis/analysis/config.py
and updateROOT
variable with the absolute path of your "comparative_analysis" folder. - In order to replicate the plots and experiments performed in our work, just run the corresponding Python scripts in the
paper
folders mentioned above. By default, these experiments will be run on datasetFB15K
. In order to change the dataset on which to run the experiment, just change the value of variabledataset_name
in the script you wish to launch. Acceptable values areFB15K
,FB15K_237
,WN18
,WN18RR
andYAGO3_10
.
- Access file
Please note that the data in folders datasets
and results
are required in order to launch most scripts in this repository.
Those data can also be obtained by running the various scripts in folder dataset_analysis
, that we include for the sake of completeness.
The global performances of all models on both min
and avg
tie policies can be printed on screen by running the the script print_global_performances.py
.