Natural Entity Types

This repository contains code to select the most natural type for Wikidata entities. For each entity, we consider the types connected to this entity in Wikidata via an instance-of/subclass-of* path. That is, we create a set of candidate types that consists of all types that are reachable from the entity via a chain of relations that starts with a single instance-of relation, followed by an arbitrary number of subclass-of relations (which may be 0). We employ various methods such as a Gradient Boost Regressor model and a feed forward neural network to score each candidate type and select the most natural one.

Setup with Docker

Get the code and build the docker image:

git clone https://github.com/ad-freiburg/natural-entity-types.git
cd natural-entity-types
docker build -t natural-entity-types .

Run the docker container:

docker run -it -v $(pwd)/data/:/home/data -v $(pwd)/models/:/home/models -v $(pwd)/benchmarks/:/home/benchmarks natural-entity-types

Make sure the mounted directories are writable from within the docker container, e.g. by running:

chmod a+rw -R data/ models/ benchmarks/

Inside the docker container, get the data by running:

make download_all

OR ALTERNATIVELY, if you want the most up-to-date data, generate it by running:

make generate_all

This will download Wikidata mappings using the QLever API, generate databases from them for quick access, and compute type properties from these Wikidata mappings which are used as features by the models. This can take a couple of hours.

OR, if you only want to use the model that does not depend on precomputed features, you can run

make generate_wikidata_mappings

This will only download the Wikidata mappings and generate the databases from them. The model that does not depend on precomputed features (models/nn.no_precomp.512_sigmoid_d02_32_adam00001.70k.p) can be used without the precomputed features and yields similar results (in our experiments, accuracy@1 is only 0.8 percentage points lower than with the full model).

You can now train and evaluate models, or used trained models to generate natural type triples for all entities in Wikidata as described in the next sections.

Training and Evaluation

To train and/or evaluate a model, adjust the following command according to your needs:

python3 scripts/evaluate.py -m <gbr|nn|gpt|oracle> --save_model <model_file> -b <benchmark_file> -train <training_file>

The -m option specifies the model to be evaluated. You can choose from gbr (Gradient Boost Regressor), nn (Feed Forward Neural Network), gpt (GPT-4), and oracle. oracle is a model that always predicts the ground truth type of the benchmark entity if it is among the candidate types. The oracle evaluation results represent the upper bound of what a model that relies on the candidate types can achieve.
<model_file> is the path where the trained model will be saved (optional).
<benchmark_file> is the path to the benchmark file on which the model will be evaluated, e.g. benchmarks/mini_benchmark.test.tsv. The expected format of the benchmark is a tsv file with one line per entity, with the entity QID in the first column and the space-separated QIDs of the ground truth types in the second column.
<training_file> is the path to the file that contains the training data. The expected format is the same as for the benchmark file.

Once you have evaluated a model for the first time (assuming you used the --save_model option), you can replace the --save_model option with the --load_model option to load the previously saved model from the specified file without having to train it again.

Generate Natural Type Triples

To generate natural type triples for all entities that have an instance of or subclass of relation in Wikidata, run:

make triples

This will generate a file data/results/natural_types.ttl that contains the natural type triples in TTL format. You can adjust which model and feature set is used by changing the MODEL and FEATURES variables in the Makefile. Per default, the model models/nn.no_precomp.512_sigmoid_d02_32_adam00001.70k.p is used, which does not depend on precomputed features. If you want to use the full model run

make triples FEATURES=all MODEL=models/nn.512_sig_d04_64_adam00001.70k.pt

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
benchmarks		benchmarks
data		data
models		models
scripts		scripts
src		src
training_data		training_data
.dockerignore		.dockerignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Entity Types

Setup with Docker

Training and Evaluation

Generate Natural Type Triples

About

Releases

Packages

Languages

ad-freiburg/natural-entity-types

Folders and files

Latest commit

History

Repository files navigation

Natural Entity Types

Setup with Docker

Training and Evaluation

Generate Natural Type Triples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages