Single file lean framework for continuous replication of external data sources with wikidata. Unlike wikipedia, almost all one-off tasks, that involves mass Wikidata edits, can be performed by the combination of WDQS, spreadsheet and QS. You only need this framework if you are planning to execute your bot periodically. It is intended for experienced Wikidata users that are able to:
- write data extractor in python 3.9+
- write SPARQL queries (with specific WDQS extensions)
- understand Wikibase Data Model
There are source code code of real life bots based on that framework (see below)
In order to run script in any IDE, one has to specify login/password as first/second command line arguments.
Required dependencies can be installed by pip install -r requirements.txt
If you want to run scripts in toolforge, do the following:
- Open ssh session to your tool on toolforge and run the following 3 commands:
git init
git remote add origin https://github.com/ghuron/wdpy.git
git pull origin master
- Create BotPassword with a limited set of permissions. Normally besides basic I request editpage, createeditmovepage and highvolume (see https://www.wikidata.org/wiki/Special:ListGrants for details)
- Create one-liner file
src/toolforge/.credentials
and type there login and password separated by space. Script will use it to make edits in wikidata. Don't forgetchmod 440 src/toolforge/.credentials
to make sure other toolforge users will not see your credentials. - Run
toolforge-jobs run bootstrap-venv --command "cd $PWD && src/bootstrap_venv.sh" --image tf-python39 --wait
in order to initialize python virtual environment and install necessary packages. - Load jobs schedule via
toolforge-jobs load src/toolforge/jobs.yaml
or run them individually (see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework)
The framework consists of 4 basic classes:
Wikidata
class is a thin wrapper around Wikidata API v1 and Wikidata Query ServiceModel
class is a set of methods that helps to work with values and snaksClaim
class represents statement with focus on list of referencesElement
represents items, that can be created from ( or updated with) a list of snaks
More detailed documentation is included into source code. There are also some tests, that might help with understanding of expected behaviour.
You can learn how to use the framework by looking into real-life implementations of my bots (from simplest to complex):
- simbad_id.py is intended to keep P3083 values in
wikidata up-to-date.
SIMBAD aggressively changes primary identifiers, usually keeping "redirects"
for some time (but not for long).
It is important to identify P3083-statements with redirects and "resolve" them. In order to do so, we are
- obtaining 10000 statements via WDQS
- checking which on them are redirects (running simple ADQL queries using TAP)
- updating affected via
Claim
class
- arxiv.py combines two tasks in one file:
- When run as a standalone bot, it helps to fill missing P356 values
when P818 is known and vise versa.
It uses slow bulk metadata OAI-PMH, 2 SPARQL queries and
Claim
class - Implement bare minimum on top of
Model
andElement
classes to make sure thatarxiv.Element.get_by_id()
is able to create a missing item. Information is extracted via regular ArXiv API
- When run as a standalone bot, it helps to fill missing P356 values
when P818 is known and vise versa.
It uses slow bulk metadata OAI-PMH, 2 SPARQL queries and