Skip to content

In depth Installation Instructions

Ryan Johnson edited this page Oct 24, 2017 · 1 revision

Setting up an environment for OpenRefine reconciliation with Anaconda

OpenRefine provides a crucial tool in linked data reconciliation (AKA, "from strings to things"). Reconciliation by its nature, tends to be first automated by a machine, then manually reviewed by a human since semantics are involved, which makes OpenRefine's combination of a GUI and scripting support an ideal tool. Unfortunately, this functionality does not always come "out of the box", and many of the scripts used to then set up "local" reconciliation services require certain python libraries and/or versions of python itself that can lead to code errors or, worse, dependency hell.

dependency hell

To avoid this, we will rely on a python environment I have made with Anaconda. Anaconda makes python virtual environments so that you can tailor one to a specific task or program (thus avoiding the previously mentioned dependency hell). Making a virtual environment ensures that we can switch into it whenever we want to reconcile stuff in OpenRefine, and switch back out when we want to do other python stuff (there's lots of other cool python and other programming stuff you can do!).

If you're curious what types of libraries are included, they mainly have to do with SPARQL to query linked data sources, as well as common libraries that are designed to interact with APIs, and finally libraries that attempt "fuzzy matching".

Once Anaconda is installed, setting up the environment becomes copying and pasting one line on the command line.

NOTE: For this particular repo, we are assuming a Windows user. This is the use case I am faced with, but it would in general work on other systems following those systems' instructions (and in fact, will require fewer steps). This environment has been successfully built on Windows7 and Windows10 machines.

Pre-requisite installation of Microsoft C++ Build Tools

One first pre-requisite: Microsoft needs some special "build tools" to work with certain python libraries. Visit the Visual C++ Build Tools page and select the "Download Visual C++ Build Tools 2015" option. Let the installer run, and go get some coffee, because it might take a while.

Pre-requisite installation of Anaconda3

  • Uninstall previous versions of Anaconda on your computer, if they exist.
  • Download the latest version of Anaconda3 for Windows, selecting the Python 3.x version 64-bit installer.
  • Open the installer, use the default settings, ensuring the boxes about adding conda to your PATH are checked.
  • When finished installing, close the installer, and open "Anaconda Navigator" (you can search Programs for "Anaconda"). This is Anaconda's GUI.
  • Click on the "Environments" tab on the left. You should see an environment called root. This is where a bunch of popular python libraries have been installed. We will now create a new environment.

Set up the new refine conda environment

Conda Navigator has an intuitive GUI for looking at your conda environments, but unfortunately it is not the right tool to set up very specific conda environments like we need. So it's off to the command line! But first:

  • Download the YAML (.yml) file from this repo or clone it... however you normally grab repositories from GitHub.

NOTE: For using the command line in the rest of these instructions, Git Shell (that gets installed with GitHub Desktop) or Git Bash will work nicely. Or you can use "Anaconda Console" which you got when you installed Anaconda.

  • Open your shell, and navigate (via cd) to where you downloaded the YAML file from this repo. Alternatively, if you have the Git Bash program installed, you can right-click in the directory with the YAML file and select the "Git Bash here" option from the context menu.
  • Type:
$ conda env create -f openrefine.yml
  • The environment will now be installed, and it should now show up back in the Anaconda Navigator 'Environments' tab as refine3.

  • Now go back to your shell. We want to switch into our new environment. In conda, we do that by typing:

$ source activate refine3

If your shell complains, instead drop the source part:

$ activate refine3