While it is possible, to examine changes just using the git
version control tool and the "raw" data repository,
this is often overwhelming (see for example the changes between release 4.2.1 and 4.1).
So unless one is interested in particular commits and associated commit comments, which may give some context for particular changes,
it is typically better to examine changes of the aggregated CLDF dataset
(e.g. the changes between 4.2.1 and 4.1).
For some cases, this may still be "too much". In the following we describe how to "condense" changes between two
Glottolog releases down to just a list of changed languoid names. We'll use the shell, aka command line
(see the software carpentry lesson for an excellent introduction)
and the tools of the csvkit package and optionally git
.
- Retrieve the table of languoids for the two releases we want to compare
- either downloading and unpacking
and
- or from GitHub
cldf/languages.csv
at 4.2.1 andcldf/languages.csv
at 4.1. - or via
git
and a local clone of https://github.com/glottolog/glottolog-cldf$ git show v4.2.1:cldf/languages.csv > languages-4.2.1.csv $ git show v4.1:cldf/languages.csv > languages-4.1.csv
- either downloading and unpacking
- Prune the two languoid tables to only the
ID
andName
columns usingcsvcut
:$ csvcut -c ID,Name languages-4.2.1.csv > languages-4.2.1-pruned.csv $ csvcut -c ID,Name languages-4.2.1.csv > languages-4.2.1-pruned.csv
- Merge the two files into one, with two columns for the names in the two releases using
csvjoin
:The merged file looks like this:$ csvjoin -c ID -q '"' languages-4.2.1-pruned.csv languages-4.1-pruned.csv > combined.csv
i.e.$ head -n 3 combined.csv ID,Name,Name2 kond1302,Konda-Yahadian,Konda-Yahadian ...
Name
is the column with names in release 4.2.1,Name2
for release 4.1 respectively. - Narrow the list down to just the languoids with changed names using
csvsql
:$ csvsql --query "select id, name, name2 from combined where name != name2 order by id" combined.csv abkh1244,Abkhaz,Abkhazian alta1277,Altai-Kizi,Altai Proper amur1242,Amur-West Sakhalin Nivkh,Amur arti1237,Artial,Artialic babi1235,Witsuwit'en-Babine,Babine baga1275,Pukur,Baga Mboteni-Binari bari1283,Barian,Bari-Kakwa-Mandari bayr1238,Badre'i,Bayray ...
And if we are using bash we can exploit the fact that all
csvkit
tools are built for pipes and
put all steps in one command:
$ csvjoin -c ID -q '"' <(git show v4.2.1:cldf/languages.csv | csvcut -c ID,Name) <(git show v4.1:cldf/languages.csv | csvcut -c ID,Name) | csvsql --query "select id, name, name2 from combined where name != name2 order by id" --tables combined