Examining changes across Glottolog releases

While it is possible, to examine changes just using the git version control tool and the "raw" data repository, this is often overwhelming (see for example the changes between release 4.2.1 and 4.1). So unless one is interested in particular commits and associated commit comments, which may give some context for particular changes, it is typically better to examine changes of the aggregated CLDF dataset (e.g. the changes between 4.2.1 and 4.1).

For some cases, this may still be "too much". In the following we describe how to "condense" changes between two Glottolog releases down to just a list of changed languoid names. We'll use the shell, aka command line (see the software carpentry lesson for an excellent introduction) and the tools of the csvkit package and optionally git.

Retrieve the table of languoids for the two releases we want to compare
- either downloading and unpacking and
- or from GitHub cldf/languages.csv at 4.2.1 and cldf/languages.csv at 4.1.
- or via git and a local clone of https://github.com/glottolog/glottolog-cldf
```
$ git show v4.2.1:cldf/languages.csv > languages-4.2.1.csv
$ git show v4.1:cldf/languages.csv > languages-4.1.csv
```

Prune the two languoid tables to only the ID and Name columns using csvcut:

$ csvcut -c ID,Name languages-4.2.1.csv > languages-4.2.1-pruned.csv
$ csvcut -c ID,Name languages-4.2.1.csv > languages-4.2.1-pruned.csv

Merge the two files into one, with two columns for the names in the two releases using csvjoin:
```
$ csvjoin -c ID -q '"' languages-4.2.1-pruned.csv languages-4.1-pruned.csv > combined.csv
```
The merged file looks like this:
```
$ head -n 3 combined.csv 
ID,Name,Name2
kond1302,Konda-Yahadian,Konda-Yahadian
...
```
i.e. Name is the column with names in release 4.2.1, Name2 for release 4.1 respectively.

Narrow the list down to just the languoids with changed names using csvsql:

$ csvsql --query "select id, name, name2 from combined where name != name2 order by id" combined.csv 
abkh1244,Abkhaz,Abkhazian
alta1277,Altai-Kizi,Altai Proper
amur1242,Amur-West Sakhalin Nivkh,Amur
arti1237,Artial,Artialic
babi1235,Witsuwit'en-Babine,Babine
baga1275,Pukur,Baga Mboteni-Binari
bari1283,Barian,Bari-Kakwa-Mandari
bayr1238,Badre'i,Bayray
...

And if we are using bash we can exploit the fact that all csvkit tools are built for pipes and put all steps in one command:

$ csvjoin -c ID -q '"' <(git show v4.2.1:cldf/languages.csv | csvcut -c ID,Name) <(git show v4.1:cldf/languages.csv | csvcut -c ID,Name) | csvsql --query "select id, name, name2 from combined where name != name2 order by id" --tables combined

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changes.md

changes.md

Examining changes across Glottolog releases

Files

changes.md

Latest commit

History

changes.md

File metadata and controls

Examining changes across Glottolog releases