Skip to content

Latest commit

 

History

History
60 lines (56 loc) · 3.71 KB

changes.md

File metadata and controls

60 lines (56 loc) · 3.71 KB

Examining changes across Glottolog releases

While it is possible, to examine changes just using the git version control tool and the "raw" data repository, this is often overwhelming (see for example the changes between release 4.2.1 and 4.1). So unless one is interested in particular commits and associated commit comments, which may give some context for particular changes, it is typically better to examine changes of the aggregated CLDF dataset (e.g. the changes between 4.2.1 and 4.1).

For some cases, this may still be "too much". In the following we describe how to "condense" changes between two Glottolog releases down to just a list of changed languoid names. We'll use the shell, aka command line (see the software carpentry lesson for an excellent introduction) and the tools of the csvkit package and optionally git.

  1. Retrieve the table of languoids for the two releases we want to compare
  2. Prune the two languoid tables to only the ID and Name columns using csvcut:
    $ csvcut -c ID,Name languages-4.2.1.csv > languages-4.2.1-pruned.csv
    $ csvcut -c ID,Name languages-4.2.1.csv > languages-4.2.1-pruned.csv
    
  3. Merge the two files into one, with two columns for the names in the two releases using csvjoin:
    $ csvjoin -c ID -q '"' languages-4.2.1-pruned.csv languages-4.1-pruned.csv > combined.csv
    
    The merged file looks like this:
    $ head -n 3 combined.csv 
    ID,Name,Name2
    kond1302,Konda-Yahadian,Konda-Yahadian
    ...
    
    i.e. Name is the column with names in release 4.2.1, Name2 for release 4.1 respectively.
  4. Narrow the list down to just the languoids with changed names using csvsql:
    $ csvsql --query "select id, name, name2 from combined where name != name2 order by id" combined.csv 
    abkh1244,Abkhaz,Abkhazian
    alta1277,Altai-Kizi,Altai Proper
    amur1242,Amur-West Sakhalin Nivkh,Amur
    arti1237,Artial,Artialic
    babi1235,Witsuwit'en-Babine,Babine
    baga1275,Pukur,Baga Mboteni-Binari
    bari1283,Barian,Bari-Kakwa-Mandari
    bayr1238,Badre'i,Bayray
    ...
    

And if we are using bash we can exploit the fact that all csvkit tools are built for pipes and put all steps in one command:

$ csvjoin -c ID -q '"' <(git show v4.2.1:cldf/languages.csv | csvcut -c ID,Name) <(git show v4.1:cldf/languages.csv | csvcut -c ID,Name) | csvsql --query "select id, name, name2 from combined where name != name2 order by id" --tables combined