Skip to content

Latest commit

 

History

History
52 lines (32 loc) · 2.26 KB

processing.md

File metadata and controls

52 lines (32 loc) · 2.26 KB

LauNuts data processing

Installation

The data extraction from Excel files uses external tools (see Converter.java).

On Debian systems, please install the APT packages libreoffice (for XLS to XLSX convertion) and gnumeric (for XLSX to CSV convertion).

The csvkit tool (to extract sheet names) can be installed using pip install csvkit.

Resource files

  • Source files are listed in sources.json. A metadata entry consists of:
    • id: A unique ID.
    • type: e.g. nuts or lau.
    • filetype: e.g. xls or xlsx.
    • sources: At least one download URL.
    • nuts-scheme: The year to be used.
  • To add additional files to the processing queue, add a respective entry.

Download

Data conversion

  • Excel files are converted into CSV files. Currently, the following approach is used:
    • Use LibreOffice to convert XLS into XLSX.
    • Use ssconvert (Gnumeric) to convert XLSX into CSV.
    • Use in2csv (csvkit) to get XLSX sheet names.
  • The converted files are mirrored at the Hobbit FTP server.

Previous data conversion approaches

Apache POI

With Apache POI the JVM ran out of memory. Previous code is available for the Java Excel parser and a related test case.

LibreOffice

Libreoffice 7.3 could not open a XLSX file (LAU 2020), as the maximum number of columns was too large.

xlsx2csv

xlsx2csv was not able to process at least one large file.

Google Spreadsheets

Google Spreadsheets stated that a file was too large to open.