LauNuts data processing

Installation

The data extraction from Excel files uses external tools (see Converter.java).

On Debian systems, please install the APT packages libreoffice (for XLS to XLSX convertion) and gnumeric (for XLSX to CSV convertion).

The csvkit tool (to extract sheet names) can be installed using pip install csvkit.

Resource files

Source files are listed in sources.json. A metadata entry consists of:
- id: A unique ID.
- type: e.g. nuts or lau.
- filetype: e.g. xls or xlsx.
- sources: At least one download URL.
- nuts-scheme: The year to be used.
To add additional files to the processing queue, add a respective entry.

Download

The download URLs listed in sources.json are used for downloading data.
The used files are mirrored at the Hobbit FTP server.

Data conversion

Excel files are converted into CSV files. Currently, the following approach is used:
- Use LibreOffice to convert XLS into XLSX.
- Use ssconvert (Gnumeric) to convert XLSX into CSV.
- Use in2csv (csvkit) to get XLSX sheet names.
The converted files are mirrored at the Hobbit FTP server.

Previous data conversion approaches

Apache POI

With Apache POI the JVM ran out of memory. Previous code is available for the Java Excel parser and a related test case.

LibreOffice

Libreoffice 7.3 could not open a XLSX file (LAU 2020), as the maximum number of columns was too large.

xlsx2csv

xlsx2csv was not able to process at least one large file.

Google Spreadsheets

Google Spreadsheets stated that a file was too large to open.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processing.md

processing.md

LauNuts data processing

Installation

Resource files

Download

Data conversion

Previous data conversion approaches

Apache POI

LibreOffice

xlsx2csv

Google Spreadsheets

Files

processing.md

Latest commit

History

processing.md

File metadata and controls

LauNuts data processing

Installation

Resource files

Download

Data conversion

Previous data conversion approaches

Apache POI

LibreOffice

xlsx2csv

Google Spreadsheets