The data extraction from Excel files uses external tools (see Converter.java).
On Debian systems, please install the APT packages libreoffice (for XLS to XLSX convertion) and gnumeric (for XLSX to CSV convertion).
The csvkit tool (to extract sheet names) can be installed using pip install csvkit
.
- Source files are listed in sources.json. A metadata entry consists of:
- id: A unique ID.
- type: e.g. nuts or lau.
- filetype: e.g. xls or xlsx.
- sources: At least one download URL.
- nuts-scheme: The year to be used.
- To add additional files to the processing queue, add a respective entry.
- The download URLs listed in sources.json are used for downloading data.
- The used files are mirrored at the Hobbit FTP server.
- Excel files are converted into CSV files. Currently, the following approach is used:
- Use LibreOffice to convert XLS into XLSX.
- Use ssconvert (Gnumeric) to convert XLSX into CSV.
- Use in2csv (csvkit) to get XLSX sheet names.
- The converted files are mirrored at the Hobbit FTP server.
With Apache POI the JVM ran out of memory. Previous code is available for the Java Excel parser and a related test case.
Libreoffice 7.3 could not open a XLSX file (LAU 2020), as the maximum number of columns was too large.
xlsx2csv was not able to process at least one large file.
Google Spreadsheets stated that a file was too large to open.