The Import - ADLS File Reader provides an easy way to connect and read Parquet and Delta Lake files from Azure Data Lake Storage (ADLS) to SAS Compute or CAS.
It supports reading snappy compressed Parquet and DeltaLake file formats and allows reading from partitioned tables (hierarchical nested subdirectories structures commonly used when partitioning the datasest a very common approach when storing datasets on data lakes). Its supports expression filters push-down using any of the dataset fields which avoid reading and transferring unnecessary data between the origin and source destination (when used with partitioned fields it's known as partition pruning)
This custom step helps to work around some of the restrictions that currently exist for working with Parquet files in SAS Viya. Please check the following documentation that lists those restrictions for the latest SAS Viya release:
- Restrictions for Parquet File Features for the libname engine (SAS Compute Server)
- Azure Data Lake Storage Data Source (SAS Cloud Analytic Services)
- Path-Based Data Source Types and Options – which has a footnote for Parquet (SAS Cloud Analytic Services)
This customs step depends on having a python environment configured with the following libraries installed:
- pandas
- saspy
- azure-identity
- pyarrow
- adlfs
Tested on Viya version Stable 2023.03 with python environment version 3.8.13 and the libraries versions:
- pandas == 1.5.2
- saspy == 4.3.3
- azure-identity == 1.12.0
- pyarrow == 10.0.1
- adlfs == 2023.1.0
- Version 2.0.1 (25APR2023)
- Fixed issues related to reading non-partitioned parquet and deltalake files
- Version 2.0 (20APR2023)
- Added support for reading Delta Lake file format
- Removed pyarrowfs-adlgen2 python dependency
- Added adlfs python library as filesystem library implementation to access ADLS
- Some code refactoring focusing on a object-oriented implementation for ADLSFileReader
- Version 1.0 (02FEB2023)
- Initial version