Volt is library that aims to simplify the life of Data Engineers within the Databricks environment.
Nota Bene: The development is mainly driven by Databricks field engineering. However, the library is not an official product of Databricks.
It provides APIs in SQL, Scala and Python for:
- Retrieving table metadata in bulk; like retrieving table footprint on the File System and more
- Clone catalog/schemas
Currently we only support Personal Compute as we use some low-level APIs to retrieve the required information (like DeltaLog, for table snapshot without doing a DESCRIBE DETAIL
)
Another important limitation is that we cannot reuse stored credentials so you need to define secrets with read only access to the cloud storage (see Read files from S3/Azure for more details)
First you need this file:
volt-<major.minor.patch>.jar
Then, if required you can also get these files:
volt-aws-deps-<major.minor.patch>.jar
volt-azure-deps-<major.minor.patch>.jar
In order to install all the required dependencies you can choose between two init scripts:
install_from_volume.sh
which will install the dependencies from the volumeinstall_from_github.sh
which will install the dependencies from GitHub
For both ways you need also to define the following Spark configuration:
spark.sql.extensions com.databricks.volt.sql.SQLExtensions
spark.jars /databricks/jars/volt-${env:VOLT_VERSION}.jar,/databricks/jars/volt-azure-deps-${env:VOLT_VERSION}.jar,/databricks/jars/volt-aws-deps-${env:VOLT_VERSION}.jar
If you choose install_from_volume.sh
you will need to define two environment variables in order to work:
VOLT_VERSION
the version that you want to useVOLUME_PATH
the volume path where the jars are downloaded
If you choose install_from_github.sh
you will need to define one environment variables in order to work:
VOLT_VERSION
the version that you want to download
If you want support for reading files from S3 you need to install also:
volt-aws-deps-<major.minor.patch>.jar
Moreover you need to define a secret scope aws-s3-credentials
and add the following keys with the relative values:
client_id
: is the AWS access key IDclient_secret
: is the AWS secret access keysession_token
(optional): is the AWS session token, if your access needs it
If you want support for reading files from ADLS you need to install also:
volt-auzre-deps-<major.minor.patch>.jar
Moreover you need to define a secret scope adls-sp-credentials
and add the following keys with the relative values:
tenant_id
: is the AZURE SP tenant IDclient_id
: is the AZURE SP client IDclient_secret
: is the AZURE SP client secret
If you need a feature please let us know filling an issue. We're glad to help you.
Volt provides an easy way, to deep/shallow clone catalog and schemas.
No more complicated configurations like this or scripts like this
The execution of the command will return a report table with the following fields:
table_catalog
(StringType)table_schema
(StringType)table_name
(StringType)status
(StringType): it's OK/ERRORstatus_message
(StringType) in case of status ERROR you can get the error message here
In SQL it looks like the following:
-- example
CREATE CATALOG|SCHEMA target_catalog DEEP|SHALLOW CLONE source_catalog [MANAGED LOCATION '<your-location>']
-- N.b. this actually works
create catalog as_catalog_clone DEEP clone as_catalog;
import com.databricks.volt.apis.CatalogExtensions._
spark.catalog.setCurrentCatalog("as_catalog")
display(spark.catalog.deepCloneCatalog("as_catalog_clone"))
from volt.apis import *
spark.catalog.setCurrentCatalog("as_catalog")
display(spark.catalog.deepCloneCatalog("as_catalog_clone"))
How many times you need to get the metadata details of a (DELTA) table in bulk for a specific catalog or schema?
We have a feature for that and it works from SQL/Scala/Python
The result of the invocation is a table with the following colums:
table_catalog
(StringType)table_schema
(StringType)table_name
(StringType)table_type
(StringType)data_source_format
(StringType)storage_path
(StringType)created
(TimestampType)created_by
(StringType)last_altered_by
(StringType)last_altered
(TimestampType)liquid_clustering_cols
(ArrayType(StringType))size
(StructType) has the following fields:full_size_in_gb
(DoubleType)full_size_in_bytes
(LongType)last_snapshot_size_in_gb
(DoubleType)last_snapshot_size_in_bytes
(LongType)delta_log_size_in_gb
(DoubleType)delta_log_size_in_bytes
(LongType)
You can filter for all the columns but you need to know that only for the following fields the filter will be pushed down:
table_catalog
table_schema
table_name
table_type
data_source_format
storage_path
created
created_by
last_altered_by
last_altered
Filtering for:
liquid_clustering_cols
size
will cause a full UC scan so please don't filter for it right now (or put the original query in a dataframe and than filter it).
In SQL it looks like the following:
-- example
show tables extended [WHERE <filters>]
show tables extended WHERE table_catalog = 'as_catalog'
import com.databricks.volt.apis.CatalogExtensions._
import org.apache.spark.sql.functions
display(spark.catalog.showTablesExtended("table_catalog = 'as_catalog'"))
// you can also pass a Column
// display(spark.catalog.showTablesExtended(functions.col("table_catalog").equalTo("as_catalog")))
from volt.apis import *
import pyspark.sql.functions as F
# display(spark.catalog.showTablesExtended("table_catalog = 'as_catalog'"))
# you can also pass a Column
display(spark.catalog.showTablesExtended(F.col("table_catalog") == "as_catalog"))