In addition to usage information provided in User Guide, we provide more strategies for SQL Index and Data Source Cache in this section.
Their needed dependencies like Memkind ,Vmemcache and Plasma can be automatically installed when following OAP Installation Guide, corresponding feature jars can be found under $HOME/miniconda2/envs/oapenv/oap_jars
.
- Additional Cache Strategies In addition to external cache strategy, SQL Data Source Cache also supports 3 other cache strategies: guava, noevict and vmemcache.
- Index and Data Cache Separation To optimize the cache media utilization, SQL Data Source Cache supports cache separation of data and index, by using same or different cache media with DRAM and PMem.
- Cache Hot Tables Data Source Cache also supports caching specific tables according to actual situations, these tables are usually hot tables.
- Column Vector Cache This document above uses binary cache as example for Parquet file format, if your cluster memory resources is abundant enough, you can choose ColumnVector data cache instead of binary cache for Parquet to spare computation time.
- Large Scale and Heterogeneous Cluster Support Introduce an external database to store cache locality info to support large-scale and heterogeneous clusters.
Following table shows features of 4 cache strategies on PMem.
guava | noevict | vmemcache | external cache |
---|---|---|---|
Use memkind lib to operate on PMem and guava cache strategy when data eviction happens. | Use memkind lib to operate on PMem and doesn't allow data eviction. | Use vmemcache lib to operate on PMem and LRU cache strategy when data eviction happens. | Use Plasma/dlmalloc to operate on PMem and LRU cache strategy when data eviction happens. |
Need numa patch in Spark for better performance. | Need numa patch in Spark for better performance. | Need numa patch in Spark for better performance. | Doesn't need numa patch. |
Suggest using 2 executors one node to keep aligned with PMem paths and numa nodes number. | Suggest using 2 executors one node to keep aligned with PMem paths and numa nodes number. | Suggest using 2 executors one node to keep aligned with PMem paths and numa nodes number. | Node-level cache so there are no limitation for executor number. |
Cache data cleaned once executors exited. | Cache data cleaned once executors exited. | Cache data cleaned once executors exited. | No data loss when executors exit thus is friendly to dynamic allocation. But currently it has performance overhead than other cache solutions. |
-
For cache solution
guava/noevict
, make sure Memkind library installed on every cluster worker node. If you have finished OAP Installation Guide, libmemkind will be installed. Or manually build and install it following memkind-installation, then placelibmemkind.so.0
under/lib64/
on each worker node. -
For cache solution
vmemcahe/external
cache, make sure Vmemcache library has been installed on every cluster worker node. If you have finished OAP Installation Guide, libvmemcache will be installed. Or you can follow the vmemcache-installation steps and make surelibvmemcache.so.0
exist under/lib64/
directory on each worker node.
If you have followed OAP Installation Guide, Memkind ,Vmemcache and Plasma will be automatically installed. Or you can refer to OAP Developer-Guide, there is a shell script to help you install these dependencies automatically.
The following are required to configure OAP to use PMem cache.
-
PMem hardware is successfully deployed on each node in cluster.
-
Directories exposing PMem hardware on each socket. For example, on a two socket system the mounted PMem directories should appear as
/mnt/pmem0
and/mnt/pmem1
. Correctly installed PMem must be formatted and mounted on every cluster worker node. You can follow these commands to destroy interleaved PMem device which you set in User-Guide:
# destroy interleaved PMem device which you set when using external cache strategy
umount /mnt/pmem
dmsetup remove striped-pmem
echo y | mkfs.ext4 /dev/pmem0
echo y | mkfs.ext4 /dev/pmem1
mkdir -p /mnt/pmem0
mkdir -p /mnt/pmem1
mount -o dax /dev/pmem0 /mnt/pmem0
mount -o dax /dev/pmem1 /mnt/pmem1
In this case file systems are generated for 2 NUMA nodes, which can be checked by "numactl --hardware". For a different number of NUMA nodes, a corresponding number of namespaces should be created to assure correct file system paths mapping to NUMA nodes.
For more information you can refer to Quick Start Guide: Provision Intel® Optane™ DC Persistent Memory
-
Install
numactl
to bind the executor to the PMem device on the same NUMA node.yum install numactl -y
-
We strongly recommend you use NUMA-patched Spark to achieve better performance gain for the following 3 cache strategies. Besides, currently using Community Spark occasionally has the problem of two executors being bound to the same PMem path.
Build Spark from source to enable NUMA-binding support, refer to Enabling-NUMA-binding-for-PMem-in-Spark.
Create persistent-memory.xml
under $SPARK_HOME/conf
if it doesn't exist. Use the following template and change the initialPath
to your mounted paths for PMem devices.
<persistentMemoryPool>
<!--The numa id-->
<numanode id="0">
<!--The initial path for Intel Optane DC persistent memory-->
<initialPath>/mnt/pmem0</initialPath>
</numanode>
<numanode id="1">
<initialPath>/mnt/pmem1</initialPath>
</numanode>
</persistentMemoryPool>
Guava cache is based on memkind library, built on top of jemalloc and provides memory characteristics. To use it in your workload, follow prerequisites to set up PMem hardware correctly, also make sure memkind library installed. Then follow configurations below.
NOTE: spark.executor.sql.oap.cache.persistent.memory.reserved.size
: When we use PMem as memory through memkind library, some portion of the space needs to be reserved for memory management overhead, such as memory segmentation. We suggest reserving 20% - 25% of the available PMem capacity to avoid memory allocation failure. But even with an allocation failure, OAP will continue the operation to read data from original input data and will not cache the data block.
# enable numa
spark.yarn.numa.enabled true
spark.executorEnv.MEMKIND_ARENA_NUM_PER_KIND 1
# for Parquet file format, enable binary cache
spark.sql.oap.parquet.binary.cache.enabled true
# for ORC file format, enable binary cache
spark.sql.oap.orc.binary.cache.enabled true
spark.sql.oap.cache.memory.manager pm
spark.oap.cache.strategy guava
# PMem capacity per executor, according to your cluster
spark.executor.sql.oap.cache.persistent.memory.initial.size 256g
# Reserved space per executor, according to your cluster
spark.executor.sql.oap.cache.persistent.memory.reserved.size 50g
# enable SQL Index and Data Source Cache jar in Spark
spark.sql.extensions org.apache.spark.sql.OapExtensions
# absolute path of the jar on your working node, when in Yarn client mode
spark.files $HOME/miniconda2/envs/oapenv/oap_jars/plasma-sql-ds-cache-<version>-with-spark-<version>.jar,$HOME/miniconda2/envs/oapenv/oap_jars/pmem-common-<version>-with-spark-<version>.jar
# relative path to spark.files, just specify jar name in current dir, when in Yarn client mode
spark.executor.extraClassPath ./plasma-sql-ds-cache-<version>-with-spark-<version>.jar:./pmem-common-<version>-with-spark-<version>.jar
# absolute path of the jar on your working node,when in Yarn client mode
spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/plasma-sql-ds-cache-<version>-with-spark-<version>.jar:$HOME/miniconda2/envs/oapenv/oap_jars/pmem-common-<version>-with-spark-<version>.jar
Memkind library also support DAX KMEM mode. Refer to Kernel, this chapter will guide how to configure PMem as system RAM. Or Memkind support for KMEM DAX option for more details.
Please note that DAX KMEM mode need kernel version 5.x and memkind version 1.10 or above. If you choose KMEM mode, change memory manager from pm
to kmem
as below.
spark.sql.oap.cache.memory.manager kmem
The noevict cache strategy is also supported in OAP based on the memkind library for PMem.
To use it in your workload, follow prerequisites to set up PMem hardware correctly, also make sure memkind library installed. Then follow the configuration below.
# enable numa
spark.yarn.numa.enabled true
spark.executorEnv.MEMKIND_ARENA_NUM_PER_KIND 1
# for Parquet file format, enable binary cache
spark.sql.oap.parquet.binary.cache.enabled true
# for ORC file format, enable binary cache
spark.sql.oap.orc.binary.cache.enabled true
spark.oap.cache.strategy noevict
spark.executor.sql.oap.cache.persistent.memory.initial.size 256g
# Enable OAP extension in Spark
spark.sql.extensions org.apache.spark.sql.OapExtensions
# absolute path of the jar on your working node, when in Yarn client mode
spark.files $HOME/miniconda2/envs/oapenv/oap_jars/plasma-sql-ds-cache-<version>-with-spark-<version>.jar,$HOME/miniconda2/envs/oapenv/oap_jars/pmem-common-<version>-with-spark-<version>.jar
# relative path to spark.files, just specify jar name in current dir, when in Yarn client mode
spark.executor.extraClassPath ./plasma-sql-ds-cache-<version>-with-spark-<version>.jar:./pmem-common-<version>-with-spark-<version>.jar
# absolute path of the jar on your working node,when in Yarn client mode
spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/plasma-sql-ds-cache-<version>-with-spark-<version>.jar:$HOME/miniconda2/envs/oapenv/oap_jars/pmem-common-<version>-with-spark-<version>.jar
-
Make sure Vmemcache library has been installed on every cluster worker node if vmemcache strategy is chosen for PMem cache. If you have finished OAP-Installation-Guide, vmemcache library will be automatically installed by Conda.
Or you can follow the build/install steps and make sure
libvmemcache.so
exist in/lib64
directory on each worker node. -
To use it in your workload, follow prerequisites to set up PMem hardware correctly.
Make the following configuration changes in $SPARK_HOME/conf/spark-defaults.conf
.
# 2x number of your worker nodes
spark.executor.instances 6
# enable numa
spark.yarn.numa.enabled true
# Enable OAP extension in Spark
spark.sql.extensions org.apache.spark.sql.OapExtensions
# absolute path of the jar on your working node, when in Yarn client mode
spark.files $HOME/miniconda2/envs/oapenv/oap_jars/plasma-sql-ds-cache-<version>-with-spark-<version>.jar,$HOME/miniconda2/envs/oapenv/oap_jars/pmem-common-<version>-with-spark-<version>.jar
# relative path to spark.files, just specify jar name in current dir, when in Yarn client mode
spark.executor.extraClassPath ./plasma-sql-ds-cache-<version>-with-spark-<version>.jar:./pmem-common-<version>-with-spark-<version>.jar
# absolute path of the jar on your working node,when in Yarn client mode
spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/plasma-sql-ds-cache-<version>-with-spark-<version>.jar:$HOME/miniconda2/envs/oapenv/oap_jars/pmem-common-<version>-with-spark-<version>.jar
# for parquet file format, enable binary cache
spark.sql.oap.parquet.binary.cache.enabled true
# for ORC file format, enable binary cache
spark.sql.oap.orc.binary.cache.enabled true
# enable vmemcache strategy
spark.oap.cache.strategy vmem
spark.executor.sql.oap.cache.persistent.memory.initial.size 256g
# according to your cluster
spark.executor.sql.oap.cache.guardian.memory.size 10g
The vmem
cache strategy is based on libvmemcache (buffer based LRU cache), which provides a key-value store API. Follow these steps to enable vmemcache support in Data Source Cache.
spark.executor.instances
: We suggest setting the value to 2X the number of worker nodes when NUMA binding is enabled. Each worker node runs two executors, each executor is bound to one of the two sockets, and accesses the corresponding PMem device on that socket.spark.executor.sql.oap.cache.persistent.memory.initial.size
: It is configured to the available PMem capacity to be used as data cache per exectutor.
NOTE: If "PendingFiber Size" (on spark web-UI OAP page) is large, or some tasks fail with "cache guardian use too much memory" error, set spark.executor.sql.oap.cache.guardian.memory.size
to a larger number as the default size is 10GB. The user could also increase spark.sql.oap.cache.guardian.free.thread.nums
or decrease spark.sql.oap.cache.dispose.timeout.ms
to free memory more quickly.
-
After finishing configuration, restart Spark Thrift Server for the configuration changes to take effect. Start at step 2 of the Use DRAM Cache guide to verify that cache is working correctly.
-
Verify NUMA binding status by confirming keywords like
numactl --cpubind=1 --membind=1
contained in executor launch command. -
Check PMem cache size by checking disk space with
df -h
.Forvmemcache
strategy, disk usage will reach the initial cache size once the PMem cache is initialized and will not change during workload execution. ForGuava/Noevict
strategies, the command will show disk space usage increases along with workload execution.
SQL Index and Data Source Cache now supports different cache strategies for DRAM and PMem. To optimize the cache media utilization, you can enable cache separation of data and index with same or different cache media. When Sharing same media, data cache and index cache will use different fiber cache ratio.
Here we list 4 different kinds of configuration for index/cache separation, if you choose one of them, please add corresponding configuration to spark-defaults.conf
.
- DRAM as cache media,
guava
strategy as index & data cache backend.
spark.sql.oap.index.data.cache.separation.enabled true
spark.oap.cache.strategy mix
spark.sql.oap.cache.memory.manager offheap
The rest configuration you can refer to Use DRAM Cache
- PMem as cache media,
external
strategy as index & data cache backend.
spark.sql.oap.index.data.cache.separation.enabled true
spark.oap.cache.strategy mix
spark.sql.oap.cache.memory.manager tmp
spark.sql.oap.mix.data.cache.backend external
spark.sql.oap.mix.index.cache.backend external
The rest configurations can refer to the configurations of PMem Cache and External cache
- DRAM(
offheap
)/guava
asindex
cache media and backend, PMem(tmp
)/external
asdata
cache media and backend.
spark.sql.oap.index.data.cache.separation.enabled true
spark.oap.cache.strategy mix
spark.sql.oap.cache.memory.manager mix
spark.sql.oap.mix.data.cache.backend external
# 2x number of your worker nodes
spark.executor.instances 6
# enable numa
spark.yarn.numa.enabled true
spark.memory.offHeap.enabled false
spark.sql.oap.dcpmm.free.wait.threshold 50000000000
# according to your executor core number
spark.executor.sql.oap.cache.external.client.pool.size 10
# equal to the size of executor.memoryOverhead
spark.executor.sql.oap.cache.offheap.memory.size 50g
# according to the resource of cluster
spark.executor.memoryOverhead 50g
# for ORC file format
spark.sql.oap.orc.binary.cache.enabled true
# for Parquet file format
spark.sql.oap.parquet.binary.cache.enabled true
- DRAM(
offheap
)/guava
asindex
cache media and backend, PMem(pm
)/guava
asdata
cache media and backend.
spark.sql.oap.index.data.cache.separation.enabled true
spark.oap.cache.strategy mix
spark.sql.oap.cache.memory.manager mix
# 2x number of your worker nodes
spark.executor.instances 6
# enable numa
spark.yarn.numa.enabled true
spark.executorEnv.MEMKIND_ARENA_NUM_PER_KIND 1
spark.memory.offHeap.enabled false
# PMem capacity per executor
spark.executor.sql.oap.cache.persistent.memory.initial.size 256g
# Reserved space per executor
spark.executor.sql.oap.cache.persistent.memory.reserved.size 50g
# equal to the size of executor.memoryOverhead
spark.executor.sql.oap.cache.offheap.memory.size 50g
# according to the resource of cluster
spark.executor.memoryOverhead 50g
# for ORC file format
spark.sql.oap.orc.binary.cache.enabled true
# for Parquet file format
spark.sql.oap.parquet.binary.cache.enabled true
Data Source Cache also supports caching specific tables by configuring items according to actual situations, these tables are usually hot tables.
To enable caching specific hot tables, you can add the configuration below to spark-defaults.conf
.
# enable table lists fiberCache
spark.sql.oap.cache.table.list.enabled true
# Table lists using fiberCache actively
spark.sql.oap.cache.table.list <databasename>.<tablename1>;<databasename>.<tablename2>
This document above use binary cache for Parquet as example, cause binary cache can improve cache space utilization compared to ColumnVector cache. When your cluster memory resources are abundant enough, you can choose ColumnVector cache to spare computation time.
To enable ColumnVector data cache for Parquet file format, you should add the configuration below to spark-defaults.conf
.
# for parquet file format, disable binary cache
spark.sql.oap.parquet.binary.cache.enabled false
# for parquet file format, enable ColumnVector cache
spark.sql.oap.parquet.data.cache.enabled true
NOTE: Only works with external cache
OAP influences Spark to schedule tasks according to cache locality info. This info could be of large amount in a large scale cluster, and how to schedule tasks in a heterogeneous cluster (some nodes with PMem, some without) could also be challenging.
We introduce an external DB to store cache locality info. If there's no cache available, Spark will fall back to schedule respecting HDFS locality. Currently we support Redis as external DB service. Please download and launch a redis-server before running Spark with OAP.
Please add the following configurations to spark-defaults.conf
.
spark.sql.oap.external.cache.metaDB.enabled true
# Redis-server address
spark.sql.oap.external.cache.metaDB.address 10.1.2.12
spark.sql.oap.external.cache.metaDB.impl org.apache.spark.sql.execution.datasources.RedisClient