Skip to content

Data Transformation Using Apache Spark

Swapna Y edited this page May 8, 2021 · 5 revisions

Apache Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers.

Spark supports the following resource/cluster managers:

  • Spark Standalone – a simple cluster manager included with Spark
  • Apache Mesos – a general cluster manager that can also run Hadoop applications
  • Apache Hadoop YARN – the resource manager in Hadoop 2
  • Kubernetes – an open source system for automating deployment, scaling, and management of containerized applications
  • Local mode - The driver and executors run as threads on your computer instead of a cluster.

Apart from its rich API, spark UI also provides insights into each task run time and execution plan.Below is query execution of stocks data insertion with 10000 batch size.Before code optimization.

Example execution

Current project runs spark image in Spark standalone mode with 4 nodes, 2 SPARK_WORKER_CORES and 2G-SPARK_EXECUTOR_MEMORY I used bitnami/spark docker image, because it uses minimalist Debian-based image, and easy to configure.

Clone this wiki locally