Data Transformation Using Apache Spark

Apache Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers.

Spark supports the following resource/cluster managers:

Spark Standalone – a simple cluster manager included with Spark
Apache Mesos – a general cluster manager that can also run Hadoop applications
Apache Hadoop YARN – the resource manager in Hadoop 2
Kubernetes – an open source system for automating deployment, scaling, and management of containerized applications
Local mode - The driver and executors run as threads on your computer instead of a cluster.

Apart from its rich API, spark UI also provides insights into each task run time and execution plan.Below is query execution of stocks data insertion with 10000 batch size.Before code optimization.

Example execution

Current project runs spark image in Spark standalone mode with 4 nodes, 2 SPARK_WORKER_CORES and 2G-SPARK_EXECUTOR_MEMORY I used bitnami/spark docker image, because it uses minimalist Debian-based image, and easy to configure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Transformation Using Apache Spark

Clone this wiki locally