Skip to content
Stratis Ioannidis edited this page Feb 13, 2019 · 13 revisions

Running spark on the Discovery Cluster

Overview

Apache Spark comes as a module on the discovery cluster. You can currently run Spark on Discovery either in local mode or standalone-mode. In local mode, spark jobs run on a single machine, and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine.

By launching a standalone cluster, you can run your jobs over multiple machines simultaneously. However, configuring a standalone cluster requires a bit more effort than running spark in local mode: local mode runs practically "out-of-the-box" in Discovery, while launching a standalone cluster requires installing a few extra scripts. Both methods are covered below; if you wish to only run spark in local mode, you can ignore the second bullet.

Spark Execution Quick Reference

Local Mode Cluster Mode
Parameters N: number of CPU cores, e.g., 10 MASTER_IP: IP address of cluster master, e.g., 10.100.8.52
Interpreter pyspark --master local[N] pyspark --master spark://MASTER_IP:7077
Run mycript.py spark-submit --master local[N] myscript.py spark-submit --master spark://MASTER_IP:7077 myscript.py

Spark GUI

Both local and standalone modes in spark offer a graphical user interface (GUI), from which you can monitor execution. In order to access this GUI first you need to find the IP address of the machine in which the driver of spark application is running. Find this machine (e.g., c3096) via squeue, then type:

traceroute c3096

The following will be printed

traceroute to c3096 (10.99.252.65), 30 hops max, 60 byte packets

This means that the IP address of c3096 machine is 10.99.252.65. Cutting and pasting the URL (here, http://10.99.252.65:4040) to a web browser will show you the webpage containing the GUI. You need to be within Northeastern's network to access the GUI. If you are off-campus, you need to first set up a VPN to access the GUI.

Additional Resources

Links

Back to the Discovery Cluster page