- Lecture notebooks are organized by classes, each in their respective folders.
- The
requirements.txt
file can be used to install all the necessary packages for all the notebooks in this course.
Note: As detailed below, multiple steps are necessary to run PySpark locally. We recommend trying out Vocareum before setting up locally, since the environment is already up and running there.
-
Install Docker from https://docs.docker.com/get-docker/. MAKE SURE TO DOWNLOAD THE APPROPRIATE ONE FOR YOUR OPERATING SYSTEM.
-
Open Docker's desktop application (You do NOT need to sign in to Docker).
-
Also open Terminal and run the following command to download and install the PySpark image on your personal device:
docker pull jupyter/pyspark-notebook:x86_64-ubuntu-22.04
-
Make sure you have
git
installed, and an account with GitHub. Download this repository viagit
(https://github.com/ucsd-dsc232r-s24/lecture-notebooks.git). You can run the following command on Terminal in the location where you want to clone the repository.git clone https://github.com/ucsd-dsc232r-s24/lecture-notebooks.git
-
After the image is downloaded and installed, define your
HOSTDIR
on Terminal. This is the absolute path in your personal computer's filesystem to the Lecture Notebooks. So, this will be the location where you downloaded/cloned this GitHub repository. For example, if you clone this repository on Desktop, your path will be~/Desktop/lecture-notebooks/
, i.e.,HOSTDIR=~/Desktop/lecture-notebooks/
Make sure that there are no spaces in your path.
-
Finally, run the Docker image using the following command:
docker run -p 127.0.0.1:8888:8888 -v $HOSTDIR:/home/jovyan docker.io/jupyter/pyspark-notebook:x86_64-ubuntu-22.04
-
This will generate a URL of the form:
http://127.0.0.1:8888/lab?token=<unique-token>
where your Jupyter server will be running. Just copy and paste that URL in your browser and it will launch the Jupyter Hub. -
Open Terminal in Jupyter Hub. Run the following command:
pip install -r requirements_docker.txt
This should install all necessary packages.
-
After you are done with your work, you can stop the Container from the Docker desktop application. Similarly, you can start the Container next time using the Docker desktop application.
To run PySpark locally, install Java, Spark, and Python first if you don't have them already. You need both Java and Python for PySpark to work locally. Follow the steps here (https://www.datacamp.com/tutorial/installation-of-pyspark) based on your operating system, to download and install Java, and set up corrsponding Environment variables.
After you have done the above, proceed to create a Python environment for the course from the requirements.txt
file given here. You're free to do this any other way if you're comfortable, but we recommend the below:
-
Install Anaconda from https://www.anaconda.com/download. You can refer to https://docs.anaconda.com/free/anaconda/install/index.html for installation instructions. MAKE SURE TO DOWNLOAD THE APPROPRIATE ONE FOR YOUR OPERATING SYSTEM.
-
Once anaconda is downloaded and the conda command is working in your terminal (refer to installation instructions), confirm that
conda activate base
command works (this should activate yourbase
environment). -
Run the following (replace
<your-environment-name>
with a name for your environment, I usedDSC232R
):conda create -n <your-environment-name> python=3.7.5 conda activate <your-environment-name>
-
Make sure you have
git
installed, and an account with GitHub. Download this repository viagit
(https://github.com/ucsd-dsc232r-s24/lecture-notebooks.git). You can run the following command on Terminal in the location where you want to clone the repository.git clone https://github.com/ucsd-dsc232r-s24/lecture-notebooks.git
-
Navigate to the folder location where you have downloaded this repository. Then, run
pip install -r requirements.txt
This should create your environment with all necessary packages.