During my time as a UC Berkeley grad student, we were taught R and Python, so that is what I'm familar with. SystemML runs with R and Python. Being new to computer science and wanting to jump straight into the data doesn't allow me much time to hack into Spark and figure out how to write high-level math with big data. On SystemML you can write the math no matter how big the data is! Because I can access algorithms from files, it's easier to go from formulas and Python code to big data problems.
While I may still be very new to the tech world and all of its wonderful tutorials, an issue that I have consistently noticed thus far, is the long list of assumptions made in any step by step guide, particularly in setting up your environment. Many developers, data scientists and researchers are so advanced, they have forgotten what it’s like to be new! When writing tutorials, they assume that everything is set up and ready to go, but that’s not always the case. No need to worry with SystemML: I am here to help. Below is my very own step by step guide to running Apache SystemML on Jupyter notebook (with little to no assumptions).
*If you are just starting out please read the following “setting up your environment” step. If you aren’t just starting out please skip to “run SystemML”, but make sure to install SystemML first!
If you’re on a mac, you’ll want to install homebrew (http://brew.sh) if you haven’t already.
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"
brew tap caskroom/cask
brew install Caskroom/cask/java
In order to install something on homebrew all you need to do is type "brew install" followed by what you want to install. See below.
brew tap homebrew/versions
brew install apache-spark16
brew install python
pip install jupyter matplotlib numpy
brew install python3
pip3 install jupyter matplotlib numpy
First, use vim to create/edit your bash profile. Not sure what vim is? Check https://www.linux.com/learn/vim-101-beginners-guide-vim.
We are going to insert our file where Spark and SystemML is stored into our bash profile. This will make it easier to access. First type:
vim .bash_profile
i
Now insert SystemML. Note: /Documents is where I saved my SystemML. Be sure that your file path is accurate.
export SYSTEMML_HOME=/Users/stc/Documents/systemml-0.10.0-incubating
:wq
In your browser, if you go to http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization you will see a long line of code under “Nonnegative Matrix Factorization”.
Take a look at this page if you want to understand the code more, but we only need to use part of it. In your terminal, type:
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path $SYSTEMML_HOME/target/SystemML.jar --jars $SYSTEMML_HOME/target/SystemML.jar --conf "spark.driver.memory=12g" --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 --conf spark.default.parallelism=100
Jupyter should have launched and you should now be running the jupyter notebook with Spark and SystemML!
%load_ext autoreload %autoreload 2 %matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 6)
%%sh
curl -O http://snap.stanford.edu/data/amazon0601.txt.gz
gunzip amazon0601.txt.gz
import pyspark.sql.functions as F
dataPath = "amazon0601.txt"
X_train = (sc.textFile(dataPath)
.filter(lambda l: not l.startswith("#"))
.map(lambda l: l.split("\t"))
.map(lambda prods: (int(prods[0]), int(prods[1]), 1.0))
.toDF(("prod_i", "prod_j", "x_ij"))
.filter("prod_i < 500 AND prod_j < 500")
.cache())
max_prod_i = X_train.select(F.max("prod_i")).first()[0]
max_prod_j = X_train.select(F.max("prod_j")).first()[0]
numProducts = max(max_prod_i, max_prod_j) + 1
print("Total number of products: {}".format(numProducts))
from SystemML import MLContext
ml = MLContext(sc)
pnmf = """
X = read($X)
X = X+1
V = table(X[,1], X[,2])
size = ifdef($size, -1)
if(size > -1) {
V = V[1:size,1:size]
}
max_iteration = as.integer($maxiter)
rank = as.integer($rank)
n = nrow(V)
m = ncol(V)
range = 0.01
W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform")
H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform")
losses = matrix(0, rows=max_iteration, cols=1)
i=1
while(i <= max_iteration) {
H = (H * (t(W) %% (V/(W%%H))))/t(colSums(W)) W = (W * ((V/(W%%H)) %% t(H)))/t(rowSums(H))
losses[i,] = -1 * (sum(Vlog(W%%H)) - as.scalar(colSums(W)%*%rowSums(H))) i = i + 1; }
write(losses, $lossout)
write(W, $Wout)
write(H, $Hout)
"""
ml.reset()
outputs = ml.executeScript(pnmf, {"X": X_train, "maxiter": 100, "rank": 10}, ["W", "H", "losses"])
losses = outputs.getDF(sqlContext, "losses")
xy = losses.sort(losses.ID).map(lambda r: (r[0], r[1])).collect()
x, y = zip(*xy)
plt.plot(x, y)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('PNMF Training Loss')
Copyright 2017 IBM Corp. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.