Machine Learning Overview with Spark ML

First lab of the Scalable Machine Learning course of the EIT Digital data science master at KTH

Problem Statement

This project aims to study the basics of regression and classification in Spark. It is divided in two parts:

First part: guided exercise whose objective is to predict median housing value in the dataset California Housing Data (1990), which involves the analysis and transformation of the attributes of the dataset (e.g., one-hot encoding, string indexer, normalization...).. After that, four different regression models are implemented: linear regression, decision tree, random forest and gradient-boost forest regression. Finally, the dataset is divided in train and test sets, and the models are trained and hypertuned.
Second part: aims to classify the default payment for credit card customers , by using the dataset Default of Credit Card Clients Dataset. First, a explanatory analysis will be performed over the data, followed by the implementation and training of three different classification models (logistic regression, decision tree, and random forest). Finally, the models would be compared and a brief discussion about which model performs better for the task.

Tools

The implementation of both parts of the assignments is performed using Scala programming language with Apache Spark Machine Learning library. In addition, Databricks was used to train more efficiently in a cluster, so the source format consists of a Scala notebook. The implementation can be found at src/ and the notebook preview at https://angeligareta.com/machine-learning-spark/.

Authors

Serghei Socolovschi serghei@kth.se
Angel Igareta angel@igareta.com

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Overview with Spark ML

First lab of the Scalable Machine Learning course of the EIT Digital data science master at KTH

Problem Statement

Tools

Authors

About

Languages

License

angeligareta/machine-learning-spark

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Overview with Spark ML

First lab of the Scalable Machine Learning course of the EIT Digital data science master at KTH

Problem Statement

Tools

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages