Skip to content

Data Analytics pipeline using Apache Spark | Build multi-class classification models | Test the model using test data and compute accuracy of each method

Notifications You must be signed in to change notification settings

prettyma/Feature-Engineering-Multi-class-Classification

Repository files navigation

REPORT

Environment Details:

Following are the environment chosen for this Lab:

  1. Apache Spark run on Virtual Machine (provided for Lab2 - Hadoop)
  2. Scripts were written in Python

Steps to run the program:

Note: Our file directory for the python script is /spark/lab3/src/script.py and for the data is /spark/lab3/data

  1. From the terminal, traverse to the folder spark/
  2. Run the script by spark-submit /lab3/src/script.py
  3. You can see the accuracy returned by the Logistic Regression and Naive Bayes on the console

Documentation

  1. Collect data
    Data was collected using NYTimesArticles-API (as used in lab2) for the following categories:
  • Business : Stock, Economy, Finance
  • Sports : NBA, BFL, Golf
  • Politics : President, Trump, Election
  • Entertainment : Met Gala, TMZ, SNL
  1. Dataframe
    Read all articles from each of the category and appended it into a spark-dataframe along with a column specifying the category name.
  2. Feature Engineering
  • Tokenizer : Tokenize each article into words using space delimiter. Used RegexTokenizer API from pyspark
  • Stop Words : Removed commonly used words unrelated to the categories. Used StopWordsRemover API from pyspark
  • Count Vectors : Used CountVectorizer to count the frequency of each word occurring in an article (Similar to term frequency)
  • String Indexer : Converted every category to an integer label using StringIndexer from pyspark
  • IDF : Used IDF API to calculate the frequency of each word in a category
  1. Splitting of Dataframe
    Used pyspark's random-split API to split the data to training data(80%) and test data(20%)
  2. Multi Class Classification
  • Logistic Regression : Used pyspark's LogisticRegression API to create a LR model using the training data. The labels of the test data were predicted using the trained LR Model.
Logistic Regression Classification pipeline:

alt text

  • Naive Bayes Classification : Used pyspark's NaiveBayes API to create a Naive Bayes model using the training data. The labels of the test data were predicted using the trained Naive Bayes model.
Naive Bayes Classification pipeline:

alt text 6. Accuracy
The accuracy of the classification model was determined using the MulticlassClassificationEvaluator API from pyspark by comparing the predicted labels from the classification model and the test data's labels 7. Testing
An unknown set of labelled data was collected and classified using the steps and the accuracy was determined

Output:

  1. Test Data Accuracy
    • Logistic Regression : 81.92%
    • Naive Bayes : 84.7%
  2. Unknown data Accuracy
    • Logistic Regression : 95.44%
    • Naive Bayes : 92.36%

ScreenShot: alt text

About

Data Analytics pipeline using Apache Spark | Build multi-class classification models | Test the model using test data and compute accuracy of each method

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages