The goal of this classification challenge is to build a classifier that can accurately predict if a song is either “Rock”, “Hip Hop”, or “Pop” based on the lyrics.
Provided with the dataframe: 'data.csv'. It contains 55,000 observations. The first 50,000 observations, called the training set include the lyrics and the Genre("Rock", "Hip Hop", or "Pop"). The last 5,000 observations (id starting with 'TEST_x') are named the testing set. The lyrics are provided for the testing set, but the 'Genre' is missing (value is 'unkown'). Using this, I will build a machine learning model that learns from the training set to predict the 'Genre' of the testing set. The goal is to be as accurate as possible but also to be able to estimate the accuracy well.
- Working with scikit learn in Python
- Selecting the proper classifier/machine learning model
- Using pandas for machine learning classification
- Estimating accuracy of the classifier I built
- How to adjust the model for machine learning
- Gained a better understanding of different models for distribution
The 'ea.csv' file includes the estimated accuracy of the model used for the assignment.