Skip to content

Latest commit

 

History

History
106 lines (61 loc) · 1.98 KB

README.md

File metadata and controls

106 lines (61 loc) · 1.98 KB

zmusic_code

Code for the EMI Music Data Science Hackathon

http://www.kaggle.com/c/MusicHackathon


DOCUMENTS


REQUIREMENTS


DATA

  • EMI One Million Interview Dataset

    http://musicdatascience.com/emi-million-interview-dataset/

  • ./data/*.csv

    The data files users.csv and words.csv have been cleaned and encoded manually using Unix tools (cat, cut, split, grep, sort, wc, etc.) and a text editor (search, replace, etc.).

  • ./data/*.txt

    The other files users_.txt and words_.txt show how the text-format categorical attributes are encoded.


PROGRAMS

  • ./users.py

    Pre-process the users data

  • ./words.py

    Pre-process the words data

  • ./music.py

    Pre-process the music training/test data

  • ./model.py [n]

    Run cross-validation experiments on the training data using the random forest with n trees (n=60 by default)

  • ./submit.py

    Make final predictions on the test data using the random forest with 60 trees

  • ./prepare_libfm.py

    Convert the data into libFM format: train.libfm and test.libfm


PERFORMANCE

Random Forest (n_estimators=60, max_features='sqrt')

  • RMSE = 14.59553 (2-fold cross-validation)
  • RMSE = 13.76513 (public)
  • RMSE = 13.80559 (private)

Factorization Machine (-method mcmc -dim '1,1,100' -init_stdev 0.25 -iter 1000)

  • RMSE = 14.19240 (2-fold cross-validation)

AUTHOR

Dell Zhang (dell.z@ieee.org)