Big Data Technologies (A.A 2019-2020)

The Million Song Dataset (MSD)

ETL, data exploration and Gaussian Mixture clustering in Databricks Community

The Million Song Dataset (MSD) is a dataset of popular songs spanning decades of music. It has been released to the public for research by The Echo Nest, a company that specializes in music intelligence services. The dataset has become more and more famed after its analysis competition released in Kaggle.com in 2012. The dataset is about 300 GB large and it contains a collection of audio features (loud- ness, tempo, etc.) and metadata (artist, year, etc.) for a million contemporary popular music tracks.

In this project, we suggest a simple Big Data system to process and analyse the Million Song Dataset (MSD) using Spark through the handy Databricks platform in its community Edition which is available for free. After displaying some statistics regarding the dataset, we have focused on creating different types of clusters based on different song attributes.

Requirements

Create an account of Databricks (Community)
Import the python notebooks (import data, explore data and clustering) into your workspace
Download the original subset (~2 GB) provided by the Million Song Dataset website
Decompress the file (.gz) and upload all the folders into Databricks File System (Data > Add Data > Upload File)
In the file "Import dataset" change the variable "ROOT_FOLDER" according to your dataset path
- Default folder: "/FileStore/tables/songs"

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
Project		Project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Technologies (A.A 2019-2020)

The Million Song Dataset (MSD)

ETL, data exploration and Gaussian Mixture clustering in Databricks Community

Requirements

Authors (Group 11)

Bronzini Marco

Lazzerini Giacomo

About

Releases

Packages

Contributors 2

Languages

saturnMars/bigDataTechnologies

Folders and files

Latest commit

History

Repository files navigation

Big Data Technologies (A.A 2019-2020)

The Million Song Dataset (MSD)

ETL, data exploration and Gaussian Mixture clustering in Databricks Community

Requirements

Authors (Group 11)

Bronzini Marco

Lazzerini Giacomo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages