- Satvik Praveen
- Matt Palmer
- Jonathan Tong
- Kaitlyn Griffin
- Reed Pafford
This project investigates scalable visualization techniques for big data using distributed systems. The focus is on analyzing diverse data types, including massive tabular data, audio recordings, images, and textual data sourced from the National Hockey League (NHL). Through the use of machine learning techniques and distributed computing frameworks, we generated insightful visualizations that uncover trends, patterns, and relationships in the data. All visual outputs for each data type are stored in their respective Output
folders.
The repository is organized as follows:
├── Audio
│ ├── Data # Raw audio data files
│ ├── Output # Generated audio visualizations (waveforms, spectrograms, MFCC heatmaps)
│ └── Audio.ipynb # Jupyter notebook for audio data analysis
├── Data # General folder for additional datasets
├── Image
│ ├── Data # Raw images of NHL players in action
│ ├── Output # Image visualizations (t-SNE plots, clustered images)
│ └── Image.ipynb # Jupyter notebook for image data analysis
├── Tabular_Massive
│ ├── Output # Visualizations for massive tabular data (game stats, trends)
│ └── Tabular.ipynb # Jupyter notebook for tabular data analysis
├── Textual
│ ├── Output # Visualizations for textual data (word clouds, t-SNE plots, TF-IDF)
│ └── Textual.ipynb # Jupyter notebook for textual data analysis
├── .gitignore # Files and folders to be ignored by Git
└── README.md # This README file
The project processes and analyzes the following types of data:
- Play-by-play statistics of NHL games between 2013 and 2023.
- Includes game events, player statistics, and team performance data.
- Stadium noise recordings during NHL games.
- Analyzed using waveforms, MFCCs, and spectrogram visualizations.
- Action images of five NHL players.
- Processed using deep learning models for feature extraction and clustering.
- Articles related to NHL games, scraped from the web.
- Analyzed with TF-IDF, word clouds, and t-SNE visualizations.
We employed distributed systems to handle the scale and complexity of big data, leveraging:
- Dask: Parallel processing for efficient computation on massive tabular data.
- Google Colab: Cloud resources to perform analysis across large datasets.
- Audio Data: Short-Time Fourier Transform (STFT), Mel-Frequency Cepstral Coefficients (MFCC), waveforms, and spectrograms.
- Image Data: Feature extraction using a pre-trained ResNet-50 model, dimensionality reduction (PCA, t-SNE), and clustering (K-means).
- Textual Data: TF-IDF vectorization, word clouds, t-SNE visualizations, and topic modeling with Plotly dashboards.
- Tabular Data: Statistical analysis and trend visualizations (e.g., bar charts, scatter plots, pie charts).
Here’s the content converted to markdown format for your README file:
- Clone this repository:
git clone https://github.com/kaitlyngrif/dataviz-group5.git
- Navigate to the relevant folder for the data type (e.g.,
Audio
,Image
,Tabular_Massive
, orTextual
). - Open the associated Jupyter notebook in your preferred environment:
jupyter notebook Audio/Audio.ipynb
- Execute the notebook cells sequentially to preprocess data, perform analysis, and generate visualizations.
numpy
,pandas
: Data manipulation and preprocessingmatplotlib
,seaborn
: Visualization librariesscikit-learn
: Machine learning for dimensionality reduction and clusteringtorch
: For deep learning-based image feature extractionnltk
: Text preprocessingbeautifulsoup4
: Web scraping for textual dataDask
: Distributed computing
- Data Scraping Issues: Security protocols on the NHL website prevented automated scraping for some datasets (e.g., images and audio), necessitating manual downloads.
- Video Data Exclusion: Computational limitations prevented the inclusion of video data in the analysis.
- Parallel Processing Challenges: Memory constraints caused inefficiencies during parallel processing of large datasets.
- Incorporate video data into the analysis pipeline.
- Utilize Large Language Models (LLMs) for automated report generation based on visualizations.
- Expand the framework to process additional datasets from other sports.
- Enhance the efficiency of parallel processing and data visualization.
We thank the National Hockey League (NHL) for providing the publicly accessible data used in this project.
This project is licensed under the MIT License. See the LICENSE
file for more details.
This is ready to be appended to your existing README.md file. Let me know if there’s anything else you'd like to adjust!