StackOverFlow-MBDProject

About the project

This repository implement analysis on Stack Overflow dataset to answer 2 research questions:

RQ 1. What are the popular tech topics discussed in Stack Overflow, and how has it changed over time?
RQ 2. What are the factors contributing to the quality of questions and answers in Stack Overflow?
- 2.1 How do tags influence the possibility of questions getting answered?
- 2.2 Does user reputation influence the quality of Stack Overflow’s posts?
- 2.3 Which factors drive a question to be answered (and how fast)?

The final report is available here

Raw data can be downloaded from Stack Exchange Data Dump Raw xml data can be converted to parquet format using this script

Tools:

Apache Spark (Pyspark, Spark-xml version 2.11-0.9.0)
Python
Jupyter Notebook (For analysis and visualization, key libraries sklearn, ntlk, matplotlib)

Python files are located in codes/ directory, and different files answer different research questions. Since submitting code dependencies on Spark is not straighforward, we decided to copy-paste some of the functions between all the files - therefore there might be duplicates.
Each file will have different running instructions (for submitting to spark.) Python notebook files (.ipynb) are designed to run either locally or on UT's jupyter server.
Some figures (not all) are available on the /fig directory. Other figures will be available on their research question's Jupyter notebook.

cd codes/
Create an environment using conda env create -f stackoverflow_env.yml --name stackoverflow_env
Zip the environment: zip -r stackoverflow_env.zip stackoverflow_env
Then follow the instructions on each python file's heading.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
codes		codes
data/questions-with-ans-and-metrics-cluster-FINAL-INCL-EVERYTHING.parquet		data/questions-with-ans-and-metrics-cluster-FINAL-INCL-EVERYTHING.parquet
docs		docs
fig		fig
.gitignore		.gitignore
MBD_G3_Stack_Overflow.pdf		MBD_G3_Stack_Overflow.pdf
README.md		README.md