Objective: To introduce students to supervised learning, focusing on linear regression, by guiding them through a project that predicts house prices based on a variety of features.
Dataset: We'll use the "Boston Housing Dataset" from the UCI Machine Learning Repository. This dataset contains information about housing in Boston, including features such as the number of rooms, age of the house, and crime rate, along with the target variable, which is the median value of owner-occupied homes.
- Data Exploration and Preprocessing
- Model Building and Training
- Model Evaluation
- Feature Engineering and Improvement
- Collaboration and Version Control
├── data
│ └── boston_housing.csv
├── notebooks
│ ├── EDA.ipynb
│ ├── Data_Preprocessing.ipynb
│ ├── Model_Training.ipynb
│ └── Model_Evaluation.ipynb
├── scripts
│ ├── data_preprocessing.py
│ ├── train_model.py
│ └── evaluate_model.py
├── README.md
├── requirements.txt
└── .gitignore
Notebook: notebooks/EDA.ipynb Steps:
- Load the dataset.
- Explore the data structure, types, and summary statistics.
- Visualize relationships between features and the target variable.
- Identify missing values and outliers.
Notebook: notebooks/Data_Preprocessing.ipynb Steps:
- Handle missing values and outliers.
- Encode categorical variables.
- Normalize/standardize numerical features.
- Split the data into training and testing sets.
- Script: scripts/data_preprocessing.py
Notebook: notebooks/Model_Training.ipynb Steps:
-
Choose appropriate features for the model.
-
Train a linear regression model.
-
Perform hyperparameter tuning (if applicable).
-
Script: scripts/train_model.py
Notebook: notebooks/Model_Evaluation.ipynb Steps:
- Evaluate the model using metrics such as Mean Squared Error (MSE), R-squared.
- Plot residuals to check the assumptions of linear regression.
- Compare model performance with different feature sets or preprocessing steps.
- Script: scripts/evaluate_model.py
Notebook: notebooks/Feature_Engineering.ipynb Steps:
- Create new features that might improve model performance.
- Test different feature combinations.
- Evaluate the impact of new features on model performance.
Steps:
- Regularly merge branches after reviewing and resolving conflicts.
- Use pull requests to review and discuss changes before merging.
- Git Commands: Provide a list of essential Git commands for students to use.
Source: Boston Housing Dataset on UCI Machine Learning Repository
Features:
- CRIM: per capita crime rate by town.
- ZN: proportion of residential land zoned for lots over 25,000 sq. ft.
- INDUS: proportion of non-retail business acres per town.
- CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- NOX: nitrogen oxides concentration (parts per 10 million).
- RM: average number of rooms per dwelling.
- AGE: proportion of owner-occupied units built prior to 1940.
- DIS: weighted distances to five Boston employment centers.
- RAD: index of accessibility to radial highways.
- TAX: full-value property tax rate per $10,000.
- PTRATIO: pupil-teacher ratio by town.
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of black residents by town.
- LSTAT: percentage of lower status of the population.
- MEDV: median value of owner-occupied homes in $1000s (Target).