This project involves building predictive models and performing exploratory data analysis (EDA) on NBA data to evaluate team and player performance. The workflow integrates machine learning techniques, statistical modeling, and feature engineering, with a focus on Elo ratings, mutual information, and PCA for feature selection.
- Cross-validation with PCA: Applies dimensionality reduction using PCA to retain 95% of variance and evaluates a Random Forest classifier.
- Elo Rating Calculation: Tracks team strength over time using a custom Elo system without data leakage.
- Historical Statistics Calculation: Computes win percentages and game counts for teams up to the current game.
Tracks team Elo ratings chronologically without looking ahead, ensuring no data leakage.
-
Steps:
- Initialize all team ratings to a base value (e.g., 1500).
- Adjust team ratings after each game based on the result and the expected score.
- Carry over 75% of the previous season's Elo rating to the next season, if applicable.
-
Key Parameters:
initial_elo
: Starting Elo rating for new teams (default: 1500).k
: Sensitivity factor for Elo updates (default: 20).home_advantage
: Elo boost for the home team (default: 100).
-
Key Outputs:
Elo_Team
: Elo rating of the home team before the game.Elo_Team.1
: Elo rating of the away team before the game.
-
Code Snippet:
def calculate_elo_chronologically(data, initial_elo=1500, k=20, home_advantage=100): for idx, row in data.iterrows(): home_team = row['TEAM_NAME'] away_team = row['TEAM_NAME.1'] home_elo = team_elos.get(home_team, initial_elo) away_elo = team_elos.get(away_team, initial_elo) home_expected = 1 / (1 + 10 ** (-(home_elo - away_elo + home_advantage) / 400)) home_win = row['Target'] team_elos[home_team] += k * (home_win - home_expected) team_elos[away_team] += k * ((1 - home_win) - (1 - home_expected))
Calculates win percentages and game counts for each team up to the current game without data leakage.
-
Steps:
- Track team statistics (wins, losses, and total games) for each season.
- For each game, store the historical win percentage for both teams before updating their stats.
-
Key Outputs:
home_win_pct
: Home team's win percentage before the game.away_win_pct
: Away team's win percentage before the game.total_games
: Total games played by the home team before the game.
-
Code Snippet:
def calculate_historical_stats(data): for idx, row in data.iterrows(): home_team = row['TEAM_NAME'] away_team = row['TEAM_NAME.1'] home_stats = season_stats[season][home_team] away_stats = season_stats[season][away_team] data.at[idx, 'home_win_pct'] = home_stats['wins'] / max(home_stats['games'], 1) data.at[idx, 'away_win_pct'] = away_stats['wins'] / max(away_stats['games'], 1) data.at[idx, 'total_games'] = home_stats['games']
Combines all steps into a single function to process the dataset without data leakage.
-
Steps:
- Sort the dataset chronologically by game date.
- Apply Elo rating calculation.
- Compute historical statistics.
-
Key Outputs:
- Processed dataset with Elo ratings and historical stats.
-
Code Snippet:
def process_data_without_leakage(data): data = data.sort_values('Date').copy() data = calculate_elo_chronologically(data) data = calculate_historical_stats(data) return data data_processed = process_data_without_leakage(df)
This step applies Principal Component Analysis (PCA) and evaluates the model's performance in a cross-validation framework.
-
Steps:
- Split the data into training and validation sets using K-Fold cross-validation.
- Apply PCA to retain 95% of variance.
- Train a Random Forest classifier on the PCA-reduced data.
- Evaluate validation accuracy and store PCA feature importance.
-
Key Outputs:
- Validation scores (
cv_scores
) - Number of components for 95% variance (
cv_n_components
) - Feature importance from PCA (
cv_feature_importance
)
- Validation scores (
-
Code Snippet:
for fold, (train_idx, val_idx) in tqdm(enumerate(kf.split(X_scaled)), total=n_splits, desc="Cross-validation"): X_train = X_scaled[train_idx] X_val = X_scaled[val_idx] y_train = target.iloc[train_idx] y_val = target.iloc[val_idx] pca = PCA(n_components=0.95) X_train_pca = pca.fit_transform(X_train) X_val_pca = pca.transform(X_val) clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1) clf.fit(X_train_pca, y_train) val_score = clf.score(X_val_pca, y_val)
- Logistic Regression
- Random Forest Classifier
- XGBoost Classifier
- Voting Classifier
# Get probabilities from each model
lr_proba = lr_pipeline.predict_proba(test[predictors])
rf_proba = rf.predict_proba(test[predictors])
xgb_proba = xgb.predict_proba(test[predictors])
# Average the probabilities (soft voting)
predictions_prob = (lr_proba + rf_proba + xgb_proba) / 3
predictions = (predictions_prob[:, 1] >= 0.5).astype(int)
- Validates models across multiple NBA seasons.
- Performs calibration using
CalibratedClassifierCV
for improved probability estimates.
def backtest(data, predictors, model, start=3, step=1):
for i in range(start, len(seasons), step):
model.fit(train[predictors], train['Home-Team-Win'])
predictions, predictions_prob = get_predictions(model, test[predictors])
This section adds a neural network to predict NBA game outcomes using TensorFlow/Keras. The model was trained on preprocessed NBA data, and its performance was evaluated using metrics like accuracy, F1 score, and ROC AUC.
Model Architecture:
- Input: Flattened feature set from preprocessed data.
- Hidden Layers:
- Layer 1: 512 neurons,
ReLU6
activation, dropout 20%. - Layer 2: 256 neurons,
ReLU6
activation, dropout 20%. - Layer 3: 128 neurons,
ReLU6
activation.
- Layer 1: 512 neurons,
- Output Layer: Softmax activation obtained better results than Sigmoid for categorical classification (home team win or loss).
Key Steps:
- Normalize the input data.
- Train-test split (80/20).
- Compile the model with a small learning rate (
0.00001
) to improve stability. - Save the best weights during training using the
ModelCheckpoint
callback. - Monitor validation loss for early stopping.
Key Outputs:
Home_Prob
: Predicted probability of a home team win.Target
: Actual game outcome (0 = away team win, 1 = home team win).- Metrics: Accuracy, F1 score, ROC AUC.
Code Snippet:
# Model architecture
model = Sequential([
Flatten(input_shape=(x_train.shape[1],)),
Dense(512, activation=tf.nn.relu6),
Dropout(0.2),
Dense(256, activation=tf.nn.relu6),
Dropout(0.2),
Dense(128, activation=tf.nn.relu6),
Dense(2, activation=tf.nn.softmax)
])
Neural Network Progression:
The calibration curve evaluates the alignment between predicted probabilities and actual outcomes. A well-calibrated model's curve will align closely with the diagonal, indicating that predicted probabilities accurately reflect true probabilities.
Code Snippet:
plot_calibration(results_df)
Observation:
- The original features produce a smoother calibration curve, aligning well with the diagonal, indicating strong reliability in predicted probabilities.
- The PCA-reduced features yield a noisier curve, likely due to information loss during dimensionality reduction, but still provide reasonable alignment.
- The neural network demonstrates improved calibration over the PCA features, reflecting better probability predictions.
Calibration Curve Visualizations:
This section summarizes the evaluation metrics for models trained using the original features, PCA-reduced features, and the neural network. Metrics include accuracy, F1 score, and ROC AUC.
I'll rewrite these metrics using LaTeX notation:
Accuracy measures the proportion of correct predictions among all predictions.
where TP, TN, FP, and FN represent True Positives, True Negatives, False Positives, and False Negatives respectively.
F1 score is the harmonic mean of precision and recall:
where:
AUC measures the area under the ROC curve, which plots True Positive Rate (TPR) against False Positive Rate (FPR):
The AUC can be calculated as:
For perfect classification:
- The model performs well on ranking tasks (AUC) and achieves balanced predictions (F1 score).
- The calibration curve is smooth and closely aligned with the diagonal, indicating reliable probability predictions.
- Dimensionality reduction slightly reduces accuracy and AUC due to potential loss of predictive variance.
- The improved F1 score suggests better performance on balanced predictions, especially for imbalanced scenarios.
- The neural network outperforms both the original and PCA-reduced feature models in terms of AUC, demonstrating its superior ability to rank predictions.
- A higher F1 score reflects its strength in handling imbalanced data while achieving balanced predictions.
- The training and validation losses stabilize due to early stopping, indicating effective regularization and prevention of overfitting.
Model | Accuracy | F1 Score | AUC | Calibration | Notes |
---|---|---|---|---|---|
Original Features | 0.639 | 0.707 | 0.619 | Smooth | Reliable ranking and probability prediction |
PCA-Reduced Features | 0.619 | 0.715 | 0.584 | Noisier | Improved F1 but reduced ranking ability |
Neural Network | 0.6599 | 0.7174 | 0.7055 | Improved | Best overall performance, especially in ranking |