NBA Data Analysis Documentation

Overview

This project involves building predictive models and performing exploratory data analysis (EDA) on NBA data to evaluate team and player performance. The workflow integrates machine learning techniques, statistical modeling, and feature engineering, with a focus on Elo ratings, mutual information, and PCA for feature selection.

Cross-validation with PCA: Applies dimensionality reduction using PCA to retain 95% of variance and evaluates a Random Forest classifier.
Elo Rating Calculation: Tracks team strength over time using a custom Elo system without data leakage.
Historical Statistics Calculation: Computes win percentages and game counts for teams up to the current game.

Code Structure

1. Elo Rating Calculation

Tracks team Elo ratings chronologically without looking ahead, ensuring no data leakage.

Steps:
1. Initialize all team ratings to a base value (e.g., 1500).
2. Adjust team ratings after each game based on the result and the expected score.
3. Carry over 75% of the previous season's Elo rating to the next season, if applicable.
Key Parameters:
- initial_elo: Starting Elo rating for new teams (default: 1500).
- k: Sensitivity factor for Elo updates (default: 20).
- home_advantage: Elo boost for the home team (default: 100).
Key Outputs:
- Elo_Team: Elo rating of the home team before the game.
- Elo_Team.1: Elo rating of the away team before the game.

Code Snippet:

def calculate_elo_chronologically(data, initial_elo=1500, k=20, home_advantage=100):
    for idx, row in data.iterrows():
        home_team = row['TEAM_NAME']
        away_team = row['TEAM_NAME.1']
        
        home_elo = team_elos.get(home_team, initial_elo)
        away_elo = team_elos.get(away_team, initial_elo)
        
        home_expected = 1 / (1 + 10 ** (-(home_elo - away_elo + home_advantage) / 400))
        home_win = row['Target']
        
        team_elos[home_team] += k * (home_win - home_expected)
        team_elos[away_team] += k * ((1 - home_win) - (1 - home_expected))

2. Historical Statistics Calculation

Calculates win percentages and game counts for each team up to the current game without data leakage.

Steps:
1. Track team statistics (wins, losses, and total games) for each season.
2. For each game, store the historical win percentage for both teams before updating their stats.
Key Outputs:
- home_win_pct: Home team's win percentage before the game.
- away_win_pct: Away team's win percentage before the game.
- total_games: Total games played by the home team before the game.

Code Snippet:

def calculate_historical_stats(data):
    for idx, row in data.iterrows():
        home_team = row['TEAM_NAME']
        away_team = row['TEAM_NAME.1']
        
        home_stats = season_stats[season][home_team]
        away_stats = season_stats[season][away_team]
        
        data.at[idx, 'home_win_pct'] = home_stats['wins'] / max(home_stats['games'], 1)
        data.at[idx, 'away_win_pct'] = away_stats['wins'] / max(away_stats['games'], 1)
        data.at[idx, 'total_games'] = home_stats['games']

3. Full Data Processing

Combines all steps into a single function to process the dataset without data leakage.

Steps:
1. Sort the dataset chronologically by game date.
2. Apply Elo rating calculation.
3. Compute historical statistics.
Key Outputs:
- Processed dataset with Elo ratings and historical stats.

Code Snippet:

def process_data_without_leakage(data):
    data = data.sort_values('Date').copy()
    data = calculate_elo_chronologically(data)
    data = calculate_historical_stats(data)
    return data

data_processed = process_data_without_leakage(df)

4. Cross-validation with PCA

This step applies Principal Component Analysis (PCA) and evaluates the model's performance in a cross-validation framework.

Steps:
1. Split the data into training and validation sets using K-Fold cross-validation.
2. Apply PCA to retain 95% of variance.
3. Train a Random Forest classifier on the PCA-reduced data.
4. Evaluate validation accuracy and store PCA feature importance.
Key Outputs:
- Validation scores (cv_scores)
- Number of components for 95% variance (cv_n_components)
- Feature importance from PCA (cv_feature_importance)

Code Snippet:

for fold, (train_idx, val_idx) in tqdm(enumerate(kf.split(X_scaled)), total=n_splits, desc="Cross-validation"):
    X_train = X_scaled[train_idx]
    X_val = X_scaled[val_idx]
    y_train = target.iloc[train_idx]
    y_val = target.iloc[val_idx]
    
    pca = PCA(n_components=0.95)
    X_train_pca = pca.fit_transform(X_train)
    X_val_pca = pca.transform(X_val)
    
    clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    clf.fit(X_train_pca, y_train)
    val_score = clf.score(X_val_pca, y_val)

5. Machine Learning Models

Models Used:

Logistic Regression
Random Forest Classifier
XGBoost Classifier
Voting Classifier

# Get probabilities from each model
lr_proba = lr_pipeline.predict_proba(test[predictors])
rf_proba = rf.predict_proba(test[predictors])
xgb_proba = xgb.predict_proba(test[predictors])

# Average the probabilities (soft voting)
predictions_prob = (lr_proba + rf_proba + xgb_proba) / 3
predictions = (predictions_prob[:, 1] >= 0.5).astype(int)

Backtesting Framework

Validates models across multiple NBA seasons.
Performs calibration using CalibratedClassifierCV for improved probability estimates.

Example Backtest:

def backtest(data, predictors, model, start=3, step=1):
    for i in range(start, len(seasons), step):
        model.fit(train[predictors], train['Home-Team-Win'])
        predictions, predictions_prob = get_predictions(model, test[predictors])

6. Neural Network Implementation

This section adds a neural network to predict NBA game outcomes using TensorFlow/Keras. The model was trained on preprocessed NBA data, and its performance was evaluated using metrics like accuracy, F1 score, and ROC AUC.

Model Architecture:

Input: Flattened feature set from preprocessed data.
Hidden Layers:
- Layer 1: 512 neurons, ReLU6 activation, dropout 20%.
- Layer 2: 256 neurons, ReLU6 activation, dropout 20%.
- Layer 3: 128 neurons, ReLU6 activation.
Output Layer: Softmax activation obtained better results than Sigmoid for categorical classification (home team win or loss).

Key Steps:

Normalize the input data.
Train-test split (80/20).
Compile the model with a small learning rate (0.00001) to improve stability.
Save the best weights during training using the ModelCheckpoint callback.
Monitor validation loss for early stopping.

Key Outputs:

Home_Prob: Predicted probability of a home team win.
Target: Actual game outcome (0 = away team win, 1 = home team win).
Metrics: Accuracy, F1 score, ROC AUC.

Code Snippet:

    # Model architecture
    model = Sequential([
        Flatten(input_shape=(x_train.shape[1],)),
        Dense(512, activation=tf.nn.relu6),
        Dropout(0.2),
        Dense(256, activation=tf.nn.relu6),
        Dropout(0.2),
        Dense(128, activation=tf.nn.relu6),
        Dense(2, activation=tf.nn.softmax)
    ])

Neural Network Progression:

7. Visualization and Evaluation

Calibration Curve

The calibration curve evaluates the alignment between predicted probabilities and actual outcomes. A well-calibrated model's curve will align closely with the diagonal, indicating that predicted probabilities accurately reflect true probabilities.

Code Snippet:

plot_calibration(results_df)

Observation:

The original features produce a smoother calibration curve, aligning well with the diagonal, indicating strong reliability in predicted probabilities.
The PCA-reduced features yield a noisier curve, likely due to information loss during dimensionality reduction, but still provide reasonable alignment.
The neural network demonstrates improved calibration over the PCA features, reflecting better probability predictions.

Calibration Curve Visualizations:

Original Features:
PCA-Reduced Features:
NN Model Original Features

8. Performance Metrics

This section summarizes the evaluation metrics for models trained using the original features, PCA-reduced features, and the neural network. Metrics include accuracy, F1 score, and ROC AUC.

I'll rewrite these metrics using LaTeX notation:

Accuracy

Accuracy measures the proportion of correct predictions among all predictions.

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

where TP, TN, FP, and FN represent True Positives, True Negatives, False Positives, and False Negatives respectively.

F1 Score

F1 score is the harmonic mean of precision and recall:

$$ F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} $$

where:

$$ Precision = \frac{TP}{TP + FP} $$

$$ Recall = \frac{TP}{TP + FN} $$

AUC (Area Under the ROC Curve)

AUC measures the area under the ROC curve, which plots True Positive Rate (TPR) against False Positive Rate (FPR):

$$ TPR = Recall = \frac{TP}{TP + FN} $$

$$ FPR = \frac{FP}{FP + TN} $$

The AUC can be calculated as:

$$ AUC = \int_{0}^{1} TPR(FPR) , dFPR $$

For perfect classification: $AUC = 1$ For random classification: $AUC = 0.5$ For worse than random: $AUC < 0.5$

Results & Analysis

Original Features

The model performs well on ranking tasks (AUC) and achieves balanced predictions (F1 score).
The calibration curve is smooth and closely aligned with the diagonal, indicating reliable probability predictions.

PCA-Reduced Features

Dimensionality reduction slightly reduces accuracy and AUC due to potential loss of predictive variance.
The improved F1 score suggests better performance on balanced predictions, especially for imbalanced scenarios.

Neural Network

The neural network outperforms both the original and PCA-reduced feature models in terms of AUC, demonstrating its superior ability to rank predictions.
A higher F1 score reflects its strength in handling imbalanced data while achieving balanced predictions.
The training and validation losses stabilize due to early stopping, indicating effective regularization and prevention of overfitting.

Summary of Results

Model	Accuracy	F1 Score	AUC	Calibration	Notes
Original Features	0.639	0.707	0.619	Smooth	Reliable ranking and probability prediction
PCA-Reduced Features	0.619	0.715	0.584	Noisier	Improved F1 but reduced ranking ability
Neural Network	0.6599	0.7174	0.7055	Improved	Best overall performance, especially in ranking

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
feature_engineering		feature_engineering
images		images
logs		logs
models		models
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
PCA.ipynb		PCA.ipynb
RunEnsemble.ipynb		RunEnsemble.ipynb
RunNN.ipynb		RunNN.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Data Analysis Documentation

Overview

Code Structure

1. Elo Rating Calculation

2. Historical Statistics Calculation

3. Full Data Processing

4. Cross-validation with PCA

5. Machine Learning Models

Models Used:

Backtesting Framework

Example Backtest:

6. Neural Network Implementation

7. Visualization and Evaluation

Calibration Curve

8. Performance Metrics

Accuracy

F1 Score

AUC (Area Under the ROC Curve)

Results & Analysis

Original Features

PCA-Reduced Features

Neural Network

Summary of Results

About

Releases

Packages

Languages

License

anramz29/NBA_Machine_Learning

Folders and files

Latest commit

History

Repository files navigation

NBA Data Analysis Documentation

Overview

Code Structure

1. Elo Rating Calculation

2. Historical Statistics Calculation

3. Full Data Processing

4. Cross-validation with PCA

5. Machine Learning Models

Models Used:

Backtesting Framework

Example Backtest:

6. Neural Network Implementation

7. Visualization and Evaluation

Calibration Curve

8. Performance Metrics

Accuracy

F1 Score

AUC (Area Under the ROC Curve)

Results & Analysis

Original Features

PCA-Reduced Features

Neural Network

Summary of Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages