The objective was to look at data from the Indian Premier League, analyze the data and come up with the following:
- Top Batsmen and their performance metrics
- Top Bowlers and their performance metrics
- Build a model to predict the winner of an IPL match
IPL is a treasure trove of cricket data and hence analyzing IPL results could well prove to be interesting. Its played in the exciting 20–20 format. The 12th edition of IPL completed on May 12 with the Mumbai Indians winning an easy win.
IPL Data in Kaggle was available only till 2017. I then looked at Cricsheet and got a zip file for all 12 years. Data however was in the YAML format. Google Colab was used for GPU and Kaggle API was used to download data from within Colab. There are 3 files in the Github repo
README.md
(this file)ipl_yaml_data_processing.ipynb
- a file that is used to convert the raw YAML data files into 2csv
files providing summary and detailed informationipl_analysis.ipynb
- this file processes he 2 csv files provided into meaningful data, analysis, insights and a predictive model
- 756 YAML files were processed - one for each IPL match
- YAML file processing -
ruamel.yaml
- Data Processing -
pandas
,numpy
,datetime
,joblib
,timeit
- Data Visualization -
matplotlib.pyplot
,matplotlib.gridspec
,seaborn
- Folder Operations -
pathlib
- Modeling -
sklearn
(model_selection
,linear_model
,tree
,ensemble
,neural_network
,pipeline
,preprocessing
,impute
,metrics
,decomposition
)
- 2 csv files were created
- one that showed match summaries i.e. who played, who won, venue etc.
- one with details - for each delivery, who was the bowler, batsman, how many runs were scored etc.
Various features were created to enrich the raw data such as
- mapping names of various categories,
- Batting performance metrics such as batting averages, strike rates, top performances - 200s, 100s, 50s etc., boundary_rate, farm_rate, dot_rate, comparison across multiple time periods (i.e. all 12 seasons, the last 3 seasons or just the last 1 year)
- Bowling performance metrics such as bowling averages, strike rates, top performances - # wickets in a match, comparisons across time periods etc.
- Each Innings was divided into 4 quarters to see how momentum plays a part in victories
- The top 10 run-makers were identified
- Features such as Batting Strike Rates, Batting Economy Rates
- Who is the most valuable batsman in the IPL?
- Batsmen were analyzed along multiple parameters
- Batting performances were then analyzed over other time periods such as the last 3 years and only 2019 to identify top batsmen
- The top 10 wicket-takers were identified
- Features such as Bowling Strike Rates, Bowling Economy Rates
- Bowlers were analyzed along multiple parameters
- Is Malinga the best bowler in the IPL?
- Bowling performances were then analyzed over other time periods such as the last 3 years and only 2019 to identify top batsmen
- Similarly players with the most catches, stumpings and run-outs were then idenified
The base predictor would be 50% (either you win a match or lose a mach irrespective of how many runs you scored)
- How many runs were scored?
- How many wickets were lost?
- Could we break up an inning to see the performance by each quarter?
- How many deliveries that couldn' be scored off could be another parameter
There were no categorical variables that were defined as part of the model. Any missing values were imputed through SimpleImputer.
- 7 different classifiers were tested:
- Logistic Regression
- Support Vector Machines
- Decision Tree
- AdaBoost
- RandomForest
- Gradient Boost
- MultiLayerPerceptron
- These models were put through a Pipeline with the following steps:
- Simple Imputer to impute values to NaN
- Standard Scaler to scale the values
- PCA to reduce the variables to principal components that explain over 99%
- Classifier was then grid searched to get the right model hyper parameters
- The models were then saved as pickle files
A 75% test accuracy was obtained with very minor differences between the classifiers. Compared to the base predictor of 50%, a 75% accurate model is a big improvement.