Regression Analyses with Cricket Data

In week 1, we took a brief look at the cricket match of statistics of the Indian Premier league in 2018 (IPL2018teams dataset). In this week, we will look at the player level statistics. In particular, we are interested in whether the player performance impact their salaries.

Import useful libraries

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm

Import cricket data

In our data repository, there is a data set “IPL18Player.csv” which contains performance statistics as well as salary information of cricket players in the Indian Premier League in 2018.

In [ ]:
IPLPlayer=pd.read_csv("../../Data/Week 4/IPL18Player.csv")
IPLPlayer.head()

Data Exploration and Preparation

In [ ]:
IPLPlayer.shape

Missing Values

In [ ]:
IPLPlayer.info()

There are missing values in the salary variable. We will drop observations with missing values.

In [ ]:
IPLPlayer=IPLPlayer.dropna()
IPLPlayer.shape

Create useful variables

Create dummy variables to indicate the role of the players.

  • Create a variable to indicate whether a player had played as a batsman.

    The variable "innings" indicates how many innings a player had batted in.

In [ ]:
IPLPlayer['batsman']=np.where(IPLPlayer['innings']> 0, 1, 0)
IPLPlayer['batsman'].describe()
  • Create a variable to indicate bowler.
In [ ]:
IPLPlayer['bowler']=np.where(IPLPlayer['matches_bowled']> 0, 1, 0)
IPLPlayer['bowler'].describe()

The last type of player that is not captured by either batsman or bowler is wicket keeper. In the dataset, the variable "matches_keeper" indicates the number of matches that a player is a wicket keeper.

Performance Measures

  1. batting average = runs / the numbers of outs
  2. batting strike rate = (runs * 100) / balls faced
  3. bowling average = runs conceded / wicket taken
  4. bowling strike rate = number of balls bowled / wicket taken

Notice that if a batsman has scored runs but not been dismissed, his batting average is technically infinite. Similarly, if a player did not face any ball, his batting strike would be infinite and if a player did not lose any wicket, his bowling average or bowling strike would be infinite.

We will not be able to run a regression when our variables have some infinite values.

There are two alternatives we will consider to deal with this issue.

  1. Add 1 to the number of outs, balls faced, andn wickets taken in calculating the above variables.
  2. Instead of creating the above measures, we can simply include total runs, total number of outs, and balls faced to measure a batsman's performance, and include runs conceded, number of balls bowled, and wickets taken to measure a bowler's performance.
In [ ]:
IPLPlayer['outs']=np.where(IPLPlayer['batsman']==1, IPLPlayer['innings']-IPLPlayer['not_outs'], 0)
IPLPlayer['outs'].describe()

Create batting average, batting strke rate, bowling average, and bowling strike rate variables. Add 1 to the number of outs, balls faced, andn wickets taken in calculating these variables.

In [ ]:
IPLPlayer['batting_average']=IPLPlayer['runs']/(IPLPlayer['outs']+1)
IPLPlayer['batting_strike']=IPLPlayer['runs']/((IPLPlayer['balls_faced']+1))*100
IPLPlayer['bowling_average']=IPLPlayer['runs_conceded']/(IPLPlayer['wickets']+1)
IPLPlayer['bowling_strike']=IPLPlayer['balls_bowled']/(IPLPlayer['wickets']+1)
In [ ]:
IPLPlayer['batting_average'].describe()
In [ ]:
IPLPlayer['batting_strike'].describe()
In [ ]:
IPLPlayer['bowling_average'].describe()
In [ ]:
IPLPlayer['bowling_strike'].describe()

Regression Analyses

First let's run a regression of the salary on the type of player, batsman, bowler, and all-rounder.

In [ ]:
reg_IPL1=sm.ols(formula = 'Salary ~ batsman+ bowler+ batsman*bowler', data= IPLPlayer, missing="drop").fit()
print(reg_IPL1.summary())

Next we will first focus on performance of batsman.

We will first simply use the total number of runs, number of not outs, and number of balls faced to measure players’ performance.

In [ ]:
reg_IPL2=sm.ols(formula = 'Salary ~ runs', data= IPLPlayer).fit()
print(reg_IPL2.summary())
In [ ]:
reg_IPL3=sm.ols(formula = 'Salary ~ runs+not_outs', data= IPLPlayer).fit()
print(reg_IPL3.summary())
In [ ]:
reg_IPL4=sm.ols(formula = 'Salary ~ runs+not_outs+balls_faced', data= IPLPlayer).fit()
print(reg_IPL4.summary())

In the next regressions, we will use the modified batting average and batting strike variables to measure player performance.

In [ ]:
reg_IPL5=sm.ols(formula = 'Salary ~ batting_average', data= IPLPlayer).fit()
print(reg_IPL5.summary())
In [ ]:
reg_IPL6=sm.ols(formula = 'Salary ~ batting_average+batting_strike', data= IPLPlayer).fit()
print(reg_IPL6.summary())

We will now turn to bowlers' performance.

Again, we will first use number of runs conceded, number of balls bowled, and number of wickets taken to measure bowlers' performance.

In [ ]:
reg_IPL7=sm.ols(formula = 'Salary ~ runs_conceded', data= IPLPlayer).fit()
print(reg_IPL7.summary())
In [ ]:
reg_IPL8=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled', data= IPLPlayer).fit()
print(reg_IPL8.summary())
In [ ]:
reg_IPL9=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL9.summary())

In the next regression, we will use the modified bowling average and bowling strike variables to measure player performance.

In [ ]:
reg_IPL10=sm.ols(formula = 'Salary ~ bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL10.summary())

Lastly, we will incorporate performance measures of both batsman and bowler in the same regression.

We will first use the original variables, total number of runs, number of not outs, number of balls faced, number of runs conceded, number of balls bowled, and number of wickets in the regression.

In [ ]:
reg_IPL11=sm.ols(formula = 'Salary ~ runs+not_outs+balls_faced+runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL11.summary())

We will also use the modified batting average, batting strike, bowling average, and bowling strike variables to measure the player performance.

In [ ]:
reg_IPL12=sm.ols(formula = 'Salary ~ batting_average+batting_strike+bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL12.summary())

Self Test

  • Run a regression of salary as a function of the interaction of batsman and runs and the interaction of bowler and wickets taken.
  • Interpret your regression results.
In [ ]: