Using Summary Statistics to Examine the "Hot Hand"¶

Import useful libraries and the updated shot log data¶

import pandas as pd
import numpy as np
import datetime as dt

Shotlog=pd.read_csv("../../Data/Week 6/Shotlog1.csv")
Player_Stats=pd.read_csv("../../Data/Week 6/Player_Stats1.csv")
Player_Shots=pd.read_csv("../../Data/Week 6/Player_Shots1.csv")
Player_Game=pd.read_csv("../../Data/Week 6/Player_Game1.csv")
Shotlog.head()

Conditional Probability¶

We can first calculate the conditional probability of making a shot in the current period conditional on making the previous shot. $$Conditional \ Probability=\frac{Probability \ of \ Making \ Consecutive \ Shots}{Probability \ of \ Making \ Previous \ Shot}$$

We will need to create a variable that indicates a player made consecutive shots.

Shotlog['conse_shot_hit'] = np.where((Shotlog['current_shot_hit']==1)&(Shotlog['lag_shot_hit']==1), 1, 0) 
Shotlog.head()

We can create a player level dataframe. The average of the variable "conse_shot_hit" would be the joint probability of making current and previous shots. We will also calculate the average of "lag_shot_hit" to indicate the probability of making the previous shot.

Player_Prob=Shotlog.groupby(['shoot_player'])['conse_shot_hit','lag_shot_hit'].mean()
Player_Prob=Player_Prob.reset_index()
Player_Prob.rename(columns={'lag_shot_hit':'average_lag_hit'}, inplace=True)
Player_Prob.head()

Calculate conditional probability for each player¶

We can calculate the conditional probability by dividing the joint probability by the probability of making the previous shot.

Player_Prob['conditional_prob']=Player_Prob['conse_shot_hit']/Player_Prob['average_lag_hit']
Player_Prob.head()

We can merge the "Player_Prob" data frame with the "Player_Stats" data frame we created earlier to compare the conditional probability and the unconditional probability. If the two probabilities are the same, or almost the same, then we fail to find evidence that the making the current shot depends on making the previous shot.

Player_Stats=pd.merge(Player_Prob, Player_Stats, on=['shoot_player'])
Player_Stats.head(10)

Let's first take a quick look at our "Player_Stats" data frame.

Player_Stats.info()

Note that when we created the "conditional_prob" variable, some observations may have missing value since the "average_lag_shot" variable may contain zero value. We will delete these observations with missing values in conditional probability.

Player_Stats=Player_Stats[pd.notnull(Player_Stats["conditional_prob"])]

We can first check which players have the highest conditional probability, i.e., more likely to have hot hand.

Let's sort the data by conditional probability.

Player_Stats.sort_values(by=['conditional_prob'], ascending=[False]).head(10)

Comparing the "conditional_prob" variable and the "average_hit" variable, some players have a slightly higher conditional probability but some also have a lower conditional probability.

We can sort the data by the value of difference between conditional and unconditional probabilities.

Player_Stats['diff_prob']=Player_Stats['conditional_prob']-Player_Stats['average_hit']
Player_Stats=pd.merge(Player_Stats, Player_Shots, on=['shoot_player'])
Player_Stats.sort_values(by=['diff_prob'], ascending=[False]).head(10)

Comparing the "conditional_prob" variable and the "average_hit" variable, some players have a slightly higher conditional probability but some also have a lower conditional probability. We can sort the data by the value of the difference between conditional and unconditional probabilities. We can see that Lamar Patterson has the highest difference between the two probabilities, at 30%. But we could also see that the sample size for Patterson is pretty small. For Joe Young and Damjan Rudez, we have about 80 observations and the difference in the probabilities is about 20%.

T-test for statistical significance on the difference¶

More rigorously, we can use a t-test to test if the players’ probability of hitting the goal is statistically significantly different than their conditional probability.

We need to choose a significance level before we perform the test. If the test produces a p-value less than the chosen significance level, then we say that there is a statistically significant difference between the two probabilities; otherwise, we fail to find evidence to support that the two probabilities are statistically significantly different from each other.

The most commonly used significance level is 0.05.

To perform a t-test, we need to import a new library, "scipy.stats."¶

import scipy.stats as sp

We can use the ttest_ind() function to calculate the test statistics.¶

sp.stats.ttest_ind(Player_Stats['conditional_prob'], Player_Stats['average_hit'])

The first number is the t-statistics and the second number is the p-value.

Note that the p-value for the t test is about 0.10, which is higher than the conventional significance level 0.05. Thus the conditional probability is not statistically significantly different than the average success rate. In other words, in the analysis of conditional probability, we fail to find evidence to support the "hot hand".

Autocorrelation Coefficient¶

We can calculate the autocorrelation coefficient by calculating the correlation coefficient between the “current_shot_hit” variable and the “lag_shot_hit” variable.

Note: in python, you could use “autocorr(lag=1)” to calculate first order autocorrelation coefficient. This command is not very useful in our case since we want to look at the autocorrelation coefficient within each game. Using the built-in autocorrelation coefficient function in python, we will be treating the last shot from the previous game and the first shot of the subsequent game as a pair.

Shotlog['current_shot_hit'].corr(Shotlog['lag_shot_hit'])

As we can see, the autocorrelation coefficient is negative and the magnitude is very small and close to zero.

Since some players may have “hot hand”, and hence strong correlation between outcomes of adjacent shots, while some may not. We can also calculate autocorrelation coefficient for each player.

Shotlog.groupby('shoot_player')[['current_shot_hit','lag_shot_hit']].corr().head(10)

We may not want to print out a 2 by 2 matrix for every player. We can use the "unstack()" command to reshape the data.¶

Autocorr_Hit=Shotlog.groupby('shoot_player')[['current_shot_hit','lag_shot_hit']].corr().unstack()
Autocorr_Hit.head()

Note that now each row represents a single player. But we still have duplicate information in the columns.

We can use the ".iloc" command to select the columns that we need.¶

In the iloc[,] command, we first specify the rows we want to select, then the columns, i.e., [rows, columns]
We want to select all rows, so we will have iloc[:,]
We only want to select the second column, which is indexed 1 (first column would be indexed 0, etc.)
So we will use the command iloc[:,1]

Lastly, we will also reset the index so that the player names would become a variable.

Autocorr_Hit=Shotlog.groupby('shoot_player')[['current_shot_hit','lag_shot_hit']].corr().unstack().iloc[:,1].reset_index()
Autocorr_Hit.head()

Notice that we still have two levels of variable names.

We can use the "get_level_values" command to reset the variable name to the first level (index 0).¶

Autocorr_Hit.columns=Autocorr_Hit.columns.get_level_values(0)
Autocorr_Hit.head()

Let's rename the variable capturing autocorrelation coefficient.

Autocorr_Hit.rename(columns={'current_shot_hit':'autocorr'}, inplace=True)
Autocorr_Hit.head()

How informative the autocorrelation coefficient also depends on the number of shots per game for each player. Let's add the number of shots and the number of shots per game to the autocorrelation matrix and sort the data by the size of autocorrelation coefficient.¶

Player_Game_Shot=Player_Game.groupby(["shoot_player"])['shot_per_game'].mean().reset_index(name='avg_shot_game')
Player_Game_Shot.head()

Autocorr_Hit=pd.merge(Autocorr_Hit, Player_Game_Shot, on=['shoot_player'])
Autocorr_Hit.sort_values(by=['autocorr'], ascending=[False]).head(10)

We will merge the Player_Game_Shot dataframe to the Player_Shots dataframe since both dataframes are measured in player level and both contain information on the number of shots.

Player_Shots=pd.merge(Player_Shots, Player_Game_Shot, on=['shoot_player'])
Player_Shots.head()

Save updated data¶

Shotlog.to_csv("../../Data/Week 6/Shotlog2.csv", index=False)
Player_Stats.to_csv("../../Data/Week 6/Player_Stats2.csv", index=False)
Player_Shots.to_csv("../../Data/Week 6/Player_Shots2.csv", index=False)