We will use the 2016-2017 basketball shot log data to demonstrate how to test the hot hand.¶

Import useful libraries and the shot log data¶

Please note that the 3 lecture notebooks for this week must be run in order, as the following notebooks rely on the output of the previous¶

import pandas as pd
import numpy as np

Shotlog=pd.read_csv("../../Data/Week 6/Shotlog_16_17.csv")
Shotlog.head()

Shotlog.shape

Data Preparation¶

Missing Value¶

Shotlog.info()

Let’s create some useful variables.¶

Create dummy variables to indicate hit or miss of current shot and previous shot.

Shotlog['current_shot_hit'] = np.where(Shotlog['current_shot_outcome']=="SCORED", 1, 0)
Shotlog.head()

Make sure the variable "date" is stored as a date type variable.

import datetime as dt
Shotlog['date']=pd.to_datetime(Shotlog['date'])

Convert the variable "time" to be datetime type variable
1. We will first add the hour (00) to the time variable since the time variable will be stored in the format 'HH:MM:SS';
2. We will use "to_timedelta" to work with variable with only time information.

Shotlog['time'] = pd.to_timedelta('00:'+ Shotlog['time'])
Shotlog['time'].describe()

Create lagged variable to indicate the result of the previous shot by the same player in the same game.
1. We will first sort the shot outcome by the quarter and time in the game;
2. We will group the data by player and game (date) and use the "shift" command to create a lag variable.

Shotlog['lag_shot_hit']=Shotlog.sort_values(by=['quarter','time'], ascending=[True, True]).groupby(['shoot_player','date'])['current_shot_hit'].shift(1)
Shotlog.head()

We can sort the shot log data by player, game(date), quarter, and time of the shot.¶

Shotlog.sort_values(by=['shoot_player', 'date', 'quarter', 'time'], ascending=[True, True, True, True])

Notice that for the first shots of the game by the given players, the lagged outcome variable will have missing value.

Let's create a dataframe for average success rate of players over the season.¶

Since the "current_shot_hit" variable is a dummy variable (=1 if hit, =0 if miss), the average of this variable would indicate the success rate of the player over the season.

Player_Stats=Shotlog.groupby(['shoot_player'])['current_shot_hit'].mean()
Player_Stats=Player_Stats.reset_index()
Player_Stats.head()

Let's rename the "current_shot_hit" variable in the newly created date frame as "average_hit".

Player_Stats.rename(columns={'current_shot_hit':'average_hit'}, inplace=True)

We will use the player statistics to analyze the hot hand. So we will merge average player statistics dataframe back to the shot log dataframe.¶

Shotlog=pd.merge(Shotlog, Player_Stats, on=['shoot_player'])
Shotlog.head()

Create a variable to indicate the total number of shots recorded in the dataset for each player.

Player_Shots=Shotlog.groupby(['shoot_player']).size().reset_index(name='shot_count')

Player_Shots.sort_values(by=['shot_count'], ascending=[False]).head()

We should also note that players have different number of shots in each individual game. We will need to treat the data differently for a player who had only two shots in a game compared to those who had attempted 30 in a game.

Create a variable to indicate the number of shots in each game for by each player.

Player_Game=Shotlog.groupby(['shoot_player','date']).size().reset_index(name='shot_per_game')
Player_Game.head()

We will merge the shot count data frames back to the shot log dataframe.¶

Shotlog=pd.merge(Shotlog, Player_Shots, on=['shoot_player'])
Shotlog=pd.merge(Shotlog, Player_Game, on=['shoot_player','date'])
display(Shotlog)

We will sort the data again after merging.¶

Shotlog.sort_values(by=['shoot_player', 'date', 'quarter', 'time'], ascending=[True, True, True, True])

We will treat the "points" and "quarter" variables as objects.¶

Shotlog['points'] = Shotlog['points'].astype(object)
Shotlog['quarter'] = Shotlog['quarter'].astype(object)

Missing values¶

Drop observations with missing value in lagged variable.

Shotlog=Shotlog[pd.notnull(Shotlog["lag_shot_hit"])]

Let's take a quick look at the number of variables and the number of observations in our clean dataframe.¶

Shotlog.shape

Save our updated data¶

Shotlog.to_csv("../../Data/Week 6/Shotlog1.csv", index=False)
Player_Stats.to_csv("../../Data/Week 6/Player_Stats1.csv", index=False)
Player_Shots.to_csv("../../Data/Week 6/Player_Shots1.csv", index=False)
Player_Game.to_csv("../../Data/Week 6/Player_Game1.csv", index=False)