In this week, we will use basketball data downloaded from NBA.com to demonstrate how to import data into Python, how to clean up data before conducting any data analyses, as well how to describe and summarize data.
import pandas as pd
import numpy as np
NBA_Teams=pd.read_csv("../../Data/Week 2/nba_teams.csv")
We can take a quick look at the data we imported by displaying the dataset.
display(NBA_Teams)
This dataset provides some basic information of the NBA teams.
For a dataset, each row represents an observation, i.e., a team in this dataset and each column represents a variable which contains information of a characteristics of the observation. A variable can take different values in different situations. The number of observation in a dataset represents the size of our sample and the number of variables represents the richness of information in our dataset.
NBA_Teams.shape
We can see that there are 30 observations (rows) and 8 variables (columns).
Inplace parameter
The first variable is unnamed, let's rename it to be "TEAM_NUMBER"; let's also rename "ID" to "TEAM_ID."
NBA_Teams.rename(columns={'Unnamed: 0':'TEAM_NUMBER'}, inplace=True)
NBA_Teams.rename(columns={'ID':'TEAM_ID'}, inplace=True)
display(NBA_Teams)
#Your Code Here
The variable "TEAM_NUMBER" has little meaning, let's drop it.
NBA_Teams.drop(['TEAM_NUMBER'], axis=1, inplace=True)
display(NBA_Teams)
Import the game level dataset from our data repository.
Games=pd.read_csv("../../Data/Week 2/basketball_games.csv")
Games.head()
Upon importing the game data, we notice that the first five games are not NBA games, instead, they are WNBA games. Indeed, this dataset contains NBA games, WNBA games, NBA 2K (simulation video) games.
Games.drop([0], axis=0, inplace=True)
Games.head()
For example, Las Vegas Aces is a women’s basketball team. If we are only going to focus on men’s basketball games, we will drop all the games played by Las Vegas Aces. In this case, we don’t have to use the “drop” function. We can specify our TEAM_NAME variables to be not equal to “Las Vegas Aces.”
Games=Games[Games.TEAM_NAME !="Las Vegas Aces"]
Games.head()
#Your Code Here
NBA_Games=pd.merge(NBA_Teams, Games, on=['TEAM_ID', 'TEAM_NAME'])
NBA_Games.head()
NBA_Games.columns
The variable “ABBREVIATION” AND “TEAM_ABBREVIATION” carry the same information and it is not necessary to keep both of them.
NBA_Games.drop(['ABBREVIATION'], axis=1, inplace=True, errors='ignore')
#Your Code Here
The merged dataset is sorted by the criteria we use to merge the datasets. Thus, the NBA_Games dataset is currently sorted by "TEAM_ID." We may be interested to sort the data by other criteria, for example, the date of the game.
In our dataset, "GAME_ID" is created based on the date of the game. We can sort the games by “GAME_ID” and display the 20 most recent games.
NBA_Games.sort_values(by=['GAME_ID'], ascending=[False]).head(20)
NBA_Games.info()
_The total number of rows is 18956, so there are missing values in variable WL, FG_PCT, FG3_PCT, and FTPCT.
We can use the isnull() function and the notnull() function to detect where the missing values are.
NBA_Games.notnull()
NBA_Games=NBA_Games[pd.notnull(NBA_Games["FG_PCT"])]
NBA_Games.shape
NBA_Games=NBA_Games.fillna(NBA_Games.mean())
NBA_Games.info()
We can create a variable equals to the total number of goals made.
NBA_Games['GM']=NBA_Games['FGM']+NBA_Games['FG3M']+NBA_Games['FTM']
#Your Code Here
For example, we can create a variable "RESULT" that equals to 1 if the team won the game and 0 otherwise. The result of the game can be captured in the points of the team receive, whether it was positive or negative.
NBA_Games['RESULT'] = np.where(NBA_Games['PLUS_MINUS']>0, 'W', 'L')
We will now drop this newly created "RESULT" variable.
NBA_Games.drop(['RESULT'], axis=1, inplace=True)
In the dataset, each game has two observations, one represents the statistics of the home team, one represents those of the away team. Both observations have the same GAME_ID. We can create a variable "POINT_DIFF" that equals the difference between the points earned by the two teams.
We will first sort the data not only by the "GAME_ID" but also by the result "WL".
NBA_Games.sort_values(['GAME_ID','WL'], inplace=True)
NBA_Games["POINT_DIFF"]=NBA_Games.groupby(["GAME_ID"])["PTS"].diff()
The "POINT_DIFF" variable only has the point difference for the winning team, we need to impute the point difference for the losing team as well.
NBA_Games['POINT_DIFF'] = NBA_Games['POINT_DIFF'].fillna(NBA_Games.groupby('GAME_ID')['POINT_DIFF'].transform('mean'))
NBA_Games=NBA_Games.dropna()
NBA_Games.shape
NBA_Team_Stats=NBA_Games.groupby(['TEAM_ID', 'SEASON_ID'])['PTS','FGM','FGA','FG_PCT','FG3M','FG3A','FG3_PCT','FTM','FTA','FT_PCT','OREB','DREB','REB','AST','STL','BLK','TOV','PF','PLUS_MINUS'].sum()
display(NBA_Team_Stats)
Notice that the newly created dataset has two levels of index, the "TEAM_ID" and "SEASON_ID"
NBA_Team_Stats=NBA_Team_Stats.reset_index()
display(NBA_Team_Stats)
NBA_Game_Count=NBA_Games.groupby(['TEAM_ID','SEASON_ID']).size().reset_index(name='GAME_COUNT')
display(NBA_Game_Count)
NBA_Games.to_csv("../../Data/Week 2/NBA_Games.csv", index=False)