Why Python?¶
- Open source, poweful, free
- Good for datascraping
- has a lot of libraries
Intro¶
I already know the basics of Pandas, so I haven’t explained here.
Variable Types¶
Qualitative | Quantitative | |
---|---|---|
Data Type | Categorical | Numerical |
Example | Gender - M, F Country Sport - Football, Basketball League - NFL, NBA | Number of goals, Points, Age, Salary |
Can be discrete/continuous random variables |
Convert categorical to dummy variable¶
One-Hot Encoding
dummy = (
pd
.get_dummies(
df,
columns = ['WL']
)
.rename(columns={
"WL_W": "Win"
})
)
df = (
df
.concat(
[df, dummy['Win']],
axis = 1
)
)
Summary Statistics¶
flowchart TB
ct[Central Tendancy] -->
Mean & Median & Mode
Variation -->
Variance & sd[Standard Deviation] & cv[Coefficient of Variation]
Coefficient of variation¶
- \(CV = \frac{\sigma}{\mu}\)
- \(0 \le CV \le 1\)
- useful for comparing variations of different measurement scale
Correlation Analyses¶
Helps understand relationship between 2 variables
Correlation \(\ne\) causal relationship.
Covariance¶
Measure of the joint variability of 2 random variables. $$ \sigma_{xy} = \text{cov}(x, y) = E\Bigg[ \Big( x-E(x) \Big) \Big( y-E(y) \Big) \Bigg] $$ The sign of the covariance shows the tendency of the linear relationship between the variables. We do not analyze the magnitude of covariance, as it depends on the unit of measurement.
Correlation coefficient¶
Summarizes the relationship between 2 variables. But doesn’t show the exact value change. $$ r = \frac{ \sigma_{x y} }{ \sigma_x \sigma_y } \
-1 \le r \le 1 $$
\(\sigma_{xy}\) or \(r\) | Conclusion |
---|---|
0 | linear-relationship non-existent some other relationship may/may not exist |
>0 | \(x \propto y\) |
<0 | \(x \propto \frac{1}{y}\) |