Why Python?¶

Open source, poweful, free
Good for datascraping
has a lot of libraries

Intro¶

I already know the basics of Pandas, so I haven’t explained here.

Variable Types¶

	Qualitative	Quantitative
Data Type	Categorical	Numerical
Example	Gender - M, F Country Sport - Football, Basketball League - NFL, NBA	Number of goals, Points, Age, Salary
		Can be discrete/continuous random variables

Convert categorical to dummy variable¶

One-Hot Encoding

dummy = (
  pd
  .get_dummies(
        df,
    columns = ['WL']
    )
  .rename(columns={
    "WL_W": "Win"
  })
)

df = (
  df
  .concat(
    [df, dummy['Win']],
    axis = 1
  )
)

Summary Statistics¶

flowchart TB

ct[Central Tendancy] -->
Mean & Median & Mode

Variation -->
Variance & sd[Standard Deviation] & cv[Coefficient of Variation]

Coefficient of variation¶

$CV = \frac{\sigma}{\mu}$
$0 \le CV \le 1$
useful for comparing variations of different measurement scale

Correlation Analyses¶

Helps understand relationship between 2 variables

Correlation $\ne$ causal relationship.

Covariance¶

Measure of the joint variability of 2 random variables. $$ \sigma_{xy} = \text{cov}(x, y) = E\Bigg[ \Big( x-E(x) \Big) \Big( y-E(y) \Big) \Bigg] $$ The sign of the covariance shows the tendency of the linear relationship between the variables. We do not analyze the magnitude of covariance, as it depends on the unit of measurement.

Correlation coefficient¶

Summarizes the relationship between 2 variables. But doesn’t show the exact value change. $$ r = \frac{ \sigma_{x y} }{ \sigma_x \sigma_y } \

-1 \le r \le 1 $$

$\sigma_{xy}$ or $r$	Conclusion
0	linear-relationship non-existent some other relationship may/may not exist
>0	$x \propto y$
<0	$x \propto \frac{1}{y}$

\(\sigma_{xy}\) or \(r\)	Conclusion
0	linear-relationship non-existent some other relationship may/may not exist
>0	\(x \propto y\)
<0	\(x \propto \frac{1}{y}\)