Skip to content

Why Python?

  • Open source, poweful, free
  • Good for datascraping
  • has a lot of libraries

Intro

I already know the basics of Pandas, so I haven’t explained here.

Variable Types

Qualitative Quantitative
Data Type Categorical Numerical
Example Gender - M, F
Country
Sport - Football, Basketball
League - NFL, NBA
Number of goals, Points, Age, Salary
Can be discrete/continuous random variables

Convert categorical to dummy variable

One-Hot Encoding

dummy = (
  pd
  .get_dummies(
        df,
    columns = ['WL']
    )
  .rename(columns={
    "WL_W": "Win"
  })
)

df = (
  df
  .concat(
    [df, dummy['Win']],
    axis = 1
  )
)

Summary Statistics

flowchart TB

ct[Central Tendancy] -->
Mean & Median & Mode

Variation -->
Variance & sd[Standard Deviation] & cv[Coefficient of Variation]

Coefficient of variation

  • \(CV = \frac{\sigma}{\mu}\)
  • \(0 \le CV \le 1\)
  • useful for comparing variations of different measurement scale

Correlation Analyses

Helps understand relationship between 2 variables

Correlation \(\ne\) causal relationship.

Covariance

Measure of the joint variability of 2 random variables. $$ \sigma_{xy} = \text{cov}(x, y) = E\Bigg[ \Big( x-E(x) \Big) \Big( y-E(y) \Big) \Bigg] $$ The sign of the covariance shows the tendency of the linear relationship between the variables. We do not analyze the magnitude of covariance, as it depends on the unit of measurement.

Correlation coefficient

Summarizes the relationship between 2 variables. But doesn’t show the exact value change. $$ r = \frac{ \sigma_{x y} }{ \sigma_x \sigma_y } \

-1 \le r \le 1 $$

\(\sigma_{xy}\) or \(r\) Conclusion
0 linear-relationship non-existent
some other relationship may/may not exist
>0 \(x \propto y\)
<0 \(x \propto \frac{1}{y}\)

Notebooks