04 Sample Regression Function

Sample¶

Subset of Population

Useful, as we can never have access to the entire population

We will only have limited values of \(y\) for given value(s) of \(x\); we don’t have access to the true distribution of \(y\) for different values of \(x\)

SRF¶

Sample Regression Function

\[ \begin{aligned} \hat y_i &= \hat \beta_0 + \hat \beta_1 x_i \\ y_i &= \hat \beta_0 + \hat \beta_1 x_i + \hat u_i \end{aligned} \]

We try to make each of these hyperparameters close to their PRF counter-parts

\(\hat \beta_0\) is an estimator of \(\beta_0\) from the PRF
\(\hat \beta_1\) is an estimator of \(\beta_1\) from the PRF
\(\hat u_i\) is an approximation of \(u_i\) from the PRF
\(\hat y_i\) is the predicted value of dependent variable from sample data
\(y_i\) is the true value of dependent variable

We can use the SRF to approximate a variable’s distribution, by using statistical distributions, such as Poisson, \(t\), Normal, \(\chi^2\)

Problem¶

Over-estimation SRF estimate may be higher than the PRF estimate
Under-estimation SRF estimate may be lower than the PRF estimate

Issue with Sample Data¶

We will get different SRF for each sample

Perfect fit is impossible, due to Sample Fluctuations

The tendancy of each sample to be different from each other
It is basically impossible to avoid this

There is no way to say what best represents the PRF

Sample can be mis-representing You have to be careful about interpreting the results

Nature of random sample may cause SRF to over-estimate/under-estimate
Biased choosing of sample
- Researcher chooses a particular sample
- For eg
- I chose Ronaldo for 2^nd year study project
- Choosing students of high attendance only

Sampling Distribution¶

The distribution of \(\hat \beta_0, \hat \beta_1, \hat u_i\) for different samples

Why is it important?¶

Helps us understand which pair of \(\hat \beta_0, \hat \beta_1\) is the closest to the PRF \(\beta_0, \beta_1\)

Techniques to identify SRF¶

You don’t have to memorize the formula, but you should know what is happening; that way you can debug any errors

OLS
Maximum Likelihood Estimation

Autocorrelation¶

Correlation between values of the same variable

Usually used for time series data

We use ARIMA(AutoRegressive Integrated Moving Averages) Model It’s just values based on lagged values

Last Updated: 2023-01-25 ; Contributors: AhmedThahir