In this page, we discuss the details of linear regression. Simple linear regression is the most commonly used method to fit a linear relationship between an observed response $y$ and an independent variable $x$. Suppose we observe $n$ data points $(x_i, y_i)$. In general, regression seeks a hypothesis $f$ such that

$$ \begin{aligned} &\text{for each } i=1,\ldots,n,\ &\qquad y_i = f(x_i) + \epsilon_i, \end{aligned} $$

where $\epsilon_i$ is an irreducible random error with mean $0$. It is commonly assumed Gaussian due to the central limit theorem.

In simple linear regression, we assume a linear form:

$$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i. $$

Thus the regression function is a parametrized model $f(x; \beta_0, \beta_1)$.

For multiple predictors $x_{1i}, x_{2i}, \ldots, x_{pi}$, the multiple linear regression model becomes

$$ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_p x_{pi} + \epsilon_i. $$

In vector form:

$$ y = X\beta + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2 I), $$

where $y \in \mathbb{R}^n$, $X \in \mathbb{R}^{n \times (p+1)}$, and $\beta \in \mathbb{R}^{p+1}$.
The intercept $\beta_0$ corresponds to the first column of ones:

$$ X = \begin{bmatrix} \mathbf{1} & x_1 & \cdots & x_p \end{bmatrix}. $$

Simple linear regression is a special case of this model.


Assumptions of Linear Regression

Question

The true relationship between $y$ and $x$ is linear.

Run linear regression and plot fitted values vs.\ residuals. The residuals should be randomly scattered around $0$. A visible pattern indicates nonlinearity.

Question

Errors are independent: $\epsilon_i$ independent of $\epsilon_j$ for $i \ne j$.

Check the correlation matrix of residuals or scatterplots of residual components.

Question

Homoscedasticity: constant variance of errors.

Plot fitted values vs.\ residuals. The spread should be roughly constant — the “tube shape” pattern.

Question

Errors are Gaussian: $\epsilon_i \sim \mathcal{N}(0,\sigma^2)$.

Check histograms or QQ-plots of residuals.

Question

Errors are independent of predictors (no endogeneity).

Plot residuals or fitted values vs.\ columns of $X$.

Question

No perfect collinearity among predictors.

Check condition numbers or variance inflation factors (VIF).

These assumptions are implicitly conditioned on $X$.
With this, we now discuss solutions of linear regression.


Ordinary Least Squares

Regression seeks the “best” linear fit, commonly interpreted as minimizing the residual sum of squares:

$$ \mathrm{RSS}(\beta) = | y - X\beta |^2. $$

This yields the ordinary least squares (OLS) estimator.


Question
Show that the OLS solution is $\hat{\beta} = (X^\top X)^{-1} X^\top y$.
Solution

We minimize

$$ \min_{\beta} | y - X\beta |^2. $$

Assuming no perfect collinearity, $X^\top X$ is positive definite. Compute the gradient:

$$ \nabla_\beta |y - X\beta|^2 = -2X^\top(y - X\beta). $$

Setting to zero:

$$ X^\top X \beta = X^\top y, $$

so the unique minimizer is

$$ \hat{\beta} = (X^\top X)^{-1} X^\top y. $$

The minimal RSS is

$$ | (I - X(X^\top X)^{-1}X^\top )y |^2. $$


Solution of Simple Linear Regression

In the simple linear case $X = [\mathbf{1} \;\; x]$, we compute:

$$ X^\top X = \begin{bmatrix} n & n\bar{x} \ n\bar{x} & \sum_{i} x_i^2 \end{bmatrix}, $$

and its inverse:

$$ (X^\top X)^{-1} = \frac{1}{n \sum_i x_i^2 - n^2 \bar{x}^2} \begin{bmatrix} \sum_i x_i^2 & -n\bar{x} \ -n\bar{x} & n \end{bmatrix}. $$

Compute $X^\top y$:

$$ X^\top y = \begin{bmatrix} \sum_i y_i \ \sum_i x_i y_i \end{bmatrix}. $$

Thus, we have $$ \hat{\beta}_1 = \frac{n\sum_i x_i y_i - (\sum_i x_i)(\sum_i y_i)}{n\sum_i x_i^2 - (\sum_i x_i)^2} = \frac{s_{xy}}{s_x^2} $$

and

$$ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}. $$


Simple Linear Regression: Mean Centering

Mean-centering defines $x' = x - \bar{x}$.

  • Center $x$ only: $\hat{\beta}_0 = \bar{y}$.
  • Center $y$ only: $\hat{\beta}_0 = -\hat{\beta}_1 \bar{x}$.
  • Center both: $\hat{\beta}_0 = 0$.

Also,

$$ \mathrm{Cov}(\hat{\beta}_0, \hat{\beta}_1) = 0. $$


Infinite Data Case

Linear regression solves

$$ \min_{a,b} \; \mathbb{E}[(Y - (aX + b))^2]. $$

The solution is

$$ \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(X)} \\ \mathbb{E}[Y] - a \mathbb{E}[X] \end{bmatrix}. $$

This corresponds to the finite-sample formulas when sample moments converge to population moments.

Miscellaneous

Question
OLS minimizes the average squared vertical distance in the $y$-direction. In what situations would one want to replace this by the perpendicular distance, i.e. the geometric distance to the fitted line?
Solution
  • OLS theoretical basis:
  1. Geometrically, OLS means projecting $y$ onto the column space of $X$; the residual is orthogonal to the fitted subspace.

  2. Under the standard assumptions $\mathbb{E}[\varepsilon \mid X] = 0$ and $\mathrm{Var}(\varepsilon \mid X)=\sigma^2 I$, the Gauss–Markov theorem says that OLS is BLUE

  3. If the noise is Gaussian, $\varepsilon \sim \mathcal{N}(0,\sigma^2 I)$, then OLS is also the maximum likelihood estimator. Minimizing squared vertical error is justified when the randomness is modeled as noise in $y$, while the predictors $x$ are treated as fixed or measured without error.

We should replace vertical distance by perpendicular distance when this modeling assumption is no longer appropriate. This happens when:

  1. both $x$ and $y$ contain measurement error,
  2. no clear distinction between predictor and response,
  3. the goal is geometric line fitting rather than prediction of $y$ from $x$.

In these cases, OLS fitting $y$ on $x$ is different from $x$ on $y$. Minimizing perpendicular distances gives a total least squares (TLS) formulation, which is symmetric in the variables.

  • Example: fitting a physical law from noisy instrument measurements, where both coordinates are contaminated. In this case, minimizing vertical error can bias the slope, and perpendicular distance methods are more appropriate. If the noise levels in $x$ and $y$ are known but unequal, we can use Deming regression instead of TLS.
Question
Other possible regression methods on top of linear regression (OLS)
Solution
  1. generalized linear models: Poisson regression, generalized additive models
  2. decision tree
  3. principal component regression
  4. group lasso
  5. Bayesian regression
Question
Linear regression coefficients are uncorrelated with variance estimator.
Solution

Consider the residual

$$ y - \hat{y} = (I-X(X^{\top}X)^{-1}X^{\top})y =: Qy $$

where $Q$ is the projection matrix to $\text{Null}(X^\top) \perp \text{Range}(X)$. Then

$$ Qy=Q(X\beta+\epsilon)=QX\beta+Q\epsilon=Q\epsilon\Rightarrow \|y-\hat{y}\|^2=\epsilon^TQ\epsilon=s^2 $$

Compute now

$$ \text{Cov}(\hat{\beta},y-\hat{y})=\text{Cov}((X^TX)^{-1}X^Ty, Q\epsilon) $$

We recall that $\text{Cov}(AX,BY)=A\text{Cov}(X,Y)B^T$; we have

$$ \text{Cov}(\hat{\beta},y-\hat{y})=(X^TX)^{-1}X^T\text{Cov}(y, \epsilon)Q^T=0 $$ because $Cov(y,\epsilon) = \sigma^2I$. And $Q$ is orthogonal to $\text{Range}(X)$. Here uncorrelated implies independent since they are jointly normal. 1

Since $\hat{\beta}$ is independent of the errors, it is independent of functions of the errors, including $RSS=\epsilon^T\epsilon$, and thus the variance estimator. 2


  1. This follows from the fact that jointly Gaussian random variables are independent if they are uncorrelated. ↩︎

  2. Claim: If $X$ and $Y$ are independent, then $f(X)$, $g(Y)$ are independent for any Borel functions $f,g$. ↩︎