Regression is a set of methods for estimating the relationships between a outcome variable and features.

When our input consist of d features, we express our prediction $\hat{y}$ as y^=w1x1+...+wdxd+b\hat{y} = w_1 x_1 + ... + w_d x_d + b

Collecting all features into a vector $\mathbf{x} \in \mathbb{R}^d$ and all weights into a vector $\mathbf{w} \in \mathbb{R}^d$, we can express our model compacity using a dot product: y^=wTx+b\hat{y} = \mathbf{w}^T\mathbf{x} +b

The vector $\mathbb{x}$ corresponds to features of a single data example. To represent tho whole dataset we use $\mathbf{X} \in \mathbb{R}^{n\times d}$. Here $\mathbf{X}$ contains one row for every example and one column for every feature. The predictions $\mathbf{\hat{y}} \in \mathbb{R}^n$ can be expressed as: y^=Xw+b\mathbf{\hat{y}} = \mathbf{Xw} +b

The Loss function quantifies the distance between the real and predicted value of the target. (yy^)2=yXw2(y - \hat{y})^2 = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2

We need to find $\mathbf{w}$ to minimize the loss function. The analytic solution is: (XTX)1XTy(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X^T}\mathbf{y}