Least Mean Square

cost function

\[J(\theta) = \frac{1}{2} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) ^2\]

Update rule:

\[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)\] \[\begin{aligned} \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{1}{2} (h_\theta(x) - y)^2\\ &= 2 · \frac{1}{2} (h_\theta(x) - y) · \frac{\partial}{\partial \theta_j} (h_\theta(x) - y)\\ &= (h_\theta(x) - y) · \frac{\partial}{\partial \theta_j} \bigg( \sum_{i=0}^n \theta_i x_i - y \bigg) \\ &= (h_\theta(x) - y) x_j \end{aligned}\]

For a single training example

\[\theta_j := \theta_j - \alpha(h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\]

Regularization

\[\theta_0 := \theta_0 - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}\] \[\theta_j = \theta_j - \alpha \bigg[ \bigg(\frac{1}{m}\sum_{i+1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \bigg) + \frac{\lambda}{m}\theta_j \bigg]\]

The normal equation

We can minimize $J$ by explicitly taking its derivatives with respect to the $\theta_j$’s, and setting them to zero.

\[\theta = (X^T X)^{-1} X^T y\]

When implementing the normal equation we want to use the pinv function rather than inv The pinv function will give you a value of $\theta$ even if $X^TX X$ is not invertible.

If $X^T X$ is noninvertible, the common causes might be having:

  • Redundant features, where two features are very closely related (i.e. they are linearly dependent)
  • Too many features (e.g. m ≤ n). In this case, delete some features or use regularization.

Evaluation Metrics

r2 score:

\[R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^m (y_i - \hat{y_i})^2}{\sum_{i=1}^m (y_i - \bar{y})^2}\]

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of $y$, disregarding the input features, would get a $R^2$ score of 0.0.