# linear regression

## premise

- $X$ is a
`matrix`

which has`m`

rows and`n`

columns, that means it is a $m \times n$ matrix, represents for training set. - $\theta$ is a $1 \times n$
`vector`

, stands for hypothesis parameter. - $y$ is a $m \times 1$
`vector`

, stands for real value of training set. - $\alpha$ named
`learning rate`

for defining learning or descending speed. - $S(X_j)$ means to get standard deviation of the j feature from training set.

# 1. Hypothesis

\[h_{\theta}(X) = X \times \theta^T\]Draw hypothesis of a pattern.

# 2. Cost

\[Cost(X^{(i)},y^{(i)})=[h_{\theta}(X^{(i)}) - y^{(i)}]^2\]Calculate the Cost for single training point.

# 3. Cost function

\[\begin{aligned} J(\theta) &=\left(\frac{1}{2m}\right)\sum_{i=1}^m Cost(X^{(i)},y^{(i)}) \\ &=\left(\frac{1}{2m}\right)\sum_{i=1}^m[h_{\theta}(X^{(i)}) - y^{(i)}]^2 \end{aligned}\]Draw cost function for iterating whole training set.

# 4. Get optimized parameter

Learn from training set to get optimized parameter for proposed algorithm.

### Gradient Descend###

\[\begin{aligned} grad(j) &= \frac{\partial}{\partial \theta_j} J(\theta) \\ &= \frac{1}{m}\sum_{i=1}^m[(h_{\theta}(X^{(i)}) - y^{(i)}) X_{j}^{(i)}] \end{aligned}\] \[\theta_j := \theta_j - \alpha \times grad(j) \quad \text{Repeat many times}\]Complicate to implement.

suitable for any senario.

### Normal equation###

\[\theta = (X^{T}X)^{\prime}X^{T}y\]Convenient, but performance bad while

`m`

grow large than 100000.

Unable to conquer non-invertable matrix.

## Feature scaling

\[X_j=\frac{X_j - \mu}{a}\] \[\begin{aligned} a &= max(X_j)-min(X_j) \\ & or \\ &= S(X_j) \end{aligned}\]Use feature scaling to optimize training set.

Make gradient descend converge much faster.