CS229 Lecture Note(1): Linear Regression

90 阅读 0 评论 60 点赞

我是靠谱客的博主俊逸寒风，这篇文章主要介绍CS229 Lecture Note(1): Linear Regression，现在分享给大家，希望可以做个参考。

1. LMS Algorithm

The Ordinary Least Squares Regression Model:
$h (θ) = θ T x$ $h(theta)=theta^Tx$
Cost Function:
$J (θ) = 1 2 \sum i = 1 m (h θ (x i) - y i) 2$ $J(theta)=frac{1}{2}sum_{i=1}^m(h_theta(x^i)-y^i)^2$
Gradient Descent Algorithm:

$θ : = θ - α \partial \partial θ J (θ)$ $theta:=theta-alpha frac{partial}{partialtheta}J(theta)$
LMS (least mean squares) update rule (also called for Widrow-Hoff learning rule):
$θ j : = θ j + α \sum i = 1 m (y i - h θ (x i)) x i j$ $theta_j:=theta_j+alphasum_{i=1}^m{(y^i-h_theta(x^i))x_j^i}$

Batch Gradient Descent vs. Stochastic Gradient Descent


# BGD

Repeat until convergence {
    theta = theta + alpha * sum((y_i - h_i) * x_i)
}


# SGD

Loop {
    for i=1 to m, {
        theta = theta + alpha * (y_i - h_i) * x_i
    }
}

Normal Equation Solution:
$θ = (X T X) - 1 X T Y$ $theta=(X^TX)^{-1}X^TY$

2. Probabilistic Interpretation

Predictive Probability Assumption: a Gaussian Distribution
$p (y | x; θ) = 1 2 π ‾ ‾ ‾ \sqrt σ e - ( y - θ T x ) 2 2 σ 2 \sim  (θ T x, σ 2)$ $p(y|x;theta)=frac{1}{sqrt{2pi}sigma}e^{-frac{(y-theta^Tx)^2}{2sigma^2}}simmathcal{N}(theta^Tx,sigma^2)$
Likelihood Function of $theta$ : the probability of the given data $y$ (given i.i.d. assumption)
$L (θ) = \prod i = 1 m p (y i | x i; θ) = \prod i = 1 m 1 2 π ‾ ‾ ‾ \sqrt σ e - ( y i - θ T x i ) 2 2 σ 2$ $L(theta)=prod_{i=1}^m{p(y^i|x^i;theta)}=prod_{i=1}^m{frac{1}{sqrt{2pi}sigma}e^{-frac{(y^i-theta^Tx^i)^2}{2sigma^2}}}$
Maximum Likelihood Method: choose $theta$ to maximize $L(theta)$ or the log likelihood $l(theta)$ :
$l (θ) = log L (θ) \Rightarrow 1 2 \sum i = 1 m (y i - θ T x i) 2$ $l(theta)=log{L(theta)}Rightarrowfrac{1}{2}sum_{i=1}^m{(y^i-theta^Tx^i)^2}$ $θ = arg max θ l (θ)$ $theta=argmax_theta{l(theta)}$

The least-squares regression model corresponds to the maximum likelihood estimation of $theta$ under a Gaussian distribution assumption on data.

3. Locally Weighted Linear Regression

Motivation: get rid of the problem of feature selection (which leads to the underfitting and overfitting problems)
Parametric vs. Non-parametric learning algorithm
LWR algorithm:
Querying a certain point $x$ ,
1. Fit $theta$ to minimize $sum_i{w^i(y^i-theta^Tx^i)^2}$ , where $w^i=e^{-frac{(x^i-x)^2}{2tau^2}}$
2. Output $theta^Tx$
3. Hence, the (errors on) training examples close to the query point $x$ would be given a much higher weight to determine $theta$ (local linearity).