概述
Gradient Descend 梯度下降
- (一)概念
- (二)Notation
- (三)Gradient Descent in Linear Regression
- (四)python with Gradient Descent
- (五)小结
(一)概念
It turns out that gradient descent is an algorithm that you can use to try to minimize any function.
梯度下降法,不仅仅用于Linear Regression,也不仅仅只用于二维,而是对所有函数,都可以用梯度下降法找到最小值(Local/Global),当然如何高效的找到Global最小值是还需要研究的方向,而如何从local最小值中跳出来也是一个研究方法。
如同 【ML02】Cost Fuction 中所示的
f
=
w
x
+
b
f = wx + b
f=wx+b 的 cost function-c图所示,从任意点逐步下落,直到找到损失函数最小值的点,从而得到合适的参数值。
(二)Notation
w
=
w
−
α
d
d
w
J
(
w
,
b
)
(1)
w = w- αfrac{d}{dw} J(w,b) tag{1}
w=w−αdwdJ(w,b)(1)
b
=
b
−
α
d
d
b
J
(
w
,
b
)
(2)
b = b- αfrac{d}{db} J(w,b) tag{2}
b=b−αdbdJ(w,b)(2)
对 w 的 notation 进行研究:
@—@
学习率
@—@
α α α a small positive number between 0 and 1, control how big of a step you take downhill.
其中 α α α 为 learning rate,即 学习率,一般为0-1之间的值。通过对学习率大小的控制,从而控制每周一步的大小。如果把“梯度下降”理解为“下山”的过程,那么学习率控制的,就是你每一步迈的大小,大的话落差大,小的话落差小。
应当注意的是,学习率不宜太高也不宜太低,太低会导致梯度下降出奇的慢,而太高会导致太快了“又被惯性推着跑上山去”,会出现 overshoot 的情况,从而找不到 J 损失函数最低的点。
@------@
函数导数
@------@
d d w J ( w , b ) frac{d}{dw} J(w,b) dwdJ(w,b) :Derivative term of funtion J in which direction you want to take a baby step.
d d w J ( w , b ) frac{d}{dw} J(w,b) dwdJ(w,b) 被称为 derivative term of function J,翻译过来是 “函数J的导数” 。说到求导数,第一个想法就是斜率以及切线。而函数J,其实就是 损失函数J。而损失函数的导数在这里的主要作用就是判断其切线斜率为正为负,从而判断你想向哪个方向迈出一小步。
(三)Gradient Descent in Linear Regression
3.1 Linear Regression Model
f
w
,
b
(
x
)
=
w
x
+
b
f_{mathbf{w,b}}(mathbf{x}) = wx + b
fw,b(x)=wx+b
3.2 Cost Function
J
(
w
,
b
)
=
1
2
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
J(w,b) = frac{1}{2m} sumlimits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2
J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2
3.3 Gradient Descent Algorithm
w
=
w
−
α
d
d
w
J
(
w
,
b
)
=
w
−
α
1
m
∑
i
=
1
m
(
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
w = w- αfrac{d}{dw} J(w,b)=w-αfrac1m sumlimits_{i = 1}^{m}((f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}
w=w−αdwdJ(w,b)=w−αm1i=1∑m((fw,b(x(i))−y(i))x(i)
b
=
b
−
α
d
d
b
J
(
w
,
b
)
=
w
−
α
1
m
∑
i
=
1
m
(
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
b = b- αfrac{d}{db} J(w,b)=w-αfrac1m sumlimits_{i = 1}^{m}((f_{w,b}(x^{(i)})-y^{(i)})
b=b−αdbdJ(w,b)=w−αm1i=1∑m((fw,b(x(i))−y(i))
3.4 Batch gradient descent
“Batch”: Each step of gradient descent uses all the training examples.
(四)python with Gradient Descent
@@@--------@@@
4.1 引入包与数据集
@@@--------@@@
import math, copy
import numpy as np
import matplotlib.pyplot as plt
x_train = np.array([1.0, 2.0]) #features
y_train = np.array([300.0, 500.0]) #target value
@@@-----@@@
4.2 计算损失函数
@@@-----@@@
J ( w , b ) = 1 2 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) 2 J(w,b) = frac{1}{2m} sumlimits_{i = 1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2 J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2
#Function to calculate the cost
def compute_cost(x, y, w, b):
m = x.shape[0]
cost = 0
for i in range(m):
f_wb = w * x[i] + b
cost = cost + (f_wb - y[i])**2
total_cost = 1 / (2 * m) * cost
return total_cost
@@@------@@@
4.3 损失函数导数
@@@------@@@
d
d
w
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
frac{d}{dw} J(w,b)=frac1m sumlimits_{i = 1}^{m}((f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}
dwdJ(w,b)=m1i=1∑m((fw,b(x(i))−y(i))x(i)
d
d
b
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
frac{d}{db} J(w,b)=frac1m sumlimits_{i = 1}^{m}((f_{w,b}(x^{(i)})-y^{(i)})
dbdJ(w,b)=m1i=1∑m((fw,b(x(i))−y(i))
def compute_gradient(x, y, w, b): # x为训练集自变量数组,y为训练集实际值数组
m = x.shape[0]
dj_dw = 0
dj_db = 0
for i in range(m):
f_wb = w * x[i] + b
dj_dw_i = (f_wb - y[i]) * x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_db
@@@------@@@
4.4 计算梯度下降
@@@------@@@
w
=
w
−
α
d
d
w
J
(
w
,
b
)
=
w
−
α
1
m
∑
i
=
1
m
(
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
w = w- αfrac{d}{dw} J(w,b)=w-αfrac1m sumlimits_{i = 1}^{m}((f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}
w=w−αdwdJ(w,b)=w−αm1i=1∑m((fw,b(x(i))−y(i))x(i)
b
=
b
−
α
d
d
b
J
(
w
,
b
)
=
w
−
α
1
m
∑
i
=
1
m
(
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
b = b- αfrac{d}{db} J(w,b)=w-αfrac1m sumlimits_{i = 1}^{m}((f_{w,b}(x^{(i)})-y^{(i)})
b=b−αdbdJ(w,b)=w−αm1i=1∑m((fw,b(x(i))−y(i))
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
w = copy.deepcopy(w_in)
J_history = []
p_history = []
b = b_in
w = w_in
for i in range(num_iters):
dj_dw, dj_db = gradient_function(x, y, w , b)
b = b - alpha * dj_db
w = w - alpha * dj_dw
if i<100000:
J_history.append( cost_function(x, y, w , b))
p_history.append([w,b])
if i% math.ceil(num_iters/10) == 0:
print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
f"w: {w: 0.3e}, b:{b: 0.5e}")
return w, b, J_history, p_history
打印结果:(数据来源:吴恩达《ML》Lab05 Gradient Descent)
(五)小结
说到头,gradient descent,目的就是为了使得损失函数值最小,损失函数值最小就意味着能更好的,更准确的拟合数据。
多登登山,下山时候想想什么是梯度下降,当然,一定注意安全。
end—>
最后
以上就是唠叨雪糕为你收集整理的【ML03】Gradient Descend 梯度下降(一)概念(二)Notation(三)Gradient Descent in Linear Regression(四)python with Gradient Descent(五)小结的全部内容,希望文章能够帮你解决【ML03】Gradient Descend 梯度下降(一)概念(二)Notation(三)Gradient Descent in Linear Regression(四)python with Gradient Descent(五)小结所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复