概述
推导Reinforcement Learning Richard S.Sutton and Andrew G. Barto 第二章Bandit算法中的Upper-Confidence-Bound Action Selection.
预备知识
Markov Inequality
对于任意r.v. (random variable) X and constant a > 0,
Prf:
Chebyshev's Inequality
Let X have mean and variance , Then for any a>0,
Prf:
Chernoff's Inequality
For any r.v. X and constants a>0 and t>0,
Prf:
, by Markov Inequality.
Hoeffding Lemma
Tool:
1. Jenson Inequality:
For f is a convex function, and
2.Taylor's Theorem:
All derivatives of f(x) exist at point a,
Prf:
1. is a convex function, for any
we can make by set a,b.
We need to find a function , s.t.
2.
So:
3.
So Hoeffding Lemma proved.
Hoeffding Bound
Let the random variables be independent with , for each , where a,b are constants. Then for any ,
Prf:
1.
Because independent, so independent.
And:
,
, to make .
By Hoeffding Lemma,
Construct a function:
.
so get the minial value at . .
,And
So: .
Upper Confidence Bound(UCB)
Def: 置信区间(confidence interval), A confidence interval for a parameter is an interval
.
假设抛一枚硬币N次,向上概率p, 且p未知, 用N次结果来估计p的取值范围:
则:. by Hoeffding Bound.
So: .
Bandit UCB Action Selection
We use .
Prf:
1.
Where
2. Hoeffding Bound
设 , By Hoeffding Bound.
So:,
Make , .
3.
4.
, 当t足够大,这个上界就越精确。当 时,后一项的值无穷大,则选择action a 的概率就越大,c 控制exploring的程度。c越大探索少选择的action的几率也越大。
最后
以上就是飘逸夕阳为你收集整理的Bandit UCB推导的全部内容,希望文章能够帮你解决Bandit UCB推导所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复