推导Reinforcement Learning Richard S.Sutton and Andrew G. Barto 第二章Bandit算法中的Upper-Confidence-Bound Action Selection.
预备知识
Markov Inequality
对于任意r.v. (random variable) X and constant a > 0,
Prf:
Chebyshev's Inequality
Let X have mean and variance
, Then for any a>0,
Prf:
Chernoff's Inequality
For any r.v. X and constants a>0 and t>0,
Prf:
, by Markov Inequality.
Hoeffding Lemma
Tool:
1. Jenson Inequality:
For f is a convex function, and
2.Taylor's Theorem:
All derivatives of f(x) exist at point a,
Prf:
1. is a convex function, for any
we can make
by set a,b.
We need to find a function , s.t.
2.
So:
3.
So Hoeffding Lemma proved.
Hoeffding Bound
Let the random variables be independent with
,
for each
, where a,b are constants. Then for any
,
Prf:
1.
Because independent, so
independent.
And:
,
, to make
.
By Hoeffding Lemma,
Construct a function:
.
so get the minial value at
.
.
,And
So: .
Upper Confidence Bound(UCB)
Def: 置信区间(confidence interval), A confidence interval for a parameter
is an interval
.
假设抛一枚硬币N次,向上概率p, 且p未知, 用N次结果来估计p的取值范围:
则:.
by Hoeffding Bound.
So: .
Bandit UCB Action Selection
We use .
Prf:
1.
Where
2. Hoeffding Bound
设 ,
By Hoeffding Bound.
So:,
Make ,
.
3.
4.
, 当t足够大,这个上界就越精确。当
时,后一项的值无穷大,则选择action a 的概率就越大,c 控制exploring的程度。c越大探索少选择的action的几率也越大。
最后
以上就是飘逸夕阳最近收集整理的关于Bandit UCB推导的全部内容,更多相关Bandit内容请搜索靠谱客的其他文章。
发表评论 取消回复