概述
Online Learning 4: Upper Confidence Bound [UCB] algorithm
- 1 Introduction
- 1.1 Key features
- 1.2 Anytime concentration
- 1.3 UCB intuition
- 2 UCB algorithm
- ***The Upper Confidence Bound Algorithm p102 (book)***
- 2.1 Settings
- 2.2 Algorithm
- 2.3 Regret
- 3 KL-UCB Bernoulli
- 3.1 KL-divergence
- 3.2 Algorithm
- KL-UCB algorithm (notebook)
- 3.3 Regret
- UCB
- KL-UCB
- Comparison
1 Introduction
1.1 Key features
-
All ETC-like algorithms (ETC, DB, e-greedy, Elimination), seperate exploitation examples from exploration examples. The number of explorations is fixed (or with fixed expectation).
-
UCB characteristics:
- Number of explorations for each arm is dynamic, there is no seperation between exploration and exploitation, i.e., all samples are used to improve estimates.
- Technical challenge: Emprical estimates of arms depend on random number of samples, so the usual sub-gaussian concentration do not hold. We need anytime concentration.
1.2 Anytime concentration
1.3 UCB intuition
UCB relies on principle of optimism: Use the "most reasonable" optimistic estimate of each arm, and optimize your decision at each time-step using this optimistic estimate.
Taking an optimistic view of the unknown local cuisine leads to exploration because without data, it could be amazing. After trying the new option a few times, you can update your statistics and make a more informed decision.
On the other hand, taking a pessimistic view of the new option discourages exploration, and you may suffer significant regret if the local options are delicious.
For bandits, the optimism principle means using the data observed so far to assign to each arm a value, called the upper confidence bound that with high probability is an overestimate of the unknown mean.
The intuitive reason why this leads to sublinear regret is simple.
Assuming the upper confidence bound assigned to the optimal arm is indeed an overestimate,
then another arm can only be played if its upper confidence bound is larger than that of the optimal arm, which in turn is larger than the mean of the optimal arm.
And yet this cannot happen too often because the additional data provided by playing a suboptimal arm means that the upper confidence bound for this arm will eventually fall below that of the optimal arm.
2 UCB algorithm
The Upper Confidence Bound Algorithm p102 (book)
2.1 Settings
2.2 Algorithm
2.3 Regret
The growth rate is
log
n
log n
logn, but is dependent on
Δ
Delta
Δ.
E
[
T
i
(
n
)
]
=
ln
1
δ
Δ
i
2
[
1
−
(
n
+
1
)
δ
]
+
n
(
n
+
1
)
δ
+
1
E[T_i(n)]=frac{ln frac{1}{delta}}{Delta_i^2}[1-(n+1)delta]+n(n+1)delta+1
E[Ti(n)]=Δi2lnδ1[1−(n+1)δ]+n(n+1)δ+1
R
t
(
π
,
υ
)
=
∑
i
=
1
k
Δ
i
E
(
T
i
(
n
)
)
R_t(pi,upsilon)=sum_{i=1}^{k}Delta_imathbb{E}(T_i(n))
Rt(π,υ)=i=1∑kΔiE(Ti(n))
Growth rate is
n
k
log
n
sqrt{nklog n}
nklogn, but is not dependent on
Δ
Delta
Δ.
3 KL-UCB Bernoulli
3.1 KL-divergence
K L ( p , q ) = d ( p , q ) = p ln p q + ( 1 − p ) ln 1 − p 1 − q KL(p,q)=d(p,q)=pln {frac{p}{q}}+(1-p)ln{frac{1-p}{1-q}} KL(p,q)=d(p,q)=plnqp+(1−p)ln1−q1−p
3.2 Algorithm
KL-UCB algorithm (notebook)
3.3 Regret
UCB
lim n → ∞ R n ln n ≤ ∑ i 1 2 Δ i lim_{nrightarrowinfin}frac{R_n}{ln n}leq sum_{i}frac{1}{2Delta_i} n→∞limlnnRn≤i∑2Δi1
KL-UCB
lim n → ∞ R n ln n ≤ ∑ i 2 μ i ( 1 − μ i ) Δ i lim_{nrightarrowinfin}frac{R_n}{ln n}leq sum_{i}frac{2mu_i(1-mu_i)}{Delta_i} n→∞limlnnRn≤i∑Δi2μi(1−μi)
Comparison
If the coin was exactly a fair coin, so it’s equally likely there’s a zero or a one, then KL-UCB will not give you any asymptotic improvement over the UCB.
In general, biased coin, mu could be some parameter much closer to either zero or one, in which case, you’re going to get huge improvements over UCB. That’s the KL-UCB algorithm.
最后
以上就是帅气茉莉为你收集整理的Online Learning 4: Upper Confidence Bound (UCB) algorithm1 Introduction2 UCB algorithm3 KL-UCB Bernoulli的全部内容,希望文章能够帮你解决Online Learning 4: Upper Confidence Bound (UCB) algorithm1 Introduction2 UCB algorithm3 KL-UCB Bernoulli所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复