Online Learning 4: Upper Confidence Bound (UCB) algorithm1 Introduction2 UCB algorithm3 KL-UCB Bernoulli

90 阅读 0 评论 60 点赞

我是靠谱客的博主帅气茉莉，这篇文章主要介绍Online Learning 4: Upper Confidence Bound (UCB) algorithm1 Introduction2 UCB algorithm3 KL-UCB Bernoulli，现在分享给大家，希望可以做个参考。

Online Learning 4: Upper Confidence Bound [UCB] algorithm

1 Introduction
- 1.1 Key features
- 1.2 Anytime concentration
- 1.3 UCB intuition
2 UCB algorithm
- ***The Upper Confidence Bound Algorithm p102 (book)***
- 2.1 Settings
- 2.2 Algorithm
- 2.3 Regret
3 KL-UCB Bernoulli
- 3.1 KL-divergence
- 3.2 Algorithm
- - KL-UCB algorithm (notebook)
- 3.3 Regret
- - UCB
  - KL-UCB
  - Comparison

1 Introduction

1.1 Key features

All ETC-like algorithms (ETC, DB, e-greedy, Elimination), seperate exploitation examples from exploration examples. The number of explorations is fixed (or with fixed expectation).
UCB characteristics:
- Number of explorations for each arm is dynamic, there is no seperation between exploration and exploitation, i.e., all samples are used to improve estimates.
- Technical challenge: Emprical estimates of arms depend on random number of samples, so the usual sub-gaussian concentration do not hold. We need anytime concentration.

1.2 Anytime concentration

在这里插入图片描述

1.3 UCB intuition

UCB relies on principle of optimism: Use the "most reasonable" optimistic estimate of each arm, and optimize your decision at each time-step using this optimistic estimate.

Taking an optimistic view of the unknown local cuisine leads to exploration because without data, it could be amazing. After trying the new option a few times, you can update your statistics and make a more informed decision.

On the other hand, taking a pessimistic view of the new option discourages exploration, and you may suffer significant regret if the local options are delicious.

For bandits, the optimism principle means using the data observed so far to assign to each arm a value, called the upper confidence bound that with high probability is an overestimate of the unknown mean.

The intuitive reason why this leads to sublinear regret is simple.

Assuming the upper confidence bound assigned to the optimal arm is indeed an overestimate,

then another arm can only be played if its upper confidence bound is larger than that of the optimal arm, which in turn is larger than the mean of the optimal arm.

And yet this cannot happen too often because the additional data provided by playing a suboptimal arm means that the upper confidence bound for this arm will eventually fall below that of the optimal arm.

2 UCB algorithm

The Upper Confidence Bound Algorithm p102 (book)

2.1 Settings

在这里插入图片描述

2.2 Algorithm

在这里插入图片描述

2.3 Regret

在这里插入图片描述
The growth rate is $lo g n$ , but is dependent on $Δ$ .

在这里插入图片描述
$E[T_i(n)]=frac{ln frac{1}{delta}}{Delta_i^2}[1-(n+1)delta]+n(n+1)delta+1$
$R_t(pi,upsilon)=sum_{i=1}^{k}Delta_imathbb{E}(T_i(n))$

在这里插入图片描述
Growth rate is $n k lo g n$ , but is not dependent on $Δ$ .

3 KL-UCB Bernoulli

3.1 KL-divergence

$K L ( p , q ) = d ( p , q ) = p ln ⁡ p q + ( 1 − p ) ln ⁡ 1 − p 1 − q KL(p,q)=d(p,q)=pln {frac{p}{q}}+(1-p)ln{frac{1-p}{1-q}}$

3.2 Algorithm

KL-UCB algorithm (notebook)

3.3 Regret

UCB

$lim_{nrightarrowinfin}frac{R_n}{ln n}leq sum_{i}frac{1}{2Delta_i}$

KL-UCB

$lim_{nrightarrowinfin}frac{R_n}{ln n}leq sum_{i}frac{2mu_i(1-mu_i)}{Delta_i}$

Comparison

If the coin was exactly a fair coin, so it’s equally likely there’s a zero or a one, then KL-UCB will not give you any asymptotic improvement over the UCB.

In general, biased coin, mu could be some parameter much closer to either zero or one, in which case, you’re going to get huge improvements over UCB. That’s the KL-UCB algorithm.

最后

以上就是帅气茉莉最近收集整理的关于Online Learning 4: Upper Confidence Bound (UCB) algorithm1 Introduction2 UCB algorithm3 KL-UCB Bernoulli的全部内容，更多相关Online内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Online Learning
浏览次数：90 次浏览
发布日期：2024-09-18 02:45:01
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_7_o_10_fw_14_zg0.html

Approximate Dynamic Programming（近似动态规划算法） Chapter 6 阅读笔记前言一、Myopic Policies（贪婪策略）二、Lookahead Policies(前视策略）三、Policy Function Approximations(策略函数估计）四、Value Function Approximations（价值函数估计）五、Hybrid Strategies（混合策略）六、Randomized Policies（随机策略）总结

Online Learning 4: Upper Confidence Bound (UCB) algorithm1 Introduction2 UCB algorithm3 KL-UCB Bernoulli

Online Learning 4: Upper Confidence Bound [UCB] algorithm

1 Introduction

1.1 Key features

1.2 Anytime concentration

1.3 UCB intuition

2 UCB algorithm

The Upper Confidence Bound Algorithm p102 (book)

2.1 Settings

2.2 Algorithm

2.3 Regret

3 KL-UCB Bernoulli

3.1 KL-divergence