Upper-Confidence-Bound(UCB) Action Selection

83 阅读 0 评论 55 点赞

我是靠谱客的博主糊涂香氛，这篇文章主要介绍Upper-Confidence-Bound(UCB) Action Selection，现在分享给大家，希望可以做个参考。

Background

In ε-greedy method, we randomly choose non-greedy actions as exploration, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain.

Upper-Confidence-Bound

In order to take into account both how close their estimates are to being maximal and the uncertainties in those estimates, one effective way is to select actions according to: $A_tdoteq underset{a}{argmax}[Q_t(a)+csqrt{frac{ln{t}}{N_t(a)}}]$

$N_t(a)$ denotes the number of times that action $a$ has been selected prior to time $t$ . If $N_t(a)=0$ , then $a$ is considered to be a maximizing action.
$c > 0$ controls the degree of exploration and determines the confidence level.
The use of natural logarithm $ln{t}$ means that the increases get smaller over time, but are unbounded - all actions will be selected eventually. But actions with lower value estimates or that have already been selected frequently, will be selected with decreasing frequency over time.

The idea of UCB action selection is that the square-root term $csqrt{frac{ln{t}}{N_t(a)}}$ is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is a sort of upper bound on the possible true value of action $a$ . Each time the action $a$ is selected, the uncertainty is reduced. On the other hand, as the time step $t$ goes larger, if the action other than $a$ is selected, the uncertainty is increased.

Pros & Cons

UCB is more difficult than ε-greedy method to extend beyond bandit problems.
UCB has difficulties in dealing with large state spaces and nonstationary problems…