我是靠谱客的博主 飘逸夕阳,最近开发中收集的这篇文章主要介绍Bandit UCB推导,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

推导Reinforcement Learning Richard S.Sutton and Andrew G. Barto 第二章Bandit算法中的Upper-Confidence-Bound Action Selection.

预备知识

Markov Inequality

对于任意r.v. (random variable) X and constant a > 0,

P(|x| geq a) leq frac{E[|x|]}{a}

Prf:

P(|x| geq a)=int _{|x| geq a}{f(x)}dx leq int_{|x|geq a}{frac{|x|}{a}f(x)dx}leq int{frac{|x|}{a}f(x)dx}leq frac{1}{a}int{|x|f(x)dx}=frac{1}{a}E[|x|]

Chebyshev's Inequality

Let X have mean mu and variance sigma^2, Then for any a>0,

P(|x-mu|geq a)leq frac{sigma^2}{a^2}

Prf:

P(|x-mu|geq a)=int_{|x-mu|geq a}{f(x)dx}leq int{frac{|x-mu|^2}{a^2}f(x)dx}=frac{1}{a^2}Var(x)=frac{sigma^2}{a^2}

Chernoff's Inequality

For any r.v. X and constants a>0 and t>0,

P(Xgeq a) leq frac{E(e^{tx})}{e^{ta}}

Prf:

P(xgeq a)=P(e^{tx}geq e^{ta})leq frac{E(e^{tx})}{e^{ta}}, by Markov Inequality.

Hoeffding Lemma

Tool:

1. Jenson Inequality:

For f is a convex function, and lambda_1, lambda_2 leq 1, and lambda_1+lambda_2 = 1.

f(lambda_1times x_1 + lambda_2times x_2)leq lambda_1 f(x_1) + lambda_2 f(x_2),

2.Taylor's Theorem:

All derivatives of f(x) exist at point a,

f(x) = f(a) + f^{'}(a)(x-a) + frac{1}{2}f{''}(a)(x-a)^2+... + frac{1}{k!}f^{k}(a)(x-a)^k+...

Prf:

1.f(x)=e^{lambda x} is a convex function, for any alpha in (0, 1),

f(a alpha + b(1- alpha))leq alpha f(a) + (1-alpha)f(b)=alpha times e^{lambda a} + (1-alpha)times e^{lambda b}

x in [a, b], alpha = frac{b-x}{b-a}, so x=(1-alpha)b + alpha a

$$f(x)=e^{lambda x}leq frac{b-x}{b-a}e^{lambda a} + frac{x-a}{b-a}e^{lambda b}.$$\ So:\ $$E(e^{lambda x})leq E(frac{b-x}{b-a})e^{lambda a} + E(frac{x-a}{b-a})e^{lambda b}=frac{b}{b-a}e^{lambda a}-frac{a}{b-a}e^{lambda b}, where E(x)=0,$$we can make E[x]=0 by set a,b.

We need to find a function varphi, s.t.

E[e^{lambda x}]leq frac{b}{b-a}e^{lambda a}-frac{a}{b-a}e^{lambda b}=e^{varphi (t)}|_{t=?}

2.

 varphi(t)=-theta t + ln(1-theta + theta e^t), where theta = -frac{a}{b-a} > 0. \

e^{varphi(t)}|_{t=lambda(b-a)}=e^{-theta t}times [1-theta + theta e^t]|_{t=lambda(b-a)}=\

e^{-theta lambda (b-a)}[1- theta + theta e^{lambda (b-a)}]|_{theta = -frac{a}{b-a}}=\ e^{a}[frac{b}{b-a}-frac{a}{b-a}e^{lambda(b-a)}]=frac{b}{b-a}e^{lambda a}-frac{a}{b-a}e^{lambda b},

So:

E(e^{lambda x})leq e^{varphi(lambda(b-a))}leq e^{frac{1}{8}lambda^2(b-a)^2} [by Taylor's theorem]

3.

 

varphi(t) = varphi(0) + varphi(0)^{'}(t-0) + frac{1}{2} varphi(0)^{''}(t-0)^2 + o(t^2), by Taylor's Theorem.

varphi(0) = 0,

varphi(t)^{'}= -theta + frac{theta e^t}{1 -theta + theta e^t},

varphi(0)^{'}=0,

varphi(t)^{''}=frac{(1-theta)theta e^t}{(1-theta +theta e^t)^2},\

As frac{xy}{(x+y)^2}leq frac{1}{4},  make x=1-theta, y=theta e^t.

\ varphi(t)^{''}leq frac{1}{4}\ varphi(t)=0+0+frac{1}{8}t^2,\ E(e^{lambda x})leq e^{varphi(t)}|_{t=lambda(b-a)}=e^{frac{1}{8}lambda^2(b-a)^2}\ So E[e^{lambda x}]leq e^{frac{1}{8}lambda^2(b-a)^2}

So Hoeffding Lemma proved.

Hoeffding Bound

Let the random variables X_1, X_2,...,X_n be independent with E(X_i)=mua leq X_i leq b for each i=1,2,...,n, where a,b are constants. Then for any varepsilon geq 0,

$$P(|frac{1}{n}sum_{i=1}^{n}{X_i}-mu|>varepsilon)leq 2times e^{frac{-2 n varepsilon^2}{(b-a)^2}}$$

Prf:

1.

z_i = X_i - E[X_i], z = frac{1}{n} sum_{i=1}^{n}{z_i}=frac{1}{n}sum_{i=1}^{n}{X_i} -mu\

Arbitrary lambda,  P(zgeq varepsilon)leq frac{E[e^{lambda z}]}{e^{lambda varepsilon}} = frac{E[e^{lambda frac{1}{n} sum_{i=1}^{n}{z_i}}]}{e^{lambda varepsilon}}

Because X_1, X_2,...,X_n independent, so z_1,z_2,...,z_n independent.

=frac{prod_{i=1}^{n}{E[e^{lambda frac{1}{n} z_i}]}}{e^{lambda varepsilon}} = e^{-lambda varepsilon} prod_{i=1}^{n}{E[e^{frac{1}{n}lambda z_i}]}

And:

E[frac{1}{n}z_i]=frac{1}{n}E[z_i]=0

E[z_i]=E[X_i - E[X_i]],

a<z_i<b, a-mu < z_i < b-mu, to make E[z_i]=0

By Hoeffding Lemma, E[e^{lambda frac{1}{n} z_i}]leq e^{frac{1}{8}lambda^2(frac{b-a}{n})^2}=e^{frac{1}{8 n^2}lambda^2(b-a)^2}

P(zgeq varepsilon)leq e^{-lambda varepsilon} times e^{frac{1}{8n}lambda^2(b-a)^2} (arbitrary)

Construct a function: g(lambda) = frac{1}{8n}lambda^2(b-a)^2-lambda varepsilon

{g}'(lambda)=lambda frac{(b-a)^2}{4n} - varepsilon=0, lambda=frac{4nvarepsilon}{(b-a)^2}.

{g}''(lambda) = frac{(b-a)^2}{4n}geq 0

so g(lambda) get the minial value at lambda = frac{4nvarepsilon}{(b-a)^2}g(frac{4nvarepsilon}{(b-a)^2})=-frac{2nepsilon^2}{(b-a)^2}.

P[zgeq varepsilon]leq e^{-frac{2nepsilon^2}{(b-a)^2}}, P[frac{1}{n}sum{z_i}geq epsilon],And P[zleq -varepsilon]leq e^{-frac{2nepsilon^2}{(b-a)^2}}, P[frac{1}{n}sum{z_i}leq epsilon] leq e^{-frac{2nepsilon^2}{(b-a)^2}}

So: P[|z|geq epsilon]leq 2 e^{-frac{2n epsilon^2}{(b-a)^2}}.

Upper Confidence Bound(UCB)

Def: 置信区间(confidence interval), A 1-delta confidence interval for a parameter p is an interval [bar{p} - epsilon, bar{p} + epsilon]  s.t.

P(p in [bar{p}-epsilon, bar{p} + epsilon])geq 1 - delta. Leftrightarrow P(|bar{p}-p|leq epsilon)geq 1-delta.

假设抛一枚硬币N次,向上概率p, 且p未知, 用N次结果来估计p的取值范围:bar{p}=frac{1}{N}(X_1+X_2+...+X_N).

则:P(|bar{p}-p|geq epsilon)leq delta Leftrightarrow P(|bar{p}-p|leq epsilon)geq 1-delta.Rightarrow P(|bar{p}-p|geq epsilon)leq 2times e^{-frac{2nepsilon^2}{(b-a)^2}}=delta by Hoeffding Bound.

So: epsilon = sqrt{frac{ln(frac{2}{delta})}{2N}}, epsilon  is connected with N delta..

P(p in [bar{p}-sqrt{ln(frac{2}{delta}) frac{1}{2N}}, bar{p}+sqrt{ln(frac{2}{delta}) frac{1}{2N}}])geq 1 - delta.

Bandit UCB Action Selection

A_t= argmax_a[Q_t(a) + csqrt{frac{ln t}{N_t(a)}}], Q_t(a) is the expected estimate action value, q^*(a) is the action value.

 We use Q_t(a) approx q^*(a).

Prf:

1.

 \ A_t = arg max_a q^*(a),\ A_t = arg max_a Q_t(a),\ A_t = arg max_a [Q_t(a)+csqrt{frac{log t}{N_t(a)}}],\

Where Q_t(a)=frac{sum_{i=1}^{t-1}{R_i times 1_{A_i=a}}}{sum_{i=1}^{t-1} 1_{A_i=a}}=frac{R_1^a+R_2^a+...+R_{t-1}^a}{N_t(a)}

2. Hoeffding Bound

设 Lleq R_i^aleq M, def d=M-LP(q^*(a)-Q_t(a)geq varepsilon)leq e^{-frac{2N_t(a)varepsilon^2}{d^2}}=delta By Hoeffding Bound.

So:epsilon=sqrt{frac{d^2 ln(frac{1}{delta})}{2N_t(a)}}=frac{d}{sqrt{2}}sqrt{frac{ln(frac{1}{delta})}{N_t(a)}},

Make delta=t^{-4}epsilon=dsqrt{2}sqrt{frac{ln{t}}{N_t(a)}}.

3.

\ P(q^*(a)-Q_t(a)geq csqrt{frac{ln{t}}{N_t(a)}})leq t^{-4}. Rightarrow q^*(a)leq Q_t(a)+csqrt{frac{ln{t}}{N_t(a)}} \ with confidence probability 1-t^{-4}

4.

A_t= argmax_a[Q_t(a) + csqrt{frac{ln t}{N_t(a)}}] , 当t足够大,这个上界就越精确。当N_t(a)=0 时,后一项的值无穷大,则选择action a 的概率就越大,c 控制exploring的程度。c越大探索少选择的action的几率也越大。

 

 

 

 

 

 

 

 

 

最后

以上就是飘逸夕阳为你收集整理的Bandit UCB推导的全部内容,希望文章能够帮你解决Bandit UCB推导所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(48)

评论列表共有 0 条评论

立即
投稿
返回
顶部