概述
游戏领域也开始受到AI研究者的关注(比如最近的AI-Libratus战胜人类德州扑克高手[1], Alpha Go等),原因无他,很多人类发明的游戏,其过程涉及到逻辑推理,概率和游戏博弈等运算,研究者希望通过计算机来挑战/模拟一些原本认为只有人类才可完成的脑力运算能力。这里我们不探讨通过这种方式的尝试和创新,是否真正带来新的AI革命,我们希望通过探讨研究在用计算机解决游戏问题中遇到的一些通用问题进行一些讨论,并给出一些具有参考性的解决办法。
Exploration & Exploitation 问题,这类问题最简单的场景可以是,如果你搬到一个新的小区,小区周边有20多家小馆子可以选择(比如拉面,本地菜,西班牙菜,日本料理等), 如果你吃到自己喜欢料理那么你一天的心情将会+5(因为喜欢程度不同心情也会有不同), 如果你点的料理你并不喜欢,那么你的心情也会大受影响-5(根据不喜欢的程度,心情也会不同);假设点餐费用一样,你肯定希望每次点到的都是符合自己口味而且是自己最喜欢的食物,但是你对所有餐厅菜品和口味一无所知。如何平衡的选择餐馆和菜品?使自己能长久的有个好心情而且好心情值一直保持的很高
策略部分
最简单的策略,将是随机选择餐馆,随机选择菜品,直到选到自己最喜欢的,从那以后就一直点对应餐馆和菜品(这有点脱离现实,你会吃腻,但是这里有个问题,你并不知道最好的菜品能给你带来多少的最好心情,如何确保你在没有找到最好菜品前不停下随机选择?),这里我们称这种策略为随机选择 (Random strategy)
也有人建议我们按历史各家餐馆给我们带来好性情+坏心情的情况来选择,即期望,我们每次都选择期望最高的那家餐馆,直到这家餐馆充分的被咱们挖掘了,并且也吃腻了,心情开始下降,我们开始选择最新高的期望餐馆,当然这中策略也存在这一些问题,比如如何避提早的把一家较为喜欢的餐馆吃腻,因为如果在一家餐馆吃的普遍好,那么期望总是正向,直到你吃腻了,你猜可能换餐馆找到你最喜欢的菜品,这里我们称这种策略为贪婪选择(Greedy strategy), 这种策略和Gittin Index [2]有点相似
为了解决上面的问题,有部分学者建议,选择餐馆的问题,其实可以与你对这家餐馆的期望(历史品尝记录)成正比,这样你即可以有机会尝试市面上是否有你其他最喜欢的菜品,又可以较好的保持好心情
好了为了讨论以上三种策略在真实情景中的实用性,并且使结果具有通用参考意义,我们这里将借用经典问题案例 Multi-armed bandit [[3]], 在美国或日本的赌场,普遍都有老虎机(也叫 one-armed bandits), 假如一个赌徒走进一家有一排老虎机的赌场,赌徒知道每台老虎机有不同的出奖分布,如果赌徒想最大化他可能赢得钱,赌徒该如何选择机器?
让我们通过程序/算法来帮助赌徒简化问题,假设赌场有N=20台机器,假设每台机器可能产生的奖励 Ri∈[−10,10] , 即每台机器可能有正10到负10的奖励,+10表示赌徒可以赢走10刀,-10表示赌徒需要支付10刀,假设赌徒知道每台机器共有L=20中奖励方式,但是每台机器奖励分布都各不相同而且还不可知(这个问题已经简化,将在后续文章介绍如果每台机器有不同状态的情景),如果基于以上三种策略,赌徒的最终收益会有何不同?
实验数据
以下是三种策略跑出来的收益, 总轮数在1000w次,随机选取策略在0.1 * 1000w = 100w次后按最优策略选取, 最终收益图如下
随机策略为蓝色,浅绿色为贪婪策略,红色为期望概率选择策略,纵坐标取log, 最后随机策略最终收益 34709726, 期望概率选择策略32454248, 贪婪策略最终收益 17197353;
结论
- 随机策略效果最好,仔细分析实验过程数据也可理解,整体未知空间为 20 * 10 = 200, 在前100w次随机中,未知空间已充分开发,使得后来的最优化方案得已突出,但如果每台机器可能的收益结果超100w次,那么实验结果肯定会不一样(有兴趣的同学可以自己尝试)
- 期望概率策略,效果次之,但也接近最优收益方案 (见后验数据分析), 而且效果稳定, 但实验搜索空间剧增的情况下
后验数据分析
以上实验的部分机器状态数据, 各机器不同收益概率分布和期望,在给予这些状态数据情况下,我们可以得出最优方案解为 3.7384 * 1000w = 37384000
概率分布 | 收益 | 期望 |
0.0465,0.0127,0.1704,0.0561,0.0542,0.0800,0.1424,0.2025,0.0685,0.1667 | 10,-3,5,-4,-10,-4,-3,4,-4,4 | 0.9681 |
0.0989,0.1604,0.0439,0.1364,0.0529,0.1556,0.0655,0.0975,0.1501,0.0388 | -2,9,-3,-1,0,3,10,7,-4,-5 | 1.9876 |
0.0949,0.1092,0.1596,0.1322,0.0951,0.0948,0.0654,0.0886,0.1458,0.0144 | 6,5,-1,-2,-8,-6,5,10,10,6 | 2.1192 |
0.0088,0.2207,0.1977,0.1189,0.0535,0.0305,0.0232,0.0608,0.1841,0.1017 | 2,10,-4,5,5,-8,3,6,1,2 | 2.8737 |
0.1355,0.0153,0.1410,0.0974,0.1604,0.1586,0.1174,0.0607,0.0740,0.0398 | -6,-1,-8,-2,8,0,-8,-3,-3,-8 | -2.5296 |
0.0681,0.1103,0.1173,0.0912,0.0845,0.1179,0.0889,0.0991,0.1356,0.0871 | 10,4,4,-6,-4,-9,4,8,9,-3 | 1.7526 |
0.0042,0.0712,0.0910,0.2096,0.1495,0.0741,0.0396,0.1423,0.0575,0.1610 | 4,-1,4,-10,4,-1,-1,-4,2,6 | -0.7903 |
0.0059,0.1588,0.2242,0.1104,0.1474,0.0863,0.0486,0.0005,0.0986,0.1192 | 1,10,-3,4,10,10,7,-9,-6,1 | 3.5632 |
0.0672,0.1684,0.1088,0.0554,0.1513,0.0590,0.1459,0.0721,0.1126,0.0593 | 6,7,9,-6,-1,8,4,7,3,-4 | 3.7384 |
0.0656,0.0038,0.0823,0.0049,0.1432,0.0959,0.1555,0.1625,0.1639,0.1225 | -10,-5,10,4,4,10,7,9,1,-6 | 3.6793 |
0.1055,0.0119,0.0090,0.1513,0.1544,0.0567,0.1174,0.0929,0.1040,0.1970 | -3,-4,-1,-4,7,2,2,3,1,-3 | 0.2424 |
0.1010,0.1884,0.0644,0.0676,0.1187,0.1186,0.1203,0.0263,0.1589,0.0357 | 7,5,-1,5,-5,-4,-4,1,-6,-4 | -0.6964 |
0.1511,0.1606,0.0113,0.0898,0.0454,0.0911,0.0944,0.1051,0.1329,0.1184 | 1,3,6,5,3,2,4,3,2,-2 | 2.19 |
0.0992,0.0025,0.1160,0.0849,0.0053,0.0587,0.2315,0.1699,0.0123,0.2197 | 9,6,-3,-5,10,2,-2,-8,4,10 | 0.7297 |
0.1161,0.0412,0.1207,0.1272,0.0006,0.1194,0.1049,0.1108,0.1319,0.1272 | 0,1,-9,5,4,0,-9,-1,2,-9 | -2.3426 |
0.0296,0.0868,0.0511,0.0028,0.1839,0.0294,0.1794,0.0947,0.1595,0.1827 | 1,-4,9,10,10,9,-1,-1,6,-6 | 1.8606 |
0.1351,0.1553,0.1885,0.0131,0.1694,0.1132,0.0086,0.0795,0.1117,0.0256 | -1,4,-3,7,-9,9,-8,-7,-1,10 | -0.9745 |
0.0657,0.1514,0.1413,0.1079,0.0896,0.0913,0.1265,0.0517,0.0357,0.1389 | 7,-5,4,7,-1,-9,2,4,1,9 | 1.8577 |
0.1809,0.1780,0.1220,0.1791,0.0578,0.0477,0.0860,0.0526,0.0206,0.0753 | -6,7,4,-1,-1,10,2,8,8,1 | 1.7216 |
0.0179,0.1741,0.1225,0.0242,0.1265,0.1408,0.0834,0.0973,0.1052,0.1081 | -7,-5,-9,7,7,-2,-9,7,-8,10 | -1.1551 |
部分代码
# -*- encoding: utf-8 -*-
import re
import sys
import random
from utils import *
class Strategy(object):
"""
"""
def __init__(self, actions):
"""
"""
self.actions = actions
self.length = len(self.actions)
self.history = ""
self.rewards = 0
def __str__(self):
"""
"""
return "baseStrategy"
class StrategyRandom(Strategy):
"""
"""
def __init__(self, actions, T_switch=1000000):
"""
"""
self.actions = actions
self.length = len(self.actions)
self.history = ""
self.rewards = 0
self.action2rewards = dict()
for i in actions:
self.action2rewards[i] = 0
self.T_switch = T_switch
self.T = 0
def __str__(self):
"""
"""
return "randomStrategy"
def nextStep(self):
"""
"""
action = 0
if self.T < self.T_switch:
action = random.randint(0, self.length - 1)
else:
cur_max = 0
for ele in self.action2rewards:
if self.action2rewards[ele] >= cur_max:
action = ele
cur_max = self.action2rewards[ele]
self.T += 1
return action
def updateHistory(self, action, rewards):
"""
"""
self.rewards += rewards
self.action2rewards[action] += rewards
def updateStrategy(self):
"""
"""
pass
class StrategyGreedy(Strategy):
"""
"""
def __init__(self, actions):
"""
"""
self.history = ""
self.actions = actions
self.rewards = 0
self.historyState = dict()
for action in self.actions:
self.historyState[action] = [0, 0]
def __str__(self):
"""
"""
return "strategyGreedy"
def nextStep(self):
"""
"""
greedy_action = 0
greedy_reward = 0
for action in self.historyState:
if self.historyState[action][0] >= greedy_reward:
greedy_action = action
greedy_reward = self.historyState[action][0]
# print greedy_action, self.historyState[greedy_action]
return greedy_action
def updateHistory(self, action, rewards):
"""
"""
self.historyState[action][0] += rewards
self.historyState[action][1] += 1
self.rewards += rewards
def updateStrategy(self):
"""
"""
pass
class StrategyExpection(Strategy):
"""
"""
def __init__(self, actions):
"""
"""
self.history = ""
self.actions = actions
self.rewards = 0
self.historyState = dict()
for action in self.actions:
self.historyState[action] = [1, 1]
def __str__(self):
"""
"""
return "strategyExpection"
def nextStep(self):
"""
"""
prob = []
for action in self.historyState:
prob.append(self.historyState[action][0])
prob = normalize(prob)
p = random.random()
_v = 0
for i in range(len(prob)):
_v += prob[i]
if _v >= p: return i
return 0
def updateHistory(self, action, rewards):
"""
"""
self.historyState[action][0] += rewards
self.historyState[action][1] += 1
self.rewards += rewards
def updateStrategy(self):
"""
"""
pass
TO BE CONTINUE
很明显一台机器只有一个概率分布并不符合大部分实际情况(机器可能有不同状态,还存在状态间的转化),而且概率分布的值也未知(即L未知),后续将讨论相关内容
最后
以上就是纯情花瓣为你收集整理的AI算法训练中Exploration & Exploitation问题一谈 (AI方向)的全部内容,希望文章能够帮你解决AI算法训练中Exploration & Exploitation问题一谈 (AI方向)所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复