RNN流程详解RNN及其代码流程

96 阅读 0 评论 64 点赞

我是靠谱客的博主紧张枕头，这篇文章主要介绍RNN流程详解RNN及其代码流程，现在分享给大家，希望可以做个参考。

RNN及其代码流程

本文重点关注RNN的整个流程，而不是BP的推导过程

什么是RNN

Recurrent Neural Network
循环神经网络

在这里插入图片描述

为什么需要RNN？

普通的神经网络都只能单独地处理一个个的输入，前一个输入和后一个输入是完全没有关系的。但是，某些任务需要能够更好的处理序列的信息，即前面的输入和后面的输入是有关系的

**比如，当我们在理解一句话意思时，孤立的理解这句话的每个词是不够的，我们需要处理这些词连接起来的整个序列； **当我们处理视频的时候，我们也不能只单独的去分析每一帧，而要分析这些帧连接起来的整个序列。

举个例子：以简单的词性标注为例

输入为 “我吃苹果”
输出为 “我/n 吃/v 苹果/n”
显然“吃”后面的“苹果”是名词的概率大于动词
结论：一个位置的信息会受到其前面信息的影响
RNN就是一个可以保存前面时间步信息的神经网络

RNN(循环神经网络)基本框架

下图中值得注意的是，左边的自循环模型才是真正的RNN结构。右边的模型只是我们将RNN的结构按时间步展开，便于理解

在这里插入图片描述

上一层的信息通过隐藏层保留，他将向下一个时间步传递
看上图右半部分
- $x_t$ 是第 $t$ 个时间步的输入， $s_t$ 是第 $t$ 个时间步保留的隐藏信息，也就是保留的前面的时间步的信息，这个信息将向第 $t + 1$ 个时间步传递下去。
- $O_t$ 是第 $t$ 个时间步的输出
- $W, U, V$ 都是权重矩阵
- $O_{t}=gleft(V cdot S_{t}right)$
- $S_{t}=fleft(U cdot X_{t}+W cdot S_{t-1}right)$
现在回到左图
- 时间步其实就是模型工作的第一次，第二次，第 $t$ 次。
- 每次模型工作都会有一个输入 $x$ ，和前面工作时留下来的信息，也就是隐藏层 $s$ ，然后模型通过式子 $S = f (U \cdot X + W \cdot S)$ 更新隐藏层信息，也就是在过去的信息中添加本次输入的信息，再向下一层传递
- 当前层的隐藏层信息综合了前面的信息，和本次输入的信息，因此可以决定输出。我们用 $O = g (V \cdot S)$ 得到当前模型的输出

测试模型效果（利用模型生成句子的采样函数）

sample函数
复制代码def sample(h, seed_ix, n): # 创建一段索引序列 # h是隐藏层状态，也就是前面的时间步留下的信息 #vocab_size是不重复的字母的数量,这里我们将字母向量化。每个字母在 x 中对应一个位置，该位置为1，则代表该字母出现。为0，则不出现 x = np.zeros((vocab_size, 1)) x[seed_ix] = 1 # seed_ix是一个数 print("seed_ix:%s" % seed_ix) ixes = [] for t in range(n): # 一共取出 n 个 index h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh) y = np.dot(Why, h) + by p = np.exp(y) / np.sum(np.exp(y)) ix = np.random.choice(range(vocab_size), p=p.ravel()) x = np.zeros((vocab_size, 1)) x[ix] = 1 ixes.append(ix) return ixes
```
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def sample(h, seed_ix, n): # 创建一段索引序列
  # h是隐藏层状态，也就是前面的时间步留下的信息
  #vocab_size是不重复的字母的数量,这里我们将字母向量化。每个字母在 x 中对应一个位置，该位置为1，则代表该字母出现。为0，则不出现
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1 # seed_ix是一个数
  print("seed_ix:%s" % seed_ix)
  ixes = []
  for t in range(n): # 一共取出 n 个 index
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes
```
- 功能：输入一个字母的索引，利用当前RNN模型，根据该字母创建整个句子，然后返回句子中出现的字母对应的索引列表
- 输入
  - h是隐藏层状态，也就是前面的时间步留下的信息
  - seed_ix是一个索引，也就是我们要输入的字母对应的索引
  - n：句子长度（要生成的字母索引的个数）
- 解析1
  复制代码 x = np.zeros((vocab_size, 1)) x[seed_ix] = 1 # 得到出入索引对应字母的编码向量 print("seed_ix:%s" % seed_ix) ixes = []
```
1
2
3
4
5
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1 # 得到出入索引对应字母的编码向量
  print("seed_ix:%s" % seed_ix)
  ixes = []
```
  - vocab_size是不重复的字母的数量,这里我们将字母向量化。每个字母在 x 中对应一个位置，该位置为1，则代表该字母出现。为0，则不出现
  - 就是 one-hot 编码
- 解析2
  复制代码 for t in range(n): # 一共取出 n 个 index h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh) y = np.dot(Why, h) + by p = np.exp(y) / np.sum(np.exp(y)) ix = np.random.choice(range(vocab_size), p=p.ravel()) x = np.zeros((vocab_size, 1)) x[ix] = 1 ixes.append(ix) #放入 ixes 列表中
```
1
2
3
4
5
6
7
8
9
 for t in range(n): # 一共取出 n 个 index
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix) #放入 ixes 列表中
```
  - h是隐藏层状态
  - 注意：每个 t 对应的 h 不同，这里的 h 在更新
  - y：得分向量，每个分数都是该分数的索引对应的字母的得分
  - p：利用 softmax ，将得分转化成概率
  - ix：按 p 中的概率取出一个索引
  - 重置编码向量 x ，以供下一个时间步利用
  - 就是实现了以下过程

lossFun（损失函数）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-h756El2H-1636702475946)(C:UsersnishiyuAppDataRoamingTyporatypora-user-imagesimage-20211112000507073.png)]

lossFun
- 详细介绍流程，不详细介绍 BP 过程中的梯度计算
复制代码def lossFun(inputs, targets, hprev): """ inputs,targets are both list of integers. hprev is Hx1 array of initial hidden state returns the loss, gradients on model parameters, and last hidden state """ xs, hs, ys, ps = {}, {}, {}, {} hs[-1] = np.copy(hprev) loss = 0 # forward pass for t in range(len(inputs)): # inputs里都是数组<int>类型 # 将元素编码为向量 xs[t] = np.zeros((vocab_size,1)) # 初始化为 0 向量 # print("input: %s" % inputs[t]) # print("type: %s" % type(inputs[t])) xs[t][inputs[t]] = 1 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss) # BP过程，计算梯度 dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) dbh, dby = np.zeros_like(bh), np.zeros_like(by) dhnext = np.zeros_like(hs[0]) for t in reversed(range(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw dWxh += np.dot(dhraw, xs[t].T) dWhh += np.dot(dhraw, hs[t-1].T) dhnext = np.dot(Whh.T, dhraw) for dparam in [dWxh, dWhh, dWhy, dbh, dby]: np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
```
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass
  for t in range(len(inputs)): # inputs里都是数组<int>类型
    # 将元素编码为向量
    xs[t] = np.zeros((vocab_size,1)) # 初始化为 0 向量
    # print("input: %s" % inputs[t])
    # print("type: %s" % type(inputs[t]))
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
       
    
  # BP过程，计算梯度
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
```
- 输入一串字符来训练模型，返回 loss 和各种参数的梯度
- 输入：
  - inputs：模型训练数据的输入，也就是上图所示的 [a , b , c , d]
  - targets：模型对 inputs 应该产生的正确结果，也就是上图所示的 [a’ , b’ , c’ , d’]
  - hprev：H-by-1的向量。代表当前时间步的各个神经元的隐藏层的初始状态
- 解析1
  复制代码 xs, hs, ys, ps = {}, {}, {}, {} hs[-1] = np.copy(hprev) loss = 0
```
1
2
3
4
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
```
  - 对输入的每个字符来说，其训练的都是不同时间步的模型。eg：
    - inputs = [a , b , c , d]
    - a 训练的是 t = 1 时的模型
    - b 训练的是 t = 2 时的模型
    - c 训练的是 t = 3 时的模型
    - d 训练的是 t = 4 时的模型
  - 这段代码只是初始化
- 解析2
  复制代码for t in range(len(inputs)): #inputs里都是数组<int>类型 # 将元素编码为向量 xs[t] = np.zeros((vocab_size,1)) # 初始化为 0 向量 xs[t][inputs[t]] = 1 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # 正确结果的得分,得分越大,loss越小
```
1
2
3
4
5
6
7
8
9
for t in range(len(inputs)): #inputs里都是数组<int>类型
    # 将元素编码为向量
    xs[t] = np.zeros((vocab_size,1)) # 初始化为 0 向量
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # 正确结果的得分,得分越大,loss越小
```
  - 注意：利用在遍历 inputs 的时候，同时利用索引作为时间步。hs[t-1] 是一直在变的，存储的是前面的时间步的信息
  - 遍历每一个输入的字符
  - 对一个字符建立 one-hot 编码
  - 计算当前时间步的隐藏状态
  - 注意：每一个时间步的参数矩阵都是一样的，本质上是对一个模型的训练，只是输入字符和隐藏状态不同
  - 计算不同每个时间步的输出（分数向量），也就是不同字符对应的输出
  - 利用 softmax 将分数向量转为概率向量
  - 计算每个字符产生的 loss

main函数

在一次训练的过程中，不同时间步的模型的参数矩阵是相同的，因为本质上是对一个模型进行时间上有先后的训练
每次利用一个 seq_length 长的数据训练模型
详细过程都在注释中标出

复制代码

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

while True:
  if p + seq_length + 1 >= len(data) or n == 0: # 刚开始或者结束时都初始化一次
    hprev = np.zeros((hidden_size,1)) # 重置隐藏层状态
    p = 0 # 指向输入数据的第一个

# targets 是 inputs 往后推一个词
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]] # 一个存储 index 的 list
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]] # 一个存储 index 的 list

# 这个采样只是为了试试模型的效果, 100 次一采样
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200) # 采样,得到采样的词的 index 序列
    txt = ''.join(ix_to_char[ix] for ix in sample_ix) # 通过得到的 index 序列 , 输出这些词
    print ('----n %s n----' % (txt, ))

# 利用 seq_length 长的数据，对模型进行一次 loss 和 梯度的计算
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev) # 得到 Loss 和 梯度
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print ('iter %d, loss: %f' % (n, smooth_loss)) # 100 次一输出
  
  # 参数更新
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam # dparam 对应位置点乘 mem是前面梯度的平方和
    
    #越到后面学习率越小
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

p += seq_length # p指向下一个 seq_length的开始
  n += 1 # iteration counter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

while True:
  if p + seq_length + 1 >= len(data) or n == 0: # 刚开始或者结束时都初始化一次
    hprev = np.zeros((hidden_size,1)) # 重置隐藏层状态
    p = 0 # 指向输入数据的第一个

  # targets 是 inputs 往后推一个词
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]] # 一个存储 index 的 list
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]] # 一个存储 index 的 list

  # 这个采样只是为了试试模型的效果, 100 次一采样
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200) # 采样,得到采样的词的 index 序列
    txt = ''.join(ix_to_char[ix] for ix in sample_ix) # 通过得到的 index 序列 , 输出这些词
    print ('----n %s n----' % (txt, ))

  # 利用 seq_length 长的数据，对模型进行一次 loss 和 梯度的计算
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev) # 得到 Loss 和 梯度
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print ('iter %d, loss: %f' % (n, smooth_loss)) # 100 次一输出
  
  # 参数更新
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam # dparam 对应位置点乘 mem是前面梯度的平方和
    
    #越到后面学习率越小
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # p指向下一个 seq_length的开始
  n += 1 # iteration counter