1. Introduction
该系统实现了基于深度框架的语音识别中的声学模型和语言模型建模,其中声学模型包括 CNN-CTC、GRU-CTC、CNN-RNN-CTC,语言模型包含 transformer、CBHG,数据集包含 stc、primewords、Aishell、thchs30 四个数据集。
本项目现已训练一个迷你的语音识别系统,将项目下载到本地上,下载 thchs 数据集并解压至 data,运行 test.py
2. 声学模型
声学模型采用 CTC 进行建模,采用 CNN-CTC、GRU-CTC、FSMN 等模型 model_speech
,采用 keras 作为编写框架。
3. 语言模型
新增基于 self-attention 结构的语言模型 model_languagetransformer.py
- 论文地址:https://arxiv.org/abs/1706.03762。
基于 CBHG 结构的语言模型 model_languagecbhg.py
4. 数据集
包括 stc、primewords、Aishell、thchs30 四个数据集,共计约 430 小时, 相关链接:http://www.openslr.org/resources.php
Name | train | dev | test |
aishell | 120098 | 14326 | 7176 |
primewords | 40783 | 5046 | 5073 |
thchs-30 | 10000 | 893 | 2495 |
st-cmd | 10000 | 600 | 2000 |
数据标签整理在 data
路径下,其中 primewords、st-cmd 目前未区分训练集测试集。
若需要使用所有数据集,只需解压到统一路径下,然后设置 utils.py 中 datapath 的路径即可。
与数据相关参数在 utils.py
- data_type: train, test, dev
- data_path: 对应解压数据的路径
- thchs30, aishell, prime, stcmd: 是否使用该数据集
- batch_size: batch_size
- data_length: 我自己做实验时写小一些看效果用的,正常使用设为 None 即可
- shuffle:正常训练设为 True,是否打乱训练顺序
def data_hparams():
params = tf.contrib.training.HParams(
# vocab
data_type = 'train',
data_path = 'data/',
thchs30 = True,
aishell = True,
prime = False,
stcmd = False,
batch_size = 1,
data_length = None,
shuffle = False)
return params
5. 配置
使用 train.py 文件进行模型的训练。
声学模型可选 cnn-ctc、gru-ctc,只需修改导入路径即可:
from model_speech.cnn_ctc import Am, am_hparams
from model_speech.gru_ctc import Am, am_hparams
语言模型可选 transformer 和 cbhg:
from model_language.transformer import Lm, lm_hparams
from model_language.cbhg import Lm, lm_hparams
使用 test.py 检查模型识别效果。
Layer (type) Output Shape Param #
the_inputs (InputLayer) (None, None, 200, 1) 0
conv2d_11 (Conv2D) (None, None, 200, 32) 320
batch_normalization_11 (Batc (None, None, 200, 32) 128
conv2d_12 (Conv2D) (None, None, 200, 32) 9248
batch_normalization_12 (Batc (None, None, 200, 32) 128
max_pooling2d_4 (MaxPooling2 (None, None, 100, 32) 0
conv2d_13 (Conv2D) (None, None, 100, 64) 18496
batch_normalization_13 (Batc (None, None, 100, 64) 256
conv2d_14 (Conv2D) (None, None, 100, 64) 36928
batch_normalization_14 (Batc (None, None, 100, 64) 256
max_pooling2d_5 (MaxPooling2 (None, None, 50, 64) 0
conv2d_15 (Conv2D) (None, None, 50, 128) 73856
batch_normalization_15 (Batc (None, None, 50, 128) 512
conv2d_16 (Conv2D) (None, None, 50, 128) 147584
batch_normalization_16 (Batc (None, None, 50, 128) 512
max_pooling2d_6 (MaxPooling2 (None, None, 25, 128) 0
conv2d_17 (Conv2D) (None, None, 25, 128) 147584
batch_normalization_17 (Batc (None, None, 25, 128) 512
conv2d_18 (Conv2D) (None, None, 25, 128) 147584
batch_normalization_18 (Batc (None, None, 25, 128) 512
conv2d_19 (Conv2D) (None, None, 25, 128) 147584
batch_normalization_19 (Batc (None, None, 25, 128) 512
conv2d_20 (Conv2D) (None, None, 25, 128) 147584
batch_normalization_20 (Batc (None, None, 25, 128) 512
reshape_2 (Reshape) (None, None, 3200) 0
dense_3 (Dense) (None, None, 256) 819456
dense_4 (Dense) (None, None, 230) 59110
Total params: 1,759,174
Trainable params: 1,757,254
Non-trainable params: 1,920
loading acoustic model…
loading language model…
INFO:tensorflow:Restoring parameters from logs_lm/model
## 使用语音识别系统
for i in range(5):
print('n the ', i, 'th example.')
# 载入训练好的模型,并进行识别
inputs, outputs = next(am_batch)
x = inputs['the_inputs']
y = inputs['the_labels'][0]
result = am.model.predict(x, steps=1)
# 将数字结果转化为文本结果
_, text = decode_ctc(result, train_data.am_vocab)
text = ' '.join(text)
print('文本结果:', text)
print('原文结果:', ' '.join([train_data.am_vocab[int(i)] for i in y]))
with sess.as_default():
_, y = next(lm_batch)
text = text.strip('n').split(' ')
x = np.array([train_data.pny_vocab.index(pny) for pny in text])
x = x.reshape(1, -1)
preds = sess.run(lm.preds, {lm.x: x})
got = ''.join(train_data.han_vocab[idx] for idx in preds[0])
print('原文汉字:', ''.join(train_data.han_vocab[idx] for idx in y[0]))
print('识别结果:', got)
