将多层循环神经网络堆叠在一起，通过对几个简单层的组合，产生一个灵活的机制。其中的数据可能与不同层的堆叠有关。

9.3.1 函数依赖关系

将深度架构中的函数依赖关系形式化，第 \(l\) 个隐藏层的隐状态表达式为：

\[\boldsymbol{H}^{(l)}_t=\phi_l(\boldsymbol{H}^{(l-1)}_t\boldsymbol{W}^{(l)}_{xh}+\boldsymbol{H}^{(l)}_{t-1}\boldsymbol{W}^{(l)}_{hh}+\boldsymbol{b}^{(l)}_h) \]

参数字典：

\(\phi_l\) 表示第 \(l\) 个隐藏层的激活函数
\(\boldsymbol{X}_t\in\R^{n\times d}\) 表示小批量输入
- \(n\) 表示样本个数
- \(d\) 表示输入个数
\(\boldsymbol{H}^{(l)}_{t}\in\R^{n\times h}\) 表示 \(l^{th}\) 隐藏层 \((l=1,\dots,L)\) 的隐状态
- \(h\) 表示隐藏单元个数
- 设置 \(\boldsymbol{H}^{(0)}_{t}=\boldsymbol{X}_{t}\)
\(\boldsymbol{O}_{t}\in\R^{n\times q}\) 表示输出层变量
- \(q\) 表示输出数
\(\boldsymbol{W}^{(l)}_{xh},\boldsymbol{W}^{(l)}_{hh}\in\R^{h\times h}\) 表示第 \(l\) 个隐藏层的权重参数
\(\boldsymbol{b}^{(l)}_h\in\R^{1\times h}\) 表示第 \(l\) 个隐藏层的偏重参数

最后，输出层的计算仅基于第 \(l\) 个隐藏层最终的隐状态：

\[\boldsymbol{O}_t=\boldsymbol{H}^{L}_t\boldsymbol{W}_{hq}+\boldsymbol{b}_q \]

其中 \(\boldsymbol{W}_{hq}\in\R^{h\times q}\) 和 \(\boldsymbol{b}_q\in\R^{1\times q}\) 表示输出层的模型参数

9.3.2 简洁实现

手撸多层循环神经网络有点过于麻烦了，在此仅简单实现。

import torch
from torch import nn
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

vocab_size, num_hiddens, num_layers = len(vocab), 256, 2  # 用 num_layers 来设定隐藏层数
num_inputs = vocab_size
device = d2l.try_gpu()
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)

9.3.3 训练与预测

num_epochs, lr = 500, 2
d2l.train_ch8(model, train_iter, vocab, lr*1.0, num_epochs, device)  # 多了一层后训练速度大幅下降

perplexity 1.0, 116173.5 tokens/sec on cuda:0
time travelleryou can show black is white by argument said filby
travelleryou can show black is white by argument said filby

练习

（1）基于我们在 8.5 节中讨论的单层实现，尝试从零开始实现两层循环神经网络。

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

def get_params_bilayer(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    # 隐藏层1参数
    W_xh1 = normal((num_inputs, num_hiddens))
    W_hh1 = normal((num_hiddens, num_hiddens))
    b_h1 = torch.zeros(num_hiddens, device=device)
    # 新增隐藏层2参数
    W_hh2 = normal((num_hiddens, num_hiddens))
    b_h2 = torch.zeros(num_hiddens, device=device)
    # 输出层参数
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    # 附加梯度
    params = [W_xh1, W_hh1, b_h1, W_hh2, b_h2, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params

def init_rnn_state_bilayer(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),
            torch.zeros((batch_size, num_hiddens), device=device))  # 新增第二个隐状态初始化张量

def rnn_bilayer(inputs, state, params):  # inputs的形状：(时间步数量，批量大小，词表大小)
    W_xh1, W_hh1, b_h1, W_hh2, b_h2, W_hq, b_q = params  # 新增第二层参数
    H1, H2 = state
    outputs = []
    for X in inputs:  # X的形状：(批量大小，词表大小) 前面转置是为了这里遍历
        H1 = torch.tanh(torch.mm(X, W_xh1) + torch.mm(H1, W_hh1) + b_h1)  # 计算隐状态1
        H2 = torch.tanh(torch.mm(H1, W_hh2) + b_h2)  # 计算隐状态2
        Y = torch.mm(H2, W_hq) + b_q  # 计算输出
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H1, H2)  # 沿时间步拼接

num_hiddens = 512
net_rnn_bilayer = d2l.RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params_bilayer,
                      init_rnn_state_bilayer, rnn_bilayer)
num_epochs, lr = 500, 1
d2l.train_ch8(net_rnn_bilayer, train_iter, vocab, lr, num_epochs, d2l.try_gpu())

perplexity 1.0, 63514.3 tokens/sec on cuda:0
time travelleryou can show black is white by argument said filby
travelleryou can show black is white by argument said filby

（2）在本节训练模型中，比较使用门控循环单元替换长短期记忆网络后模型的精确度和训练速度。

vocab_size, num_hiddens, num_layers = len(vocab), 256, 2  # 用 num_layers 来设定隐藏层数
num_inputs = vocab_size
device = d2l.try_gpu()
# lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)
# model = d2l.RNNModel(lstm_layer, len(vocab))
gru_layer = nn.GRU(num_inputs, num_hiddens)
model_gru = d2l.RNNModel(gru_layer, len(vocab))
model_gru = model_gru.to(device)

num_epochs, lr = 500, 2
d2l.train_ch8(model_gru, train_iter, vocab, lr*1.0, num_epochs, device)  # 换 gru 后更快了

perplexity 1.0, 230590.6 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
travelleryou can show black is white by argument said filby