在Transformer模型中解码器（Decoder）的详细解释（Attention Is All You Need）

在这里插入图片描述

解码器（Decoder）的详细解释

在Transformer模型中，解码器与编码器相似，由堆叠的 $N = 6$ 个相同的层组成。但解码器比编码器多了一个子层：对编码器输出的多头注意力（Multi-Head Attention over the encoder’s output）。解码器的每一层有三个子层：

掩码多头自注意力机制（Masked Multi-Head Self-Attention Mechanism）
对编码器输出的多头注意力机制（Multi-Head Attention over the Encoder’s Output）
逐点全连接前馈神经网络（Positionwise Fully Connected Feed-Forward Network）

每个子层都使用残差连接，并在其后进行层归一化。这种设计确保了模型能够保持稳定的梯度，并能有效地进行训练。掩码机制和输出嵌入的位移确保了预测 $i$ 位置时，只能依赖于之前的位置。

解码器的结构和工作原理

1. 掩码多头自注意力机制（Masked Multi-Head Self-Attention Mechanism）

掩码多头自注意力机制确保在生成序列的过程中，当前时间步只关注之前的时间步，不会看到未来的信息。这通过掩码矩阵实现。

2. 对编码器输出的多头注意力机制（Multi-Head Attention over the Encoder’s Output）

这一子层使用编码器的输出作为键和值，解码器的自注意力输出作为查询。这允许解码器在生成每个输出时，参考编码器的所有隐藏状态，从而捕捉输入序列与输出序列之间的依赖关系。

3. 逐点全连接前馈神经网络（Positionwise Fully Connected Feed-Forward Network）

这一子层对每个位置独立应用两个线性变换和一个非线性激活函数，用于进一步特征提取和变换。

详细的处理步骤和示例

假设我们有一个输入序列 $\text{Input Sequence} = [23.1, 24.3, 22.8, 23.5]$ 以及一个部分输出序列 $\text{Generated Sequence} = [y_1, y_2, y_3]$ 。

1. 输入嵌入和位置编码

对输入序列和部分输出序列进行嵌入和位置编码：

$\text{Input Embedding} = \begin{bmatrix} 23.1 & 0 & \ldots & 0 \\ 24.3 & 0 & \ldots & 0 \\ 22.8 & 0 & \ldots & 0 \\ 23.5 & 0 & \ldots & 0 \end{bmatrix}$
$\text{Output Embedding} = \begin{bmatrix} y_1 & 0 & \ldots & 0 \\ y_2 & 0 & \ldots & 0 \\ y_3 & 0 & \ldots & 0 \end{bmatrix}$

位置编码：

$PE_{\text{input}} = \begin{bmatrix} 0 & 0 & \ldots & 0 \\ \sin(1) & 0 & \ldots & 0 \\ \sin(2) & 0 & \ldots & 0 \\ \sin(3) & 0 & \ldots & 0 \end{bmatrix}$
$PE_{\text{output}} = \begin{bmatrix} 0 & 0 & \ldots & 0 \\ \sin(1) & 0 & \ldots & 0 \\ \sin(2) & 0 & \ldots & 0 \end{bmatrix}$

将位置编码与嵌入表示相加：

$\text{Input + Position} = \begin{bmatrix} 23.1 & 0 & \ldots & 0 \\ 25.1415 & 0 & \ldots & 0 \\ 23.7093 & 0 & \ldots & 0 \\ 23.6411 & 0 & \ldots & 0 \end{bmatrix}$
$\text{Output + Position} = \begin{bmatrix} y_1 & 0 & \ldots & 0 \\ y_2 + \sin(1) & 0 & \ldots & 0 \\ y_3 + \sin(2) & 0 & \ldots & 0 \end{bmatrix}$

2. 掩码多头自注意力机制

计算查询、键和值矩阵：

$Q_{\text{masked}} = K_{\text{masked}} = V_{\text{masked}} = \text{Output + Position}$

应用掩码：

$\begin{bmatrix} 0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0 \end{bmatrix}$

计算注意力权重：

$\text{Attention Weights}_{\text{masked}} = \text{softmax}\left(\frac{Q_{\text{masked}} K_{\text{masked}}^T + M}{\sqrt{d_{\text{model}}}}\right)$

加权求和：

$\text{Attention Output}_{\text{masked}} = \text{Attention Weights}_{\text{masked}} \cdot V_{\text{masked}}$

3. 对编码器输出的多头注意力机制

计算查询、键和值矩阵：

$\text{Attention Output}_{\text{masked}}, \quad K = V = \text{Encoder Output}$

计算注意力权重：

$\text{Attention Weights} = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_{\text{model}}}}\right)$

加权求和：

$\text{Attention Output} = \text{Attention Weights} \cdot V$

4. 逐点全连接前馈神经网络

计算前馈神经网络输出：

$\max(0, \text{Attention Output} \cdot W_1 + b_1)$
$\cdot W_2 + b_2$

5. 残差连接和层归一化

对于掩码多头自注意力子层：

$\text{Self-Attention Output} = \text{LayerNorm}(\text{Output + Position} + \text{Attention Output}_{\text{masked}})$

对于对编码器输出的多头注意力子层：

$\text{Cross-Attention Output} = \text{LayerNorm}(\text{Self-Attention Output} + \text{Attention Output})$

对于前馈神经网络子层：

$\text{FFN Output} = \text{LayerNorm}(\text{Cross-Attention Output} + y)$

总结

通过详细解释解码器的结构和每个子层的工作原理，我们可以看到解码器如何利用掩码多头自注意力机制、对编码器输出的多头注意力机制和逐点全连接前馈神经网络来生成输出序列。每个子层都使用残差连接和层归一化，以确保模型的稳定性和训练效果。这些步骤使得解码器能够在生成每个符号时，只依赖于之前生成的符号，同时参考编码器的输出，从而实现高效的序列生成。