Seq2Seq+Attention学习总结

Seq2Seq

基本简介

结构

Attention机制

Seq2Seq

基本简介

Seq2Seq(Sequence to Sequence)，即序列到序列模型，就是一种能够根据给定的序列，通过特定的生成方法生成另一个序列的方法。这种结构又叫Encoder-Decoder模型，即编码-解码模型。Seq2seq可以用来处理输入输出序列不等长的问题，是一种特殊的RNN模型。

例如：在机器翻译中，输入（hello） -> 输出（你好）。输入是1个英文单词，输出为2个汉字。在对话机器中，我们提（输入）一个问题，机器会自动生成（输出）回答。这里的输入和输出显然是长度没有确定的序列（sequences）。

结构

Seq2Seq由编码器（Encoder），语义向量C，解码器（Decoder）三部分组成。

编码：Encoder负责将输入的文本序列压缩固定大小的状态向量C。这个语义向量C可以看作输入序列的语义，作为解码器的输入。

解码：Decoder负责根据语义向量生成指定的序列。

第一种方式是将语义向量C作为初始状态直接输入到Dncoder的RNN中，得到输出序列。这时的语义向量C只作为初始状态参与运算，后面运算与C无关。

第二种方式是把语义向量C作为Decoder的每一时刻输入，C参与序列所有时刻的运算。

Attention机制

由于Encoder和Decoder的唯一联系只有语义编码C，会将传递的信息“有损压缩”，而且不同位置的单词对于当前的贡献都是一样的。为了解决以上问题，我们引入Attention模型。

Attention模型的特点是Decoder不再将整个输入序列编码为固定长度的中间语义向量C ，而是根据当前生成的新单词计算新的Ci，使得每个时刻输入不同的C，这样就解决了以上问题。引入了Attention的Encoder-Decoder模型如下图：

Attention函数

Attention函数的本质可以被描述为一个查询（query）到一系列键key-值value对的映射。

将query和每个key进行相似度计算得到权重，常用的相似度函数有点积，拼接，感知机等；
使用一个softmax函数对这些权重进行归一化；
将权重和相应的键值value进行加权求和得到最后的attention。

具体公式：

Self-Attention

Self Attention指的不是Target和Source之间的Attention机制，而是Source内部元素之间或者Target内部元素之间发生的Attention机制，也可以理解为Target=Source这种特殊情况下的注意力计算机制。其具体计算过程与attention一样是一样的，只是计算对象发生了变化而已，相当于是Query=Key=Value。

假如输入序列是"Thinking Machines"，x1，x2就是对应地"Thinking"和"Machines"添加过位置编码之后的词向量，然后词向量通过三个权值矩阵 $W^{^{Q}}$ 、 $W^{_K}$ 、 $W^{_V}$ ，转变成为计算Attention值所需的Query，Keys，Values向量。

在实际使用中，每一条序列数据都是以矩阵的形式输入地，故可以看到上图中，X矩阵是由"Tinking"和"Machines"词向量组成的矩阵，然后跟过变换得到Q，K，V。假设词向量是512维，X矩阵的维度是(2,512)， $W^{^{Q}}$ 、 $W^{_K}$ 、 $W^{_V}$ 均是(512,64)维，故得到的Query，Keys，Values就都是(2,64)维。

得到Q，K，V之后，接下来就是计算Attention值了。

计算单词间的相关性得分，使用点积法。公式： $score = Q\cdot K^{T}$ ，socre是一个(2,2)的矩阵。
将相关性得分归一化，以稳定训练时的梯度。公式： $sorce = \frac{sorce}{\sqrt{x}}$ ， $d{_{k}}$ 就是K的维度，以上面假设为例， $d{_{k}}$ =64。
通过softmax函数将得分转换为[0,1]之间的概率分布。
根据概率分布，将得分与对应的Values值相乘。公式：，V的为维度是(2,64)，(2,2)x(2,64)最后得到的Z是(2,64)维的矩阵。

具体详细算法参考：自注意力机制(Self-Attention)-CSDN博客

Multi-Head Attention

Multi-Headed Attention不仅仅只初始化一组Q、K、V的矩阵，而是初始化多组。

Query，Key，Value首先经过一个线性变换，然后输入到放缩点积attention（注意这里要做h次，其实也就是所谓的多头，每一次算一个头，而且每次Q，K，V进行线性变换的参数W是不一样的），然后将h次的放缩点积attention结果进行拼接，再进行一次线性变换得到的值作为多头attention的结果。

对于输入矩阵 X，每一组 Q、K 和 V 都可以得到一个输出矩阵 Z。如下图所示 :

参考文献

Seq2Seq模型和Attention机制 - machine-learning-notes (gitbook.io)

【Attention机制讲解】-CSDN博客

seq2seq的相关论文：
（1）原始的模型：https://arxiv.org/pdf/1406.1078.pdf，题目为：Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation，由cho在2014年提出。

（2）改进的模型：https://arxiv.org/pdf/1409.3215.pdf，题目为：Sequence to Sequence Learning with Neural Networks。

（3）融入Attention机制的模型：https://arxiv.org/pdf/1409.0473.pdf，题目为：NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE。

Seq2Seq+Attention代码

Task: 基于Seq2Seq和注意力机制的句子翻译

Date: 2023/11/22

Reference: ChengJunkai @github.com/Cheng0829 // Tae Hwan Jung(Jeff Jung) @graykod

库引入

from tkinter import font
import numpy as np
import torch, time, os, sys
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

# S: 表示开始进行解码输入的符号。
# E: 表示结束进行解码输出的符号。
# P: 当前批次数据大小小于时间步长时将填充空白序列的符号

词嵌入处理

pre_process( )函数：

将输入的句子进行分词处理，得到单词的序列。
创建一个word_list来去重，确保单词的索引（在嵌入向量中的位置）是唯一的。
创建两个字典：word_dict（单词到索引的映射）和 number_dict（索引到单词的映射）。
返回单词列表、单词字典、数字字典和单词类别数（即字典的大小）。

def pre_process(sentences):
    # 分词
    word_sequence = " ".join(sentences).split()
    # 去重
    word_list = []
    '''
    如果用list(set(word_sequence))来去重,得到的将是一个随机顺序的列表(因为set无序),
    这样得到的字典不同,保存的上一次训练的模型很有可能在这一次不能用
    (比如上一次的模型预测碰见我:0,,就输出i:7,但这次模型i在字典8号位置,也就无法输出正确结果)
    '''
    for word in word_sequence:
        if word not in word_list:
            word_list.append(word)
    word_dict = {w:i for i, w in enumerate(word_list)}
    number_dict = {i:w for i, w in enumerate(word_list)}
    # 词库大小,也是嵌入向量维度
    n_class = len(word_dict)  # 12
    return word_list, word_dict, number_dict, n_class

make_batch( )函数：

将句子数据转换为神经网络可以处理的格式。
使用创建的 word_dict 将每个单词转换为对应的索引，然后将索引组成的数组转换为单位矩阵的索引切片。
将输入、输出和目标批次转换为PyTorch的张量，并转移到指定的设备上。

'''根据句子数据,构建词元的嵌入向量'''
def make_batch(sentences,word_dict):
    # [1, 6, 12] [样本数, 输入句子长度, 嵌入向量维度(单词类别数)]
    input_batch = [np.eye(n_class)[[word_dict[n] for n in sentences[0].split()]]]
    # [1, 5, 12] [样本数, 输出句子长度, 嵌入向量维度(单词类别数)]
    output_batch = [np.eye(n_class)[[word_dict[n] for n in sentences[1].split()]]]
    # [1, 5] [样本数, 输出句子长度]
    target_batch = [[word_dict[n] for n in sentences[2].split()]]

    input_batch = torch.FloatTensor(np.array(input_batch)).to(device)
    output_batch = torch.FloatTensor(np.array(output_batch)).to(device)
    target_batch = torch.LongTensor(np.array(target_batch)).to(device)

    return input_batch, output_batch, target_batch

构建模型

__init__ 方法：初始化，定义编码器（encoder_cell）和解码器（decoder_cell）的RNN网络，以及用于Attention的线性层（attn）和输出线性层（out）。

class Attention(nn.Module):
    def __init__(self):
        super(Attention, self).__init__()
        self.encoder_cell = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5)
        self.decoder_cell = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5)
        # Linear for attention
        self.attn = nn.Linear(n_hidden, n_hidden)
        self.out = nn.Linear(2*n_hidden, n_class)

forward( )方法：

对输入进行转置，转置成神经网络需要的维度。
使用 encoder_cell 处理编码器输入，获取输出和隐藏状态。
初始化一个空的张量 output 来存储解码器的输出。
创建一个空的列表 trained_attn 来存储训练过的注意力权重。
Decoder解码器操作。
返回输出序列及其注意力权重。

    '''output, _ = model(input_batch, hidden_0, output_batch)'''
    def forward(self, encoder_inputs, hidden_0, decoder_inputs):
        # [6, 1, 12] [输入句子长度(n_step), 样本数, 嵌入向量维度(单词类别数)]
        encoder_inputs = encoder_inputs.transpose(0, 1)
        # encoder_inputs: [n_step(=n_step, time step), batch_size, n_class]
        # [5, 1, 12] [输出句子长度(n_step), 样本数, 嵌入向量维度(单词类别数)]

        decoder_inputs = decoder_inputs.transpose(0, 1)
        # decoder_inputs: [n_step(=n_step, time step), batch_size, n_class]
        # print(encoder_inputs.shape, decoder_inputs.shape)

        '''编码器encoder'''
        # encoder_outputs : [实际的n_step, batch_size, num_directions(=1)*n_hidden] # [5,1,128]
        # encoder_states : [num_layers*num_directions, batch_size, n_hidden] # [1,1,128]
        '''encoder_states是最后一个时间步的输出(即隐藏层状态),和encoder_outputs的最后一个元素一样'''
        encoder_outputs, encoder_states = self.encoder_cell(encoder_inputs, hidden_0)
        encoder_outputs = encoder_outputs # [6,1,128]
        encoder_states = encoder_states # [1,1,128]
        # print(encoder_outputs.shape, encoder_states.shape)
        n_step = len(decoder_inputs) # 5
        # 返回一个未初始化的张量,内部均为随机数
        output = torch.empty([n_step, 1, n_class]).to(device) # [5,1,12]
        
        '''获取注意力权重 : between(整个编码器上的隐状态, 整个解码器上的隐状态)
        有两次加权求和,一次是bmm,一次是dot,对应两个for循环
        '''
        trained_attn = []

解码器上的每个时间步：

通过将当前解码器输入和上一个解码器的隐藏状态传递给decoder_cell，获得当前解码器的输出和隐藏状态。
使用 get_attn_one_to_all 方法计算注意力权重。
将计算出的注意力权重添加到 trained_attn 列表中。
使用注意力权重和编码器的输出计算上下文向量。
将当前解码器的输出和上下文向量连接起来，并通过输出线性层（out）进行处理，最终生成输出序列的一个元素。

        for i in range(n_step): # 5
            '''解码器'''
            '''decoder_inputs[i]即只需要第i个时间步上面的解码器输入,但必须是三维,所以用unsqueeze升一维'''
            decoder_input_one = decoder_inputs[i].unsqueeze(0) # 升维
            '''decoder_output_one 和 encoder_states 其实是一样的 因为decoder_cell只算了一个时间步'''
            decoder_output_one, encoder_states = self.decoder_cell(decoder_input_one, encoder_states)
            decoder_output_one = decoder_output_one
            encoder_states = encoder_states
            '''attn_weights是一个解码器时间步隐状态和整个编码器之间的注意力权重'''
            # attn_weights : [1, 1, n_step] # [1,1,6]
            attn_weights = self.get_attn_one_to_all(decoder_output_one, encoder_outputs)

            '''squeeze():[1,1,6]->[6,] data:只取数据部分,剔除梯度部分 numpy:转换成一维矩阵'''
            trained_attn.append(attn_weights.squeeze().data.numpy())
            # numpy遍历不能存在于cuda,所以必须先作为cpu变量进行操作,再进行转换
            attn_weights = attn_weights.to(device) 
            """a.bmm(b)和torch.bmm(a,b)一样
                a:(z,x,y)
                b:(z,y,c)
                则result = torch.bmm(a,b),维度为:(z,x,c)
            """
            '''利用attn第i时刻Encoder的隐状态的加权求和,得到上下文向量,即融合了注意力的模型输出'''
            # context:[1,1,n_step(=5)]x[1,n_step(=5),n_hidden(=128)]=[1,1,128]
            context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
            # decoder_output_one : [batch_size(=1), num_directions(=1)*n_hidden]
            decoder_output_one = decoder_output_one.squeeze(0) # [1,1,128] -> [1,128]
            # [1, num_directions(=1)*n_hidden] # [1,128]
            context = context.squeeze(1)  
            '''把上下文向量和解码器隐状态进行concat,得到融合了注意力的模型输出'''
            # torch.cat的dim=1代表在第二个维度上拼接 ,所以[1,128]+[1,128]->[1,256]
            # output[i] = self.out(torch.cat((decoder_output_one, context), 1))
            output[i] = self.out(torch.cat((decoder_output_one, context), 1))
        # output: [5,1,12] -> [1,5,12] -> [5,12]
        return output.transpose(0, 1).squeeze(0), np.array(trained_attn)

get_attn_one_to_all（）函数：

计算了所有时间步的注意力权重。
对于每个时间步，调用get_attn_one_to_one函数来计算解码器在那个时间步上对编码器的注意力权重。
使用softmax函数将这些权重标准化，使得它们的和为1。

    '''获取注意力权重 : between(解码器的一个时间步的隐状态, 整个编码器上的隐状态)'''
    def get_attn_one_to_all(self, decoder_output_one, encoder_outputs):  
        n_step = len(encoder_outputs) # 6
        attn_scores = torch.zeros(n_step)  # attn_scores : [n_step,] -> [6,]
        
        '''对解码器的每个时间步获取注意力权重'''
        for i in range(n_step):
            encoder_output_one = encoder_outputs[i]
            attn_scores[i] = self.get_attn_one_to_one(decoder_output_one, encoder_output_one)

        """
        F.softmax(matrix,dim) 将scores标准化为0到1范围内的权重
        softmax(x_i) = exp(x_i) / sum( exp(x_1) + ··· + exp(x_n) )
        由于attn_scores是一维张量,所以F.softmax不用指定dim
        """
        # .view(1,1,-1)把所有元素都压到最后一个维度上,把一维的张量变成三维的
        return F.softmax(attn_scores).view(1, 1, -1) # [6,] -> [1,1,6]

get_attn_one_to_one（）函数：

计算解码器在某个时间步上对编码器的注意力权重。
通过应用线性层attn来获取编码器的隐状态，然后通过将解码器的隐状态和这个隐状态进行点积来计算注意力权重。

    '''获取注意力权重 : between(编码器的一个时间步的隐状态, 解码器的一个时间步的隐状态)'''
    def get_attn_one_to_one(self, decoder_output_one, encoder_output_one):  
        # decoder_output_one : [batch_size, num_directions(=1)*n_hidden] # [1,128]
        # encoder_output_one : [batch_size, num_directions(=1)*n_hidden] # [1,128]
        # score : [batch_size, n_hidden] -> [1,128]
        score = self.attn(encoder_output_one)  
        '''X.view(shape) 
        >>> X = torch.ones((3,2))
        >>> X = X.view(2,3) # X形状变为(2,3)
        >>> X = X.view(-1) # X形状变为一维
        '''
        # decoder_output_one : [n_step(=1), batch_size(=1), num_directions(=1)*n_hidden] -> [1,1,128]
        # score : [batch_size, n_hidden] -> [1,128]
        # 求点积
        return torch.dot(decoder_output_one.view(-1), score.view(-1))  # inner product make scalar value

translate（）函数：翻译句子

先准备数据，初始化模型，然后使用贪婪算法预测一个单词作为输出。这个预测基于模型之前的输出和当前时间步的输入，被用来更新下一步的输入。
最后，使用模型对整个输入序列进行预测，然后选择概率最高的单词作为输出。这个输出被转换为单词，并添加到解码列表中。
清理特殊字符：训练集的target均以E结尾，所以模型输出最后一个值也会是E，因此找到E的位置并删除其后的所有字符。此外，还删除所有的'P'字符（这可能是表示空白的特殊字符）。

def translate(sentences):
    input_batch, output_batch, target_batch = make_batch(sentences,word_dict)
    blank_batch = [np.eye(n_class)[[word_dict[n] for n in 'SPPPP']]]
    # test_batch: [1,5,12] [batch_size,len_sen,dict_size]
    test_batch = torch.FloatTensor(np.array(blank_batch)).to(device) 
    dec_inputs = torch.FloatTensor(np.array(blank_batch)).to(device) 

    '''贪婪搜索'''
    for i in range(len(test_batch[0])):
        # predict: [len_sen, dict_size] [5,12]
        predict, trained_attn = model(input_batch, hidden_0, dec_inputs) 
        predict = predict.data.max(1, keepdim=True)[1] # [5,1] [sen_len,1]
        # 覆盖之前的padding字符
        dec_inputs[0][i][word_dict['P']] = 0
        dec_inputs[0][i][predict[i][0]] = 1
        
    predict, trained_attn = model(input_batch, hidden_0, dec_inputs) 
    predict = predict.data.max(1, keepdim=True)[1] # [5,1] [sen_len,1]
    decoded = [word_list[i] for i in predict]
    real_decoded = decoded # 记录不清除特殊字符的decoded

    '''清除特殊字符'''
    '''训练集的target均以E结尾,所以模型输出最后一个值也会是E'''
    if 'E' in decoded:
        end = decoded.index('E') # 5
        decoded = decoded[:end] # 删除结束符及之后的所有字符
    else:
        return # 报错
    while(True):
        if 'P' in decoded:
            del decoded[decoded.index('P')] # 删除空白符
        else:
            break

    # 把列表元素合成字符串
    translated = ' '.join(decoded) 
    real_output = ' '.join(real_decoded) 
    return translated, real_output

主函数

if __name__ == '__main__':
    # n_step = 5 # number of cells(= number of Step)
    chars = 30 * '*'
    n_hidden = 128 # number of hidden units in one cell
    '''GPU比CPU慢的原因大致为:
    数据传输会有很大的开销,而GPU处理数据传输要比CPU慢,
    而GPU在矩阵计算上的优势在小规模神经网络中无法明显体现出来
    '''
    device = ['cuda:0' if torch.cuda.is_available() else 'cpu'][0]
    sentences = ['我 想 喝 啤 酒 P', 'S i want a beer', 'i want a beer E']

    '''1.数据预处理'''
    word_list, word_dict, number_dict, n_class = pre_process(sentences)
    input_batch, output_batch, target_batch = make_batch(sentences,word_dict)
    # hidden_0 : [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
    hidden_0 = torch.zeros(1, 1, n_hidden).to(device) # [1,1,128]

    '''2.构建模型'''
    model = Attention()
    model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    if os.path.exists('model_param.pt') == True:
        # 加载模型参数到模型结构
        model.load_state_dict(torch.load('model_param.pt', map_location=device))

训练

代入数据,输入编码器,然后输入解码器
得到模型输出值,取其中最大值的索引,找到字典中对应的字母,即为模型预测的下一个字母
把模型输出值和真实值相比,求得误差损失函数,运用Adam动量法梯度下降

    '''3.训练'''
    print('{}\nTrain\n{}'.format('*'*30, '*'*30))
    loss_record = []
    for epoch in range(1000):
        optimizer.zero_grad()
        output, trained_attn = model(input_batch, hidden_0, output_batch)
        output = output.to(device)
        loss = criterion(output, target_batch.squeeze(0)) # .squeeze(0)降成1维
        loss.backward()
        optimizer.step()

        if loss >= 0.0001: # 连续30轮loss小于0.01则提前结束训练
            loss_record = []
        else:
            loss_record.append(loss.item())
            if len(loss_record) == 30:
                torch.save(model.state_dict(), 'model_param.pt')
                break    

        if (epoch + 1) % 100 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'Loss = {:.6f}'.format(loss))
            torch.save(model.state_dict(), 'model_param.pt')

    '''4.测试'''
    print('{}\nTest\n{}'.format('*'*30, '*'*30))
    input = sentences[0]
    output, real_output = translate(input)
    print(sentences[0].replace(' P', ''), '->', output)

运行结果：

******************************
Test
******************************
我想喝啤酒 -> beer

    '''5.可视化注意力权重矩阵'''
    trained_attn = trained_attn.round(2)
    fig = plt.figure(figsize=(len(input.split()), len(real_output.split()))) # (5,5)
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(trained_attn, cmap='viridis')
    ax.set_xticklabels([''] + input.split(), \
        fontdict={'fontsize': 14}, fontproperties='SimSun') # 宋体
    ax.set_yticklabels([''] + real_output.split(), \
        fontdict={'fontsize': 14}, fontproperties='SimSun')
    plt.show()

注意力权重矩阵可视化如下：