文章目录
一、数据介绍
实验数据集采用已分词与标注的影评文本,文本标签分为两类:0表示正面评价、1表示负面评价。数据集概况如下:
- 训练集含19998条评价(正面、负面评价各占一半);
- 测试集含369条评价(正面评价:182,负面评价:187);
- 验证集含5629条评价(正面评价:2817,负面评价:2812)
- 预训练词向量:中文维基百科词向量
wiki_word2vec_50.bin
如果未分词,第一步应该对影评文本进行分词
二、模型介绍
1.双向LSTM
双向LSTM
可以理解为同时训练两个LSTM
,两个LSTM
的方向、参数都不同。当前时刻的 h t h_t ht 就是将两个方向不同的LSTM
得到的两个 h t h_t ht 向量拼接到一起。我们使用双向LSTM
捕捉到当前时刻 t t t的过去和未来的特征,通过反向传播来训练双向LSTM
网络。
模型搭建核心点:
由于该任务是情感分类任务,因此,只需要对整个句子的信息进行分类,所以,这里拼接的是整个句子的信息-正向LSTM与负向LSTM的最深的隐藏层的结果。
单向LSTM与双向LSTM的输出结果差别:
- 由于双向LSTM当前时刻的 h t h_t ht 就是将两个方向不同的
LSTM
得到的两个 h t h_t ht 向量拼接到一起。因此,在维度方面,正向LSTM的最深的隐藏层 h t h_t ht的维度为[2,batch,hidden_size]
,负向LSTM的最深的隐藏层 h 0 h_0 h0的维度为[2,batch,hidden_size]
,两者再拼接的话,维度就是[4,batch,hidden_size]
模型搭建代码为:
import torch
import torch.nn as nn
import torch.nn.functional as F
class LSTMModel(nn.Module):
def __init__(
self,
input_size,
hidden_size,
num_layers,
dropout,
bidirectional,
batch_first,
classes,
pretrained_weight,
update_w2v
):
"""
:param input_size: 输入x的特征数,即embedding的size
:param hidden_size:隐藏层的大小
:param num_layers:LSTM的层数,可形成多层的堆叠LSTM
:param dropout: 如果非0,则在除最后一层外的每个LSTM层的输出上引入Dropout层,Dropout概率等于dropout
:param classes:类别数
:param batch_first:控制输入与输出的形状,如果为True,则输入和输出张量被提供为(batch, seq, feature)
:param bidirectional:如果为True,则为双向LSTM
:param pretrained_weight:预训练的词向量
:param update_w2v:控制是否更新词向量
:return:
"""
super(LSTMModel, self).__init__()
# embedding:向量层,将单词索引转为单词向量
self.embedding = nn.Embedding.from_pretrained(pretrained_weight)
self.embedding.weight.requires_grad = True
# encoder层
self.encoder = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=batch_first,
dropout=dropout,
bidirectional=bidirectional
)
# decoder层
if bidirectional:
self.decoder1 = nn.Linear(hidden_size * 4, hidden_size)
self.decoder2 = nn.Linear(hidden_size, classes)
else:
self.decoder1 = nn.Linear(hidden_size * 2, hidden_size)
self.decoder2 = nn.Linear(hidden_size, classes)
def forward(self, x):
"""
前向传播
:param x:输入
:return:
"""
# embedding层
# x.shape=(batch,seq_len);embedding.shape=(num_embeddings, embedding_dim) => emb.shape=(batch,seq_len,embedding_dim)
emb = self.embedding(x)
# encoder层
state, hidden = self.encoder(emb)
# states: (batch,seq_len, D*hidden_size), D=2 if bidirectional = True else 1, =>[64,75,256]
# hidden: (h_n, c_n) => h_n / c_n shape:(D∗num_layers, batch, hidden_size) =>[4,64,128]
# 这里看似拼接输出层结果,实则拼接正向与负向LSTM的隐藏层结果
encoding = torch.cat([state[:, 0, :], state[:, -1, :]], dim=1)
# decoder层
# encoding shape: (batch, 2*D*hidden_size): [64,512]
outputs = self.decoder1(encoding)
outputs = self.decoder2(outputs) # outputs shape:(batch, n_class) => [64,2]
return outputs
2.LSTM+Attention
如果是静态Attention
,其网络结构如下:
h t h_t ht是每一个词的hidden state
,而 h s ‾ \overline{h_s} hs 向量,开始是随机生成的,后面经过反向传播可以得到 ∂ L o s s ∂ h s ‾ \frac{\partial{Loss}}{\partial{\overline{h_s}}} ∂hs∂Loss,通过梯度不断迭代更新。
该分类任务中,注意力得分计算公式为:
s c o r e ( h t , h s ‾ ) = v a T t a n h ( W a [ h t ; h s ‾ ] ) score(h_t,\overline{h_s})=v_{a}^{T}tanh(W_a[h_t;\overline{h_s}]) score(ht,hs)=vaTtanh(Wa[ht;hs])
score
是标量。每句话进行拼接,然后做softmax
得到概率,然后对hidden state
进行加权平均,得到总向量,然后经过一个分类层,经softmax
得到每一个类别的得分。
这里的注意力机制,就是通过训练给予重要的词一个大的权重,给予不重要的词一个小的权重。
模型搭建代码为:
class LSTM_attention(nn.Module):
def __init__(self,
input_size,
hidden_size,
num_layers,
dropout,
bidirectional,
batch_first,
classes,
pretrained_weight,
update_w2v,
):
"""
:param input_size: 输入x的特征数,即embedding的size
:param hidden_size:隐藏层的大小
:param num_layers:LSTM的层数,可形成多层的堆叠LSTM
:param dropout: 如果非0,则在除最后一层外的每个LSTM层的输出上引入Dropout层,Dropout概率等于dropout
:param classes:类别数
:param batch_first:控制输入与输出的形状,如果为True,则输入和输出张量被提供为(batch, seq, feature)
:param bidirectional:如果为True,则为双向LSTM
:param pretrained_weight:预训练的词向量
:param update_w2v:控制是否更新词向量
:return:
"""
super(LSTM_attention, self).__init__()
# embedding:向量层,将单词索引转为单词向量
self.embedding = nn.Embedding.from_pretrained(pretrained_weight)
self.embedding.weight.requires_grad = True
# encoder层
self.encoder = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=batch_first,
dropout=dropout,
bidirectional=bidirectional
)
# nn.Parameter:使用这个函数的目的也是想让某些变量在学习的过程中不断的修改其值以达到最优化。
self.weight_W = nn.Parameter(torch.Tensor(2 * hidden_size, 2 * hidden_size))
self.weight_proj = nn.Parameter(torch.Tensor(2 * hidden_size, 1))
# 向量初始化
nn.init.uniform_(self.weight_W, -0.1, 0.1)
nn.init.uniform_(self.weight_proj, -0.1, 0.1)
# decoder层
if bidirectional:
self.decoder1 = nn.Linear(hidden_size * 2, hidden_size)
self.decoder2 = nn.Linear(hidden_size, classes)
else:
self.decoder1 = nn.Linear(hidden_size, hidden_size)
self.decoder2 = nn.Linear(hidden_size, classes)
def forward(self, x):
"""
前向传播
:param x:输入
:return:
"""
# embedding层
# x.shape=(batch,seq_len);embedding.shape=(num_embeddings, embedding_dim) => emb.shape=(batch,seq_len,embedding_dim)
emb = self.embedding(x)
# encoder层
state, hidden = self.encoder(emb)
# states: (batch,seq_len, D*hidden_size), D=2 if bidirectional = True else 1, =>[64,75,256]
# hidden: (h_n, c_n) => h_n / c_n shape:(D∗num_layers, batch, hidden_size) =>[4,64,128]
# attention:self.weight_proj * tanh(self.weight_W * state)
# (batch,seq_len, 2*hidden_size) => (batch,seq_len, 2*hidden_size)
u = torch.tanh(torch.matmul(state, self.weight_W))
# (batch,seq_len, 2*hidden_size) => (batch,seq_len,1)
att = torch.matmul(u, self.weight_proj)
att_score = F.softmax(att, dim=1)
scored_x = state * att_score
encoding = torch.sum(scored_x, dim=1)
# decoder层
# encoding shape: (batch, D*hidden_size): [64,256]
outputs = self.decoder1(encoding)
outputs = self.decoder2(outputs) # outputs shape:(batch, n_class) => [64,2]
return outputs
3.TextCNN
论文中的模型结构为:
图中的卷积核提取的是相邻两个单词向量(Two-gram),我们可以提取不同的窗口大小的特征,即利用不同的卷积核。如下图,卷积核分别提取了2-gram
、3-gram
、4-gram
的信息。
TextCNN
模型的核心在于以不同尺寸的卷积核来提取词向量分别得到输出,将不同的输出结果分别经池化层后进行拼接,得到总的输出,再经全连接层进行分类。
其模型搭建代码为:
class TextCNNModel(nn.Module):
def __init__(self,
num_filters,
kernel_sizes,
embedding_dim,
dropout,
classes,
pretrained_weight,
update_w2v):
"""
搭建TextCNN模型
:param num_filters: 输出通道数
:param kernel_sizes: 多个卷积核的高[2,3,4]
:param embedding_dim: 卷积核的宽
:param dropout: 遗失率
:param classes: 类别数
:param pretrained_weight: 权重
:param update_w2v: 是否更新w2v
"""
super(TextCNNModel, self).__init__()
# embedding层:加载预训练词向量
self.embedding = nn.Embedding.from_pretrained(pretrained_weight)
self.embedding.weight.data.requires_grad = update_w2v
# 多个卷积层,2-gram;3-gram;4-gram...
self.convs = nn.ModuleList([nn.Conv2d(1, num_filters, (K, embedding_dim)) for K in kernel_sizes]) ## 卷积层
# drouopt层
self.dropout = nn.Dropout(dropout)
# 全连接层
self.fc = nn.Linear(len(kernel_sizes) * num_filters, classes) ##全连接层
def forward(self, x):
"""
前向传播
:param x: 输入
:return:
"""
# # (batch,seq_len) => (batch,seq_len,emb_size)
x = self.embedding(x)
# (batch,seq_len,emb_size) => (batch,1,seq_len,emb_size)
x = x.unsqueeze(1)
# (batch,1,seq_len,emb_size) => (batch,num_filters,seq_len - kernel_size + 1)
x = [F.relu(conv(x)).squeeze(3) for conv in self.convs]
# (batch,num_filters,seq_len - kernel_size + 1) => (batch,num_filters)
x = [F.max_pool1d(line, line.size(2)).squeeze(2) for line in x]
# [(batch,num_filters)*len(kernel_sizes)] => (batch,len(kernel_sizes) * num_filters)
x = torch.cat(x, 1)
x = self.dropout(x)
# (batch,len(kernel_sizes) * num_filters) => (batch,classes)
logit = self.fc(x)
return logit
三、文本情感分类任务实现
一个深度学习任务的实现,一般需要如下几个模块:
- 数据预处理
- 数据读入
- 模型搭建
- 训练、验证与测试
#!usr/bin/env python
# -*- coding:utf-8 -*-
"""
@author: liujie
@file: Config.py
@time: 2022/08/29
@desc:参数统一配置
"""
class MyConfig:
num_filters = 6 # CNN的输出通道数
kernel_sizes = [2, 3, 4]
update_w2v = True # 是否在训练中更新w2v
n_class = 2 # 分类数:分别为pos和neg
max_sen_len = 75 # 句子最大长度
embedding_dim = 50 # 词向量维度
batch_size = 64 # 批处理尺寸
hidden_dim = 128 # 隐藏层节点数
n_epoch = 50 # 训练迭代周期,即遍历整个训练样本的次数
lr = 0.0001 # 学习率;若opt=‘adadelta',则不需要定义学习率
drop_keep_prob = 0.2 # dropout层,参数keep的比例
num_layers = 2 # LSTM层数
seed = 2022
batch_first = True
bidirectional = True # 是否使用双向LSTM
model_dir = "./model"
stopword_path = "./data/stopword.txt"
train_path = "./data/train.txt"
val_path = "./data/validation.txt"
test_path = "./data/test.txt"
pre_path = "./data/pre.txt"
word2id_path = "./word2vec/word2id.txt"
pre_word2vec_path = "./word2vec/wiki_word2vec_50.bin"
corpus_word2vec_path = "./word2vec/word_vec.txt"
model_state_dict_path = "./model/sen_model.pkl"
best_model_path = "./model/sen_model_best.pkl"
1.数据预处理
数据预处理流程如下:
- 加载训练、验证、测试数据集与停用词表
- 建立word2index与index2word映射字典
- 利用预训练word2vec向量来构建字典集对应的word2vec向量,向量的行数代表单词的索引
- 文本转为索引数字模式-将原始文本(包括标签和文本)里的每个词转为word2id对应的索引数字,并以数组返回
其代码dataProcess.py
为:
#!usr/bin/env python
# -*- coding:utf-8 -*-
"""
@author: liujie
@file: dataProcess.py
@time: 2022/08/29
@desc:
数据预处理流程:
1.加载训练、验证、测试数据集与停用词表
2.建立word2index与index2word映射字典
3.利用预训练word2vec向量来构建字典集对应的word2vec向量,向量的行数代表单词的索引
4.文本转为索引数字模式-将原始文本(包括标签和文本)里的每个词转为word2id对应的索引数字,并以数组返回
"""
import re
import codecs
import gensim
import numpy as np
from Config import MyConfig
class Dataprocess:
def __init__(self):
self.stopWords = self.stopWordList_Load(MyConfig.stopword_path)
self.word2id = self.bulid_word2index(MyConfig.word2id_path) # 建立word2id
self.id2word = self.bulid_index2word(self.word2id) # 建立id2word
self.w2vec = self.bulid_word2vec(MyConfig.pre_word2vec_path, self.word2id,
MyConfig.corpus_word2vec_path) # 建立word2vec
# 构造训练集、验证集、测试集数组
self.result = self.prepare_data(self.word2id,
train_path=MyConfig.train_path,
val_path=MyConfig.val_path,
test_path=MyConfig.test_path,
seq_lenth=MyConfig.max_sen_len)
def org_data_load(self, file_path):
"""
加载原数据集中的lable与text
:param file_path: 文件路径
:return: lable列表与text列表
"""
lable = []
text = []
with codecs.open(file_path, "r", encoding="utf-8") as f:
for line in f.readlines():
# 切割
str = line.strip().split("\t")
lable.append(str[0])
text.append(str[1])
return lable, text
def stopWordList_Load(self, filepath):
"""
加载停用词表
:param filepath: 文件路径
:return: 返回停用词
"""
stopWordList = []
with codecs.open(filepath, "r", encoding="utf-8") as f:
for line in f.readlines():
line = line.strip()
stopWordList.append(line)
return stopWordList
def bulid_word2index(self, file_path):
"""
构造word2index字典文件
:return:
"""
# 读取文件路径
path = [MyConfig.train_path, MyConfig.val_path]
word2id = {
"_PAD_": 0}
for _path in path:
with codecs.open(_path, 'r', encoding="utf-8") as f:
for line in f.readlines():
output = []
words = line.strip().split("\t")[1].split(" ")
for word in words:
if word not in self.stopWords:
# 找出长度大于1的汉字字符串
rt = re.findall("[\u4E00-\u9FA5]+", word)
if len(rt) == 0:
continue
else:
output.append(rt[0])
for word in output:
if word not in word2id.keys():
word2id[word] = len(word2id)
# 将word2id写入文件
with codecs.open(file_path, 'w', encoding="utf-8") as f:
for word, index in word2id.items():
f.write(word + "\t" + str(index) + '\n')
return word2id
def bulid_index2word(self, word2id):
"""
构建id2word字典
:param word2id:
:return:
"""
id2word = {
}
for word, index in word2id.items():
id2word[index] = word
return id2word
def bulid_word2vec(self, fname, word2id, save_to_path=None):
"""
利用预训练word2vec向量来构建字典集对应的word2vec向量,向量的行数代表单词的索引
:param fname: 预训练模型名称
:param word2id: 字典
:param save_to_path: 存储语料的词向量文件
:return:
"""
n_words = max(word2id.values()) + 1 # 总词数
# 加载预训练的word2vec模型
model = gensim.models.KeyedVectors.load_word2vec_format(fname, binary=True)
# 初始化word2vec向量
words_vec = np.array(np.random.uniform(-1, 1, [n_words, model.vector_size]))
for word in word2id.keys():
# 避免因未登录词造成的错误
try:
words_vec[word2id[word]] = model[word]
except KeyError:
pass
if save_to_path:
with codecs.open(save_to_path, 'w', encoding="utf-8") as f:
for vec in words_vec:
vec = [str(w) for w in vec]
f.write(",".join(vec))
f.write("\n")
return words_vec
def text_of_array(self, word2id, seq_lenth, path):
"""
文本转为索引数字模式-将原始文本(包括标签和文本)里的每个词转为word2id对应的索引数字,并以数组返回
:param word2id: dict, 语料文本中包含的词汇集
:param seq_lenth: int, 序列的限定长度
:param path: str, 待处理的原始文本数据集
:return: 返回原始文本转化索引数字数组后的数据集(array), 标签集(list)
"""
labels = []
i = 0
sens = []
# 获取句子个数
with codecs.open(path, encoding="utf-8") as f:
for line in f.readlines():
words = line.strip().split("\t")[1].split(" ")
new_sen = [word2id.get(word, 0) for word in words if word not in self.stopWords]
new_sen_vec = np.array(new_sen).reshape(1, -1)
sens.append(new_sen_vec)
# 将原始数据集中的文本转为单词索引,并将单词索引格式的文件写入到文件中
with codecs.open(path, encoding="utf-8") as f:
sentences_array = np.zeros(shape=(len(sens), seq_lenth))
for line in f.readlines():
words = line.strip().split("\t")[1].split(" ")
new_sen = [word2id.get(word, 0) for word in words if word not in self.stopWords]
new_sen_vec = np.array