基于Bert预训练模型的SQuAD 问答系统
step-1 运行example
参考huggingface的 pytorch_transformer 下载并运行 example run_squad.py
运行参数:
python run_squad.py
--model_type bert
--model_name_or_path bert-base-uncased
--do_train
--do_eval
--do_lower_case
--train_file ../../SQUAD_DIR/train-v1.1.json
--predict_file ../../SQUAD_DIR/dev-v1.1.json
--per_gpu_train_batch_size 4
--learning_rate 3e-5
--num_train_epoch 2.0
--max_seq_length 384
--doc_stride 128
--output_dir ../../SQUAD_DIR/OUTPUT
--overwrite_output_dir
--gradient_accumulation_steps 3
其中config.json
{'attention_probs_dropout_prob': 0.1, 'finetuning_task': None, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'hidden_size': 768, 'initializer_range': 0.02, 'intermediate_size': 3072, 'layer_norm_eps': 1e-12, 'max_position_embeddings': 512, 'num_attention_heads': 12, 'num_hidden_layers': 12, 'num_labels': 2, 'output_attentions': False, 'output_hidden_states': False, 'pruned_heads': {}, 'torchscript': False, 'type_vocab_size': 2, 'vocab_size': 30522}
tokenizer_config.json
{'do_lower_case': True, 'max_len': 512, 'init_inputs': []}
运行结果:
{"exact": 79.2715, "f1": 86.96, "total": 10570, "HasAns_exact": 79.27, "HasAns_f1": 86.96, "HasAns_total": 10570}
evaluation
python evaluate-v1.1.py dev-v1.1.json OUTPUT/predictions_.json
{"exact_match": 79.27152317880795, "f1": 86.96144570829648}
step-2 代码解读
1. SQuAD 数据格式
SQuAD: json file
json file: dict_keys(['data', 'version'])其中:
1. 'version'是1.1表明SQuAD 版本
2. 'data'是训练数据:<class 'list'> 是一个list,里面包含442项,每一项都是一个dict.
data[11]是一个dict_keys(['title', 'paragraphs']) ##以data[11]为例
2.1 'title'是文章的标题
2.2'paragraphs'是文章的各个段落,是一个<class 'list'>,里面包含n(148项),每一项都是一个dict
paragraphs[0]是一个dict_keys(['context', 'qas'])
2.2.1 'context'是该段落内容
2.2.2 'qas'是一个list, 里面有6项该段落对应的问题,每一项都是一个dict,
qas里面的问题有些比较相似,答案都一致的,同一个答案的不同问法
qas[0]是一个dict_keys(['answers', 'question', 'id'])
2.2.2.1 answers: 是一个list,里面每个元素都是一个dict_keys(['answer_start','text'])
2.2.2.2 question: 问题
2.2.2.3 id: 问题的id
比如:
{'answers': [{'answer_start': 0, 'text': 'New York'}],
'question': 'What city in the United States has the highest population?',
'id': '56ce304daab44d1400b8850e'}
{'answers': [{'answer_start': 0, 'text': 'New York'}],
'question': 'In what city is the United Nations based?',
'id': '56ce304daab44d1400b8850f'}
{'answers': [{'answer_start': 0, 'text': 'New York'}],
'question': 'What city has been called the cultural capital of the world?',
'id': '56ce304daab44d1400b88510'}
{'answers': [{'answer_start': 0, 'text': 'New York'}],
'question': 'What American city welcomes the largest number of legal immigrants?',
'id': '56ce304daab44d1400b88511'}
{'answers': [{'answer_start': 22, 'text': 'New York City'}],
'question': 'The major gateway for immigration has been which US city?',
'id': '56cf5d41aab44d1400b89130'}
{'answers': [{'answer_start': 22, 'text': 'New York City'}],
'question': 'The most populated city in the United States is which city?',
'id': '56cf5d41aab44d1400b89131'}
2. 数据读取和转换
2-1. 源数据结构 SquadExample
SquadExample 用于parser SQuAD 源数据。
class SquadExample(object):
"""
A single traing/test example for the squad dataset.
For examples without an answer, the start and end position are -1.
"""
def __init__(self, qas_id, question_text, doc_tokens,
orig_answer_text=None, start_position=None, end_position=None,
is_impossible=None):
self.qas_id = qas_id
self.question_text = question_text
self.doc_tokens = doc_tokens
self.orig_answer_text = orig_answer_text
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
其中qas_id 是样本ID, question_text 问题文本,doc_tokens是阅读材料, orig_answer_text 原始答案的文本, start_position答案在文本中开始的位置,end_position答案在文本中结束的位置,is_impossible在SQuAD2中可用的negtivate 标识(这里可以先不用管)。
2-2. 源数据读取read_squad_examples
def read_squad_examples(input_file, is_training, version_2_with_negative):
#read a SQuAD json file into a list of SquadExample
with open(input_file, "r", encoding='utf-8') as reader:
input_data = json.load(reader)['data']
def is_whitespace(c):
if c==' ' or c=='\t' or c=='\r' or c=='\n' or ord(c)==0x202F:
return True
return False
examples = []
for entry in input_data:## entry是一个dict{title, paragraphs}, 此处主要对paragraphs进行处理
##entry['paragraphs']是一个list, 里面每一个item都是一个dict{context, qas}
for paragraph in entry['paragraphs']:## paragraph是一个dict{context, qas}
# step-1: 处理paragraph['context'] 得到 doc_tokens 和 char_to_word_offset
paragraph_text = paragraph['context']
doc_tokens = [] # context里面包含的一个个单词,按顺序放进去,等价于paragraph['context'].split()
char_to_word_offset = [] # context 每个字母对应的单词的offset
prev_is_whitespace = True
for c in paragraph_text: ##把text按照字母一个个输出
##如果之前是空格,则doc_tokens.append(c)是作为一个doc_tokens一个新的元素添加进去。
##如果之前不是空格,则在doc_tokens[-1]最后一个元素(str)上加上c这个字母。
##运行结果是 doc_tokens是一个个单词,效果等价于paragraph_text.split()
## ??不知道为什么这里要这么操作,而不是直接split()
## char_to_word_offset是paragraph_text每个字母对应的单词的在doc_tokens中的index
## 注意这里char_to_word_offset包含空格,空格默认和前一个单词的index一致
if is_whitespace(c):
prev_is_whitespace = True
else:
if prev_is_whitespace:
doc_tokens.append(c)
else:
doc_tokens[-1] += c
prev_is_whitespace = False
char_to_word_offset.append(len(doc_tokens) - 1)
#setp-2: 处理paragraph['qas'] 得到qas_id, question_text, orig_answer_text, start_position, end_position, is_impossible
for qa in paragraph['qas']:## qa是一个dict{answers,questions,id}
qas_id = qa['id']
question_text = qa['question']
start_position = None
end_position = None
orig_answer_text = None
is_impossible = False
if is_training:
if version_2_with_negative:## 如果是version2 SQuAD 里面包含negative info
is_impossible = qa['is_impossible']
if (len(qa['answers']) !=1) and (not is_impossible):#多个答案,且都正确
raise ValueError('For training, each question should have exactly 1 answer.')
if not is_impossible: ## 正确答案
answer = qa['answers'][0] ## qa['answers']是一个list,里面只有一个元素,该元素是一个dict{text, answer_start}
orig_answer_text = answer['text']
answer_offset = answer['answer_start']
answer_length = len(orig_answer_text) ## answer 字符串的长度
start_position = char_to_word_offset[answer_offset] # 答案在文档中的起始位置
end_position = char_to_word_offset[answer_offset + answer_length - 1] # 答案在文档中的终止位置
## 这里只添加 那些能在文档中能准确找到answer['text']的答案。
## 如果answer['text']找不到,有可能是因为奇怪的编码之类的问题导致的,遇到这种情况我们就会跳过这个样本。
## 这就意味着对于训练模式,不能保证 保存每一个训练样本。
actual_text = " ".join(doc_tokens[start_position:(end_position + 1)]) # 文中 答案所在的句子
cleaned_answer_text = " ".join(whitespace_tokenize(orig_answer_text))
## whitespace_tokenize: 得到一个text.split()之后的list, " ".join得到一个str 以" "隔开,这一行的意义是统一单词之间的间隔符
if actual_text.find(cleaned_answer_text) == -1: # 判断答案在 实际句子中是否能找到,如果不能找到则跳过,找到则放到example中。
logger.waring("Could not find answer: '%s' vs '%s'", actual_text, cleaned_answer_text)
continue
else:
start_position = -1
end_position = -1
orig_answer_text = ''
example = SquadExample(qas_id=qas_id,
question_text=question_text,
doc_tokens=doc_tokens,
orig_answer_text=orig_answer_text,
start_position=start_position,
end_position=end_position,
is_impossible=is_impossible)
examples.append(example)
return examples
2-3. 源数据转换的特征结构 InputFeatures
InputFeatures 是将源数据SQuAD 转换成用于BERT 模型的输入特征。
class InputFeatures(object):
"""A single set of features of data"""
def __init__(self, unique_id, example_index, doc_span_index,
tokens, token_to_orig_map, token_is_max_context,
input_ids, input_mask, segment_ids, cls_index,
p_mask, paragraph_len, start_position=None, end_position=None,
is_impossible=None):
self.unique_id = unique_id
self.example_index = example_index
self.doc_span_index = doc_span_index
self.tokens = tokens
self.token_to_orig_map = token_to_orig_map
self.token_is_max_context = token_is_max_context
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.cls_index = cls_index
self.p_mask = p_mask
self.paragraph_len = paragraph_len
self.start_position = start_position
self.end_position = end_position
self.is_impossible = is_impossible
其中unique_id是feature的唯一id, example_index 样本索引,用于建立feature 和example的对应,
doc_span_index是该feature在doc_span的索引,如果一个文本很长,那么肯定要对其进行截取成若干片段 转成 doc_span, doc_span里面的各个片段都会装进各个feature里面,所以一个feature就会有一个doc_span_index.
tokens 该样本的token序列,token_to_orig_map是tokens里面每一个token在原始doc_token的索引,token_is_max_context是一个序列,里面的值表示该位置的token在当前span里面是否是上下文最全的。
函数:_check_is_max_contex(doc_spans, cur_span_index, position)
由于我们对长文本通过滑动窗口的方法进行切分,得到了多个doc_span, 一个单独的token 有可能出现在多个doc_span中。
E.g.
Doc: the man went to the store and bought a gallon of milk
Span A: the man went to the
Span B: to the score and bought
Span C: and bought a gallon of
'bought'这个单词将会有2个score 分别来自spans B和C. 我们只想考虑score with 'maximum context'
什么叫'maximum context'? 这个单词左右内容最多的情况。
如何算'score'? 左右内容最小值 加上文本长度的0.01 : min(len_of_left_context, len_of_right_context) +0.01 * doc_span.length
对于上面的例子,the maximum context for 'bought' 就是span C,因为在C中bought
有1个left context, 3个right context。而B中有4个left context 但是只有0个right context.
input_ids是tokens转化为token ids作为模型的输入,input_mask输入的mask(mask padding 模块),segment_ids, is_impossible。
start_position, end_position 是答案在当前tokens序列里面的位置(跟上面example中的不同,这里的位置不是整个context里面的位置),注意如果答案不在当前span里的化,start_position和end_position 均为0。
2-4. 源数据到特征的转换convert_examples_to_features
唯一需要注意的是:输入特征的格式[CLS] question_text tokens [SEP] doc_tokens [SEP]
即 [CLS] 问题 [SEP]阅读材料片段[SEP]
def convert_examples_to_features(examples, tokenizer, max_seq_length,
doc_stride, max_query_length, is_training,
cls_token_at_end=False,
cls_token='[CLS]', sep_token='[SEP]', pad_token=0,
sequence_a_segment_id=0, sequence_b_segment_id=1,
cls_token_segment_id=0, pad_token_segment_id=0,
mask_padding_with_zero=True):
""" Loads a data file into a list of 'InputBatch's. """
unique_id = 1000000000
# cnt_pos, cnt_neg = 0, 0
# max_N, max_M = 1024, 1024
# f = np.zero((max_N, max_M), dtype=np.float32)
features = []
for (example_index, example) in enumerate(examples):
# if example_index % 100 == 0:
# logger.info('Converting %s/%s pos %s neg %s', example_index, len(examples), cnt_pos, cnt_neg)
query_tokens = tokenizer.tokenize(example.question_text)
if len(query_tokens) > max_query_length:
query_tokens = query_tokens[0:max_query_length]
tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
for (i, token) in enumerate(example.doc_tokens): ## example.doc_tokens的token是一个paragraph单词组成的list['','',]
orig_to_tok_index.append(len(all_doc_tokens))
sub_tokens = tokenizer.tokenize(token) ## 这里对paragraph单词用各个模型设计的tokenizer分词法再进行分词。
for sub_token in sub_tokens:
tok_to_orig_index.append(i) ## 这里是tokenizer之后 第sub_token对应的原始 单词的index
all_doc_tokens.append(sub_token) ## 这里添加用tokenizer分词之后的tokens
## 下面这一段是得到通过model.tokenizer分词之后的答案所在位置
tok_start_position = None
tok_end_position = None
if is_training and example.is_impossible:
tok_start_position = -1
tok_end_position = -1
if is_training and not example.is_impossible:
tok_start_position = orig_to_tok_index[example.start_position]
if examples.end_position < len(example.doc_tokens) - 1:
tok_end_position = orig_to_tok_index[example.end_position]
else:
tok_end_position = len(all_doc_tokens) - 1
## 获取tokenizer之后答案所在位置
(tok_start_position, tok_end_position) = _improve_answer_span(
all_doc_tokens, tok_start_position, tok_end_position,
tokenizer, example.orig_answer_text)
# The -3 accounts for [CLS], [SEP] and [SEP] 因为句子格式[CLS]A[SEP]B[SEP]
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
# 我们有可能会有比maximum sequence length 更长的 documents,为了处理这种情况,
# 我们做一个滑动窗口,我们取一个达到我们最大长度的窗口,
_DocSpan = collections.namedtuple( # pylint: disable=invalid-name
"DocSpan", ["start", "length"])
doc_spans = []
start_offset = 0
while start_offset < len(all_doc_tokens):
length = len(all_doc_tokens) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(all_doc_tokens):
break
start_offset += min(length, doc_stride)
for (doc_span_index, doc_span) in enumerate(doc_spans):
tokens = []
token_to_orig_map = []
token_is_max_context = {}
segment_ids = []
# p_mask: mask with 1 for token that can't be in answer(0 for token which can be in an answer)
# Original TF implem also keep the classification token(set to 0)
p_mask = []
# CLS token at the begging
if not cls_token_at_end:
tokens.append(cls_token)
segment_ids.append(cls_token_segment_id)
p_mask.append(0)
cls_index = 0
# Query
for token in query_tokens:
tokens.append(token)
segment_ids.append(sequence_a_segment_id)
p_mask.append(1)
# SEP token
tokens.append(sep_token)
segment_ids.append(sequence_a_segment_id)
p_mask.append(1)
# Paragraph
for i in range(doc_span.length):
split_token_index = doc_span.start + i
token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]
is_max_context = _check_is_max_contex(doc_spans, doc_span_index,
split_token_index)
token_is_max_context[len(tokens)] = is_max_context
tokens.append(all_doc_tokens[split_token_index])
segment_ids.append(sequence_b_segment_id)
p_mask.append(0)
paragraph_len = doc_span.length
# SEP token
tokens.append(sep_token)
segment_ids.append(sequence_b_segment_id)
p_mask.append(1)
# CLS token at the end
if cls_token_at_end:
tokens.append(cls_token)
segment_ids.append(cls_token_segment_id)
p_mask.append(0)
cls_index = len(tokens) - 1 # index of classification token
input_ids = tokenizer.convert_tokens_to_ids(tokens)
## the mask has 1 for real tokens and 0 for padding tokens
## tokens are attended to.
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
## Zero-pad up to the sequence length:添加padding
while len(input_ids) < max_seq_length:
input_ids.append(pad_token)
input_mask.append(0 if mask_padding_with_zero else 1)
segment_ids.append(pad_token_segment_id)
p_mask.append(1)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
span_is_impossible = example.is_impossible
start_position = None
end_position = None
if is_training and not span_is_impossible:
# For training, if our document chunk doesn't contain an annotation
# we throw it out, since there is nothing to predict
doc_start = doc_span.start
doc_end = doc_span.start + doc_span.length - 1
out_of_span = False
if not (tok_start_position >= doc_stride and
tok_end_position <= doc_end):
out_of_span = True
if out_of_span:
start_position = 0
end_position = 0
span_is_impossible = True
else:
doc_offset = len(query_tokens) + 2
start_position = tok_start_position - doc_start + doc_offset
end_position = tok_end_position - doc_start + doc_offset
if is_training and span_is_impossible:
start_position = cls_index
end_position = cls_index
if example_index < 20:
logger.info("*** example ***")
logger.info(" unique_id : %s" % (unique_id))
logger.info(" example_index: %s" % (example_index))
logger.info(" doc_span_index: %s" % (doc_span_index))
logger.info(" tokens: %s" % " ".join(tokens))
logger.info(" token_to_orig_map: %s" % " ".join([
"%d:%d" % (x, y) for (x, y) in token_to_orig_map.items()]))
logger.info(" token_is_max_context: %s" % " ".join([
"%d:%s" % (x, y) for (x, y) in token_is_max_context.items()
]))
logger.info(" input_ids: %s" % " ".join([str(x) for x in input_ids]))
logger.info(" input_mask: %s" % " ".join([str(x) for x in input_mask]))
logger.info(" segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
if is_training and span_is_impossible:
logger.info(" impossible example")
if is_training and not span_is_impossible:
answer_text = " ".join(tokens[start_position:(end_position + 1)])
logger.info(" start_position: %d" % (start_position))
logger.info(" end_position: %d" % (end_position))
logger.info(" answer: %s" % (answer_text))
features.append(
InputFeatures(
unique_id=unique_id,
example_index=example_index,
doc_span_index=doc_span_index,
tokens=tokens,
token_to_orig_map=token_to_orig_map,
token_is_max_context=token_is_max_context,
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
cls_index=cls_index,
p_mask=p_mask,
paragraph_len=paragraph_len,
start_position=start_position,
end_position=end_position,
is_impossible=span_is_impossible))
unique_id += 1
return features
2-5. 加载数据和特征
load_and_cache_examples
## input X
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensro([f.input_mask for f in features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
## input Y
all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
input_ids, input_mask, segment_ids这三个元素之和作为模型的输入。而start_positions, end_positions 作为Y,知道了Y相当于知道答案位置,通过反向在阅读材料的context中去查找对应的内容就是答案。
3. model 和 loss的构建
BertForQuestionAnswering 模型在BERT 模型基础上添加了线性变换的head, 对BERT最后一层的输出hidden state(batch_size, seq_len, hidden_size)进行线性变换得logits(batch_size, seq_len, 2)(这里是num_labels是2 分别对应答案的2个位置start_logits 和end_logits值),然后分别计算start和end position的CrossEntropyLoss损失,start 和end loss的加和平均值即为模型的损失函数。
class BertForQuestionAnswering(BertPreTrainedModel):
def __init__(self, config):
super(BertForQuestionAnswering, self).__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel(config)
self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
def forward(self, input_ids, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
start_positions=None, end_positions=None):
outputs = self.bert(input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask)
sequence_output = outputs[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)
outputs = (start_logits, end_logits,) + outputs[2:]
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:
start_positions = start_positions.squeeze(-1)
if len(end_positions.size()) > 1:
end_positions = end_positions.squeeze(-1)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index = start_logits.size(1)
start_positions.clamp_(0, ignored_index)
end_positions.clamp_(0, ignored_index)
**loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2**
outputs = (total_loss,) + outputs
return outputs # (loss), start_logits, end_logits, (hidden_states), (attentions)
4. train 和 evaluation
train 函数中需要注意 optimizer 和schedule:
这里optimizer函数时AdamW, 是目前训练神经网络最快的方式, 是在Adam的基础上修正了权重衰减得到的优化器, 修正了Adam的收敛性得不到保证的问题. (但是其实最后证实了是因为模型超参数调的不够好,如果调好参数,Adam也可以达到很好的效果,通过Adam+L2 正则也可以, 但是效果不如权重AdamW 权重衰减) 参考: AdamW优化算法+超级收敛
#Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
-
权重衰减 : 防止过拟合
(避免过拟合的方法有很多:early stopping、数据集扩增(Data augmentation)、正则化(Regularization)包括L1、L2(L2 regularization也叫weight decay),dropout)
L2正则化的目的就是为了让权重衰减到更小的值,在一定程度上减少模型过拟合的问题,所以权重衰减也叫L2正则化。L2正则化就是在代价函数后面再加上一个正则化项:
其中C0代表原始的代价函数,后面那一项就是L2正则化项,它是这样来的:所有参数w的平方的和,除以训练集的样本大小n。λ就是正则项系数,权衡正则项与C0项的比重。另外还有一个系数1/2,1/2经常会看到,主要是为了后面求导的结果方便,后面那一项求导会产生一个2,与1/2相乘刚好凑整为1。系数λ就是权重衰减系数。为什么可以对权重进行衰减? 我们可以对上面的代价函数进行求导,如下, 可以看出 L2正则对b的更新没有影响,但是对w的更新有影响.
在不使用L2正则化时,求导结果中w前系数为1,现在w前面系数为1-ηλ/n,因为η、λ、n都是正的,所以1-ηλ/n小于1,它的效果是减小w,这也就是权重衰减(weight decay)的由来。当然考虑到后面的导数项,w最终的值可能增大也可能减小。另外,需要提一下,对于基于mini-batch的随机梯度下降,w和b更新的公式跟上面给出的有点不同,对比上面w的更新公式,可以发现后面那一项变了,变成所有导数加和,乘以η再除以m,m是一个mini-batch中样本的个数。
权重衰减(L2正则化)可以避免模型过拟合问题. L2正则化项有让w变小的效果,防止过拟合的原理:(1)从模型的复杂度上解释:更小的权值w,从某种意义上说,表示网络的复杂度更低,对数据的拟合更好(这个法则也叫做奥卡姆剃刀),而在实际应用中,也验证了这一点,L2正则化的效果往往好于未经正则化的效果。(2)从数学方面的解释:过拟合的时候,拟合函数的系数往往非常大,为什么?如下图所示,过拟合,就是拟合函数需要顾忌每一个点,最终形成的拟合函数波动很大。在某些很小的区间里,函数值的变化很剧烈。这就意味着函数在某些小区间里的导数值(绝对值)非常大,由于自变量值可大可小,所以只有系数足够大,才能保证导数值很大。而正则化是通过约束参数的范数使其不要太大,所以可以在一定程度上减少过拟合情况。 -
学习率衰减
学习率衰减(learning rate decay)就是一种可以平衡这两者之间矛盾的解决方案。学习率衰减的基本思想是:学习率随着训练的进行逐渐衰减。
学习率衰减基本有两种实现方法:(1 ) 线性衰减, 例如:每过5个epochs学习率减半。(2) 指数衰减, 例如:随着迭代轮数的增加学习率自动发生衰减,每过5个epochs将学习率乘以0.9998。具体算法如下:decayed_learning_rate=learning_rate*decay_rate^(global_step/decay_steps)
其中decayed_learning_rate为每一轮优化时使用的学习率,learning_rate为事先设定的初始学习率,decay_rate为衰减系数,decay_steps为衰减速度。
def train(args, train_dataset, model, tokenizer):
""" Train the model """
if args.local_rank in [-1, 0]:
tb_writer = SummaryWriter()
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset) ## 打乱顺序
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay}, ## 非bias, LayerNorm.weight的参数进行L2正则时
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
global_step = 0
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
set_seed(args) # Added here for reproductibility (even between python 2 and 3)
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
model.train()
batch = tuple(t.to(args.device) for t in batch)
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
'token_type_ids': None if args.model_type == 'xlm' else batch[2],
'start_positions': batch[3],
'end_positions': batch[4]}
if args.model_type in ['xlnet', 'xlm']:
inputs.update({'cls_index': batch[5],
'p_mask': batch[6]})
outputs = model(**inputs)
loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
tr_loss += loss.item()
if (step + 1) % args.gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
# Log metrics
if args.local_rank == -1 and args.evaluate_during_training: # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer)
for key, value in results.items():
tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
logging_loss = tr_loss
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, 'training_args.bin'))
logger.info("Saving model checkpoint to %s", output_dir)
if args.max_steps > 0 and global_step > args.max_steps:
epoch_iterator.close()
break
if args.max_steps > 0 and global_step > args.max_steps:
train_iterator.close()
break
if args.local_rank in [-1, 0]:
tb_writer.close()
return global_step, tr_loss / global_step
参考博文:
权重衰减(weight decay)与学习率衰减(learning rate decay)
神经网络学习率(learning rate)的衰减
正则化方法:L1和L2 regularization、数据集扩增、dropout