【“Transformers快速入门”学习笔记1】pipeline操作背后做了什么？

预处理 (preprocessing)，将原始文本转换为模型可以接受的输入格式；
将处理好的输入送入模型；
对模型的输出进行后处理 (postprocessing)，将其转换为人类方便阅读的格式。

预处理(preprocessing)

预处理环节将文本转换成模型可以理解的数字。这个步骤使用每个模型对应的分词器（tokenizer）进行。

使用AutoTokenizer类和其中的from_pretrained()函数

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
# 直接使用tokenizer进行分词
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

输出：

{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}

包括input_ids和attention_mask，表示对应分词之后的 tokens 映射到的数字编号列表和用来标记哪些 tokens 是被填充的（这里“1”表示是原文，“0”表示是填充字符）。

将预处理好的输入送入模型

**使用AutoModelr类和其中的from_pretrained()函数

下载distilbert-base模型：

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Transformers 库封装了很多不同的结构，常见的有：

Model （返回 hidden states）
ForCausalLM （用于条件语言模型）
ForMaskedLM （用于遮盖语言模型）
ForMultipleChoice （用于多选任务）
ForQuestionAnswering （用于自动问答任务）
ForSequenceClassification （用于文本分类任务）
ForTokenClassification （用于 token 分类任务，例如 NER）

比如，对于情感分析任务，很明显我们最后需要使用的是一个文本分类 head。因此，实际上我们不会使用 AutoModel 类，而是使用 AutoModelForSequenceClassification：

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits.shape)

输出：

torch.Size([2, 2])

可以看到，对于 batch 中的每一个样本，模型都会输出一个两维的向量（每一维对应一个标签，positive 或 negative）。

self reminder：tokenizer是输入，model(**input)是输出

对模型输出进行后处理

模型的输出是数值，人看不懂，需要进一步处理.
对于上面一段代码，如果运行

print(outputs.logits)

则会输出

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

注意，这不是概率值，概率值要通过softmax函数获得，比如：

import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

输出：

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

这样模型的预测结果就是容易理解的概率值。
为了得到对应的标签，可以读取模型config中提供的id2label属性，如下：

print(model.config.id2label)

{0: 'NEGATIVE', 1: 'POSITIVE'}

于是我们可以得到最终的预测结果：

第一个句子: NEGATIVE: 0.0402, POSITIVE: 0.9598
第二个句子: NEGATIVE: 0.9995, POSITIVE: 0.0005