在本文中,我们将介绍如何使用Hugging Face的大型语言模型(LLM)构建一些常见的应用,包括摘要(Summarization)、情感分析(Sentiment analysis)、翻译(Translation)、零样本分类(Zero-shot classification)和少样本学习(Few-shot learning)。我们将探索现有的开源和专有模型,展示如何直接应用于各种应用场景。同时,我们还将介绍简单的提示工程(prompt engineering),以及如何使用Hugging Face的API配置LLM管道。
学习目标:
-
使用各种现有模型构建常见应用。
-
理解基本的提示工程。
-
了解LLM推理中的搜索和采样方法。
-
熟悉Hugging Face的主要抽象概念:数据集、管道、分词器和模型。
环境安装
常见的大预言模型应用
本节旨在让您对几种常见的LLM应用有所了解,并展示使用LLM的入门方法的简易性。
在浏览示例时,请注意所使用的数据集、模型、API和选项。这些简单的示例可作为构建自己应用程序的起点。
摘要(Summarization)
摘要可以分为两种形式:
-
抽取式摘要(extractive):从文本中选择代表性的摘录作为摘要。
-
生成式摘要(abstractive):通过生成新的文本来形成摘要。
在本文中,我们将使用生成式摘要模型。
背景阅读:Hugging Face的摘要任务页面列出了支持摘要的模型架构。摘要章节提供了详细的操作指南。
在本节中,我们将使用以下内容:
-
数据集:xsum数据集,该数据集提供了一系列BBC新闻文章和相应的摘要。
-
模型:t5-small模型,该模型具有6000万个参数(对于PyTorch而言是242MB)。T5是由Google创建的编码器-解码器模型,支持多个任务,包括摘要、翻译、问答和文本分类。有关更多详细信息,请参阅Google的博客文章、GitHub上的代码或研究论文。
数据集加载
`xsum_dataset = load_dataset(
"xsum", version="1.2.0", cache_dir="/root/home/LLMs/week1"
) # Note: We specify cache_dir to use predownloaded data.
xsum_dataset # The printed representation of this object shows the `num_rows` of each dataset split.
# 输出
DatasetDict({
train: Dataset({
features: ['document', 'summary', 'id'],
num_rows: 204045
})
validation: Dataset({
features: ['document', 'summary', 'id'],
num_rows: 11332
})
test: Dataset({
features: ['document', 'summary', 'id'],
num_rows: 11334
})
}) `
该数据集提供了三列:
-
document:包含BBC文章的文本内容。
-
summary:一个“ground-truth”摘要。请注意,“ground-truth”摘要是主观的,可能与您所写的摘要不同。这是一个很好的例子,说明许多LLM应用程序没有明显的“正确”答案。
-
id:文章的唯一标识符。
xsum_sample = xsum_dataset["train"].select(range(10))
display(xsum_sample.to_pandas())
接下来,我们将使用Hugging Face的pipeline工具加载一个预训练模型。在LLM(Language Model)pipeline的构造函数中,我们需要指定以下参数:
-
task:第一个参数用于指定主要任务。您可以参考Hugging Face的task文档获取更多信息。
-
model:这是从Hugging Face Hub加载的预训练模型的名称。
-
min_length、max_length:我们可以设置生成的摘要的最小和最大标记长度范围。
-
truncation:一些输入文章可能过长,超出了LLM处理的限制。大多数LLM模型对输入序列的长度有固定的限制。通过设置此选项,我们可以告诉pipeline在需要时对输入进行截断。
# Apply to 1 article
summarizer(xsum_sample["document"][0])
# Apply to a batch of articles
results = summarizer(xsum_sample["document"])
# Display the generated summary side-by-side with the reference summary and original document.
# We use Pandas to join the inputs and outputs together in a nice format.
import pandas as pd
display(
pd.DataFrame.from_dict(results)
.rename({"summary_text": "generated_summary"}, axis=1)
.join(pd.DataFrame.from_dict(xsum_sample))[
["generated_summary", "summary", "document"]
]
)
# 输出
results
[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'},
{'summary_text': 'a fire alarm went off at the Holiday Inn in Hope Street on Saturday . guests were asked to leave the hotel . the two buses were parked side-by-side in'},
{'summary_text': 'Sebastian Vettel will start third ahead of team-mate Kimi Raikkonen . stewards only handed Hamilton a reprimand after governing body said "n'},
{'summary_text': 'the 67-year-old is accused of committing the offences between March 1972 and October 1989 . he denies all the charges, including two counts of indecency'},
{'summary_text': 'a man receiving psychiatric treatment at the clinic threatened to shoot himself and others . the incident comes amid tension in Istanbul following several attacks on the reina nightclub .'},
{'summary_text': 'Gregor Townsend gave a debut to powerhouse wing Taqele Naiyaravoro . the dragons gave first starts of the season to wing a'},
{'summary_text': 'Veronica Vanessa Chango-Alverez, 31, was killed and another man injured in the crash . police want to trace Nathan Davis, 27, who has links to the Audi .'},
{'summary_text': 'the 25-year-old was hit by a motorbike during the Gent-Wevelgem race . he was riding for the Wanty-Gobert team and was taken'},
{'summary_text': 'gundogan will not be fit for the start of the premier league season at Brighton on 12 august . the 26-year-old says his recovery time is now being measured in "week'},
{'summary_text': 'the crash happened about 07:20 GMT at the junction of the A127 and Progress Road in leigh-on-Sea, Essex . the man, aged in his 20s'}]