为什么通用语言模型如此重要? (Why are general-use language models so important?)
In May 2020, a team of researchers at OpenAI released a landmark AI model called GPT-3. The latter is a language model trained on 570 gigabytes worth of textual data: as a result, GPT-3 is as of August 2020 the most massive publicly released language model in terms of both training data and generative capabilities. Consequently, users have reported that the text generated by GPT-3 is often indistinguishable from phrases that a human would right, and can also play the role of a search engine, writer or programmer, depending on what the user prompts GPT-3 to do.
2020年5月,OpenAI的一组研究人员发布了具有里程碑意义的AI模型GPT-3。 后者是在570 GB的文本数据上训练的语言模型:因此,就训练数据和生成能力而言,GPT-3截至2020年8月是最庞大的公开发布的语言模型。 因此,用户报告说,由GPT-3生成的文本通常与人类会正确使用的短语没有区别,并且取决于用户提示GPT-3执行的操作,它们还可以扮演搜索引擎,作家或程序员的角色。 。
In the near future, it is possible that models similar to GPT-3 in both size and capabilities will be made available for public use. Their development and availability raise the question, how will language models impact our everyday lives? To explore this subject, I believe that it is crucial to analyze the potential everyday uses of such AI-based language models and to pinpoint the necessary measures to prevent misuse of these self-learning algorithms.
在不久的将来,可能会在大小和功能上使类似于GPT-3的模型公开使用。 它们的发展和可用性提出了一个问题,语言模型将如何影响我们的日常生活? 为了探索这一主题,我认为分析此类基于AI的语言模型的日常潜在用途并指出必要的措施以防止滥用这些自学算法至关重要。
这种通用语言模型有哪些应用? (What are the applications of such general-use language models?)
Although uses of these models are seemingly limitless, I think their applications can be grouped into four categories:
尽管这些模型的使用看似无限,但我认为它们的应用可以分为四类:
1.搜索引擎 (1. Search Engine)
Because the training data for general-use language models come from large datasets of varied information, these models would consequently learn to weigh the importance of general topics and their specific details. For instance, if asked “What is the tallest mountain on Earth?”, a language model might have no difficulty giving the correct answer, “Mt. Everest,” because it is well trained to analyze sentence structure and determine key topics of interest, but also because it has knowledge of these answers through its training. This thus makes a tool that is potentially applicable for everyday consumer use.
由于通用语言模型的训练数据来自各种信息的大型数据集,因此这些模型将学会权衡通用主题及其特定细节的重要性。 例如,如果被问到“地球上最高的山是什么?”,语言模型可能会很容易给出正确的答案: 珠穆朗玛峰”,因为它受过良好的训练,可以分析句子结构并确定感兴趣的关键主题,还因为它通过培训对这些答案有所了解。 因此,这使得该工具潜在地可用于日常消费者使用。
2.文字生成 (2. Text Generation)
Since general-use language models are trained on large datasets of variable knowledge, they have no difficulty in generating coherent sentences on a specific subtopic. This could allow a wide array of consumers, from fiction writers to journalists, from lawyers to educators, from researchers to managers, to effortlessly produce cohesive and valuable text solely by asking the model to write based on a relatively small-sized description. For instance, an insurance agent in could ask such an algorithm to write a lengthy terms of agreement on car insurance and specifically adapt the wording to the statistics of car accidents in New York City in a client living in that region, and write another document but adapted to San Francisco for a client living in that Californian city.
由于通用语言模型是在可变知识的大型数据集上进行训练的,因此在特定子主题上生成连贯句子时没有困难。 这可以允许各种各样的消费者,从小说作家到新闻工作者,从律师到教育家,从研究人员到管理人员,仅通过要求模型以相对较小的描述为基础,就可以毫不费力地生成具有凝聚力和有价值的文本。 例如,某公司的保险代理机构可以要求使用这种算法来写一份关于汽车保险的冗长协议条款,并专门针对措词以适应居住在该地区的客户在纽约市发生的汽车事故统计数据,然后编写另一份文件,但适应了旧金山,为住在该加利福尼亚州城市的客户提供服务。
3.个性化的对话工具 (3. Personalized Conversational Tools)
The large training datasets for these models can also yield insight on how to respond to conversations with an end user. For instance, if an end user liked to talk in a style similar to the characters of Mark Twain, the language model could learn to converse in such a manner and undergo a discussion that seems meaningful and personalized to the end user, in which the model can comfortably analyze the user’s data and well being and respond with information that best suits the user’s personal demands, or connect the user with external services that are fine-tuned for their needs.
这些模型的大型培训数据集还可以提供有关如何响应最终用户对话的见解。 例如,如果最终用户喜欢以类似于马克·吐温(Mark Twain)角色的风格讲话,则语言模型可以学习以这种方式进行交谈,并进行对于最终用户而言似乎有意义且个性化的讨论。可以轻松地分析用户的数据和健康状况,并以最适合用户个人需求的信息进行响应,或者将用户与经过微调的外部服务联系起来。
4.软件生成 (4. Software Generation)
Why stop at writing text? GPT-3 has supposedly the capacity to write code in a few programming languages that include Python and React. Language models can therefore be used as a no-code tool to effortlessly generate computer programs, therefore saving valuable time and resources to produce efficient and streamlined computer programs.
为什么不写文字呢? 据推测,GPT-3可以用几种编程语言(包括Python和React)编写代码。 因此,语言模型可以用作无需编写代码即可轻松生成计算机程序的工具,因此可以节省宝贵的时间和资源来生成高效且精简的计算机程序。
潜在的陷阱是什么? (What are the potential pitfalls?)
Following a critical analysis of the applications of general-use language models, multiple disadvantages or dangers can be identified.
在对通用语言模型的应用进行严格分析之后,可以确定多种弊端或危险。
1.数据偏差 (1. Data Bias)
Any bias in the datasets, including discriminatory bias based on race, gender, nationality or beliefs, can influence the end results, and unintentionally lead to discrimination or bias against targeted groups.
数据集中的任何偏见,包括基于种族,性别,国籍或信仰的歧视性偏见,都可能影响最终结果,并无意间导致对目标群体的歧视或偏见。
2.假副本 (2. Fake Replicates)
General-use language models can mimic authentic texts or reports and generate highly similar ones. As a result, users could manipulate this capacity to generate fake texts and share them online in an attempt to spread misinformation.
通用语言模型可以模仿真实的文本或报告并生成高度相似的文本或报告。 结果,用户可以利用此功能来生成伪造文本并在线共享它们,以散布错误信息。
3.杂项滥用 (3. Miscellaneous Misuse)
Users can apply language models in many ways to cause harm to others. It’s not necessary to speculate on how these models could be used to attain such a result: the takeaway is that such a possibility is a legitimate concern.
用户可以通过多种方式应用语言模型来对他人造成伤害。 没必要去猜测如何使用这些模型来获得这样的结果:结论是这样的可能性是合理的关注。
4.数据隐私 (4. Data Privacy)
These models can save user responses and store personal data, similar to the way how tech corporations already track data to streamline product usage.
这些模型可以保存用户响应并存储个人数据,类似于高科技公司已经跟踪数据以简化产品使用的方式。
如何使用这些模型? (How Can these Models Be Used?)
One interesting way is the OpenAI approach for GPT-3. OpenAI did not provide direct open-source access to GPT-3 to prevent misuse of the language model, but is setting up an API endpoint that is accessible following registration and under surveillance for any potential misuse.
一种有趣的方法是针对GPT-3的OpenAI方法。 OpenAI并未提供对GPT-3的直接开放源代码访问,以防止滥用语言模型,但正在设置一个API端点,该端点可在注册后访问并受到任何潜在滥用的监视。
Another approach to counter the pitfalls of general-use language models is to create a complementary set of models that scan textual content and judge whether the textual content is generated by a machine, and the potential biases it might contain.
解决通用语言模型陷阱的另一种方法是创建一组互补的模型,这些模型可以扫描文本内容并判断文本内容是否由机器生成,以及它可能包含的潜在偏差。
主要的收获是什么? (What’s the main takeaway?)
With the publication of Open AI’s GPT-3, it is likely that a plethora of similar language models will become public in the years to come. The arrival of these language models will allow users to access information for efficiently, complete tasks that will save humans time and energy, and better analyze every day data to improve the well being of the everyday user. However, there are potential downsides such as data bias and model misuse. As a result, I believe that large scale general-use language models should be released on a cautious, gradual basis, and be regularly monitored for any misuse.
随着Open AI的GPT-3的发布,在未来几年内,很可能会有大量类似的语言模型公开。 这些语言模型的到来将使用户能够有效地访问信息,完成任务以节省人类的时间和精力,并更好地分析每天的数据以改善日常用户的健康状况。 但是,存在潜在的不利影响,例如数据偏差和模型滥用。 因此,我认为应谨慎,逐步发布大规模的通用语言模型,并应定期监视任何滥用情况。
In the end, it’s worth stating that these models promise that the near-future is exciting, and that a massive quantity of beneficial everyday applications will likely become commonplace.
最后,值得一提的是,这些模型保证了不久的将来将是令人兴奋的,并且大量有益的日常应用可能会变得司空见惯。
Thank you for reading, I hope you enjoyed this article!
感谢您的阅读,希望您喜欢这篇文章!