人工智能数据和算法的偏差
重点 (Top highlight)
Night after night, Fien de Meulder sat in front of her Linux computer flagging names of people, places, and organizations in sentences pulled from Reuters newswire articles. De Meulder and her colleague, Erik Tjong Kim Sang, worked in language technology at the University of Antwerp. It was 2003, and a 60-hour workweek was typical in academic circles. She chugged Coke to stay awake.
ñ洞察力;复一夜,FIEN德Meulder坐在她的Linux计算机标记在路透通讯社的文章句子拉人,地点和组织的名称前面。 De Meulder和她的同事Erik Tjong Kim Sang在安特卫普大学从事语言技术工作。 那是2003年,学术界通常每周工作60个小时。 她勉强可乐保持清醒。
The goal: develop an open source dataset to help machine learning (ML) models learn to identify and categorize entities in text. At the time, the field of named-entity recognition (NER), a subset of natural language processing, was beginning to gain momentum. It hinged on the idea that training A.I. to identify people, places, and organizations would be a key to A.I. being able to glean the meaning of text. So, for instance, a system trained on these types of datasets that is analyzing a piece of text including the names “Mary Barra,” “General Motors,” and “Detroit” may be able to infer that the person (Mary Barra) is associated with the company (General Motors) and either lives or works in the named place (Detroit).
其目标是:开发一个开源的数据S等,以帮助机器学习(ML)模型学习文本进行识别和分类的实体。 当时,自然语言处理的子集命名实体识别(NER)领域开始蓬勃发展。 它基于这样的思想,即培训AI以识别人员,地点和组织将是AI能够收集文本含义的关键。 因此,例如,在这些类型的数据集上受过训练的系统正在分析一段包含名称“ Mary Barra”,“ General Motors”和“ Detroit”的文本,则可以推断出该人(Mary Barra)与公司(通用汽车)有联系,并在指定的地点(底特律)居住或工作。
In 2003, the entire process centered on supervised machine learning, or ML models trained on data that previously had been annotated by hand. To “learn” how to make these classifications, the A.I. had to be “shown” examples categorized by humans, and categorizing those examples involved a lot of grunt work.
在2003年,整个过程都集中在有监督的机器学习上,或者说是在以前手工注释过的数据上训练过的ML模型。 为了“学习”如何进行这些分类,必须对AI进行人为分类的“显示”示例,并且对这些示例进行分类涉及很多艰巨的工作。
Tjong Kim Sang and de Meulder didn’t think much about bias as they worked — at the time, few research teams were thinking about representation in datasets. But the dataset they were creating — known as CoNLL-2003 — was biased in an important way: The roughly 20,000 news wire sentences they annotated contained many more men’s names than women’s names, according to a recent experiment by data annotation firm Scale AI shared exclusively with OneZero.
Tjong Kim Sang和de Meulder在工作中对偏见的思考不多-当时,很少有研究团队在考虑数据集中的表示形式。 但是他们创建的数据集(称为CoNLL-2003)在一个重要方面存在偏见:据数据注释公司Scale AI独家共享的最新实验显示,他们注释的大约20,000条新闻新闻句子包含的男性名字比女性名字更多与OneZero 。
CoNLL-2003 would soon become one of the most widely used open source datasets for building NLP systems. Over the past 17 years, it’s been cited more than 2,500 times in research literature. It’s difficult to pin down the specific commercial algorithms, platforms, and tools CoNLL-2003 has been used in — “Companies tend to be tight-lipped about what training data specifically they’re using to build their models,” says Jacob Andreas, PhD, an assistant professor at the Massachusetts Institute of Technology and part of MIT’s Language and Intelligence Group—but the dataset is widely considered to be one of the most popular of its kind. It has often been used to build general-purpose systems in industries like financial services and law.
CoNLL-2003很快将成为用于构建NLP系统的最广泛使用的开源数据集之一。 在过去的17年中,它在研究文献中被引用了2500多次。 很难确定CoNLL-2003所使用的特定商业算法,平台和工具-“公司往往对他们具体用于构建模型的培训数据持谨慎态度,” Jacob Andreas博士说。是麻省理工学院的助理教授,也是麻省理工学院语言和情报小组的成员,但该数据集被广泛认为是同类中最受欢迎的数据集。 它通常被用于在金融服务和法律等行业中构建通用系统。
Only this past February did someone bother to quantify its bias.
直到今年二月才有人来量化其偏见。
Using its own labeling pipeline — the process and tech used to teach humans to classify data that’ll then be used to train an algorithm — Scale AI found that, by the company’s own categorization, “male” names were mentioned almost five times more than “female” names in CoNLL-2003. Less than 2% of names were considered “gender-neutral.”
Scale AI发现,使用公司自己的标签流水线(用于教导人类对数据进行分类的过程和技术,然后将其用于训练算法),通过公司自身的分类,提到“男性”名字的次数几乎是其的五倍。 CoNLL-2003中的“女性”名称。 只有不到2%的名字被视为“性别中立”。
When Scale AI tested a model trained using CoNLL-2003 on a separate set of names, it was 5% more likely to miss a new woman’s name than a new man’s name (a notable discrepancy). When the company tested the algorithm on U.S. Census data — the 100 most popular men’s and women’s names for each year — it performed “significantly worse” on women’s names “for all years of the census,” according to the report.
当Scale AI在另一组姓名上测试使用CoNLL-2003训练的模型时,错过新女人的姓名的可能性比新男人的姓名的可能性高5%(明显的差异)。 该报告称,当该公司在美国人口普查数据(每年100个最受欢迎的男性和女性名字)上对该算法进行测试时,“在所有普查期间”,女性名字的表现都“明显较差”。
All of this means that a model trained on CoNNL-2003 wouldn’t just fall short when it comes to identifying the current names included in the dataset — it would fall short in the future, too, and likely perform worse over time. It would have more trouble with women’s names, but it would also likely be worse at recognizing names more common to minorities, immigrants, young people, and any other group that wasn’t regularly covered in the news two decades ago.
所有这些都意味着,在识别数据集中包含的当前名称时,在CoNNL-2003上训练的模型不仅会欠缺-将来也会欠缺,并且随着时间的推移可能会变得更糟。 女人的名字会带来更多麻烦,但是如果认识到少数群体,移民,年轻人以及二十年前新闻中没有经常报道的任何其他群体中更常见的名字,也可能会变得更糟。
“It’s only after the fact, if the systems are used on different datasets, that the bias will become apparent.”
“只有在事实之后,如果将系统用于不同的数据集,偏差就会变得明显。”
To this day, CoNLL-2003 is relied on as an evaluation tool to validate some of the most-used language systems — “word embedding” models that translate words into meaning and context that A.I. can understand — including fundamental models like BERT, ELMo, and GloVe. Everything influenced by CoNLL-2003 has, in turn, had its own ripple effects (for instance, GloVe has been cited more than 15,000 times in literature on Google Scholar).
迄今为止,CoNLL-2003一直是一种评估工具,可用于验证一些最常用的语言系统-“词嵌入”模型,这些模型将词转化为AI可以理解的含义和上下文-包括诸如BERT,ELMo,和GloVe。 反过来,受CoNLL-2003影响的一切事物都有其自身的连锁React(例如,在Google学术搜索中,GloVe被引用超过15,000次)。
Alexandr Wang, founder and CEO of Scale AI, describes ML as a “house of cards” of sorts, in that things are built atop each other so quickly that it’s not always apparent whether there’s a sturdy foundation underneath.
Scale AI的创始人兼首席执行官Alexandr Wang将ML形容为一种“纸牌屋”,因为事物之间的构建如此之快,以至于在下面是否总是有坚固的基础并不总是很明显。
The dataset’s ripple effects are immeasurable. So are those of its bias.
数据集的涟漪效应是不可估量的。 那些偏见也是如此。
Imagine a ruler, slightly bent, that’s seen as the universal standard for measurement.
我想象一下一个略微弯曲的尺子,它被视为测量的通用标准。
In interviews, industry experts consistently referred to CoNLL-2003 with wording that reflects its influence: Benchmark. Grading system. Yardstick. For almost two decades, it’s been used as a building block or sharpening tool for countless algorithms.
在采访中,行业专家始终以反映其影响力的措词提及CoNLL-2003:基准。 评分标准。 尺度。 近二十年来,它已被用作无数算法的构件或锐化工具。
“If people invent a new machine learning system,” Tjong Kim Sang says, “one of the datasets they will… test it on is this CoNLL-2003 dataset. That is the reason why it has become so popular. Because if people make something new, if it’s in 2005, 2010, 2015, or 2020, they will use this dataset.”
“如果人们发明了一种新的机器学习系统,” Tjong Kim Sang说,“他们将……测试的一个数据集就是这个CoNLL-2003数据集。 这就是它如此受欢迎的原因。 因为如果人们创造出新事物,例如在2005、2010、2015或2020年,他们将使用此数据集。”
If an algorithm performs well after being run on CoNLL-2003, meaning the way it classified entities closely matches how humans classified them, then it’s viewed as successful — a seminal work in the sector. But in actuality, passing a test like this with flying colors is concerning: It means the model has been built to reinforce some of the dataset’s initial bias. And what about the next model that comes along? If the new one outperforms the old, then it’s likely even more aligned with the dataset’s initial bias.
如果一种算法在CoNLL-2003上运行后表现良好,这意味着该算法对实体进行分类的方式与人类对它们进行分类的方式非常匹配,则该算法被认为是成功的-该领域的开创性工作。 但是实际上,通过像这样的测试是令人担忧的:这意味着已建立该模型以增强数据集的某些初始偏差。 那接下来的模型呢? 如果新数据优于旧数据,则可能与数据集的初始偏差更加一致。
“I consider ‘bias’ a euphemism,” says Brandeis Marshall, PhD, data scientist and CEO of DataedX, an edtech and data science firm. “The words that are used are varied: There’s fairness, there’s responsibility, there’s algorithmic bias, there’s a number of terms… but really, it’s dancing around the real topic… A dataset is inherently entrenched in systemic racism and sexism.”
“我认为'偏见'是委婉的说法,”数据技术专家,edtech和数据科学公司DataedX的首席执行官Brandeis Marshall博士说。 “使用的词是多种多样的:公平,有责任,有算法偏见,有许多术语……但实际上,它围绕着真实的话题而跳舞……数据集固有地根植于系统种族主义和性别歧视之中。”
In interviews with OneZero, the primary creators of CoNLL-2003 didn’t object to the idea that their dataset was biased.
在对OneZero的采访中,CoNLL-2003的主要创建者并不反对其数据集有偏见的想法。
De Meulder, Tjong Kim Sang, and Walter Daelemans, PhD (the team’s supervisor at the time) don’t recall considering bias much back then, especially since they created the dataset for a specific “shared task” — an exercise allowing different groups to test their algorithms’ performance on the same data — ahead of a conference in Canada. “It’s only after the fact, if the systems are used on different datasets, that the bias will become apparent,” writes de Meulder in an interview follow-up.
De Meulder,Tjong Kim Sang和Walter Daelemans博士(当时的团队主管)不记得当时有很多偏见,特别是因为他们创建了特定“共享任务”的数据集–这项练习允许不同的小组在加拿大召开会议之前,在相同数据上测试其算法的性能。 “只有在事实上,如果将系统用于不同的数据集,偏差就会变得明显,” de Meulder在接受采访时写道。
That’s exactly what happened.
就是这样。
The bias of a system trained on CoNLL-2003 could be as simple as your virtual assistant misreading your instructions to “call Dakota” as dialing a place rather than a person, or not recognizing which artist you’d like to stream via Spotify or Google Play. Maybe you’re looking up a famous actress, artist, or athlete, and a dedicated panel doesn’t pop up in your search results — costing them opportunities and recognition. It’s “exactly the kind of subtle, pervasive bias that can creep into many real-world systems,” writes James Lennon, who led the study at Scale AI, in his report.
在CoNLL-2003上训练过的系统的偏见可能很简单,例如您的虚拟助手错误地将您的“呼叫达科他州”指令读为拨打一个地点而不是一个人,或者没有识别出您想通过Spotify或Google直播的艺术家玩。 也许您正在查找著名的女演员,艺术家或运动员,并且在搜索结果中不会弹出专门的面板,这会使他们失去机会并获得认可。 这是“正是一种微妙的,普遍的偏见,可能会潜入许多真实世界系统,”詹姆斯·列侬,谁领导了这项研究的规模AI,在写他的报告 。
“If you can’t recognize people’s names, then those people become invisible to all kinds of automated systems that are really important,” Andreas says. “Making it harder to Google people; making it harder to pull them out of one’s own address books; making it hard to build these nice, specialized user interfaces for people.”
“如果您无法识别人们的名字,那么对于真正重要的各种自动化系统来说,这些人将变得不可见,” Andreas说。 “给Google人员增加了难度; 使其更难以从自己的通讯录中删除; 因此很难为人们建立这些漂亮的专业用户界面。”
This kind of bias can also lead to problems stemming from lack of recognition or erasure. Many algorithms analyze news coverage, social media posts, and message boards to determine public opinion on a topic or identify emerging trends for decision-makers and stock traders.
这种偏见也可能导致由于缺乏认识或删除而引起的问题。 许多算法可以分析新闻报道,社交媒体帖子和留言板,以确定某个主题的民意或为决策者和股票交易者确定新兴趋势。
“Let’s say there were investors that identified companies to invest in based on ‘social media buzz,’ the number of mentions of that company or any of the senior executives of the company on social media,” writes Graham Neubig, PhD, an associate professor at Carnegie Mellon University’s Language Technology Institute, in an email to OneZero. “In this case, if an NER system failed to identify the name of any of the senior executives, then this ‘buzz’ would not register, and thus the company would be less likely to attract investment attention.”
副教授Graham Neubig博士说:“假设有投资者根据'社交媒体嗡嗡声'确定了要投资的公司,即该公司或该公司的任何高级管理人员的提及次数。卡内基梅隆大学语言技术学院的电子邮件,发给OneZero 。 “在这种情况下,如果NER系统无法识别任何高级管理人员的姓名,那么这种'嗡嗡声'就不会注册,因此该公司不太可能吸引投资注意力。”
Daelemans sees it as “a bit of laziness” that people are still using his team’s dataset as a benchmark. Computational linguistics has progressed, but CoNLL-2003 still provides an easy way out in proving a new model to be the latest and greatest. Building a better dataset means dedicating human labor to the unglamorous task of labeling sentences by hand, but today it can be done more quickly, and with fewer examples, than in 2003.
D aelemans认为人们仍在使用其团队的数据集作为基准是“有点懒惰”。 计算语言学已经取得了进步,但是CoNLL-2003仍然提供了一种简便的方法来证明新模型是最新的和最大的。 建立更好的数据集意味着将人工投入到手工标记句子这一不光彩的任务上,但是与2003年相比,如今,它可以更快地完成,且示例更少。
“It would not take that much energy to do a new, more balanced dataset as a benchmark,” Daelemans says. “But the focus is really on getting the next best model, and it’s highly competitive, so I don’t think a lot of research groups will want to invest time in doing a better version.”
Daelemans说:“建立一个新的,更平衡的数据集作为基准并不需要那么多的精力。” “但是重点实际上是获得下一个最佳模型,而且它具有很高的竞争力,因此我认为,没有很多研究小组愿意花时间来制作更好的模型。”
Then there’s the question of what building a better dataset actually looks like.
然后是一个问题,即构建一个更好的数据集实际上是什么样的。
Scale AI’s analysis of CoNLL-2003’s bias, for instance, isn’t without its own problems. When it comes to asking how recognition accuracy compares between the name categories, “that question itself is a whole can of worms,” Andreas says. “Because what does it mean to be a female name, and who are the annotators that are judging… and what about all the people in the world who are not males or females but identify with some other category and who’d maybe even be left out of an analysis like this?” (OneZero has chosen to refer to Scale AI’s “male” and “female” categories as “men’s names” and “women’s names.”)
例如,Scale AI对CoNLL-2003偏见的分析并非没有其自身的问题。 当问及名称类别之间的识别准确度如何比较时,“这个问题本身就是蠕虫的全部,” Andreas说。 “因为成为女性名字意味着什么,并且是正在评判的注释者……以及世界上所有不是男性或女性但又认同其他类别并且甚至可能被遗忘的人呢?出于这样的分析?” ( OneZero选择将Scale AI的“男性”和“女性”类别称为“男性名字”和“女性名字”。)
“If you can’t recognize people’s names, then those people become invisible to all kinds of automated systems that are really important.”
“如果您无法识别人的名字,那么这些人就变得对真正重要的各种自动化系统不可见。”
To complete its analysis of CoNLL-2003’s bias, instead of using surrounding pronouns to infer gender, Scale AI used societal notions about the names themselves. The humans who tagged the data assumed, for example, that Tiffany must be a woman, John must be a man, and Alex goes in the gender-neutral category. An ML model that assigns gender externally based on any characteristic is “in complete contradiction with the idea that gender is something that people define for themselves,” says Rachel Thomas, PhD, director of the University of San Francisco’s Center for Applied Data Ethics.
为了完成对CoNLL-2003偏见的分析,Scale AI不使用周围的代词来推断性别,而是使用有关名称本身的社会观念。 例如,标记数据的人类假设Tiffany必须是女人,John必须是男人,而Alex属于性别中立类别。 旧金山大学应用数据伦理学中心主任雷切尔·托马斯(Rachel Thomas)博士说,一种基于任何特征从外部分配性别的ML模型“与人们认为性别是人们自己定义的观念完全矛盾”。
Scale AI’s interest in conducting this experiment is partly propelled by its business model, which involves clients using the company’s labeling pipeline to comb through their own datasets, or the open source data they’re using, to gauge bias. The company created a new open source dataset, called CoNLL-Balanced, after adding more than 400 additional “women’s” names to the initial data. Scale AI’s preliminary results suggest the new algorithm performs comparably on both categories of names.
Scale AI进行这项实验的兴趣在一定程度上受到其业务模型的推动,该模型涉及客户使用公司的标签管道梳理自己的数据集或他们使用的开源数据来衡量偏差。 该公司在初始数据中添加了400多个其他“女性”名称后,创建了一个名为CoNLL-Balanced的新开源数据集。 Scale AI的初步结果表明,新算法在两种名称上的性能均相当。
But this still may not solve the fundamental problem. In interview after interview, experts made it clear that increasing representation in datasets is merely a bandage — in many ways, the tech community wants to “find a tech solution for a social problem,” Marshall says. When it comes to shifting power into the hands of women, BIPOC, and LGBTQ+ individuals, there’s a lot of work still to be done — and reevaluating datasets alone isn’t going to change things. According to Marshall and Andreas, moving forward will take interdisciplinary work: bringing together leaders in machine learning with those in fields like anthropology, political science, and sociology.
但这仍然不能解决根本问题。 在一次又一次的采访中,专家们清楚地表明,增加数据集的代表范围只是一个绷带-在许多方面,技术社区都希望“找到解决社会问题的技术解决方案,” Marshall说。 在将权力移交给女性,BIPOC和LGBTQ +个人的手中时,还有很多工作要做-仅重新评估数据集并不会改变一切。 根据Marshall和Andreas的说法,向前迈进将需要跨学科的工作:将机器学习的领导者与人类学,政治科学和社会学等领域的领导者召集在一起。
“Representation in datasets is important,” Thomas says. “I worry that too many people think that’s just the sole issue — like once you’ve balanced your dataset, then you’re good — whereas bias really also involves all these questions… People [are] moving more towards talking about how different machine learning models shift power.”
“在数据集中表示很重要,” Thomas说。 “我担心太多人认为这是唯一的问题-就像一旦平衡了数据集便是好事-而偏见实际上还涉及所有这些问题……人们[越来越]倾向于谈论不同的机器学习模式会改变力量。”
That power mismatch can stem from the representation gap between the people creating these tools and those who could be affected by them. It comes down to the importance of bringing members of marginalized groups into the conversation and development of these tools, in a significant way, so they can think through dangers and potential misuse cases down the line.
权力不匹配可能源于创建这些工具的人与可能受到这些工具影响的人之间的代表性差距。 归结为以重要方式让边缘化群体的成员参与这些工具的对话和开发的重要性,以便他们可以仔细考虑危险和潜在的滥用案例。
“The academic community’s been playing with these datasets for decades, and we know that there are some human errors in the datasets — we know that there’s some bias,” says Xiang Ren, PhD, an assistant professor at the University of Southern California and part of USC’s NLP group. “But I think most of the time, people just kind of follow the popular evaluation protocols.”
“学术界一直在使用这些数据集几十年了,我们知道数据集中存在一些人为错误-我们知道存在一些偏见,”南加州大学助理教授项翔说。 USC的NLP组的成员。 “但是我认为大多数时候,人们只是遵循流行的评估协议。”
Some experts think we’re beginning to see a reckoning for how ML models are evaluated — which, eventually, could lead to the retirement of datasets like CoNLL-2003.
一些专家认为,我们开始对ML模型的评估方式有所怀疑-最终可能导致像CoNLL-2003这样的数据集淘汰。
The entire community is now “staring real closely at the datasets and thinking about… our whole scientific apparatus,” Andreas says. “The way in which we judge the effectiveness of systems is largely built around datasets that are like CoNLL-2003.”
现在,整个社区“都在密切关注数据集,并在思考……我们的整个科学仪器,”安德里亚斯说。 “我们判断系统有效性的方式很大程度上建立在像CoNLL-2003这样的数据集之上。”
翻译自: https://onezero.medium.com/the-troubling-legacy-of-a-biased-data-set-2967ffdd1035
人工智能数据和算法的偏差