Meta creates new, 'inclusive' AI training dataset so bots can be fair

It could be a solid step against inaccurate, racist, and sexist responses from the likes of OpenAI's ChatGPT and Google's Bard.
Sade Agard
Meta's European head office
Meta's European head office

Derick Hudson/iStock 

Meta hopes to assist AI researchers in making their tools and procedures more universally inclusive, with the launch of Casual Conversations v2, according to a statement from the firm on March 9.

The vast new dataset, which includes face-to-face video clips from a broad spectrum of human participants across varied geographic, cultural, racial, and physical demographics, serves as an upgrade to its 2021 AI audio-visual training dataset.

The incentive may address some of the concerns about AI-trained programs like OpenAI's ChatGPT and Google's Bard- particularly relating to data consent and algorithmically-enforced racial and socio-political biases.

The AI discrimination problem

With 26,467 video monologues recorded in seven nations and provided by 5,567 paid participants from Brazil, India, Indonesia, Mexico, Vietnam, the Philippines, and the United States, v2 is described by Meta as "a more inclusive dataset to measure fairness." As you can see from the video below, these participants also provided self-identifiable attributes like age, gender, and physical appearance. 

For an industry long plagued by AI products providing inaccurate, racist, and sexist responses, combating algorithmic bias in AI is a critical barrier. The development of algorithms and how they are made available to developers account for a large portion of this.

"The consent-driven dataset... was informed and shaped by a comprehensive literature review around relevant demographic categories," stated Meta. 

By outlining 'consent-driven,' Meta clarifies that this information was collected directly from the participants and not from a covert source. That is, not from your Facebook data or Instagram photos. 

"To our knowledge, it's the first open-source dataset with videos collected from multiple countries using highly accurate and detailed demographic information to help test AI models for fairness and robustness," Meta added. 

Still, while Meta trumpets Casual Conversations v2 as a significant advancement, some experts remain cautious. 

Kristian Hammond, a professor of computer science at Northwestern University and director of the school's Center for Advancing the Safety of Machine Intelligence, told PopSci that this is a space where almost anything is an improvement.

He sees Meta's new dataset as "a solid step" for the company, especially in light of earlier privacy issues. He also added that the company's emphasis on user permission as well as labor compensation for research participants is significant.

"But an improvement is not a full solution. Just a step," Hammond warned.

According to Hammond, there is still much to be discovered about how academics selected participants to create Casual Conversations v2.

"Having gender and ethnic diversity is great, but you also have to consider the impact of income and social status and more fine-grained aspects of ethnicity," he wrote. "There is bias that can flow from any self-selecting population."

Add Interesting Engineering to your Google News feed.
Add Interesting Engineering to your Google News feed.
message circleSHOW COMMENT (1)chevron
Job Board