EDIT: That is all the time we have for today! Thank you everyone for the thoughtful questions. We'll hop back on tomorrow if there are any big, lingering questions still out there, and feel free to keep following our coverage of AI here: https://www.washingtonpost.com/technology/innovations/?itid=nb_technology_artificial-intelligence?utm_campaign=wp_main&utm_medium=social&utm_source=reddit.com

The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT).

The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company.

Read more of our analysis here, and skip the paywall with email registration:


Comments: 21 • Responses: 6  • Date: 

PeanutSalsa8 karma

How does ChatGPT know if the data it's using to give you an answer is correct or not?

washingtonpost17 karma

From Nitasha Tiku:

Excellent question! The large language models that power chatbots like ChatGPT are given a simple objective to predict the next word in a sentence or piece of text so factual accuracy is not part of their goal. However, with models like ChatGPT that have been fine-tuned to better meet a user’s expectations, companies like OpenAI have done work to improve accuracy during the final stages of the training process where human evaluators offer feedback on the model’s responses. OpenAI offers some background in their blog post about ChatGPT, noting some of the limitations.

washingtonpost4 karma

(More from Nitasha)

Efforts to get large language models to produce factually correct responses are an industry-wide challenge and companies can test their models on “truthfulness” benchmarks to see how their product measures up. If you’re interested in learning more about how OpenAI went about this effort, the company offers more detail in its paper on InstructGPT, its precursor to ChatGPT. For InstructGPT, OpenAI also put out a “model card,” a sort of nutrition label for AI models that was brought up a potential transparency and accountability measure in today’s congressional hearing on AI oversight.

cegallego4 karma

Does ChatGPT give all sources equal weight or does it give more importance to more credible sources?

washingtonpost14 karma

From Nitasha Tiku, Szu Yu Chen and Kevin Schaul:

The dataset we explored was curated by a nonprofit called CommonCrawl. We examined just one snapshot taken by the organization from 2019. OpenAI has declined to share any information about the training data for ChatGPT, which was developed using the base models GPT-3.5 and GPT-4. However, we know that for GPT-3, OpenAI’s training data began with at least 41 such snapshots from CommonCrawl. That organization told us that they do try to give more credible websites a higher prevalence when it scrapes the web.

But it’s important to note that companies are really cagey about this entire training process, which can be really complex. (For instance, GPT-3’s training dataset also includes something called Web2Text, articles with three or more Karma points from Reddit!!) So there is also a filtering process done to the training data, which could theoretically be used to give more weight to credible sources. It would be great if there was additional transparency around this process as well.

Taivas_Varjele3 karma

Do you think it’s feasible to expect legislation limiting AI, or at least requiring more transparency, to be discussed at a high level in the near future? As we’ve seen with Crypto and meme-stocks, it feels like any sort of control or legislation over novel tech is always incredibly lagging behind.

washingtonpost10 karma

From Nitasha Tiku:
Another great q! I think looking at generative AI to crypto and meme-stocks is not a bad comparison. When it comes to fast-moving and fast-changing novel technology, legislators have been slow to act because they’re afraid of being accused of inhibiting innovation and aren’t always sure they know the best way to intervene. In some instances, inaction on the federal level has prompted state regulators to step up.
Today’s Congressional hearing on AI oversight is probably a good harbinger of what’s to come. It seems like there was a lot of trust between the senators and OpenAI CEO Sam Altman to steward this technology. And historically if industry has a say in writing the laws, the public gets transparency in name only.

ktprry2 karma

How long did it take you to analyze such a large dataset? What did you use to analyze it?

washingtonpost6 karma

From Nitasha Tiku, Szu Yu Chen and Kevin Schaul:

The data analysis for this story took a few weeks — mostly for cleaning and categorization. Allen Institute researchers gave us all 15.7M domains in Google’s C4 dataset. We joined that with categorization data from analytics firm Similarweb.

We used R Markdown for cleaning and analysis, creating updateable web pages we could share with everyone involved. Similarweb’s categories were useful, but too niche for us. So we spent a lot of time recategorizing and redefining the groupings. We used the token count for each website — how many words or phrases — to measure it’s importance in the overall training data.

It turns out the internet has a lot of very bad content on it! Editors at The Post did not want us to publish all of the domain names uncensored. So we spent days combing through offensive domain names, including racial slurs, obscenities and pornographic content. We did our best to mask specific words from readers in our searchable database, but those sites are still used to train chat bots.

Here’s a little more background on the process: https://twitter.com/PostGraphics/status/1648784141813440513

bugoid2 karma

Do you know which LLMs (e.g., ChatGPT, Bard, Llama) use C4 as their training data?

Do you have any insights into whether how some of these AI teams might be filtering out some of the more problematic C4 data prior to training?

Have you been able to confirm the degree to which problematic C4 data is actually represented in the models (e.g., prompting the models to summarize that data)?

washingtonpost3 karma

From Nitasha Tiku:

We know that C4 was used to train Google’s influential T5 model, Facebook’s LLaMA, as well as the open source model Red Pajama. C4 is a very cleaned-up version of a scrape of the internet from the non-profit CommonCrawl taken in 2019. OpenAI’s model GPT-3 used a training dataset that began with 41 scrapes of the web from CommonCrawl from 2016 to 2019 so I think it’s safe to say that something akin to C4 was part of GPT-3. (The researchers who originally looked into C4 argue that these issues are common to all web-scraped datasets.)

When we reached out to OpenAI and Google for comment, both companies emphasized that they undergo extensive efforts to weed out potentially problematic data from their training sets. But within the industry, C4 is known as being a heavily filtered dataset and has been criticized, in fact, for eliminating content related to LGBTQ+ identities because of its reliance on a heavy-handed blocklist. (https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words )

We are working on some reporting to try to address your last and very crucial question, but it’s an open area of research and one that even AI developers are struggling to answer.