Do you know which LLMs (e.g., ChatGPT, Bard, Llama) use C4 as their training data?
Do you have any insights into whether how some of these AI teams might be filtering out some of the more problematic C4 data prior to training?
Have you been able to confirm the degree to which problematic C4 data is actually represented in the models (e.g., prompting the models to summarize that data)?
bugoid2 karma
Do you know which LLMs (e.g., ChatGPT, Bard, Llama) use C4 as their training data?
Do you have any insights into whether how some of these AI teams might be filtering out some of the more problematic C4 data prior to training?
Have you been able to confirm the degree to which problematic C4 data is actually represented in the models (e.g., prompting the models to summarize that data)?
View HistoryShare Link