Highest Rated Comments


bugoid2 karma

Do you know which LLMs (e.g., ChatGPT, Bard, Llama) use C4 as their training data?

Do you have any insights into whether how some of these AI teams might be filtering out some of the more problematic C4 data prior to training?

Have you been able to confirm the degree to which problematic C4 data is actually represented in the models (e.g., prompting the models to summarize that data)?

bugoid1 karma

Thank you! I can't wait to see your next report!