Comment Re: Why? (Score 1) 37
From what I understand if you feed AI data to any type of LLM you get model collapse. Now, if you feed those LLMs data from the internet that's post-ChatGPT it's going to be increasingly riddled with AI. From reddit to Wikipedia, people are going to post what AI spills out.
Nobody solves this problem, nobody talks about it and this is crucial to scaling LLMs with more data. But it turns out you need to use data until 2022, everything newer is "contaminated".