Submission + - LAION-5B Dataset Removed After Discovery of Child Sexual Abuse Material (404media.co)
LAION told 404 Media on Tuesday that out of “an abundance of caution,” it was taking down its datasets temporarily “to ensure they are safe before republishing them."
According to a new study by the Stanford Internet Observatory shared with 404 Media ahead of publication, the researchers found the suspected instances of CSAM through a combination of perceptual and cryptographic hash-based detection and analysis of the images themselves.
“We find that having possession of a LAION5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered nonconsensually, the legality of which is more variable by jurisdiction,” the paper says. “While the amount of CSAM present does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model’s ability to combine the concepts of sexual activity and children, it likely does still exert influence. The presence of repeated identical instances of CSAM is also problematic, particularly due to its reinforcement of images of specific victims.”
The finding highlights the danger of largely indiscriminate scraping of the internet for the purposes of generative artificial intelligence.