EleutherAI releases massive AI training dataset of licensed and open domain text | TechCrunch
Briefly

EleutherAI introduced The Common Pile v0.1, an extensive 8 terabyte dataset formed over two years alongside AI startups and academic partners. This dataset aims to provide a legally sound alternative for training AI models, specifically the Comma v0.1-1T and Comma v0.1-2T models. With ongoing legal disputes about AI training practices, EleutherAI stresses the reduced transparency and collaboration challenges resulting from these lawsuits, asserting that transparency is crucial for the progress of AI research.
[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in.
The Common Pile v0.1, which can be downloaded from Hugging Face's AI dev platform and GitHub, was created in consultation with legal experts.
Read at TechCrunch
[
|
]