EleutherAI releases massive AI training dataset of licensed and open domain text

"[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in."

"The Common Pile v0.1, which can be downloaded from Hugging Face's AI dev platform and GitHub, was created in consultation with legal experts."

EleutherAI introduced The Common Pile v0.1, an extensive 8 terabyte dataset formed over two years alongside AI startups and academic partners. This dataset aims to provide a legally sound alternative for training AI models, specifically the Comma v0.1-1T and Comma v0.1-2T models. With ongoing legal disputes about AI training practices, EleutherAI stresses the reduced transparency and collaboration challenges resulting from these lawsuits, asserting that transparency is crucial for the progress of AI research.

#ai-research #data-licensing #transparency #eleutherai #common-pile

Read at TechCrunch

Unable to calculate read time

Collection

[

...

]

EleutherAI releases massive AI training dataset of licensed and open domain text | TechCrunchEleutherAI releases massive AI training dataset of licensed and open domain text | TechCrunch Briefly

EleutherAI releases massive AI training dataset of licensed and open domain text | TechCrunch
EleutherAI releases massive AI training dataset of licensed and open domain text | TechCrunch
Briefly