Researchers suggest OpenAI trained AI models on paywalled O'Reilly books | TechCrunch
Briefly

Researchers suggest OpenAI trained AI models on paywalled O'Reilly books | TechCrunch
"The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O'Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its GPT-4o model on paywalled books from O'Reilly Media. O'Reilly doesn't have a licensing agreement with OpenAI, the paper says."
"GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content [...] compared to OpenAI's earlier model GPT-3.5 Turbo."
"While a number of AI labs including OpenAI have begun embracing AI-generated data to train AI as they exhaust real-world sources, few have eschewed real-world data entirely."
"The paper used a method called DE-COP, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models' training data."
The article discusses accusations against OpenAI regarding its training of AI models on copyrighted materials without proper licensing. A new paper from the AI Disclosures Project alleges that OpenAI's GPT-4o model has been trained on paywalled books from O'Reilly Media without a licensing agreement. The analysis highlights a shift in how AI models are trained, with some companies moving towards synthetic data yet emphasizing the importance of real-world data. Utilizing methods like DE-COP, the paper examines how effectively models can distinguish between public and non-public sources, raising ethical implications for AI development.
Read at TechCrunch
Unable to calculate read time
[
|
]