Ptechhub
  • News
  • Industries
    • Enterprise IT
    • AI & ML
    • Cybersecurity
    • Finance
    • Telco
  • Brand Hub
    • Lifesight
  • Blogs
No Result
View All Result
  • News
  • Industries
    • Enterprise IT
    • AI & ML
    • Cybersecurity
    • Finance
    • Telco
  • Brand Hub
    • Lifesight
  • Blogs
No Result
View All Result
PtechHub
No Result
View All Result

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

By Wired by By Wired
December 18, 2024
Home AI & ML
Share on FacebookShare on Twitter


In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line. The exact way the books dataset will be released is not settled. The Institutional Data Initiative has asked Google to work together on public distribution, but the details are still being hammered out. In a statement, Kent Walker, Google’s president of global affairs, said the company was “proud to support” the project.

However the IDI’s dataset is released, it will be joining a host of similar projects, startups, and initiatives that promise to give companies access to substantial and high-quality AI training materials without the risk of running into copyright issues. Firms like Calliope Networks and ProRata have emerged to issue licenses and manage compensation schemes designed to get creators and rights holders paid for providing AI training data.

There are also other new public-domain projects. Last spring, the French AI startup Pleias rolled out its own public-domain dataset, Common Corpus, which contains an estimated 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. Backed by the French Ministry of Culture, the Common Corpus has been downloaded more than 60,000 times this month alone on the open source AI platform Hugging Face. Last week, Pleias announced that it is releasing its first set of large language models trained on this dataset, which Langlais told WIRED constitute the first models “ever trained exclusively on open data and compliant with the [EU] AI Act.”

Efforts are underway to create similar image datasets as well. AI startup Spawning released its own this summer called Source.Plus, which contains public-domain images from Wikimedia Commons as well as a variety of museums and archives. Several significant cultural institutions have long made their own archives accessible to the public as standalone projects, like the Metropolitan Museum of Art in New York.

Ed Newton-Rex, a former executive at Stability AI who now runs a nonprofit that certifies ethically-trained AI tools, says the rise of these datasets shows that there’s no need to steal copyrighted materials to build high-performing and quality AI models. OpenAI previously told lawmakers in the United Kingdom that it would be “impossible” to create products like ChatGPT without using copyrighted works. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” Newton-Rex says.

But he still has reservations about whether the IDI and projects like it will actually change the AI training status quo. “These datasets will only have a positive impact if they’re used, probably in conjunction with licensing other data, to replace scraped copyrighted work. If they’re just added to the mix, one part of a dataset that also includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies,” he says.

Updated 12/12/24 11:18am ET: This story has been updated with comment from Google.



Source link

Tags: Artificial Intelligencecopyrightmachine learningmicrosoftopen sourceopenai
By Wired

By Wired

Next Post
Worry About Misuse of AI, Not Superintelligence

Worry About Misuse of AI, Not Superintelligence

Recommended.

Nearly 24,000 IPs Target PAN-OS GlobalProtect in Coordinated Login Scan Campaign

Nearly 24,000 IPs Target PAN-OS GlobalProtect in Coordinated Login Scan Campaign

April 1, 2025
Google Closes B Wiz Acquisition; AWS, Microsoft Clients Will Still Be Supported

Google Closes $32B Wiz Acquisition; AWS, Microsoft Clients Will Still Be Supported

March 11, 2026

Trending.

CELLCOM ISRAEL LTD. Announcement of A Special General Meeting of The Shareholders of The Company

CELLCOM ISRAEL LTD. Announcement of A Special General Meeting of The Shareholders of The Company

May 21, 2025
AWS Vs. Google Cloud Vs. Microsoft Azure Q1 Earnings Face-Off

AWS Vs. Google Cloud Vs. Microsoft Azure Q1 Earnings Face-Off

May 1, 2026
Veeam Debuts Data Resiliency Maturity Model To Assess, Improve Customers’ Cyber Resiliency

Veeam Debuts Data Resiliency Maturity Model To Assess, Improve Customers’ Cyber Resiliency

April 23, 2025
MocPOGO Easter Special Deals: The Pokémon GO Spoofer You Need for Might and Mastery 2025!

MocPOGO Easter Special Deals: The Pokémon GO Spoofer You Need for Might and Mastery 2025!

April 7, 2025
VNET Wins 40MW Wholesale Order from Leading Internet Company for Its New Strategic IDC Campus

VNET Wins 40MW Wholesale Order from Leading Internet Company for Its New Strategic IDC Campus

September 11, 2025

PTechHub

A tech news platform delivering fresh perspectives, critical insights, and in-depth reporting — beyond the buzz. We cover innovation, policy, and digital culture with clarity, independence, and a sharp editorial edge.

Follow Us

Industries

  • AI & ML
  • Cybersecurity
  • Enterprise IT
  • Finance
  • Telco

Navigation

  • About
  • Advertise
  • Privacy & Policy
  • Contact

Subscribe to Our Newsletter

  • About
  • Advertise
  • Privacy & Policy
  • Contact

Copyright © 2025 | Powered By Porpholio

No Result
View All Result
  • News
  • Industries
    • Enterprise IT
    • AI & ML
    • Cybersecurity
    • Finance
    • Telco
  • Brand Hub
    • Lifesight
  • Blogs

Copyright © 2025 | Powered By Porpholio