Harvard University has announced the release of a substantial dataset comprising nearly one million public-domain books, aimed at enhancing access to high-quality data for artificial intelligence training. This initiative is led by Harvard’s Institutional Data Initiative, which received funding from both Microsoft and OpenAI. The dataset is a result of books scanned during the Google Books project, specifically those that are no longer under copyright protection.
This dataset is approximately five times larger than the controversial Books3 dataset that trained AI models such as Meta’s Llama, featuring a diverse range of genres, languages, and notable authors including Shakespeare, Charles Dickens, and Dante. Greg Leppert, executive director of the initiative, expressed that this project aims to “level the playing field” in the AI sector by providing resources typically monopolized by major tech companies to smaller players and researchers.
Leppert compared the potential impact of this public domain database to that of Linux, suggesting it could serve as a foundational element for various AI applications. He noted that companies still need to incorporate additional licensed data to make their models distinct from others.
Microsoft’s vice president, Burton Davis, stated that the company’s involvement is aligned with its mission to create accessible data pools for AI startups, emphasizing that they do not intend to replace all existing data with public domain materials but rather to complement them. OpenAI’s Tom Rubin expressed enthusiasm for supporting this venture.
The legal landscape surrounding AI training data is currently fraught with challenges, as numerous lawsuits surface regarding the use of copyrighted material. Depending on the outcomes, AI companies may have to alter their methods of creating models. In light of these uncertainties, initiatives like Harvard’s dataset aim to ensure a continued demand for public domain resources, which could provide a sustainable path forward for the AI industry.
Alongside the book dataset, the Institutional Data Initiative is collaborating with the Boston Public Library to digitize millions of articles that are now in the public domain. While the exact distribution methods for the dataset are yet to be determined, discussions with Google for public distribution are underway.
This release adds to a growing list of projects and startups focused on providing high-quality, legally compliant data for AI development, highlighting a significant shift toward leveraging public domain materials. For instance, the French AI startup Pleias has already introduced the Common Corpus dataset, which features an extensive collection of public-domain works.
These movements challenge the narrative that AI development relies on copyrighted content, as emerging datasets present opportunities for ethical AI advancements without the necessity of infringing on intellectual property rights. Nonetheless, the real impact of such datasets will depend on their actual usage and the industry’s willingness to forego scrapped copyrighted materials in favor of these legitimate resources.