Dark
Light

EleutherAI Introduces One of the Largest Licensed Datasets for AI Training

June 7, 2025

EleutherAI is shaking things up with its new dataset, The Common Pile v0.1. Developed over two years in collaboration with partners like Poolside, Hugging Face, and leading academic institutions, this 8-terabyte treasure trove of licensed and open-domain text promises to make AI training both more robust and legally sound. If you’ve ever wrestled with finding high-quality, compliant data for training, this news might resonate with you.

The Common Pile v0.1 is already powering EleutherAI’s Comma v0.1-1T and Comma v0.1-2T models, which hold their own against models trained on unlicensed content. In a recent blog post on Hugging Face, executive director Stella Biderman pointed out that while litigation hasn’t stopped companies from sourcing their data, it has certainly dulled the transparency many once enjoyed. Crafted with thorough legal oversight, the dataset draws on a diverse mix of sources—from 300,000 public domain books housed at the Library of Congress and the Internet Archive to transcriptions generated by OpenAI’s Whisper model. This marks a welcome shift from EleutherAI’s earlier approach with The Pile, which included copyrighted materials. For anyone looking to navigate AI training with both quality and compliance in mind, this initiative provides a clear pathway.

Don't Miss