OpenAI Models and Copyright: What You Need to Know

Hey there! If you’ve been following the world of AI, you might’ve heard some buzz about OpenAI’s models and how they handle copyrighted content. A recent study sheds light on this, and it’s definitely worth a closer look.

Researchers from the University of Washington, the University of Copenhagen, and Stanford have come up with a way to see if AI models, like those from OpenAI, are holding onto copyrighted material a bit too well. It’s not just a techy issue—it’s stirring up legal debates, with authors and developers claiming their work was used without a nod or a nudge.

OpenAI stands by a fair use defense, but critics aren’t convinced. The U.S. copyright law doesn’t clearly say if using copyrighted works for training AI is okay, which makes things a bit sticky.

So, how did the researchers dig into this? They focused on something called “high-surprisal” words—those unusual words that pop up in specific contexts, like “radar” in an unexpected sentence. By stripping these words from fiction and news snippets, they tested if models like GPT-4 and GPT-3.5 could fill in the blanks. If they could, it hinted at memorization.

The results? GPT-4, for example, showed it had memorized chunks from popular books, including those from a dataset of copyrighted ebooks known as BookMIA. It was less frequent with New York Times articles, but still noticeable.

Abhilasha Ravichander, a doctoral candidate and study co-author, highlighted the importance of transparency in AI training data. She believes that for these models to be trustworthy, they need to be open to scientific probing and auditing.

OpenAI is pushing for more relaxed rules on using copyrighted data for AI, even though they do have some licensing agreements and opt-out options. They’re also advocating for clearer fair use policies with governments worldwide.