ChatGPT maker says ‘it would be impossible’ to train models without violating copyright

In a Monday blog post, OpenAI — the company behind ChatGPT — published a lengthy response to the New York Times lawsuit filed against the company in late December. 

The lawsuit alleges rampant copyright infringement in both the input and output of ChatGPT, which the Times argued represents a significant threat to its business. 

OpenAI’s position, however, is that it is already collaborating with other news organizations; copyright-infringing output is a “rare bug” and the company is working on reducing its frequency; training is fair use; and the Times is “not telling the full story.” 

Related: Copyright expert predicts result of NY Times lawsuit against Microsoft, OpenAI

The core of the difference in perspective between OpenAI and the Times is two different interpretations of the “fair use” doctrine, a component of copyright law that enables the limited use of otherwise copyrighted work. 

The U.S. Copyright Office, which said in August it is undertaking a study of the law to better understand where generative AI fits in, declined to comment on the Times’ lawsuit. 

OpenAI’s argument is that training its models on the internet at large is fair use.

“We view this principle as fair to creators, necessary for innovators and critical for U.S. competitiveness,” the company said in a statement. 

It is a view shared by many technologists, including computer scientist Andrew Ng, who recently said that, just as humans are allowed to learn from information on the internet, “AI should be allowed to do so, too.” 

If training on the open internet was made fair use, Ng said, “society will be better off.” He did not elaborate on that point. 

On the topic of AI training on copyrighted data, many people have echoed the argument made by Andrew Ng below. But it would be interesting to think about what copyright law would be like if humans had the ability to memorize entire books and recite them when prompted to do so.

— Melanie Mitchell (@MelMitchell1) January 8, 2024

But the issue is less of disallowing training on publicly available information and more of requiring the licensing of content that is powering commercial models which are so far generating enormous returns for investors. 

OpenAI, which was founded in 2015, is now valued at a minimum of $86 billion and is reportedly in talks to raise funds at a valuation of $100 billion. Microsoft, its top investor, has a market cap of nearly $3 trillion and has poured $13 billion into OpenAI. 

“The AI companies are working in a mental space where putting things into technology blenders is always okay,” copyright expert and Cornell professor of digital and information law James Grimmelmann told TheStreet. “The media companies have never fully accepted that. They’ve always taken the view that ‘if you’re training or doing something with our works that generates value we should be entitled to part of it.'”

Related: Think tank director warns of the danger around ‘non-democratic tech leaders deciding the future’

OpenAI: “It would be impossible” to train without violating copyright

OpenAI, according to the Daily Telegraph, submitted a statement to the House of Lords communications and digital committee explaining that, since copyright covers everything from blog posts to pictures and government documents, “it would be impossible to train today’s leading AI models without using copyrighted materials.” 

Rough Translation: We won’t get fabulously rich if you don’t let us steal, so please don’t make stealing a crime!

Don’t make us pay 𝘭𝘪𝘤𝘦𝘯𝘴𝘪𝘯𝘨 fees, either!

Sure Netflix might pay billions a year in licensing fees, but *we* shouldn’t have to!

More money for us, moar!

— Gary Marcus (@GaryMarcus) January 8, 2024

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.” 

But OpenAI said that, even though it believes training ought to be fair use, it offers an op-out process, meaning the default is that all content on the internet is up for grabs to train OpenAI’s models. 

The process prevents a website from being crawled by OpenAI, but does not erase past crawling done by the company. Indeed, the lawsuit alleges that OpenAI’s bots are trained on millions of Times articles. 

“OpenAI’s lobbying campaign, simply put, is based on a false dichotomy (give everything to us free or we will die) — and also a threat: either we get to use all the existing IP we want for free, or you won’t get to generative AI anymore,” AI researcher Gary Marcus said. “But the argument is hugely flawed.”

I run a sandwich shop. There’s no way I could make a living if I had to pay for all my ingredients. The cost of cheese alone would put me out of business.

— Craig Cowling (@ccowling) January 8, 2024

Marcus added that nobody is suggesting such companies train only on public domain works. The suggestion is instead to license those works. OpenAI has already engaged in licensing agreements with the Associated Press and Axel Springer, which publishes Business Insider. The details of these agreements remain unknown. 

The Information recently reported that OpenAI was offering media publishers between $1 million and $5 million annually in content licensing fees for training. 

Related: The ethics of artificial intelligence: A path toward responsible AI

OpenAI: Negotiations fell apart

OpenAI said that it had been engaged in negotiations with the Times through Dec. 19, focused on creating a “high-value partnership” with attribution in ChatGPT. The company called the lawsuit a “surprise and disappointment.” 

OpenAI added that the Times’ dozens of examples of copyright-infringing output, known also as regurgitation, don’t tell the full story. 

“It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate,” OpenAI said, adding that it is constantly making progress in making its systems more resistant to such infringing attempts. 

This comes in the wake of a new paper, published by Marcus and artist Reid Southen, which highlighted copious examples of copyright-infringing output in image-generation models. 

“Both OpenAI and Midjourney are fully capable of producing materials that appear to infringe on copyright and trademarks,” the paper reads. “These systems do not inform users when they do so. They do not provide any information about the provenance of the images they produce. Users may not know, when they produce an image, whether they are infringing.”

OpenAI did not respond to a request for comment. 

New polling from the Artificial Intelligence Policy Institute (AIPI), meanwhile, found that nearly 60% of U.S. voters believe AI companies should not be allowed to use copyrighted content to train models; 70% said that AI companies ought to compensate outlets like the Times if they want to use their content to train models. 

Nearly 70% of voters support federal legislation that would require AI companies to form licensing agreements with news organizations before training models on their content. 

“This is a landmark case in what tech companies are allowed to do with the data they collect and extract,” Daniel Colson, Executive Director of the AIPI said in a statement. “Companies are starting to realize that AI models are a huge threat to the value of their intellectual property, and support restrictions on how AI can be trained.”

“The New York Times is taking the lead and making sure the deployment of generative AI doesn’t repeat the ‘move fast and break things’ approach of Facebook and social media platforms.”

Contact Ian with AI stories via email,, or Signal 732-804-1223.

Related: Human creativity persists in the era of generative AI

Get exclusive access to portfolio managers’ stock picks and proven investing strategies with Real Money Pro. Get started now.

Related Posts