The Intercept is one of the many media outlets that have sued OpenAI and Microsoft over the past year for using journalists’ work to train ChatGPT without permission or credit. The case, which OpenAI and Microsoft are trying to get tossed from federal court, shows why digital news outlets are particularly exposed to AI grifters.
To be clear, it’s not just outlets like this one that are at risk. Actor Scarlett Johansson on Monday accused OpenAI of mimicking her voice in its new virtual assistant despite reportedly twice rejecting offers from CEO Sam Altman. Larger publications have also raised questions about OpenAI’s approach to human labor. But unlike Hollywood stars and print publications, digital outlets face some unique hurdles in protecting their work.
Just as OpenAI denies casting an actor with a voice “eerily similar” to Johansson’s, OpenAI and Microsoft have attempted to shrug off The Intercept’s lawsuit.
In a lawsuit filed in February, The Intercept alleged that OpenAI and Microsoft violated a federal law, the Digital Millennium Copyright Act, by using copyrighted stories to train ChatGPT without paying any licensing fees to publishers and stripping out basic authorship information. (Full disclosure: In addition to doing my own reporting for The Intercept, I am also one of its attorneys.)
“Open AI and Microsoft have the economic incentive to vacuum up the hard work of online news outlets, ignoring the training, curation, research, and resources those organizations devote to making sure the public is informed with timely, accurate news,” said David Bralow, The Intercept’s general counsel. “They would like us to get lost in the algorithm so that they can continue to free ride.”
Just as OpenAI denies casting an actor with a voice “eerily similar” to Johansson’s, OpenAI and Microsoft have attempted to shrug off The Intercept’s lawsuit. Last month, they filed motions to dismiss the case, which will be argued before a federal judge in Manhattan on June 3.
In the past year, OpenAI has inked deals with many press outlets to license their content, including the Associated Press, Le Monde, the Financial Times, and Axel Springer, the German publisher that owns Politico and Business Insider.
A slew of other outlets have sued OpenAI for various flavors of copyright misfeasance. The New York Times sued in December, followed by the Chicago Tribune, the New York Daily News, and six other daily papers owned by Alden Global Capital last month. Digital outlets Raw Story and AlterNet, represented by the same firm as The Intercept, filed a separate lawsuit in February.
All plaintiffs — traditional and digital alike — noted in court filings that their websites appear prominently in OpenAI’s own lists of which pages it had scraped to train earlier versions of ChatGPT. The Intercept’s website is on OpenAI’s list of “the top 1,000 domains present” in data used to train GPT-2; per OpenAI’s description, one of the datasets contains text scraped from more than 6,400 separate pages from The Intercept’s domain.
But OpenAI and Microsoft have urged the district court to dismiss The Intercept’s claims on numerous grounds, including that The Intercept cannot point to every article that was ever fed into ChatGPT.
In a brief filed last week, OpenAI argued that The Intercept failed to identify “a single work from which OpenAI supposedly removed copyright management information.”
As The Intercept countered, only OpenAI and Microsoft could possibly know which specific articles are in the ChatGPT training sets, unless the court allows the case to proceed into discovery.
Because of how modern copyright protections work, the New York Times and other print publications have much more straightforward claims than The Intercept and other digital outlets. To qualify for bread-and-butter copyright infringement damages, authors must register their works with the U.S. Copyright Office. It is relatively straightforward to register print news articles in bulk; using an online portal, publications can register an entire month’s worth of print issues at once.
But there is no similar bulk process for online-only outlets, which must register each article individually with the Copyright Office. Earlier this year, the Copyright Office floated a new registration process for news websites, which is still under consideration. But the current registration requirements are costly and time intensive, and thus impractical for budget-constrained nonprofits like The Intercept.
Unable to invoke traditional copyright infringement claims, The Intercept turned to somewhat novel arguments under the DMCA, which Congress passed in 1998. As the Copyright Office summarizes it, the DMCA was meant to “move the nation’s copyright law into the digital age.”
Under the DMCA, it is illegal to intentionally remove “copyright management information” such as a work’s title and author as well as to distribute that work knowing the information was removed. The Intercept and other plaintiffs allege that OpenAI and Microsoft violated both of these provisions by training ChatGPT on journalists’ articles without this attribution information.
“The Intercept is not the first to challenge this technology through claims under the Digital Millennium Copyright Act’s provision concerning removal of copyright management information,” Microsoft’s attorneys wrote in their brief, calling The Intercept’s lawsuit “the skimpiest of the lot of these challenges.”
Next month, the district court will consider whether The Intercept’s lawsuit will proceed.
If the case is dismissed, OpenAI can continue to train ChatGPT to regurgitate words that are “eerily similar” to the work of digital outlets like The Intercept without paying for that work.