The Internet Archive faces a new threat: Wary publishers who opt out to stop scraping by AI bots

Ruins of the Library of Pantainos in Athens, Greece. Photo (cc) 2018 by Michael Kogan.

Has the Internet Archive reached the end of the line? The 30-year-old nonprofit, which has saved and made searchable more than a trillion webpages, has proved itself to be of enormous value over the years.

I’ve used it to track changes in reporting, including this blog post about The New York Times’ shifting coverage of an explosion at Ahli Arab Hospital in Gaza City in the days after Hamas’ October 2023 terrorist attack on Israel. The Times and other news organizations initially reported that Israeli forces had bombed the hospital, but they later had to walk back that unverified claim.

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.

The Internet Archive is also home to The Boston Phoenix’s online digital and print archives thanks to an agreement that it made with Northeastern University, which acquired the Phoenix’s intellectual property after the legendary alt-weekly went out of business in 2013. (Note: I was a longtime staff columnist for the Phoenix, and I helped arrange the donation to Northeastern.)

Now, though, the Internet Archive and its Wayback Machine, which reproduces web content from years past, are facing an existential threat. News organizations ranging from the Times to USA Today are inserting code into their sites that blocks the Archive from crawling their content, mainly to prevent AI companies from accessing their journalism without permission.

As Katie Knibbs reports for Wired, the irony is that USA Today recently published an important piece of investigative journalism documenting ICE detention statistics that wouldn’t have been possible without the Archive. Knibbs writes:

According to analysis by the artificial-intelligence-detection startup Originality AI, 23 major news sites are currently blocking ia_archiverbot, the web crawler commonly used by the Internet Archive for the Wayback project. The social platform Reddit is too. Other outlets are limiting the project in different ways: The Guardian does not block the crawler, but it excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles.

The Electronic Frontier Foundation, which is helping to lead a signature drive in support of the Archive, compares the publishers’ actions to “a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper,” according to a recent EFF article by Joe Mullin, who writes:

For nearly three decades, historians, journalists, and the public have relied on the Internet Archive to preserve news sites as they appeared online. Those archived pages are often the only reliable record of how stories were originally published. In many cases, articles get edited, changed, or removed—sometimes openly, sometimes not. The Internet Archive often becomes the only source for seeing those changes. When major publishers block the Archive’s crawlers, that historical record starts to disappear.

This is not the first time the Archive has run into legal problems. One major challenge was of its own making: a project begun during the COVID pandemic to make books available for free without permission and without any compensation to publishers or authors. Not surprisingly, the Archive lost that case in a federal appeals court in 2024. As I wrote in describing that decision: “The Archive claimed that it was in compliance with copyright law because it limited e-book borrowing to correspond with physical books that it had in its collection or that was owned by one of its partner libraries. That’s not the way it works, though.”

The current threat involves the right of publishers’ to make the content available as they see fit, which they have a legal right to do. They are under no obligation to let the Internet Archive repurpose it. Ideally, they will come to understand the incalculable damage they are doing.

As EFF’s Mullin puts it: “There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”


Discover more from Media Nation

Subscribe to get the latest posts sent to your email.

Post a Comment. Real names, first and last, are recommended.