The Internet Archive faces a new threat: Wary publishers who opt out to stop scraping by AI bots

Ruins of the Library of Pantainos in Athens, Greece. Photo (cc) 2018 by Michael Kogan.

Has the Internet Archive reached the end of the line? The 30-year-old nonprofit, which has saved and made searchable more than a trillion webpages, has proved itself to be of enormous value over the years.

I’ve used it to track changes in reporting, including this blog post about The New York Times’ shifting coverage of an explosion at Ahli Arab Hospital in Gaza City in the days after Hamas’ October 2023 terrorist attack on Israel. The Times and other news organizations initially reported that Israeli forces had bombed the hospital, but they later had to walk back that unverified claim.

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.

The Internet Archive is also home to The Boston Phoenix’s online digital and print archives thanks to an agreement that it made with Northeastern University, which acquired the Phoenix’s intellectual property after the legendary alt-weekly went out of business in 2013. (Note: I was a longtime staff columnist for the Phoenix, and I helped arrange the donation to Northeastern.)

Now, though, the Internet Archive and its Wayback Machine, which reproduces web content from years past, are facing an existential threat. News organizations ranging from the Times to USA Today are inserting code into their sites that blocks the Archive from crawling their content, mainly to prevent AI companies from accessing their journalism without permission.

As Katie Knibbs reports for Wired, the irony is that USA Today recently published an important piece of investigative journalism documenting ICE detention statistics that wouldn’t have been possible without the Archive. Knibbs writes:

According to analysis by the artificial-intelligence-detection startup Originality AI, 23 major news sites are currently blocking ia_archiverbot, the web crawler commonly used by the Internet Archive for the Wayback project. The social platform Reddit is too. Other outlets are limiting the project in different ways: The Guardian does not block the crawler, but it excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles.

The Electronic Frontier Foundation, which is helping to lead a signature drive in support of the Archive, compares the publishers’ actions to “a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper,” according to a recent EFF article by Joe Mullin, who writes:

For nearly three decades, historians, journalists, and the public have relied on the Internet Archive to preserve news sites as they appeared online. Those archived pages are often the only reliable record of how stories were originally published. In many cases, articles get edited, changed, or removed—sometimes openly, sometimes not. The Internet Archive often becomes the only source for seeing those changes. When major publishers block the Archive’s crawlers, that historical record starts to disappear.

This is not the first time the Archive has run into legal problems. One major challenge was of its own making: a project begun during the COVID pandemic to make books available for free without permission and without any compensation to publishers or authors. Not surprisingly, the Archive lost that case in a federal appeals court in 2024. As I wrote in describing that decision: “The Archive claimed that it was in compliance with copyright law because it limited e-book borrowing to correspond with physical books that it had in its collection or that was owned by one of its partner libraries. That’s not the way it works, though.”

The current threat involves the right of publishers’ to make the content available as they see fit, which they have a legal right to do. They are under no obligation to let the Internet Archive repurpose it. Ideally, they will come to understand the incalculable damage they are doing.

As EFF’s Mullin puts it: “There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”

A Muzzle to Waltham’s local access outlet for trying to silence citizen journalists

Postcard c. 1930-1945

According to its “About” page, Waltham Community Access Corp., which operates two local access stations for the benefit of cable subscribers, “is funded by a percentage of the gross revenues from Comcast and RCN cable.” This is a typical arrangement, mandated by state law. And though WCAC describes itself as an “independent nonprofit corporation,” the revenues that access channels receive from cable providers are generally passed through to them by local government. What’s more, the cable providers themselves are licensed by each city and town.

In other words, local access outlets like WCAC may not be part of the government, but they certainly have a relationship with the government. Which is why the actions taken by WCAC last September, just before a city election, were especially pernicious. According to a lawsuit filed last week in U.S. District Court by a citizen journalism group known as Channel 781 News, WCAC filed a complaint with YouTube claiming copyright infringement because Channel 781 had made use of clips of government meetings. Again, as is typical of local access operations, WCAC carries some municipal meetings in full and then posts them online. According to a press release from the Electronic Frontier Foundation, which filed the suit on Channel 781’s behalf, WCAC violated Channel 781’s rights under the “fair use” exception to copyright law:

The Waltham Community Access Corp.’s misrepresentation of copyright claims under the Digital Millennium Copyright Act (DMCA) led YouTube to temporarily deactivate Channel 781, making its work disappear from the internet last September just five days before an important municipal election, the suit says. 

“WCAC knew it had no right to stop people from using video recordings of public meetings, but asked YouTube to shut us down anyway,” Channel 781 cofounder Josh Kastorf said. “Democracy relies on an informed public, and there must be consequences for anyone who abuses the DMCA to silence journalists and cut off people’s access to government.”

WCAC’s actions — which have earned it a New England Muzzle Award — resulted in the temporary shutdown of Channel 781, according to a story from last September in The Justice, the student newspaper at Brandeis University. At that time, Justice reporter Lea Zaharoni wrote that WCAC did not respond to a request for comment. But Zaharoni found that the president of WCAC’s board also served as a city official, and observed that Channel 781 had reported critically on yet another organization that particular official was involved with.

Adam Gaffin of Universal Hub, who has published a comprehensive account of the lawsuit, found a statement posted by WCAC executive director Maria Sheehan that has since been taken down:

Our station is a private nonprofit that does not receive taxpayer funding. Over recent years, photographs from our news department, and video from the MAC channel, have been reproduced without our permission. We know this is a reality of the world we live in, but we put copyright disclaimers on our media for a reason. Some have used our content to score political points under the veil of anonymity. Others have used it to encourage residents to hate. This practice can damage reputations and spread misinformation and we do not want to be a part of that. So as we head into a contentious election season, I’m asking the public to respect people who work hard to create our original content. In the interest of transparency, we will entertain requests to reuse our content for free, but misuse is wrong, and it is illegal. Moving forward, the Waltham Channel will take whatever legal steps necessary to protect our content.

According to the EFF, “WCAC sent three copyright infringement notices to YouTube referencing 15 specific Channel 781 videos, leading YouTube to deactivate the account and render all of its content inaccessible. YouTube didn’t restore access to the videos until two months later, after a lengthy intervention by EFF.”

In its lawsuit, the EFF asks that the court issue an order to prevent WCAC from targeting Channel 781. Damages and attorney’s fees are being sought as well.

Leave a comment | Read comments

MuckRock.com and the potential power of crowdfunding

Screen Shot 2012-12-18 at 7.58.38 PMThis interview was previously published at the Nieman Journalism Lab.

The first time I heard of Michael Morisy and MuckRock.com was in 2010, after the site was targeted by a bureaucrat working for Massachusetts Governor Deval Patrick.

It seems that MuckRock, using the state’s open records law, had obtained information about how food stamps were being used in grocery stores. The data, which did not name any individual food-stamp recipients, had been lawfully requested and lawfully obtained. But that didn’t stop said bureaucrat from threatening Morisy and his tech partner, Mitchell Kotler, with fines and even imprisonment if they refused to remove the documents from their site.

They refused. And the bureaucrat said it had all been a mistake.

Now Morisy is preparing to expand MuckRock’s mission of filing freedom-of-information requests with various government agencies and posting them online for all to see. The just-launched Freedom of the Press Foundation has identified MuckRock as one of four news organizations that will benefit from its system of crowdsourced donations. The best-known of the four is WikiLeaks.

The foundation’s board is a who’s who of media activists, including Pentagon Papers whistleblower Daniel Ellsberg, Electronic Frontier Foundation co-founder John Perry Barlow, Josh Stearns of Free Press and the journalist Glenn Greenwald, now with the Guardian.

“The Freedom of the Press Foundation can be a first step away from the edge of a cliff,” writes Dan Gillmor, author of “We the Media” and “Mediactive.” “But it needs to be recognized and used by as many people as possible, as fast as possible. And journalists, in particular, need to offer their support in every way. This is ultimately about their future, whether they recognize it or not. But it’s more fundamentally about all of us.”

What follows is a lightly edited email interview I conducted with Morisy about MuckRock, the Freedom of the Press Foundation, and what comes next.

Q: Tell me a little bit about MuckRock and its origins.

A: I’d been really frustrated that we hadn’t seen much innovation in newsgathering generated by journalistic organizations. You see lots of innovations in how stories are told, but they’ve been generated by companies like Twitter, Facebook, and Instagram — all wonderful organizations, but ones which generate news as a byproduct, and where the journalistic function is by far secondary to business considerations. My co-founder and I wanted to create a startup where creating news was a core part of the business, and where the news was both user-generated and -directed as well as verified.

Since requests on MuckRock come from — and are paid for by — our users, we are able to align our business and editorial goals almost perfectly. We don’t sell advertising, we don’t put up paywalls. We just help people investigate the issues they want to, and then share those results with the world.

We’ve know been growing as a business and as an editorial operation for three years, with a part-time news editor and two fantastic interns.

Q: What sorts of projects are you involved in today?

A: Our biggest project to date is a partnership with the Electronic Frontier Foundation (EFF) called the Drone Census, which has broken a lot of major stories around the country. We let anyone submit an agency’s information and then we follow up with a public records request. So far we’ve submitted 263 requests to state, local, and federal agencies, the vast majority of which were suggested by the public. And it’s helped shed more light on a program that police departments and drone manufacturers are very purposefully keeping quiet.

We’ve also gotten to cover some really interesting local stories, such as getting the late Boston mayor Kevin White’s FBI file and taking an inside look at the timing of a drug raid, as well as national stories.

Q: What is the nature of your relationship with the Boston Globe?

A: MuckRock was invited to be part of the Globe Lab‘s incubator program a little over a year ago. We’ve received free office space and, most important, a good mailbox to receive the dozens of responses we get back every day. It’s also given us a chance to bounce ideas back and forth with their technology and editorial teams, and we’re in the early stages of a collaborative project with them.

They also recently launched The Hive, a section focused on startups in the Boston area. Given my experience running one and my editorial background, when they were looking for someone to manage and report for that section, I was a natural fit and thrilled to be invited to cover startups in the area. It’s a dream job, and it means I now have two desks, and often wear two hats inside the same building.

Q: How did you get involved in the Freedom of the Press Foundation?

A: Trevor Timm has been our main point of contact with the EFF working on the drone project, and he’s been absolutely great to work with. He reached out to us about a week ago and said that he was working on a new venture to help crowdfund investigative journalism projects, and we were honored to be thought of. It turns out he is the executive director of the Freedom of the Press Foundation, so we got lucky to be working with the right people.

Q: Do you have a goal for how much money you’re hoping to raise through the foundation? What kinds of projects would you like to fund if you’re successful?

A: We’re kind of going into this with an open mind and a hopeful heart. Any amount raised is greatly appreciated, but this will help jumpstart several new projects similar in size and scope to the drone effort, which has had an amazing response, including nods from the New York Times and many other outlets. It may also give us the flexibility to fund important stories that maybe are not as sexy. We were really interested in funding an investigation into MBTA price jumps for the disabled, for example, but our crowdfunding efforts on Spot.us are essentially dead on arrival. Having a reserve will allow us to take gambles on stories like that without having to choose between making rent and breaking news.