The Internet Archive faces a new threat: Wary publishers who opt out to stop scraping by AI bots

Ruins of the Library of Pantainos in Athens, Greece. Photo (cc) 2018 by Michael Kogan.

Has the Internet Archive reached the end of the line? The 30-year-old nonprofit, which has saved and made searchable more than a trillion webpages, has proved itself to be of enormous value over the years.

I’ve used it to track changes in reporting, including this blog post about The New York Times’ shifting coverage of an explosion at Ahli Arab Hospital in Gaza City in the days after Hamas’ October 2023 terrorist attack on Israel. The Times and other news organizations initially reported that Israeli forces had bombed the hospital, but they later had to walk back that unverified claim.

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.

The Internet Archive is also home to The Boston Phoenix’s online digital and print archives thanks to an agreement that it made with Northeastern University, which acquired the Phoenix’s intellectual property after the legendary alt-weekly went out of business in 2013. (Note: I was a longtime staff columnist for the Phoenix, and I helped arrange the donation to Northeastern.)

Now, though, the Internet Archive and its Wayback Machine, which reproduces web content from years past, are facing an existential threat. News organizations ranging from the Times to USA Today are inserting code into their sites that blocks the Archive from crawling their content, mainly to prevent AI companies from accessing their journalism without permission.

As Katie Knibbs reports for Wired, the irony is that USA Today recently published an important piece of investigative journalism documenting ICE detention statistics that wouldn’t have been possible without the Archive. Knibbs writes:

According to analysis by the artificial-intelligence-detection startup Originality AI, 23 major news sites are currently blocking ia_archiverbot, the web crawler commonly used by the Internet Archive for the Wayback project. The social platform Reddit is too. Other outlets are limiting the project in different ways: The Guardian does not block the crawler, but it excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles.

The Electronic Frontier Foundation, which is helping to lead a signature drive in support of the Archive, compares the publishers’ actions to “a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper,” according to a recent EFF article by Joe Mullin, who writes:

For nearly three decades, historians, journalists, and the public have relied on the Internet Archive to preserve news sites as they appeared online. Those archived pages are often the only reliable record of how stories were originally published. In many cases, articles get edited, changed, or removed—sometimes openly, sometimes not. The Internet Archive often becomes the only source for seeing those changes. When major publishers block the Archive’s crawlers, that historical record starts to disappear.

This is not the first time the Archive has run into legal problems. One major challenge was of its own making: a project begun during the COVID pandemic to make books available for free without permission and without any compensation to publishers or authors. Not surprisingly, the Archive lost that case in a federal appeals court in 2024. As I wrote in describing that decision: “The Archive claimed that it was in compliance with copyright law because it limited e-book borrowing to correspond with physical books that it had in its collection or that was owned by one of its partner libraries. That’s not the way it works, though.”

The current threat involves the right of publishers’ to make the content available as they see fit, which they have a legal right to do. They are under no obligation to let the Internet Archive repurpose it. Ideally, they will come to understand the incalculable damage they are doing.

As EFF’s Mullin puts it: “There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”

The Boston Globe ends its use of the AI tool Nota after Poynter reports that it plagiarizes

Photo (cc) 2018 by Dan Kennedy.

Angela Fu of Poynter Online published a story on Thursday that’s been rocketing around media circles. Her lead: “Artificial intelligence company Nota — whose clients include organizations like The Boston Globe and the Institute for Nonprofit News — is scrapping its network of local news sites after learning that they contained dozens of instances of plagiarism.”

You should read Fu’s story in full. The gist of it is that the AI tool was supposed to scrape press releases and official information but has been grabbing news content in addition to that. “Poynter found more than 70 stories dating back to October that included reporting, writing and photography from local journalists without attribution,” she writes. “Some of the copied material came from outlets owned by Nota’s own clients.”

Earlier today, several trusted sources sent along a memo sent to the Globe’s newsroom assuring the staff that the paper was not part of the specific experiment at issue and that everyone should stop using Nota.

Here is text of the email, which is from editor Brian McGrory; Shira Center, vice president for innovation and strategic initiatives; Cynthia Needham, deputy managing editor for innovation and strategy; Matt Karolian, vice president of platforms and AI; and Heather Ciras, deputy managing editor for audience.

Poynter published a report yesterday about Nota, an AI tool used by the Globe and many other newsrooms across the country. The story said that a Nota experiment involving AI-driven hyperlocal news resulted in stories that were clearly plagiarized from other local news organizations.

The Globe was not part of this experiment, which was aimed at small counties in other states. We’ve worked with Nota on SEO, headline recommendations, related metadata, and social platform suggestions for Globe stories. The Globe’s contract with Nota prohibits it from using our journalism to train its AI model.

That said, what happened here does not fit with our values, and we are asking everyone to stop using this product while we wait for Nota to turn off the service and end our contract. We have other strong options for this work that we’re exploring.

My media ethics students express some surprisingly skeptical views about AI and journalism

1930 photo (cc) via the German Federal Archives.

My colleagues and I are engaged in the convoluted, ever-shifting process of figuring out how to use artificial intelligence in journalism in ways that are both productive and ethical. Somewhere between “Let students use AI to write their stories” and “We should forbid all uses of AI,” there is a reasonable approach, and we’re all trying to figure out what that is.

Our students learn from us. We learn from our students. Keep in mind, though, that we have not yet seen what you might call “AI natives” in our classrooms. Young people in their late teens and early 20s were part of the before times. In the not-too-distant future, though, we’ll start seeing students who can’t remember a world without ChatGPT, Claude and the rest.

Recently I devoted a class to AI in my graduate ethics seminar. It’s a small group of five students, one of whom is an advanced undergrad. I was surprised to learn that they are as skeptical of AI as I am.

Read the rest at Poynter Online.

Local doesn’t scale: How community publishers can survive and thrive in the AI era

The New Haven Independent newsroom. Photo (cc) 2021 by Dan Kennedy.

Folks who work at finding solutions to the local news crisis are understandably frustrated at what a difficult, frustrating slog it can be. Earlier this week, Elizabeth Hansen Shapiro, the former executive director of the National Trust for Local News, gave Richard J. Tofel a preview of a report she’s written for Press Forward and said, “I think the challenges now are so systemic that the only way to do responsible, impactful funding going forward is to look at system solutions rather than newsroom-based ones.”

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.

I’m looking forward to reading Hansen Shapiro’s report. (She’s featured in our book, “What Works in Community News,” and has been on our podcast.) And yet there really is no substitute for solving this problem one community at a time. For all the talk you hear about scale, that’s really not the way to go unless you’re talking about obvious things like finding a common tech platform so that every local news publisher doesn’t have to reinvent the wheel — or, in this case, the content management system. In the early days of the hyperlocal news movement, a group of publishers got together and formed an organization called Authentically Local. Its spot-on message: “Local Doesn’t Scale.”

Continue reading “Local doesn’t scale: How community publishers can survive and thrive in the AI era”

On the latest ‘Beat the Press,’ we look at war coverage, a Trump-friendly media monopoly, AI and more

Click on image to watch the video.

On the brand-new edition of “Beat the Press with Emily Rooney,” we analyze media coverage of the war against Iran.

In other topics, we examine the implications of Paramount’s acquisition of Warner Bros. Discovery, which will put CNN in the hands of Trump-friendly executives Larry and David Ellison, and the failure of Bari Weiss — who may soon be running CNN in addition to CBS News — to hang on to a Jeffrey Epstein associate. We also give the hairy eyeball to AI’s ongoing encroachment into journalism and weigh in with our Rants and Raves.

“Beat the Press” is hosted by Scott Van Voorhis’ newsletter, Contrarian Boston. With Emily, Scott, Lylah Alphonse of The Boston Globe and me, expertly produced by Tonia Magras of Hull Bay Productions.

Dale Anglin tells us how Press Forward is leveraging local news to build community

Dale Anglin at the recent Knight Media Forum in Miami.

On the latest “What Works” podcast, Ellen Clegg and I talk with Dale Anglin, the inaugural executive director of Press Forward, a philanthropic effort that is dedicated to funding local news initiatives nationwide.

Before she was named as the leader of Press Forward, Anglin served as a vice president for grantmaking at the Cleveland Foundation. She also led the foundation’s journalism strategy. Then and now, she focuses on local news and information as a way to restore a sense of community.

I’ve got a Quick Take on The Baltimore Banner, one of the most prominent nonprofit digital startups. It looks like readers of The Washington Post who live in the DC area may not be deprived of local news and sports after all despite the recent deep cuts ordered by its billionaire owner, Jeff Bezos. The Banner is expanding, and it’s part of executive editor Audrey Cooper’s mission to build civic engagement through community journalism.

Ellen’s Quick Take is on a bill in New York state that attempts to put some guardrails around the use of artificial intelligence in newsrooms. Among other things, it would require disclosures and mandate supervision and fact-checking by actual human editors. It received a hearty endorsement from journalism industry unions. But there’s a lot of catching up to do to rein in the robots.

You can listen to our conversation here, or you can subscribe through your favorite podcast app.

An AI-boosting editor blasts j-schools as being out of touch. The reality is more complex.

1914 photo via Cleveland Historical.

A prominent editor has unleashed a scathing attack on journalism schools for what he claims is their retrograde attitude toward artificial intelligence. Since the editor, Chris Quinn of Cleveland.com and The Plain Dealer, is invading my turf, I thought I’d take a look at what he has to say and offer some context.

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.

Quinn begins his recent “Letter from the Editor” column with an anecdote about a recent college graduate who turned down a job because of the way Quinn’s publications use AI. Increasingly, they ask reporters to do nothing but report, turning over their notes to be transformed into news stories by AI, with human editors looking them over to make sure the final product is accurate and coherent.

Continue reading “An AI-boosting editor blasts j-schools as being out of touch. The reality is more complex.”

A copyright expert’s big idea: Force Google and other AI companies to pay news publishers

Photo (cc) 2014 by Anthony Quintano.

Journalism faces yet another tech-driven crisis: AI-powered Google search deprives news publishers of as much as 30% to 40% of their web traffic as users stay on Google rather than following the links. What’s more, users of other AI chatbots, such as ChatGPT and Claude, can search for clickless news as well. Now an expert on copyright and licensing has come up with a possible solution.

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.

Paul Gerbino, president of Creative Licensing International, writes that publishers need to move away from negotiating one-time deals with AI companies to scrape their content for training purposes. Instead, Gerbino says, they should push for a system by which they will be compensated for the use of their content on a recurring basis, whether through per-use fees or subscriptions. As Gerbino puts it:

Training is a singular, non-recurring event that offers only a front-loaded burst of revenue. It possesses no capacity to scale or recur at the level required to effectively sustain the complex and costly operation of the publishing industry….

The singular, non-negotiable strategic imperative for every publisher is to execute a complete and fundamental pivot from the outdated mindset of “sell content once” to the forward-looking, sustainable model of “monetize access forever.”

It’s a fascinating idea, although we should be cautious given that forcing Google and other platforms to pay for the news they repurpose hasn’t gone much of anywhere over the years. When such schemes have been implemented, they’ve been hampered by unexpected consequences, such as threats to remove all links to news sources. It’s not clear why Google would suddenly flip because it’s now using AI.

Gerbino acknowledges this, arguing that publishers should negotiate with the AI companies collectively, observing: “Individual publishers operating alone possess negligible leverage against the behemoths of the AI industry. Collective frameworks represent the only viable path to successful negotiation.” But that may require passage of a law so that the publishers don’t run afoul of antitrust violations.

Gerbino also says that publishers need to develop paywalls that are impervious to AI. Not all of them are.

The possibility that a substantial part of the news audience will never move beyond AI-generated results — no matter how wrong they may be — represents a significant threat to publishers, who are already dealing with the challenge of finding a path to sustainability in a post-advertising world.

Gerbino has laid out some interesting proposals on how to extract revenues from AI companies, which may represent the biggest threat to news since the internet flickered into view more than 30 years ago. It remains to be seen, though, whether his ideas will form the basis for action — or if, instead, they will simply fade into the ether.

A new lawsuit takes aim at Google’s ad monopoly just as the AI train is leaving the station

Photo (cc) 2014 by Anthony Quintano.

There’s an old saying — no doubt you’ve heard it — that justice delayed is justice denied. And so it is with the news business’ longstanding lament that Google engages in monopolistic practices aimed at driving down the value of digital advertising. Gilad Edelman, writing for The Atlantic, describes it this way:

If the story of journalism’s 21st-century decline were purely a tale of technological disruption — of print dinosaurs failing to adapt to the internet — that would be painful enough for those of us who believe in the importance of a robust free press. The truth hurts even more. Big Tech platforms didn’t just out-compete media organizations for the bulk of the advertising-revenue pie. They also cheated them out of much of what was left over, and got away with it.

The Atlantic is among a number of media organizations that filed suit against Google this month. I’m kind of stunned that they are only suing now, because the issue they’ve identified goes back many years. As Charlotte Tobitt reports for the Press Gazette, the federal lawsuit was brought earlier this month by The Atlantic as well as Penske Media Corp., which owns Rolling Stone and She Media; Condé Nast, whose holdings include Advance Publications; Vox Media, owner of The Verge; and the newspaper chain McClatchy, whose papers include the Miami Herald, The Kansas City Star and The Sacramento Bee.

Continue reading “A new lawsuit takes aim at Google’s ad monopoly just as the AI train is leaving the station”

My Northeastern ethics students offer some ideas on practicing journalism in the AI era

Photo by Carlos López via Pixabay.

The Society of Professional Journalists’ Code of Ethics encompasses four broad principles:

    • Seek Truth and Report It
    • Minimize Harm
    • Act Independently
    • Be Accountable and Transparent

Each principle is accompanied by multiple bullet points, which in turn link to background information. But those are the starting points, and I think they provide a good rough guide for how to practice ethical journalism.

Whenever I teach one of our ethics classes, I ask my students to come up with a fifth principle as well as some explanatory material. This semester, I’m teaching our graduate ethics seminar. It’s a small class — five grad students and one undergrad. Last week I divided them into three teams of two and put them to work. Here’s what they came up with. (Longtime readers of Media Nation will recognize this exercise.) I’ve done a little editing, mainly for parallel construction.

Practice Digital Diligence

  • Utilize AI for structural purposes such as transcribing interviews, searching for sources and entering data.
  • Disclose the use of AI software when publishing artificial creations.
  • Give credit by providing hyperlinks to other journalistic sources.
  • Gain verification status on social platforms for credibility purposes.
  • Do not engage with negative comments on social media posts.
  • Engage with subscribers who might use social media to ask questions about a story.
  • Apply AP style to social media posts.
  • Give credit to any artists whose work you might borrow. Respect copyright law.

Use Modern Resources Responsibly

  • Use social media and other digital tools, such as comment sections, to crowdsource information, connect with others and distribute news in a more accessible way.
  • Do not use these tools to engage in ragebait or to get tangled in messy and unproductive discourse online.
  • Acceptable uses of AI include gathering information, reformatting your reporting, transcribing interviews and similar non-public-facing tasks.
  • AI should be used more effectively to guide your reporting rather than replacing it.

Be Compassionate

  • Treat sources and communities with empathy and care.
  • Avoid misleading sources or providing false hope — for instance, don’t promise someone who is suffering that you’ll be able to give them assistance.
  • Do not exploit a source’s lack of media training. Provide a detailed explanation of your reporting methods when warranted.
  • Avoid using jargon both in interacting with sources and in producing a story.
  • Be a human first. If that clashes with your role as a journalist, that should be secondary.

***

In addition to their work on extending the Code of Ethics, I asked them on the first day of class to name one significant ethical issue that they think faces journalism. What follows is my attempt to summarize a longer conversation that we had in class.

► Stand up for our independence as journalists

► Explore and define the role of AI and truth in journalism

► Make sure we include a range of perspectives

► Push back against fake news, ragebait, etc.

► Avoid passive voice that evades responsibility

► Move beyond our preconceptions in pursuit of the truth

I hope you’ll agree that this is good, thought-provoking stuff. I can’t wait to see how the rest of the semester will go.

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.