The Internet Archive faces a new threat: Wary publishers who opt out to stop scraping by AI bots

Ruins of the Library of Pantainos in Athens, Greece. Photo (cc) 2018 by Michael Kogan.

Has the Internet Archive reached the end of the line? The 30-year-old nonprofit, which has saved and made searchable more than a trillion webpages, has proved itself to be of enormous value over the years.

I’ve used it to track changes in reporting, including this blog post about The New York Times’ shifting coverage of an explosion at Ahli Arab Hospital in Gaza City in the days after Hamas’ October 2023 terrorist attack on Israel. The Times and other news organizations initially reported that Israeli forces had bombed the hospital, but they later had to walk back that unverified claim.

Follow my Bluesky newsfeed for additional news and commentary. And please join my Patreon for just $6 a month. You’ll receive a supporters-only newsletter every Thursday.

The Internet Archive is also home to The Boston Phoenix’s online digital and print archives thanks to an agreement that it made with Northeastern University, which acquired the Phoenix’s intellectual property after the legendary alt-weekly went out of business in 2013. (Note: I was a longtime staff columnist for the Phoenix, and I helped arrange the donation to Northeastern.)

Now, though, the Internet Archive and its Wayback Machine, which reproduces web content from years past, are facing an existential threat. News organizations ranging from the Times to USA Today are inserting code into their sites that blocks the Archive from crawling their content, mainly to prevent AI companies from accessing their journalism without permission.

As Katie Knibbs reports for Wired, the irony is that USA Today recently published an important piece of investigative journalism documenting ICE detention statistics that wouldn’t have been possible without the Archive. Knibbs writes:

According to analysis by the artificial-intelligence-detection startup Originality AI, 23 major news sites are currently blocking ia_archiverbot, the web crawler commonly used by the Internet Archive for the Wayback project. The social platform Reddit is too. Other outlets are limiting the project in different ways: The Guardian does not block the crawler, but it excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles.

The Electronic Frontier Foundation, which is helping to lead a signature drive in support of the Archive, compares the publishers’ actions to “a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper,” according to a recent EFF article by Joe Mullin, who writes:

For nearly three decades, historians, journalists, and the public have relied on the Internet Archive to preserve news sites as they appeared online. Those archived pages are often the only reliable record of how stories were originally published. In many cases, articles get edited, changed, or removed—sometimes openly, sometimes not. The Internet Archive often becomes the only source for seeing those changes. When major publishers block the Archive’s crawlers, that historical record starts to disappear.

This is not the first time the Archive has run into legal problems. One major challenge was of its own making: a project begun during the COVID pandemic to make books available for free without permission and without any compensation to publishers or authors. Not surprisingly, the Archive lost that case in a federal appeals court in 2024. As I wrote in describing that decision: “The Archive claimed that it was in compliance with copyright law because it limited e-book borrowing to correspond with physical books that it had in its collection or that was owned by one of its partner libraries. That’s not the way it works, though.”

The current threat involves the right of publishers’ to make the content available as they see fit, which they have a legal right to do. They are under no obligation to let the Internet Archive repurpose it. Ideally, they will come to understand the incalculable damage they are doing.

As EFF’s Mullin puts it: “There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”

The Boston Globe ends its use of the AI tool Nota after Poynter reports that it plagiarizes

Photo (cc) 2018 by Dan Kennedy.

Angela Fu of Poynter Online published a story on Thursday that’s been rocketing around media circles. Her lead: “Artificial intelligence company Nota — whose clients include organizations like The Boston Globe and the Institute for Nonprofit News — is scrapping its network of local news sites after learning that they contained dozens of instances of plagiarism.”

You should read Fu’s story in full. The gist of it is that the AI tool was supposed to scrape press releases and official information but has been grabbing news content in addition to that. “Poynter found more than 70 stories dating back to October that included reporting, writing and photography from local journalists without attribution,” she writes. “Some of the copied material came from outlets owned by Nota’s own clients.”

Earlier today, several trusted sources sent along a memo sent to the Globe’s newsroom assuring the staff that the paper was not part of the specific experiment at issue and that everyone should stop using Nota.

Here is text of the email, which is from editor Brian McGrory; Shira Center, vice president for innovation and strategic initiatives; Cynthia Needham, deputy managing editor for innovation and strategy; Matt Karolian, vice president of platforms and AI; and Heather Ciras, deputy managing editor for audience.

Poynter published a report yesterday about Nota, an AI tool used by the Globe and many other newsrooms across the country. The story said that a Nota experiment involving AI-driven hyperlocal news resulted in stories that were clearly plagiarized from other local news organizations.

The Globe was not part of this experiment, which was aimed at small counties in other states. We’ve worked with Nota on SEO, headline recommendations, related metadata, and social platform suggestions for Globe stories. The Globe’s contract with Nota prohibits it from using our journalism to train its AI model.

That said, what happened here does not fit with our values, and we are asking everyone to stop using this product while we wait for Nota to turn off the service and end our contract. We have other strong options for this work that we’re exploring.

My media ethics students express some surprisingly skeptical views about AI and journalism

1930 photo (cc) via the German Federal Archives.

My colleagues and I are engaged in the convoluted, ever-shifting process of figuring out how to use artificial intelligence in journalism in ways that are both productive and ethical. Somewhere between “Let students use AI to write their stories” and “We should forbid all uses of AI,” there is a reasonable approach, and we’re all trying to figure out what that is.

Our students learn from us. We learn from our students. Keep in mind, though, that we have not yet seen what you might call “AI natives” in our classrooms. Young people in their late teens and early 20s were part of the before times. In the not-too-distant future, though, we’ll start seeing students who can’t remember a world without ChatGPT, Claude and the rest.

Recently I devoted a class to AI in my graduate ethics seminar. It’s a small group of five students, one of whom is an advanced undergrad. I was surprised to learn that they are as skeptical of AI as I am.

Read the rest at Poynter Online.

How Claude AI helped improve the look and legibility of Media Nation

Public domain illustration via Pixabay.

For quite a few years I used WordPress’ indent feature for blockquotes rather than the actual blockquote command. The reason was that blockquotes in the theme that I use (Twenty Sixteen) were ugly, with type larger than the regular text (the opposite of what you would see in a book or a printed article) and in italics.

But then I noticed that indents didn’t show up at all in posts that went out by email, leading to confusion among my subscribers — that is, my most engaged readers. I decided to find out if I could modify the blockquote feature. WordPress allows you to add custom CSS to your theme, but I know very little about how to use CSS. I could have asked in a WordPress forum, but I tried to see if I could get an answer from AI instead.

Sign up for free email delivery of Media Nation. You can also become a supporter for just $6 a month and receive a weekly newsletter with exclusive content.

Northeastern has given us all access to the enterprise version of Claude, Anthropic’s AI platform. It’s a mixed blessing, although I’ve found that it’s very good as a search engine — often better than Google, which is now also glopped up by AI. I simply make sure I ask Claude to add the underlying links to its answer so I don’t get taken in by hallucinations. But Claude is also known for being quite good at coding. What I needed was low-level, so I thought maybe it could help.

Indeed it could. I began by asking, “In the Twenty Sixteen WordPress theme, how can I change the CSS so that blockquotes do not appear in italics?” Claude provided me with several options; I chose the simplest one, which was a short bit of custom CSS that I could add to my theme:

blockquote {
     font-style: normal;
}

It worked. A subsequent query enabled me to make the blockquote type smaller. Then, just last week, I noticed that any formatting in the blockquote was stripped out. For instance, a recent memo from Boston Globe Media CEO Linda Henry contained boldface and italicized text, which did not appear when I reproduced her message. The formatting code was there; it just wasn’t visible. Claude produced CSS commands that overrode the theme. You can see the results here, with bold and italic type just as Henry had it in her message.

I make some light use of AI in my other work. When I need to transcribe an audio interview, I use Otter, which is powered by AI. I’ve experimented with using AI to compile summaries from transcripts and even (just for my own use) an actual news story. Very occasionally I’ve used AI to produce illustrations for this blog, which seems to draw more objections than other AI applications, probably because it’s right in people’s faces.

Just the other day, someone complained to me on social media that she was not going to visit a local news outlet I had mentioned because she had encountered an AI-produced illustration there. When I asked why, she replied that it was because AI relies on plagiarism. Oh, I get it. Sometime this year I’m hoping to receive $3,000 as my share of a class-action lawsuit against Anthropic because one of my books, “The Return of the Moguls,” was used to train Claude.

And let’s not overlook the massive amounts of energy that are required to power AI. On a recent New York Times podcast, Ezra Klein and his guests observed that AI is deeply unpopular with the public (sub. req.), even though they’re using it, because all they really know is that it’s going to take away jobs and is driving up electricity costs.

But AI isn’t going anywhere, and if we’re going to use it (and we are, even if we try to avoid it), we need to find ways to do so ethically and responsibly.

In a lawsuit against Meta, the state’s highest court will rule on the limits of Section 230

Attorney General Andrea Campbell. Photo (cc) 2022 by Dan Kennedy.

Section 230 of the Communications Decency Act of 1996 protects website owners from liability over third-party content. The classic example would be an anonymous commenter who libels someone. The offended party would be able to sue the commenter but not the publishing platform, although the platform might be required to turn over information that would help identify the commenter.

Sign up for free email delivery of Media Nation. You can also become a supporter for just $6 a month and receive a weekly newsletter with exclusive content.

But where is the line between passively hosting third-party content and activity promoting certain types of material in order to boost engagement and, thus, profitability? That question will go before the Massachusetts Supreme Judicial Court on Friday, reports Jennifer Smith of CommonWealth Beacon.

At issue is a lawsuit brought against Meta by 42 state attorneys general, including Andrea Campbell of Massachusetts. Meta operates Facebook, Instagram, Threads and other social media platforms, and it has long been criticized for using algorithms and other tactics that keep users hooked on content that, in some cases, provokes anger and depression, even suicide. Smith writes:

The Massachusetts complaint alleges that Meta violated state consumer protection law and created a public nuisance by deliberately designing Instagram with features like infinite scroll, autoplay, push notifications, and “like” buttons to addict young users, then falsely represented the platform’s safety to the public. The company has also been reckless with age verification, the AG argues, and allowed children under 13 years old to access its content.

Meta and its allies counter that Section 230 protects not just the third-party content they host but also how Facebook et al. display that content to its users.

In an accompanying opinion piece, attorney Megan Iorio of the Electronic Privacy Information Center, computer scientist Laura Edelson of Northeastern University and policy analyst Yaël Eisenstat of Cybersecurity for Democracy argue that Section 230 was not designed to protect website operators from putting their thumbs on the scales to favor one type of third-party content over another. As they put it in describing the amicus brief they have filed:

Our brief explains how the platform features at the heart of the Commonwealth’s case — things like infinite scroll, autoplay, the timing and batching of push notifications, and other tactics borrowed from the gambling industry — have nothing to do with content moderation; they are designed to elicit a behavior on the part of the user that furthers the company’s own business goals.

As Smith makes clear, this is a long and complex legal action, and the SJC is being asked to rule only on the narrow question of whether Campbell can move ahead with the lawsuit to which she has lent the state’s support. (Double disclosure: I am a member of CommonWealth Beacon’s editorial advisory aboard as well as a fellow Northeastern professor.)

I’ve long argued (as I did in this GBH News commentary from 2020) that, just as a matter of logic, favoring some types of content over others is a publishing activity that goes beyond the mere passive hosting of third-party content, and thus website operators should be liable for whatever harm those decisions create. That argument has not found much support in the courts, however. It will be interesting to see how this plays out.

How Margaret Sullivan’s erroneous slip of the tongue became (briefly) an AI-generated ‘fact’

Paul Krugman and Margaret Sullivan. Photo via Paul Krugman’s newsletter.

Media critic Margaret Sullivan made an error recently. No big deal — we all do it. But her account of what happened next is worth thinking about.

First, the error. Sullivan writes in her newsletter, American Crisis, that she recently appeared on economist Paul Krugman’s podcast and said that Los Angeles Times owner Patrick Soon-Shiong was among the billionaires who joined Donald Trump at his second inauguration earlier this year, along with the likes of Mark Zuckerberg, Jeff Bezos and Elon Musk. “I was wrong about that,” she notes, although she adds that Soon-Shiong “has been friendly to Trump in other ways.” Then she writes:

But — how’s this for a cautionary tale about the dubious accuracy of artificial intelligence? — a Google “AI overview,” in response to a search, almost immediately took my error and spread it around: “Yes, Dr. Patrick Soon-Shiong attended Donald Trump’s inauguration in 2025. He was seen there alongside other prominent figures like Mark Zuckerberg and Jeff Bezos.” It cited Krugman’s and my conversation. Again, I was wrong and I regret the error.

It does appear that the error was corrected fairly quickly. I asked Google this morning and got this from AI: “Patrick Soon-Shiong did not attend Donald Trump’s second inauguration. Earlier reports and AI overviews that claimed he did were based on an error by a journalist who later issued a correction.” It links to Sullivan’s newsletter.

Unlike Google, Claude makes no mention Sullivan’s original mistake, concluding, accurately: “While the search results don’t show Patrick Soon-Shiong listed among the most prominent billionaires seated in the Capitol Rotunda (such as Musk, Bezos, Zuckerberg, and others who received extensive coverage), the evidence suggests he was engaged with the inauguration events and has maintained a relationship with Trump’s administration.”

And here’s the verdict from ChatGPT: “I found no credible public evidence that Patrick Soon-Shiong attended Donald Trump’s second inauguration.”

You might cite my findings as evidence that AI corrects mistakes quickly, and in this case it did. (By the way, the error has not yet been corrected at Krugman’s site.) But a less careful journalist than Sullivan might have let the original error hang out there, and it would soon have become part of the established record of who did and didn’t pay homage to Trump on that particular occasion.

In other words: always follow your queries back to the source.

Surveillance cameras in Brookline, Mass., raise serious questions about civil liberties

Photo (cc) 2014 by Jay Phagan.

The surveillance state has come to Brookline, Massachusetts. Sam Mintz reports for Brookline.News that Chestnut Hill Realty will set up license-plate readers on Independence Drive near Hancock Village, located in South Brookline, on the Boston border. The readers are made by Flock Safety, which is signing an agreement with the Brookline Police Department to use the data. The data will also be made available to Boston Police.

Sign up for free email delivery of Media Nation. You can also become a supporter for just $6 a month and receive a weekly newsletter with exclusive content.

Two months ago I wrote about a campaign to keep Flock out of the affluent community of Scarsdale Village, New York. The story was covered by a startup local website, Scarsdale 10583, and after a period of months the contract was canceled in the face of rising opposition. Unfortunately, Scarsdale Village is the exception, as Flock Safety, a $7.5 billion company, has a presence in 5,000 communities in 49 states as well as a reputation for secretive dealings with local officials.

Adam Gaffin of Universal Hub reports that the state’s Supreme Judicial Court ruled in 2020 that automated license-plate readers are legal in Massachusetts. Gaffin also notes that, early this year, police in Johnson County, Texas, used data from 83,000 Flock cameras across the U.S. in a demented quest to track down a woman they wanted to arrest for a self-induced abortion. Presumably Texas authorities could plug into the Brookline network with Flock’s permission.

Mintz notes in his Brookline.News story that Flock recently opened an office in Boston and that its data has been used by police in dozens of Massachusetts communities. He also quotes Kade Crockford of the ACLU of Massachusetts as saying that though such uses of Flock data as identifying stolen cars or assisting with Amber Alerts isn’t a problem, “Unregulated, this technology facilitates the mass tracking of every single person’s movements on the road.”

The cameras could also be used by ICE in its out-of-control crackdown on undocumented (and, in some cases, documented) immigrants. This is just bad news all around, it’s hard to imagine that members of the public would support it if they knew about it.

Google appears to be throttling AI searches about Trump’s obviously addled mental state

Be careful what you search for.

Google appears to be throttling AI searches related to Donald Trump’s obviously addled mental state. Jay Peters reports (sub. req.) in The Verge:

There’s been a lot of coverage of the mental acuity of both President Trump and President Biden, who are the two oldest presidents ever, so it’s reasonable to expect that people might query Google about it. The company may be worried about accurately presenting information on a sensitive subject, as AI overviews remain susceptible to delivering incorrect information. But in this case, it may also be worried about the president’s response to such information. Google agreed this week to pay $24.5 million to settle a highly questionable lawsuit about Trump’s account being banned from YouTube.

I wanted to see if I could reproduce Peters’ results, and sure enough, Google is still giving Trump special treatment, even though Peters’ embarrassing story was published two days ago. I searched “is trump showing signs of dementia” in Google’s “All” tab, which these days will generally give you an AI-generated summary before getting to the links. Instead, you get nothing but links. The same thing happened when I switched to “AI Mode.”

Next I searched for “is biden showing signs of dementia” at the “All” tab. As with Trump, I got nothing but links — no AI summary at the top. But when I switched to “AI Mode,” I got a detailed AI summary that begins:

In response to concerns and observations about President Joe Biden’s cognitive abilities, a range of opinions and reports have emerged. It’s important to note that diagnosing dementia or cognitive decline requires a formal medical assessment by qualified professionals.

I have mixed feelings about AI searches, though, like many people, I make use of them — always checking the citations to make sure I’m getting accurate information. But as Peters observes, it looks like Google is flinching.

Seems like old times: Facebook is once again inflicting harm on the rest of us, this time using AI

This AI image of “Big sis Billie” was generated by Meta AI at the prompting of a Reuters journalist.

There was a time when it seemed like every other week I was writing about some terrible thing we had learned about Facebook or one of Meta’s other platforms.

There was Facebook’s complicity in the genocide of the Rohingya people in Myanmar. Or the Cambridge Analytica scandal, in which the personal data of millions of people on Facebook was hoovered up so that Steve Bannon could target political ads to them. Or Instagram’s ties to depression among teenage girls.

Sign up for free email delivery of Media Nation. You can also become a supporter for just $6 a month and receive a weekly newsletter with exclusive content.

Now Jeff Horwitz, who uncovered much of Facebook’s nefarious behavior when he was at The Wall Street Journal, is back with an in-depth report for Reuters on how Meta’s use of artificial intelligence led to the accidental death of a mentally disabled man and how it’s being used to seduce children as well.

The man, a 76-year-old stroke survivor named Thongbue Wongbandue, suffered fatal injuries when he fell while running for a train so that he could meet his AI-generated paramour, “Big sis Billie,” who had repeatedly assured Wongbandue in their online encounters that she was real.

As for interactions with children, Horwitz writes:

An internal Meta policy document seen by Reuters as well as interviews with people familiar with its chatbot training show that the company’s policies have treated romantic overtures as a feature of its generative AI products, which are available to users aged 13 and older.

“It is acceptable to engage a child in conversations that are romantic or sensual,” according to Meta’s “GenAI: Content Risk Standards.” The standards are used by Meta staff and contractors who build and train the company’s generative AI products, defining what they should and shouldn’t treat as permissible chatbot behavior. Meta said it struck that provision after Reuters inquired about the document earlier this month.

Yes, the Zuckerborg’s strategy going back many years now is to back off when caught — and then move on to some other antisocial business practice.

Ever since Elon Musk bought Twitter, and especially during his brief, chaotic stint in the Trump White House, Mark Zuckerberg has gotten something of a free pass. Just this week it was announced that Threads, a Meta product launched for users who were fleeing Twitter, now has 400 million active monthly users, making it about two-thirds as large as Twitter/X. (An independent alternative, Bluesky, trails far behind.)

Well, Zuckerberg is still out there wreaking havoc, and AI has given him (and Musk and all the rest) a new toy with which to make money while harming the rest of us.

Remember that ‘drunk Pelosi’ video? AI-powered deepfakes are making disinformation much more toxic

Should we be worried about deepfake videos? Well, sure. But I’ve tended to think that some skepticism is warranted.

My leading example is a 6-year-old video of then-House Speaker Nancy Pelosi in which we are told that she appears to be drunk. I say “we are told” because the video was simply slowed down to 75%, and the right-wing audience for whom it was intended thought this crude alteration was proof that she was loaded. Who needs deepfakes when gullible viewers will be fooled by such crap? People believe what they want to believe.

Become a supporter of Media Nation for just $6 a month. You’ll receive a weekly newsletter with exclusive content, a roundup of the week’s posts, photography and a song of the week.

But the deepfakes are getting better. This morning I want to call your attention to a crucially important story in The New York Times (gift link) showing that deepfakes powered by artificial intelligence are causing toxic damage to the political and cultural environment around the world.

“The technology has amplified social and partisan divisions and bolstered antigovernment sentiment, especially on the far right, which has surged in recent elections in Germany, Poland and Portugal,” write reporters Steven Lee Myers and Stuart A. Thompson. A few examples:

  • Romania had to redo last year’s presidential election after a court ruled that AI manipulation of one of the candidates may have changed the result.
  • An AI-generated TikTok video falsely showed Donald Trump endorsing a far-right candidate in Poland.
  • Another fake video from last year’s U.S. election tied to Russia falsely showed Kamala Harris saying that Trump refused to “die with dignity.”

As with the Pelosi video, fakes have been polluting the media environment for a long time. So I was struck by something that Isabelle Frances-Wright of the Institute for Strategic Dialogue told the Times: Before AI, “you had to pick between scale or quality — quality coming from human troll farms, essentially, and scale coming from bots that could give you that but were low quality. Now, you can have both, and that’s really scary territory to be in.”

In other words, disinformation is expanding exponentially both in terms of quality and quantity. Given that, it’s unlikely we’ll see any more Russian-generated memes of a satanic Hillary Clinton boxing with Jesus, a particularly inept example of Russian propaganda from 2016. Next time, you’ll see a realistic video of a politician pledging their eternal soul to the Dark Lord.

And since I still have a few gift links to give out before the end of month, here’s a Times quiz with 10 videos, some of which are AI fakes and some real. Can you tell the difference? I didn’t do very well.

So what can we do to protect our political discourse? I’m sure we can all agree that it’s already in shockingly bad shape, dominated by lies from Trump and his allies that are amplified on Fox News and social media. As I said, people are going to believe what they want to believe. But AI-generated deepfake videos are only going to make things that much worse.