Top Websites Take a Stand Against Apple’s AI Scraping Practices

Less than three months following the release of a tool by Apple which allows publishers to exclude their content from AI training, several renowned news organizations and social platforms have decided to utilize this option. Notable companies making this decision include Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and Condé Nast, which is the parent company of WIRED.

The introduction of the Applebot-Extended tool, which enhances Apple’s web-crawling capabilities, allows website owners to instruct Apple not to utilize their content for AI training purposes. This development indicates a substantial shift in how web crawling technologies, essential for gathering AI training data, are perceived and utilized. Such technologies are becoming a focal point for debates surrounding intellectual property and the future direction of the internet.

Applebot-Extended serves to respect the rights of publishers, according to Apple spokesperson Nadine Haija. It does not prevent the original Applebot from accessing website data, which could influence the visibility of the site’s content through Apple’s search services like Siri and Spotlight. However, it does specifically restrict the use of this data in developing Apple’s AI projects, including large language models. The tool represents a specialized method for customizing how data gathered by Apple’s bots is employed.

Publishers have the option to prevent Applebot-Extended from accessing their sites by updating the robots.txt file, a protocol standard that dictates how bots can interact with the content of websites. This protocol has been pivotal in shaping the parameters for how web scraping is conducted by AI. A significant number of publishers have already modified their robots.txt to restrict access from AI bots deployed by companies like OpenAI and Anthropic. Read more here.

Although it is not legally mandatory for bots to comply with the directives in the robots.txt, historically, there’s been a general adherence to this norm. However, compliance is not always guaranteed. For instance, a recent investigation by WIRED uncovered that an AI company, Perplexity, was found to be ignoring these exclusions and covertly scraping data. Learn more about the investigation.

Since Applebot-Extended is relatively new, only a small fraction of websites have started to block it. Originality AI, an AI-detection firm based in Ontario, Canada, reported that around 7 percent of 1,000 surveyed high-traffic websites have blocked Applebot-Extended. Moreover, a similar analysis by Dark Visitors indicated that about 6 percent of another 1,000 high-traffic sites had also blocked the bot. Visit Originality AI, Visit Dark Visitors.

Separate research by data journalist Ben Welsh revealed that approximately 25 percent of the 1,167 US-based, English-language news websites he looked at were blocking Applebot-Extended. Interestingly, this was less compared to the 53 percent blocking OpenAI’s bot and the 43 percent blocking Google’s newly introduced Google-Extended AI-specific bot. Welsh notes that awareness and subsequent blocking have been ‘gradually moving’ upward.

Welsh has an ongoing project monitoring how news outlets approach major AI agents. “A bit of a divide has emerged among news publishers about whether or not they want to block these bots,” he says. “I don’t have the answer to why every news organization made its decision. Obviously, we can read about many of them making licensing deals, where they’re being paid in exchange for letting the bots in—maybe that’s a factor.”

Last year, The New York Times reported that Apple was attempting to strike AI deals with publishers. Since then, competitors like OpenAI and Perplexity have announced partnerships with a variety of news outlets, social platforms, and other popular websites. “A lot of the largest publishers in the world are clearly taking a strategic approach,” says Originality AI founder Jon Gillham. “I think in some cases, there’s a business strategy involved—like, withholding the data until a partnership agreement is in place.”

There is some evidence supporting Gillham’s theory. For example, Condé Nast websites used to block OpenAI’s web crawlers. After the company announced a partnership with OpenAI last week, it unblocked the company’s bots. (Condé Nast declined to comment on the record for this story.) Meanwhile, Buzzfeed spokesperson Juliana Clifton told WIRED that the company, which currently blocks Applebot-Extended, puts every AI web-crawling bot it can identify on its block list unless its owner has entered into a partnership—typically paid—with the company, which also owns the Huffington Post.

Because robots.txt needs to be edited manually, and there are so many new AI agents debuting, it can be difficult to keep an up-to-date block list. “People just don’t know what to block,” says Dark Visitors founder Gavin King. Dark Visitors offers a freemium service that automatically updates a client site’s robots.txt, and King says publishers make up a big portion of his clients because of copyright concerns.

Robots.txt is no longer just relevant for webmasters; with the rise of AI, media executives have taken a keen interest. Two CEOs of major media companies are now directly involved in the decision of which bots to block, according to information obtained by WIRED.

Some media entities have made it clear that they block AI scraping tools due to the absence of commercial partnerships. “Across all Vox Media properties, we’re blocking Applebot-Extended, as we’ve done with other AI scrapers when there’s no agreement in place,” reports Lauren Starke, senior vice president of communications at Vox Media. “It’s crucial for us to protect the value of our content.”

In contrast, others state their reasons less specifically but just as firmly. “At this time, we see no benefit in allowing Applebot-Extended to access our content,” explains Lark-Marie Antón, chief communications officer at Gannett.

The New York Times, currently in a legal battle with OpenAI for copyright infringement, criticizes the automatic opt-out nature of bots like Applebot-Extended. “Under the law and our own terms, using our content commercially without prior permission is forbidden,” states Charlie Stadtlander, director of external communications at NYT. He adds that the Times continually updates its bot block list to include unauthorized bots. “Copyright laws are enforced regardless of technical blocks, and owners are not required to opt out from copyright theft,” he affirms.

Meanwhile, The New York Times, also dealing with legal issues around content use, emphasizes that permissions are crucial, as stated in a lawsuit against OpenAI.

It remains uncertain if Apple has made progress in finalizing agreements with publishers. Should these deals materialize, the details concerning data licensing or sharing could potentially appear in robots.txt files prior to any public announcement.

“I find it fascinating that one of the most consequential technologies of our era is being developed, and the battle for its training data is playing out on this really obscure text file, in public for us all to see,” says Gillham.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Article

Nvidia Resolves Chip Glitch, Restores Production Timeline for Q4

Next Article

Emio - The Smiling Man: Famicom Detective Club Sells Out on Launch Day!

Related Posts