Aug 5, 2025 Current Resources

Perplexity AI accused of scraping content against websites’ will with unlisted IP ranges

Perplexity, an AI search startup, has been spotted trying to disguise its content-scraping bots while flouting websites' no-crawl directives.

According to Cloudflare, a network infrastructure company that recently entered the bot gatekeeping business, Perplexity bots don't take no for an answer when websites say that they don't want to be scraped.

"Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences," said Cloudflare engineers Gabriel Corral, Vaibhav Singhal, Brian Mitchell, and Reid Tatoris in a Monday blog post.

"We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files."

A robots.txt file is a way for websites to tell web crawlers – automated client software – which resources, if any, they may access. It's part of the Robots Exclusion Protocol, originally drafted by Martijn Koster in 1994. Compliance is voluntary and the rising lack of compliance has led companies like Cloudflare to offer defensive technology to publishers.

The Cloudflare engineers say that they heard from customers that their sites were still being crawled by Perplexity bots even after they warned the bots via robots.txt directives and set up web application firewall rules to block declared crawlers PerplexityBot and Perplexity-User.

The stealth bots, they said, operated outside of the IP addresses in Perplexity’s official IP range, using addresses that came from different ASNs (IP address ranges) to evade address-based blocking. Using a generic browser impersonating Google Chrome on macOS when blocked, the bots were seen making millions of site data requests daily.

Perplexity did not respond to a request for comment.

Anthropic faced similar accusations last year, and in June this year, was sued by Reddit for content scraping alleged to violate the site's user agreement and California competition law. According to Cloudflare, OpenAI's bots have been following best practices lately and its ChatGPT Agent has been signing HTTP requests using Web Bot Auth, a proposed standard for managing bot behavior.

Initially, site crawling bots were a mixed blessing. They consumed computing resources, but they sometimes provided some benefit in return. Being visited by the Google Search crawler, for example, meant a site might appear in the Google Search Index and thus would be more visible to searchers, some of whom could be expected to visit and perhaps generate ad revenue.

Lately, however, that arrangement has become more lopsided. AI crawlers have proliferated while search referral traffic has plummeted. Bots are taking more and returning less, and that's due mainly to the data demands of AI companies, whose business model has become reselling the internet's non-consensually gathered data as an API or cloud computing service.

In June, bot-blocking biz TollBit published its Q1 2025 State of the Bots report, which found an 87 percent increase in scraping during the quarter. It also found that the share of bots ignoring robots.txt files increased from 3.3 percent to 12.9 percent during the quarter. In March 2025, the firm said, 26 million AI scrapes bypassed robots.txt files.

AI crawlers don't necessarily index websites like search crawlers. They may use site content for model training or Retrieval Augmented Generation (RAG), a way to access content not captured in model training data. Google's AI Overviews and Perplexity Search, for example, rely on RAG to fetch current information in response to a user query or prompt.

According to TollBit, RAG-oriented scraping has surpassed training-oriented scraping. "From Q4 2024 to Q1 2025, RAG bot scrapes per site grew 49 percent, nearly 2.5X the rate of training bot scrapes (which grew by 18 percent)," the firm's report says. "This is a clear signal that AI tools require continuous access to content and data for RAG vs. for training."

The problem for web publishers is that this is a revenue-threatening parasitic relationship. When an AI bot gathers data and presents a summary through an AI company's tool or interface, that imposes a compute cost on the source while offering no compensation for the harvested content.

TollBit's report indicates that, on the sites that it monitors, Bing's ratio of scrapes to referred human site visits was 11:1. For AI-only apps, the rates were as follows: OpenAI 179:1; Perplexity 369:1; and Anthropic: 8692:1.

AI firms are aware that their bots have worn out their welcome on the web. Perplexity last year launched its Publisher Program to pay participating partners. And various AI companies have struck deals with major publishers that grant access to their content. Reddit, keeper of valuable user-created content, has seen its business improve as a result.

Most websites, however, haven't been invited to the table to negotiate with the likes of Amazon, Anthropic, Google, Meta, OpenAI, and Microsoft. Thus, aspiring intermediaries like Cloudflare and TollBit are offering publishers a technical negotiation method: a paywall.

It remains to be seen whether the cloud giants like Amazon, Google, and Microsoft, which make money from any AI usage on their infrastructure, need long-tail content enough to pay for it. And it's also unclear whether AI firms that aren't trying to incentivize data center usage can survive the impact of paywalls.

But at some point, either a business model that works for both AI firms and publishers will take shape, publishing will retreat behind subscription walls and the free web will become a sea of synthetic AI slop, or the AI bubble will collapse under the weight of unrequited capex. ®