Companies alert as along come AI web spiders
AI crawlers are computer programs that collect data from websites to train large language models. Enterprises are increasingly blocking AI web crawlers due to performance issues, security threats, and violation of content guidelines. Unlike tradit...

AI crawlers are computer programs that gather data from websites to train large language models. With increased use of AI search and need for collecting training data, the internet is seeing many new web scrapers such as Bytespider, PerplexityBot, ClaudeBot and GPTBot.
Until 2022, the internet had conventional search engine crawlers such as GoogleBot, AppleBot and BingBot which obeyed the principles of ethical content scraping and scheduling for decades.

On the other hand, the aggressive AI bots are not only violating content guidelines but also degrading the performance of websites, adding overhead costs and posing security threats. Many websites and content portals are implementing anti-scraping measures or bot restriction technologies to counter this.
According to Cloudflare, a leading content delivery network provider, nearly 40% of the top 10 internet domains accessed by 80% of AI bots are moving to block AI crawlers.
Reuben Koh, director of security technology and strategy at content delivery network company Akamai Technologies, said, “Scraping poses a significant overhead and impacts the performance of a website. It does this by intensively interacting with the site, attempting to scrape every single piece of content. This results in a performance penalty.”
According to Cloudflare’s analysis of top 10,000 internet domains, three AI bots had the highest share of websites accessed – Bytespider operated by China’s TikTok (40.40%), GPTBot operated by OpenAI (35.46%) and ClaudeBot run by Anthropic (11.17%). Although these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them, it said. Meanwhile, there is CCBot, developed by Common Crawl, to scrape the web and create an open-source dataset which can be used by anyone.
What sets AI crawlers apart
Traditionally, web scraper bots follow robots.txt protocol as a guiding principle on what can be indexed. Traditional search engine bots such as GoogleBot and BingBot adhere to this and stay away from intellectual property. However, AI bots have been found to violate the principles of robots.txt at multiple instances. “Google and Bing do not overwhelm websites because they follow a predictable and transparent indexing schedule. For instance, Google is clear about how often it indexes a particular domain, allowing companies to anticipate and manage the potential performance impact,” Koh said. “With newer and more aggressive crawlers, like those driven by AI, the situation is less predictable. These crawlers don’t necessarily operate on a fixed schedule, and their scraping activities can be much more intensive.”
Can't Block Them All
However, experts said, eliminating AI crawlers cannot be the ultimate solution because websites need to be discovered. Websites need to show up in commercial search engine results, be discovered and gain customers, if AI search is set to be the new search practice, they said. “Enterprises are going to be concerned if we are blocking legitimate revenue generating crawl activity or bot activity. Or are we allowing too many malicious activities to happen on our website? It’s a very fine balance, they need to understand,” opined Koh.
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.
The Economic Times News App for Quarterly Results, Latest News in ITR, Business, Share Market, Live Sensex News & More.