Companies alert as along come AI web spiders

AI crawlers are computer programs that collect data from websites to train large language models. Enterprises are increasingly blocking AI web crawlers due to performance issues, security threats, and violation of content guidelines. Unlike tradit...

By Himanshi Lohchab, ETtech | Dec 15, 2024, 06.00 AM IST

Enterprises are increasingly resorting to blocking artificial intelligence (AI) web crawlers and spiders which are scraping the web bit-by-bit and hampering the performance of websites, according to industry executives and experts.

AI crawlers are computer programs that gather data from websites to train large language models. With increased use of AI search and need for collecting training data, the internet is seeing many new web scrapers such as Bytespider, PerplexityBot, ClaudeBot and GPTBot.

Until 2022, the internet had conventional search engine crawlers such as GoogleBot, AppleBot and BingBot which obeyed the principles of ethical content scraping and scheduling for decades.

On the other hand, the aggressive AI bots are not only violating content guidelines but also degrading the performance of websites, adding overhead costs and posing security threats. Many websites and content portals are implementing anti-scraping measures or bot restriction technologies to counter this.

According to Cloudflare, a leading content delivery network provider, nearly 40% of the top 10 internet domains accessed by 80% of AI bots are moving to block AI crawlers.

India’s apex technology body Nasscom said these crawlers are especially damaging to news publishers if they use authored content without attribution. “If the use of copyrighted data for AI model training qualifies as fair use is moot,” Raj Shekhar, Responsible AI lead at Nasscom told ET. “The legal dispute between ANI Media and OpenAI is a wake-up call for AI developers to heed IP (intellectual property) laws when collecting training data. Developers, therefore, must exercise caution and consult IP experts to ensure compliant data practices and avoid potential liabilities.”

Reuben Koh, director of security technology and strategy at content delivery network company Akamai Technologies, said, “Scraping poses a significant overhead and impacts the performance of a website. It does this by intensively interacting with the site, attempting to scrape every single piece of content. This results in a performance penalty.”

According to Cloudflare’s analysis of top 10,000 internet domains, three AI bots had the highest share of websites accessed – Bytespider operated by China’s TikTok (40.40%), GPTBot operated by OpenAI (35.46%) and ClaudeBot run by Anthropic (11.17%). Although these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them, it said. Meanwhile, there is CCBot, developed by Common Crawl, to scrape the web and create an open-source dataset which can be used by anyone.

What sets AI crawlers apart

AI crawlers are different from conventional crawlers - they target high-quality text, images and videos that can enhance training datasets. AI-powered crawlers are more intelligent than conventional search engine crawlers, “which just crawl, gather data, and stop there”, said Akamai’s Koh. “Their intelligence is not only used for data selection but also for data classification and prioritisation. This means that even after they crawl, index and scrape all the data, they can process what the data is going to be used for,” he said.

Traditionally, web scraper bots follow robots.txt protocol as a guiding principle on what can be indexed. Traditional search engine bots such as GoogleBot and BingBot adhere to this and stay away from intellectual property. However, AI bots have been found to violate the principles of robots.txt at multiple instances. “Google and Bing do not overwhelm websites because they follow a predictable and transparent indexing schedule. For instance, Google is clear about how often it indexes a particular domain, allowing companies to anticipate and manage the potential performance impact,” Koh said. “With newer and more aggressive crawlers, like those driven by AI, the situation is less predictable. These crawlers don’t necessarily operate on a fixed schedule, and their scraping activities can be much more intensive.”

Koh cautioned about a third category of crawlers which are malicious in nature and misuse data for frauds. According to Akamai’s State of The Internet research, more than 40% of all internet traffic is from bots and about 65% of that is from malicious bots.

Can't Block Them All

However, experts said, eliminating AI crawlers cannot be the ultimate solution because websites need to be discovered. Websites need to show up in commercial search engine results, be discovered and gain customers, if AI search is set to be the new search practice, they said. “Enterprises are going to be concerned if we are blocking legitimate revenue generating crawl activity or bot activity. Or are we allowing too many malicious activities to happen on our website? It’s a very fine balance, they need to understand,” opined Koh.

Download
The Economic Times Business News App for the Latest News in Business, Sensex, Stock Market Updates & More.

Companies alert as along come AI web spiders

AI crawlers are computer programs that collect data from websites to train large language models. Enterprises are increasingly blocking AI web crawlers due to performance issues, security threats, and violation of content guidelines. Unlike tradit...

Related Articles

READ MORE:

More from our Partners

Popular Categories

Hot on Web

In Case you missed it

Top Searched Companies

Latest News

Download ET APP

Follow us on

become a member