Cloudflare identifies Web’s hungriest large language models

ByteDance the busit of tech giants when it comes to AI training but for what purpose remains to be seen
Image: Stockfresh

9 July 2024

New figures from security firm Cloudflare have thrown lights on how intensively companies are grazing the Web to train their large language models.

Cloudflare developed a system that allows its customers to keep AI crawlers out of websites. More than 80% of its customers use that free option, which shouild be taken as a should signal that the vast majority of the online community does not want their copy used to train AI models.

TikTok parent company Bytedance appears to be far and away the busiest player with Amazon also picking up momentum behind ChatGPT developer OpenAI and Anthropic, the company behind Claude.

It is not clear what exactly the Chinese are working on for an international market but domestically it is working on a viant of ChatGPT called Doubao.

Amazon, logically, wants to take its ubiquitous digital assistant Alexa to the next level, which explains its increased activity.

Some of these site administrators explicitly state that by adding a few lines of text to the so-called robots.txt file. It is common for crawlers to read those first to see what they are and are not allowed to do on someone’s server. GPTBot (OpenAI), CCBot (Common Crawl) and Google are the most frequently addressed there. Site administrators forget about Bytespider and ClaudeBot. Consequently, these get all the space they need to gobble up text, images and sound.

It is not yet well understood how often bots still crawl when not wanted. Photo archives and publishers like the New York Times are suing the tech companies for copyright infringement of their intellectual property. However Axel Springer and News Corp. have gone taken a more pragmatic aproach by striking licensing deals for the use of their content.

Read More:

Back to Top ↑