Mr Branding: Hundreds Of Websites Fail To Block Scraping Bots Because They Keep Multiplying

Tuesday, July 30, 2024

Hundreds Of Websites Fail To Block Scraping Bots Because They Keep Multiplying

The name Anthropic should ring a bell for you and in case it doesn’t we’d like to shed some light on that front.

The popular scrapping AI website has been a serious source of concern for so many obvious reasons. And the biggest one of them all has to do with its ability to scrape data from wherever and whenever it feels like.

By now, we’d assume that websites would have been able to block the firm from performing such actions but the reality is far from that. But why, is a question you might be having.

The answer is simple, AI websites like Anthropic are giving rise to more and more scraping bots and the rate at which these are multiplying is insane. Websites might think they’re blocking the right bots but that’s not the case. This combined with failed attempts to target the great number of bots that keep multiplying is making matters worse.

Did we mention how most companies are launching newer bots with unique names and they’re only going to get blocked if owners update the robots.txt?

Sites are mostly blocking two bots that aren’t even used by the firm anymore. The real troublemaker by Anthropic is failing to be targeted and therefore as it remains unblocked, it keeps doing the damage.

An anonymous operator the company Dark Visitors who tracks the operations of various AI firms mentioned how many web scrapers keep getting updated so that prevents them from being detected. The webpage keeps seeing massive popularity as more and more individuals try to prevent AI from using its material.

See, experts realize that the ecosystem never stays stagnant. There are many fluctuations and therefore website owners are finding it super hard to keep up. Companies keep changing the page’s robots.txt file which entails instructions telling bots if they contain permission for crawling certain sites.

Time and time again, certain sites make use of means that they shouldn’t for crawling purposes.

In certain situations, the endeavor leads to unwanted impacts like restricting search engines, tools used for archiving material, or those linked to academic research. Even if that was not the intention of the owner, it happens.

Let’s take Anthropic for instance, so many robots.txt files of popular websites such as Reuters and Conde Nast are blocked to scraper bots that were once used by the company and Claude’s AI chatbot.

Therefore, any of these websites and many more weren’t blocking Anthropic unknowingly.

Many are tired and don’t know what to do next. Let’s take the repair guide website iFixit for instance. It mentioned how a crawler from Anthropic was targeting its page close to one million times in a single day. That’s a lot of files to access when you come to think of it.

Experts suggests it’s time AI firms put on a more respectable face of the pages being crawled. They tend to risk many pages from blocking for abuse, no matter what morals are used in the industry.

Image: DIW-Aigen

Read next: Mark Zuckerberg Drops F-Bomb While Discussing His Excitement For Closed vs Open-Source AI At Meta
by Dr. Hura Anwar via Digital Information World

Mr Branding

Tuesday, July 30, 2024

Hundreds Of Websites Fail To Block Scraping Bots Because They Keep Multiplying

No comments:

Post a Comment