AI Web Crawler Bots Gone Wild! e.g. ClaudeBot, DotBot, PetalBot 🤖

kenricashe · May 6, 2024, 2:10am

In the middle of the night I woke to a RED ALERT from UptimeDoctor.com. In the Pushover app on my phone I literally configured a Star Trek red alert mp3 for fun, though it’s only humorous after the issue is resolved! It’s quite surreal, especially when in a half-dream state, when that old familiar sound represents an actual emergency and I feel like I’m on the bridge of the Enterprise about to get vaporized for reals.

Lucee was unresponsive and my only clue initially was Java Heap Out of Memory.

All was well after restarting Lucee, then about an hour of analyzing resource logs also revealed high CPU from Lucee/Java, the baseline of which is always a mere 1%.

After a few more hours of sleep, then reviewing Apache logs, I figured out it was AI BOTS GONE WILD. I also found I’m by far not the only one affected. Turns out the bots are having a huge global impact due to their use of Amazon AWS.

Then I worked on how best to stop them. Fortunately they’re at least nice enough to identify themselves, but the usual means of robots.txt or Apache config don’t work due to mod_proxy sending the requests to Lucee before those are processed.

One option could have been for my Lucee apps to deal with it. Not only would that require extra coding, but more importantly even when the bots are blocked, they’re still using Java resources.

mod_security to the rescue! Since I use OWASP CRS, after hours of digging and plenty of frustration, I finally found a simple text file where I appended the list of bots I needed to block:

/etc/apache2/conf.d/modsec_vendor_configs/OWASP3/rules/crawlers-user-agents.data

(location is likely different on non-cPanel servers)

I spent a total of nearly NINE HOURS on this quite aggravating incident and my purpose for posting is to hopefully help others feeling the same pressure to keep their servers and apps running smoothly in the face of the AI invasion!

Roberto_Marzialetti · May 6, 2024, 7:55am

Yes, crawlers are becoming more and more and more aggressive. Anyway I solved it using robots.txt. Bots are kind enough to respect what you ask of them

Something like this:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot 
Disallow: /
User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ImagesiftBot 
Disallow: /

User-agent: cohere-ai
Disallow: /

bennadel · May 6, 2024, 9:41am

Are you incrementally blocking bots when you detect them? Or is this from a list you have somewhere? Just curious.

Julian_Halliwell · May 6, 2024, 9:57am

bennadel · May 6, 2024, 9:59am

Ahh, well there you go

kenricashe · May 6, 2024, 5:00pm

@Roberto_Marzialetti, there have been reports that these bots ignore robots.txt. Maybe they are respecting it now since it’s working for you, but still I would expect better performance from a web app firewall to prevent further processing by the web server.

My next sysadmin project will be a migration to Coraza which is a 100X faster drop-in replacement for ModSecurity (both now maintained by OWASP) plus Coraza WAF - Ratelimit Plugin to automate the temporary blocking of their IP addresses, not just their requests (no need for separate installation of Fail2ban).

kenricashe · May 6, 2024, 5:05pm

That list doesn’t include DotBot nor PetalBot. I’ve seen a LOT of the latter. But that list does have some I’ll add to mine. Not sure that I want to block ChatGPT-User, Google-Extended, and GoogleOther, unless those user agents are imposters?

@bennadel I use a combination of the list defined by OWASP, plus others I gathered from various other sources, plus the three new offenders I’ve mentioned in this post.

Here are the contents of my crawlers-user-agents.data file:

# Search engine crawlers and other bots
# crawler
# https://80legs.com/
80legs
# site ripper
# http://www.softbytelabs.com/en/BlackWidow/
black widow
blackwidow
# crawler
# 2006
prowebwalker
# generic crawler
pymills-spider/
# SEO
# https://ahrefs.com/robot
AhrefsBot
# people database
# https://pipl.com/bot/
PiplBot
# advertising targeting
# https://www.grapeshot.com/crawler/
GrapeshotCrawler/2.0
grapeFX
# SEO
# http://www.searchmetrics.com/searchmetricsbot/
SearchmetricsBot
# SEO
# https://www.semrush.com/bot/
SemrushBot
# SEO
# https://moz.com/help/guides/moz-procedures/what-is-rogerbot
rogerbot
# SEO
# http://www.majestic12.co.uk/projects/dsearch/mj12bot.php
MJ12bot
# news service
Owlin bot
# misbehaving spider
Lingewoud-550-Spyder
# https://www.wappalyzer.com/
Wappalyzer

# THESE ADDED BY KENRIC
ADmantX
AdsBot-Google
AlphaBot
Amazonbot
anthropic-ai
Applebot
AwarioRssBot
AwarioSmartBot
Baiduspider
BLEXBot
Buzzbot
Bytespider
CCBot
ChatGPT-User
Claude-Web
ClaudeBot
coccocbot-image
cohere-ai
DataForSeoBot
Diffbot
DotBot
FacebookBot
FriendlyCrawler
GPTBot
Heritrix
ImagesiftBot
img2dataset
magpie-crawler
Mail.Ru
MaxPointCrawler
Meltwater
Nutch
omgili
omgilibot
peer39_crawler
PerplexityBot
PetalBot
PHPCrawl
PiplBot
scoop.it
Seekr
seoscanners
SeznamBot
YouBot
ZoominfoBot
ZumBot

kenricashe · May 6, 2024, 5:22pm

I’m also interested in Cloudflare Bot Management to stop bots before they even reach my server, but with my still limited budget and their Contact Sales button instead of transparent pricing … I’m afraid to ask!

Alexei_Beloglazov · May 9, 2024, 1:22pm

Having Cloudflare is helpful even on free plan.
One can create a custom rule under Security - WAF to block (or challenge) bots by their User Agent.

kenricashe · May 10, 2024, 6:27pm

Apparently Cloudflare is no longer offering that service for free, as I’m getting this response both with my existing account and a new test account I created just today:

Oddly their plan comparison still indicates it is included with the free account:

So I will inquire with them about that discrepancy.

Implementing Cloudflare will also require an investment in my time which I don’t have an abundance of at the moment, despite the appearance that I do since I’m using way too many multi-syllable words to convey my message.

But I did need more than what I’ve already done because new bots keep popping up all the time and it’s a pain keeping up with them all.

So another thing I’ve done is require user authentication in various sections of my sites that don’t need to be catalogued by Internet search services, with a login page that’s protected by my very own fork of Captcheck (not to be confused with CAPTCHA), which is hosted on my own server.

Though pretty soon I’ll be switching to Cloudflare Turnstile, which I’ve confirmed is still free … for now?

Displaying the login page for each bot request still consumes resources, but much less than pages which query the database and do various other specialized things.

Alexei_Beloglazov · May 10, 2024, 7:10pm

Account-level WAF is paid indeed. But if you first select a domain you will be able to configure WAF per domain for free.
I use Turnstile as well for about half a year, no complains, better than reCaptcha.
Hope this helps with teaching bots good manners.

kenricashe · May 10, 2024, 11:54pm

Done! Or at least I’ve got two User Agent names active for a test domain. I’ll create a script to convert my text file to the expression that’s used by Cloudflare. Also, enabling the basic Cloudflare services for a domain was a lot easier than I feared, including SSL/TLS! Of course I still have a lot to learn about everything they offer, but already off to a great start. Thanks @Alexei_Beloglazov!

Steve_Sommers · May 11, 2024, 1:53pm

Great. Another game of whack-a-mole. I suspect eventually we will have to implement some AI to block AI.

kenricashe · May 12, 2024, 1:39am

Indeed, though Cloudflare is blocking some bots, perhaps their definition of “bad bot” does not always match mine, so there will be periodic monitoring of Apache logs for request totals grouped by User-Agent.

Also from Apache logs I saw some bots – particularly AmazonBot – ignoring proper routing, so I enabled Authenticated Origin Pulls to block any traffic not routed through Cloudflare, with an exception in my Apache config for local cron and cfhttp().

And here’s my Lucee script (keeping this topic still relevant to Lucee haha) for converting OWASP’s crawlers-user-agents.data to the Cloudflare WAF custom rule expression:

// path may be different on your server
pathUserAgents = "/etc/apache2/conf.d/modsec_vendor_configs/OWASP3/rules/crawlers-user-agents.data";
str = "";
F = FileOpen(pathUserAgents, "read");
while (!FileIsEOF(F)) parseLine(FileReadLine(F).trim());
FileClose(F);
echo(str);
function parseLine(line) {
	// ignore empty lines and comments
	if (line == "" || line.left(1) == "##") return;
	if (str != "") str &= " or ";
	str &= '(lower(http.user_agent) contains "#line.lcase()#")';
}
/* Example output:
(lower(http.user_agent) contains "claude-web")
or (lower(http.user_agent) contains "claudebot")
or (lower(http.user_agent) contains "dataforseobot")
[...]
*/

kenricashe · May 13, 2024, 6:40am

21,090 bad bot requests blocked in 24 hours!!!