- Battling with high load on CGSpace from bots a few times the past few weeks
- I think I need to re-work the bot filtering some how
- I extracted all the unique IPs from the past ~45 days:
# zcat -f /var/log/nginx/api-access.log.* | awk '{print $1}' | sort | uniq > /tmp/ips.txt
# wc -l /tmp/ips.txt
449405 /tmp/ips.txt
- It’s an impressive 450,000 IPs!
- I will run them through my
script to tease out the large offenders
- I had to split the file into 100,000 segments to get the script to run!
- Found some new datacenter networks to add to the list:
- AS32934 Facebook (I’m surprised this wasn’t already there, as I know I’ve blocked it in the past)
- AS209366 Semrush, which does user a proper user agent, but they don’t respect robots.txt and give us no value
- AS13150 Datacenter ISP (CATON - CATO NETWORKS LTD, IL)
- AS20473 AS-VULTR, US
- AS13335 CLOUDFLARENET, US … wtf is Cloudflare doing? They made requests from 2,600 IPs in the last month???
- 133499 (HOSTROYALETECHNOLOGIES-AS-AP HostRoyale Technologies Pvt Ltd, NZ)
- 134450 (HOSTROYALETECHNOLOGIES-AS-AP HostRoyale Technologies Pvt Ltd, IN)
- 207990 (HR-CUSTOMER - HostRoyale Technologies Pvt Ltd, IN)
- 39486 (HOSTROYALE - HostRoyale Technologies Pvt Ltd, IN)
- 44144 (OMEGASOFT - HostRoyale Technologies Pvt Ltd, IN)
- 202496 (PRIENAI-SCIENCES-INTERNET - trafficforce, UAB, LT)
- 19084 (COLOUP, US)
- 21769 (AS-COLOAM, US)
- 17252 (AS2-COLOAM, US)
- 205964 (CODE200-180 - UAB code200, LT)
- 203061 (ITPROXIMUS - UAB code200, LT)
- 205659 (CODE200-ISP2 - UAB code200, LT)
- 394474 (WHITELABELCOLO393, US)
- 18779 (EGIHOSTING, US)
- The names above mostly come from