CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2025

2025-02-21

  • Battling with high load on CGSpace from bots a few times the past few weeks
    • I think I need to re-work the bot filtering some how
  • I extracted all the unique IPs from the past ~45 days:
# zcat -f /var/log/nginx/api-access.log.* | awk '{print $1}' | sort | uniq > /tmp/ips.txt
# wc -l /tmp/ips.txt
449405 /tmp/ips.txt
  • It’s an impressive 450,000 IPs!
  • I will run them through my resolve_addresses_geoip2.py script to tease out the large offenders
    • I had to split the file into 100,000 segments to get the script to run!
    • Found some new datacenter networks to add to the list:
      • AS32934 Facebook (I’m surprised this wasn’t already there, as I know I’ve blocked it in the past)
      • AS209366 Semrush, which does user a proper user agent, but they don’t respect robots.txt and give us no value
      • AS13150 Datacenter ISP (CATON - CATO NETWORKS LTD, IL)
      • AS20473 AS-VULTR, US
      • AS13335 CLOUDFLARENET, US … wtf is Cloudflare doing? They made requests from 2,600 IPs in the last month???
      • 133499 (HOSTROYALETECHNOLOGIES-AS-AP HostRoyale Technologies Pvt Ltd, NZ)
      • 134450 (HOSTROYALETECHNOLOGIES-AS-AP HostRoyale Technologies Pvt Ltd, IN)
      • 207990 (HR-CUSTOMER - HostRoyale Technologies Pvt Ltd, IN)
      • 39486 (HOSTROYALE - HostRoyale Technologies Pvt Ltd, IN)
      • 44144 (OMEGASOFT - HostRoyale Technologies Pvt Ltd, IN)
      • 133944 (TRAFFICFORCE-INTERNET-SERVICES - trafficforce, UAB, LT)
      • 202496 (PRIENAI-SCIENCES-INTERNET - trafficforce, UAB, LT)
      • 19084 (COLOUP, US)
      • 21769 (AS-COLOAM, US)
      • 17252 (AS2-COLOAM, US)
      • 205964 (CODE200-180 - UAB code200, LT)
      • 203061 (ITPROXIMUS - UAB code200, LT)
      • 205659 (CODE200-ISP2 - UAB code200, LT)
      • 36352 (AS-COLOCROSSING, US)
      • 394474 (WHITELABELCOLO393, US)
      • 18779 (EGIHOSTING, US)
  • The names above mostly come from asn