CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2024

2024-11-11

  • Some IP in India is making tons of requests this morning with a normal user agent:
# awk '{print $1}' /var/log/nginx/api-access.log | sort | uniq -c | sort -h | tail -n 40
...
513743 49.207.196.249
  • They are using this user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3

2024-11-16

  • I switched CGSpace to Node.js v20 since I’ve been using it in dev and test for months

2024-11-18

  • I see a bot (188.34.177.10) on Hetzner has made 35,000 requests this morning and is pretending to be Googlebot, GoogleOther, etc
    • Google publishes their range of IPs also: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
    • Our nginx config doesn’t rate limit the API but perhaps that needs to change…
    • In DSpace 4/5/6 the API was separate from the user interface so we didn’t need to enforce rate limits there because we encouraged using that over scraping the UI
    • In DSpace 7 the API is used by the frontend and perhaps should have the same IP- and UA-based rate limiting

2024-11-19

  • I notice 10,000 requests by a new bot yesterday:
20.38.174.208 - - [18/Nov/2024:07:02:50 +0100] "GET /server/oai/request?verb=ListRecords&resumptionToken=oai_dc%2F2024-10-18T13%3A00%3A49Z%2F%2F%2F400 HTTP/1.1" 503 190 "-" "Laminas_Http_Client"
  • Seems to be some kind of PHP framework library
  • Yesterday one IP in Argentina made nearly 1,000,000 requests using a normal user agent: 181.4.143.40
  • 188.34.177.10 ended up making 700,000 requests using various Googlebot, GoogleOther, and even normal Chrome user agents