2024-11-11
- Some IP in India is making tons of requests this morning with a normal user agent:
# awk '{print $1}' /var/log/nginx/api-access.log | sort | uniq -c | sort -h | tail -n 40
...
513743 49.207.196.249
- They are using this user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3
2024-11-16
- I switched CGSpace to Node.js v20 since I’ve been using it in dev and test for months
2024-11-18
- I see a bot (188.34.177.10) on Hetzner has made 35,000 requests this morning and is pretending to be Googlebot, GoogleOther, etc
- Google publishes their range of IPs also: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
- Our nginx config doesn’t rate limit the API but perhaps that needs to change…
- In DSpace 4/5/6 the API was separate from the user interface so we didn’t need to enforce rate limits there because we encouraged using that over scraping the UI
- In DSpace 7 the API is used by the frontend and perhaps should have the same IP- and UA-based rate limiting
2024-11-19
- I notice 10,000 requests by a new bot yesterday:
20.38.174.208 - - [18/Nov/2024:07:02:50 +0100] "GET /server/oai/request?verb=ListRecords&resumptionToken=oai_dc%2F2024-10-18T13%3A00%3A49Z%2F%2F%2F400 HTTP/1.1" 503 190 "-" "Laminas_Http_Client"
- Seems to be some kind of PHP framework library
- Yesterday one IP in Argentina made nearly 1,000,000 requests using a normal user agent: 181.4.143.40
- 188.34.177.10 ended up making 700,000 requests using various Googlebot, GoogleOther, and even normal Chrome user agents