CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

January, 2024

2024-01-02

  • Work on preparation of new server for DSpace 7 migration
    • I’m not quite sure what we need to do for the Handle server
    • For now I just ran the dspace make-handle-config script and diffed it with the one from DSpace 6
    • I sent the bundle to the Handle admins to make sure it’s OK before we do the migration
  • Continue testing and debugging the cgspace-java-helpers on DSpace 7
  • Work on IFPRI ISNAR archive cleanup

2024-01-03

  • I haven’t heard from the Handle admins so I’m preparing a backup solution using nginx streams
  • This seems to work in my simple tests (this must be outside the http {} block):
stream {
    upstream handle_tcp_9000 {
       server 188.34.177.10:9000;
    }

    server {
        listen 9000;
        proxy_connect_timeout 1s;
        proxy_timeout 3s;
        proxy_pass handle_tcp_9000;
    }
}
  • Here I forwarded a test TCP port 9000 from one server to another and was able to retrieve a test HTML that was running on the target
    • I will have to do TCP and UDP on port 2641, and TCP/HTTP on port 8000.
  • I did some more minor work on the IFPRI ISNAR archive
    • I got some PDFs from the UMN AgEcon search and fixed some metadata
    • Then I did some duplicate checking and found five items already on CGSpace

2024-01-04

  • Upload 692 items for the ISNAR archive to CGSpace: https://cgspace.cgiar.org/handle/10568/136192
  • Help Peter proof and upload 252 items from the 2023 Gender conference to CGSpace
  • Meeting with IFPRI to discuss their migration to CGSpace
    • We agreed to add two new fields, one for IFPRI project and one for IFPRI publication ranking
    • Most likely we will use cg.identifier.project as a general field and consolidate other project fields there
    • Not sure which field to use for the publication rank…

2024-01-05

  • Proof and upload 51 items in bulk for IFPRI
  • I did a big cleanup of user groups in anticipation of complaints about slow workflow tasks etc in DSpace 7
    • I removed ILRI editors from all the dozens of CCAFS community and collection groups, and I should do the same for other CRPs since they are closed for two years now

2024-01-06

  • Migrate CGSpace to DSpace 7

2024-01-07

  • High load on the server and UptimeRobot saying the frontend is flapping
    • I noticed tons of logs from pm2 in the systemd journal, so I disabled those in the systemd unit because they are available from pm2’s log directory anyway
    • I also noticed the same for Solr, so I disabled stdout for that systemd unit as well
  • I spent a lot of time bringing back the nginx rate limits we used in DSpace 6 and it seems to have helped
  • I see some client doing weird HEAD requests to search pages:
47.76.35.19 - - [07/Jan/2024:00:00:02 +0100] "HEAD /search/?f.accessRights=Open+Access%2Cequals&f.actionArea=Resilient+Agrifood+Systems%2Cequals&f.author=Burkart%2C+Stefan%2Cequals&f.country=Kenya%2Cequals&f.impactArea=Climate+adaptation+and+mitigation%2Cequals&f.itemtype=Brief%2Cequals&f.publisher=CGIAR+System+Organization%2Cequals&f.region=Asia%2Cequals&f.sdg=SDG+12+-+Responsible+consumption+and+production%2Cequals&f.sponsorship=CGIAR+Trust+Fund%2Cequals&f.subject=environmental+factors%2Cequals&spc.page=1 HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.2504.63 Safari/537.36"
  • I will add their network blocks (AS45102) and regenerate my list of bot networks:
$ wget https://asn.ipinfo.app/api/text/list/AS16276 \
     https://asn.ipinfo.app/api/text/list/AS23576 \
     https://asn.ipinfo.app/api/text/list/AS24940 \
     https://asn.ipinfo.app/api/text/list/AS13238 \
     https://asn.ipinfo.app/api/text/list/AS14061 \
     https://asn.ipinfo.app/api/text/list/AS12876 \
     https://asn.ipinfo.app/api/text/list/AS55286 \
     https://asn.ipinfo.app/api/text/list/AS203020 \
     https://asn.ipinfo.app/api/text/list/AS204287 \
     https://asn.ipinfo.app/api/text/list/AS50245 \
     https://asn.ipinfo.app/api/text/list/AS6939 \
     https://asn.ipinfo.app/api/text/list/AS45102 \
     https://asn.ipinfo.app/api/text/list/AS21859
$ cat AS* | sort | uniq | wc -l
4897
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
2017 /tmp/networks.txt
  • I’m surprised to see the number of networks reduced from my current ones… hmmm.
  • I will also update my list of Bing networks:
$ ./ilri/bing-networks-to-ips.sh
$ ~/go/bin/mapcidr -a < /tmp/bing-ips.txt > /tmp/bing-networks.txt
$ wc -l /tmp/bing-networks.txt
250 /tmp/bing-networks.txt

2024-01-08

  • Export list of publishers for Peter to select some amount to use as a controlled vocabulary:
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.publisher", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 178 GROUP BY "dcterms.publisher" ORDER BY count DESC) to /tmp/2024-01-publishers.csv WITH CSV HEADER;
COPY 4332
  • Address some feedback on DSpace 7 from users, including fileing some issues on GitHub
  • The Alliance TIP team was having issues posting to one collection via the legacy DSpace 6 REST API
    • In the DSpace logs I see the same issue that they had last month:
ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.

2024-01-09

  • I restarted Tomcat to see if it helps the REST issue
  • After talking with Peter about publishers we decided to get a clean list of the top ~100 publishers and then make sure all CGIAR centers, Initiatives, and Impact Platforms are there as well
    • I exported a list from PostgreSQL and then filtered by count > 40 in OpenRefine and then extracted the metadata values:
$ csvcut -c dcterms.publisher ~/Downloads/2024-01-09-publishers4.csv | sed -e 1d -e 's/"//g' > /tmp/top-publishers.txt
  • Export a list of ORCID identifiers from PostgreSQL to look them up on ORCID and update our controlled vocabulary:
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2024-01-09-orcid-identifiers.txt;
localhost/dspace7= ☘ \q
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2024-01-09-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-09-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-09-orcids.txt -o /tmp/2024-01-09-orcids-names.txt -d
  • Then I updated existing ORCID identifiers in CGSpace:
$ ./ilri/update_orcids.py -i /tmp/2024-01-09-orcids-names.txt -db dspace -u dspace -p bahhhh
  • Bizu seems to be having issues due to belonging to too many groups
    • I see some messages from Solr in the DSpace log:
2024-01-09 06:23:35,893 ERROR unknown unknown org.dspace.authorize.AuthorizeServiceImpl @ Failed getting getting community/collection admin status for bahhhhh@cgiar.org The search error is: Error from server at http://localhost:8983/solr/search: org.apache.solr.search.SyntaxError: Cannot parse 'search.resourcetype:Community AND (admin:eef481147-daf3-4fd2-bb8d-e18af8131d8c OR admin:g80199ef9-bcd6-4961-9512-501dea076607 OR admin:g4ac29263-cf0c-48d0-8be7-7f09317d50ec OR admin:g0e594148-a0f6-4f00-970d-6b7812f89540 OR admin:g0265b87a-2183-4357-a971-7a5b0c7add3a OR admin:g371ae807-f014-4305-b4ec-f2a8f6f0dcfa OR admin:gdc5cb27c-4a5a-45c2-b656-a399fded70de OR admin:ge36d0ece-7a52-4925-afeb-6641d6a348cc OR admin:g15dc1173-7ddf-43cf-a89a-77a7f81c4cfc OR admin:gc3a599d3-c758-46cd-9855-c98f6ab58ae4 OR admin:g3d648c3e-58c3-4342-b500-07cba10ba52d OR admin:g82bf5168-65c1-4627-8eb4-724fa0ea51a7 OR admin:ge751e973-697d-419c-b59b-5a5644702874 OR admin:g44dd0a80-c1e6-4274-9be4-9f342d74928c OR admin:g4842f9c2-73ed-476a-a81a-7167d8aa7946 OR admin:g5f279b3f-c2ce-4c75-b151-1de52c1a540e OR admin:ga6df8adc-2e1d-40f2-8f1e-f77796d0eecd OR admin:gfdfc1621-382e-437a-8674-c9007627565c OR admin:g15cd114a-0b89-442b-a1b4-1febb6959571 OR admin:g12aede99-d018-4c00-b4d4-a732541d0017 OR admin:gc59529d7-002a-4216-b2e1-d909afd2d4a9 OR admin:gd0806714-bc13-460d-bedd-121bdd5436a4 OR admin:gce70739a-8820-4d56-b19c-f191855479e4 OR admin:g7d3409eb-81e3-4156-afb1-7f02de22065f OR admin:g54bc009e-2954-4dad-8c30-be6a09dc5093 OR admin:gc5e1d6b7-4603-40d7-852f-6654c159dec9 OR admin:g0046214d-c85b-4f12-a5e6-2f57a2c3abb0 OR admin:g4c7b4fd0-938f-40e9-ab3e-447c317296c1 OR admin:gcfae9b69-d8dd-4cf3-9a4e-d6e31ff68731 OR ... admin:g20f366c0-96c0-4416-ad0b-46884010925f)': too many boolean clauses The search resourceType filter was: search.resourcetype:Community
  • There are 1,805 OR clauses in the full log!
    • We previous had this issue in 2020-01 and 2020-02 with DSpace 5 and DSpace 6
    • At the time the solution was to increase the maxBooleanClauses in Solr and to disable access rights awareness, but I don’t think we want to do the second one now
    • I saw many users of Solr in other applications increasing this to obscenely high numbers, so I think we should be OK to increase it from 1024 to 2048
  • Re-visiting the DSpace user groomer to delete inactive users
    • In 2023-08 I noticed that this was now possible in DSpace 7
    • As a test I tried to delete all users who have been inactive since six years ago (Janury 9, 2018):
$ dspace dsrun org.dspace.eperson.Groomer -a -b 01/09/2018 -d
  • I tested it on DSpace 7 Test and it worked… I am debating running it on CGSpace…
    • I see we have almost 9,000 users:
$ dspace user -L > /tmp/users-before.txt
$ wc -l /tmp/users-before.txt
8943 /tmp/users-before.txt
  • I decided to do the same on CGSpace and it worked without errors
  • I finished working on the controlled vocabulary for publishers

2024-01-10

  • I spent some time deleting old groups on CGSpace
  • I looked into the use of the cg.identifier.ciatproject field and found there are only a handful of uses, with some even seeming to be a mistake:
localhost/dspace7= ☘ SELECT DISTINCT text_value AS "cg.identifier.ciatproject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata
_field_id = 232 GROUP BY "cg.identifier.ciatproject" ORDER BY count DESC;
 cg.identifier.ciatproject │ count
───────────────────────────┼───────
 D145                      │     4
 LAM_LivestockPlus         │     2
 A215                      │     1
 A217                      │     1
 A220                      │     1
 A223                      │     1
 A224                      │     1
 A227                      │     1
 A229                      │     1
 A230                      │     1
 CLIMATE CHANGE MITIGATION │     1
 LIVESTOCK                 │     1
(12 rows)

Time: 240.041 ms
  • I think we can move those to a new cg.identifier.project if we create one
  • The cg.identifier.cpwfproject field is similarly sparse, but the CCAFS ones are widely used

2024-01-12

  • Export a list of affiliations to do some cleanup:
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
COPY 11719
  • I first did some clustering and editing in OpenRefine, then I’ll import those back into CGSpace and then do another export
  • Troubleshooting the statistics pages that aren’t working on DSpace 7
    • On a hunch, I queried for for Solr statistics documents that did not have an id matching the 36-character UUID pattern:
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"-id:/.{36}/",
      "rows":"0"}},
  "response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
  }}
  • They seem to come mostly from 2020, 2023, and 2024:
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
{
  "responseHeader":{
    "status":0,
    "QTime":13,
    "params":{
      "facet.range":"time",
      "q":"-id:/.{36}/",
      "facet.range.gap":"+1YEAR",
      "rows":"0",
      "facet":"true",
      "facet.range.start":"2010-01-01T00:00:00Z",
      "facet.range.end":"NOW"}},
  "response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_ranges":{
      "time":{
        "counts":[
          "2010-01-01T00:00:00Z",0,
          "2011-01-01T00:00:00Z",0,
          "2012-01-01T00:00:00Z",0,
          "2013-01-01T00:00:00Z",0,
          "2014-01-01T00:00:00Z",0,
          "2015-01-01T00:00:00Z",89,
          "2016-01-01T00:00:00Z",11,
          "2017-01-01T00:00:00Z",0,
          "2018-01-01T00:00:00Z",0,
          "2019-01-01T00:00:00Z",0,
          "2020-01-01T00:00:00Z",1339,
          "2021-01-01T00:00:00Z",0,
          "2022-01-01T00:00:00Z",0,
          "2023-01-01T00:00:00Z",653736,
          "2024-01-01T00:00:00Z",144993],
        "gap":"+1YEAR",
        "start":"2010-01-01T00:00:00Z",
        "end":"2025-01-01T00:00:00Z"}},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
  • They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
{
  "responseHeader":{
    "status":0,
    "QTime":196,
    "params":{
      "facet.range":"time",
      "q":"-id:/.{36}/",
      "facet.range.gap":"+1MONTH",
      "rows":"0",
      "facet":"true",
      "facet.range.start":"2023-01-01T00:00:00Z",
      "facet.range.end":"NOW"}},
  "response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_ranges":{
      "time":{
        "counts":[
          "2023-01-01T00:00:00Z",1,
          "2023-02-01T00:00:00Z",0,
          "2023-03-01T00:00:00Z",0,
          "2023-04-01T00:00:00Z",0,
          "2023-05-01T00:00:00Z",0,
          "2023-06-01T00:00:00Z",0,
          "2023-07-01T00:00:00Z",0,
          "2023-08-01T00:00:00Z",27621,
          "2023-09-01T00:00:00Z",59165,
          "2023-10-01T00:00:00Z",115338,
          "2023-11-01T00:00:00Z",96147,
          "2023-12-01T00:00:00Z",355464,
          "2024-01-01T00:00:00Z",125429],
        "gap":"+1MONTH",
        "start":"2023-01-01T00:00:00Z",
        "end":"2024-02-01T00:00:00Z"}},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
  • I see that we had 31,744 statistic events yesterday, and 799 have no id!
  • I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
    • Several people said they have them, so it’s a bug of some sort in DSpace, not our configuration

2024-01-13

  • Yesterday alone we had 37,000 unique IPs making requests to nginx
    • I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14

2024-01-15

  • Investigating the CSS selector warning that I’ve seen in PM2 logs:
0|dspace-ui  | 1 rules skipped due to selector errors:
0|dspace-ui  |   .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
# zcat -f /var/log/nginx/*access.log  /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
tee /tmp/ips.txt | wc -l
196493
  • Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
    • I highly doubt these are home users browsing CGSpace… seems super fishy
    • Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT
    • I will temporarily add a few new datacenter ISP network blocks to our rate limit:
      • 16509 Amazon-02
      • 701 UUNET
      • 8075 Microsoft
      • 15169 Google
      • 14618 Amazon-AES
      • 396982 Google Cloud
    • The load on the server immediately dropped

2024-01-17

  • It turns out AS701 (UUNET) is Verizon Business, which is used as an ISP for many staff at IFPRI
    • This was causing them to see HTTP 429 “too many requests” errors on CGSpace
    • I removed this ASN from the rate limiting

2024-01-18

  • Start looking at Solr stats again
    • I found one statistics record that has 22,000 of the same collection in owningColl and 22,000 of the same community in owningComm
    • The record is from 2015 and think it would be easier to delete it than fix it:
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>uid:3b4eefba-a302-4172-a286-dcb25d70129e</query></delete>'
  • Looking again, there are at least 1,000 of these so I will need to come up with an actual solution to fix these
  • I’m noticing we have 1,800+ links to defunct resources on bioversityinternational.org in the cg.link.permalink field
    • I should ask Alliance if they have any plans to fix those, or upload them to CGSpace

2024-01-22

2024-01-23

  • Meeting with IWMI about ORCID integration and the DSpace API for use with WordPress
  • IFPRI sent me an list of their author ORCIDs to add to our controlled vocabulary
    • I joined them with our current list and resolved their names on ORCID and updated them in our database:
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/IFPRI\ ORCiD\ All.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-23-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-23-orcids.txt -o /tmp/2024-01-23-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2024-01-23-orcids-names.txt -db dspace -u dspace -p fuuu
  • This adds about 400 new identifiers to the controlled vocabulary
  • I consolidated our various project identifier fields for closed programs into one cg.identifer.project:
    • cg.identifier.ccafsproject
    • cg.identifier.ccafsprojectpii
    • cg.identifier.ciatproject
    • cg.identifier.cpwfproject
  • I prefixed the existing 2,644 metadata values with “CCAFS”, “CIAT”, or “CPWF” so we can figure out where they came from if need be, and deleted the old fields from the metadata registry

2024-01-26

  • Minor work on dspace-angular to clean up component styles
  • Add cg.identifier.publicationRank to CGSpace metadata registry and submission form

2024-01-29

  • Rework the nginx bot and network limits slightly to remove some old patterns/networks and remove Google
    • The Google Scholar team contacted me to ask why their requests were timing out (well…)