CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

May, 2024

2024-05-01

  • I dumped all the CGSpace DOIs and resolved them with my crossref_doi_lookup.py script
    • Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
Read more →

January, 2024

2024-01-02

  • Work on preparation of new server for DSpace 7 migration
    • I’m not quite sure what we need to do for the Handle server
    • For now I just ran the dspace make-handle-config script and diffed it with the one from DSpace 6
    • I sent the bundle to the Handle admins to make sure it’s OK before we do the migration
  • Continue testing and debugging the cgspace-java-helpers on DSpace 7
  • Work on IFPRI ISNAR archive cleanup
Read more →

December, 2023

2023-12-01

  • There is still high load on CGSpace and I don’t know why
    • I don’t see a high number of sessions compared to previous days in the last few weeks
$ for file in dspace.log.2023-11-[23]*; do echo "$file"; grep -a -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
dspace.log.2023-11-20
22865
dspace.log.2023-11-21
20296
dspace.log.2023-11-22
19688
dspace.log.2023-11-23
17906
dspace.log.2023-11-24
18453
dspace.log.2023-11-25
17513
dspace.log.2023-11-26
19037
dspace.log.2023-11-27
21103
dspace.log.2023-11-28
23023
dspace.log.2023-11-29
23545
dspace.log.2023-11-30
21298
  • Even the number of unique IPs is not very high compared to the last week or so:
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq | wc -l
17023
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.2.gz | sort | uniq | wc -l
17294
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.3.gz | sort | uniq | wc -l
22057
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.4.gz | sort | uniq | wc -l
32956
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.5.gz | sort | uniq | wc -l
11415
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.6.gz | sort | uniq | wc -l
15444
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.7.gz | sort | uniq | wc -l
12648
  • It doesn’t make any sense so I think I’m going to restart the server…
    • After restarting the server the load went down to normal levels… who knows…
  • I started trying to see how I’m going to generate the fake statistics for the Alliance bitstream that was replaced
    • I exported all the statistics for the owningItem now:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/stats-export.json -f 'owningItem:b5862bfa-9799-4167-b1cf-76f0f4ea1e18' -k uid
  • Importing them into DSpace Test didn’t show the statistics in the Atmire module, but I see them in Solr…

2023-12-02

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS

2023-12-04

  • Send a message to Altmetric support because the item IWMI highlighted last month still doesn’t show the attention score for the Handle after I tweeted it several times weeks ago
  • Spent some time writing a Python script to fix the literal MaxMind City JSON objects in our Solr statistics
    • There are about 1.6 million of these, so I exported them using solr-import-export-json with the query city:com* but ended up finding many that have missing bundles, container bitstreams, etc:
city:com* AND -bundleName:[* TO *] AND -containerBitstream:[* TO *] AND -file_id:[* TO *] AND -owningItem:[* TO *] AND -version_id:[* TO *]
  • (Note the negation to find fields that are missing)
  • I don’t know what I want to do with these yet

2023-12-05

  • I finished the fix_maxmind_stats.py script and fixed 1.6 million records and imported them on CGSpace after testing on DSpace 7 Test
  • Altmetric said there was a glitch regarding the Handle and DOI linking and they successfully re-scraped the item page and linked them
    • They sent me a list of current production IPs and I notice that some of them are in our nginx bot network list:
$ for network in $(csvcut -c network /tmp/ips.csv | sed 1d | sort -u); do grepcidr $network ~/src/git/rmg-ansible-public/roles/dspace/files/nginx/bot-networks.conf; done
108.128.0.0/13 'bot';
46.137.0.0/16 'bot';
52.208.0.0/13 'bot';
52.48.0.0/13 'bot';
54.194.0.0/15 'bot';
54.216.0.0/14 'bot';
54.220.0.0/15 'bot';
54.228.0.0/15 'bot';
63.32.242.35/32     'bot';
63.32.0.0/14 'bot';
99.80.0.0/15 'bot'
  • I will remove those for now so that Altmetric doesn’t have any unexpected issues harvesting

2023-12-08

  • Finalized the script to generate Solr statistics for Alliance research Mirjam
    • The script is ilri/generate_solr_statistics.py
    • I generated ~3,200 statistics based on her records of the download statistics of that item and imported them on CGSpace
  • Did some work on the DSpace 7 submission form
  • Peter asked for lists of affiliations, investors, and publishers to do some cleanups
    • I generated a list from a CSV export instead of doing it based on a SQL dump…
$ csvcut -c 'cg.contributor.affiliation[en_US]' /tmp/initiatives.csv       \
  | sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d'            \
  | sort | uniq -c | sort -hr                                              \
  | awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
  | sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/'           \
  > /tmp/2023-12-08-initiatives-affiliations.csv
  • Export a list of authors as well:
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 3 GROUP BY "dc.contributor.author" ORDER BY count DESC) to /tmp/2023-12-08-authors.csv WITH CSV HEADER;
COPY 102435

2023-12-11

  • Work on OpenRXV dependencies and podman a bit
  • Peter noticed that the statistics for this month are very very low on CGSpace
    • I don’t know what is going on, perhaps it is related to me adjusting the nginx config last week?
    • Ah, it’s probably because of the spider patterns I updated on 2023-11

2023-12-16

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS

2023-12-17

  • Pull latest master branch for OpenRXV and deploy on the server
    • I threw away some changes in the tree regarding the Angular base ref, and it broke AReS
    • So note to self: we need to set the base ref in frontend/Dockerfile before building!
  • Now Salem fixed the country map

2023-12-18

  • Work a bit on the IFPRI-ISNAR archive from Leigh
  • More work on the DSpace 7 home page

2023-12-19

  • More work on the DSpace 7 home page
  • The Alliance TIP team is testing deposits to the DSpace 7 REST API and getting an HTTP 500 error
    • In the DSpace logs I see this after they log in, create the item, and update the metadata:
2023-12-19 17:49:28,022 ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.

2023-12-20

  • The Alliance guys said that submitting via REST works now… sigh, so that’s just some old DSpace 5/6 REST API bug
  • I lowercased all our AGROVOC keywords in dcterms.subject in SQL:
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 462
dspace=*# COMMIT;
COMMIT

2023-12-25

  • Looking into Solr backups
    • Since we are not running in Solr Cloud mode we need to use the replication endpoint for Solr standalone
    • This works:
$ curl 'http://localhost:8983/solr/statistics/replication?command=backup'
{
  "responseHeader":{
    "status":0,
    "QTime":26},
  "status":"OK"}
  • Then I saw the size of the snapshot reach the size of the index…
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
16G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
20G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
21G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
22G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
  • Then I deleted the core and restored from the snapshot backup:
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<commit />'
$ curl 'http://localhost:8983/solr/statistics/replication?command=restore&name=statistics'
  • Interestingly the import worked fine, but created a new data index:
# du -sh /var/solr/data/configsets/statistics/data/*
4.0K    /var/solr/data/configsets/statistics/data/index.properties
22G     /var/solr/data/configsets/statistics/data/restore.20231225154626463
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
22G     /var/solr/data/configsets/statistics/data/snapshot.statistics
  • Not sure the implications of that—Solr uses the data just fine
  • I can surely use this for atomic Solr backups

2023-12-27

  • Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
  • Do some other metadata cleanups on CGSpace
    • I also looked up our DOIs on Crossref to get some missing abstracts and correct licenses and dates
  • Some minor work on the CGSpace DSpace 7 theme to fix the navbar on mobile
  • Some work on the IFPRI ISNAR archive

2023-12-28

  • I started porting the cgspace-java-helpers to DSpace 7
  • Some work on the IFPRI ISNAR archive
    • I ended up going through most of the PDFs to get better dates and abstracts

2023-12-29

  • I created a new Hetzner server to replace the current DSpace 6 CGSpace next week when we migrate to DSpace 7
  • Interesting, I haven’t checked for content pointing to legacy domains in several years (!)
    • inurl:mahider.cgiar.org: 0 results on Google!
    • inurl:mahider.ilri.org: 2,100 results on Google
    • inurl:mahider.ilri.org inurl:https: 2 results on Google (!)
    • inurl:dspace.ilri.org: 1,390 results on Google
    • inurl:dspace.ilri.org inurl:https: 0 results on Google (!)
  • So it seems I can do away with the HTTPS virtual hosts finally
    • Well my current certificates expired on 2021-02-13 and nobody noticed… so…
Read more →

November, 2023

2023-11-01

  • Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
    • I improved the filtering and wrote some Python using pandas to merge my sources more reliably

2023-11-02

  • Export CGSpace to check missing Initiative collection mappings
  • Start a harvest on AReS
Read more →

October, 2023

2023-10-02

  • Export CGSpace to check DOIs against Crossref
    • I found that Crossref’s metadata is in the public domain under the CC0 license
    • One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
    • We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
Read more →

September, 2023

2023-09-02

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS
Read more →

August, 2023

2023-08-03

  • I finally got around to working on Peter’s cleanups for affiliations, authors, and donors from last week
    • I did some minor cleanups myself and applied them to CGSpace
  • Start working on some batch uploads for IFPRI
Read more →