CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2023

2023-11-01

  • Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
    • I improved the filtering and wrote some Python using pandas to merge my sources more reliably

2023-11-02

  • Export CGSpace to check missing Initiative collection mappings
  • Start a harvest on AReS
  • IFPRI contacted us about importing their Slideshare presentations to CGSpace
    • There are ~1,700 of them and date back to as early as 2008
    • I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test

2023-11-03

  • A little bit of work on the CGIAR Climate Change Synthesis
  • Discuss some CGSpace migration plans with Leigh from IFPRI
    • For their Slideshare content we agreed:
      • Exclude private
      • Exclude deleted
      • Exclude non presentation types
      • Exclude duplicates within the collection for now until we can sort them out
    • That leaves about 1,500 items out of the 1,700
  • I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those

2023-11-04

  • Export CGSpace to check for missing Initiative collection mappings
  • I ran through the list of potential duplicates on the IFPRI Slideshare presentations

2023-11-05

  • Work with Salem to migrate AReS to the new version

2023-11-07

  • DSpace 7 Test went down and there is very high load on the server
    • I saw very high load from Java but didn’t have time to check exactly what was wrong so I just rebooted the host
    • A few hours after restarting the system went down again, with very high load from Java again
    • I see lots of messages like this in the Tomcat log:
tomcat9[732]: [9955.662s][info   ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
tomcat9[732]: [9955.662s][info   ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
tomcat9[732]: [9955.666s][info   ][gc] GC(6292) To-space exhausted
  • I see some messages in dspace.log about heap space:
Caused by: java.lang.OutOfMemoryError: Java heap space
  • I will increase Tomcat’s heap from 4096m to 5120m
    • A few hours later it happened again, so I increased the heap from 5120m to 6144m
    • Not sure what’s going on today…
  • I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:
$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
$ dspace index-discovery -r 10947/2516
$ dspace index-discovery -r 10947/2515
$ dspace index-discovery -r 10568/83389
$ dspace index-discovery
  • I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
  • I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in September, 2023,

2023-11-08

  • DSpace 7 Test has very high load again and I see more Java heap space errors in the log
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log-2023-11-07 
35
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log
7
  • I don’t know what is happening… I will increase the heap size from 6144m to 7168m again…
  • I did some work on the value mappings in AReS
    • I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine
    • Importing duplicates records, so I deleted and re-created the index in Elasticsearch first
    • Then I started a new harvest on AReS to make sure the mappings are applied

2023-11-09

  • Ryan asked me for help uploading a large PDF to CGSpace
    • I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images
    • Interestingly, the GhostScript docs mention that prepress doesn’t give the best results:

Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The ‘best’ quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).

$ gs -sOutputFile=137166-default-dct.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS=/default -c "<< /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 >> /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged >> setdistillerparams" -f 137166.pdf
  • This looks much better, and is still much smaller than the original
    • Also, I used pdfimages to extract all the images from the original and the one above and found:
$ du -sh images-*
886M    images-default-dct
1012M   images-original

2023-11-10

  • I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace

2023-11-11

  • Salem fixed a bug on OpenRXV that was splitting country values by “,” before matching them with ISO countries
  • I exported CGSpace to check for missing Initiative collection mappings
  • Start a fresh harvest on AReS

2023-11-16

  • Discuss mapping ICARDA outputs from Initiatives to ICARDA collections on CGSpace
    • I added MEL’s CGSpace user to the administrator group of a handful of collections
    • I also did a batch mapping of 274 existing Initiative outputs from ICARDA to the relevant collections

2023-11-18

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS

2023-11-22

2023-11-23

  • Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query:
localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf');
 count 
───────
 47818
(1 row)
  • It’s been some time since I looked at our Solr statistics to find new bots
  • I ran my old check-spider-hits.sh script with a list of bots from our local overrides to purge hits from Solr:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 30 hits from ubermetrics in statistics
Purging 59 hits from curb in statistics
Purging 36 hits from bitdiscovery in statistics
Purging 87 hits from omgili in statistics
Purging 47 hits from Vizzit in statistics
Purging 109 hits from Java\/17-ea in statistics
Purging 40 hits from AdobeUxTechC4-Async in statistics
Purging 21 hits from ZaloPC-win32-24v473 in statistics
Purging 21 hits from nbertaupete95 in statistics
Purging 52 hits from Scoop\.it in statistics
Purging 16 hits from WebAPIClient in statistics
Purging 241 hits from RStudio in statistics
Purging 1255 hits from ^MEL in statistics
Purging 47850 hits from GuzzleHttp in statistics
Purging 8714 hits from Owler in statistics
Purging 1083 hits from newspaperjs in statistics
Purging 369 hits from ^Chrome$ in statistics
Purging 1474 hits from curl in statistics

Total number of bot hits purged: 61504
  • I also noticed 35,000 requests over the past few years from lowercase user agents, which is definitely weird, for example:
    • mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36
    • mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36
  • I’m gonna add those to our overrides and purge them:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 35816 hits from ^mozilla in statistics

Total number of bot hits purged: 35816

2023-11-30

  • Minor updates to our OAI MODS crosswalk
    • Stefano found a minor markup issue with our alternative titles (<titleInfo> tag)
  • Very high load on CGSpace since after lunch
    • I killed some locks that had been stuck for a few hours