CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

June, 2023

2023-06-02

  • Spend some time testing my post_bitstreams.py script to update thumbnails for items on CGSpace
    • Interestingly I found an item with a JFIF thumbnail and another with a WebP thumbnail…
  • Meeting with Valentina, Stefano, and Sara about MODS metadata in CGSpace
    • They have experience with improving the MODS interface in MELSpace’s OAI-PMH for use with AGRIS and were curious if we could do the same in CGSpace
    • From what I can see we need to upgrade the MODS schema from 3.1 to 3.7 and then just add a bunch of our fields to the crosswalk

2023-06-04

  • Upgrade CGSpace to Ubuntu 22.04
    • The upgrade was mostly normal, but I had to unhold the openjdk package in order for do-release-upgrade to run:
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
  • In 2022-11 an upstream Java update broke the DSpace 6 Handle server so we will have to pin this again after the upgrade to Ubuntu 22.04
  • After the upgrade I made sure CGSpace was working, then proceeded to upgrade PostgreSQL from 12 to 14, like I did on DSpace Test in 2023-03
  • Then I had to downgrade OpenJDK to fix the Handle server using the ones I had previously downloaded for Ubuntu 20.04 because they no longer exist on Launchpad:
# dpkg -i openjdk-8-j*8u342-b07*.deb
  • Export CGSpace to fix missing Initiative collection mappings
  • Start a harvest on AReS
  • Work on the DSpace 7 migration a bit more
    • I decided to rebase and drop all the submission form edits because they conflict every time upstream changes!

2023-06-06

  • Fix some incorrect ORCID identifiers for an Alliance author on CGSpace
  • Export our list of ORCID identifiers, resolve them, and update the records in CGSpace:
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-06-06-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2023-06-06-orcids.txt -o /tmp/2023-06-06-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2023-06-06-orcids-names.txt -db dspacetest -u dspace -p 'ffff' -m 247
  • Start working on updating the MODS schema in CGSpace from 3.1 to 3.8 based on Stefano and Salem’s work last year

2023-06-08

  • Continue working on the MODS schema mapping
  • Export CGSpace to check and update dcterms.extent fields
    • I normalized about 1,500 to use either “p. 1-6” or “5 p.” format
    • Also, I used this GREL expression to extract missing pages from the citation field: cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*(pp?\.\s?\d+[-–]\d+).*/)[0]
    • This was over 4,000 items with a format like “p. 1-6” and “pp. 1-6” in the citation
    • I used another GREL expression to extract another 5,000: cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*?(\d+\s+?[Pp]+\.).*/)[0]
    • This was for the format like “1 p.” (note we had to protect against the greedy .* in the beginning)
  • I also did some work to capture a handful of missing DOIs and ISSNs, but it was only about 100 items and I will have to wait until the 10,000+ above finish importing

2023-06-09

  • I see there are ~200 users in CGSpace that have registered with their CGIAR email address using a password as opposed to using Active Directory:
SELECT * FROM eperson WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL;
  • I am wondering if I should delete their passwords and tell them use log in using LDAP
    • As an initial test I will reset a few accounts including my own that have passwords and salts:
UPDATE eperson SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE netid IN ('axxxx', 'axxxx', 'bxxxx');
  • I also decided to reset passwords/salts for CGIAR accounts that have not been active since 2021 (1.5 years ago):
UPDATE eperson  SET password=DEFAULT,salt=DEFAULT,digest_algorithm=DEFAULT WHERE email LIKE '%cgiar.org' AND netid IS NOT NULL AND password IS NOT NULL AND salt IS NOT NULL AND last_active < '2022-01-01'::date;
  • This was about 100 accounts…
    • I will wait some more time before I decide what to do about the more current ones
  • Add a few more ORCID identifiers to my list and tag them on CGSpace

2023-06-10

  • Export CGSpace to check for missing Initiative mappings
    • Start a harvest on AReS

2023-06-11

  • File an issue on DSpace for the Content-Disposition bug causing images to get downloaded instead of opened inline

2023-06-12

  • Export CGSpace to do some more work extracting volume and issue from citations for items where they are missing
    • I found and fixed over 7,000!
    • Then I found and extracted another 7,000 items with no extents (pages)
    • Then I replaced all occurences of en dashes for ranges in pages with regular hyphens

2023-06-13

  • Last night I finally figured out how to do basic overrides to the simple item view in Angular
  • Add a handful of new ORCID identifiers to my list and tag them on CGSpace
  • Extract a list of all the proposed actions for CG Core output types and create a new issue for them on CG Core’s GitHub repository
  • Extract a list of all the proposed actions for CG Core output types for MARLO and create a new issue for them on MARLO’s GitHub repository
  • Meeting with Indira, Ryan, and Abenet to discuss plans for the DSpace 7 focus group

2023-06-14

2023-06-15

  • A lot more work on DSpace 7
    • I tested some pull requests and worked on the style of the item view and homepage

2023-06-16

2023-06-17

  • Export CGSpace to check for missing Initiative collection mappings
    • I also spent some time doing sanity checks on countries, regions, DOIs, and more
  • I lowercased all our AGROVOC keywords in dcterms.subject:
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 2392
dspace=*# COMMIT;
COMMIT
  • Start a harvest on AReS

2023-06-19

  • Today I started getting an error on DSpace 7 Test
    • The page loads, and then when it is almost done it goes blank to white with this in the console:
ERROR DOMException: CSSStyleSheet.cssRules getter: Not allowed to access cross-origin stylesheet
  • I restarted Angular, but it didn’t fix it
  • The yarn test:rest script shows everything OK, and I haven’t changed anything recently…
  • I re-compiled the Angular UI using the default theme and it was the same…
  • I tried in Firefox Nightly and it works…
    • So it must be something related to the browser
    • I tried clearing all the session storage / cookies and refreshing and it worked
  • I switched back to the CGSpace theme and it happened again
    • I had a hunch it might be due to the GDPR cookie plugin in my browser, so I disabled that and then refreshed and it worked… hmmm
  • Upload thumbnails for about 42 IITA Journal Articles after resolving their DOIs and making sure they were not CC ND
    • I fixed a few bugs in get_scihub_pdfs.py in the process

2023-06-21

  • Stefano got back to me about the MODS OAI-PMH schema test on DSpace Test

2023-06-25

  • Export CGSpace to check for missing Initiative collection mappings
  • I wanted to start a harvest on AReS but I’ve seen the load on the server high for a few days and I’m not sure what it is
    • I decided to run all updates and reboot it since it’s Sunday anyway

2023-06-26

  • Since the new DSpace 7 will respect newlines in metadata fields I am curious to see how many of our abstracts have poor newlines
    • I exported CGSpace and used a custom text facet with this GREL expression in OpenRefine to count the number of newlines in each cell:
value.split('\n').length()
  • Also useful to check for general length of the text in the cell to make sure it’s a reasonably long string
    • I spent some time trying to find a pattern that I could use to identify “easy” targets, but there are so many exceptions that it will have to be done manually
    • I fixed a few dozen
  • Do a bit of work on thumbnails on CGSpace
  • I’m trying to troubleshoot the Discovery error I get on DSpace 7:
java.lang.NullPointerException: Cannot invoke "org.dspace.discovery.configuration.DiscoverySearchFilterFacet.getIndexFieldName()" because the return value of "org.dspace.content.authority.DSpaceControlledVocabularyIndex.getFacetConfig()" is null
  • I reverted to the default submission-forms.xml and the getFacetConfig() error goes away…
  • Kill some long-held locks on CGSpace PostgreSQL, as some users are complaining of slowness in archiving
  • I did some testing of the LDAP login issue related to groupmaps
  • I spent some time on working on Angular and I figured out how to add a custom Angular component to show the UN SDG Goal icons on DSpace 7

2023-06-27

  • I debugged the NullPointerException and somehow it disappeared
    • It seems to be related to the external controlled vocabularies in the submission form
    • I removed them all, then added them all back, and now the issue is solved… hmmmm
    • Oh now, now they are gone again, sigh…

2023-06-28

  • Spent a lot of time debugging the browse indexes
    • Looking at the DSpace 7 demo API I see the four default browse indexes from dspace.cfg and the one default srsc one that gets automatically enabled from the <vocabulary>srsc</vocabulary> in the submission-forms.xml
    • The same API call on my test DSpace 7 configuration results in the HTTP 500 I’ve been seeing for some time, and I am pretty sure it’s due to the automagic configuration of hierarchical browses based on the submission form
    • Yes, if I remove them all from my submission form then this works: http://localhost:8080/server/api/discover/browses
    • I went through each of our vocabularies and tested them one by one:
      • dcterms-subject: OK
      • dc-contributor-author: NO
      • cg-creator-identifier: NO
      • cg-contributor-affiliation: OK (and with facetType: "affiliation" in API response?!)
      • cg-contributor-donor: OK (facetType: "sponsorship")
      • cg-journal: NO
      • cg-coverage-subregion: NO
      • cg-species-breed: NO
    • Now I need to figure out what it is about those five that causes them to not work!
    • Ah, after debugging with someone on the DSpace Slack, I realized that DSpace expects these vocabularies to have corresponding indexes configured in discovery.xml, and they must be added as search filters AND sidebar facets.

2023-06-29

dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 53
dspace=*# COMMIT;
  • After more discussion about the NullPointerException related to browse options, I filed an issue

2023-06-30

  • I added another custom component to display CGIAR Impact Area icons in the DSpace 7 test