CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

October, 2023

2023-10-02

  • Export CGSpace to check DOIs against Crossref
    • I found that Crossref’s metadata is in the public domain under the CC0 license
    • One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
    • We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
Read more →

September, 2023

2023-09-02

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS
Read more →

August, 2023

2023-08-03

  • I finally got around to working on Peter’s cleanups for affiliations, authors, and donors from last week
    • I did some minor cleanups myself and applied them to CGSpace
  • Start working on some batch uploads for IFPRI
Read more →

July, 2023

2023-07-01

  • Export CGSpace to check for missing Initiative collection mappings
  • Start harvesting on AReS

2023-07-02

  • Minor edits to the crossref_doi_lookup.py script while running some checks from 22,000 CGSpace DOIs

2023-07-03

  • I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect
    • I took the more accurate ones from Crossref and updated the items on CGSpace
    • I took a few hundred ISBNs as well for where we were missing them
    • I also tagged ~4,700 items with missing licenses as “Copyrighted; all rights reserved” based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer
    • Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it’s usually copyrighted (could still be open access, but we can’t tell via Crossref)
    • I would be curious to write a script to check the Unpaywall API for open access status…
    • In the past I found that their license status was not very accurate, but the open access status might be more reliable
  • More minor work on the DSpace 7 item views
    • I learned some new Angular template syntax
    • I created a custom component to show Creative Commons licenses on the simple item page
    • I also decided that I don’t like the Impact Area icons as a component because they don’t have any visual meaning

2023-07-04

  • Focus group meeting with CGSpace partners about DSpace 7
  • I added a themed file selection component to the CGSpace theme
    • It displays the bistream description instead of the file name, just like we did in DSpace 6 XMLUI
  • I added a custom component to show share icons

2023-07-05

  • I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13
    • Most things work but there are some minor bugs it seems
  • Mishell from CIP emailed me to say she was having problems approving an item on CGSpace
    • Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again

2023-07-06

  • Types meeting
  • I wrote a Python script to check Unpaywall for some information about DOIs

2023-07-7

  • Continue exploring Unpaywall data for some of our DOIs
    • In the past I’ve found their licensing information to not be very reliable (preferring Crossref), but I think their open access status is more reliable, especially when the provider is listed as being the publisher
    • Even so, sometimes the version can be “acceptedVersion”, which is presumably the author’s version, as opposed to the “publishedVersion”, which means it’s available as open access on the publisher’s website
    • I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses
  • Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
  • Start working on some statistics on AGROVOC usage for my presenation next week
    • I used the following SQL query to dump values from all subject fields and lower case them:
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2023-07-07-cgspace-subjects.csv WITH CSV HEADER;
COPY 26443
Time: 2564.851 ms (00:02.565)
  • Then I extracted the subjects and looked them up against AGROVOC:
$ csvcut -c subject /tmp/2023-07-07-cgspace-subjects.csv | sed '1d' > /tmp/2023-07-07-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-07-07-cgspace-subjects.txt -o /tmp/2023-07-07-cgspace-subjects-results.csv
  • I did some more tests with Angular 13 on OpenRXV and found out why the repository type dropdown wasn’t working
    • It was because of a missing 1-line JSON file in the data directory, which is runtime data, not code
    • I copied the data directory from the production serve and rebuild and the site is working well now
    • I did a full harvest with plugins and it worked!
    • So it seems Angular 13.4.0 will work, yay

2023-07-08

  • Export CGSpace to check for missing Initiative collection mappings
    • Start a harvest on AReS
  • The AGROVOC lookup finished, so I checked the number of matches:
$ csvgrep -c 'match type' -r '^.+$' ~/Downloads/2023-07-07-cgspace-subjects-resolved.csv | sed 1d | wc -l
12528
  • So that’s 12,528 out of 26,443 unique terms (47.3%)
  • I did a LOT of work on the OpenRXV frontend build dependencies to bring more in line with Angular 13

2023-07-10

  • I did a lot more work on OpenRXV to test and update dependencies
  • I deployed the latest version on the production server

2023-07-12

  • CGSpace upgrade meeting with Americas and Africa group

2023-07-13

  • Michael Victor asked me to help Aditi extract some information from CGSpace
    • She was interested in journal articles published between 2018 and 2023 with a range of subjects related to drought, flooding, resilience, etc
    • I used an advanced query with some AGROVOC terms:
dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration" OR dcterms.subject:livestock)
  • Interestingly, some variations of this same exact query produce no search results, and I see this error in the DSpace log:
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:livestock OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration\"\)': Lexical error at line 1, column 617.  Encountered: <EOF> after : "\"landscape restoration\\\"\\)"
  • It seems to be when there is a quoted search term at the end of the parenthesized group
    • For what it’s worth this same query worked fine on DSpace 7.6

2023-07-15

  • Export CGSpace to fix missing Initiative collection mappings
  • Start a harvest on AReS

2023-07-17

  • Rasika had sent me a list of new ORCID identifiers for new IWMI staff so I combined them with our existing list and ran resolve_orcids.py to refresh the names in our database
    • I updated the list, updated names in the database, and tagged new authors with missing identifiers in existing items

2023-07-18

  • Meeting with IWMI, IRRI, and IITA colleagues about CGSpace upgrade plans
  • Maria from the Alliance mentioned having some submissions stuck on CGSpace
    • I looked and found a number of locks stuck for many nineteen, eighteen, and more hours…
    • I killed them and told her to try again
$ psql < locks-age.sql | less -S
$ psql < locks-age.sql | grep -E " (19|18|17|16|12):" | awk -F"|" '{print $10}' | sort -u | xargs kill

2023-07-19

  • I had to kill a bunch more locked processes in PostgreSQL, I’m not sure what’s going on
  • After some discussion about an advanced search bug with Tim on Slack, I filed an issue on GitHub

2023-07-20

  • I added a new metadata field for CGIAR Impact Platforms (cg.subject.impactPlatform) to CGSpace

2023-07-22

  • Export CGSpace tp fix missing Initiative collections
  • Start a harvest on AReS

2023-07-24

  • Test Salem’s new JavaScript-based DSpace Statistics API and send him some feedback
  • I noticed a few times that the Solr service on my DSpace 7 instance is getting OOM killed
    • I had been using a 4g Solr heap, but maybe we don’t need that much
    • Tomcat is also using 4.6GB, and then there’s PostgreSQL… so perhaps it’s all a bit much on this system now

2023-07-25

  • Start testing exporting DSpace 6 Solr cores to import on DSpace 7:
$ chrt -b 0 dspace solr-export-statistics -i statistics
  • I’m curious how long it takes and how much data there will be
    • The size of the Solr data directory is currently 82GB
    • The export took about 2.5 hours and created 6,000 individual CSVs, one for each day of Solr stats
    • The size of the exported CSVs is about 88GB
    • I will copy just a few years to import on the DSpace 7 test server
    • So importing these is going to require removing the Atmire custom fields:
$ dspace solr-import-statistics -i statistics
Exception: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
        at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:465)
        at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)
  • I will try using solr-import-export-json, which I’ve used in the past to skip Atmire custom fields in Solr:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2022.json -f 'time:[2022-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,geoIpCountryCode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId,core_update_run_nb
  • Some users complained that CGSpace was slow and I found a handful of locks that were hours and days old…
    • I killed those and told them to try again
  • After importing the Solr statistics into DSpace 7 I realized that my DSpace Statistics API will work fine
    • I made some minor modifications to the Ansible infrastructure scripts to make sure it is enabled and then activated it on DSpace 7 Test

2023-07-26

  • Debugging lock issues on CGSpace
    • I see the blocking PIDs for some long-held locks are “idle in transaction”:
$ ps auxw | grep -E "(1864132|1659487)"
postgres 1659487  0.0  0.5 3269900 197120 ?      Ss   Jul25   0:03 postgres: 14/main: cgspace cgspace 127.0.0.1(61648) idle in transaction
postgres 1864132  0.1  0.7 3275704 254528 ?      Ss   07:27   0:08 postgres: 14/main: cgspace cgspace 127.0.0.1(36998) idle in transaction
postgres 1880388  0.0  0.0   9208  2432 pts/3    S+   08:48   0:00 grep -E (1864132|1659487)
  • I used some other scripts and found that those processes were executing the following statement:
select nextval ('public.tasklistitem_seq')
  • I don’t know why these can get blocked for hours without resolution, but for now I just killed them
  • I wrote a slightly longer regex to match locks that have been stuck for more than 1 hour based on the output of the locks-age.sql script and killed them:
$ psql < locks-age.sql | awk -F"|" '/ [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill

2023-07-27

  • Export CGSpace to check countries, regions, types, and Initiatives
    • There were a few minor issues in countries and regions, and I noticed 186 items without types!
    • Then I ran the file through csv-metadata-quality to make sure items with countries have appropriate regions
  • Brief discussion about OpenRXV bugs and fixes with Moayad
  • I was toying with the idea of using an expanded whitespace check/fix based on ESLint’s no-irregular-whitespace rule in csv-metadata-quality
    • I found 176 items in CGSpace with such whitespace in their titles alone
    • I compared the results of removing these characters and replacing them with a space
    • In most cases removing it is the correct thing to do, for example “Pesticides : une arme à double tranchant” → “Pesticides: une arme à double tranchant”
    • But in some items it is tricky, for example “L’environnement juridique est-il propice à la gestion” → “L’environnement juridique est-il propice àla gestion”
    • I guess it would really need some good heuristics or a human to verify…
  • I upgraded OpenRXV to Angular v14

2023-07-28

Exception in thread "main" org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=0008a7c1-e552-4a4e-93e4-4d23bf39964b] Error adding field 'workflowItemId'='0812be47-1bfe-45e2-9208-5bf10ee46f81' msg=For input string: "0812be47-1bfe-45e2-9208-5bf10ee46f81"
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:745)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:234)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:102)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:69)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:82)
        at it.damore.solr.importexport.App.insertBatch(App.java:295)
        at it.damore.solr.importexport.App.lambda$writeAllDocuments$10(App.java:276)
        at it.damore.solr.importexport.BatchCollector.lambda$accumulator$0(BatchCollector.java:71)
        at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
        at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
        at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
        at it.damore.solr.importexport.App.writeAllDocuments(App.java:252)
        at it.damore.solr.importexport.App.main(App.java:150)
isInternal,workflowItemId,containerCommunity,containerCollection,containerItem,containerBitstream,dateYear,dateYearMonth,filterquery,complete_query,simple_query,complete_query_search,simple_query_search,ngram_query_search,ngram_simplequery_search,text,storage_statistics_type,storage_size,storage_nb_of_bitstreams,name,first_name,last_name,p_communities_id,p_communities_name,p_communities_map,p_group_id,p_group_name,p_group_map,group_id,group_name,group_map,parent_count,bitstreamId,bitstreamCount,actingGroupId,actorMemberGroupId,actingGroupParentId,rangeDescription,range,version_id,file_id,cua_version,core_update_run_nb,orphaned
  • I will combine it with the other fields I was skipping above and try the export again:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020.json -f 'time:[2020-01-01T00\:00\:00Z TO 2020-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
  • Export a list of affiliations from the Initiatives community for Peter:
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-07-28-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-07-28-initiatives.csv \
  | sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d'            \
  | sort | uniq -c | sort -hr                                              \
  | awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
  | sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/'           \
  > /tmp/2023-07-28-initiatives-affiliations.csv
  • This is a method I first used in 2023-01 to export affiliations ONLY used in items in the Initiatives community
    • I did the same for authors and investors

2023-07-29

  • Export CGSpace to look for missing Initiative collection mappings
  • I found a bunch of locks waiting for many hours and killed them:
$ psql < locks-age.sql | awk -F"|" '$9 ~ / [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
  • This looks for a pattern matching something like 11:30:48.598436 in the age column (not 00:00:00) and kills them
  • Start a harvest on AReS
Read more →

June, 2023

2023-06-02

  • Spend some time testing my post_bitstreams.py script to update thumbnails for items on CGSpace
    • Interestingly I found an item with a JFIF thumbnail and another with a WebP thumbnail…
  • Meeting with Valentina, Stefano, and Sara about MODS metadata in CGSpace
    • They have experience with improving the MODS interface in MELSpace’s OAI-PMH for use with AGRIS and were curious if we could do the same in CGSpace
    • From what I can see we need to upgrade the MODS schema from 3.1 to 3.7 and then just add a bunch of our fields to the crosswalk
Read more →

May, 2023

2023-05-03

  • Alliance’s TIP team emailed me to ask about issues authenticating on CGSpace
    • It seems their password expired, which is annoying
  • I continued looking at the CGSpace subjects for the FAO / AGROVOC exercise that I started last week
    • There are many of our subjects that would match if they added a “-” like “high yielding varieties” or used singular…
    • Also I found at least two spelling mistakes, for example “decison support systems”, which would match if it was spelled correctly
  • Work on cleaning, proofing, and uploading twenty-seven records for IFPRI to CGSpace
Read more →

April, 2023

2023-04-02

  • Run all system updates on CGSpace and reboot it
  • I exported CGSpace to CSV to check for any missing Initiative collection mappings
    • I also did a check for missing country/region mappings with csv-metadata-quality
  • Start a harvest on AReS
Read more →

March, 2023

2023-03-01

  • Remove cg.subject.wle and cg.identifier.wletheme from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
  • iso-codes 4.13.0 was released, which incorporates my changes to the common names for Iran, Laos, and Syria
  • I finally got through with porting the input form from DSpace 6 to DSpace 7
Read more →

February, 2023

2023-02-01

  • Export CGSpace to cross check the DOI metadata with Crossref
    • I want to try to expand my use of their data to journals, publishers, volumes, issues, etc…
Read more →

January, 2023

2023-01-01

  • Apply some more ORCID identifiers to items on CGSpace using my 2022-09-22-add-orcids.csv file
    • I want to update all ORCID names and refresh them in the database
    • I see we have some new ones that aren’t in our list if I combine with this file:
Read more →