CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2023

2023-02-01

  • Export CGSpace to cross check the DOI metadata with Crossref
    • I want to try to expand my use of their data to journals, publishers, volumes, issues, etc…
  • First, extract a list of DOIs for use with crossref-doi-lookup.py:
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-02-01-cgspace.csv \
  | csvgrep -c 1 -m 'doi.org' \
  | csvgrep -c 1 -m ' ' -i \
  | csvgrep -c 1 -r '.*cifor.*' -i \
  | sed 1d > /tmp/2023-02-01-dois.txt
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-02-01-dois.txt -o ~/Downloads/2023-01-31-crossref-results.csv -d
  • Then extract the ID, DOI, journal, volume, issue, publisher, etc from the CGSpace dump and rename the cg.identifier.doi[en_US] to doi so we can join on it with the Crossref results file:
$ csvcut -c 'id,cg.identifier.doi[en_US],cg.journal[en_US],cg.volume[en_US],cg.issue[en_US],dcterms.publisher[en_US],cg.number[en_US],dcterms.license[en_US]' ~/Downloads/2023-02-01-cgspace.csv \
  | csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
  | sed -e '1s/cg.identifier.doi\[en_US\]/doi/' \
    -e 's_https://doi.org/__g' \
    -e 's_https://dx.doi.org/__g' \
  > /tmp/2023-02-01-cgspace-doi-metadata.csv
$ csvjoin -c doi /tmp/2023-02-01-cgspace-doi-metadata.csv ~/Downloads/2023-02-01-crossref-results.csv > /tmp/2023-02-01-cgspace-crossref-check.csv
  • And import into OpenRefine for analysis and cleaning
  • I just noticed that Crossref also has types, so we could use that in the future too!
  • I got a few corrections after examining manually, but I didn’t manage to identify any patterns that I could use to do any automatic matching or cleaning

2023-02-05

  • Normalize text lang attributes in PostgreSQL, run a quick Discovery index, and then export CGSpace to check Initiative mappings and countries/regions
  • Run all system updates on CGSpace (linode18) and reboot it

2023-02-06

2023-02-07

  • IFPRI’s web developer Tony managed to get his Drupal harvester to have a useful user agent:
54.x.x.x - - [06/Feb/2023:10:10:32 +0100] "POST /rest/items/find-by-metadata-field?limit=%22100&offset=0 HTTP/1.1" 200 58855 "-" "IFPRI drupal POST harvester"
  • He also noticed that there is no pagination on POST requests to /rest/items/find-by-metadata-field, and that he needs to increase his timeout for requests that return 100+ results, ie:
$ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.actionArea", "value":"Systems Transformation", "language": "en_US"}'
  • I need to ask on the DSpace Slack about this POST pagination
  • Abenet and Udana noticed that the Handle server was not running
    • Looking in the error.log file I see that the service is complaining about a lock file being present
    • This is because Linode had to do emergency maintenance on the VM host this morning and the Handle server didn’t shut down properly
  • I’m having an issue with poetry update so I spent some time debugging and filed an issue
  • Proof and import nine items for the Digital Innovation Inititive for IFPRI
    • There were only some minor issues in the metadata
    • I also did a duplicate check with check-duplicates.py just in case
  • I did some minor updates on csv-metadata-quality
    • First, to reduce warnings on non-SPDX licenses like “Copyrighted; all rights reserved” and “Other” since they are very common for us and I’m sick of seeing the warnings
    • Second, to skip whitespace and newline fixes on the abstract field since so many times they are intended

2023-02-08

  • Make some edits to IFPRI records requested by Jawoo and Leigh
  • Help Alessandra upload a last minute report for SAPLING
  • Proof and upload twenty-seven IFPRI records to CGSpace
    • It’s a good thing I did a duplicate check because I found three duplicates!
  • Export CGSpace to update Initiative mappings and country/region mappings
    • Then start a harvest on AReS

2023-02-09

  • Do some minor work on the CSS on the DSpace 7 test

2023-02-10

  • I noticed a large number of PostgreSQL locks from dspaceWeb on CGSpace:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
   2033 dspaceWeb
  • Looking at the lock age, I see some already 1 day old, including this curious query:
select nextval ('public.registrationdata_seq')
  • I killed all locks that were more than a few hours old
  • Export CGSpace to update Initiative collection mappings
  • Discuss adding dcterms.available to the submission form
    • I also looked in the dcterms.description field on CGSpace and found ~1,500 items where the is an indication of an online published date
    • Using some facets in OpenRefine I narrowed down the ones mentioning “online” and then extracted the dates to a new column:
cells['dcterms.description[en_US]'].value.replace(/.*?(\d+{2}) ([a-zA-Z]+) (\d+{2}).*/,"$3-$2-$1")
  • Then to handle formats like “2022-April-26” and “2021-Nov-11” I used some replacement GRELs (note the order so we don’t replace short patterns in longer strings prematurely):
value.replace("January","01").replace("February","02").replace("March","03").replace("April","04").replace("May","05").replace("June","06").replace("July","07").replace("August","08").replace("September","09").replace("October","10").replace("November","11").replace("December","12")
value.replace("Jan","01").replace("Feb","02").replace("Mar","03").replace("Apr","04").replace("May","05").replace("Jun","06").replace("Jul","07").replace("Aug","08").replace("Sep","09").replace("Oct","10").replace("Nov","11").replace("Dec","12")
  • This covered about 1,300 items, then I did about 100 more messier ones with some more regex wranling
    • I removed the dcterms.description[en_US] field from items where I updated the dates
  • Then I added dcterms.available to the submission form and the item view
    • We need to announce this to the editors

2023-02-13

  • Export CGSpace to do some metadata quality checks
    • I added CGIAR Trust Fund as a donor to some new Initiative outputs
    • I moved some abstracts from the description field
    • I moved some version information to the cg.edition field

2023-02-14

  • The PRMS team in Colombia sent some questions about countries on CGSpace
    • I had to fix some, that were clearly wrong, but there is also a difference between CGSpace and MEL because we use mostly iso-codes, and MEL uses the UN M.49 list
    • Then I re-ran the country code tagger from cgspace-java-helpers, forcing the update on all items in the Initiatives community
  • Remove Alliance research levers from cg.contributor.crp field after discussing with Daniel and Maria
    • This was a mistake on TIP’s part, and there is no direct mapping between research levers and CRPs
  • I exported CGSpace to check Initiative collection mappings, regions, and licenses
    • Peter told me that all CGIAR blog posts for the Initiatives should be CC-BY-4.0, and I see the logo at the bottom in light gray!
    • I had previously missed that and removed some licenses for blog posts
    • I checked cgiar.org, ifpri.org, icarda.org, iwmi.cgiar.org, irri.org, etc and corrected a handful
  • Start a harvest on AReS

2023-02-15

  • Work on rebasing my local DSpace 7 dev branches on top of the latest 7.5-SNAPSHOT
    • It seems the issues I had with the dspace submission-forms-migrate tool in August, 2022 were fixed
  • I imported a fresh PostgreSQL snapshot from CGSpace and then removed the Atmire migrations and ran the new migrations as I originally noted in March, 2022, and is pointed out in the DSpace 7 upgrade notes
    • Now I get a new error:
localhost/dspace7= ☘ DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
localhost/dspace7= ☘ DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
localhost/dspace7= \q
$ ./bin/dspace database migrate ignored
...

CREATE INDEX resourcepolicy_action_idx ON resourcepolicy(action_id)

        at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.handleException(DefaultSqlScriptExecutor.java:275)
        at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:222)
        at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.execute(DefaultSqlScriptExecutor.java:126)
        at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.executeOnce(SqlMigrationExecutor.java:69)
        at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.lambda$execute$0(SqlMigrationExecutor.java:58)
        at org.flywaydb.core.internal.database.DefaultExecutionStrategy.execute(DefaultExecutionStrategy.java:27)
        at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:57)
        at org.flywaydb.core.internal.command.DbMigrate.doMigrateGroup(DbMigrate.java:377)
        ... 24 more
Caused by: org.postgresql.util.PSQLException: ERROR: relation "resourcepolicy_action_idx" already exists
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:496)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:413)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:333)
        at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:319)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:295)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:290)
        at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
        at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
        at org.flywaydb.core.internal.jdbc.JdbcTemplate.executeStatement(JdbcTemplate.java:201)
        at org.flywaydb.core.internal.sqlscript.ParsedSqlStatement.execute(ParsedSqlStatement.java:95)
        at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:210)
        ... 30 more
  • I dropped that index and then the migration succeeded:
localhost/dspace7= ☘ DROP INDEX resourcepolicy_action_idx;
localhost/dspace7= ☘ \q
$ ./bin/dspace database migrate ignored
Done.

2023-02-16

  • I found a suspicious number of PostgreSQL locks on CGSpace and decided to investigate:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
     44 dspaceApi
    372 dspaceCli
    446 dspaceWeb
  • This started happening yesterday and I killed a few locks that were several hours old after inspecting the locks-age.sql output
  • I also checked the locks.sql output, which helpfully lists the blocked PID and the blocking PID, to find one blocking PID that was idle in transaction
    • I killed that process and then all other locks were instantly processed
  • I filed a GitHub issue on dspace-angular requesting the item view to use the bitstream description instead of the file name if present
  • Weekly CG Core types meeting
    • I need to go through the actions and remove those items that are only for CGSpace internal use, ie:
      • CD-ROM
      • Manuscript-unpublished
      • Photo Report
      • Questionnaire
      • Wiki
  • Weekly CGIAR Repository Working Group meeting
  • I did some experiments with Crossref dates for about 20,000 DOIs in CGSpace using my crossref-doi-lookup.py script
  • Some things I noted from reading the Crossref API docs and inspecting the records for a few dozen DOIs manually:
    • ["created"]["date-parts"] → Date on which the DOI was first registered (not useful for us)
    • ["published-print"]["date-parts"] → Date on which the work was published in print
    • ["journal-issue"]["published-print"]["date-parts"] → When present, is 99% the same as the above
    • ["published-online"]["date-parts"] → Date on which the work was published online
    • ["journal-issue"]["published-online"]["date-parts"] → Much more rare, and only 50% the same as the above, so unreliable
    • ["issued"]["date-parts"] → Earliest of published-print and published-online (not useful to us)
  • After checking the DOIs manully I decided that when the published-print date exists, it is usually more accurate than our issued dates
    • I set 12,300 issue dates to those from Crossref
  • I also decided that, when published-online exists, it is usually accurate when I check the publisher page (we don’t have many online dates to compare)
    • I set the available date for ~7,000 items to the published-online date as long as:
      • There was no dcterms.available date already
      • It was different than the issued date, because for now I only want online dates that are different, in case this is an online only journal in which case that can be the issue date… maybe I’ll re-visit that later

2023-02-17

  • It seems some (all?) of the changes I applied to dates last night didn’t get saved…
    • I don’t know what happened, so I will run them again after some investigation
    • I submitted the first batch of ~7,600 changes and it took twelve hours!
    • I almost cancelled it because after applying the changes there was a lock blocking everything for two hours, and it seemed to be stuck, but I kept checking it and saw that the query_start and state_change were being updated despite it being state “idle in transaction”:
$ psql -c 'SELECT * FROM pg_stat_activity WHERE pid=1025176' | less -S
  • I will apply the other changes in smaller batches…
  • Lately I’ve noticed a lot of activity from the country code tagger curation task
    • Looking in the logs I see items being tagged that are very old and should have already been tagged years ago
    • Also, I see a ton of these errors whenever the task is updating an item:
2023-02-17 08:01:00,252 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/89020 with status: 0. Result: '10568/89020: added 1 alpha2 country code(s)'
2023-02-17 08:01:00,467 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: a0fe9d9a-6ac1-4b6a-8fcb-dae07a6bbf58 message:missing required field: epersonID
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
        at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
        at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
        at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
        at org.dspace.core.Context.dispatchEvents(Context.java:455)
        at org.dspace.curate.Curator.visit(Curator.java:541)
        at org.dspace.curate.Curator$TaskRunner.run(Curator.java:568)
        at org.dspace.curate.Curator.doCollection(Curator.java:515)
        at org.dspace.curate.Curator.doCommunity(Curator.java:487)
        at org.dspace.curate.Curator.doSite(Curator.java:451)
        at org.dspace.curate.Curator.curate(Curator.java:269)
        at org.dspace.curate.Curator.curate(Curator.java:203)
        at org.dspace.curate.CurationCli.main(CurationCli.java:220)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
  • This must be related…

2023-02-18

  • I realized why the country-code-tagger was tagging everything: I had overridden the force parameter last week!
  • Start a harvest on AReS

2023-02-20

  • IWMI is concerned that some of their items with top Altmetric attention scores don’t show up in the AReS Explorer
    • I looked into it for one and found that AReS is using the Handle, but Altmetric hasn’t associated the Handle with the DOI
  • Looking into country and region issues for the PRMS team
    • Last week they had some questions about some invalid countries that ended up being typos
    • I realized my cgspace-java-helpers country-code-tagger curation task is not using the latest version, so it was missing Türkiye
    • I compiled the new version and ran it manually, but I have to upload a new version to Maven Central and then update the dependency in dspace/modules/additions/pom.xml ughhhhhh
    • I tagged version 6.2 with the change for Türkiye and uploaded to to Maven Central with mvn clean deploy
  • I’m having second thoughts about switching to UN M.49 for countries because there are just too many tradeoffs
    • I want to find a way to keep our existing list, and codify some rules for it
    • There are several discussions related to the shortcomings of ISO themselves and the iso-codes project, for example:
    • I almost want to say fuck it, let’s just use iso-codes and tell everyone to deal with it, but make sure we handle ISO 3166-1 Alpha2 or probably Alpha3 in the future
    • Something like:
      • Prefer common_name if it exists
      • Prefer the shorter of name and official name

2023-02-21

  • Continue working on my parse-iso-codes.py script to parse the iso-codes JSON for ISO 3166-1
    • I also started a spreadsheet to track current CGSpace country names, proposed new names using the compromise above, and UN M.49 names
    • I proposed this to Peter but he wasn’t happy because there are still some stupidly long and political names there
  • I bumped the version of cgspace-java-helpers to 6.2-SNAPSHOT and pushed it to Maven Central because I can’t figure out how to get non-snapshot releases to go there
  • Ouch, grunt 1.6.0 was released a few weeks ago, which relies on Node.js v16, thus breaking the Mirage 2 build in DSpace 6
  • Help Moises from CIP troubleshoot harvesting issues on their WordPress site
    • I see 2,000 requests with the user agent “RTB website BOT” today and they are all HTTP 200
# grep 'RTB website BOT' /var/log/nginx/rest.log | awk '{print $9}' | sort | uniq -c | sort -h
   2023 200
  • Start reviewing and fixing metadata for Sam’s ~250 CAS publications from last year
    • Both Abenet and Peter have already looked at them and Sam has been waiting for months on this

2023-02-22

  • Continue proofing CAS records for Sam
    • I downloaded all the PDFs manually and checked the issue dates for each from the PDF, noting some that had licenses, ISBNs, etc
    • I combined the title, abstract, and system subjects into one column to mine them for AGROVOC terms:
toLowercase(value) + toLowercase(cells["dcterms.abstract"].value) + toLowercase(cells["cg.subject.system"].value.replace("||", " "))
  • Then I extracted a list of AGROVOC terms the same way I did in August, 2022 and used this Jython code to extract matching terms:
import re

with open(r"/tmp/agrovoc-subjects.txt",'r') as f : 
    terms = [name.rstrip().lower() for name in f]

return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
deduped_list = list(set(value.split("||")))
return '||'.join(map(str, deduped_list))

2023-02-23

  • Tag v0.6.1 of csv-metadata-quality
  • Weekly meeting about CG Core types
    • I need to get some definitions from Peter for some types
  • Peter sent some of the feedback from Indira to XMLUI
    • I removed some old facets, limited others to less values, and adjusted the recent submissions from 5 to 10

2023-02-24

  • More work on understanding Sam’s CAS publications to prepare for uploading them to CGSpace
    • I need to reconcile the duplicates and Peter’s type re-classifications in the final version of the spreadsheet
    • I flagged all the duplicates by creating a custom text facet matching all their titles like:
or(
  isNotNull(value.match("Evaluation of the CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS)")),
  isNotNull(value.match("Report of the IEA Workshop on Development, Use and Assessment of TOC in CGIAR Research, Rome, 12-13 January 2017")),
  isNotNull(value.match("Report of the IEA Workshop on Evaluating the Quality of Science, Rome, 10-11 December 2015")),
  isNotNull(value.match("Review of CGIAR’s Intellectual Assets Principles")),
...
)
  • Annoyingly this seems to miss the ones with parenthesis so I had to do those manually
    • This matched thirty-seven items, then I flagged them so I can handle them separately after uploading the others
    • Then I used the URL field in the old version of the file to match the items with types Evaluation and Independent Commentary since Peter changed them
    • I added extent, volume, issue, number, and affiliation to a few journal articles
    • Then I did some last minute checks to make sure we’re not uploading files for items marked as having “multiple documents”

2023-02-25

  • Oh nice, my pull request adding common names for Iran, Laos, and Syria to iso-codes was merged
  • I did a test import of the 198 CAS Publications on DSpace Test, then inspected Abenet’s file with Gaia’s “multiple documents” field one more time and decided to do the import on CGSpace
    • Gaia’s “multiple documents” column had some text like “E6” and “F7” that didn’t make any sense, and those files were not in the Sharepoint even

2023-02-26

  • Start a harvest on AReS

2023-02-27

  • I found two items for the CAS Publications that were marked as a duplicates, but upon second inspection were not, so I uploaded it to CGSpace
    • That makes the total number of items for CAS 200…
  • I did some CSV joining and inspections with the remaining thirty-six duplicates with the metadata for their existing items on CGSpace and uploaded them
  • Do some work on the new DSpace 7 submission forms
    • I ended up reverting to the stock configuration to use some new techniques like the style and type bind

2023-02-28

  • Keep working on the DSpace 7 submission forms
    • As part of this I asked Maria and Francesca if they are still using the cg.link.permalink (Bioversity publications permalink) and they said no, so we can remove it from the submission form
    • I also removed cg.subject.ccafs since the CRP ended over a year ago and cg.subject.pabra since there have only been a handful of new items in their collection and they seem to be using Alliance subjects instead
  • I filed a bug on DSpace regarding the inability to add freetext values from an input field that uses a vocabulary