CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

January, 2024

2024-01-02

  • Work on preparation of new server for DSpace 7 migration
    • I’m not quite sure what we need to do for the Handle server
    • For now I just ran the dspace make-handle-config script and diffed it with the one from DSpace 6
    • I sent the bundle to the Handle admins to make sure it’s OK before we do the migration
  • Continue testing and debugging the cgspace-java-helpers on DSpace 7
  • Work on IFPRI ISNAR archive cleanup
Read more →

December, 2023

2023-12-01

  • There is still high load on CGSpace and I don’t know why
    • I don’t see a high number of sessions compared to previous days in the last few weeks
$ for file in dspace.log.2023-11-[23]*; do echo "$file"; grep -a -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
dspace.log.2023-11-20
22865
dspace.log.2023-11-21
20296
dspace.log.2023-11-22
19688
dspace.log.2023-11-23
17906
dspace.log.2023-11-24
18453
dspace.log.2023-11-25
17513
dspace.log.2023-11-26
19037
dspace.log.2023-11-27
21103
dspace.log.2023-11-28
23023
dspace.log.2023-11-29
23545
dspace.log.2023-11-30
21298
  • Even the number of unique IPs is not very high compared to the last week or so:
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq | wc -l
17023
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.2.gz | sort | uniq | wc -l
17294
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.3.gz | sort | uniq | wc -l
22057
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.4.gz | sort | uniq | wc -l
32956
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.5.gz | sort | uniq | wc -l
11415
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.6.gz | sort | uniq | wc -l
15444
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.7.gz | sort | uniq | wc -l
12648
  • It doesn’t make any sense so I think I’m going to restart the server…
    • After restarting the server the load went down to normal levels… who knows…
  • I started trying to see how I’m going to generate the fake statistics for the Alliance bitstream that was replaced
    • I exported all the statistics for the owningItem now:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/stats-export.json -f 'owningItem:b5862bfa-9799-4167-b1cf-76f0f4ea1e18' -k uid
  • Importing them into DSpace Test didn’t show the statistics in the Atmire module, but I see them in Solr…

2023-12-02

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS

2023-12-04

  • Send a message to Altmetric support because the item IWMI highlighted last month still doesn’t show the attention score for the Handle after I tweeted it several times weeks ago
  • Spent some time writing a Python script to fix the literal MaxMind City JSON objects in our Solr statistics
    • There are about 1.6 million of these, so I exported them using solr-import-export-json with the query city:com* but ended up finding many that have missing bundles, container bitstreams, etc:
city:com* AND -bundleName:[* TO *] AND -containerBitstream:[* TO *] AND -file_id:[* TO *] AND -owningItem:[* TO *] AND -version_id:[* TO *]
  • (Note the negation to find fields that are missing)
  • I don’t know what I want to do with these yet

2023-12-05

  • I finished the fix_maxmind_stats.py script and fixed 1.6 million records and imported them on CGSpace after testing on DSpace 7 Test
  • Altmetric said there was a glitch regarding the Handle and DOI linking and they successfully re-scraped the item page and linked them
    • They sent me a list of current production IPs and I notice that some of them are in our nginx bot network list:
$ for network in $(csvcut -c network /tmp/ips.csv | sed 1d | sort -u); do grepcidr $network ~/src/git/rmg-ansible-public/roles/dspace/files/nginx/bot-networks.conf; done
108.128.0.0/13 'bot';
46.137.0.0/16 'bot';
52.208.0.0/13 'bot';
52.48.0.0/13 'bot';
54.194.0.0/15 'bot';
54.216.0.0/14 'bot';
54.220.0.0/15 'bot';
54.228.0.0/15 'bot';
63.32.242.35/32     'bot';
63.32.0.0/14 'bot';
99.80.0.0/15 'bot'
  • I will remove those for now so that Altmetric doesn’t have any unexpected issues harvesting

2023-12-08

  • Finalized the script to generate Solr statistics for Alliance research Mirjam
    • The script is ilri/generate_solr_statistics.py
    • I generated ~3,200 statistics based on her records of the download statistics of that item and imported them on CGSpace
  • Did some work on the DSpace 7 submission form
  • Peter asked for lists of affiliations, investors, and publishers to do some cleanups
    • I generated a list from a CSV export instead of doing it based on a SQL dump…
$ csvcut -c 'cg.contributor.affiliation[en_US]' /tmp/initiatives.csv       \
  | sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d'            \
  | sort | uniq -c | sort -hr                                              \
  | awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
  | sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/'           \
  > /tmp/2023-12-08-initiatives-affiliations.csv
  • Export a list of authors as well:
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 3 GROUP BY "dc.contributor.author" ORDER BY count DESC) to /tmp/2023-12-08-authors.csv WITH CSV HEADER;
COPY 102435

2023-12-11

  • Work on OpenRXV dependencies and podman a bit
  • Peter noticed that the statistics for this month are very very low on CGSpace
    • I don’t know what is going on, perhaps it is related to me adjusting the nginx config last week?
    • Ah, it’s probably because of the spider patterns I updated on 2023-11

2023-12-16

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS

2023-12-17

  • Pull latest master branch for OpenRXV and deploy on the server
    • I threw away some changes in the tree regarding the Angular base ref, and it broke AReS
    • So note to self: we need to set the base ref in frontend/Dockerfile before building!
  • Now Salem fixed the country map

2023-12-18

  • Work a bit on the IFPRI-ISNAR archive from Leigh
  • More work on the DSpace 7 home page

2023-12-19

  • More work on the DSpace 7 home page
  • The Alliance TIP team is testing deposits to the DSpace 7 REST API and getting an HTTP 500 error
    • In the DSpace logs I see this after they log in, create the item, and update the metadata:
2023-12-19 17:49:28,022 ERROR unknown unknown org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.

2023-12-20

  • The Alliance guys said that submitting via REST works now… sigh, so that’s just some old DSpace 5/6 REST API bug
  • I lowercased all our AGROVOC keywords in dcterms.subject in SQL:
dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 462
dspace=*# COMMIT;
COMMIT

2023-12-25

  • Looking into Solr backups
    • Since we are not running in Solr Cloud mode we need to use the replication endpoint for Solr standalone
    • This works:
$ curl 'http://localhost:8983/solr/statistics/replication?command=backup'
{
  "responseHeader":{
    "status":0,
    "QTime":26},
  "status":"OK"}
  • Then I saw the size of the snapshot reach the size of the index…
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
16G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
20G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
21G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
# du -sh /var/solr/data/configsets/statistics/data/*
22G     /var/solr/data/configsets/statistics/data/index
22G     /var/solr/data/configsets/statistics/data/snapshot.20231225074111671
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
  • Then I deleted the core and restored from the snapshot backup:
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<commit />'
$ curl 'http://localhost:8983/solr/statistics/replication?command=restore&name=statistics'
  • Interestingly the import worked fine, but created a new data index:
# du -sh /var/solr/data/configsets/statistics/data/*
4.0K    /var/solr/data/configsets/statistics/data/index.properties
22G     /var/solr/data/configsets/statistics/data/restore.20231225154626463
4.0K    /var/solr/data/configsets/statistics/data/snapshot_metadata
22G     /var/solr/data/configsets/statistics/data/snapshot.statistics
  • Not sure the implications of that—Solr uses the data just fine
  • I can surely use this for atomic Solr backups

2023-12-27

  • Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
  • Do some other metadata cleanups on CGSpace
    • I also looked up our DOIs on Crossref to get some missing abstracts and correct licenses and dates
  • Some minor work on the CGSpace DSpace 7 theme to fix the navbar on mobile
  • Some work on the IFPRI ISNAR archive

2023-12-28

  • I started porting the cgspace-java-helpers to DSpace 7
  • Some work on the IFPRI ISNAR archive
    • I ended up going through most of the PDFs to get better dates and abstracts

2023-12-29

  • I created a new Hetzner server to replace the current DSpace 6 CGSpace next week when we migrate to DSpace 7
  • Interesting, I haven’t checked for content pointing to legacy domains in several years (!)
    • inurl:mahider.cgiar.org: 0 results on Google!
    • inurl:mahider.ilri.org: 2,100 results on Google
    • inurl:mahider.ilri.org inurl:https: 2 results on Google (!)
    • inurl:dspace.ilri.org: 1,390 results on Google
    • inurl:dspace.ilri.org inurl:https: 0 results on Google (!)
  • So it seems I can do away with the HTTPS virtual hosts finally
    • Well my current certificates expired on 2021-02-13 and nobody noticed… so…
Read more →

November, 2023

2023-11-01

  • Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
    • I improved the filtering and wrote some Python using pandas to merge my sources more reliably

2023-11-02

  • Export CGSpace to check missing Initiative collection mappings
  • Start a harvest on AReS
Read more →

October, 2023

2023-10-02

  • Export CGSpace to check DOIs against Crossref
    • I found that Crossref’s metadata is in the public domain under the CC0 license
    • One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
    • We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
Read more →

September, 2023

2023-09-02

  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS
Read more →

August, 2023

2023-08-03

  • I finally got around to working on Peter’s cleanups for affiliations, authors, and donors from last week
    • I did some minor cleanups myself and applied them to CGSpace
  • Start working on some batch uploads for IFPRI
Read more →

July, 2023

2023-07-01

  • Export CGSpace to check for missing Initiative collection mappings
  • Start harvesting on AReS

2023-07-02

  • Minor edits to the crossref_doi_lookup.py script while running some checks from 22,000 CGSpace DOIs

2023-07-03

  • I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect
    • I took the more accurate ones from Crossref and updated the items on CGSpace
    • I took a few hundred ISBNs as well for where we were missing them
    • I also tagged ~4,700 items with missing licenses as “Copyrighted; all rights reserved” based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer
    • Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it’s usually copyrighted (could still be open access, but we can’t tell via Crossref)
    • I would be curious to write a script to check the Unpaywall API for open access status…
    • In the past I found that their license status was not very accurate, but the open access status might be more reliable
  • More minor work on the DSpace 7 item views
    • I learned some new Angular template syntax
    • I created a custom component to show Creative Commons licenses on the simple item page
    • I also decided that I don’t like the Impact Area icons as a component because they don’t have any visual meaning

2023-07-04

  • Focus group meeting with CGSpace partners about DSpace 7
  • I added a themed file selection component to the CGSpace theme
    • It displays the bistream description instead of the file name, just like we did in DSpace 6 XMLUI
  • I added a custom component to show share icons

2023-07-05

  • I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13
    • Most things work but there are some minor bugs it seems
  • Mishell from CIP emailed me to say she was having problems approving an item on CGSpace
    • Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again

2023-07-06

  • Types meeting
  • I wrote a Python script to check Unpaywall for some information about DOIs

2023-07-7

  • Continue exploring Unpaywall data for some of our DOIs
    • In the past I’ve found their licensing information to not be very reliable (preferring Crossref), but I think their open access status is more reliable, especially when the provider is listed as being the publisher
    • Even so, sometimes the version can be “acceptedVersion”, which is presumably the author’s version, as opposed to the “publishedVersion”, which means it’s available as open access on the publisher’s website
    • I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses
  • Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
  • Start working on some statistics on AGROVOC usage for my presenation next week
    • I used the following SQL query to dump values from all subject fields and lower case them:
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2023-07-07-cgspace-subjects.csv WITH CSV HEADER;
COPY 26443
Time: 2564.851 ms (00:02.565)
  • Then I extracted the subjects and looked them up against AGROVOC:
$ csvcut -c subject /tmp/2023-07-07-cgspace-subjects.csv | sed '1d' > /tmp/2023-07-07-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-07-07-cgspace-subjects.txt -o /tmp/2023-07-07-cgspace-subjects-results.csv
  • I did some more tests with Angular 13 on OpenRXV and found out why the repository type dropdown wasn’t working
    • It was because of a missing 1-line JSON file in the data directory, which is runtime data, not code
    • I copied the data directory from the production serve and rebuild and the site is working well now
    • I did a full harvest with plugins and it worked!
    • So it seems Angular 13.4.0 will work, yay

2023-07-08

  • Export CGSpace to check for missing Initiative collection mappings
    • Start a harvest on AReS
  • The AGROVOC lookup finished, so I checked the number of matches:
$ csvgrep -c 'match type' -r '^.+$' ~/Downloads/2023-07-07-cgspace-subjects-resolved.csv | sed 1d | wc -l
12528
  • So that’s 12,528 out of 26,443 unique terms (47.3%)
  • I did a LOT of work on the OpenRXV frontend build dependencies to bring more in line with Angular 13

2023-07-10

  • I did a lot more work on OpenRXV to test and update dependencies
  • I deployed the latest version on the production server

2023-07-12

  • CGSpace upgrade meeting with Americas and Africa group

2023-07-13

  • Michael Victor asked me to help Aditi extract some information from CGSpace
    • She was interested in journal articles published between 2018 and 2023 with a range of subjects related to drought, flooding, resilience, etc
    • I used an advanced query with some AGROVOC terms:
dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration" OR dcterms.subject:livestock)
  • Interestingly, some variations of this same exact query produce no search results, and I see this error in the DSpace log:
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:livestock OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration\"\)': Lexical error at line 1, column 617.  Encountered: <EOF> after : "\"landscape restoration\\\"\\)"
  • It seems to be when there is a quoted search term at the end of the parenthesized group
    • For what it’s worth this same query worked fine on DSpace 7.6

2023-07-15

  • Export CGSpace to fix missing Initiative collection mappings
  • Start a harvest on AReS

2023-07-17

  • Rasika had sent me a list of new ORCID identifiers for new IWMI staff so I combined them with our existing list and ran resolve_orcids.py to refresh the names in our database
    • I updated the list, updated names in the database, and tagged new authors with missing identifiers in existing items

2023-07-18

  • Meeting with IWMI, IRRI, and IITA colleagues about CGSpace upgrade plans
  • Maria from the Alliance mentioned having some submissions stuck on CGSpace
    • I looked and found a number of locks stuck for many nineteen, eighteen, and more hours…
    • I killed them and told her to try again
$ psql < locks-age.sql | less -S
$ psql < locks-age.sql | grep -E " (19|18|17|16|12):" | awk -F"|" '{print $10}' | sort -u | xargs kill

2023-07-19

  • I had to kill a bunch more locked processes in PostgreSQL, I’m not sure what’s going on
  • After some discussion about an advanced search bug with Tim on Slack, I filed an issue on GitHub

2023-07-20

  • I added a new metadata field for CGIAR Impact Platforms (cg.subject.impactPlatform) to CGSpace

2023-07-22

  • Export CGSpace tp fix missing Initiative collections
  • Start a harvest on AReS

2023-07-24

  • Test Salem’s new JavaScript-based DSpace Statistics API and send him some feedback
  • I noticed a few times that the Solr service on my DSpace 7 instance is getting OOM killed
    • I had been using a 4g Solr heap, but maybe we don’t need that much
    • Tomcat is also using 4.6GB, and then there’s PostgreSQL… so perhaps it’s all a bit much on this system now

2023-07-25

  • Start testing exporting DSpace 6 Solr cores to import on DSpace 7:
$ chrt -b 0 dspace solr-export-statistics -i statistics
  • I’m curious how long it takes and how much data there will be
    • The size of the Solr data directory is currently 82GB
    • The export took about 2.5 hours and created 6,000 individual CSVs, one for each day of Solr stats
    • The size of the exported CSVs is about 88GB
    • I will copy just a few years to import on the DSpace 7 test server
    • So importing these is going to require removing the Atmire custom fields:
$ dspace solr-import-statistics -i statistics
Exception: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
        at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:465)
        at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)
  • I will try using solr-import-export-json, which I’ve used in the past to skip Atmire custom fields in Solr:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2022.json -f 'time:[2022-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,geoIpCountryCode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId,core_update_run_nb
  • Some users complained that CGSpace was slow and I found a handful of locks that were hours and days old…
    • I killed those and told them to try again
  • After importing the Solr statistics into DSpace 7 I realized that my DSpace Statistics API will work fine
    • I made some minor modifications to the Ansible infrastructure scripts to make sure it is enabled and then activated it on DSpace 7 Test

2023-07-26

  • Debugging lock issues on CGSpace
    • I see the blocking PIDs for some long-held locks are “idle in transaction”:
$ ps auxw | grep -E "(1864132|1659487)"
postgres 1659487  0.0  0.5 3269900 197120 ?      Ss   Jul25   0:03 postgres: 14/main: cgspace cgspace 127.0.0.1(61648) idle in transaction
postgres 1864132  0.1  0.7 3275704 254528 ?      Ss   07:27   0:08 postgres: 14/main: cgspace cgspace 127.0.0.1(36998) idle in transaction
postgres 1880388  0.0  0.0   9208  2432 pts/3    S+   08:48   0:00 grep -E (1864132|1659487)
  • I used some other scripts and found that those processes were executing the following statement:
select nextval ('public.tasklistitem_seq')
  • I don’t know why these can get blocked for hours without resolution, but for now I just killed them
  • I wrote a slightly longer regex to match locks that have been stuck for more than 1 hour based on the output of the locks-age.sql script and killed them:
$ psql < locks-age.sql | awk -F"|" '/ [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill

2023-07-27

  • Export CGSpace to check countries, regions, types, and Initiatives
    • There were a few minor issues in countries and regions, and I noticed 186 items without types!
    • Then I ran the file through csv-metadata-quality to make sure items with countries have appropriate regions
  • Brief discussion about OpenRXV bugs and fixes with Moayad
  • I was toying with the idea of using an expanded whitespace check/fix based on ESLint’s no-irregular-whitespace rule in csv-metadata-quality
    • I found 176 items in CGSpace with such whitespace in their titles alone
    • I compared the results of removing these characters and replacing them with a space
    • In most cases removing it is the correct thing to do, for example “Pesticides : une arme à double tranchant” → “Pesticides: une arme à double tranchant”
    • But in some items it is tricky, for example “L’environnement juridique est-il propice à la gestion” → “L’environnement juridique est-il propice àla gestion”
    • I guess it would really need some good heuristics or a human to verify…
  • I upgraded OpenRXV to Angular v14

2023-07-28

Exception in thread "main" org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=0008a7c1-e552-4a4e-93e4-4d23bf39964b] Error adding field 'workflowItemId'='0812be47-1bfe-45e2-9208-5bf10ee46f81' msg=For input string: "0812be47-1bfe-45e2-9208-5bf10ee46f81"
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:745)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:234)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:102)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:69)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:82)
        at it.damore.solr.importexport.App.insertBatch(App.java:295)
        at it.damore.solr.importexport.App.lambda$writeAllDocuments$10(App.java:276)
        at it.damore.solr.importexport.BatchCollector.lambda$accumulator$0(BatchCollector.java:71)
        at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
        at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
        at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
        at it.damore.solr.importexport.App.writeAllDocuments(App.java:252)
        at it.damore.solr.importexport.App.main(App.java:150)
isInternal,workflowItemId,containerCommunity,containerCollection,containerItem,containerBitstream,dateYear,dateYearMonth,filterquery,complete_query,simple_query,complete_query_search,simple_query_search,ngram_query_search,ngram_simplequery_search,text,storage_statistics_type,storage_size,storage_nb_of_bitstreams,name,first_name,last_name,p_communities_id,p_communities_name,p_communities_map,p_group_id,p_group_name,p_group_map,group_id,group_name,group_map,parent_count,bitstreamId,bitstreamCount,actingGroupId,actorMemberGroupId,actingGroupParentId,rangeDescription,range,version_id,file_id,cua_version,core_update_run_nb,orphaned
  • I will combine it with the other fields I was skipping above and try the export again:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2020.json -f 'time:[2020-01-01T00\:00\:00Z TO 2020-12-31T23\:59\:59Z]' -k uid -S actingGroupId,actingGroupParentId,actorMemberGroupId,author_mtdt,author_mtdt_search,bitstreamCount,bitstreamId,complete_query,complete_query_search,containerBitstream,containerCollection,containerCommunity,containerItem,core_update_run_nb,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,file_id,filterquery,first_name,geoipcountrycode,geoIpCountryCode,group_id,group_map,group_name,ip_ngram,ip_search,isArchived,isInternal,iso_mtdt,iso_mtdt_search,isWithdrawn,last_name,name,ngram_query_search,ngram_simplequery_search,orphaned,parent_count,p_communities_id,p_communities_map,p_communities_name,p_group_id,p_group_map,p_group_name,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,referrer_ngram,referrer_search,simple_query,simple_query_search,solr_update_time_stamp,storage_nb_of_bitstreams,storage_size,storage_statistics_type,subject_mtdt,subject_mtdt_search,text,userAgent_ngram,userAgent_search,version_id,workflowItemId
  • Export a list of affiliations from the Initiatives community for Peter:
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-07-28-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-07-28-initiatives.csv \
  | sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d'            \
  | sort | uniq -c | sort -hr                                              \
  | awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
  | sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/'           \
  > /tmp/2023-07-28-initiatives-affiliations.csv
  • This is a method I first used in 2023-01 to export affiliations ONLY used in items in the Initiatives community
    • I did the same for authors and investors

2023-07-29

  • Export CGSpace to look for missing Initiative collection mappings
  • I found a bunch of locks waiting for many hours and killed them:
$ psql < locks-age.sql | awk -F"|" '$9 ~ / [[:digit:]][1-9]:[[:digit:]]{2}:[[:digit:]]{2}\./ {print $10}' | sort -u | xargs kill
  • This looks for a pattern matching something like 11:30:48.598436 in the age column (not 00:00:00) and kills them
  • Start a harvest on AReS
Read more →