CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

September, 2017

2017-09-06

  • Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours

2017-09-07

  • Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
Read more →

August, 2017

2017-08-01

  • Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
  • I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
  • The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
  • This means our Tomcat Crawler Session Valve is working
  • But many of the bots are browsing dynamic URLs like:
    • /handle/10568/3353/discover
    • /handle/10568/16510/browse
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…
  • We might actually have to block these requests with HTTP 403 depending on the user agent
  • Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
  • This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
  • I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
  • Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
Read more →

July, 2017

2017-07-01

  • Run system updates and reboot DSpace Test

2017-07-04

  • Merge changes for WLE Phase II theme rename (#329)
  • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
  • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
Read more →

June, 2017

2017-06-01

  • After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes
  • The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes
  • Then we’ll create a new sub-community for Phase II and create collections for the research themes there
  • The current “Research Themes” community will be renamed to “WLE Phase I Research Themes”
  • Tagged all items in the current Phase I collections with their appropriate themes
  • Create pull request to add Phase II research themes to the submission form: #328
  • Add cg.subject.system to CGSpace metadata registry, for subject from the upcoming CGIAR Library migration

2017-06-04

  • After adding cg.identifier.wletheme to 1106 WLE items I can see the field on XMLUI but not in REST!
  • Strangely it happens on DSpace Test AND on CGSpace!
  • I tried to re-index Discovery but it didn’t fix it
  • Run all system updates on DSpace Test and reboot the server
  • After rebooting the server (and therefore restarting Tomcat) the new metadata field is available
  • I’ve sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket

2016-06-05

  • Rename WLE’s “Research Themes” sub-community to “WLE Phase I Research Themes” on DSpace Test so Macaroni Bros can continue their testing
  • Macaroni Bros tested it and said it’s fine, so I renamed it on CGSpace as well
  • Working on how to automate the extraction of the CIAT Book chapters, doing some magic in OpenRefine to extract page from–to from cg.identifier.url and dc.format.extent, respectively:
    • cg.identifier.url: value.split("page=", "")[1]
    • dc.format.extent: value.replace("p. ", "").split("-")[1].toNumber() - value.replace("p. ", "").split("-")[0].toNumber()
  • Finally, after some filtering to see which small outliers there were (based on dc.format.extent using “p. 1-14” vs “29 p.”), create a new column with last page number:
    • cells["dc.page.from"].value.toNumber() + cells["dc.format.pages"].value.toNumber()
  • Then create a new, unique file name to be used in the output, based on a SHA1 of the dc.title and with a description:
    • dc.page.to: value.split(" ")[0].replace(",","").toLowercase() + "-" + sha1(value).get(1,9) + ".pdf__description:" + cells["dc.type"].value
  • Start processing 769 records after filtering the following (there are another 159 records that have some other format, or for example they have their own PDF which I will process later), using a modified generate-thumbnails.py script to read certain fields and then pass to GhostScript:
    • cg.identifier.url: value.contains("page=")
    • dc.format.extent: or(value.contains("p. "),value.contains(" p."))
    • Command like: $ gs -dNOPAUSE -dBATCH -dFirstPage=14 -dLastPage=27 -sDEVICE=pdfwrite -sOutputFile=beans.pdf -f 12605-1.pdf
  • 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
  • I’ve flagged them and proceeded without them (752 total) on DSpace Test:
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
  • I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)
  • Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT
  • Restart Tomcat on CGSpace so that the cg.identifier.wletheme field is available on REST API for Macaroni Bros

2017-06-07

  • Testing Atmire’s patch for the CUA Workflow Statistics again
  • Still doesn’t seem to give results I’d expect, like there are no results for Maria Garruccio, or for the ILRI community!
  • Then I’ll file an update to the issue on Atmire’s tracker
  • Created a new branch with just the relevant changes, so I can send it to them
  • One thing I noticed is that there is a failed database migration related to CUA:
+----------------+----------------------------+---------------------+---------+
| Version        | Description                | Installed on        | State   |
+----------------+----------------------------+---------------------+---------+
| 1.1            | Initial DSpace 1.1 databas |                     | PreInit |
| 1.2            | Upgrade to DSpace 1.2 sche |                     | PreInit |
| 1.3            | Upgrade to DSpace 1.3 sche |                     | PreInit |
| 1.3.9          | Drop constraint for DSpace |                     | PreInit |
| 1.4            | Upgrade to DSpace 1.4 sche |                     | PreInit |
| 1.5            | Upgrade to DSpace 1.5 sche |                     | PreInit |
| 1.5.9          | Drop constraint for DSpace |                     | PreInit |
| 1.6            | Upgrade to DSpace 1.6 sche |                     | PreInit |
| 1.7            | Upgrade to DSpace 1.7 sche |                     | PreInit |
| 1.8            | Upgrade to DSpace 1.8 sche |                     | PreInit |
| 3.0            | Upgrade to DSpace 3.x sche |                     | PreInit |
| 4.0            | Initializing from DSpace 4 | 2015-11-20 12:42:52 | Success |
| 5.0.2014.08.08 | DS-1945 Helpdesk Request a | 2015-11-20 12:42:53 | Success |
| 5.0.2014.09.25 | DS 1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2014.09.26 | DS-1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2015.01.27 | MigrateAtmireExtraMetadata | 2015-11-20 12:43:29 | Success |
| 5.0.2017.04.28 | CUA eperson metadata migra | 2017-06-07 11:07:28 | OutOrde |
| 5.5.2015.12.03 | Atmire CUA 4 migration     | 2016-11-27 06:39:05 | OutOrde |
| 5.5.2015.12.03 | Atmire MQM migration       | 2016-11-27 06:39:06 | OutOrde |
| 5.6.2016.08.08 | CUA emailreport migration  | 2017-01-29 11:18:56 | OutOrde |
+----------------+----------------------------+---------------------+---------+

2017-06-18

  • Redeploy CGSpace with latest changes from 5_x-prod, run system updates, and reboot the server
  • Continue working on ansible infrastructure changes for CGIAR Library

2017-06-20

  • Import Abenet and Peter’s changes to the CGIAR Library CRP community
  • Due to them using Windows and renaming some columns there were formatting, encoding, and duplicate metadata value issues
  • I had to remove some fields from the CSV and rename some back to, ie, dc.subject[en_US] just so DSpace would detect changes properly
  • Now it looks much better: https://dspacetest.cgiar.org/handle/10947/2517
  • Removing the HTML tags and HTML/XML entities using the following GREL:
    • replace(value,/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/,'')
    • value.unescape("html").unescape("xml")
  • Finally import 914 CIAT Book Chapters to CGSpace in two batches:
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &> /tmp/ciat-books2.log

2017-06-25

  • WLE has said that one of their Phase II research themes is being renamed from Regenerating Degraded Landscapes to Restoring Degraded Landscapes
  • Pull request with the changes to input-forms.xml: #329
  • As of now it doesn’t look like there are any items using this research theme so we don’t need to do any updates:
dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
 text_value
------------
(0 rows)
  • Marianne from WLE asked if they can have both Phase I and II research themes together in the item submission form
  • Perhaps we can add them together in the same question for cg.identifier.wletheme

2017-06-30

  • CGSpace went down briefly, I see lots of these errors in the dspace logs:
Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
  • After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load
  • Might be a good time to adjust DSpace’s database connection settings, like I first mentioned in April, 2017 after reading the 2017-04 DCAT comments
  • I’ve adjusted the following in CGSpace’s config:
    • db.maxconnections 30→70 (the default PostgreSQL config allows 100 connections, so DSpace’s default of 30 is quite low)
    • db.maxwait 5000→10000
    • db.maxidle 8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)
  • We will need to adjust this again (as well as the pg_hba.conf settings) when we deploy tsega’s REST API
  • Whip up a test for Marianne of WLE to be able to show both their Phase I and II research themes in the CGSpace item submission form:

Test A for displaying the Phase I and II research themes Test B for displaying the Phase I and II research themes

Read more →

May, 2017

2017-05-01

2017-05-02

  • Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request

2017-05-04

  • Sync DSpace Test with database and assetstore from CGSpace
  • Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server
  • Now I can see the workflow statistics and am able to select users, but everything returns 0 items
  • Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b
  • Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.cgiar.org/handle/10568/80731

2017-05-05

  • Discovered that CGSpace has ~700 items that are missing the cg.identifier.status field
  • Need to perhaps try using the “required metadata” curation task to find fields missing these items:
$ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - > /tmp/curation.out
  • It seems the curation task dies when it finds an item which has missing metadata

2017-05-06

2017-05-07

  • Testing one replacement for CCAFS Flagships (cg.subject.ccafs), first changed in the submission forms, and then in the database:
$ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
  • Also, CCAFS wants to re-order their flagships to prioritize the Phase II ones
  • Waiting for feedback from CCAFS, then I can merge #320

2017-05-08

  • Start working on CGIAR Library migration
  • We decided to use AIP export to preserve the hierarchies and handles of communities and collections
  • When ingesting some collections I was getting java.lang.OutOfMemoryError: GC overhead limit exceeded, which can be solved by disabling the GC timeout with -XX:-UseGCOverheadLimit
  • Other times I was getting an error about heap space, so I kept bumping the RAM allocation by 512MB each time (up to 4096m!) it crashed
  • This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using dspace cleanup -v, or else you’ll run out of disk space
  • In the end I realized it’s better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
$ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done

2017-05-09

  • The CGIAR Library metadata has some blank metadata values, which leads to ||| in the Discovery facets
  • Clean these up in the database using:
dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
  • I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up
  • Hours into the re-ingestion I ran into more errors, and had to erase everything and start over again!
  • Now, no matter what I do I keep getting foreign key errors…
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
  Detail: Key (handle_id)=(80928) already exists.
  • I think those errors actually come from me running the update-sequences.sql script while Tomcat/DSpace are running
  • Apparently you need to stop Tomcat!

2017-05-10

  • Atmire says they are willing to extend the ORCID implementation, and I’ve asked them to provide a quote
  • I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields
  • Finally finished importing all the CGIAR Library content, final method was:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
$ [dspace]/bin/dspace packager -s -t AIP -o ignoreHandle=false -e some@user.com -p 10568/80923 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
  • Basically, import the smaller communities using recursive AIP import (with skipIfParentMissing)
  • Then, for the larger collection, create the community, collections, and items separately, ingesting the items one by one
  • The -XX:-UseGCOverheadLimit JVM option helps with some issues in large imports
  • After this I ran the update-sequences.sql script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:
dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';

2017-05-13

  • After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually NUL characters in the dc.description.abstract field (at least) on the lines where CSV importing was failing
  • I tried to find a way to remove the characters in vim or Open Refine, but decided it was quicker to just remove the column temporarily and import it
  • The import was successful and detected 2022 changes, which should likely be the rest that were failing to import before

2017-05-15

  • To delete the blank lines that cause isses during import we need to use a regex in vim g/^$/d
  • After that I started looking in the dc.subject field to try to pull countries and regions out, but there are too many values in there
  • Bump the Academicons dependency of the Mirage 2 themes from 1.6.0 to 1.8.0 because the upstream deleted the old tag and now the build is failing: #321
  • Merge changes to CCAFS project identifiers and flagships: #320
  • Run updates for CCAFS flagships on CGSpace:
$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'

April, 2017

2017-04-02

  • Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
  • Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:

dc.rights in the submission form

  • Remove redundant/duplicate text in the DSpace submission license
  • Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
Read more →

March, 2017

2017-03-01

  • Run the 279 CIAT author corrections on CGSpace

2017-03-02

  • Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
  • CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
  • They might come in at the top level in one “CGIAR System” community, or with several communities
  • I need to spend a bit of time looking at the multiple handle support in DSpace and see if new content can be minted in both handles, or just one?
  • Need to send Peter and Michael some notes about this in a few days
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
Read more →

February, 2017

2017-02-07

  • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
dspace=# select * from collection2item where item_id = '80278';
  id   | collection_id | item_id
-------+---------------+---------
 92551 |           313 |   80278
 92550 |           313 |   80278
 90774 |          1051 |   80278
(3 rows)
dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
  • Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
  • Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
Read more →

January, 2017

2017-01-02

  • I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
  • I tested on DSpace Test as well and it doesn’t work there either
  • I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
Read more →

December, 2016

2016-12-02

  • CGSpace was down for five hours in the morning while I was sleeping
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
  • I see thousands of them in the logs for the last few months, so it’s not related to the DSpace 5.5 upgrade
  • I’ve raised a ticket with Atmire to ask
  • Another worrying error from dspace.log is:
Read more →