Digital Preservation Tools and Islandora

Incorporating a suite of digital preservation tools into various Islandora workflows has been a long-term goal of mine and a few other members in the community, and I'm really happy to see that it is now becoming more and more of a priority in the community.

A couple years ago, I cut my teeth on contributing to Islandora by creating a FITS plugin for the Drupal 6 version of Islandora. Later this tool was expanded to a stand alone module with restructuring of the Drupal 7 code base of Islandora. The Drupal 7 version of Islandora, along with Tuque, has really opened up the door for community contributions over the last year or so. Below is a list and description of Islandora modules with a particular focus to the preservation side of the repository platform.

Islandora Checksum

Islandora Checksum is a module that I developed with Adam Vessey (with special thanks to help from Jonathan Green and Jordan Dukart who helped me grok Tuque), that allows repository managers to enable the creation of a checksums for all datastreams on objects. If enabled, the repository administrator can choose from default Fedora Commons checksum algorithms: MD5, SHA-1, SHA-256, SHA-384, SHA-512.

This module is licensed under a GPLv3 license, and currently going through the Islandora Foundation's Licensed Software Acceptance Procedure. If successfull, the module will be apart of the next Islandora release.

Islandora Checksum admin

Islandora Checksum Checker

Islandora Checksum Checker is a module by Mark Jordan that extends Islandora Checksum by verifying, "the checksums derived from Islandora object datastreams and adds a PREMIS 'fixity check' entry to the object's audit log for each datastream checked."

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora PREMIS

Islandora PREMIS is a module by Mark Jordan, Donald Moses, Paul Pound, and myself. The module produces XML and HTML representations of PREMIS metadata for objects in an Islandora repository on the fly. The module currently documents: all fixity checks performed on an object's datastreams, includes configurable 'agent' entries for an institution as well as for the Fedora Commons software, and maps the contents of each object's "rights" elements in the Dublic Core datastream to equivalent PREMIS "rightsExtension" elements. You can view an example here, along with the XML representation.

What we have implemented so far is just the basics, and we are always seeking feedback to make it better. If you're interested in the discussion or would like to provide feedback, feel free to follow along in the Islandora Google Group thread, and the Github issue queue for the project.

This module is also licensed under a GPLv3 license.

Islandora PREMIS admin

Islandora BagIt

Islandora BagIt is also a module by Mark Jordan (actually a fork of his Drupal module) that utilizes Scholars' Lab's BagItPHP, allowing repository administrators to create bags of selected content. Currently a wide variety of configuration options for exporting contents as Bags, as well as creating Bags on ingest and/or when objects are modified. The way Mark has structured this module also allows developers to easily extend it by creating additional plugins for it, as well providing Drush integration.

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora Preservation Documentation

Documentation! One of the most important aspects of digital preservation.

This is not a full blown module yet. What it currently is, is the beginings of a generic set of documentation that can be used by repository administrators. Eventually we hope to use a combination of Default Content/UUID Features and Features to provide a default bundle of preservation documenation in an Islandora installation.

The content in this Github repo comes from the documentation and policies we are creating at York University Library, which is derived from the wonderful documentation created by Scholars Portal during their successful ISO 16363 audit.

Islandora Web ARChive SP updates

Community

Some pretty exciting stuff has been happening lately in the Islandora community. Earlier this year, Islandora began the transformation to a federally incorporated, community-driven soliciting non-profit. Making it, in my opinion, and much more sustainable project. Thanks to my organization joining on as a member, I've been provided the opporutinity to take part in the Roadmap Committe. Since I've joined, we have been hard at work creating transparent policies and processes software contributions, licenses, and resources. Big thanks to the Hydra community for providing great examples to work from!

I signed my first contirbutor licence agreement, and initiated the process for making the Web ARChive Solution Pack a canonical Islandora project, subject to the same release management and documentation processes as other Islandora modules. After working through the process, I'm happy to see that the Web ARChive Solution Pack is now a canonical Islandora project.

Project updates

I've been slowly picking off items from my initial todo list for the project, and have solved two big issues: indexing the warcs in Solr for full-text/keyword searching and creating and index of each warc.

Solr indexing was very problematic at first. I ened up having a lot of trouble getting an xslt to take the warc datastream and give it to FedoraGSearch, and in-turn to Solr. Frustrated, I began experimenting with newer versions of Solr, which thankfully has Apache Tika bundled, thereby allowing for Solr to index basically whatever you throw at it.

I didn't think our users wanted to be searching the full markup of a warc file. Just the actual text. So, using the Internet Archives' Warctools and @tef's wonderful assistance, I was able to incorporate warcfilter into the derivative creation.

$ warcfilter -H text warc_file > filtered_file

You can view an example of the full-text searching of warcs in action here.

In addition to the full-text searching, I wanted to provided users with a quick overview of what is in a given capture, and was able to do so by also incorporating warcindex into the derivative creation.

$ warcindex warc_file > csv_file

#WARC filename offset warc-type warc-subject-uri warc-record-id content-type content-length
/extra/tmp/yul-113521_OBJ.warc 0 warcinfo None <urn:uuid:588604aa-4ade-4e94-b19a-291c6afa905e> application/warc-fields 514
/extra/tmp/yul-113521_OBJ.warc 797 response dns:yfile.news.yorku.ca <urn:uuid:cbeefcb0-dcd1-466e-9c07-5cd45eb84abb> text/dns 61
/extra/tmp/yul-113521_OBJ.warc 1110 response http://yfile.news.yorku.ca/robots.txt <urn:uuid:6a5d84d1-b548-41e4-a504-c9cf9acfcde7> application/http; msgtype=response 902
/extra/tmp/yul-113521_OBJ.warc 2366 request http://yfile.news.yorku.ca/robots.txt <urn:uuid:363da425-594e-4365-94fc-64c4bb24c897> application/http; msgtype=request 257
/extra/tmp/yul-113521_OBJ.warc 2952 metadata http://yfile.news.yorku.ca/robots.txt <urn:uuid:62ed261e-549d-45e8-9868-0da50c1e92c4> application/warc-fields 149

The updated Web ARChive SP datastreams now look like so:

Warc SP datastreams

One of my major goals with this project has been integration with a local running instance of Wayback, and it looks like we are pretty close. This solution might not be the cleanest, but at least it is a start, and hopefully it will get better over time. I've updated the default MODS form for the module so that it better reflects this Library of Congress example. The key item here is the 'url' element with the 'Archived site' attribute.

<location>
  <url displayLabel="Active site">http://yfile.news.yorku.ca/</url>
  <url displayLabel="Archived site">http://digital.library.yorku.ca/wayback/20131226/http://yfile.news.yorku.ca/</url>
</location> 

Wayback accounts for a date in its url structure 'http://digital.library.yorku.ca/wayback/20131226/http://yfile.news.yorku.ca/' and we can use that to link a given capture to its given dissemination point in Wayback. Using some Islandora Solr magic, I should be able give that link to a user on a given capture page.

We have automated this in our capture and preserve process: capturing warcs with Heritrix, creating MODS datastreams, and screenshots. This allows us to batch import our crawl quickly and efficiently.

Hopefully in the new year we'll have a much more elegant solution!

ruebot started following axfelix

ruebot started following axfelix

September 26, 2013

ruebot pushed to master at ruebot/bagit-profiles-validator

September 26, 2013

ruebot pushed to master at ruebot/bagit-profiles-validator

  • bdc45c5

    need a manifest file to make setup happy about the readme