Islandora Web ARChive Solution Pack

What is it?

The Islandora Web ARChive Solution Pack is yet another Islandora Solution Pack. This particular solution pack provides the necessary Fedora objects for persisting and disseminating web archive objects; warc files.

What does it do?

Currently, the SP allows a user to upload a warc with an associated MODS form. Once the object is deposited, the associated metadata is displayed along with a download link to the warc file.

You can check out an example here

Can I get the code?

Of course!


If I am doing something obviously wrong, please let me know!

Immediate term:

  1. Incorporate Wayback integration for the DIP. I think this is the best disseminator for the warc files. However, I haven't wrapped my head around how to programatically provide access to the warc files in the Wayback. I know that I will have two warc objects, an AIP warc and a DIP warc (Big thank you to @sbmarks for being a soundboard today!). Fedora will manage the AIP, and Wayback will manage the DIP. Do I iFrame the Wayback URI for the object, or link out to it?

  2. Drupal 7 module. Drupal 7 versions of Islandora Solution Packs should be on their way shortly -- Next release I believe. The caveat to using the Drupal 6 version of this module is the mimetype support. It looks like the Drupal 6 api (file_get_mimetype) doesn't pull the correct mimetype for warc file. I should get 'application/warc' but I am getting 'application/octet-stream' -- the fallback default for the api.

Long term:

  1. Incorporate Islandora microservices. What I would really like to do is allow users to automate this entire process. Basically, just say this is a site I would like to archive. This is the frequency at which I would like it archived, with necessary wget options. This is the default metadata profile for it. Then grab the site, ingest it into Fedora, drop the DIP warc into Wayback, and make it all available.

  2. If you have any idea on how to do the above, or how to do it a better manner, please let me know!

Islandora development visualization

Hit a bit of a wall yesterday getting checksums working when ingesting content into Islandora, so I made a Gource video of the Islandora commits in my fork of the git repo.

Music by RipCD (@drichert) and myself.

How'd I do it?

  1. I wanted to use the Gravatars, so I used this handy little perl script.
  2. Hopped into the Islandora git repo, and ran:

    gource --user-image-dir .git/avatar/ -s 3 --auto-skip-seconds 0.1 --file-idle-time 50 --max-files 500 --disable-bloom --stop-at-end --highlight-users --hide mouse --background-colour 111111 --font-size 20 --title "Islandora Development" --output-ppm-stream - --output-framerate 60 | avconv -y -r 60 -f image2pipe -vcodec ppm -i - -b 8192K ~/Videos/islandora-gource.mp4

  3. Then I used OpenShot to add the music and uploaded to YouTube.

FITS and Islandora integration

Digital preservationistas rejoice?
I managed to get FITS integration working in Islandora via a plugin. The plugin will automatically create a FITS xml datastream for an object upon ingest in the Islandora interface for a given solution pack. Right now I have it working with the Basic Image Solution Pack, Large Image Solution Pack, and PDF Solution Pack. You just have to make sure is in your apache user's path (thanks @adr). [UPDATE: Works with the Audio Solution Pack now.]
What I had feared was going to be a pretty insane process turned out to be fairly simple and straightforward, which I'll outline here.

  1. I looked at existing plugins for something similar that I could crib from, and found that something in the exiftool plugin which is used in the audio and video solution packs.
  2. Using the existing plugin, I ran some grep queries to figure out how it is used in the overall codebase (Islandora, and solution packs). 
  3. Created a feature branch
  4. Hammered away until I had something working. (Thanks @mmccollow)
  5. Create an ingest rule for a solution pack. This tells the solution pack to call the plugin.
  6. Test, test, and test.
  7. Merged feature branch with 6.x branch, pushed, and opened up a  pull request.

That is basically it. Let me know if you have any questions. Or, if you know of a way to make it even better, patches welcome ;)
[Update #2]
I've added a configuration option to the Islandora admin page to enable FITS datastream creation, and the ability to define a path to I put it in the advanced section of the admin page which is not expanded by default. This will probably be problematic, and folks won't notice it. It might be a better idea to collect all the various command line tools Islandora uses, and give them all a section in the admin page to define their paths.
I also have FITS creation working with the Video Solution Pack now. Up next, Islandora Scholar... just have to get that up and running ;)

Right! That hackfest report I should have gave...

When I was at Islandora Camp trying to wrap my head around all things Islandora and Fedora, I was thinking ahead about a possible project in archives and research collections - migrating our collection/fonds descriptions and finding aids over to ICA AtoM.
ICA AtoM does some pretty cool stuff in terms of access to collection/fonds descriptions, integrates very nicely with Archivematica with accessioning born digital objects, and associating digital representations of item level objects with their respective collection/fonds. My greedy little brain wanted more! I wanted ICA AtoM to be able to pull in Fedora objects automatically and associate them with their respective collection/fonds. So, this is the hackfest proposal I submitted.
So what happened? What'd we end up doing?
The amazing Peter Van Garderen made absolutely sure Artefactual Systems staff was highly represented at hackfest, and I had two amazing people from Artefactual trying to parse my sleep-deprived-scatter-brained-state reasoning/logic behind what I wanted to do. David Juhasz and Jesús García Crespo, you rock!
We spent the first hour or so working through the Fedora REST API documentation looking for the best way to approach the "problem." After about an hour or so of working through a few conditional queries that would need to be strung together, Jesús jumped in and said, "Why aren't we using SWORD for this!?" Good question!
ICA AtoM can speak SWORD and Fedora and speak SWORD so long as you can get the module working. As things at hackfest generally go for me, it failed. I could not for the life of me get the module to build. Spend a some time going through build.xml and ant and I just weren't going to be friends that day.
Strike one - don't code conditional Fedora REST API queries - not sharable and scalable
Strike two - I couldn't get the SWORD module to build!
Strike three - ???
While brainstorming for other solutions to our "problem", David was looking for examples in which I could share records from our repository. Duh! OAI-PMH! ICA AtoM can harvest OAI. If we can map OAI sets to ICA AtoM collections/fonds, and set records to indivudual items in a collection/fonds we're set. Oh my, another use case of OAI-PMH! Yay!
Did we succeed? Not actually. Turns out the OAI-PMH harvesting code wasn't quite up to snuff at the time, and David, bless his heart, worked on trying to get it up to par before the end of the day. We were not able to pull together a working version, but the framework is there. It was there all along! (Ed, yes we could have and totally should have used atom :P )

Fail, Fail, Fail, Success?

This past week I had the privilege of speaking on a panel at Access 2011 about failing entitled, "If you ain't failin', you ain't tryin'!" Amy Buckland moderated the panel where we each took five minutes to tell a library tech fail story to encourage the audience to share their failure stories. I think it went over great, and was cathartic to say the least.
I shared my story, and afterword I had that familiar feeling of "but, wait! I have even more to say!" There are so many lessons to be learned! So, I'll share the story again here and *all* of the lessons learned that given requisite time I would have said.
The story
Three years ago I was on an Access panel presentation to speak about a project we had just hit a critical milestone on. Ironically, I spoke at Access 2011 on a fail panel about that same project.
When I started at MPOW I was thrown to the wolves. We had received a Library and Archives Canada grant to digitize a large number of items from our collections and create a thematic, cutting edge, web 2.0 website for it. Think tag clouds a.k.a the mullets of the internet (attribution c4lirc). Guess what? We had no infrastructure. No policies or procedures for digitization. No workflows. No metadata policies. No standards. 
Given the short turn around time of the grant - 1 year - and the grant requirements, a vendor based drop-in solution would not cut it. So we did it all live! 
We took a month to do some rapid prototyping and pulled off a pretty cool proof of concept with Drupal. It worked, and continued to work. It was the basis of our infrastructure moving forward, and at the time it was perfect!
In the background of working on the PW20C project, we had the foresight to begin creating an overall "repository" to pull content from - Digital Collections @ Mac. A Drupal 5 based repository infrastructure loosely based on best practices and standards at the time. A standard Dublin Core field set created with CCK for records with our own enhanced metadata fields for collections, a hacked-together OAI-PMH module and some really cool timeline visualizations using the SIMILE project.
Flash forward a year, and we have secured another LAC grant for Historical Perspectives on Canadian Publishing; another thematic based digital collection site. Time crunch was in effect, and we pulled together another great project with probably 10x more case studies. My heart goes out for our project coordinator on this one pulling all of those case studies together. 
Flash forward another year, we have what I believed a pretty solid frame work for digital collections. We have a main digital collections site, and two heavily customized thematic sites. We are also about 8 months into a major upgrade of our digital collections infrastructure; migrating everything from Drupal 5 to Drupal 6. 
We upped our functional requirements. We wanted to hang with the cool kids: linked data, seemless JPEG2000 support, KML integration, and MediaRSS support. Yeah, MediaRSS.
Here is where the fail comes to fruition. Mistakes were made. Mistakes were made.
There is this what I suppose could be a called a koan in the Drupal community, "do it the Drupal way." Problem is the Drupal way changes depending on who you are talking to and what time of day it is, and what version you are on. Heavily customizing Drupal themes are definitely not the Drupal way to do things. Those two thematic sites became an albatross, and have sense been put out to pasture on their on virtual machines. (Note. Drupal 5 and PHP 5.3 really don't like each other.)
Lessons learned
Do *not* create custom thematic digital collections sites. To further clarify this, do not create custom thematic digital collections sites if you have limited personnel resources and actually have other *stuff* to do.
Do *not* create policy, procedures, workflows, best practices on the fly. However, given the title of the panel, sometimes you really need to fail to get those best practices down. So, how about, Do *not* create policy, procedures, workflows, best practices on the fly for mission critical projects.
Your data your precious. Think a technology a step later. For us, then past Drupal, think past Fedora. We need to be able to move from platform to platform with ease. Thankfully we had the wherewithal to structure our data in such a way that it was pretty painless to extract.
Sometimes when you think you are *not* reinventing the wheel, you are in-fact reinventing the wheel. Look the the community around you and get involved. Don't be afraid to ask stupid questions. Some of those questions that I thought were stupid and shouldn't be asked were in fact questions that were begging to be asked.
Also akin to reinventing the wheel, the hit-by-the-bus scenario. Your really awesome-homegrown-fantastic-full-of-awesomeness thing you build, you get hit by a bus, take another job, etc. your place of work is so entirely screwed. At the very least, DOCUMENT, DOCUMENT, DOCUMENT. 
The library tech community is pretty rad. We're all doing a lot of similar work that doesn't need to be replicated, or if it does, does not need to be completed reinvented. Again, engage, and interact.
Moving forward, making this fail into a success...
Over the past few months we have taken the time to sit down and write out our digitization/digital collections philosophy with stakeholders. What I thought might be a difficult and painful exercise turned out to be quite wonderful and we came up with a document that I am proud of. 
We also took the time to do a study of what digital preservation means at MPOW, and what we are capable of doing right now, what we can be doing in the near future, and what we should look to achieve in the long-term. This segued nicely into a functional requirements document for our repository infrastructure.
Right now, we are working on creating what I believe to be a solid infrastructure; heavily documented! Something we lacked all along, and what some of my colleagues know me for - that guy who walks around stamping his feet about infrastructure all the time. INFRASTRUCTURE. INFRASTRUCTURE. INFRASTRUCTURE.
Hopefully in a year or two I can come back to Access and present on a panel full of folks turning failures into success!

Node Import fails me | Hack the database!

Over on the dev version of our digital collections site we are working on lots of new features. One of them being JPEG2000 support for our World War I trench maps, World War I aerial photos, and World War II Italian topographical maps. Lightbox2 simply does not cut it when researchers would like to examine these wonderful images. Being that we are pretty short staffed here and don't have the wherewithall to whip up a Drupal module to do this "properly", we have come up with what I think is a pretty creative solution to adding the jp2 images to the records in Drupal.

First off we setup Adore-Djatoka and started converting all of our tifs to jp2 with the utility that comes with Djatoka. We have all of the aerial photos and topographical maps converted. However, we are running into to some heap space problems with the trench maps. I am assuming it is because of their shear size - 400MB-2.5GB. Heap space errors aside, we setup the resolver and the IIPImage Viewer - sample.

Next we setup a CCK IFrame field for each content type and prepped our imports. This is where we ran into a bit of trouble - Node Import does not support CCK IFrame. Problem - time to get creative! We decided to import the records without the jp2 field, and would update then in the database which in turn presented us with a couple more problems... err, how about we say quirks. The update was fairly straight-forward, just like the following two MySQL queries:

UPDATE content_field_jp2 c JOIN content_field_identifier d ON (c.nid = d.nid) JOIN node n ON (c.nid = n.nid)SET field_jp2_url = CONCAT('', d.field_identifier_value) WHERE n.type = 'wwiap';

UPDATE content_field_jp2 c JOIN content_field_identifier d ON (c.nid = d.nid) SET field_jp2_attributes = 'a:4:{s:5:"width";s:4:"100%";s:6:"height";s:3:"768";s:11:"frameborder";s:1:"0";s:9:"scrolling";s:4:"auto";}';

Once the above queries were run the actual nodes were not technically updated due to actually needing to invoke the hooks needed to actually update them. Or as I like to call them, Drupal's rules, regulations and procedures. Basically we had to batch re-save all of them. Hooray for View Bulk Operations! However, we noticed a problem after we re-saved all of the nodes from a particular content type; it did not always reflect what we updated it to. We discovered that the CCK Cache was interfering. The solution was to wipe the 'content_cache' table, run our two update queries again, then batch re-save all of the records. The results are pretty nice, we have our embedded jp2 with our metadata. Now just to theme everything!

blog image

Library day in the life - 5 - Day 5

Here we are at the final day. Friday. Work from home. WIN. VPN, shell, type, type type, forward ports, oh man, email.


Morning soundtrack - Four Tet - Remixes, Plaid - Parts in the Post

Finally finished all of the field merges. Now on to some batch metadata field editing for the World War, 1939-1945, Jewish Underground Resistance Collection. Metadata must be accurate, metadata must be correct! Sorry, no link for this collection for the public yet since it is being populated on the dev version of the site. Hopefully it will be public by some time in the fall. *fingers crossed*

Batch published another set of about 100 theses and dissertations on Digital Commons. Taught my student workers how to publish the thesis and dissertations themselves since graduate studies will be using the same collection in digital commons to begin publishing new theses and dissertations.

Afternoon - nada. Flex. Holiday weekend in Canada.

Library day in the life - 5 - Day 4

Wow, day 4 already. This week seems to be going by fast. Worked from home for a bit this morning and then took the train in again. Email this week has been miraculously low. Probably from all the moves. The 5th floor eerily empty, absolutely bizarre up there now.


Morning soundtrack: BBC World Service podcast, Aphex Twin - Druqks, Metallica - Master of Puppets. (BTW, did you know Metallica did not produce any records after ...And Justice for All. All the other ones are an urban legend.)

More merges. Down to one standard subject field. Nearly down to one relation field, and coverage field. Should be done with that by the end of the day. Thank you again View Bulk Operations!

Finished pulling quotes together for the potential Crombie family grant.

Lunch with a colleague. Finally get to collect on my World Cup bet. ¡¡¡ESPAÑA!!!


Afternoon soundtrack: nothing. absolutely nothing.

Thursday afternoons bring me down to the William Ready Division of Archives and Research Collections where I work a reference desk shift once a week. I'm one of the few librarians who still work a help desk shift here. Not saying that it is a bad thing, just still getting used to the idea of librarians not on the help desk.

Same old afternoon brief routine, checked servers for updates, checked drupal module updates, ran updates. New to the routine, committed all the commits from yesterday and updated the MUALA Bargaining Updates blog.

Read the press release put out by Sky River for their antitrust lawsuit again OCLC. This one may be interesting. Meanwhile, merges were constantly running in the background. Almost done.

Found an autographed first edition of Charles Bukowski's "It catches my heart in its hands : new and selected poems 1955-1963" down in research collections during my shift.

blog image
blog image

Library day in the life - 5 - Day 3

Day three started off with a Go Train that decided to arrive 20 minutes late. Three cheers for mass transit. The delay was a good thing, it gave me 20 extra minutes and I was able to finish Calvino's, "Six Memos for the Next Millennium."


Morning soundtrack: BBC World Service podcast, Search Engine - Trolling 101, Funkstörung - Appendix

In the trenches of morning emails. ILL requests for theses to be made open access, therefore said theses are made open access. Hooray open access! In the background queued up a few merges. Wait. Wait. Wait. Called Bepress support to work through some workflow issues with electronic submissions of theses and dissertations with graduate studies. Very, Very close to moving toward complete electronic submission of theses and dissertations!!! Lunch in my office, at my desk, as per usual.


Afternoon soundtrack: Funkstörung - Appetite For Discstruction, Plaid - Spokes, Quinoline Yellow - Cyriack Parasol, Telefon Tel Aviv - Map of What is Effortless

More hacking at the Dublin Core html headers. Error. No output. OMG, Output! Not the right output. *FACEPALM* $creators != $creator. Pay attention to your variable names and sometimes you have to explicitly iterate through your arrays kids! (Thanks Matt!) Sloppy code below. Checked server logs, ran server updates, and downloaded and installed drupal module updates on the dev server to round out the afternoon. Chaos Tools had quite a bit of new svn adds. No commits since the svn repository disappeared for a bit with an office move :( ADVERTISEMENT: Check out my significant other's blog if you are interested in what library day in the life is for a public children's librarian.

global $base_url;

// The path of the node

if($node->path) $node_path=$node->path;


$dc[] = '';
$dc[] = '';

//DC TERM - Creator
foreach(element_children($node->field_creator) as $key) {
$creators[]= $node->field_creator[$key]['value'];
foreach($creators as $creator) {
$dc[] = '';

//DC TERM - subject
foreach(element_children($node->field_subject) as $key) {
$subjects[]= $node->field_subject[$key]['value'];
foreach($subjects as $subject) {
$dc[] = '';

//DC TERM - description
foreach(element_children($node->field_description) as $key) {
$descriptions[]= $node->field_description[$key]['value'];
foreach($descriptions as $description) {
$dc[] = '';

//DC TERM - publisher
foreach(element_children($node->field_publisher) as $key) {
$publishers[]= $node->field_publisher[$key]['value'];
foreach($publishers as $publisher) {
$dc[] = '';

//DC TERM - contributor
foreach(element_children($node->field_contributor) as $key) {
$contributors[]= $node->field_contributor[$key]['value'];
foreach($contributors as $contributor) {
$dc[] = '';

//DC TERM - Date
foreach(element_children($node->field_date) as $key) {
$dates[]= $node->field_date[$key]['value'];
foreach($dates as $date) {
$dc[] = '';

//DC TERM - Date
foreach(element_children($node->field_date2) as $key) {
$date2s[]= $node->field_date2[$key]['value'];
foreach($date2s as $date2) {
$dc[] = '';

//DC TERM - type
foreach(element_children($node->field_type) as $key) {
$types[]= $node->field_type[$key]['value'];
foreach($types as $type) {
$dc[] = '';

//DC TERM - format
foreach(element_children($node->field_format) as $key) {
$formats[]= $node->field_format[$key]['value'];
foreach($formats as $format) {
$dc[] = '';

//DC TERM - Identifier
foreach(element_children($node->field_identifier) as $key) {
$identifiers[]= $node->field_identifier[$key]['value'];
foreach($identifiers as $identifier) {
$dc[] = '';

//DC TERM - Language
foreach(element_children($node->field_language) as $key) {
$languages[]= $node->field_language[$key]['value'];
foreach($languages as $language) {
$dc[] = '';

//DC TERM - Relation
foreach(element_children($node->field_relation) as $key) {
$relations[]= $node->field_relation[$key]['value'];
foreach($relations as $relation) {
$dc[] = '';

//DC TERM - Source
foreach(element_children($node->field_source) as $key) {
$sources[]= $node->field_source[$key]['value'];
foreach($sources as $source) {
$dc[] = '';

//DC TERM - Coverage
foreach(element_children($node->field_coverage) as $key) {
$coverages[]= $node->field_coverage[$key]['value'];
foreach($coverages as $coverage) {
$dc[] = '';

//DC TERM - Rights
foreach(element_children($node->field_right) as $key) {
$rights[]= $node->field_right[$key]['value'];
foreach($rights as $right) {
$dc[] = '';

$created = strftime("%Y-%m-%d %H:%M:%S +01:00", $node->created);
$changed = strftime("%Y-%m-%d %H:%M:%S +01:00", $node->changed);
$dc_created = strftime("%Y-%m-%d", $node->created);
$dc_changed = strftime("%Y-%m-%d", $node->changed);

if($created) {
$dc[] = '';
$meta[] = '';
if($changed) {
$dc[] = '';
$dc[] = '';
$meta[] = '';

$node_field[0]['value'] = implode("\n", $meta) . "\n" . implode("\n", $dc) . "\n";

Library day in the life - 5 - Day 2

Here goes day 2! Tuesday is generally my first day of the week physically at work, which generally means that I have lots of meetings. Thankfully i did not have an immediate morning meeting.


Morning soundtrack: Software Freedom Law Center - Episode 0x2C: Eben on Software Liability, Adult. - Resuscitation

Spent the entire morning working on more merges and trying to hunt down some expected deliverable item from a vendor. Getting close to being complete with all of the merges. Once it is complete, Apache SOLR will be all the more happy, as will I. We'll have some very nice facets setup for our new SOLR powered search on digital collections.

Our dynamic duo of programmers both came through on interesting developments this morning as well. Debbie finished our retroactive date conversions. Lots of regular expressions!!! We were able to convert 10600 records to machine readable date ranges. Absolutely fantastic for SOLR faceting, sorting, and all the other fun stuff you can do with actual machine readable date data!!! Matt continued hacking away at the Dublin Core XML module. He even managed to create a singularity this morning; something that not even user 1 can access. ACCESS DENIED!

Lunch time brought about a union meeting. Hooray for MUALA!!!


Afternoon soundtrack: Venetian Snares - Rossz Csillag Allat Szuletett, GZA/Genius - Liquid Swords

ARL-ACRL webinar entitled, "Transitioning from Subscriptions to Open Access: Article Processing Fees and Combined Licensing/Author’s Rights Approaches." Pretty good, but at the same time preaching to the choir.

Pulled some quotes together for equipment for a potential grant. Fingers crossed!!!

Published another 100 or so open access theses and dissertations, and ran some batch conversions on some images. I <3 mogrify.

mogrify -format jpg *.tif

Hopefully tomorrow I can get back to some wretched coding and finish up a couple of outstanding items. Aphoristical.