drupal

Islandora Web ARChive Solution Pack

What is it?

The Islandora Web ARChive Solution Pack is yet another Islandora Solution Pack. This particular solution pack provides the necessary Fedora objects for persisting and disseminating web archive objects; warc files.

What does it do?

Currently, the SP allows a user to upload a warc with an associated MODS form. Once the object is deposited, the associated metadata is displayed along with a download link to the warc file.

You can check out an example here

Can I get the code?

Of course!

Todo?

If I am doing something obviously wrong, please let me know!

Immediate term:

  1. Incorporate Wayback integration for the DIP. I think this is the best disseminator for the warc files. However, I haven't wrapped my head around how to programatically provide access to the warc files in the Wayback. I know that I will have two warc objects, an AIP warc and a DIP warc (Big thank you to @sbmarks for being a soundboard today!). Fedora will manage the AIP, and Wayback will manage the DIP. Do I iFrame the Wayback URI for the object, or link out to it?

  2. Drupal 7 module. Drupal 7 versions of Islandora Solution Packs should be on their way shortly -- Next release I believe. The caveat to using the Drupal 6 version of this module is the mimetype support. It looks like the Drupal 6 api (file_get_mimetype) doesn't pull the correct mimetype for warc file. I should get 'application/warc' but I am getting 'application/octet-stream' -- the fallback default for the api.

Long term:

  1. Incorporate Islandora microservices. What I would really like to do is allow users to automate this entire process. Basically, just say this is a site I would like to archive. This is the frequency at which I would like it archived, with necessary wget options. This is the default metadata profile for it. Then grab the site, ingest it into Fedora, drop the DIP warc into Wayback, and make it all available.

  2. If you have any idea on how to do the above, or how to do it a better manner, please let me know!

»

Islandora development visualization

>

Hit a bit of a wall yesterday getting checksums working when ingesting content into Islandora, so I made a Gource video of the Islandora commits in my fork of the git repo.

Music by RipCD (@drichert) and myself.

How'd I do it?

  1. I wanted to use the Gravatars, so I used this handy little perl script.
  2. Hopped into the Islandora git repo, and ran:

    gource --user-image-dir .git/avatar/ -s 3 --auto-skip-seconds 0.1 --file-idle-time 50 --max-files 500 --disable-bloom --stop-at-end --highlight-users --hide mouse --background-colour 111111 --font-size 20 --title "Islandora Development" --output-ppm-stream - --output-framerate 60 | avconv -y -r 60 -f image2pipe -vcodec ppm -i - -b 8192K ~/Videos/islandora-gource.mp4    

  3. Then I used OpenShot to add the music and uploaded to YouTube.
»

FITS and Islandora integration

Digital preservationistas rejoice?
 
I managed to get FITS integration working in Islandora via a plugin. The plugin will automatically create a FITS xml datastream for an object upon ingest in the Islandora interface for a given solution pack. Right now I have it working with the Basic Image Solution Pack, Large Image Solution Pack, and PDF Solution Pack. You just have to make sure fits.sh is in your apache user's path (thanks @adr). [UPDATE: Works with the Audio Solution Pack now.]
 
What I had feared was going to be a pretty insane process turned out to be fairly simple and straightforward, which I'll outline here.

  1. I looked at existing plugins for something similar that I could crib from, and found that something in the exiftool plugin which is used in the audio and video solution packs.
  2. Using the existing plugin, I ran some grep queries to figure out how it is used in the overall codebase (Islandora, and solution packs). 
  3. Created a feature branch
  4. Hammered away until I had something working. (Thanks @mmccollow)
  5. Create an ingest rule for a solution pack. This tells the solution pack to call the plugin.
  6. Test, test, and test.
  7. Merged feature branch with 6.x branch, pushed, and opened up a  pull request.

That is basically it. Let me know if you have any questions. Or, if you know of a way to make it even better, patches welcome ;)
 
[Update #2]
 
I've added a configuration option to the Islandora admin page to enable FITS datastream creation, and the ability to define a path to fits.sh. I put it in the advanced section of the admin page which is not expanded by default. This will probably be problematic, and folks won't notice it. It might be a better idea to collect all the various command line tools Islandora uses, and give them all a section in the admin page to define their paths.
 
I also have FITS creation working with the Video Solution Pack now. Up next, Islandora Scholar... just have to get that up and running ;)

»

Right! That hackfest report I should have gave...

When I was at Islandora Camp trying to wrap my head around all things Islandora and Fedora, I was thinking ahead about a possible project in archives and research collections - migrating our collection/fonds descriptions and finding aids over to ICA AtoM.
 
ICA AtoM does some pretty cool stuff in terms of access to collection/fonds descriptions, integrates very nicely with Archivematica with accessioning born digital objects, and associating digital representations of item level objects with their respective collection/fonds. My greedy little brain wanted more! I wanted ICA AtoM to be able to pull in Fedora objects automatically and associate them with their respective collection/fonds. So, this is the hackfest proposal I submitted.
 
So what happened? What'd we end up doing?
 
The amazing Peter Van Garderen made absolutely sure Artefactual Systems staff was highly represented at hackfest, and I had two amazing people from Artefactual trying to parse my sleep-deprived-scatter-brained-state reasoning/logic behind what I wanted to do. David Juhasz and Jesús García Crespo, you rock!
 
We spent the first hour or so working through the Fedora REST API documentation looking for the best way to approach the "problem." After about an hour or so of working through a few conditional queries that would need to be strung together, Jesús jumped in and said, "Why aren't we using SWORD for this!?" Good question!
 
ICA AtoM can speak SWORD and Fedora and speak SWORD so long as you can get the module working. As things at hackfest generally go for me, it failed. I could not for the life of me get the module to build. Spend a some time going through build.xml and ant and I just weren't going to be friends that day.
 
Strike one - don't code conditional Fedora REST API queries - not sharable and scalable
Strike two - I couldn't get the SWORD module to build!
Strike three - ???
 
While brainstorming for other solutions to our "problem", David was looking for examples in which I could share records from our repository. Duh! OAI-PMH! ICA AtoM can harvest OAI. If we can map OAI sets to ICA AtoM collections/fonds, and set records to indivudual items in a collection/fonds we're set. Oh my, another use case of OAI-PMH! Yay!
 
Did we succeed? Not actually. Turns out the OAI-PMH harvesting code wasn't quite up to snuff at the time, and David, bless his heart, worked on trying to get it up to par before the end of the day. We were not able to pull together a working version, but the framework is there. It was there all along! (Ed, yes we could have and totally should have used atom :P )

»

Fail, Fail, Fail, Success?

This past week I had the privilege of speaking on a panel at Access 2011 about failing entitled, "If you ain't failin', you ain't tryin'!" Amy Buckland moderated the panel where we each took five minutes to tell a library tech fail story to encourage the audience to share their failure stories. I think it went over great, and was cathartic to say the least.
 
I shared my story, and afterword I had that familiar feeling of "but, wait! I have even more to say!" There are so many lessons to be learned! So, I'll share the story again here and *all* of the lessons learned that given requisite time I would have said.
 
The story
 
Three years ago I was on an Access panel presentation to speak about a project we had just hit a critical milestone on. Ironically, I spoke at Access 2011 on a fail panel about that same project.
 
When I started at MPOW I was thrown to the wolves. We had received a Library and Archives Canada grant to digitize a large number of items from our collections and create a thematic, cutting edge, web 2.0 website for it. Think tag clouds a.k.a the mullets of the internet (attribution c4lirc). Guess what? We had no infrastructure. No policies or procedures for digitization. No workflows. No metadata policies. No standards. 
 
Given the short turn around time of the grant - 1 year - and the grant requirements, a vendor based drop-in solution would not cut it. So we did it all live! 
 
We took a month to do some rapid prototyping and pulled off a pretty cool proof of concept with Drupal. It worked, and continued to work. It was the basis of our infrastructure moving forward, and at the time it was perfect!
 
In the background of working on the PW20C project, we had the foresight to begin creating an overall "repository" to pull content from - Digital Collections @ Mac. A Drupal 5 based repository infrastructure loosely based on best practices and standards at the time. A standard Dublin Core field set created with CCK for records with our own enhanced metadata fields for collections, a hacked-together OAI-PMH module and some really cool timeline visualizations using the SIMILE project.
 
Flash forward a year, and we have secured another LAC grant for Historical Perspectives on Canadian Publishing; another thematic based digital collection site. Time crunch was in effect, and we pulled together another great project with probably 10x more case studies. My heart goes out for our project coordinator on this one pulling all of those case studies together. 
 
Flash forward another year, we have what I believed a pretty solid frame work for digital collections. We have a main digital collections site, and two heavily customized thematic sites. We are also about 8 months into a major upgrade of our digital collections infrastructure; migrating everything from Drupal 5 to Drupal 6. 
 
We upped our functional requirements. We wanted to hang with the cool kids: linked data, seemless JPEG2000 support, KML integration, and MediaRSS support. Yeah, MediaRSS.
 
Here is where the fail comes to fruition. Mistakes were made. Mistakes were made.
 
There is this what I suppose could be a called a koan in the Drupal community, "do it the Drupal way." Problem is the Drupal way changes depending on who you are talking to and what time of day it is, and what version you are on. Heavily customizing Drupal themes are definitely not the Drupal way to do things. Those two thematic sites became an albatross, and have sense been put out to pasture on their on virtual machines. (Note. Drupal 5 and PHP 5.3 really don't like each other.)
 
Lessons learned
 
Do *not* create custom thematic digital collections sites. To further clarify this, do not create custom thematic digital collections sites if you have limited personnel resources and actually have other *stuff* to do.
 
Do *not* create policy, procedures, workflows, best practices on the fly. However, given the title of the panel, sometimes you really need to fail to get those best practices down. So, how about, Do *not* create policy, procedures, workflows, best practices on the fly for mission critical projects.
 
Your data your precious. Think a technology a step later. For us, then past Drupal, think past Fedora. We need to be able to move from platform to platform with ease. Thankfully we had the wherewithal to structure our data in such a way that it was pretty painless to extract.
 
Sometimes when you think you are *not* reinventing the wheel, you are in-fact reinventing the wheel. Look the the community around you and get involved. Don't be afraid to ask stupid questions. Some of those questions that I thought were stupid and shouldn't be asked were in fact questions that were begging to be asked.
 
Also akin to reinventing the wheel, the hit-by-the-bus scenario. Your really awesome-homegrown-fantastic-full-of-awesomeness thing you build, you get hit by a bus, take another job, etc. your place of work is so entirely screwed. At the very least, DOCUMENT, DOCUMENT, DOCUMENT. 
 
The library tech community is pretty rad. We're all doing a lot of similar work that doesn't need to be replicated, or if it does, does not need to be completed reinvented. Again, engage, and interact.
 
Moving forward, making this fail into a success...
 
Over the past few months we have taken the time to sit down and write out our digitization/digital collections philosophy with stakeholders. What I thought might be a difficult and painful exercise turned out to be quite wonderful and we came up with a document that I am proud of. 
 
We also took the time to do a study of what digital preservation means at MPOW, and what we are capable of doing right now, what we can be doing in the near future, and what we should look to achieve in the long-term. This segued nicely into a functional requirements document for our repository infrastructure.
 
Right now, we are working on creating what I believe to be a solid infrastructure; heavily documented! Something we lacked all along, and what some of my colleagues know me for - that guy who walks around stamping his feet about infrastructure all the time. INFRASTRUCTURE. INFRASTRUCTURE. INFRASTRUCTURE.
 
Hopefully in a year or two I can come back to Access and present on a panel full of folks turning failures into success!

»

Node Import fails me | Hack the database!

Over on the dev version of our digital collections site we are working on lots of new features. One of them being JPEG2000 support for our World War I trench maps, World War I aerial photos, and World War II Italian topographical maps. Lightbox2 simply does not cut it when researchers would like to examine these wonderful images. Being that we are pretty short staffed here and don't have the wherewithall to whip up a Drupal module to do this "properly", we have come up with what I think is a pretty creative solution to adding the jp2 images to the records in Drupal.

blog image
»

Library day in the life - 5 - Day 5

Here we are at the final day. Friday. Work from home. WIN. VPN, shell, type, type type, forward ports, oh man, email.

Morning

Morning soundtrack - Four Tet - Remixes, Plaid - Parts in the Post

»

Library day in the life - 5 - Day 4

Wow, day 4 already. This week seems to be going by fast. Worked from home for a bit this morning and then took the train in again. Email this week has been miraculously low. Probably from all the moves. The 5th floor eerily empty, absolutely bizarre up there now.

Morning

blog image
blog image
»

Library day in the life - 5 - Day 3

Day three started off with a Go Train that decided to arrive 20 minutes late. Three cheers for mass transit. The delay was a good thing, it gave me 20 extra minutes and I was able to finish Calvino's, "Six Memos for the Next Millennium."

Morning

»

Library day in the life - 5 - Day 2

Here goes day 2! Tuesday is generally my first day of the week physically at work, which generally means that I have lots of meetings. Thankfully i did not have an immediate morning meeting.

Morning:

Morning soundtrack: Software Freedom Law Center - Episode 0x2C: Eben on Software Liability, Adult. - Resuscitation

»

Creative Commons license icon Creative Commons license icon Creative Commons license icon

Syndicate content