open source

IIPC Curator Tools Fair: Islandora Web ARChive solution pack

The following is the text for a video that I was asked to record for the 2014 International Internet Preservation Consortium General Assembly Curator Tools Fair, on the Islandora Web ARChive solution pack.


My name is Nick Ruest. I am a librarian at York University, in Toronto, Ontario. I’m going to give a quick presentation on the Islandora Web ARChive solution pack. I only have a few minutes, so I’ll quickly cover what the module does, what areas of the web archiving life cycle it covers, and a provide a quick demonstration.

So, what is the Islandora Web ARChive solution pack?

I’ll step back and quickly answer what is Islandora first. "Islandora is an open source digital asset management system based on Fedora Commons, Drupal and a host of additional applications." A solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and the Tuque library thereby allowing users to deposit, create derivatives, and interact with a given type of object. We have solution packs for Audio, Video, Large Images, Images, PDFs, paged content, and now web archives. The Web ARChive solution pack allows users to ingest and retrieve web archives through the Islandora interface. If we think about it in terms of OAIS, we give the repository a SIP, which in the case of this solution pack can be a single warc file and some descriptive metadata, and if available, a screenshot and/or a PDF. From there, the solution pack will create an AIP and DIP. The AIP will contain: the original warc, MODS descriptive metadata, FITS output (file characterization/technical metadata), web dissemination versions of the screenshots (jpg & thumbnail), PDF, and derivatives of the warc via warctools. Those derivatives are a csv and filtered warc. The csv -- WARC_CSV -- is a listing of all the files in a given warc. This allows a user/researcher to have a quick glance at the contents of the warc. The filtered warc -- WARC_FILTERED -- is a warc file stripped down as much as possible to the text, and it is used only for search indexing/keyword searching. The DIP is an a JPG/TN of the captured website (if supplied) and download links to the WARC, PDF, WARC_CSV, screenshot, and descriptive metadata. Here a link to the ‘archived site’ can be supplied in the default MODS form. The suggested usage here is to provide a link to the object in a local instance of Wayback, if it exists.

I’ve also been asked to address the following questions:

1) What aspects of the web archiving life cycle model does the tool cover? What aspects of the model would you like to/do intend to build into the tool? What functionality does the tool provide that isn’t reflected in the model?

I’ll address what it does not cover first: Appraisal and selection, scoping, and data capture. We allow users to use their own Appraisal and selection, scoping, and data capture processes. So, for example, locally, we use Heritrix for cron based crawls, and our own bash script for one-off crawls.

What does it cover? All of the rest of the steps!

  • Storage and organization: via Fedora Commons & Islandora
  • QA and analysis: via display/DIP -- visualization exposes it!
  • Metadata/description: every web archive object has a MODS descriptive datastream
  • Access/use/reuse: each web archive object has a URI, along with its derivatives. By default warcs are available to download.
  • Preservation: preservation depends on the policies of the repository/institution, but, in our case we have a preservation action plan for web archives, and suite of Islandora preservation modules running (checksum, checksum checker, FITS, and PREMIS) that cover the basics.
  • Risk management: see above.

2) What resources are committed to the tool’s ongoing development? What are major features in the roadmap? Is the code open source?

I developed the original version, and transferred it to the Islandora Foundation, allowing for community stewardship of the project.

Currently, there is no official roadmap for the project. If anybody has ideas, comments, suggestion, or roadmapish ideas, feel free to send a message to the Islandora mailing list.

...and yes, the code is totally open source. It is available under a GPLv3 license, and the canonical version of the code can be found under the Islandora organization on Github.

3) What is the user base for the tool? How environment-specific is the tool as opposed to readily reusable by other organizations?

Not entirely sure. It was recently released as part of the 7.x-1.3 version of Islandora.

Given that is it an Islandora module, it is tied to Islandora. So, you’ll have to have at least a 7.x-1.3 instance of Islandora running, along with the solution pack’s dependencies to run it.

4) What are the tool’s unique features? What are its shortcomings?

I think some unique features are that it is apart of a digital asset management system (it is the first of its kind that I am aware of), and the utilization of warctools for keyword searching and file inventories.

Shortcomings? That it is apart of a digital asset management system.

Very quick demo time!

Podcasts for the nerd librarian

Every so often people ask me about what podcasts I listen to, and every so often I start listening to something new and get terribly excited about it and have to tell my colleagues all about it. Also, this past semester I taught my first course. It was an LIS course entitled, "Introduction to Technology." Instead of the normal plethora of weekly readings, I toned the readings down a little bit, and added a few podcasts as "suggested listening" for learning experiment. It went over well, so I figured I post something about my favourites, why I like them, and why I think they are relevant. I'm sticking to podcasts that tie into my profession - FilmJunk, Quirks and Quarks, and Linux Outlaws, not this time. If you have any recommendations, please share!

 
Digital Campus - "A biweekly discussion of how digital media and technology are affecting learning, teaching, and scholarship at colleges, universities, libraries, and museums." I stumbled across this podcast late last spring, and have been an eager listener every other week. The podcast is in a way an extension of the work being done at the Center for History and New Media at George Mason University and moreover, it is an excellent source of information regarding projects and trends in the digital humanities. At MPOW we are currently in the infant stages of creating a digital scholarship centre that is going to be integrated with our library, so the podcast is a great way to stay fresh with what is going on in digital humanites.
 
The Changelog - is a weekly podcast that covers new open source software projects. I constantly find myself a little lost with the many different open source projects and trends. This podcast allows me to listen to an episode on specific project so I can learn more about it. Sometimes it is exactly what I need, other times it is over my head, but hey, that happens.
 
This Week in Law - is a weekly podcast on the TWiT network hosted by Denise Howell and a panel which covers new issuses in technology law. Before I entered the library profession, I had intended on becoming a lawyer. Well, we all see how that went. But, I never lost interest in law, or issues that have always interested me. This  podcast, along with the next, allow me to keep up and stay fresh.
 
Free as in Freedom (previously, The Software Freedom Law Center podcast) - is a biweekly podcast featuring Bradley Kuhn and Karen Sandler covering legal issues and topics in the open source and free software world. This podcast again appeals to the legal issues sides of me, but more so in so far that it deals mainly with legal issues around open source and free software. Dan Scott tipped me off to the podcast one day, and I have been a regular listener ever since. 
 
The Command Line - is a weekly podcast consisting of a news episode, what the host Thomas Gideon calls a rant episode which is more or less an essay on a given topic, and the occasional interview. 
 
JISC - is an intermittent podcast put out by the Joint Information Systems Committee (JISC) touching on new issues and trends in library and information science. 
 
CBC Spark - is a combination of a weekly podcast, radio show and blog which centres around technology and culture, and is hosted by Nora Young. The show occasionally crosses over into information science territory, but I listen to it more so for the interesting topics each week.
 
TVO Search Engine - is a weekly podcast hosted by Jesse Brown which explores the Internet and technology's impact on culture and politics. This is one of my absolute favourite podcasts, and an interesting take on journalism in my opinion. 
  
 

library clouds in the sky with [diamonds]?

Bacon...

Sorry, had to get that of the way. Those who know, know. Those who do not, oh well. I will address it later... subtly???

Awhile back we got hit with the perfect-downtime-storm. A RAID controller battery randomly failed, and I was down for quite a few hours. Then a day or two later ... a brown-out occurred. Somehow, some way, this killed the brand new RAID controller on the DB server, and disemboweled the RAID controller on the web server. I was down for almost a week awaiting repairs vendors and IT. During this period of utter embarrassment and fury, I finally took somebody up on a long-standing offer to put all of my digital collections stuff on a BEEFCAKE server. I ordered my [twin node] BEEFCAKE and decided that high availability and redundancy was the way to go.

So, I began building a proof of concept: Tomax & Xamot [LAMP with hint of wonderful Tomcat, Java, Solr, and Djatoka for blooming ideas] are my sinister production machines with Heckle & Jeckle [HAProxy & KeepAlive] providing the load balancing. After many hours, the proof of concept succeeded. Kill apache and/or mysql on Tomax, Xamot will be right there still fighting for the Cobra Commander.

I've been sitting on BEEFCAKE for a week or so, almost ready to go to production. But for the last week, I have been diligent with my 99-part hearty diet of bacn, Batman, Green Lantern Corp, and Promethea. Combined with the nicotine patch, my head has been in the clouds - in a good way. I was pretty undecided about the Cloud for a long time and Stallman's talk at U. of T. threw me even farther to one side of the fence {GPL loophole], but Fink's idea-machine-brain rambling on about creating a Cloud at LibMac (another possible proof of concept) started turning gears. (Side note, Fink is more rabid about Open Source than I). The collections within the Digital Collections, (namely PW20C, Concentration Camp Correspondences, Bertrand Russell, Canadian Publishing, et al) are sitting on a fair chuck of metadata begging for something to be done with it. Add that to the Mass Digitization Project (DC, METS/ALTO, and fingers crossed TEI), and EVERGREEN!!! Oh what to do, what do to???