web archiving

IIPC Curator Tools Fair: Islandora Web ARChive solution pack

The following is the text for a video that I was asked to record for the 2014 International Internet Preservation Consortium General Assembly Curator Tools Fair, on the Islandora Web ARChive solution pack.


My name is Nick Ruest. I am a librarian at York University, in Toronto, Ontario. I’m going to give a quick presentation on the Islandora Web ARChive solution pack. I only have a few minutes, so I’ll quickly cover what the module does, what areas of the web archiving life cycle it covers, and a provide a quick demonstration.

So, what is the Islandora Web ARChive solution pack?

I’ll step back and quickly answer what is Islandora first. "Islandora is an open source digital asset management system based on Fedora Commons, Drupal and a host of additional applications." A solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and the Tuque library thereby allowing users to deposit, create derivatives, and interact with a given type of object. We have solution packs for Audio, Video, Large Images, Images, PDFs, paged content, and now web archives. The Web ARChive solution pack allows users to ingest and retrieve web archives through the Islandora interface. If we think about it in terms of OAIS, we give the repository a SIP, which in the case of this solution pack can be a single warc file and some descriptive metadata, and if available, a screenshot and/or a PDF. From there, the solution pack will create an AIP and DIP. The AIP will contain: the original warc, MODS descriptive metadata, FITS output (file characterization/technical metadata), web dissemination versions of the screenshots (jpg & thumbnail), PDF, and derivatives of the warc via warctools. Those derivatives are a csv and filtered warc. The csv -- WARC_CSV -- is a listing of all the files in a given warc. This allows a user/researcher to have a quick glance at the contents of the warc. The filtered warc -- WARC_FILTERED -- is a warc file stripped down as much as possible to the text, and it is used only for search indexing/keyword searching. The DIP is an a JPG/TN of the captured website (if supplied) and download links to the WARC, PDF, WARC_CSV, screenshot, and descriptive metadata. Here a link to the ‘archived site’ can be supplied in the default MODS form. The suggested usage here is to provide a link to the object in a local instance of Wayback, if it exists.

I’ve also been asked to address the following questions:

1) What aspects of the web archiving life cycle model does the tool cover? What aspects of the model would you like to/do intend to build into the tool? What functionality does the tool provide that isn’t reflected in the model?

I’ll address what it does not cover first: Appraisal and selection, scoping, and data capture. We allow users to use their own Appraisal and selection, scoping, and data capture processes. So, for example, locally, we use Heritrix for cron based crawls, and our own bash script for one-off crawls.

What does it cover? All of the rest of the steps!

  • Storage and organization: via Fedora Commons & Islandora
  • QA and analysis: via display/DIP -- visualization exposes it!
  • Metadata/description: every web archive object has a MODS descriptive datastream
  • Access/use/reuse: each web archive object has a URI, along with its derivatives. By default warcs are available to download.
  • Preservation: preservation depends on the policies of the repository/institution, but, in our case we have a preservation action plan for web archives, and suite of Islandora preservation modules running (checksum, checksum checker, FITS, and PREMIS) that cover the basics.
  • Risk management: see above.

2) What resources are committed to the tool’s ongoing development? What are major features in the roadmap? Is the code open source?

I developed the original version, and transferred it to the Islandora Foundation, allowing for community stewardship of the project.

Currently, there is no official roadmap for the project. If anybody has ideas, comments, suggestion, or roadmapish ideas, feel free to send a message to the Islandora mailing list.

...and yes, the code is totally open source. It is available under a GPLv3 license, and the canonical version of the code can be found under the Islandora organization on Github.

3) What is the user base for the tool? How environment-specific is the tool as opposed to readily reusable by other organizations?

Not entirely sure. It was recently released as part of the 7.x-1.3 version of Islandora.

Given that is it an Islandora module, it is tied to Islandora. So, you’ll have to have at least a 7.x-1.3 instance of Islandora running, along with the solution pack’s dependencies to run it.

4) What are the tool’s unique features? What are its shortcomings?

I think some unique features are that it is apart of a digital asset management system (it is the first of its kind that I am aware of), and the utilization of warctools for keyword searching and file inventories.

Shortcomings? That it is apart of a digital asset management system.

Very quick demo time!

The Islandora Web ARChive Solution Pack - Open Repositories 2013

Below is the text and slides of my presentation on the Web ARChive solution pack at Open Repositories 2013.


http://ruebot.net/files/OR2013-0.png

I have a really short amount of time to talk here. So, I am going to focus on the how and why for this solution pack and kinda put it in context of the Web Archiving Life Cycle Model proposed by the Internet Archive earlier this year. Maybe I shouldn't have proposed a 7 minute talk!

http://ruebot.net/files/OR2013-1.png

Context! Almost a year ago, I was in a meeting and was presented with this problem. YFile, a daily university newspaper -- it was previously a paper now a website -- had been taken over by marketing a while back, and they deleted all their back content. They are an official university publication, so an official university record, and eventual end up in archives, so it will eventually be our problem; the library's problems. Plainly put, we live in a reality where official records are born and disseminated via the Internet. Many institutions have a strategy in place for transferring official university records that are print or tactile to university archives, but not much exists strategy-wise for websites. So, I naively decided to tackle it.

http://ruebot.net/files/OR2013-2.png

I tend to just do things. I don't ask permission. I apologize later if i have to. Like maybe taking down the YFile server during the first few initial crawls. If I make mistakes, that is good, I am learning something! What i am doing isn't new, but then again it knda is. It is a really weird place. I need to crawl a website everyday. The internet archive crawler comes around whenever it does. There is no way to give the Internet Archive/Wayback machine a whole bunch of warc files, and I'm not ready to pay for Archive-It.

http://ruebot.net/files/OR2013-3.jpg

That won't work for me at all when I have some idea how to do it all myself. So, what is the problem? I need to capture and preserve a website everyday. I want to provide the best material to a researcher. I want to keep a fine eye on preservation, but not be a digital pack rat, and need to constantly keep the librarian and archivist in me pleased, which is always seems to the Item vs. collection debate and which of those gets the most attention.

http://ruebot.net/files/OR2013-4.jpg

How easy is it to grab a website? Pretty damn easy if you're using at least wget 1.14 which has warc support.

http://ruebot.net/files/OR2013-5.jpg

How many people here know what a warc is? Warc stands for web archive. It is an iso standard. It is basically a file -- that can get massive very quickly -- that aggregates raw resources you request into a single file along with crawl metadata, checksums. PROVENANCE!

This is what the beginning of a warc file looks like.

http://ruebot.net/files/OR2013-6.jpg

This is what the beginning of a warc file looks like.

http://ruebot.net/files/OR2013-7.jpg

And here is a selection from the arctual archive portion. That is my brief crash course on warc. We can talk about it more later if you have questions. I need to keep moving along.

http://ruebot.net/files/OR2013-8.jpg

So, warcs are a little weird to deal with on their own. You can disseminate them with Wayback Machine, and I assume nobody but a few people on this planet want to see a page full of just warc files. Building something browsable takes a little bit more work. So, I decided to snag a pdf and screenshot of the page of frontpage of the site that I am grabbing with wkhtmltopdf and wkhtmltoimage. Then I toss this all in a single bash script, and give it to cron.

http://ruebot.net/files/OR2013-9.jpg

So this is what I have come up with. This is how I capture and preserve a website. The pdf/image + xvfb came from Peter Binkley. X virtual framebuffer is an X11 server that performs all graphical operations in memory, not showing any screen output.

http://ruebot.net/files/OR2013-10.jpg

I've been running that script on cron since last October. Now what? Like I said before, nobody wants to see a page full of warc files. So, I started working with the tools and platforms that I know. In this case, Drupal, Islandora, and Fedora Commons, and created a solution pack. Solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and Tuque API to deposit, create derivatives, and interact with a given type of object. So, we have solution packs for Audio, Video, Large Images, Images, PDFs, and paged content.

http://ruebot.net/files/OR2013-11.jpg

What does it do? Adds all required Fedora objects to allow users to ingest, create derivatives, and retrieve web archives through the Islandora interface. So we have, Content Models, Data Stream Composite Models, forms, and collection policies. The current iteration of the module allows one to batch ingest a bunch of objects for a given collection, and it will create all of the derivatives (Thumbnail and display image), and index any provided descriptive metadata in Solr as well as the actual WARC file since it is mostly text. The WARC indexing is still pretty experimental, it works, but I don't know how useful it is.

http://ruebot.net/files/OR2013-12.jpg

If you want to check out a live demo, and poke around while I am rambling on here, check this site out.

http://ruebot.net/files/OR2013-13.jpg

Collection (in terms of the web archiving life-cycle model). This an object from the Islandora basic collection solution pack.

http://ruebot.net/files/OR2013-14.jpg

Seed (in terms of the web archiving life-cycle model). This is an object from the Islandora basic collection solution pack.

http://ruebot.net/files/OR2013-15.jpg

Document (in terms of the web archiving life-cycle model). This is an object from the Islandora Web ARChive solution pack.

http://ruebot.net/files/OR2013-16.jpg

Here is what my object looks like. The primary archival object is the WARC file, then we have our associated data streams: PDF (from the crawl), MODS/DC (descriptive metadata), Screenshot (from the crawl), FITS (techinical metadata), Thumbnail & Medium JPG (deriative display images).

http://ruebot.net/files/OR2013-17.jpg

Todo! What I am still working on when I have time.

http://ruebot.net/files/OR2013-18.jpg

I want to tie in the Internet Archive's Wayback Machine for playback/dissemination of WARCs. I haven't quite wrapped my head around how best to do the Wayback integration, but I am thinking of using the date field value on in the MODS record for an individual crawl.

http://ruebot.net/files/OR2013-19.jpg

I'm also thinking of incorporating WARC tools into this solution pack. This would be for quick summaries and maybe a little analysis. This of how R is incorporated into Dataverse if you are familar with that.

http://ruebot.net/files/OR2013-20.jpg

I am also working on integrating my silly little bash scripts into the solution pack. That way one could just do the whole fell swoop of crawling, dissemination, and preservation in a single click when ingesting an object in Islandora.

http://ruebot.net/files/OR2013-21.jpg

Finally, there is a hell of a lot of metadata in each of these warc files begging for something to be done with them. I haven't figured out a way to parse them in an abstract repeatable way, but if I or somebody else does, it will be great!

http://ruebot.net/files/OR2013-22.jpg http://ruebot.net/files/OR2013-23.jpg

Islandora Web ARChive Solution Pack

What is it?

The Islandora Web ARChive Solution Pack is yet another Islandora Solution Pack. This particular solution pack provides the necessary Fedora objects for persisting and disseminating web archive objects; warc files.

What does it do?

Currently, the SP allows a user to upload a warc with an associated MODS form. Once the object is deposited, the associated metadata is displayed along with a download link to the warc file.

You can check out an example here

Can I get the code?

Of course!

Todo?

If I am doing something obviously wrong, please let me know!

Immediate term:

  1. Incorporate Wayback integration for the DIP. I think this is the best disseminator for the warc files. However, I haven't wrapped my head around how to programatically provide access to the warc files in the Wayback. I know that I will have two warc objects, an AIP warc and a DIP warc (Big thank you to @sbmarks for being a soundboard today!). Fedora will manage the AIP, and Wayback will manage the DIP. Do I iFrame the Wayback URI for the object, or link out to it?

  2. Drupal 7 module. Drupal 7 versions of Islandora Solution Packs should be on their way shortly -- Next release I believe. The caveat to using the Drupal 6 version of this module is the mimetype support. It looks like the Drupal 6 api (file_get_mimetype) doesn't pull the correct mimetype for warc file. I should get 'application/warc' but I am getting 'application/octet-stream' -- the fallback default for the api.

Long term:

  1. Incorporate Islandora microservices. What I would really like to do is allow users to automate this entire process. Basically, just say this is a site I would like to archive. This is the frequency at which I would like it archived, with necessary wget options. This is the default metadata profile for it. Then grab the site, ingest it into Fedora, drop the DIP warc into Wayback, and make it all available.

  2. If you have any idea on how to do the above, or how to do it a better manner, please let me know!