Below is the text and slides of my presentation on the Web ARChive solution pack at Open Repositories 2013.
I have a really short amount of time to talk here. So, I am going to focus on the how and why for this solution pack and kinda put it in context of the Web Archiving Life Cycle Model proposed by the Internet Archive earlier this year. Maybe I shouldn’t have proposed a 7 minute talk!
Context! Almost a year ago, I was in a meeting and was presented with this problem. YFile, a daily university newspaper – it was previously a paper now a website – had been taken over by marketing a while back, and they deleted all their back content. They are an official university publication, so an official university record, and eventual end up in archives, so it will eventually be our problem; the library’s problems. Plainly put, we live in a reality where official records are born and disseminated via the Internet. Many institutions have a strategy in place for transferring official university records that are print or tactile to university archives, but not much exists strategy-wise for websites. So, I naively decided to tackle it.
I tend to just do things. I don’t ask permission. I apologize later if i have to. Like maybe taking down the YFile server during the first few initial crawls. If I make mistakes, that is good, I am learning something! What i am doing isn’t new, but then again it knda is. It is a really weird place. I need to crawl a website everyday. The internet archive crawler comes around whenever it does. There is no way to give the Internet Archive/Wayback machine a whole bunch of warc files, and I’m not ready to pay for Archive-It.
That won’t work for me at all when I have some idea how to do it all myself. So, what is the problem? I need to capture and preserve a website everyday. I want to provide the best material to a researcher. I want to keep a fine eye on preservation, but not be a digital pack rat, and need to constantly keep the librarian and archivist in me pleased, which is always seems to the Item vs. collection debate and which of those gets the most attention.
How easy is it to grab a website? Pretty damn easy if you’re using at least wget 1.14 which has warc support.
How many people here know what a warc is? Warc stands for web archive. It is an iso standard. It is basically a file – that can get massive very quickly – that aggregates raw resources you request into a single file along with crawl metadata, checksums. PROVENANCE!
This is what the beginning of a warc file looks like.
This is what the beginning of a warc file looks like.
And here is a selection from the arctual archive portion. That is my brief crash course on warc. We can talk about it more later if you have questions. I need to keep moving along.
So, warcs are a little weird to deal with on their own. You can disseminate them with Wayback Machine, and I assume nobody but a few people on this planet want to see a page full of just warc files. Building something browsable takes a little bit more work. So, I decided to snag a pdf and screenshot of the page of frontpage of the site that I am grabbing with wkhtmltopdf and wkhtmltoimage. Then I toss this all in a single bash script, and give it to cron.
So this is what I have come up with. This is how I capture and preserve a website. The pdf/image + xvfb came from Peter Binkley. X virtual framebuffer is an X11 server that performs all graphical operations in memory, not showing any screen output.
I’ve been running that script on cron since last October. Now what? Like I said before, nobody wants to see a page full of warc files. So, I started working with the tools and platforms that I know. In this case, Drupal, Islandora, and Fedora Commons, and created a solution pack. Solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and Tuque API to deposit, create derivatives, and interact with a given type of object. So, we have solution packs for Audio, Video, Large Images, Images, PDFs, and paged content.
What does it do? Adds all required Fedora objects to allow users to ingest, create derivatives, and retrieve web archives through the Islandora interface. So we have, Content Models, Data Stream Composite Models, forms, and collection policies. The current iteration of the module allows one to batch ingest a bunch of objects for a given collection, and it will create all of the derivatives (Thumbnail and display image), and index any provided descriptive metadata in Solr as well as the actual WARC file since it is mostly text. The WARC indexing is still pretty experimental, it works, but I don’t know how useful it is.
If you want to check out a live demo, and poke around while I am rambling on here, check this site out.
Collection (in terms of the web archiving life-cycle model). This an object from the Islandora basic collection solution pack.
Seed (in terms of the web archiving life-cycle model). This is an object from the Islandora basic collection solution pack.
Document (in terms of the web archiving life-cycle model). This is an object from the Islandora Web ARChive solution pack.
Here is what my object looks like. The primary archival object is the WARC file, then we have our associated data streams: PDF (from the crawl), MODS/DC (descriptive metadata), Screenshot (from the crawl), FITS (techinical metadata), Thumbnail & Medium JPG (deriative display images).
Todo! What I am still working on when I have time.
I want to tie in the Internet Archive’s Wayback Machine for playback/dissemination of WARCs. I haven’t quite wrapped my head around how best to do the Wayback integration, but I am thinking of using the date field value on in the MODS record for an individual crawl.
I’m also thinking of incorporating WARC tools into this solution pack. This would be for quick summaries and maybe a little analysis. This of how R is incorporated into Dataverse if you are familar with that.
I am also working on integrating my silly little bash scripts into the solution pack. That way one could just do the whole fell swoop of crawling, dissemination, and preservation in a single click when ingesting an object in Islandora.
Finally, there is a hell of a lot of metadata in each of these warc files begging for something to be done with them. I haven’t figured out a way to parse them in an abstract repeatable way, but if I or somebody else does, it will be great!