web archives

AUT & Last Date Modified

There’s been a long-standing, frequently asked question by participants of Archives Unleashed datathons and the cohort program: how do we find out the date of a resource or page? Dates can be really hard to decipher in web archives. As a result, we tend to rely on the crawl date, which is a pretty easy thing to grab out of a WARC since it is a mandatory field. While this is something we’ve always had in aut, it’s not the date of a response or creation of a website, but instead is the date on which the crawl occurred.

GeoCities and the spacer.gif

Originally posted here. https://gifcities.org Trevor Owens and Grace Thomas recently had their article, “The invention and dissemination of the spacer gif: implications for the future of access and use of web archives” published in the International Journal of Digital Humanities. It’s a great look at the history of the spacer.gif, how it proliferated in the early web, and a case study of digging into web archives and doing a whole lot of analysis.

The Archives Unleashed Toolkit as a Finding Aid Utility

Originally posted here. I’ve been thinking a lot lately about how the Archives Unleashed Toolkit is a great finding aid utility for web archive collections, and should be in the toolbox for any web archivist. So, a finding aid. How could we create a finding aid for a web archive collection with relatively minimal labour? The Society of American Archivists define a finding aid as: A tool that facilitates discovery of information within a collection of records.

A Quick Benchmark of Webarchive-Discovery

This past week Compute Canada provided us with resources to setup our Solr Cloud instance for WALK and Archives Unleashed. We were able to get things setup relatively quickly thanks to a bit of preparation and practice on our local machines in the previous weeks. Once everything was setup (5 virtual machines total; 4 Solr Cloud nodes and one indexer – details below), we started benchmarking webarchive-discovery and our Solr Cloud setup with GNU Parallel.

See a Little Warclight

What if you have a few terabytes of web archive data setting around, and wanted to shine a little light into them? Well, the good news is that now you can! The British Library’s UK Web Archive initiative has created some great software over the last couple years to allow you to index your web archive content into Solr, and provide access to it in a discovery interface called Shine. You can check Shine out in action here (for the British Library’s collections) or here (for our Canadian politics one).

The Archives Unleashed Project: Warcbase is dead, long live the Toolkit

by Ian Milligan, Jimmy Lin, and Nick Ruest We were delighted to be able to announce a few months ago that our project team at the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Since that announcement, we’ve been busy at work at a few different things: modernizing and updating our Warcbase web archiving analytics platform, working on a discovery interface and underlying infrastructure, and laying the administrative groundwork for the project itself.

Islandora Web ARChive SP updates

Community Some pretty exciting stuff has been happening lately in the Islandora community. Earlier this year, Islandora began the transformation to a federally incorporated, community-driven soliciting non-profit. Making it, in my opinion, and much more sustainable project. Thanks to my organization joining on as a member, I’ve been provided the opporutinity to take part in the Roadmap Committe. Since I’ve joined, we have been hard at work creating transparent policies and processes software contributions, licenses, and resources.