Lowering the Barrier to Access: The Archives Unleashed Cloud Project

Jun 19, 2019

Slides

Abstract

The Archives Unleashed Project aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. We respond to one of the major issues facing web archiving research: that while tools exist to work with WARC files and to enable computational analysis, they require a considerable level of technical knowledge to deploy, use, and maintain.

Our project uses the Archives Unleashed Toolkit, an open-source platform for analyzing web archives (https://github.com/archivesunleashed/aut). Due to space constraints we do not discuss the Toolkit at length in this abstract. While the Toolkit can analyze ARC and WARC files at scale, it requires knowledge of the command line and a developer environment. We recognize that this level of technical expertise is beyond the level of the average humanities or social sciences researcher, and our approaches discussed in this paper concern themselves with making these underlying technical infrastructures accessible.

This presentation presents the Archives Unleashed Cloud: both to introduce it to researchers, but also to stimulate a conversation around where the work of the researcher begins and the work of the research platform ends. It also discusses the problem of long-term project sustainability. Researchers want services such as the Cloud, but how do we provide this service to them in a cost-effective manner?

Stage One: Learning to WALK

In 2016, we launched the Web Archives for Longitudinal Knowledge (WALK) project thanks to support from Compute Canada’s Research Platforms and Portals competition. Our goal was to bring Canadian Archive-It partners together into one search portal and with analytic tools so that researchers could extract datasets of interest. By working with collections from six Canadian universities (Alberta, Dalhousie, Toronto, Victoria, Simon Fraser, and Winnipeg) we were able to develop our infrastructure, explore edge cases in terms of strange data and errors within ARC and WARC files, and provide public search access to web archive collection using our “Warclight” portal (i.e. http://dalhousie.archivesunleashed.org).

Yet our goal to provide researcher access to under-utilized web collections was only partially successful. First of all, we did not interact directly with researchers or provide a modern interface for them to choose what they wanted to analyze. Secondly, everything relied on manual work from the project team. For the next stage of the project, we wanted to develop a self-service portal for users to interact with web archives.

Stage Two: Enter the Archives Unleashed Cloud

The Archives Unleashed Cloud thus aims to facilitate the analysis of web archives through a modern-web based UI. It bridges the gap between these easy-to-use curatorial tools and developer-focused analytics platforms. In short, right now, it is relatively easy to create a web archive – but it is still too difficult to analyze one! Growing out of the “Filter-Analyze-Aggregate-Visualize” cycle developed for the Archives Unleashed Toolkit (Lin et al, 2017), the Archives Unleashed Cloud is a web-based platform for analyzing web archives. It allows users to do the following:

Sync their web archive collections with the Cloud using the Web Archiving Systems API (WASAPI). Currently we support Archive-It collections but as other archival institutions adopt WASAPI our platform can speak to them;
Transfer ARC and WARC files into the Archives Unleashed Cloud;
Process ARC and WARC files and generate:
- Full-text search for text mining;
- Hyperlink networks for network analysis;
- Other statistics on the shape of the collection. Researchers and institutions can use the canonical Archives Unleashed Cloud at https://cloud.archivesunleashed.org or, as it is an open-source project, can run their own local versions.

Sustainability

As we develop the working version of the Archives Unleashed Cloud, one of the main concerns of the project team is the future of the Cloud after funding ends in 2020. While we are currently exploring whether the Cloud makes sense as a stand-alone non-profit corporation, we are still unsure about the future direction. How do services like this, that meet demonstrated needs, survive in the long run? Our presentation discusses our current strategies but hopes to engage the audience around the state-of-the-field and how to best reach web archiving practitioners.

Conclusions

Projects and services like WebRecorder.io and Archive-It have made amazing strides in the world of web archive crawling and capture. The Archives Unleashed Cloud seeks to make web archiving analysis similarly easy and straightforward. Yet the scale of web archival data makes this less straightforward.

Date

Jun 19, 2019

Event

The web that was: archives, traces, reflections RESAW 2019

Location

Amsterdam, Netherlands

Lowering the Barrier to Access: The Archives Unleashed Cloud Project

Abstract

Nick Ruest

Associate Librarian