Building the Next Generation of Web Archive Analysis Service

Abstract

Over the past five years, both the Internet Archive (IA) and the Archives Unleashed (AU) team have independently and collaboratively developed a suite of tools to address the challenges and barriers of working with web archives at scale.

Powered by the Archives Unleashed Toolkit (AUT), the Archives Unleashed Cloud was the first attempt at providing a self-service web-based platform for conducting analysis on WARC collections. Drawing on the computational methods and power of AUT, the platform provided familiar click-to-results actions that many of us prefer, eliminating the technical burdens of learning and running the AU Toolkit locally.

At the Internet Archive’s subscription service Archive-It, pre-defined derivative datasets could be requested for any web archive collection. The Archive-It Research Services (ARS) included WAT (extended web archive metadata format), WANE (named entities), LGA (temporal graphs). ARS datasets used to be manually generated at request by AIT staff, using distributed computing cluster technology, such as Hadoop and PIG.

The next step was to marry the Archives Unleashed approach with that of the Internet Archive.

Enter the Archives Research Compute Hub (ARCH). ARCH is a jointly-developed platform incorporating the previously described approaches. It is backed by its own distributed computing cluster, based on Hadoop, with new computing hardware dedicated to ARCH, and its own distributed storage (HDFS). Using the new infrastructure, the team developed a web interface and advanced job management features to control ARCH jobs.

ARCH supports 16 dataset derivation jobs. This includes all of the jobs from the AU Toolkit and Cloud as well as the 3 ARS jobs, which have all been ported into a modern Spark-compatible framework. The jobs are now much more efficient than either earlier platform, and can be launched by researchers themselves, benefiting from ARCH’s advanced features, such as its queuing system, web preview and dataset sharing options.

ARCH is based on the IA Sparkling library, a Spark-based toolkit used at IA for large-scale data processing jobs, with a focus on web archives but also archival collections more generally. As part of this project, both ARCH and Sparkling have been open-sourced to fulfill the open access policy of this project and support interoperability. Further, we pursue open standards, such as a WASAPI-compatible data endpoint to provide derivative datasets in a standard API format, which is particularly useful for dataset types consisting of more than one file.

To account for the distributed infrastructure within IA, which hosts the ARCH service, and in order to provide easy access to its data, ARCH supports full cross-cluster access. This enables direct access to AIT’s Hadoop cluster, which is used as a long-term cache for AIT collections and allows for efficient access without loading them from IA main storage system Petabox, if available. Otherwise, collections will be seamlessly loaded from IA’s Petabox and cached by ARCH for consecutive job runs in a managed space, keeping data fresh and available. AIT and Petabox API’s have been incorporated for data statistics before files have been completely fetched.

These considerations and details have contributed to this novel, deeply-integrated self-service compute hub at IA.

Date
Location
Marseille, France
Avatar
Nick Ruest
Associate Librarian