See a little Warclight: building an open-source web archive portal with project blacklight


In 2014-15, due to close collaboration between UK-based researchers and the UK Web Archive, the open-source Shine project was launched. It allowed faceted search, trend diagram exploration, and other advanced methods of exploring web archives. It had two limitations, however: it was based on the Play framework (which is relatively obscure especially within library settings) and after the Big UK Domain Data for the Arts and Humanities (BUDDAH) project came to an end, development largely languished.

The idea of Shine is an important one, however, and our project team wanted to explore how we could take this great work and begin to move it into the wider, open-source library community. Hence the idea of a Project Blacklight-based engine for exploring web archives. Blacklight, an open-source library discovery engine, would be familiar to library IT managers and other technical community members. But what if Blacklight could work with WARCs?

The Archives Unleashed team’s first foray towards what we now call “Warclight” — a portmanteau of Blacklight and the ISO-standardized Web ARChive file format — was building a standalone Blacklight Rails application. As we began to realize this doesn’t help those who would like to implement it, development pivoted to building a Rails Engine which, “allows you to wrap a specific Rails application or subset of functionality and share it with other applications or within a larger packaged application.” Put another way, it allows others to use an existing Warclight template to build their own web archive search application. Drawing inspiration from UKWA’s Shine, it allows faceted full-text search, record view, and other advanced discovery options. Warclight is designed to work with web archive data that is indexed via the UK Web Archive’s webarchive-discovery project.

Webarchive-discovery is a utility to parse ARCs and WARCs, and index them using Apache Solr, an open source search platform. Once these ARCs and WARCs have been indexed into Solr, it provides us with searchable fields including: title, host, crawl-date, and content type.

One of the biggest strengths of Warclight is that it is based on Blacklight. This opens up a mature open source community, which could allow us to go farther if we’re following the old idiom: “If you want to go fast, go alone. If you want to go further, go together.”

This presentation will provide and overview of Warclight, and implementation patterns. Including the Archives Unleashed at scale implementation of over 1 billion Solr docs using Apache SolrCloud.

Zagreb, Croatia
Nick Ruest
Associate Librarian