What if you have a few terabytes of web archive data setting around, and wanted to shine a little light into them?
Well, the good news is that now you can! The British Library’s UK Web Archive initiative has created some great software over the last couple years to allow you to index your web archive content into Solr, and provide access to it in a discovery interface called Shine. You can check Shine out in action here (for the British Library’s collections) or here (for our Canadian politics one).
Starting in 2016, our research team was interested in trying to provide access to Canadian web archival collections. Our Web Archives for Longitudinal Knowledge project, or WALK, brings together the web archival holdings of a half-dozen Canadian libraries and aims to provide federated search and access to research derivatives.
In doing so, we realized that while Shine was powerful, it really lacked an active development community: we wanted to provide access in the same spirit, but on a different platform. But we are also in the same boat as the Shine developers: grant funded, with no real sense that in three, four, or five years down the road that we’d have the time to devote full-time energy to a platform. To keep things going, we would need to leverage a bigger open-source community.
What is Warclight?
Warclight is a Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats. It allows faceted full-text search, record view, and other advanced discovery options. Future work on the project will include integrating the Blacklight Advanced Search plugin, and creating a new plugin to recreate the existing trend search functionality in Shine.
One of the biggest strengths of Warclight is that it is based on Blacklight. This opens up a mature open source community, which could allow us to go farther if we’re following the old idiom: “If you want to go fast, go alone. If you want to go further, go together.”
Warclight is designed to work with web archive data that is indexed via the UK Web Archive’s webarchive-discovery project. Warclight currently uses a fork of webarchive-discovery that allows for three additional facets:
collection_number. We’ll be working on getting this functionality into core webarchive-discovery so that if you want to take advantage of this functionality you will not need to use our fork of it.
If you’d like to contribute to the project, whether it be feedback, use cases, or code contributions, please do not hesitate! Especially if you have thoughts on what fields should be displayed on an item’s view, search result fields, or facet fields. We also have a channel in the Archives Unleashed Slack devoted to Warclight; #warclight.
We can’t wait to work with you all, and bring the rapidly developing Warclight platform to the broader web archiving community!
By Nick Ruest on .
Exported from Medium on September 23, 2017.