The Archives Unleashed Project: Warcbase is dead, long live the Toolkit

by Ian Milligan, Jimmy Lin, and Nick Ruest

We were delighted to be able to announce a few months ago that our project team at the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.

Since that announcement, we’ve been busy at work at a few different things: modernizing and updating our Warcbase web archiving analytics platform, working on a discovery interface and underlying infrastructure, and laying the administrative groundwork for the project itself. We’ll write a bit more next week about the front-end, but for now we wanted to announce the next version of our web archiving toolkit.

Warcbase is dead…long live the Archives Unleashed Toolkit

We’ve been busy working away on a 1.0 release of the Archives Unleashed Toolkit, or AUT. AUT grows out of the analytics functions of Warcbase, which has now officially been deprecated. We’ve left behind the Apache HBase and Wayback functionality, focusing instead on the Apache Spark-based open-source platform for analyzing web archives. As we leave HBase behind, Warcbase increasingly didn’t make sense as a name. It really is a toolkit to open up web archives for scholars, hence the “Archives Unleashed Toolkit.”

If you want to take AUT for a spin, you can download the 0.9.0 release jar, setup Apache Spark locally, and work through the tutorial we have available here. If you haven’t setup Apache Spark before, we have a helpful “Getting Started” guide. The jar means that you don’t have to build it yourself!

What’s the technical roadmap look like for AUT moving forward? The 0.9.0 release was focused on codebase clean-up (Java docs too!), and getting the project setup on Sonatype. The next release will be moving the project to Apache Spark 2.0, which will allow us to move to Spark SQL and DataFrames. Also on the roadmap is PySpark support.

If you’re interested in reading about the history of Warcbase and how it was used to explore collections, feel free to check out this article:

Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, 10(4), Article 22, 2017.

Stay Tuned…

We’ll be back next week to talk about Warclight, our Project Blacklight based discovery interface for web archives. See you again soon!

Tagged in Apache Spark, Web Archiving, Analytics

By Nick Ruest on .

Canonical link

Exported from Medium on October 4, 2017.

Nick Ruest
Associate Librarian


comments powered by Disqus