Hot Tips To Boost Your Interdisciplinary Web Archive Collaboration!

Nick Ruest
York University

Hamilton, Canada
April 17, 2018



We have a problem facing our collective cultural heritage.

This is a scale that boggles the mind – compare it to the Old Bailey 197,745 trials between 1674 and 1913)




Could one even study the 1990s and beyond without web archives?

…and the 1990s are history (as painful as it is to say).

And we have fears

The decisions we make today will lay the foundations for how we work with born-digital cultural heritage.

Won't be enough - we'll need search and discovery tools.

But what will our search engines look like?

We can't let the Blackbox write our histories.

Our Nightmare

Historians rely uncritically on date-ordered or algorithmically-ranked keyword search results, putting them at mercy of search algorithms they do not understand.

Some disturbing trends in this area…

The historians who came to the meeting were intelligent, kind, and encouraging. But they didn't seem to have a good sense of how to wield quantitative data to answer questions, didn't have relevant computational skills, and didn't seem to have the time to dedicate to a big multiauthor collaboration. It’s not their fault: these things don't appear to be taught or encouraged in history departments right now.

-Erez Leiberman Aiden and Jean-Baptiste Michel

Right now, to use web archives you have to really want to use them.

i.e., you need to be an expert

We want web archives to be used on page 153 of a random book!

We need interdisciplinary collaboration to tackle this problem!

In a galaxy...

...actually a coffee shop in the centre of the universe.

How do you collaborate?

Be frank.

Be honest.

What do you want, and need out of this relationship?

What's your academic currency?

Things evolve.



Web Archives for Longitudinal Knowledge

Canadian Political Parties & Political Interest Group Collection (ARCHIVE-IT/Toronto)

  • 50 Websites
  • All major political parties
  • Many minor political parties
  • Political interest groups
  • Collected quarterly between 2005 and present

The Current Interface

  • Very limited; simple search engine, some advanced options, and no facets.
  • Great collection; but nobody uses them.


How could we improve?

Apache Solr

UK Web Archive





This is a future of collection development

Institutional Collecting & Research Data




Archives Unleashed

Our Goals

  • Create relatively easy-to-use tools;
  • Create tools that are UNDERSTANDABLE - no black boxes;
  • Create tools that can push forward research in history, library/archives, and computer science;
  • Help people use these tools, and inspire research & creativity with datathons.

We want to help people unleash their web archives

So, how you gonna do that?


Archives Unleashed Toolkit

Or, the software project formerly known as Warcbase


Lowering the barriers to entry so that humanists, librarians, and archivists can interact with large-scale web archive data, in a transparent way.

What do you mean by that?

Do I really need to know all that?



So, what are you doing again to help?

For every public collection that a WALK partner has, we generate derivatives like:

  • Domain information;
  • Full text;
  • Hyperlink graphs

What can you do with AUT now?

Extract all text

Extract all text

Extract entities

Extract entities

Extract entities

Extract link graphs

AUK - Archives Unleashed Cloud

What if we had periodic gatherings where colleagues could get background information, learn how to use these tools, and work on a small project with other colleagues?

October 2018; Canada West Coast

May 2019; US East Coast

November 2019; US West Coast

There's a really great community out there!

Suddenly web archives aren't boutique.

They speak to a broader audience, and you can imagine how to use them.

So, hopefully, researchers can cite web archives on page 153 of a book without needing to be an expert!


Thank you!