Recent Publications

Content Selection and Curation for Web Archiving: The Gatekeepers vs. the Masses

Details PDF Slides Project DOI

Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities

Details PDF Project DOI

An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter

Details PDF Dataset Project

The great WARC adventure: Using SIPS, AIPS, and DIPS to document SLAPPs

Details PDF Project

Recent & Upcoming Talks

Recent Posts

More Posts

This past week Compute Canada provided us with resources to setup our Solr Cloud instance for WALK and Archives Unleashed. We were able to get things setup relatively quickly thanks to a bit of preparation and practice on our local machines in the previous weeks. Once everything was setup (5 virtual machines total; 4 Solr Cloud nodes and one indexer – details below), we started benchmarking webarchive-discovery and our Solr Cloud setup with GNU Parallel.

CONTINUE READING

What if you have a few terabytes of web archive data setting around, and wanted to shine a little light into them? Well, the good news is that now you can! The British Library’s UK Web Archive initiative has created some great software over the last couple years to allow you to index your web archive content into Solr, and provide access to it in a discovery interface called Shine. You can check Shine out in action here (for the British Library’s collections) or here (for our Canadian politics one).

CONTINUE READING

by Ian Milligan, Jimmy Lin, and Nick Ruest We were delighted to be able to announce a few months ago that our project team at the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Since that announcement, we’ve been busy at work at a few different things: modernizing and updating our Warcbase web archiving analytics platform, working on a discovery interface and underlying infrastructure, and laying the administrative groundwork for the project itself.

CONTINUE READING

Introduction

List of bots I run, divided up by type.

anon

diffengine

YUDLbots

DPLA bots

Other

CONTINUE READING

#climatemarch tweets April 19-May 3, 2017 681,668 tweet ids for #climate collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate climatemarch_tweet_ids.txt > climatemarch.json. Dataset Image montage Web crawl Tweet volume Search API tweet volume Filter API tweet volume #MarchForScience tweets April 12-26, 2017 1,276,220 tweet ids for #MarchForScience collected with Documenting the Now’s twarc from January 22-26, 2017.

CONTINUE READING

Projects

Archives Unleashed Project

Archives Unleashed aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Supported by a grant from the Andrew W. Mellon Foundation, we will be developing web archive search and data analysis tools to enable scholars and librarians to access, share, and investigate recent history since the early days of the World Wide Web.

Web Archives for Historical Research

Our research focuses on both web histories - writing about the recent past as reflected in web archives - as well as methodological approaches to understanding these repositories.

Islandora CLAW

Islandora CLAW is the next generation of Islandora.

Fedora Repository

Fedora is the flexible, modular, open source repository platform with native linked data support.

Contact

Tags

#bataclan #charliehebdo #elxn42 #futurelibs #jesuisahmed #jesuischarlie #jesuisjuif #makedonalddrumpfagain #or2013 #panamapapers #paris #parisattacks #porteouverte #womensmarch aaronsw academic-librarians academic-libraries access-2008 access-2011 access-copyright activism alex-alvarez analytics anon apache-solr apache-spark apache2 apeshit-simians archive-team archivematica archives art artefactual-systems bacon bash-scripts benchmarking bepress best-practices big-data blacklight blogvsbook book-scanner book-scanning bots canada claw cloud-computing code comic committee community concentration-camp-correspondences congressedits data data-mining datasets deprofessionalization detroit digital-collections digital-commons digital-curation digital-history digital-humanities digital-library digital-odyssey digital-preservation digital-projects digital-repositories digital-repository digital-technologies digitalcommons digitization discussion djatoka-projects documentation donald-trump dpla dpla-api drupal dublin-core e-journal e-journals electronic-theses-and-dissertations elxn42 emerging-technologies eye-candy faculty-associations fail fedora fedora-commons fits free-software free-software-foundation fuppes future-of-academic-librarianship galleries garage-rock gccaedits geojson gestapo-camps git github google-docs google-groups gource grad-school hackfest hadoop hamilton help-desk hipinkingston historical-perspectives-on-canadian-publishing history history-of-canadian-publishing holocaust hpcanpub ica-atom ideas ie iipc image imageapi imagemagick information-retrieval infrastructure institutional-repository internet-archive internment-camps ir islandora javascript jesuischarlie john-degen journal jpeg2000 json kirtas-2400ra labor labour labour-relations law leadership leaflet.js learning lecture liaison liberation-technology librarianship libraries libraries-are-essential library-2.0 library-and-information-science library-apps library-technologies librarydayinthelife linked-data-platform live-off-the-floor love-libraries matterwave mcmaster-university mcmaster-university-library meetings meme merge metadata mets/alto microsoft modules montage muala mungus music mysql nginx nicholas-griffin node-import oai oai-pmh oais oclc ocr ola olita olita-digital-odyssey-2009 open-access open-access-week open-repositories open-source open-source-software openshot opensource panamapapers php podcasts politics presentation project-planning public-libraries pw20c pymarc pypi python renton repositories research-collections research-help-desk richard-routley richard-stallman richard-sylvan ripcd rob-ford ruby-on-rails satire scholarly-communication semantic-web shirtlesshorde sky-river social-justice solr-cloud soundtrack sparc sql streaming surnom-de-gorille sword ted tedx tedx-librariansto textual-analysis the-achievements the-humans the-potions thehip thought-leader topoli toronto torrent trainspotting twarc twarc-report twitter union unionization upnp usability vbo views views-bulk-operations visualization wahr warc wardrobes wayback wayback-machine web-archives web-archives-for-historical-research web-archiving wget wget-warc wikipedia wireframes wordcloud work-from-home world-wide-web wwii xml. ymmfire zookeeper zorton-and-the-cannibals