Recent Publications

Content Selection and Curation for Web Archiving: The Gatekeepers vs. the Masses

Details PDF Slides Project DOI

Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities

Details PDF Project DOI

An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter

Details PDF Dataset Project

The great WARC adventure: Using SIPS, AIPS, and DIPS to document SLAPPs

Details PDF Project

Recent & Upcoming Talks

More Talks

Recent Posts

More Posts

I’ve been collecting tweets to @realDonaldTrump since June 2017. In my most recent time pulling together, and deduping the dataset I asked myself, “I wonder how many occurrences of ‘fuck’ are in the dataset.” Or, how many fucks are there to give? Well… The data is updated by running a query on the Standard Search API every five days. $ twarc search 'to:realdonaldtrump' --log donald_search_$DATE.log > donald_search_$DATE.jsonl Which yields something like this every five days.


One feature of Blacklight that I’ve always wanted to setup in Warclight is displaying thumbnails in the results display. Getting this setup is a bit tricky. But, since Warclight is standardizing metadata on webarchive-discovery’s Solr schema.xml, we avail ourselves to a number of fields available for use for a potential implementation. The url field is the obvious choice, but the problem is that Blacklight out of the box will try and display a thumbnail for every url field value you give to config.


At this past week’s Archives Unleashed dataton, I jokingly created some wordclouds of my Co-PI’s timelines. Finished my most likely bigly winning #hackarchives project: A Word Cloud of @lintool's timeline! — nick ruest (@ruebot) April 27, 2018 Or, @ianmilligan1 #HackArchives — nick ruest (@ruebot) April 27, 2018 Mat Kelly asked about the process this morning, so here is a little how-to of the pipeline: Requirements: twarc jq wordcloud_cli.


This is the text for my presention at the “National Forum on Ethics and Archiving the Web”. I had the honour of being on an Archiving Trauma panel with some great people. Michael Connor, Chido Muchemwa, Coral Salomón, Tonia Sutherland, and Lauren Work, thank you for sharing your stories! The world is a beautiful and terrible place. Twitter can be beautiful. Twitter is fucking awful. So, capturing traumatic events on Twitter.


This past week Compute Canada provided us with resources to setup our Solr Cloud instance for WALK and Archives Unleashed. We were able to get things setup relatively quickly thanks to a bit of preparation and practice on our local machines in the previous weeks. Once everything was setup (5 virtual machines total; 4 Solr Cloud nodes and one indexer – details below), we started benchmarking webarchive-discovery and our Solr Cloud setup with GNU Parallel.



Archives Unleashed Project

Archives Unleashed aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Supported by a grant from the Andrew W. Mellon Foundation, we will be developing web archive search and data analysis tools to enable scholars and librarians to access, share, and investigate recent history since the early days of the World Wide Web.

Web Archives for Historical Research

Our research focuses on both web histories - writing about the recent past as reflected in web archives - as well as methodological approaches to understanding these repositories.

Islandora CLAW

Islandora CLAW is the next generation of Islandora.

Fedora Repository

Fedora is the flexible, modular, open source repository platform with native linked data support.



#bataclan #charliehebdo #elxn42 #futurelibs #jesuisahmed #jesuischarlie #jesuisjuif #makedonalddrumpfagain #or2013 #panamapapers #paris #parisattacks #porteouverte #womensmarch aaronsw academic-librarians academic-libraries access-2008 access-2011 access-copyright activism alex-alvarez analytics anon apache-solr apache-spark apache2 apeshit-simians archive-team archivematica archives art artefactual-systems bacon bash-scripts benchmarking bepress best-practices big-data blacklight blogvsbook book-scanner book-scanning bots canada claw cloud-computing code comic committee community concentration-camp-correspondences congressedits data data-mining datasets deprofessionalization detroit digital-collections digital-commons digital-curation digital-history digital-humanities digital-library digital-odyssey digital-preservation digital-projects digital-repositories digital-repository digital-technologies digitalcommons digitization discussion djatoka-projects documentation donald-trump dpla dpla-api drupal dublin-core e-journal e-journals electronic-theses-and-dissertations elxn42 emerging-technologies ethics eye-candy faculty-associations fail fedora fedora-commons fits free-software free-software-foundation fucks fuppes future-of-academic-librarianship galleries garage-rock gccaedits geojson gestapo-camps git github google-docs google-groups gource grad-school hackfest hadoop hamilton help-desk hipinkingston historical-perspectives-on-canadian-publishing history history-of-canadian-publishing holocaust hpcanpub ica-atom ideas ie iipc image imageapi imagemagick information-retrieval infrastructure institutional-repository internet-archive internment-camps ir islandora javascript jesuischarlie john-degen journal jpeg2000 json kirtas-2400ra labor labour labour-relations law leadership leaflet.js learning lecture liaison liberation-technology librarianship libraries libraries-are-essential library-2.0 library-and-information-science library-apps library-technologies librarydayinthelife linked-data-platform live-off-the-floor love-libraries matterwave mcmaster-university mcmaster-university-library meetings meme merge metadata mets/alto microsoft modules montage muala mungus music mysql nginx nicholas-griffin node-import oai oai-pmh oais oclc ocr ola olita olita-digital-odyssey-2009 open-access open-access-week open-repositories open-source open-source-software openshot opensource panamapapers php podcasts politics presentation project-planning public-libraries pw20c pymarc pypi python renton repositories research-collections research-help-desk rhizome richard-routley richard-stallman richard-sylvan ripcd rob-ford ruby-on-rails satire scholarly-communication semantic-web shirtlesshorde sky-river social-justice solr-cloud soundtrack sparc sql streaming surnom-de-gorille sword ted tedx tedx-librariansto textual-analysis the-achievements the-humans the-potions thehip thought-leader thumbnails topoli toronto torrent trainspotting trauma twarc twarc-report twitter union unionization upnp usability vbo views views-bulk-operations visualization wahr warc warclight wardrobes wayback wayback-machine web-archives web-archives-for-historical-research web-archiving webarchiving wget wget-warc wikipedia wireframes womensmarch wordcloud work-from-home world-wide-web wwii xml. ymmfire zookeeper zorton-and-the-cannibals