Nearly four years ago I decided to start collecting tweets to Donald Trump out of morbid curiosity. If I was a real archivist, I would have planned this out a little bit better, and started collecting on election night in 2016, or inaguration day 2017. I didn’t. Using twarc, I started collecting with the Filter (Streaming) API on May 7, 2017. That process failed, and I pivoted to using the Search API.
Juxta A couple years ago I wrote about a method for creating a collage out of 1.2M images collected from the 2015 Canadian Federal Election Twitter dataset. That method was very resource intensive in terms of the amount of temporary disk storage required to create the collage. As the number of images in a given collage increased, the amount of temporary disk space scaled exponentially; 3.5T for 1.2M #exln42 images, and ~90T for 6.
At this past week’s Archives Unleashed dataton, I jokingly created some wordclouds of my Co-PI’s timelines.
Finished my most likely bigly winning #hackarchives project: A Word Cloud of @lintool's timeline!https://t.co/eK2KPGjaGo
— nick ruest (@ruebot) April 27, 2018
Or, @ianmilligan1 #HackArchiveshttps://t.co/qMxiet0osl
— nick ruest (@ruebot) April 27, 2018
Mat Kelly asked about the process this morning, so here is a little how-to of the pipeline:
Requirements:
twarc jq wordcloud_cli.
#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline tweets 2,661,117 tweet ids for #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline tweets, collected with Documenting the Now's twarc. Tweets can be “rehydrated” with Documenting the Now’s twarc, or Hydrator.
twarc hydrate tweet-ids.txt tweets.jsonl ID files are available for all hashtags or some individual hashtags: covid19-ids.txt covid19vaccines-ids.txt fordnation-ids.txt healthcanada-ids.txt healthcanada-NACI-fordnation-medicalfreedom-covid19-covid19vaccines-protectourfamilies-protectyourchildren-holdtheline-ids.txt holdtheline-ids.txt medicalfreedom-ids.txt NACI-ids.txt protectyourchildren-ids.txt Tweets were collected via the Standard Search API on: November 18, 2021 November 21, 2021 November 26, 2021 December 1, 2021 Dataset #elxn44 tweets (44th Canadian Federal Election) 2,075,645 tweet ids for #elxn44 tweets, collected with Documenting the Now's twarc.
Overview A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we’d amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.
On November 13, 2015 I was at the “Web Archives 2015: Capture, Curate, Analyze” listening to Ian Milligan give the closing keynote when Thomas Padilla tweeted the following to me:
@ruebot terrible news, possible charlie hebdo connection - https://t.co/SkEusgqgz5
— Thomas Padilla (@thomasgpadilla) November 13, 2015
I immediately started collecting.
When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it’s something.
#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo I’ve spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley’s twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC.
#JeSuisAhmed Had some time last night to do some exploratory analysis on some of the #JeSuisAhmed collection. This analysis is from the first tweet I was able to harvest #JeSuisAhmed to some time on January 14, 2015 when I copied over the json to experiment with a few of the twarc utilities.
First tweet in data set:
#JeSuisAhmed Reveals the Hero of the Paris Shooting Everyone Needs to Know by @sophie_kleeman http://t.
Using the #JeSuisCharlie data set from January 11, 2015 (Warning! Will turn your browser into a potato for a few seconds), these are the image urls that have more than 1000 occurrences in the data set.
How to create (requires unshrtn):
% twarc.py --query "#JeSuisCharlie" % ~/git/twarc/utils/deduplicate.py JeSuisCharlie-tweets.json > JeSuisCharlie-tweets-deduped.json % cat JeSuisCharlie-tweets-deduped.json | utils/unshorten.py > JeSuisCharlie-tweets-deduped-ushortened.json % ~/git/twarc/utils/image_urls.py JeSuisCharlie-tweets-deduped-ushortened.json >| JeSuisCharlie-20150115-image-urls.txt % cat JeSuisCharlie-20150115-image-urls.txt | sort | uniq -c | sort -rn > JeSuisCharlie-20150115-image-urls-ranked.