Twitter

A month of tweets at @realDonaldTrump

Twitter Bots

Twitter Datasets and Derivative data

#paradisepapers tweets November 3-25, 2017 1,797,260 tweet ids for #paradisepapers collected with Documenting the Now’s twarc from November 5-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate paradisepapers_ids.txt > paradisepapers.json. Or with Documenting the Now’s Hydrator: https://github.com/DocNow/hydrator Dataset #climatemarch tweets April 19-May 3, 2017 681,668 tweet ids for #climate collected with Documenting the Now’s twarc from January 22-26, 2017. Tweets can be “rehydrated” with Documenting the Now’s twarc (https://github.

14,478,518 WomensMarch tweets January 12-28, 2017

Overview A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we’d amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.

1,203,867 elxn42 images

Background Last August, I began capturing the #elxn42 hashtag as an experiment, and potential research project with Ian Milligan. Once Justin Trudeau was sworn in as the 23rd Prime Minister of Canada, we stopped collection, and began analysing the dataset. We wrote that analysis up for the Code4Lib Journal, which will be published in the next couple weeks. In the interim, you can check out our pre-print here. Included in that dataset is a line-deliminted list of a url to every embedded image tweeted in the dataset; 1,203,867 images.

A look at 14,939,154 paris Bataclan parisattacks porteouverte tweets

On November 13, 2015 I was at the “Web Archives 2015: Capture, Curate, Analyze” listening to Ian Milligan give the closing keynote when Thomas Padilla tweeted the following to me: @ruebot terrible news, possible charlie hebdo connection - https://t.co/SkEusgqgz5 — Thomas Padilla (@thomasgpadilla) November 13, 2015 I immediately started collecting. When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it’s something.

An Exploratory look at 13,968,293 JeSuisCharlie, JeSuisAhmed, JeSuisJuif, and CharlieHebdo tweets

#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo I’ve spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley’s twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC.

An exploratory look at 257,093 JeSuisAhmed tweets

#JeSuisAhmed Had some time last night to do some exploratory analysis on some of the #JeSuisAhmed collection. This analysis is from the first tweet I was able to harvest #JeSuisAhmed to some time on January 14, 2015 when I copied over the json to experiment with a few of the twarc utilities. First tweet in data set: #JeSuisAhmed Reveals the Hero of the Paris Shooting Everyone Needs to Know by @sophie_kleeman http://t.

JeSuisCharlie images

Using the #JeSuisCharlie data set from January 11, 2015 (Warning! Will turn your browser into a potato for a few seconds), these are the image urls that have more than 1000 occurrences in the data set. How to create (requires unshrtn): % twarc.py –query "#JeSuisCharlie" % ~/git/twarc/utils/deduplicate.py JeSuisCharlie-tweets.json > JeSuisCharlie-tweets-deduped.json % cat JeSuisCharlie-tweets-deduped.json | utils/unshorten.py > JeSuisCharlie-tweets-deduped-ushortened.json % ~/git/twarc/utils/image_urls.py JeSuisCharlie-tweets-deduped-ushortened.json >| JeSuisCharlie-20150115-image-urls.txt % cat JeSuisCharlie-20150115-image-urls.txt | sort | uniq -c | sort -rn > JeSuisCharlie-20150115-image-urls-ranked.

Preliminary stats of JeSuisCharlie, JeSuisAhmed, JeSuisJuif, CharlieHebdo

#JeSuisAhmed $ wc -l *json 148479 %23JeSuisAhmed-20150109103430.json 94874 %23JeSuisAhmed-20150109141746.json 5885 %23JeSuisAhmed-20150112092647.json 249238 total $ du -h 2.7G . #JeSuisCharlie $ wc -l *json 3894191 %23JeSuisCharlie-20150109094220.json 1758849 %23JeSuisCharlie-20150109141730.json 226784 %23JeSuisCharlie-20150112092710.json 15 %23JeSuisCharlie-20150112092734.json 5879839 total $ du -h 32G . #JeSuisJuif $ wc -l *json 23694 %23JeSuisJuif-20150109172957.json 50603 %23JeSuisJuif-20150109173104.json 5941 %23JeSuisJuif-20150110003450.json 42237 %23JeSuisJuif-20150112094500.json 5064 %23JeSuisJuif-20150112094648.json 127539 total $ du -h 671M . #CharlieHebdo $ wc -l *json 4444585 %23CharlieHebdo-20150109172713.