A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we’d amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.
This put #WomensMarch well over 1% of the overall Twitter stream, which causes dropped tweets if you’re collecting from the Filter API, so I used the strategy of using the both the Filter and Search APIs for collection. (If you’re curious about learning more about this, check out Kevin Driscoll, Shawn Walker’s “Big Data, Big Questions | Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data”, and Jiaul H. Paik and Jimmy Lin’s “Do Multiple Listeners to the Public Twitter Sample Stream Receive the Same Tweets?). I’ve included the search and filter logs in the dataset. If you
grep "WARNING" WomensMarch_filter.log or
grep "WARNING" WomensMarch_filter.log | wc -l you’ll get a sense of the scale of dropped tweets. For a number of hours on January 22, I was seeing around 1.6 million cumulative dropped tweets!
I collected from around 11AM EST on January 21, 2017 to 11AM EST January 28, 2017 with the Filter API, and did two Search API queries. Final count before deduplication looked like this:
$ wc -l WomensMarch_filter.json WomensMarch_search_01.json WomensMarch_search_02.json 7906847 WomensMarch_filter.json 1336505 WomensMarch_search_01.json 9602777 WomensMarch_search_02.json 18846129 total
Final stats: 14,478,518 tweets in a 104GB json file!
Below I’ll give a quick overview of the dataset using utilities from Documenting the Now’s twarc, and utilities described inline. This is the same approach as Ian Milligan and my 2016 Code4Lib Journal article, “An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter.” This is probably all that I’ll have time to do with the dataset. Please feel free to use it in your own research. It’s licensed CC-BY, so please have at it! :-)
…and if you want access to other Twitter dataset to analyse, check out http://www.docnow.io/catalog/.
3,582,495 unique users.
Yes we can.— White House Archived (@ObamaWhiteHouse) January 20, 2017
Yes we did.
Thank you for being a part of the past eight years. pic.twitter.com/mjmr4RkxpV
Thanks for standing, speaking & marching for our values @womensmarch. Important as ever. I truly believe we're always Stronger Together.— Hillary Clinton (@HillaryClinton) January 21, 2017
I'm here today to honor our democracy & its enduring values. I will never stop believing in our country & its future. #Inauguration— Hillary Clinton (@HillaryClinton) January 20, 2017
Congratulations to the women marching today. We must go forward to ensure full reproductive justice for all women. #WomensMarch— Bernie Sanders (@SenSanders) January 21, 2017
Hi everybody! Back to the original handle. Is this thing still on? Michelle and I are off on a quick vacation, then we’ll get back to work.— Barack Obama (@BarackObama) January 20, 2017
|7,098,145||Twitter for iPhone|
|3,718,467||Twitter for Android|
|2,066,773||Twitter for iPad|
|634,054||Twitter Web Client|
|306,225||Mobile Web (M5)|
|54,851||Tweetbot for iOS|
|47,556||Twitter for Windows|
2,403,637 URLs tweeted, with 527,350 of those being unique urls.
I’ve also setup a little bash script to feed all the unique urls to Internet Archive:
#!/bin/bash URLS=/path/to/WomensMarch_urls_uniq.txt index=0 cat $URLS | while read line; do curl -s -S "https://web.archive.org/save/$line" > /dev/null let "index++" echo "$index/527350 submitted to Internet Archive" sleep 1 done
6,153,894 embedded image URLs tweeted, with 390,298 of those being unique urls.
I’ll be creating an image montage similar to what I did for #elxn42 and #panamapapers for #WomensMarch. It’ll take some time, and I have to gather resources to make it happen since we’re looking at about 5 times the amount of images for #WomensMarch.