14,478,518 WomensMarch tweets January 12-28, 2017

Overview

A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we’d amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.

http://ruebot.net/WomensMarch_tweet_volume.html (Generated with Peter Binkley’s twarc-report)

This put #WomensMarch well over 1% of the overall Twitter stream, which causes dropped tweets if you’re collecting from the Filter API, so I used the strategy of using the both the Filter and Search APIs for collection. (If you’re curious about learning more about this, check out Kevin Driscoll, Shawn Walker’s “Big Data, Big Questions | Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data”, and Jiaul H. Paik and Jimmy Lin’s “Do Multiple Listeners to the Public Twitter Sample Stream Receive the Same Tweets?). I’ve included the search and filter logs in the dataset. If you grep "WARNING" WomensMarch_filter.log or grep "WARNING" WomensMarch_filter.log | wc -l you’ll get a sense of the scale of dropped tweets. For a number of hours on January 22, I was seeing around 1.6 million cumulative dropped tweets!

http://ruebot.net/WomensMarch_dropped_tweets.png

I collected from around 11AM EST on January 21, 2017 to 11AM EST January 28, 2017 with the Filter API, and did two Search API queries. Final count before deduplication looked like this:

$ wc -l  WomensMarch_filter.json WomensMarch_search_01.json WomensMarch_search_02.json
     7906847 WomensMarch_filter.json
     1336505 WomensMarch_search_01.json
     9602777 WomensMarch_search_02.json
    18846129 total

Final stats: 14,478,518 tweets in a 104GB json file!

This put’s us in the same range as what Ryan Gallagher projected in “A Bird’s-Eye View of #WomensMarch.”

Below I’ll give a quick overview of the dataset using utilities from Documenting the Now’s twarc, and utilities described inline. This is the same approach as Ian Milligan and my 2016 Code4Lib Journal article, “An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter.” This is probably all that I’ll have time to do with the dataset. Please feel free to use it in your own research. It’s licensed CC-BY, so please have at it! :-)

…and if you want access to other Twitter dataset to analyse, check out http://www.docnow.io/catalog/.

Users

Tweets Username
5,375        paparcura
4,703        latinagirlpwr
1,903        ImJacobLadder
1,236        unbreakablepenn
1,212        amForever44
1,178        BassthebeastNYC
1,170        womensmarch
1,017        WhyIMarch
982        TheLifeVote
952        zerocomados


3,582,495 unique users.

Retweets

146,370 Retweets


141,111 Retweets
109,865 Retweets
84,161 Retweets
70,600 Retweets
62,591 Retweets
59,366 Retweets
56,365 Retweets
52,125 Retweets
50,944 Retweets

Clients

Tweets Clients
7,098,145        Twitter for iPhone
3,718,467        Twitter for Android
2,066,773        Twitter for iPad
634,054        Twitter Web Client
306,225        Mobile Web (M5)
127,622        TweetDeck
59,463        Instagram
54,851        Tweetbot for iOS
47,556        Twitter for Windows
36,404        IFTTT

URLs

Tweets       URL
29,223        https://www.facebook.com/cnn/videos/10155945796281509/
27,435       http://www.cnn.com/2017/01/21/politics/womens-march-donald-trump-inauguration-sizes/index.html?sr=twCNN012117womens-march-donald-trump-inauguration-sizes0205PMStoryGal
24,854       http://www.independent.co.uk/news/world/americas/womens-march-antarctica-donald-trump-inauguration-women-hate-donald-trump-so-much-they-are-even-a7538856.html
21,189       https://twitter.com/kayleighmcenany/status/822979246205403136
20,902       https://twitter.com/mcgregor_ewan/status/823805815488331776
14,857       http://www.cnn.com/2017/01/21/politics/womens-march-donald-trump-inauguration-sizes/index.html?sr=twpol012117womens-march-donald-trump-inauguration-sizes0832PMVODtopLink&linkId=33643748
12,630       https://www.womensmarch.com/sisters
11,244       https://twitter.com/tomilahren/status/822852245532319744
9,761       https://twitter.com/mstharrington/status/823190136200593408
9,585       http://www.cnn.com/2017/01/21/politics/womens-march-protests-live-coverage/index.html?sr=twCNN012117womens-march-protests-live-coverage1208PMVODtop


2,403,637 URLs tweeted, with 527,350 of those being unique urls.

I’ve also setup a little bash script to feed all the unique urls to Internet Archive:

#!/bin/bash

URLS=/path/to/WomensMarch_urls_uniq.txt
index=0

cat $URLS | while read line; do
  curl -s -S "https://web.archive.org/save/$line" > /dev/null
  let "index++"
  echo "$index/527350 submitted to Internet Archive"
  sleep 1
done

And, I’ve also setup a crawl with Heritrix, and I’ll make that data available here once it is complete.

Domains

Tweets Domain
1,219,747        twitter.com
159,087        instagram.com
134,309        cnn.com
68,479        facebook.com
50,561        womensmarch.com
43,219        youtube.com
36,946        nytimes.com
30,201        huffingtonpost.com
21,520        paper.li
21,476        cbsnews.com

Embedded Images

Tweets Image
146,442      
81,139       
71,877       
64,149       
59,214       
58,599       
51,439       
44,611       
43,845       
41,436       


6,153,894 embedded image URLs tweeted, with 390,298 of those being unique urls.

I’ll be creating an image montage similar to what I did for #elxn42 and #panamapapers for #WomensMarch. It’ll take some time, and I have to gather resources to make it happen since we’re looking at about 5 times the amount of images for #WomensMarch.

Avatar
Nick Ruest
Associate Librarian

Related