14,478,518 #WomensMarch tweets January 12-28, 2017


A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we'd amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour. (Generated with Peter Binkley's twarc-report)

This put #WomensMarch well over 1% of the overall Twitter stream, which causes dropped tweets if you're collecting from the Filter API, so I used the strategy of using the both the Filter and Search APIs for collection. (If you're curious about learning more about this, check out Kevin Driscoll, Shawn Walker's "Big Data, Big Questions | Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data", and Jiaul H. Paik and Jimmy Lin's "Do Multiple Listeners to the Public Twitter Sample Stream Receive the Same Tweets?). I've included the search and filter logs in the dataset. If you grep "WARNING" WomensMarch_filter.log or grep "WARNING" WomensMarch_filter.log | wc -l you'll get a sense of the scale of dropped tweets. For a number of hours on January 22, I was seeing around 1.6 million cumulative dropped tweets!

I collected from around 11AM EST on January 21, 2017 to 11AM EST January 28, 2017 with the Filter API, and did two Search API queries. Final count before deduplication looked like this:

$ wc -l  WomensMarch_filter.json WomensMarch_search_01.json WomensMarch_search_02.json 
     7906847 WomensMarch_filter.json
     1336505 WomensMarch_search_01.json
     9602777 WomensMarch_search_02.json
    18846129 total

Final stats: 14,478,518 tweets in a 104GB json file!

This put's us in the same range as what Ryan Gallagher projected in "A Bird's-Eye View of #WomensMarch."

Below I'll give a quick overview of the dataset using utilities from Documenting the Now's twarc, and utilities described inline. This is the same approach as Ian Milligan and my 2016 Code4Lib Journal article, "An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter." This is probably all that I'll have time to do with the dataset. Please feel free to use it in your own research. It's licensed CC-BY, so please have at it! :-)

...and if you want access to other Twitter dataset to analyse, check out


Tweets Username
5,375        paparcura
4,703        latinagirlpwr
1,903        ImJacobLadder
1,236        unbreakablepenn
1,212        amForever44
1,178        BassthebeastNYC
1,170        womensmarch
1,017        WhyIMarch
982        TheLifeVote
952        zerocomados

3,582,495 unique users.













Tweets Clients
7,098,145        Twitter for iPhone
3,718,467        Twitter for Android
2,066,773        Twitter for iPad
634,054        Twitter Web Client
306,225        Mobile Web (M5)
127,622        TweetDeck
59,463        Instagram
54,851        Tweetbot for iOS
47,556        Twitter for Windows
36,404        IFTTT


Tweets       URL

2,403,637 URLs tweeted, with 527,350 of those being unique urls.

I've also setup a little bash script to feed all the unique urls to Internet Archive:



cat $URLS | while read line; do
  curl -s -S "$line" > /dev/null
  let "index++"
  echo "$index/527350 submitted to Internet Archive"
  sleep 1

And, I've also setup a crawl with Heritrix, and I'll make that data available here once it is complete.


Tweets Domain

Embedded Images

Tweets Image

6,153,894 embedded image URLs tweeted, with 390,298 of those being unique urls.

I'll be creating an image montage similar to what I did for #elxn42 and #panamapapers for #WomensMarch. It'll take some time, and I have to gather resources to make it happen since we're looking at about 5 times the amount of images for #WomensMarch.

#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016 twee volume
Dataset is available here.

Looking at the #panamapapers capture I've been doing we have, 1,424,682 embedded image urls from 3,569,960 tweets. I'm downloading the 1,424,682 images now, and hope to do something similar to what I did with the #elxn42 images. While we're waiting for the images to download, here are the 10 most tweeted embedded image urls:

Tweets Image
1. 10243
2. 8093
3. 6588
4. 5613
5. 5020
6. 4944
7. 4421
8. 3740
9. 3616
10. 3585

1,203,867 #elxn42 images

1,203,867 #elxn42 images


Last August, I began capturing the #elxn42 hashtag as an experiment, and potential research project with Ian Milligan. Once Justin Trudeau was sworn in as the 23rd Prime Minister of Canada, we stopped collection, and began analysing the dataset. We wrote that analysis up for the Code4Lib Journal, which will be published in the next couple weeks. In the interim, you can check out our pre-print here. Included in that dataset is a line-deliminted list of a url to every embedded image tweeted in the dataset; 1,203,867 images. So, I downloaded them. It took a couple days.


cd /path/to/elxn42/images

cat $IMAGES | while read line; do
  wget "$line"

Now we can start doing image analysis.

1,203,867 images, now what?

I really wanted to take a macroscopic look at all the images, and looking around the best tool for the job looked like montage, an ImageMagick command for creating composite images. But, it wasn't that so simple. 1,203,867 images is a lot of images, and starts getting you thinking about what big data is. Is this big data? I don't know. Maybe?

Attempt #1

I can just point montage at a directory and say go to town, right? NOPE.

$ montage /path/to/1203867/elxn42/images/* elxn42.png

Too many arguments! After glancing through the man page, I find that I can pass it a line-delimited text file with the paths to each file.

find `pwd` -type f | cat > images.txt

Now that I have that, I can pass montage that file, and I should be golden, right? NOPE.

$ montage @images.txt elxn42.png

I run out of RAM, and get a segmentation fault. This was on a machine with 80GB of RAM.

Attempt #2

Is this big data? What is big data?

Where can I get a machine with a bunch of RAM really quick? Amazon!

I spin up a d2.8xlarge (36 cores and 244GB RAM) EC2 instance, get my dataset over there, ImageMagick installed, and run the command again.

$ montage @images.txt elxn42.png

NOPE. I run out of RAM, and get a segmentation fault. This was on a machine with 244GB of RAM.

Attempt #3

Is this big data? What is big data?

I've failed on two very large machines. Well, what I would consider large machines. So, I start googling, and reading more ImageMagick documentation. Somebody has to have done something like this before, right? Astronomers, they deal with big images right? How do they do this?

Then I find it; ImageMagick Large Image Support/Terapixel support, and the timing couldn't have been better. Ian and I had recently got setup with our ComputeCanada resource allocation. I setup a machine with 8 cores, 12GB RAM, and compiled the latest version of ImageMagick from source; ImageMagick-6.9.3-7.

montage -monitor -define registry:temporary-path=/data/tmp -limit memory 8GiB -limit map 10GiB -limit area 0 @elxn42-tweets-images.txt elxn42.png

Instead of running everything in RAM, which became my issue with this job, I'm able to write all the tmp files ImageMagick creates to disk with -define registry:temporary-path=/data/tmp and limit my memory usage with -limit memory 8GiB -limit map 10GiB -limit area 0. Then knowing this job was going to probably take a long time, -monitor comes in super handy for providing feedback of where the job is at process-wise.

In the end, it took just over 12 days to run the job. It took up 3.5TB of disk space at its peak, and in the end generated a 32GB png file. You can check it out here.

$ pngcheck elxn42.png
OK: elxn42.png (138112x135828, 48-bit RGB, non-interlaced, 69.6%).

$ exiftool elxn42.png
ExifTool Version Number         : 9.46
File Name                       : elxn42.png
Directory                       : .
File Size                       : 32661 MB
File Modification Date/Time     : 2016:03:30 00:48:44-04:00
File Access Date/Time           : 2016:03:30 10:20:26-04:00
File Inode Change Date/Time     : 2016:03:30 09:14:09-04:00
File Permissions                : rw-rw-r--
File Type                       : PNG
MIME Type                       : image/png
Image Width                     : 138112
Image Height                    : 135828
Bit Depth                       : 16
Color Type                      : RGB
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
Gamma                           : 2.2
White Point X                   : 0.3127
White Point Y                   : 0.329
Red X                           : 0.64
Red Y                           : 0.33
Green X                         : 0.3
Green Y                         : 0.6
Blue X                          : 0.15
Blue Y                          : 0.06
Background Color                : 65535 65535 65535
Image Size                      : 138112x135828

Concluding Thoughts

Is this big data? I don't know. I started with 1,203,867 images and made it into a single image. Using 3.5TB of tmp files to create a 32GB image is mind boggling when you start to think about it. But then it isn't when you think about it more. Do I need a machine with 3.5TB of RAM to run this in memory? Or do I just need to design a job with the resources I have and be patient. There are always trade-offs. But, at the end of it all, I'm still sitting here asking myself what is big data?

Maybe this is big data :-)

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

On November 13, 2015 I was at the "Web Archives 2015: Capture, Curate, Analyze" listening to Ian Milligan give the closing keynote when Thomas Padilla tweeted the following to me:

I immediately started collecting.

When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it's something. Maybe these datasets can be used for something positive that happened out of all this negative.

When I started collecting, it just so happened that the creator of twarc, Ed Summers, was sitting next to me, and he mentioned some new functionality that was part of the v0.4.0 release of twarc; Added --warnings flag to log warnings from the Twitter API about dropped tweets during streaming.

What's that mean? Basically, the public Stream API will not stream more the 1% of the total Twitter stream. If you are trying to capture something from the streaming API that exceeds 1% of the total Twitter stream, like for instance a hashtag or two related to a terrorist attack, the streaming API will drop tweets, and notify that it has done so. There is a really interesting look at this by Kevin Driscoll, Shawn Walker in the International Journal of Communication.

Ed fired up the new version of twarc and began streaming as well so we could see what was happening. We noticed that we were getting warnings of around 400 dropped tweets every request (seconds), then it quickly escalated up to over 28k dropped tweets every request. What were were trying to collect was over 1% of the total Twitter stream.


Collection started on November 13, 2015 using both the streaming and search API. This is what it looked like:

$ --search "#paris OR #Bataclan OR #parisattacks OR #porteouverte" > paris-search.json $ --stream "#paris,#Bataclan,#parisattacks,#porteouverte" > paris-stream.json

I took the strategy of utilizing both the search and streaming API for collection due to what was noted above about hitting the 1% of the total Twitter stream limit. The idea was that if I'm hitting the limit with stream, theoretically I should be able to capture any tweets that were dropped with the search API. The stream API collection ran continuously during the collection period from November 13, 2015 to December 11, 2015. The search API collection was run, then once finished, immediately started back up over the collection period. During the first two weeks of collection, the search API collection would take about a week to finish. In recollection, I should have made note of the exact times it took to collect to get some more numbers to look at. That said, I'm not confident I was able to grab every tweet related to the hashtags I was collecting on. The only way, I think, I can be confident is by comparing this dataset with a dataset from Gnip. But, I am confident that I have a large amount of what was tweeted.

Once I finished collecting, I combined the json files, and deduplicated with, and then created a list of tweet ids with

If you want to follow along or do your own analysis with the dataset, you can "hydrate" the dataset with twarc. You can grab the Tweet ids for the dataset from here (Data & Analysis tab).

$ --hydrate paris-tweet-ids.txt > paris-tweets.json

The hydration process will take some time; 72,000 tweets/hour. You might want to use something along the lines of GNU Screen, tmux, or nohup since it'll take about 207.49 hours to completely hydrate.

created with Peter Binkley's twarc-report


I'm only going to do a quick analysis of the dataset here since I want to get the dataset out, and allow others to work with it. Tweets with geocoordinates is not covered below, but you can check out a map of tweets here.

There are a number of friendly utilities that come with twarc that allow for a quick exploratory analysis of a given collection. In addition, Peter Binkley's twarc-repot is pretty handy for providing a quick overview of a given dataset.


We are able to create a list of the unique Twitter username names in the dataset by using, and additionally sort them by the number of tweets:

$ python ~/git/twarc/utils/ paris-valid-deduplicated.json > paris-users.txt $ cat paris-users.txt | sort | uniq -c | sort -n > paris-users-unique-sorted-count.txt $ cat paris-users-unique-sorted-count.txt | wc -l $ tail paris-users-unique-sorted-count.txt

From the above, we can see that there are 4,636,584 unique users in the dataset, and the top 10 accounts were as follows:

1. 38,883 tweets RelaxInParis
2. 36,504 tweets FrancePeace
3. 12,697 tweets FollowParisNews
4. 12,656 tweets Reduction_Paris
5. 10,044 tweets CNNsWorld
6. 8,208 tweets parisevent
7. 7.296 tweets TheMalyck_
8. 6,654 tweets genx_hrd
9. 6,370 tweets DHEdomains
10. 4,498 tweets paris_attack


We are able to create a lit of the most retweeted tweets in the dataset by using

$ python ~/git/twarc/utils/ paris-valid-deduplicated.json > paris-retweets.json $ python python ~/git/twarc/utils/ paris-retweets.json > paris-retweets.txt

1. 53,639 retweets
2. 44,457 retweets
3. 41,400 retweets
4. 39,140 retweets
5. 37,214 retweets
6. 24,955 retweets
7. 22,124 retweets
8. 22,113 retweets
9. 22,069 retweets
10. 21,401 retweets


We were able to create a list of the unique tags using in our dataset by using

$ python ~/git/twarc/utils/ paris-valid-deduplicated.json > paris-hashtags.txt $ cat paris-hashtags.txt | wc -l $ head elxn42-tweet-tags.txt

From the above, we can see that there were 26,8974 unique hashtags were used. The top 10 hashtags used in the dataset were:

1. 6,812,941 tweets #parisattacks
2. 6,119,933 tweets #paris
3. 1,100,809 tweets #bataclan
4. 887,144 tweets #porteouverte
5. 673,543 tweets #prayforparis
6. 444,486 tweets #rechercheparis
7. 427,999 tweets #parís
8. 387,699 tweets #france
9. 341,059 tweets #fusillade
10. 303,410 tweets #isis


We are able to create a list of the unique URLs tweeted in our dataset by using, after first unshortening the urls with and unshrtn.

$ python ~/git/twarc/utils/ paris-valid-deduplicated-unshortened.json > paris-tweets-urls.txt $ cat paris-tweets-urls.txt | sort | uniq -c | sort -n > paris-tweets-urls-uniq.txt $ cat paris-tweets-urls.txt | wc -l $ cat paris-tweets-urls-uniq.txt | wc -l $ tail paris-tweets-urls-uniq.txt

From the above, we can see that there were 5,561,037 URLs tweeted, representing 37.22% of total tweets, and 858,401 unique URLs tweeted. The top 10 URLs tweeted were as follows:

1. 46,034 tweets
2. 46,005 tweets
3. 37,509 tweets
4. 35,882 tweets
5. 33,531 tweets
6. 33,039 tweets
7. 24,221 tweets
8. 23,536 tweets
9. 21,237 tweets
10. 21,107 tweets


We are able to create a list of images tweeted in our dataset by using

$ python ~/git/twarc/utils/ paris-valid-deduplicated.json > paris-tweets-images.txt $ cat paris-tweets-images.txt | sort | uniq -c | sort -n > paris-tweets-images-uniq.txt $ cat paris-tweets-images-uniq.txt | wc -l $ tail paris-tweets-images-uniq.txt

From the above, we can see that there were 6,872,441 total images tweets, representing 46.00% of total tweets, and 660,470 unique images. The top 10 images tweeted were as follows:

  1. 49,051 Occurrences
  2. 43,348 Occurrences
  3. 22,615 Occurrences
  4. 21,325 Occurrences
  5. 20,689 Occurrences
  6. 19,696 Occurrences
  7. 19,597 Occurrences
  8. 19,096 Occurrences
  9. 16,772 Occurrences
  10. 15,364 Occurrences