14,478,518 #WomensMarch tweets January 12-28, 2017


A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we'd amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.

http://ruebot.net/WomensMarch_tweet_volume.html (Generated with Peter Binkley's twarc-report)

This put #WomensMarch well over 1% of the overall Twitter stream, which causes dropped tweets if you're collecting from the Filter API, so I used the strategy of using the both the Filter and Search APIs for collection. (If you're curious about learning more about this, check out Kevin Driscoll, Shawn Walker's "Big Data, Big Questions | Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data", and Jiaul H. Paik and Jimmy Lin's "Do Multiple Listeners to the Public Twitter Sample Stream Receive the Same Tweets?). I've included the search and filter logs in the dataset. If you grep "WARNING" WomensMarch_filter.log or grep "WARNING" WomensMarch_filter.log | wc -l you'll get a sense of the scale of dropped tweets. For a number of hours on January 22, I was seeing around 1.6 million cumulative dropped tweets!


I collected from around 11AM EST on January 21, 2017 to 11AM EST January 28, 2017 with the Filter API, and did two Search API queries. Final count before deduplication looked like this:

$ wc -l  WomensMarch_filter.json WomensMarch_search_01.json WomensMarch_search_02.json 
     7906847 WomensMarch_filter.json
     1336505 WomensMarch_search_01.json
     9602777 WomensMarch_search_02.json
    18846129 total

Final stats: 14,478,518 tweets in a 104GB json file!

This put's us in the same range as what Ryan Gallagher projected in "A Bird's-Eye View of #WomensMarch."

Below I'll give a quick overview of the dataset using utilities from Documenting the Now's twarc, and utilities described inline. This is the same approach as Ian Milligan and my 2016 Code4Lib Journal article, "An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter." This is probably all that I'll have time to do with the dataset. Please feel free to use it in your own research. It's licensed CC-BY, so please have at it! :-)

...and if you want access to other Twitter dataset to analyse, check out http://www.docnow.io/catalog/.


Tweets Username
5,375        paparcura
4,703        latinagirlpwr
1,903        ImJacobLadder
1,236        unbreakablepenn
1,212        amForever44
1,178        BassthebeastNYC
1,170        womensmarch
1,017        WhyIMarch
982        TheLifeVote
952        zerocomados

3,582,495 unique users.













Tweets Clients
7,098,145        Twitter for iPhone
3,718,467        Twitter for Android
2,066,773        Twitter for iPad
634,054        Twitter Web Client
306,225        Mobile Web (M5)
127,622        TweetDeck
59,463        Instagram
54,851        Tweetbot for iOS
47,556        Twitter for Windows
36,404        IFTTT


Tweets       URL
29,223        https://www.facebook.com/cnn/videos/10155945796281509/
27,435       http://www.cnn.com/2017/01/21/politics/womens-march-donald-trump-inauguration-sizes/index.html?sr=twCNN012117womens-march-donald-trump-inauguration-sizes0205PMStoryGal
24,854       http://www.independent.co.uk/news/world/americas/womens-march-antarctica-donald-trump-inauguration-women-hate-donald-trump-so-much-they-are-even-a7538856.html
21,189       https://twitter.com/kayleighmcenany/status/822979246205403136
20,902       https://twitter.com/mcgregor_ewan/status/823805815488331776
14,857       http://www.cnn.com/2017/01/21/politics/womens-march-donald-trump-inauguration-sizes/index.html?sr=twpol012117womens-march-donald-trump-inauguration-sizes0832PMVODtopLink&linkId=33643748
12,630       https://www.womensmarch.com/sisters
11,244       https://twitter.com/tomilahren/status/822852245532319744
9,761       https://twitter.com/mstharrington/status/823190136200593408
9,585       http://www.cnn.com/2017/01/21/politics/womens-march-protests-live-coverage/index.html?sr=twCNN012117womens-march-protests-live-coverage1208PMVODtop

2,403,637 URLs tweeted, with 527,350 of those being unique urls.

I've also setup a little bash script to feed all the unique urls to Internet Archive:



cat $URLS | while read line; do
  curl -s -S "https://web.archive.org/save/$line" > /dev/null
  let "index++"
  echo "$index/527350 submitted to Internet Archive"
  sleep 1

And, I've also setup a crawl with Heritrix, and I'll make that data available here once it is complete.


Tweets Domain
1,219,747        twitter.com
159,087        instagram.com
134,309        cnn.com
68,479        facebook.com
50,561        womensmarch.com
43,219        youtube.com
36,946        nytimes.com
30,201        huffingtonpost.com
21,520        paper.li
21,476        cbsnews.com

Embedded Images

Tweets Image

6,153,894 embedded image URLs tweeted, with 390,298 of those being unique urls.

I'll be creating an image montage similar to what I did for #elxn42 and #panamapapers for #WomensMarch. It'll take some time, and I have to gather resources to make it happen since we're looking at about 5 times the amount of images for #WomensMarch.

#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016 twee volume
Dataset is available here.

Looking at the #panamapapers capture I've been doing we have, 1,424,682 embedded image urls from 3,569,960 tweets. I'm downloading the 1,424,682 images now, and hope to do something similar to what I did with the #elxn42 images. While we're waiting for the images to download, here are the 10 most tweeted embedded image urls:

Tweets Image
1. 10243 http://pbs.twimg.com/media/CfIsEBAXEAA8I0A.jpg
2. 8093 http://pbs.twimg.com/media/Cfdm2RtXIAEbNGN.jpg
3. 6588 http://pbs.twimg.com/tweet_video_thumb/CfJly88WwAAHBZp.jpg
4. 5613 http://pbs.twimg.com/media/CfIuU8hW4AAsafn.jpg
5. 5020 http://pbs.twimg.com/media/CfN2gZcWAAEcptA.jpg
6. 4944 http://pbs.twimg.com/media/CfOPcofUAAAOb3v.jpg
7. 4421 http://pbs.twimg.com/media/CfnqsINWIAAMCTR.jpg
8. 3740 http://pbs.twimg.com/media/CfSpwuhWQAALIS7.jpg
9. 3616 http://pbs.twimg.com/media/CfXYf5-UAAAQsps.jpg
10. 3585 http://pbs.twimg.com/media/CfTsTp_UAAECCg4.jpg

1,203,867 #elxn42 images

1,203,867 #elxn42 images


Last August, I began capturing the #elxn42 hashtag as an experiment, and potential research project with Ian Milligan. Once Justin Trudeau was sworn in as the 23rd Prime Minister of Canada, we stopped collection, and began analysing the dataset. We wrote that analysis up for the Code4Lib Journal, which will be published in the next couple weeks. In the interim, you can check out our pre-print here. Included in that dataset is a line-deliminted list of a url to every embedded image tweeted in the dataset; 1,203,867 images. So, I downloaded them. It took a couple days.


cd /path/to/elxn42/images

cat $IMAGES | while read line; do
  wget "$line"

Now we can start doing image analysis.

1,203,867 images, now what?

I really wanted to take a macroscopic look at all the images, and looking around the best tool for the job looked like montage, an ImageMagick command for creating composite images. But, it wasn't that so simple. 1,203,867 images is a lot of images, and starts getting you thinking about what big data is. Is this big data? I don't know. Maybe?

Attempt #1

I can just point montage at a directory and say go to town, right? NOPE.

$ montage /path/to/1203867/elxn42/images/* elxn42.png

Too many arguments! After glancing through the man page, I find that I can pass it a line-delimited text file with the paths to each file. I run the following command in the directory with all the downloaded images.

find `pwd` -type f | cat > ../images.txt

Now that I have that, I can pass montage that file, and I should be golden, right? NOPE.

$ montage @images.txt elxn42.png

I run out of RAM, and get a segmentation fault. This was on a machine with 80GB of RAM.

Attempt #2

Is this big data? What is big data?

Where can I get a machine with a bunch of RAM really quick? Amazon!

I spin up a d2.8xlarge (36 cores and 244GB RAM) EC2 instance, get my dataset over there, ImageMagick installed, and run the command again.

$ montage @images.txt elxn42.png

NOPE. I run out of RAM, and get a segmentation fault. This was on a machine with 244GB of RAM.

Attempt #3

Is this big data? What is big data?

I've failed on two very large machines. Well, what I would consider large machines. So, I start googling, and reading more ImageMagick documentation. Somebody has to have done something like this before, right? Astronomers, they deal with big images right? How do they do this?

Then I find it; ImageMagick Large Image Support/Terapixel support, and the timing couldn't have been better. Ian and I had recently got setup with our ComputeCanada resource allocation. I setup a machine with 8 cores, 12GB RAM, and compiled the latest version of ImageMagick from source; ImageMagick-6.9.3-7.

montage -monitor -define registry:temporary-path=/data/tmp -limit memory 8GiB -limit map 10GiB -limit area 0 @elxn42-tweets-images.txt elxn42.png

Instead of running everything in RAM, which became my issue with this job, I'm able to write all the tmp files ImageMagick creates to disk with -define registry:temporary-path=/data/tmp and limit my memory usage with -limit memory 8GiB -limit map 10GiB -limit area 0. Then knowing this job was going to probably take a long time, -monitor comes in super handy for providing feedback of where the job is at process-wise.

In the end, it took just over 12 days to run the job. It took up 3.5TB of disk space at its peak, and in the end generated a 32GB png file. You can check it out here.

$ pngcheck elxn42.png
OK: elxn42.png (138112x135828, 48-bit RGB, non-interlaced, 69.6%).

$ exiftool elxn42.png
ExifTool Version Number         : 9.46
File Name                       : elxn42.png
Directory                       : .
File Size                       : 32661 MB
File Modification Date/Time     : 2016:03:30 00:48:44-04:00
File Access Date/Time           : 2016:03:30 10:20:26-04:00
File Inode Change Date/Time     : 2016:03:30 09:14:09-04:00
File Permissions                : rw-rw-r--
File Type                       : PNG
MIME Type                       : image/png
Image Width                     : 138112
Image Height                    : 135828
Bit Depth                       : 16
Color Type                      : RGB
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
Gamma                           : 2.2
White Point X                   : 0.3127
White Point Y                   : 0.329
Red X                           : 0.64
Red Y                           : 0.33
Green X                         : 0.3
Green Y                         : 0.6
Blue X                          : 0.15
Blue Y                          : 0.06
Background Color                : 65535 65535 65535
Image Size                      : 138112x135828

Concluding Thoughts

Is this big data? I don't know. I started with 1,203,867 images and made it into a single image. Using 3.5TB of tmp files to create a 32GB image is mind boggling when you start to think about it. But then it isn't when you think about it more. Do I need a machine with 3.5TB of RAM to run this in memory? Or do I just need to design a job with the resources I have and be patient. There are always trade-offs. But, at the end of it all, I'm still sitting here asking myself what is big data?

Maybe this is big data :-)

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

On November 13, 2015 I was at the "Web Archives 2015: Capture, Curate, Analyze" listening to Ian Milligan give the closing keynote when Thomas Padilla tweeted the following to me:

I immediately started collecting.

When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it's something. Maybe these datasets can be used for something positive that happened out of all this negative.

When I started collecting, it just so happened that the creator of twarc, Ed Summers, was sitting next to me, and he mentioned some new functionality that was part of the v0.4.0 release of twarc; Added --warnings flag to log warnings from the Twitter API about dropped tweets during streaming.

What's that mean? Basically, the public Stream API will not stream more the 1% of the total Twitter stream. If you are trying to capture something from the streaming API that exceeds 1% of the total Twitter stream, like for instance a hashtag or two related to a terrorist attack, the streaming API will drop tweets, and notify that it has done so. There is a really interesting look at this by Kevin Driscoll, Shawn Walker in the International Journal of Communication.

Ed fired up the new version of twarc and began streaming as well so we could see what was happening. We noticed that we were getting warnings of around 400 dropped tweets every request (seconds), then it quickly escalated up to over 28k dropped tweets every request. What were were trying to collect was over 1% of the total Twitter stream.


Collection started on November 13, 2015 using both the streaming and search API. This is what it looked like:

$ twarc.py --search "#paris OR #Bataclan OR #parisattacks OR #porteouverte" > paris-search.json $ twarc.py --stream "#paris,#Bataclan,#parisattacks,#porteouverte" > paris-stream.json

I took the strategy of utilizing both the search and streaming API for collection due to what was noted above about hitting the 1% of the total Twitter stream limit. The idea was that if I'm hitting the limit with stream, theoretically I should be able to capture any tweets that were dropped with the search API. The stream API collection ran continuously during the collection period from November 13, 2015 to December 11, 2015. The search API collection was run, then once finished, immediately started back up over the collection period. During the first two weeks of collection, the search API collection would take about a week to finish. In recollection, I should have made note of the exact times it took to collect to get some more numbers to look at. That said, I'm not confident I was able to grab every tweet related to the hashtags I was collecting on. The only way, I think, I can be confident is by comparing this dataset with a dataset from Gnip. But, I am confident that I have a large amount of what was tweeted.

Once I finished collecting, I combined the json files, and deduplicated with deduplicate.py, and then created a list of tweet ids with ids.py.

If you want to follow along or do your own analysis with the dataset, you can "hydrate" the dataset with twarc. You can grab the Tweet ids for the dataset from here (Data & Analysis tab).

$ twarc.py --hydrate paris-tweet-ids.txt > paris-tweets.json

The hydration process will take some time; 72,000 tweets/hour. You might want to use something along the lines of GNU Screen, tmux, or nohup since it'll take about 207.49 hours to completely hydrate.

created with Peter Binkley's twarc-report


I'm only going to do a quick analysis of the dataset here since I want to get the dataset out, and allow others to work with it. Tweets with geocoordinates is not covered below, but you can check out a map of tweets here.

There are a number of friendly utilities that come with twarc that allow for a quick exploratory analysis of a given collection. In addition, Peter Binkley's twarc-repot is pretty handy for providing a quick overview of a given dataset.


We are able to create a list of the unique Twitter username names in the dataset by using users.py, and additionally sort them by the number of tweets:

$ python ~/git/twarc/utils/users.py paris-valid-deduplicated.json > paris-users.txt $ cat paris-users.txt | sort | uniq -c | sort -n > paris-users-unique-sorted-count.txt $ cat paris-users-unique-sorted-count.txt | wc -l $ tail paris-users-unique-sorted-count.txt

From the above, we can see that there are 4,636,584 unique users in the dataset, and the top 10 accounts were as follows:

1. 38,883 tweets RelaxInParis
2. 36,504 tweets FrancePeace
3. 12,697 tweets FollowParisNews
4. 12,656 tweets Reduction_Paris
5. 10,044 tweets CNNsWorld
6. 8,208 tweets parisevent
7. 7.296 tweets TheMalyck_
8. 6,654 tweets genx_hrd
9. 6,370 tweets DHEdomains
10. 4,498 tweets paris_attack


We are able to create a lit of the most retweeted tweets in the dataset by using retweets.py:

$ python ~/git/twarc/utils/retweets.py paris-valid-deduplicated.json > paris-retweets.json $ python python ~/git/twarc/utils/tweet_urls.py paris-retweets.json > paris-retweets.txt

1. 53,639 retweets https://twitter.com/PNationale/status/665939383418273793
2. 44,457 retweets https://twitter.com/MarkRuffalo/status/665329805206900736
3. 41,400 retweets https://twitter.com/NiallOfficial/status/328827440157839361
4. 39,140 retweets https://twitter.com/oreoxzhel/status/665499107021066240
5. 37,214 retweets https://twitter.com/piersmorgan/status/665314980095356928
6. 24,955 retweets https://twitter.com/Fascinatingpics/status/665458581832077312
7. 22,124 retweets https://twitter.com/RGerrardActor/status/665325168953167873
8. 22,113 retweets https://twitter.com/HeralddeParis/status/665327408803741696
9. 22,069 retweets https://twitter.com/Gabriele_Corno/status/484640360120209408
10. 21,401 retweets https://twitter.com/SarahMatt97/status/665383304787529729


We were able to create a list of the unique tags using in our dataset by using tags.py.

$ python ~/git/twarc/utils/tags.py paris-valid-deduplicated.json > paris-hashtags.txt $ cat paris-hashtags.txt | wc -l $ head elxn42-tweet-tags.txt

From the above, we can see that there were 26,8974 unique hashtags were used. The top 10 hashtags used in the dataset were:

1. 6,812,941 tweets #parisattacks
2. 6,119,933 tweets #paris
3. 1,100,809 tweets #bataclan
4. 887,144 tweets #porteouverte
5. 673,543 tweets #prayforparis
6. 444,486 tweets #rechercheparis
7. 427,999 tweets #parís
8. 387,699 tweets #france
9. 341,059 tweets #fusillade
10. 303,410 tweets #isis


We are able to create a list of the unique URLs tweeted in our dataset by using urls.py, after first unshortening the urls with unshorten.py and unshrtn.

$ python ~/git/twarc/utils/urls.py paris-valid-deduplicated-unshortened.json > paris-tweets-urls.txt $ cat paris-tweets-urls.txt | sort | uniq -c | sort -n > paris-tweets-urls-uniq.txt $ cat paris-tweets-urls.txt | wc -l $ cat paris-tweets-urls-uniq.txt | wc -l $ tail paris-tweets-urls-uniq.txt

From the above, we can see that there were 5,561,037 URLs tweeted, representing 37.22% of total tweets, and 858,401 unique URLs tweeted. The top 10 URLs tweeted were as follows:

1. 46,034 tweets http://www.bbc.co.uk/news/live/world-europe-34815972?ns_mchannel=social&ns_campaign=bbc_breaking&ns_source=twitter&ns_linkname=news_central
2. 46,005 tweets https://twitter.com/account/suspended
3. 37,509 tweets http://www.lefigaro.fr/actualites/2015/11/13/01001-20151113LIVWWW00406-fusillade-paris-explosions-stade-de-france.php#xtor=AL-155-
4. 35,882 tweets http://twibbon.com/support/prayforparis-2/twitter
5. 33,531 tweets http://www.bbc.co.uk/news/live/world-europe-34815972
6. 33,039 tweets https://www.rt.com/news/321883-shooting-paris-dead-masked/
7. 24,221 tweets https://www.youtube.com/watch?v=-Uo6ZB0zrTQ
8. 23,536 tweets http://www.bbc.co.uk/news/live/world-europe-34825270
9. 21,237 tweets https://amp.twimg.com/v/fc122aff-6ece-47a4-b34c-cafbd72ef386
10. 21,107 tweets http://live.reuters.com/Event/Paris_attacks_2?Page=0


We are able to create a list of images tweeted in our dataset by using image_urls.py.

$ python ~/git/twarc/utils/image_urls.py paris-valid-deduplicated.json > paris-tweets-images.txt $ cat paris-tweets-images.txt | sort | uniq -c | sort -n > paris-tweets-images-uniq.txt $ cat paris-tweets-images-uniq.txt | wc -l $ tail paris-tweets-images-uniq.txt

From the above, we can see that there were 6,872,441 total images tweets, representing 46.00% of total tweets, and 660,470 unique images. The top 10 images tweeted were as follows:

  1. 49,051 Occurrences
  2. 43,348 Occurrences
  3. 22,615 Occurrences
  4. 21,325 Occurrences
  5. 20,689 Occurrences
  6. 19,696 Occurrences
  7. 19,597 Occurrences
  8. 19,096 Occurrences
  9. 16,772 Occurrences
  10. 15,364 Occurrences

An Exploratory look at 13,968,293 #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets

#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo

I've spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley's twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC. This data set includes 13,968,293 tweets (10,589,910 retweets - 75.81%) from 3,343,319 different users over 21 days. You can check out a word cloud of all the tweets here.

First tweet in data set (numberic sort of tweet ids):


If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for the data set from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweet-ids-20150129.txt > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing approximately 15 pots of coffee.


In this data set, we have 133,970 tweets with geo coordinates availble. This represents about 0.96% of the entire data set.

The map is available here in a separate page since the geojson file is 83M and will potato your browser while everything loads. If anybody knows how to stream that geojson file to Leaflet.js so the browser doesn't potato, please comment! :-)


These are the top 10 users in the data set.

  1. 35,420 tweets Promo_Culturel
  2. 33,075 tweets BotCharlie
  3. 24,251 tweets YaMeCanse21
  4. 23,126 tweets yakacliquer
  5. 17,576 tweets YaMeCanse20
  6. 15,315 tweets iS_Angry_Bird
  7. 9,615 tweets AbraIsacJac
  8. 9,318 tweets AnAnnoyingTweep
  9. 3,967 tweets rightnowio_feed
  10. 3,514 tweets russfeed

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json


There are teh top 10 hashtags in the data set.

  1. 8,597,175 tweets #charliehebdo
  2. 7,911,343 tweets #jesuischarlie
  3. 377,041 tweets #jesuisahmed
  4. 264,869 tweets #paris
  5. 186,976 tweets #france
  6. 177,448 tweets #parisshooting
  7. 141,993 tweets #jesuisjuif
  8. 140,539 tweets #marcherepublicaine
  9. 129,484 tweets #noussommescharlie
  10. 128,529 tweets #afp


These are the top 10 URLs in the data set. 3,771,042 tweets (27.00%) had an URL associated with them.

These are all shortened urls. I'm working through and issue with unshorten.py.

  1. http://bbc.in/1xPaVhN (43,708)
  2. http://bit.ly/1AEpWnE (19,328)
  3. http://bit.ly/1DEm0TK (17,033)
  4. http://nyr.kr/14AeVIi (14,118)
  5. http://youtu.be/4KBdnOrTdMI (13,252)
  6. http://bbc.in/14ulyLt (12,407)
  7. http://europe1.fr/direct-video (9,228)
  8. http://bbc.in/1DxNLQD (9,044)
  9. http://ind.pn/1s5EV8w (8,721)
  10. http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map (8,581)

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json


These are the top 10 media urls in the data set. 8,141,552 tweets (58.29%) had a media URL associated with them.

36,753 Occurrences


35,942 Occurrences


33,501 Occurrences


31,712 Occurrences


29,359 Occurrences


26,334 Occurrences


25,989 Occurrences


23,974 Occurrences


22,659 Occurrences


22,421 Occurrences


This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

An exploratory look at 257,093 #JeSuisAhmed tweets


Had some time last night to do some exploratory analysis on some of the #JeSuisAhmed collection. This analysis is from the first tweet I was able to harvest #JeSuisAhmed to some time on January 14, 2015 when I copied over the json to experiment with a few of the twarc utilities.

First tweet in data set:

Last tweet in data set:


If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for #JeSuisAhmed from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisAhmed-ids-20150113.txt > JeSuisAhmed-tweets-20150113.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing a cup of coffee.


#JeSuisAhmed tweets with geo coordinates.

In this data set, we have 2,329 tweets with geo coordinates availble. This represents about 0.91% of the entire data set (257,093 tweets).

How do you make this?

  • Create the geojson ~/git/twarc/utils/geojson.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-tweets-dedupe-20150112.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.


These are the image urls that have more than 1000 occurrences in the data set.

13703 Occurrences


10396 Occurrences


6088 Occurrences


4354 Occurrences


3229 Occurrences


3124 Occurrences


2307 Occurrences


2034 Occurrences


1949 Occurrences


1296 Occurrences


1182 Occurrences


1100 Occurrences


How do you get the image list (requires unshrtn)?

% ~/git/twarc/utils/image_urls.py JeSuisAhmed-tweets-unshortened-20150112.json > JeSuisAhmed-images-20150112.txt
% cat JeSuisAhmed-images-20150112.txt | sort | uniq -c | sort -rn > JeSuisAhmed-images-ranked-20150112.txt

The ranked url data set can be found here.


What are the three most retweeted tweets in the hashtag?

How do you find out the most retweets tweets in the dataset? This will give you the top 10.

~/git/twarc/utils/retweets.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-retweets-20150112.json

Top URLs

Top 10 URLs tweeted from #JeSuisAhmed.

  1. http://www.huffingtonpost.ca/2015/01/08/ahmed-merabet-jesuisahmed-charlie-hebdo_n_6437984.html?ncid=tweetlnkushpmg00000067 (2895)
  2. http://limportant.fr/infos-jesuischarlie/76/360460 (1613)
  3. http://mic.com/articles/107988/the-hero-of-the-charlie-hebdo-shooting-we-re-overlooking (1318)
  4. http://www.huffingtonpost.co.uk/2015/01/08/charlie-hebdocharlie-hebdo-attack-jesuisahmed-hashtag-commemorating-ahmed-merabet-takes-off_n_6436528.html?1420731418&ncid=tweetlnkushpmg00000067 (919)
  5. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000067 (632)
  6. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000055 (592)
  7. http://www.dailymail.co.uk/news/article-2901681/Hero-police-officer-executed-street-married-42-year-old-Muslim-assigned-patrol-Paris-neighbourhood-Charlie-Hebdo-offices-located.html (571)
  8. http://blogs.mediapart.fr/blog/joel-villain/070115/il-sappelait-ahmed (555)
  9. http://www.bbc.co.uk/news/blogs-trending-30728491?ocid=socialflow_twitter (471)
  10. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?utm_hp_ref=tw (436)

Full list of urls can be found here.

How do you get the list (requires unshrtn)?

% cat JeSuisAhmed-tweets-20150112.json | ~/git/twarc/utils/unshorten.py > JeSuisAhmed-tweets-unshortened-20150112.json
% cat JeSuisAhmed-tweets-unshortened-20150112.json | ~/git/twarc/utils/urls.py| sort | uniq -c | sort -rn > JeSuisAhmed-urls.txt

Twitter Clients

Top 10 Twitter clients used from #JeSuisAhmed.

  1. Twitter for iPhone (85116)
  2. Twitter for Android (58819)
  3. Twitter Web Client (58166)
  4. Twitter for iPad (15304)
  5. Twitter for Websites (6877)
  6. Twitter for Windows Phone (5237)
  7. Twitter for Android Tablets (4420)
  8. TweetDeck (3790)
  9. Mobile Web (M5) (1708)
  10. Tweetbot for iΟS (1691)

Full list of clients can be found here.

How do you get the list of Twitter client sources?

% ~/git/twarc/utils/source.py JeSuisAhmed-tweets-20150112.json > JeSuisAhmed-sources-20150112.html

#JeSuisCharlie images

Using the #JeSuisCharlie data set from January 11, 2015 (Warning! Will turn your browser into a potato for a few seconds), these are the image urls that have more than 1000 occurrences in the data set.

How to create (requires unshrtn):

% twarc.py --query "#JeSuisCharlie"
% ~/git/twarc/utils/deduplicate.py JeSuisCharlie-tweets.json > JeSuisCharlie-tweets-deduped.json
% cat JeSuisCharlie-tweets-deduped.json | utils/unshorten.py > JeSuisCharlie-tweets-deduped-ushortened.json
% ~/git/twarc/utils/image_urls.py JeSuisCharlie-tweets-deduped-ushortened.json >| JeSuisCharlie-20150115-image-urls.txt
% cat JeSuisCharlie-20150115-image-urls.txt | sort | uniq -c | sort -rn > JeSuisCharlie-20150115-image-urls-ranked.txt

The ranked url data set can be found here.

11657 Occurrences


4764 Occurrences


3014 Occurrences


2977 Occurrences


2840 Occurrences


2363 Occurrences


2190 Occurrences


2015 Occurrences


1921 Occurrences


1906 Occurrences


1832 Occurrences


1512 Occurrences


1409 Occurrences


1348 Occurrences


1261 Occurrences


1207 Occurrences


1152 Occurrences


1114 Occurrences


1065 Occurrences


1055 Occurrences


1047 Occurrences


Preliminary stats of #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo


$ wc -l *json
    148479 %23JeSuisAhmed-20150109103430.json
     94874 %23JeSuisAhmed-20150109141746.json
      5885 %23JeSuisAhmed-20150112092647.json
    249238 total
$ du -h
2.7G    .


$ wc -l *json
    3894191 %23JeSuisCharlie-20150109094220.json
    1758849 %23JeSuisCharlie-20150109141730.json
     226784 %23JeSuisCharlie-20150112092710.json
         15 %23JeSuisCharlie-20150112092734.json
    5879839 total
$ du -h
32G .


$ wc -l *json
    23694 %23JeSuisJuif-20150109172957.json
    50603 %23JeSuisJuif-20150109173104.json
     5941 %23JeSuisJuif-20150110003450.json
    42237 %23JeSuisJuif-20150112094500.json
     5064 %23JeSuisJuif-20150112094648.json
   127539 total
$ du -h
671M    .


$ wc -l *json
    4444585 %23CharlieHebdo-20150109172713.json
        108 %23CharlieHebdo-20150109172825.json
    1164717 %23CharlieHebdo-20150109172844.json
    1068074 %23CharlieHebdo-20150112094427.json
      69446 %23CharlieHebdo-20150112094446.json
     185263 %23CharlieHebdo-20150112155558.json
    6932193 total
$ du -h
39G     .


Preliminary and non-depuped, we're looking at roughly 74.4G of data, and 13,188,809 tweets after 5.5 days of capturing the 4 hash tags.

Preliminary look at 3,893,553 #JeSuisCharlie tweets


Last Friday (January 9, 2015) I started capturing #JeSuisAhmed, #JeSuisCharlie, #JeSuisJuif, and #CharlieHebdo with Ed Summers' twarc. I have about 12 million tweets at the time of writing this, and plan on writing up something a little bit more in-depth in the coming weeks. But for now, some preliminary analysis of #JeSuisCharlie, and if you haven't seen these two posts ("A Ferguson Twitter Archive", "On Forgetting and hydration") by Ed Summers, please do check them out.

How fast were the tweets coming in? Just to try and get a sense of this, I did a quick recording of tailing the twarc log for #JeSuisCharlie capture.


If you checked out both of Ed's post, you'll have noticed that the Twitter ToS forbid the distribution of tweets, but we can distribute the tweet ids, and based on that we can "rehydrate" the data set locally. The tweet ids for each hashtag will be/are available here. I'll update and release the tweet ids files as I can.

We're looking at just around 12 million tweets (un-deduped) at the time of writing, so the hydration process will take some time. I'd highly suggest using GNU Screen or tmux


  • #JeSuisCharlie: % twarc.py --hydrate %23JeSuisCharlie-ids-20150112.txt > %23JeSuisCharlie-tweets-20150112.json
  • #JeSuisAhmed: % twarc.py --hydrate %23JeSuisAhmed-ids-20150112.txt > %23JeSuisAhmed-tweets-20150112.json
  • #JeSuisJuif: % twarc.py --hydrate %23JeSuisJuif-ids-20150112.txt > %23JeSuisJuif-tweets-20150112.json
  • #CharlieHebdo: % twarc.py --hydrate %23CharlieHebdo-ids-20150112.txt > %23CharlieHebdo-tweets-20150112.json


#JeSuisCharlie tweets with geo coordinates.

In this data set, we have 51,942 tweets with geo coordinates availble. This represents about 1.33% of the entire data set (3,893,553 tweets).

How do you make this?

  • Create the geojson % ~/git/twarc/utils/geojson.py %23JeSuisCharlie-cat-20150115-tweets-deduped.json > %23JeSuisCharlie-cat-20150115-tweets-deduped.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

Top URLs

Top 10 URLs tweeted from #JeSuisCharlie.

  1. (11220) http://www.newyorker.com/culture/culture-desk/cover-story-2015-01-19?mbid=social_twitter
  2. (2278) http://www.europe1.fr/direct-video
  3. (1615) https://www.youtube.com/watch?v=4KBdnOrTdMI&feature=youtu.be
  4. (1347) https://www.youtube.com/watch?v=-bjbUg9d64g&feature=youtu.be
  5. (1333) http://www.amazon.com/Charlie-Hebdo/dp/B00007LMFU/
  6. (977) http://www.clubic.com/internet/actualite-748637-opcharliehebdo-anonymous-vengeance.html
  7. (934) http://www.maryam-rajavi.com/en/index.php?option=com_content&view=article&id=1735&catid=159&Itemid=506
  8. (810) http://www.lequipe.fr/eStore/Offres/Achat/271918
  9. (771) http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map
  10. (605) https://www.youtube.com/watch?v=et4fYWKjP_o

Full list of urls can be found here.

How do you get the list?

  • % cat %23JeSuisCharlie-cat-20150115-tweets-deduped.json | ~/git/twarc/utils/unshorten.py > %23JeSuisCharlie-cat-20150115-tweets-deduped-unshortened.json
  • % cat %23JeSuisCharlie-cat-20150115-tweets-deduped-unshortened.json | ~/git/twarc/utils/urls.py| sort | uniq -c | sort -n > %23JeSuisCharlie-cat-20150115-urls.txt

Twitter Clients

Top 10 Twitter clients used from #JeSuisCharlie.

  1. (1283521) Twitter for iPhone
  2. (951925) Twitter Web Client
  3. (847308) Twitter for Android
  4. (231713) Twitter for iPad
  5. (86209)TweetDeck
  6. (82616) Twitter for Windows Phone
  7. (70286) Twitter for Android Tablets
  8. (44189) Twitter for Websites
  9. (39174) Instagram
  10. (21424) Mobile Web (M5)

Full list of clients can be found here.

How do you get this?

  • % ~/git/twarc/utils/source.py %23JeSuisCharlie-cat-20150115-tweets-deduped.json > %23JeSuisCharlie-cat-20150115-tweets-deduped-source.html

Word cloud

Word cloud from #JeSuisCharlie tweets.

I couldn't get the word cloud to embed nice, so you'll have to check it out here.

How do you create the word cloud?

  • % git/twarc/utils/wordcloud.py %23JeSuisCharlie-cat-20150115-tweets.json > %23JeSuisCharlie-wordcloud.html

Islandora and nginx

Islandora and nginx


I have been doing a fair bit of scale testing for York University Digital Library over the last couple weeks. Most of it has been focused on horizontal scaling of the traditional Islandora stack (Drupal, Fedora Commons, FedoraGSearch, Solr, and aDORe-Djtatoka). The stack is traditionally run with Apache2 in front of it, and it reverse proxies parts of the stack that are Tomcat webapps. I was curious if the stack would work with nginx, and if I would get any noticeable improvements by just switching from Apache2 to nginx. The preliminary good news is that the stack works with nginx (I'll outline the config below). The not surprising news, according to this, is I'm not seeing any noticeable improvements. If time permits, I'll do some real benchmarking.

Islandora nginx configurations

Having no experience with nginx, I started searching around, and found a config by David StClair that worked. With a few slight modifications, I was able to get the stack up any running with no major issues. The only major item that I needed to figure out how to do was reverse proxying aDORe-djatoka so that it would place nice with the default settings for Islandora OpenSeadragon. All this turned out to be was figuring out what the ProxyPass and ProxyPassReverse directive equivalents were for nginx. Turns out that it is very straightforward. With Apache2, we needed:

    #Fedora Commons/Islandora proxying
    ProxyRequests Off
    ProxyPreserveHost On
      Order deny,allow
      Allow from all
    ProxyPass /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka
    ProxyPassReverse /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka

This gives us a nice dog in a hat with Apache2.

With nginx we use the proxy_redirect directive.

  server {
    location /adore-djatoka {
      proxy_pass http://localhost:8080/adore-djatoka;
      proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;

This gives us a nice dog in a hat with nginx.

That's really only the major modification that I had to make to get the stack running with nginx. Here is my config adapted from David StClair's example.

  server {

        server_name kappa.library.yorku.ca;
        root /path/to/drupal/install; ## <-- Your only path reference.

        # Enable compression, this will help if you have for instance advagg module
        # by serving Gzip versions of the files.
        gzip_static on;

        location = /favicon.ico {
                log_not_found off;
                access_log off;

        location = /robots.txt {
                allow all;
                log_not_found off;
                access_log off;

        # Very rarely should these ever be accessed outside of your lan
        location ~* \.(txt|log)$ {
                deny all;

        location ~ \..*/.*\.php$ {
                return 403;

        # No no for private
        location ~ ^/sites/.*/private/ {
                return 403;

        # Block access to "hidden" files and directories whose names begin with a
        # period. This includes directories used by version control systems such
        # as Subversion or Git to store control files.
        location ~ (^|/)\. {
                return 403;
        location / {
                # This is cool because no php is touched for static content
                try_files $uri @rewrite;
                proxy_read_timeout 300;

        location /adore-djatoka {
                proxy_pass http://localhost:8080/adore-djatoka;
                proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;

        location @rewrite {
                # You have 2 options here
                # For D7 and above:
                # Clean URLs are handled in drupal_environment_initialize().
                rewrite ^ /index.php;
                # For Drupal 6 and bwlow:
                # Some modules enforce no slash (/) at the end of the URL
                # Else this rewrite block wouldn't be needed (GlobalRedirect)
                #rewrite ^/(.*)$ /index.php?q=$1;

        # For Munin
        location /nginx_status {
                stub_status on;
                access_log off;
                deny all;

        location ~ \.php$ {
                fastcgi_split_path_info ^(.+\.php)(/.+)$;
                #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
                include fastcgi_params;
                fastcgi_param SCRIPT_FILENAME $request_filename;
                fastcgi_intercept_errors on;

        # Fighting with Styles? This little gem is amazing.
        # This is for D7 and D8
        location ~ ^/sites/.*/files/styles/ {
                try_files $uri @rewrite;

        location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
                expires max;
                log_not_found off;