A look at 14,939,154 paris Bataclan parisattacks porteouverte tweets

Dec 12, 2015 6 min read

On November 13, 2015 I was at the “Web Archives 2015: Capture, Curate, Analyze” listening to Ian Milligan give the closing keynote when Thomas Padilla tweeted the following to me:

@ruebot terrible news, possible charlie hebdo connection - https://t.co/SkEusgqgz5
— Thomas Padilla (@thomasgpadilla) November 13, 2015

I immediately started collecting.

When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it’s something. Maybe these datasets can be used for something positive that happened out of all this negative.

When I started collecting, it just so happened that the creator of twarc, Ed Summers, was sitting next to me, and he mentioned some new functionality that was part of the v0.4.0 release of twarc; Added --warnings flag to log warnings from the Twitter API about dropped tweets during streaming.

What’s that mean? Basically, the public Stream API will not stream more the 1% of the total Twitter stream. If you are trying to capture something from the streaming API that exceeds 1% of the total Twitter stream, like for instance a hashtag or two related to a terrorist attack, the streaming API will drop tweets, and notify that it has done so. There is a really interesting look at this by Kevin Driscoll, Shawn Walker in the International Journal of Communication.

Ed fired up the new version of twarc and began streaming as well so we could see what was happening. We noticed that we were getting warnings of around 400 dropped tweets every request (seconds), then it quickly escalated up to over 28k dropped tweets every request. What were were trying to collect was over 1% of the total Twitter stream.

Dataset

Collection started on November 13, 2015 using both the streaming and search API. This is what it looked like:

$ twarc.py --search "#paris OR #Bataclan OR #parisattacks OR #porteouverte" > paris-search.json
$ twarc.py --stream "#paris,#Bataclan,#parisattacks,#porteouverte" > paris-stream.json

I took the strategy of utilizing both the search and streaming API for collection due to what was noted above about hitting the 1% of the total Twitter stream limit. The idea was that if I’m hitting the limit with stream, theoretically I should be able to capture any tweets that were dropped with the search API. The stream API collection ran continuously during the collection period from November 13, 2015 to December 11, 2015. The search API collection was run, then once finished, immediately started back up over the collection period. During the first two weeks of collection, the search API collection would take about a week to finish. In recollection, I should have made note of the exact times it took to collect to get some more numbers to look at. That said, I’m not confident I was able to grab every tweet related to the hashtags I was collecting on. The only way, I think, I can be confident is by comparing this dataset with a dataset from Gnip. But, I am confident that I have a large amount of what was tweeted.

Once I finished collecting, I combined the json files, and deduplicated with deduplicate.py, and then created a list of tweet ids with ids.py.

If you want to follow along or do your own analysis with the dataset, you can “hydrate” the dataset with twarc. You can grab the Tweet ids for the dataset from here (Data & Analysis tab).

$ twarc.py --hydrate paris-tweet-ids.txt > paris-tweets.json

The hydration process will take some time; 72,000 tweets/hour. You might want to use something along the lines of GNU Screen, tmux, or nohup since it’ll take about 207.49 hours to completely hydrate.

created with Peter Binkley’s twarc-report

Overview

I’m only going to do a quick analysis of the dataset here since I want to get the dataset out, and allow others to work with it. Tweets with geocoordinates is not covered below, but you can check out a map of tweets here.

There are a number of friendly utilities that come with twarc that allow for a quick exploratory analysis of a given collection. In addition, Peter Binkley’s twarc-repot is pretty handy for providing a quick overview of a given dataset.

Users

We are able to create a list of the unique Twitter username names in the dataset by using users.py, and additionally sort them by the number of tweets:

$ python ~/git/twarc/utils/users.py paris-valid-deduplicated.json > paris-users.txt
$ cat paris-users.txt | sort | uniq -c | sort -n > paris-users-unique-sorted-count.txt
$ cat paris-users-unique-sorted-count.txt | wc -l
$ tail paris-users-unique-sorted-count.txt

From the above, we can see that there are 4,636,584 unique users in the dataset, and the top 10 accounts were as follows:


1.	38,883 tweets	RelaxInParis
2.	36,504 tweets	FrancePeace
3.	12,697 tweets	FollowParisNews
4.	12,656 tweets	Reduction_Paris
5.	10,044 tweets	CNNsWorld
6.	8,208 tweets	parisevent
7.	7.296 tweets	TheMalyck_
8.	6,654 tweets	genx_hrd
9.	6,370 tweets	DHEdomains
10.	4,498 tweets	paris_attack

Retweets

We are able to create a lit of the most retweeted tweets in the dataset by using retweets.py:

$ python ~/git/twarc/utils/retweets.py paris-valid-deduplicated.json > paris-retweets.json
$ python ~/git/twarc/utils/tweet_urls.py paris-retweets.json > paris-retweets.txt


1.	53,639 retweets	https://twitter.com/PNationale/status/665939383418273793
2.	44,457 retweets	https://twitter.com/MarkRuffalo/status/665329805206900736
3.	41,400 retweets	https://twitter.com/NiallOfficial/status/328827440157839361
4.	39,140 retweets	https://twitter.com/oreoxzhel/status/665499107021066240
5.	37,214 retweets	https://twitter.com/piersmorgan/status/665314980095356928
6.	24,955 retweets	https://twitter.com/Fascinatingpics/status/665458581832077312
7.	22,124 retweets	https://twitter.com/RGerrardActor/status/665325168953167873
8.	22,113 retweets	https://twitter.com/HeralddeParis/status/665327408803741696
9.	22,069 retweets	https://twitter.com/Gabriele_Corno/status/484640360120209408
10.	21,401 retweets	https://twitter.com/SarahMatt97/status/665383304787529729

Hashtags

We were able to create a list of the unique tags using in our dataset by using tags.py.

$ python ~/git/twarc/utils/tags.py paris-valid-deduplicated.json > paris-hashtags.txt
$ cat paris-hashtags.txt | wc -l
$ head elxn42-tweet-tags.txt

From the above, we can see that there were 26,8974 unique hashtags were used. The top 10 hashtags used in the dataset were:


1.	6,812,941 tweets	#parisattacks
2.	6,119,933 tweets	#paris
3.	1,100,809 tweets	#bataclan
4.	887,144 tweets	#porteouverte
5.	673,543 tweets	#prayforparis
6.	444,486 tweets	#rechercheparis
7.	427,999 tweets	#parís
8.	387,699 tweets	#france
9.	341,059 tweets	#fusillade
10.	303,410 tweets	#isis

URLs

We are able to create a list of the unique URLs tweeted in our dataset by using urls.py, after first unshortening the urls with unshorten.py and unshrtn.

$ python ~/git/twarc/utils/urls.py paris-valid-deduplicated-unshortened.json > paris-tweets-urls.txt
$ cat paris-tweets-urls.txt | sort | uniq -c | sort -n > paris-tweets-urls-uniq.txt
$ cat paris-tweets-urls.txt | wc -l
$ cat paris-tweets-urls-uniq.txt | wc -l
$ tail paris-tweets-urls-uniq.txt

From the above, we can see that there were 5,561,037 URLs tweeted, representing 37.22% of total tweets, and 858,401 unique URLs tweeted. The top 10 URLs tweeted were as follows:


1.	46,034 tweets	http://www.bbc.co.uk/news/live/world-europe-34815972?ns_mchannel=social&ns_campaign=bbc_breaking&ns_source=twitter&ns_linkname=news_central
2.	46,005 tweets	https://twitter.com/account/suspended
3.	37,509 tweets	http://www.lefigaro.fr/actualites/2015/11/13/01001-20151113LIVWWW00406-fusillade-paris-explosions-stade-de-france.php#xtor=AL-155-
4.	35,882 tweets	http://twibbon.com/support/prayforparis-2/twitter
5.	33,531 tweets	http://www.bbc.co.uk/news/live/world-europe-34815972
6.	33,039 tweets	https://www.rt.com/news/321883-shooting-paris-dead-masked/
7.	24,221 tweets	https://www.youtube.com/watch?v=-Uo6ZB0zrTQ
8.	23,536 tweets	http://www.bbc.co.uk/news/live/world-europe-34825270
9.	21,237 tweets	https://amp.twimg.com/v/fc122aff-6ece-47a4-b34c-cafbd72ef386
10.	21,107 tweets	http://live.reuters.com/Event/Paris_attacks_2?Page=0

Images

We are able to create a list of images tweeted in our dataset by using image_urls.py.

$ python ~/git/twarc/utils/image_urls.py paris-valid-deduplicated.json > paris-tweets-images.txt
$ cat paris-tweets-images.txt | sort | uniq -c | sort -n > paris-tweets-images-uniq.txt
$ cat paris-tweets-images-uniq.txt | wc -l
$ tail paris-tweets-images-uniq.txt

From the above, we can see that there were 6,872,441 total images tweets, representing 46.00% of total tweets, and 660,470 unique images. The top 10 images tweeted were as follows: