I immediately started collecting.
When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it’s something. Maybe these datasets can be used for something positive that happened out of all this negative.
When I started collecting, it just so happened that the creator of twarc, Ed Summers, was sitting next to me, and he mentioned some new functionality that was part of the v0.4.0 release of twarc; Added
--warnings flag to log warnings from the Twitter API about dropped tweets during streaming.
What’s that mean? Basically, the public Stream API will not stream more the 1% of the total Twitter stream. If you are trying to capture something from the streaming API that exceeds 1% of the total Twitter stream, like for instance a hashtag or two related to a terrorist attack, the streaming API will drop tweets, and notify that it has done so. There is a really interesting look at this by Kevin Driscoll, Shawn Walker in the International Journal of Communication.
Ed fired up the new version of twarc and began streaming as well so we could see what was happening. We noticed that we were getting warnings of around 400 dropped tweets every request (seconds), then it quickly escalated up to over 28k dropped tweets every request. What were were trying to collect was over 1% of the total Twitter stream.
Collection started on November 13, 2015 using both the streaming and search API. This is what it looked like:
$ twarc.py --search "#paris OR #Bataclan OR #parisattacks OR #porteouverte" > paris-search.json $ twarc.py --stream "#paris,#Bataclan,#parisattacks,#porteouverte" > paris-stream.json
I took the strategy of utilizing both the search and streaming API for collection due to what was noted above about hitting the 1% of the total Twitter stream limit. The idea was that if I’m hitting the limit with stream, theoretically I should be able to capture any tweets that were dropped with the search API. The stream API collection ran continuously during the collection period from November 13, 2015 to December 11, 2015. The search API collection was run, then once finished, immediately started back up over the collection period. During the first two weeks of collection, the search API collection would take about a week to finish. In recollection, I should have made note of the exact times it took to collect to get some more numbers to look at. That said, I’m not confident I was able to grab every tweet related to the hashtags I was collecting on. The only way, I think, I can be confident is by comparing this dataset with a dataset from Gnip. But, I am confident that I have a large amount of what was tweeted.
If you want to follow along or do your own analysis with the dataset, you can “hydrate” the dataset with twarc. You can grab the Tweet ids for the dataset from here (Data & Analysis tab).
$ twarc.py --hydrate paris-tweet-ids.txt > paris-tweets.json
created with Peter Binkley’s twarc-report
I’m only going to do a quick analysis of the dataset here since I want to get the dataset out, and allow others to work with it. Tweets with geocoordinates is not covered below, but you can check out a map of tweets here.
There are a number of friendly utilities that come with twarc that allow for a quick exploratory analysis of a given collection. In addition, Peter Binkley’s twarc-repot is pretty handy for providing a quick overview of a given dataset.
We are able to create a list of the unique Twitter username names in the dataset by using
users.py, and additionally sort them by the number of tweets:
$ python ~/git/twarc/utils/users.py paris-valid-deduplicated.json > paris-users.txt
$ cat paris-users.txt | sort | uniq -c | sort -n > paris-users-unique-sorted-count.txt
$ cat paris-users-unique-sorted-count.txt | wc -l
$ tail paris-users-unique-sorted-count.txt
From the above, we can see that there are 4,636,584 unique users in the dataset, and the top 10 accounts were as follows:
We are able to create a lit of the most retweeted tweets in the dataset by using
$ python ~/git/twarc/utils/retweets.py paris-valid-deduplicated.json > paris-retweets.json
$ python python ~/git/twarc/utils/tweet_urls.py paris-retweets.json > paris-retweets.txt
We were able to create a list of the unique tags using in our dataset by using
$ python ~/git/twarc/utils/tags.py paris-valid-deduplicated.json > paris-hashtags.txt
$ cat paris-hashtags.txt | wc -l
$ head elxn42-tweet-tags.txt
From the above, we can see that there were 26,8974 unique hashtags were used. The top 10 hashtags used in the dataset were:
$ python ~/git/twarc/utils/urls.py paris-valid-deduplicated-unshortened.json > paris-tweets-urls.txt
$ cat paris-tweets-urls.txt | sort | uniq -c | sort -n > paris-tweets-urls-uniq.txt
$ cat paris-tweets-urls.txt | wc -l
$ cat paris-tweets-urls-uniq.txt | wc -l
$ tail paris-tweets-urls-uniq.txt
From the above, we can see that there were 5,561,037 URLs tweeted, representing 37.22% of total tweets, and 858,401 unique URLs tweeted. The top 10 URLs tweeted were as follows:
We are able to create a list of images tweeted in our dataset by using
$ python ~/git/twarc/utils/image_urls.py paris-valid-deduplicated.json > paris-tweets-images.txt
$ cat paris-tweets-images.txt | sort | uniq -c | sort -n > paris-tweets-images-uniq.txt
$ cat paris-tweets-images-uniq.txt | wc -l
$ tail paris-tweets-images-uniq.txt
From the above, we can see that there were 6,872,441 total images tweets, representing 46.00% of total tweets, and 660,470 unique images. The top 10 images tweeted were as follows:
- 49,051 Occurrences
- 43,348 Occurrences
- 22,615 Occurrences
- 21,325 Occurrences
- 20,689 Occurrences
- 19,696 Occurrences
- 19,597 Occurrences
- 19,096 Occurrences
- 16,772 Occurrences
- 15,364 Occurrences