#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo
I’ve spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley’s twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC. This data set includes 13,968,293 tweets (10,589,910 retweets - 75.81%) from 3,343,319 different users over 21 days. You can check out a word cloud of all the tweets here.
First tweet in data set (numberic sort of tweet ids):
#JESUISCHARLIE pic.twitter.com/4fkcjH0yaz
— Thierry Puget (@titi1960) January 7, 2015
Hydration
If you want to experiment/follow along with what I’ve done here, you can “rehydrate” the data set with twarc. You can grab the Tweet ids for the data set from here (Data & Analysis tab).
% twarc.py --hydrate JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweet-ids-20150129.txt > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json
The hydration process will take some time. I’d highly suggest using GNU Screen or tmux, and grabbing approximately 15 pots of coffee.
Map
In this data set, we have 133,970 tweets with geo coordinates availble. This represents about 0.96% of the entire data set.
The map is available here in a separate page since the geojson file is 83M and will potato your browser while everything loads. If anybody knows how to stream that geojson file to Leaflet.js so the browser doesn’t potato, please comment! :-)
Users
These are the top 10 users in the data set.
- 35,420 tweets Promo_Culturel
- 33,075 tweets BotCharlie
- 24,251 tweets YaMeCanse21
- 23,126 tweets yakacliquer
- 17,576 tweets YaMeCanse20
- 15,315 tweets iS_Angry_Bird
- 9,615 tweets AbraIsacJac
- 9,318 tweets AnAnnoyingTweep
- 3,967 tweets rightnowio_feed
- 3,514 tweets russfeed
This comes from twarc-report’s report-profiler.py.
$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json
Hashtags
There are teh top 10 hashtags in the data set.
- 8,597,175 tweets #charliehebdo
- 7,911,343 tweets #jesuischarlie
- 377,041 tweets #jesuisahmed
- 264,869 tweets #paris
- 186,976 tweets #france
- 177,448 tweets #parisshooting
- 141,993 tweets #jesuisjuif
- 140,539 tweets #marcherepublicaine
- 129,484 tweets #noussommescharlie
- 128,529 tweets #afp
URLs
These are the top 10 URLs in the data set. 3,771,042 tweets (27.00%) had an URL associated with them.
These are all shortened urls. I’m working through and issue with unshorten.py
.
- http://bbc.in/1xPaVhN (43,708)
- http://bit.ly/1AEpWnE (19,328)
- http://bit.ly/1DEm0TK (17,033)
- http://nyr.kr/14AeVIi (14,118)
- http://youtu.be/4KBdnOrTdMI (13,252)
- http://bbc.in/14ulyLt (12,407)
- http://europe1.fr/direct-video (9,228)
- http://bbc.in/1DxNLQD (9,044)
- http://ind.pn/1s5EV8w (8,721)
- http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map (8,581)
This comes from twarc-report’s report-profiler.py.
$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json
Media
These are the top 10 media urls in the data set. 8,141,552 tweets (58.29%) had a media URL associated with them.
36,753 Occurrences
35,942 Occurrences
33,501 Occurrences
31,712 Occurrences
29,359 Occurrences
26,334 Occurrences
25,989 Occurrences
23,974 Occurrences
22,659 Occurrences
22,421 Occurrences
This comes from twarc-report’s report-profiler.py.
$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json