#JeSuisJuif

An Exploratory look at 13,968,293 #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets

#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo

I've spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley's twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC. This data set includes 13,968,293 tweets (10,589,910 retweets - 75.81%) from 3,343,319 different users over 21 days. You can check out a word cloud of all the tweets here.

First tweet in data set (numberic sort of tweet ids):


Hydration

If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for the data set from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweet-ids-20150129.txt > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing approximately 15 pots of coffee.

Map

In this data set, we have 133,970 tweets with geo coordinates availble. This represents about 0.96% of the entire data set.

The map is available here in a separate page since the geojson file is 83M and will potato your browser while everything loads. If anybody knows how to stream that geojson file to Leaflet.js so the browser doesn't potato, please comment! :-)

Users

These are the top 10 users in the data set.

  1. 35,420 tweets Promo_Culturel
  2. 33,075 tweets BotCharlie
  3. 24,251 tweets YaMeCanse21
  4. 23,126 tweets yakacliquer
  5. 17,576 tweets YaMeCanse20
  6. 15,315 tweets iS_Angry_Bird
  7. 9,615 tweets AbraIsacJac
  8. 9,318 tweets AnAnnoyingTweep
  9. 3,967 tweets rightnowio_feed
  10. 3,514 tweets russfeed

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

Hashtags

There are teh top 10 hashtags in the data set.

  1. 8,597,175 tweets #charliehebdo
  2. 7,911,343 tweets #jesuischarlie
  3. 377,041 tweets #jesuisahmed
  4. 264,869 tweets #paris
  5. 186,976 tweets #france
  6. 177,448 tweets #parisshooting
  7. 141,993 tweets #jesuisjuif
  8. 140,539 tweets #marcherepublicaine
  9. 129,484 tweets #noussommescharlie
  10. 128,529 tweets #afp

URLs

These are the top 10 URLs in the data set. 3,771,042 tweets (27.00%) had an URL associated with them.

These are all shortened urls. I'm working through and issue with unshorten.py.

  1. http://bbc.in/1xPaVhN (43,708)
  2. http://bit.ly/1AEpWnE (19,328)
  3. http://bit.ly/1DEm0TK (17,033)
  4. http://nyr.kr/14AeVIi (14,118)
  5. http://youtu.be/4KBdnOrTdMI (13,252)
  6. http://bbc.in/14ulyLt (12,407)
  7. http://europe1.fr/direct-video (9,228)
  8. http://bbc.in/1DxNLQD (9,044)
  9. http://ind.pn/1s5EV8w (8,721)
  10. http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map (8,581)

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

Media

These are the top 10 media urls in the data set. 8,141,552 tweets (58.29%) had a media URL associated with them.

36,753 Occurrences

img

35,942 Occurrences

img

33,501 Occurrences

img

31,712 Occurrences

img

29,359 Occurrences

img

26,334 Occurrences

img

25,989 Occurrences

img

23,974 Occurrences

img

22,659 Occurrences

img

22,421 Occurrences

img

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

133,970 #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets on a map

How do you make this?

  • Create the geojson ~/git/twarc/utils/geojson.py JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-20150129.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

An exploratory look at 257,093 #JeSuisAhmed tweets

#JeSuisAhmed

Had some time last night to do some exploratory analysis on some of the #JeSuisAhmed collection. This analysis is from the first tweet I was able to harvest #JeSuisAhmed to some time on January 14, 2015 when I copied over the json to experiment with a few of the twarc utilities.

First tweet in data set:


Last tweet in data set:

Hydration

If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for #JeSuisAhmed from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisAhmed-ids-20150113.txt > JeSuisAhmed-tweets-20150113.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing a cup of coffee.

Map

#JeSuisAhmed tweets with geo coordinates.

In this data set, we have 2,329 tweets with geo coordinates availble. This represents about 0.91% of the entire data set (257,093 tweets).

How do you make this?

  • Create the geojson ~/git/twarc/utils/geojson.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-tweets-dedupe-20150112.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

Images

These are the image urls that have more than 1000 occurrences in the data set.

13703 Occurrences

img

10396 Occurrences

img

6088 Occurrences

img

4354 Occurrences

img

3229 Occurrences

img

3124 Occurrences

img

2307 Occurrences

img

2034 Occurrences

img

1949 Occurrences

img

1296 Occurrences

img

1182 Occurrences

img

1100 Occurrences

img

How do you get the image list (requires unshrtn)?

% ~/git/twarc/utils/image_urls.py JeSuisAhmed-tweets-unshortened-20150112.json > JeSuisAhmed-images-20150112.txt
% cat JeSuisAhmed-images-20150112.txt | sort | uniq -c | sort -rn > JeSuisAhmed-images-ranked-20150112.txt

The ranked url data set can be found here.

Retweets

What are the three most retweeted tweets in the hashtag?




How do you find out the most retweets tweets in the dataset? This will give you the top 10.

~/git/twarc/utils/retweets.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-retweets-20150112.json

Top URLs

Top 10 URLs tweeted from #JeSuisAhmed.

  1. http://www.huffingtonpost.ca/2015/01/08/ahmed-merabet-jesuisahmed-charlie-hebdo_n_6437984.html?ncid=tweetlnkushpmg00000067 (2895)
  2. http://limportant.fr/infos-jesuischarlie/76/360460 (1613)
  3. http://mic.com/articles/107988/the-hero-of-the-charlie-hebdo-shooting-we-re-overlooking (1318)
  4. http://www.huffingtonpost.co.uk/2015/01/08/charlie-hebdocharlie-hebdo-attack-jesuisahmed-hashtag-commemorating-ahmed-merabet-takes-off_n_6436528.html?1420731418&ncid=tweetlnkushpmg00000067 (919)
  5. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000067 (632)
  6. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000055 (592)
  7. http://www.dailymail.co.uk/news/article-2901681/Hero-police-officer-executed-street-married-42-year-old-Muslim-assigned-patrol-Paris-neighbourhood-Charlie-Hebdo-offices-located.html (571)
  8. http://blogs.mediapart.fr/blog/joel-villain/070115/il-sappelait-ahmed (555)
  9. http://www.bbc.co.uk/news/blogs-trending-30728491?ocid=socialflow_twitter (471)
  10. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?utm_hp_ref=tw (436)

Full list of urls can be found here.

How do you get the list (requires unshrtn)?

% cat JeSuisAhmed-tweets-20150112.json | ~/git/twarc/utils/unshorten.py > JeSuisAhmed-tweets-unshortened-20150112.json
% cat JeSuisAhmed-tweets-unshortened-20150112.json | ~/git/twarc/utils/urls.py| sort | uniq -c | sort -rn > JeSuisAhmed-urls.txt

Twitter Clients

Top 10 Twitter clients used from #JeSuisAhmed.

  1. Twitter for iPhone (85116)
  2. Twitter for Android (58819)
  3. Twitter Web Client (58166)
  4. Twitter for iPad (15304)
  5. Twitter for Websites (6877)
  6. Twitter for Windows Phone (5237)
  7. Twitter for Android Tablets (4420)
  8. TweetDeck (3790)
  9. Mobile Web (M5) (1708)
  10. Tweetbot for iΟS (1691)

Full list of clients can be found here.

How do you get the list of Twitter client sources?

% ~/git/twarc/utils/source.py JeSuisAhmed-tweets-20150112.json > JeSuisAhmed-sources-20150112.html



#JeSuisCharlie images

Using the #JeSuisCharlie data set from January 11, 2015 (Warning! Will turn your browser into a potato for a few seconds), these are the image urls that have more than 1000 occurrences in the data set.

How to create (requires unshrtn):

% twarc.py --query "#JeSuisCharlie"
% ~/git/twarc/utils/deduplicate.py JeSuisCharlie-tweets.json > JeSuisCharlie-tweets-deduped.json
% cat JeSuisCharlie-tweets-deduped.json | utils/unshorten.py > JeSuisCharlie-tweets-deduped-ushortened.json
% ~/git/twarc/utils/image_urls.py JeSuisCharlie-tweets-deduped-ushortened.json >| JeSuisCharlie-20150115-image-urls.txt
% cat JeSuisCharlie-20150115-image-urls.txt | sort | uniq -c | sort -rn > JeSuisCharlie-20150115-image-urls-ranked.txt

The ranked url data set can be found here.

11657 Occurrences

img

4764 Occurrences

img

3014 Occurrences

img

2977 Occurrences

img

2840 Occurrences

img

2363 Occurrences

img

2190 Occurrences

img

2015 Occurrences

img

1921 Occurrences

img

1906 Occurrences

img

1832 Occurrences

img

1512 Occurrences

img

1409 Occurrences

img

1348 Occurrences

img

1261 Occurrences

img

1207 Occurrences

img

1152 Occurrences

img

1114 Occurrences

img

1065 Occurrences

img

1055 Occurrences

img

1047 Occurrences

img

Preliminary stats of #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo

#JeSuisAhmed

$ wc -l *json
    148479 %23JeSuisAhmed-20150109103430.json
     94874 %23JeSuisAhmed-20150109141746.json
      5885 %23JeSuisAhmed-20150112092647.json
    249238 total
$ du -h
2.7G    .

#JeSuisCharlie

$ wc -l *json
    3894191 %23JeSuisCharlie-20150109094220.json
    1758849 %23JeSuisCharlie-20150109141730.json
     226784 %23JeSuisCharlie-20150112092710.json
         15 %23JeSuisCharlie-20150112092734.json
    5879839 total
$ du -h
32G .

#JeSuisJuif

$ wc -l *json
    23694 %23JeSuisJuif-20150109172957.json
    50603 %23JeSuisJuif-20150109173104.json
     5941 %23JeSuisJuif-20150110003450.json
    42237 %23JeSuisJuif-20150112094500.json
     5064 %23JeSuisJuif-20150112094648.json
   127539 total
$ du -h
671M    .

#CharlieHebdo

$ wc -l *json
    4444585 %23CharlieHebdo-20150109172713.json
        108 %23CharlieHebdo-20150109172825.json
    1164717 %23CharlieHebdo-20150109172844.json
    1068074 %23CharlieHebdo-20150112094427.json
      69446 %23CharlieHebdo-20150112094446.json
     185263 %23CharlieHebdo-20150112155558.json
    6932193 total
$ du -h
39G     .

Total

Preliminary and non-depuped, we're looking at roughly 74.4G of data, and 13,188,809 tweets after 5.5 days of capturing the 4 hash tags.

Preliminary look at 3,893,553 #JeSuisCharlie tweets

Background

Last Friday (January 9, 2015) I started capturing #JeSuisAhmed, #JeSuisCharlie, #JeSuisJuif, and #CharlieHebdo with Ed Summers' twarc. I have about 12 million tweets at the time of writing this, and plan on writing up something a little bit more in-depth in the coming weeks. But for now, some preliminary analysis of #JeSuisCharlie, and if you haven't seen these two posts ("A Ferguson Twitter Archive", "On Forgetting and hydration") by Ed Summers, please do check them out.

How fast were the tweets coming in? Just to try and get a sense of this, I did a quick recording of tailing the twarc log for #JeSuisCharlie capture.

Hydration

If you checked out both of Ed's post, you'll have noticed that the Twitter ToS forbid the distribution of tweets, but we can distribute the tweet ids, and based on that we can "rehydrate" the data set locally. The tweet ids for each hashtag will be/are available here. I'll update and release the tweet ids files as I can.

We're looking at just around 12 million tweets (un-deduped) at the time of writing, so the hydration process will take some time. I'd highly suggest using GNU Screen or tmux

Hydrate

  • #JeSuisCharlie: % twarc.py --hydrate %23JeSuisCharlie-ids-20150112.txt > %23JeSuisCharlie-tweets-20150112.json
  • #JeSuisAhmed: % twarc.py --hydrate %23JeSuisAhmed-ids-20150112.txt > %23JeSuisAhmed-tweets-20150112.json
  • #JeSuisJuif: % twarc.py --hydrate %23JeSuisJuif-ids-20150112.txt > %23JeSuisJuif-tweets-20150112.json
  • #CharlieHebdo: % twarc.py --hydrate %23CharlieHebdo-ids-20150112.txt > %23CharlieHebdo-tweets-20150112.json

Map

#JeSuisCharlie tweets with geo coordinates.

In this data set, we have 51,942 tweets with geo coordinates availble. This represents about 1.33% of the entire data set (3,893,553 tweets).

How do you make this?

  • Create the geojson % ~/git/twarc/utils/geojson.py %23JeSuisCharlie-cat-20150115-tweets-deduped.json > %23JeSuisCharlie-cat-20150115-tweets-deduped.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

Top URLs

Top 10 URLs tweeted from #JeSuisCharlie.

  1. (11220) http://www.newyorker.com/culture/culture-desk/cover-story-2015-01-19?mbid=social_twitter
  2. (2278) http://www.europe1.fr/direct-video
  3. (1615) https://www.youtube.com/watch?v=4KBdnOrTdMI&feature=youtu.be
  4. (1347) https://www.youtube.com/watch?v=-bjbUg9d64g&feature=youtu.be
  5. (1333) http://www.amazon.com/Charlie-Hebdo/dp/B00007LMFU/
  6. (977) http://www.clubic.com/internet/actualite-748637-opcharliehebdo-anonymous-vengeance.html
  7. (934) http://www.maryam-rajavi.com/en/index.php?option=com_content&view=article&id=1735&catid=159&Itemid=506
  8. (810) http://www.lequipe.fr/eStore/Offres/Achat/271918
  9. (771) http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map
  10. (605) https://www.youtube.com/watch?v=et4fYWKjP_o

Full list of urls can be found here.

How do you get the list?

  • % cat %23JeSuisCharlie-cat-20150115-tweets-deduped.json | ~/git/twarc/utils/unshorten.py > %23JeSuisCharlie-cat-20150115-tweets-deduped-unshortened.json
  • % cat %23JeSuisCharlie-cat-20150115-tweets-deduped-unshortened.json | ~/git/twarc/utils/urls.py| sort | uniq -c | sort -n > %23JeSuisCharlie-cat-20150115-urls.txt

Twitter Clients

Top 10 Twitter clients used from #JeSuisCharlie.

  1. (1283521) Twitter for iPhone
  2. (951925) Twitter Web Client
  3. (847308) Twitter for Android
  4. (231713) Twitter for iPad
  5. (86209)TweetDeck
  6. (82616) Twitter for Windows Phone
  7. (70286) Twitter for Android Tablets
  8. (44189) Twitter for Websites
  9. (39174) Instagram
  10. (21424) Mobile Web (M5)

Full list of clients can be found here.

How do you get this?

  • % ~/git/twarc/utils/source.py %23JeSuisCharlie-cat-20150115-tweets-deduped.json > %23JeSuisCharlie-cat-20150115-tweets-deduped-source.html

Word cloud

Word cloud from #JeSuisCharlie tweets.

I couldn't get the word cloud to embed nice, so you'll have to check it out here.

How do you create the word cloud?

  • % git/twarc/utils/wordcloud.py %23JeSuisCharlie-cat-20150115-tweets.json > %23JeSuisCharlie-wordcloud.html