twarc

Twitter Datasets and Derivative data

#climatemarch tweets April 19-May 3, 2017

681,668 tweet ids for #climate collected with Documenting the Now's twarc from January 22-26, 2017. Tweets can be "rehydrated" with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate climatemarch_tweet_ids.txt > climatemarch.json.

#MarchForScience tweets April 12-26, 2017

1,276,220 tweet ids for #MarchForScience collected with Documenting the Now's twarc from January 22-26, 2017. Tweets can be "rehydrated" with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py hydrate MarchForScience_tweet-ids.txt > MarchForScience.json.

#WomensMarch tweets January 12-28, 2017

14,478,518 tweet ids for #WomensMarch collected with Documenting the Now's twarc from January 21-28, 2017. Tweets can be "rehydrated" with Documenting the Now’s twarc (https://github.com/DocNow/twarc). twarc.py –hydrate WomensMarch_tweet_ids.txt > WomensMarch.json Also included are the logs files for the Filter API and Search API queries. The Filter API query captures the cumulative number of dropped tweets.

The fall of Aleppo tweets; Aleppo 2016-12-13 through 2016-12-29

8,595,589 tweet ids for aleppo tweets captured during the fall of Aleppo in December 2016. Tweets can be "rehydrated" with Documenting the Now's twarc (https://github.com/DocNow/twarc). twarc.py --hydrate aleppo_tweet_ids.txt > aleppo.json

Tweet ids for final Tragically Hip concert

228,086 tweet ids for "TheHip, hipinkingston" captured during the Tragically Hip's final concert in Kingston, Ontario in August 2016. Tweets can be "rehydrated" with Documenting the Now's twarc (https://github.com/DocNow/twarc). twarc.py --hydrate th_final_concert_kingston_tweet_ids.txt > th_final_concert_kingston.json

#YMMfire tweets

Tweet ids for #YMMfire tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate ymmfire-ids.txt > ymmfire-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.

#jcdl2016 tweets

Tweet ids for #jcdl2016 tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate jcdl2016-tweet-ids.txt > jcdl2016-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.

#thechalkening tweets

Tweet ids for #thechalkening tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate thechalkening-ids-20160412.txt > thechalkening-20160412-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.

#panamapapers tweets

Tweet ids for #panamapapers tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate panamapapers-ids-20160413.txt > panamapapers-20160413-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.

#NDP2016 tweets

#MakeDonaldDrumpfAgain tweets

Derivative data for #MakeDonaldDrumpfAgain tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate MakeDonaldDrumpfAgain-tweet-ids.txt > MakeDonaldDrumpfAgain.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter. This dataset is the combination of hydrated http://hdl.handle.net/10864/11310 tweet ids, and htttp://hdl.handle.net/10864/11270.

#paris #Bataclan #parisattacks #porteouverte tweets

Tweet ids for #paris #Bataclan #parisattacks #porteouverte tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate paris-tweet-ids.txt > paris-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter.

#elxn42 tweets (42nd Canadian Federal Election)

Tweet ids for #elxn42 tweets. Tweets can be "hydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate elxn42-tweet-ids.txt > elxn42-tweets.json. Hydrating will recreate the original tweet(s) in json format, provided the content is still available on Twitter. This dataset is the combination of hydrated http://hdl.handle.net/10864/11310 tweet ids, and htttp://hdl.handle.net/10864/11270.

#JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo tweets

Tweet ids for #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo tweets. Tweets can be "rehydrated" with Ed Summers' twarc (https://github.com/edsu/twarc). twarc.py --hydrate %23JeSuisCharlie-ids-20150112.txt > %23JeSuisCharlie-tweets-20150112.json

14,478,518 #WomensMarch tweets January 12-28, 2017

Overview

A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we'd amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.

http://ruebot.net/WomensMarch_tweet_volume.html (Generated with Peter Binkley's twarc-report)

This put #WomensMarch well over 1% of the overall Twitter stream, which causes dropped tweets if you're collecting from the Filter API, so I used the strategy of using the both the Filter and Search APIs for collection. (If you're curious about learning more about this, check out Kevin Driscoll, Shawn Walker's "Big Data, Big Questions | Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data", and Jiaul H. Paik and Jimmy Lin's "Do Multiple Listeners to the Public Twitter Sample Stream Receive the Same Tweets?). I've included the search and filter logs in the dataset. If you grep "WARNING" WomensMarch_filter.log or grep "WARNING" WomensMarch_filter.log | wc -l you'll get a sense of the scale of dropped tweets. For a number of hours on January 22, I was seeing around 1.6 million cumulative dropped tweets!

http://ruebot.net/WomensMarch_dropped_tweets.png

I collected from around 11AM EST on January 21, 2017 to 11AM EST January 28, 2017 with the Filter API, and did two Search API queries. Final count before deduplication looked like this:

$ wc -l  WomensMarch_filter.json WomensMarch_search_01.json WomensMarch_search_02.json 
     7906847 WomensMarch_filter.json
     1336505 WomensMarch_search_01.json
     9602777 WomensMarch_search_02.json
    18846129 total

Final stats: 14,478,518 tweets in a 104GB json file!

This put's us in the same range as what Ryan Gallagher projected in "A Bird's-Eye View of #WomensMarch."

Below I'll give a quick overview of the dataset using utilities from Documenting the Now's twarc, and utilities described inline. This is the same approach as Ian Milligan and my 2016 Code4Lib Journal article, "An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter." This is probably all that I'll have time to do with the dataset. Please feel free to use it in your own research. It's licensed CC-BY, so please have at it! :-)

...and if you want access to other Twitter dataset to analyse, check out http://www.docnow.io/catalog/.

Users

Tweets Username
5,375        paparcura
4,703        latinagirlpwr
1,903        ImJacobLadder
1,236        unbreakablepenn
1,212        amForever44
1,178        BassthebeastNYC
1,170        womensmarch
1,017        WhyIMarch
982        TheLifeVote
952        zerocomados


3,582,495 unique users.

Retweets

146,370 Retweets


141,111 Retweets


109,865 Retweets


84,161 Retweets


70,600 Retweets


62,591 Retweets


59,366 Retweets


56,365 Retweets


52,125 Retweets


50,944 Retweets


Clients

Tweets Clients
7,098,145        Twitter for iPhone
3,718,467        Twitter for Android
2,066,773        Twitter for iPad
634,054        Twitter Web Client
306,225        Mobile Web (M5)
127,622        TweetDeck
59,463        Instagram
54,851        Tweetbot for iOS
47,556        Twitter for Windows
36,404        IFTTT

URLs

Tweets       URL
29,223        https://www.facebook.com/cnn/videos/10155945796281509/
27,435       http://www.cnn.com/2017/01/21/politics/womens-march-donald-trump-inauguration-sizes/index.html?sr=twCNN012117womens-march-donald-trump-inauguration-sizes0205PMStoryGal
24,854       http://www.independent.co.uk/news/world/americas/womens-march-antarctica-donald-trump-inauguration-women-hate-donald-trump-so-much-they-are-even-a7538856.html
21,189       https://twitter.com/kayleighmcenany/status/822979246205403136
20,902       https://twitter.com/mcgregor_ewan/status/823805815488331776
14,857       http://www.cnn.com/2017/01/21/politics/womens-march-donald-trump-inauguration-sizes/index.html?sr=twpol012117womens-march-donald-trump-inauguration-sizes0832PMVODtopLink&linkId=33643748
12,630       https://www.womensmarch.com/sisters
11,244       https://twitter.com/tomilahren/status/822852245532319744
9,761       https://twitter.com/mstharrington/status/823190136200593408
9,585       http://www.cnn.com/2017/01/21/politics/womens-march-protests-live-coverage/index.html?sr=twCNN012117womens-march-protests-live-coverage1208PMVODtop


2,403,637 URLs tweeted, with 527,350 of those being unique urls.

I've also setup a little bash script to feed all the unique urls to Internet Archive:

#!/bin/bash

URLS=/path/to/WomensMarch_urls_uniq.txt
index=0

cat $URLS | while read line; do
  curl -s -S "https://web.archive.org/save/$line" > /dev/null
  let "index++"
  echo "$index/527350 submitted to Internet Archive"
  sleep 1
done

And, I've also setup a crawl with Heritrix, and I'll make that data available here once it is complete.

Domains

Tweets Domain
1,219,747        twitter.com
159,087        instagram.com
134,309        cnn.com
68,479        facebook.com
50,561        womensmarch.com
43,219        youtube.com
36,946        nytimes.com
30,201        huffingtonpost.com
21,520        paper.li
21,476        cbsnews.com

Embedded Images

Tweets Image
146,442      
81,139       
71,877       
64,149       
59,214       
58,599       
51,439       
44,611       
43,845       
41,436       


6,153,894 embedded image URLs tweeted, with 390,298 of those being unique urls.

I'll be creating an image montage similar to what I did for #elxn42 and #panamapapers for #WomensMarch. It'll take some time, and I have to gather resources to make it happen since we're looking at about 5 times the amount of images for #WomensMarch.

#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016 twee volume
Dataset is available here.

Looking at the #panamapapers capture I've been doing we have, 1,424,682 embedded image urls from 3,569,960 tweets. I'm downloading the 1,424,682 images now, and hope to do something similar to what I did with the #elxn42 images. While we're waiting for the images to download, here are the 10 most tweeted embedded image urls:

Tweets Image
1. 10243 http://pbs.twimg.com/media/CfIsEBAXEAA8I0A.jpg
2. 8093 http://pbs.twimg.com/media/Cfdm2RtXIAEbNGN.jpg
3. 6588 http://pbs.twimg.com/tweet_video_thumb/CfJly88WwAAHBZp.jpg
4. 5613 http://pbs.twimg.com/media/CfIuU8hW4AAsafn.jpg
5. 5020 http://pbs.twimg.com/media/CfN2gZcWAAEcptA.jpg
6. 4944 http://pbs.twimg.com/media/CfOPcofUAAAOb3v.jpg
7. 4421 http://pbs.twimg.com/media/CfnqsINWIAAMCTR.jpg
8. 3740 http://pbs.twimg.com/media/CfSpwuhWQAALIS7.jpg
9. 3616 http://pbs.twimg.com/media/CfXYf5-UAAAQsps.jpg
10. 3585 http://pbs.twimg.com/media/CfTsTp_UAAECCg4.jpg

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

On November 13, 2015 I was at the "Web Archives 2015: Capture, Curate, Analyze" listening to Ian Milligan give the closing keynote when Thomas Padilla tweeted the following to me:

I immediately started collecting.


When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it's something. Maybe these datasets can be used for something positive that happened out of all this negative.


When I started collecting, it just so happened that the creator of twarc, Ed Summers, was sitting next to me, and he mentioned some new functionality that was part of the v0.4.0 release of twarc; Added --warnings flag to log warnings from the Twitter API about dropped tweets during streaming.

What's that mean? Basically, the public Stream API will not stream more the 1% of the total Twitter stream. If you are trying to capture something from the streaming API that exceeds 1% of the total Twitter stream, like for instance a hashtag or two related to a terrorist attack, the streaming API will drop tweets, and notify that it has done so. There is a really interesting look at this by Kevin Driscoll, Shawn Walker in the International Journal of Communication.

Ed fired up the new version of twarc and began streaming as well so we could see what was happening. We noticed that we were getting warnings of around 400 dropped tweets every request (seconds), then it quickly escalated up to over 28k dropped tweets every request. What were were trying to collect was over 1% of the total Twitter stream.

Dataset

Collection started on November 13, 2015 using both the streaming and search API. This is what it looked like:

$ twarc.py --search "#paris OR #Bataclan OR #parisattacks OR #porteouverte" > paris-search.json $ twarc.py --stream "#paris,#Bataclan,#parisattacks,#porteouverte" > paris-stream.json

I took the strategy of utilizing both the search and streaming API for collection due to what was noted above about hitting the 1% of the total Twitter stream limit. The idea was that if I'm hitting the limit with stream, theoretically I should be able to capture any tweets that were dropped with the search API. The stream API collection ran continuously during the collection period from November 13, 2015 to December 11, 2015. The search API collection was run, then once finished, immediately started back up over the collection period. During the first two weeks of collection, the search API collection would take about a week to finish. In recollection, I should have made note of the exact times it took to collect to get some more numbers to look at. That said, I'm not confident I was able to grab every tweet related to the hashtags I was collecting on. The only way, I think, I can be confident is by comparing this dataset with a dataset from Gnip. But, I am confident that I have a large amount of what was tweeted.

Once I finished collecting, I combined the json files, and deduplicated with deduplicate.py, and then created a list of tweet ids with ids.py.

If you want to follow along or do your own analysis with the dataset, you can "hydrate" the dataset with twarc. You can grab the Tweet ids for the dataset from here (Data & Analysis tab).

$ twarc.py --hydrate paris-tweet-ids.txt > paris-tweets.json

The hydration process will take some time; 72,000 tweets/hour. You might want to use something along the lines of GNU Screen, tmux, or nohup since it'll take about 207.49 hours to completely hydrate.

paris-tweet-times
created with Peter Binkley's twarc-report

Overview

I'm only going to do a quick analysis of the dataset here since I want to get the dataset out, and allow others to work with it. Tweets with geocoordinates is not covered below, but you can check out a map of tweets here.

There are a number of friendly utilities that come with twarc that allow for a quick exploratory analysis of a given collection. In addition, Peter Binkley's twarc-repot is pretty handy for providing a quick overview of a given dataset.

Users

We are able to create a list of the unique Twitter username names in the dataset by using users.py, and additionally sort them by the number of tweets:

$ python ~/git/twarc/utils/users.py paris-valid-deduplicated.json > paris-users.txt $ cat paris-users.txt | sort | uniq -c | sort -n > paris-users-unique-sorted-count.txt $ cat paris-users-unique-sorted-count.txt | wc -l $ tail paris-users-unique-sorted-count.txt

From the above, we can see that there are 4,636,584 unique users in the dataset, and the top 10 accounts were as follows:

1. 38,883 tweets RelaxInParis
2. 36,504 tweets FrancePeace
3. 12,697 tweets FollowParisNews
4. 12,656 tweets Reduction_Paris
5. 10,044 tweets CNNsWorld
6. 8,208 tweets parisevent
7. 7.296 tweets TheMalyck_
8. 6,654 tweets genx_hrd
9. 6,370 tweets DHEdomains
10. 4,498 tweets paris_attack

Retweets

We are able to create a lit of the most retweeted tweets in the dataset by using retweets.py:

$ python ~/git/twarc/utils/retweets.py paris-valid-deduplicated.json > paris-retweets.json $ python python ~/git/twarc/utils/tweet_urls.py paris-retweets.json > paris-retweets.txt

1. 53,639 retweets https://twitter.com/PNationale/status/665939383418273793
2. 44,457 retweets https://twitter.com/MarkRuffalo/status/665329805206900736
3. 41,400 retweets https://twitter.com/NiallOfficial/status/328827440157839361
4. 39,140 retweets https://twitter.com/oreoxzhel/status/665499107021066240
5. 37,214 retweets https://twitter.com/piersmorgan/status/665314980095356928
6. 24,955 retweets https://twitter.com/Fascinatingpics/status/665458581832077312
7. 22,124 retweets https://twitter.com/RGerrardActor/status/665325168953167873
8. 22,113 retweets https://twitter.com/HeralddeParis/status/665327408803741696
9. 22,069 retweets https://twitter.com/Gabriele_Corno/status/484640360120209408
10. 21,401 retweets https://twitter.com/SarahMatt97/status/665383304787529729

Hashtags

We were able to create a list of the unique tags using in our dataset by using tags.py.

$ python ~/git/twarc/utils/tags.py paris-valid-deduplicated.json > paris-hashtags.txt $ cat paris-hashtags.txt | wc -l $ head elxn42-tweet-tags.txt

From the above, we can see that there were 26,8974 unique hashtags were used. The top 10 hashtags used in the dataset were:

1. 6,812,941 tweets #parisattacks
2. 6,119,933 tweets #paris
3. 1,100,809 tweets #bataclan
4. 887,144 tweets #porteouverte
5. 673,543 tweets #prayforparis
6. 444,486 tweets #rechercheparis
7. 427,999 tweets #parís
8. 387,699 tweets #france
9. 341,059 tweets #fusillade
10. 303,410 tweets #isis

URLs

We are able to create a list of the unique URLs tweeted in our dataset by using urls.py, after first unshortening the urls with unshorten.py and unshrtn.

$ python ~/git/twarc/utils/urls.py paris-valid-deduplicated-unshortened.json > paris-tweets-urls.txt $ cat paris-tweets-urls.txt | sort | uniq -c | sort -n > paris-tweets-urls-uniq.txt $ cat paris-tweets-urls.txt | wc -l $ cat paris-tweets-urls-uniq.txt | wc -l $ tail paris-tweets-urls-uniq.txt

From the above, we can see that there were 5,561,037 URLs tweeted, representing 37.22% of total tweets, and 858,401 unique URLs tweeted. The top 10 URLs tweeted were as follows:

1. 46,034 tweets http://www.bbc.co.uk/news/live/world-europe-34815972?ns_mchannel=social&ns_campaign=bbc_breaking&ns_source=twitter&ns_linkname=news_central
2. 46,005 tweets https://twitter.com/account/suspended
3. 37,509 tweets http://www.lefigaro.fr/actualites/2015/11/13/01001-20151113LIVWWW00406-fusillade-paris-explosions-stade-de-france.php#xtor=AL-155-
4. 35,882 tweets http://twibbon.com/support/prayforparis-2/twitter
5. 33,531 tweets http://www.bbc.co.uk/news/live/world-europe-34815972
6. 33,039 tweets https://www.rt.com/news/321883-shooting-paris-dead-masked/
7. 24,221 tweets https://www.youtube.com/watch?v=-Uo6ZB0zrTQ
8. 23,536 tweets http://www.bbc.co.uk/news/live/world-europe-34825270
9. 21,237 tweets https://amp.twimg.com/v/fc122aff-6ece-47a4-b34c-cafbd72ef386
10. 21,107 tweets http://live.reuters.com/Event/Paris_attacks_2?Page=0

Images

We are able to create a list of images tweeted in our dataset by using image_urls.py.

$ python ~/git/twarc/utils/image_urls.py paris-valid-deduplicated.json > paris-tweets-images.txt $ cat paris-tweets-images.txt | sort | uniq -c | sort -n > paris-tweets-images-uniq.txt $ cat paris-tweets-images-uniq.txt | wc -l $ tail paris-tweets-images-uniq.txt

From the above, we can see that there were 6,872,441 total images tweets, representing 46.00% of total tweets, and 660,470 unique images. The top 10 images tweeted were as follows:

  1. 49,051 Occurrences
    http://pbs.twimg.com/media/CT3jpTNWwAAipNa.jpg
  2. 43,348 Occurrences
    http://pbs.twimg.com/media/CTxT6REUsAAdsEe.jpg
  3. 22,615 Occurrences
    http://pbs.twimg.com/media/CTwvCV3WsAAY_r9.jpg
  4. 21,325 Occurrences
    http://pbs.twimg.com/media/CTu1s_tUEAEj1qn.jpg
  5. 20,689 Occurrences
    http://pbs.twimg.com/media/CTwkRSoWoAEdL6Z.jpg
  6. 19,696 Occurrences
    http://pbs.twimg.com/media/CTu3wKfUkAAhtw_.jpg
  7. 19,597 Occurrences
    http://pbs.twimg.com/media/CTvqliHUkAAf0GH.jpg
  8. 19,096 Occurrences
    http://pbs.twimg.com/ext_tw_video_thumb/665318181603426307/pu/img/KuVYpJVjWfPhbTR7.jpg
  9. 16,772 Occurrences
    http://pbs.twimg.com/media/CTwhk0IWoAAc5qZ.jpg
  10. 15,364 Occurrences
    http://pbs.twimg.com/media/CT4ONVjUEAAOBkS.jpg

#paris #Bataclan #parisattacks #porteouverte tweets with geocoordinates

An Exploratory look at 13,968,293 #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets

#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo

I've spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley's twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC. This data set includes 13,968,293 tweets (10,589,910 retweets - 75.81%) from 3,343,319 different users over 21 days. You can check out a word cloud of all the tweets here.

First tweet in data set (numberic sort of tweet ids):


Hydration

If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for the data set from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweet-ids-20150129.txt > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing approximately 15 pots of coffee.

Map

In this data set, we have 133,970 tweets with geo coordinates availble. This represents about 0.96% of the entire data set.

The map is available here in a separate page since the geojson file is 83M and will potato your browser while everything loads. If anybody knows how to stream that geojson file to Leaflet.js so the browser doesn't potato, please comment! :-)

Users

These are the top 10 users in the data set.

  1. 35,420 tweets Promo_Culturel
  2. 33,075 tweets BotCharlie
  3. 24,251 tweets YaMeCanse21
  4. 23,126 tweets yakacliquer
  5. 17,576 tweets YaMeCanse20
  6. 15,315 tweets iS_Angry_Bird
  7. 9,615 tweets AbraIsacJac
  8. 9,318 tweets AnAnnoyingTweep
  9. 3,967 tweets rightnowio_feed
  10. 3,514 tweets russfeed

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

Hashtags

There are teh top 10 hashtags in the data set.

  1. 8,597,175 tweets #charliehebdo
  2. 7,911,343 tweets #jesuischarlie
  3. 377,041 tweets #jesuisahmed
  4. 264,869 tweets #paris
  5. 186,976 tweets #france
  6. 177,448 tweets #parisshooting
  7. 141,993 tweets #jesuisjuif
  8. 140,539 tweets #marcherepublicaine
  9. 129,484 tweets #noussommescharlie
  10. 128,529 tweets #afp

URLs

These are the top 10 URLs in the data set. 3,771,042 tweets (27.00%) had an URL associated with them.

These are all shortened urls. I'm working through and issue with unshorten.py.

  1. http://bbc.in/1xPaVhN (43,708)
  2. http://bit.ly/1AEpWnE (19,328)
  3. http://bit.ly/1DEm0TK (17,033)
  4. http://nyr.kr/14AeVIi (14,118)
  5. http://youtu.be/4KBdnOrTdMI (13,252)
  6. http://bbc.in/14ulyLt (12,407)
  7. http://europe1.fr/direct-video (9,228)
  8. http://bbc.in/1DxNLQD (9,044)
  9. http://ind.pn/1s5EV8w (8,721)
  10. http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map (8,581)

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

Media

These are the top 10 media urls in the data set. 8,141,552 tweets (58.29%) had a media URL associated with them.

36,753 Occurrences

img

35,942 Occurrences

img

33,501 Occurrences

img

31,712 Occurrences

img

29,359 Occurrences

img

26,334 Occurrences

img

25,989 Occurrences

img

23,974 Occurrences

img

22,659 Occurrences

img

22,421 Occurrences

img

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

133,970 #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets on a map

How do you make this?

  • Create the geojson ~/git/twarc/utils/geojson.py JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-20150129.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

An exploratory look at 257,093 #JeSuisAhmed tweets

#JeSuisAhmed

Had some time last night to do some exploratory analysis on some of the #JeSuisAhmed collection. This analysis is from the first tweet I was able to harvest #JeSuisAhmed to some time on January 14, 2015 when I copied over the json to experiment with a few of the twarc utilities.

First tweet in data set:


Last tweet in data set:

Hydration

If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for #JeSuisAhmed from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisAhmed-ids-20150113.txt > JeSuisAhmed-tweets-20150113.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing a cup of coffee.

Map

#JeSuisAhmed tweets with geo coordinates.

In this data set, we have 2,329 tweets with geo coordinates availble. This represents about 0.91% of the entire data set (257,093 tweets).

How do you make this?

  • Create the geojson ~/git/twarc/utils/geojson.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-tweets-dedupe-20150112.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

Images

These are the image urls that have more than 1000 occurrences in the data set.

13703 Occurrences

img

10396 Occurrences

img

6088 Occurrences

img

4354 Occurrences

img

3229 Occurrences

img

3124 Occurrences

img

2307 Occurrences

img

2034 Occurrences

img

1949 Occurrences

img

1296 Occurrences

img

1182 Occurrences

img

1100 Occurrences

img

How do you get the image list (requires unshrtn)?

% ~/git/twarc/utils/image_urls.py JeSuisAhmed-tweets-unshortened-20150112.json > JeSuisAhmed-images-20150112.txt
% cat JeSuisAhmed-images-20150112.txt | sort | uniq -c | sort -rn > JeSuisAhmed-images-ranked-20150112.txt

The ranked url data set can be found here.

Retweets

What are the three most retweeted tweets in the hashtag?




How do you find out the most retweets tweets in the dataset? This will give you the top 10.

~/git/twarc/utils/retweets.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-retweets-20150112.json

Top URLs

Top 10 URLs tweeted from #JeSuisAhmed.

  1. http://www.huffingtonpost.ca/2015/01/08/ahmed-merabet-jesuisahmed-charlie-hebdo_n_6437984.html?ncid=tweetlnkushpmg00000067 (2895)
  2. http://limportant.fr/infos-jesuischarlie/76/360460 (1613)
  3. http://mic.com/articles/107988/the-hero-of-the-charlie-hebdo-shooting-we-re-overlooking (1318)
  4. http://www.huffingtonpost.co.uk/2015/01/08/charlie-hebdocharlie-hebdo-attack-jesuisahmed-hashtag-commemorating-ahmed-merabet-takes-off_n_6436528.html?1420731418&ncid=tweetlnkushpmg00000067 (919)
  5. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000067 (632)
  6. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000055 (592)
  7. http://www.dailymail.co.uk/news/article-2901681/Hero-police-officer-executed-street-married-42-year-old-Muslim-assigned-patrol-Paris-neighbourhood-Charlie-Hebdo-offices-located.html (571)
  8. http://blogs.mediapart.fr/blog/joel-villain/070115/il-sappelait-ahmed (555)
  9. http://www.bbc.co.uk/news/blogs-trending-30728491?ocid=socialflow_twitter (471)
  10. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?utm_hp_ref=tw (436)

Full list of urls can be found here.

How do you get the list (requires unshrtn)?

% cat JeSuisAhmed-tweets-20150112.json | ~/git/twarc/utils/unshorten.py > JeSuisAhmed-tweets-unshortened-20150112.json
% cat JeSuisAhmed-tweets-unshortened-20150112.json | ~/git/twarc/utils/urls.py| sort | uniq -c | sort -rn > JeSuisAhmed-urls.txt

Twitter Clients

Top 10 Twitter clients used from #JeSuisAhmed.

  1. Twitter for iPhone (85116)
  2. Twitter for Android (58819)
  3. Twitter Web Client (58166)
  4. Twitter for iPad (15304)
  5. Twitter for Websites (6877)
  6. Twitter for Windows Phone (5237)
  7. Twitter for Android Tablets (4420)
  8. TweetDeck (3790)
  9. Mobile Web (M5) (1708)
  10. Tweetbot for iΟS (1691)

Full list of clients can be found here.

How do you get the list of Twitter client sources?

% ~/git/twarc/utils/source.py JeSuisAhmed-tweets-20150112.json > JeSuisAhmed-sources-20150112.html



#JeSuisCharlie images

Using the #JeSuisCharlie data set from January 11, 2015 (Warning! Will turn your browser into a potato for a few seconds), these are the image urls that have more than 1000 occurrences in the data set.

How to create (requires unshrtn):

% twarc.py --query "#JeSuisCharlie"
% ~/git/twarc/utils/deduplicate.py JeSuisCharlie-tweets.json > JeSuisCharlie-tweets-deduped.json
% cat JeSuisCharlie-tweets-deduped.json | utils/unshorten.py > JeSuisCharlie-tweets-deduped-ushortened.json
% ~/git/twarc/utils/image_urls.py JeSuisCharlie-tweets-deduped-ushortened.json >| JeSuisCharlie-20150115-image-urls.txt
% cat JeSuisCharlie-20150115-image-urls.txt | sort | uniq -c | sort -rn > JeSuisCharlie-20150115-image-urls-ranked.txt

The ranked url data set can be found here.

11657 Occurrences

img

4764 Occurrences

img

3014 Occurrences

img

2977 Occurrences

img

2840 Occurrences

img

2363 Occurrences

img

2190 Occurrences

img

2015 Occurrences

img

1921 Occurrences

img

1906 Occurrences

img

1832 Occurrences

img

1512 Occurrences

img

1409 Occurrences

img

1348 Occurrences

img

1261 Occurrences

img

1207 Occurrences

img

1152 Occurrences

img

1114 Occurrences

img

1065 Occurrences

img

1055 Occurrences

img

1047 Occurrences

img

Preliminary stats of #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo

#JeSuisAhmed

$ wc -l *json
    148479 %23JeSuisAhmed-20150109103430.json
     94874 %23JeSuisAhmed-20150109141746.json
      5885 %23JeSuisAhmed-20150112092647.json
    249238 total
$ du -h
2.7G    .

#JeSuisCharlie

$ wc -l *json
    3894191 %23JeSuisCharlie-20150109094220.json
    1758849 %23JeSuisCharlie-20150109141730.json
     226784 %23JeSuisCharlie-20150112092710.json
         15 %23JeSuisCharlie-20150112092734.json
    5879839 total
$ du -h
32G .

#JeSuisJuif

$ wc -l *json
    23694 %23JeSuisJuif-20150109172957.json
    50603 %23JeSuisJuif-20150109173104.json
     5941 %23JeSuisJuif-20150110003450.json
    42237 %23JeSuisJuif-20150112094500.json
     5064 %23JeSuisJuif-20150112094648.json
   127539 total
$ du -h
671M    .

#CharlieHebdo

$ wc -l *json
    4444585 %23CharlieHebdo-20150109172713.json
        108 %23CharlieHebdo-20150109172825.json
    1164717 %23CharlieHebdo-20150109172844.json
    1068074 %23CharlieHebdo-20150112094427.json
      69446 %23CharlieHebdo-20150112094446.json
     185263 %23CharlieHebdo-20150112155558.json
    6932193 total
$ du -h
39G     .

Total

Preliminary and non-depuped, we're looking at roughly 74.4G of data, and 13,188,809 tweets after 5.5 days of capturing the 4 hash tags.