#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016

#panamapapers images April 4-29, 2016 twee volume
Dataset is available here.

Looking at the #panamapapers capture I've been doing we have, 1,424,682 embedded image urls from 3,569,960 tweets. I'm downloading the 1,424,682 images now, and hope to do something similar to what I did with the #elxn42 images. While we're waiting for the images to download, here are the 10 most tweeted embedded image urls:

Tweets Image
1. 10243 http://pbs.twimg.com/media/CfIsEBAXEAA8I0A.jpg
2. 8093 http://pbs.twimg.com/media/Cfdm2RtXIAEbNGN.jpg
3. 6588 http://pbs.twimg.com/tweet_video_thumb/CfJly88WwAAHBZp.jpg
4. 5613 http://pbs.twimg.com/media/CfIuU8hW4AAsafn.jpg
5. 5020 http://pbs.twimg.com/media/CfN2gZcWAAEcptA.jpg
6. 4944 http://pbs.twimg.com/media/CfOPcofUAAAOb3v.jpg
7. 4421 http://pbs.twimg.com/media/CfnqsINWIAAMCTR.jpg
8. 3740 http://pbs.twimg.com/media/CfSpwuhWQAALIS7.jpg
9. 3616 http://pbs.twimg.com/media/CfXYf5-UAAAQsps.jpg
10. 3585 http://pbs.twimg.com/media/CfTsTp_UAAECCg4.jpg

1,203,867 #elxn42 images

1,203,867 #elxn42 images

Background

Last August, I began capturing the #elxn42 hashtag as an experiment, and potential research project with Ian Milligan. Once Justin Trudeau was sworn in as the 23rd Prime Minister of Canada, we stopped collection, and began analysing the dataset. We wrote that analysis up for the Code4Lib Journal, which will be published in the next couple weeks. In the interim, you can check out our pre-print here. Included in that dataset is a line-deliminted list of a url to every embedded image tweeted in the dataset; 1,203,867 images. So, I downloaded them. It took a couple days.

getTweetImages

IMAGES=/path/to/elxn42-image-urls.txt
cd /path/to/elxn42/images

cat $IMAGES | while read line; do
  wget "$line"
done

Now we can start doing image analysis.

1,203,867 images, now what?

I really wanted to take a macroscopic look at all the images, and looking around the best tool for the job looked like montage, an ImageMagick command for creating composite images. But, it wasn't that so simple. 1,203,867 images is a lot of images, and starts getting you thinking about what big data is. Is this big data? I don't know. Maybe?

Attempt #1

I can just point montage at a directory and say go to town, right? NOPE.

$ montage /path/to/1203867/elxn42/images/* elxn42.png

Too many arguments! After glancing through the man page, I find that I can pass it a line-delimited text file with the paths to each file.

find `pwd` -type f | cat > images.txt

Now that I have that, I can pass montage that file, and I should be golden, right? NOPE.

$ montage @images.txt elxn42.png

I run out of RAM, and get a segmentation fault. This was on a machine with 80GB of RAM.

Attempt #2

Is this big data? What is big data?

Where can I get a machine with a bunch of RAM really quick? Amazon!

I spin up a d2.8xlarge (36 cores and 244GB RAM) EC2 instance, get my dataset over there, ImageMagick installed, and run the command again.

$ montage @images.txt elxn42.png

NOPE. I run out of RAM, and get a segmentation fault. This was on a machine with 244GB of RAM.

Attempt #3

Is this big data? What is big data?

I've failed on two very large machines. Well, what I would consider large machines. So, I start googling, and reading more ImageMagick documentation. Somebody has to have done something like this before, right? Astronomers, they deal with big images right? How do they do this?

Then I find it; ImageMagick Large Image Support/Terapixel support, and the timing couldn't have been better. Ian and I had recently got setup with our ComputeCanada resource allocation. I setup a machine with 8 cores, 12GB RAM, and compiled the latest version of ImageMagick from source; ImageMagick-6.9.3-7.

montage -monitor -define registry:temporary-path=/data/tmp -limit memory 8GiB -limit map 10GiB -limit area 0 @elxn42-tweets-images.txt elxn42.png

Instead of running everything in RAM, which became my issue with this job, I'm able to write all the tmp files ImageMagick creates to disk with -define registry:temporary-path=/data/tmp and limit my memory usage with -limit memory 8GiB -limit map 10GiB -limit area 0. Then knowing this job was going to probably take a long time, -monitor comes in super handy for providing feedback of where the job is at process-wise.

In the end, it took just over 12 days to run the job. It took up 3.5TB of disk space at its peak, and in the end generated a 32GB png file. You can check it out here.

$ pngcheck elxn42.png
OK: elxn42.png (138112x135828, 48-bit RGB, non-interlaced, 69.6%).

$ exiftool elxn42.png
ExifTool Version Number         : 9.46
File Name                       : elxn42.png
Directory                       : .
File Size                       : 32661 MB
File Modification Date/Time     : 2016:03:30 00:48:44-04:00
File Access Date/Time           : 2016:03:30 10:20:26-04:00
File Inode Change Date/Time     : 2016:03:30 09:14:09-04:00
File Permissions                : rw-rw-r--
File Type                       : PNG
MIME Type                       : image/png
Image Width                     : 138112
Image Height                    : 135828
Bit Depth                       : 16
Color Type                      : RGB
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
Gamma                           : 2.2
White Point X                   : 0.3127
White Point Y                   : 0.329
Red X                           : 0.64
Red Y                           : 0.33
Green X                         : 0.3
Green Y                         : 0.6
Blue X                          : 0.15
Blue Y                          : 0.06
Background Color                : 65535 65535 65535
Image Size                      : 138112x135828

Concluding Thoughts

Is this big data? I don't know. I started with 1,203,867 images and made it into a single image. Using 3.5TB of tmp files to create a 32GB image is mind boggling when you start to think about it. But then it isn't when you think about it more. Do I need a machine with 3.5TB of RAM to run this in memory? Or do I just need to design a job with the resources I have and be patient. There are always trade-offs. But, at the end of it all, I'm still sitting here asking myself what is big data?

Maybe this is big data :-)

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

A look at 14,939,154 #paris #Bataclan #parisattacks #porteouverte tweets

On November 13, 2015 I was at the "Web Archives 2015: Capture, Curate, Analyze" listening to Ian Milligan give the closing keynote when Thomas Padilla tweeted the following to me:

I immediately started collecting.


When tragedies like this happen, I feel pretty powerless. But, I figure if I can collect something like this, similar to what I did for the Charlie Hebdo attacks, it's something. Maybe these datasets can be used for something positive that happened out of all this negative.


When I started collecting, it just so happened that the creator of twarc, Ed Summers, was sitting next to me, and he mentioned some new functionality that was part of the v0.4.0 release of twarc; Added --warnings flag to log warnings from the Twitter API about dropped tweets during streaming.

What's that mean? Basically, the public Stream API will not stream more the 1% of the total Twitter stream. If you are trying to capture something from the streaming API that exceeds 1% of the total Twitter stream, like for instance a hashtag or two related to a terrorist attack, the streaming API will drop tweets, and notify that it has done so. There is a really interesting look at this by Kevin Driscoll, Shawn Walker in the International Journal of Communication.

Ed fired up the new version of twarc and began streaming as well so we could see what was happening. We noticed that we were getting warnings of around 400 dropped tweets every request (seconds), then it quickly escalated up to over 28k dropped tweets every request. What were were trying to collect was over 1% of the total Twitter stream.

Dataset

Collection started on November 13, 2015 using both the streaming and search API. This is what it looked like:

$ twarc.py --search "#paris OR #Bataclan OR #parisattacks OR #porteouverte" > paris-search.json $ twarc.py --stream "#paris,#Bataclan,#parisattacks,#porteouverte" > paris-stream.json

I took the strategy of utilizing both the search and streaming API for collection due to what was noted above about hitting the 1% of the total Twitter stream limit. The idea was that if I'm hitting the limit with stream, theoretically I should be able to capture any tweets that were dropped with the search API. The stream API collection ran continuously during the collection period from November 13, 2015 to December 11, 2015. The search API collection was run, then once finished, immediately started back up over the collection period. During the first two weeks of collection, the search API collection would take about a week to finish. In recollection, I should have made note of the exact times it took to collect to get some more numbers to look at. That said, I'm not confident I was able to grab every tweet related to the hashtags I was collecting on. The only way, I think, I can be confident is by comparing this dataset with a dataset from Gnip. But, I am confident that I have a large amount of what was tweeted.

Once I finished collecting, I combined the json files, and deduplicated with deduplicate.py, and then created a list of tweet ids with ids.py.

If you want to follow along or do your own analysis with the dataset, you can "hydrate" the dataset with twarc. You can grab the Tweet ids for the dataset from here (Data & Analysis tab).

$ twarc.py --hydrate paris-tweet-ids.txt > paris-tweets.json

The hydration process will take some time; 72,000 tweets/hour. You might want to use something along the lines of GNU Screen, tmux, or nohup since it'll take about 207.49 hours to completely hydrate.

paris-tweet-times
created with Peter Binkley's twarc-report

Overview

I'm only going to do a quick analysis of the dataset here since I want to get the dataset out, and allow others to work with it. Tweets with geocoordinates is not covered below, but you can check out a map of tweets here.

There are a number of friendly utilities that come with twarc that allow for a quick exploratory analysis of a given collection. In addition, Peter Binkley's twarc-repot is pretty handy for providing a quick overview of a given dataset.

Users

We are able to create a list of the unique Twitter username names in the dataset by using users.py, and additionally sort them by the number of tweets:

$ python ~/git/twarc/utils/users.py paris-valid-deduplicated.json > paris-users.txt $ cat paris-users.txt | sort | uniq -c | sort -n > paris-users-unique-sorted-count.txt $ cat paris-users-unique-sorted-count.txt | wc -l $ tail paris-users-unique-sorted-count.txt

From the above, we can see that there are 4,636,584 unique users in the dataset, and the top 10 accounts were as follows:

1. 38,883 tweets RelaxInParis
2. 36,504 tweets FrancePeace
3. 12,697 tweets FollowParisNews
4. 12,656 tweets Reduction_Paris
5. 10,044 tweets CNNsWorld
6. 8,208 tweets parisevent
7. 7.296 tweets TheMalyck_
8. 6,654 tweets genx_hrd
9. 6,370 tweets DHEdomains
10. 4,498 tweets paris_attack

Retweets

We are able to create a lit of the most retweeted tweets in the dataset by using retweets.py:

$ python ~/git/twarc/utils/retweets.py paris-valid-deduplicated.json > paris-retweets.json $ python python ~/git/twarc/utils/tweet_urls.py paris-retweets.json > paris-retweets.txt

1. 53,639 retweets https://twitter.com/PNationale/status/665939383418273793
2. 44,457 retweets https://twitter.com/MarkRuffalo/status/665329805206900736
3. 41,400 retweets https://twitter.com/NiallOfficial/status/328827440157839361
4. 39,140 retweets https://twitter.com/oreoxzhel/status/665499107021066240
5. 37,214 retweets https://twitter.com/piersmorgan/status/665314980095356928
6. 24,955 retweets https://twitter.com/Fascinatingpics/status/665458581832077312
7. 22,124 retweets https://twitter.com/RGerrardActor/status/665325168953167873
8. 22,113 retweets https://twitter.com/HeralddeParis/status/665327408803741696
9. 22,069 retweets https://twitter.com/Gabriele_Corno/status/484640360120209408
10. 21,401 retweets https://twitter.com/SarahMatt97/status/665383304787529729

Hashtags

We were able to create a list of the unique tags using in our dataset by using tags.py.

$ python ~/git/twarc/utils/tags.py paris-valid-deduplicated.json > paris-hashtags.txt $ cat paris-hashtags.txt | wc -l $ head elxn42-tweet-tags.txt

From the above, we can see that there were 26,8974 unique hashtags were used. The top 10 hashtags used in the dataset were:

1. 6,812,941 tweets #parisattacks
2. 6,119,933 tweets #paris
3. 1,100,809 tweets #bataclan
4. 887,144 tweets #porteouverte
5. 673,543 tweets #prayforparis
6. 444,486 tweets #rechercheparis
7. 427,999 tweets #parís
8. 387,699 tweets #france
9. 341,059 tweets #fusillade
10. 303,410 tweets #isis

URLs

We are able to create a list of the unique URLs tweeted in our dataset by using urls.py, after first unshortening the urls with unshorten.py and unshrtn.

$ python ~/git/twarc/utils/urls.py paris-valid-deduplicated-unshortened.json > paris-tweets-urls.txt $ cat paris-tweets-urls.txt | sort | uniq -c | sort -n > paris-tweets-urls-uniq.txt $ cat paris-tweets-urls.txt | wc -l $ cat paris-tweets-urls-uniq.txt | wc -l $ tail paris-tweets-urls-uniq.txt

From the above, we can see that there were 5,561,037 URLs tweeted, representing 37.22% of total tweets, and 858,401 unique URLs tweeted. The top 10 URLs tweeted were as follows:

1. 46,034 tweets http://www.bbc.co.uk/news/live/world-europe-34815972?ns_mchannel=social&ns_campaign=bbc_breaking&ns_source=twitter&ns_linkname=news_central
2. 46,005 tweets https://twitter.com/account/suspended
3. 37,509 tweets http://www.lefigaro.fr/actualites/2015/11/13/01001-20151113LIVWWW00406-fusillade-paris-explosions-stade-de-france.php#xtor=AL-155-
4. 35,882 tweets http://twibbon.com/support/prayforparis-2/twitter
5. 33,531 tweets http://www.bbc.co.uk/news/live/world-europe-34815972
6. 33,039 tweets https://www.rt.com/news/321883-shooting-paris-dead-masked/
7. 24,221 tweets https://www.youtube.com/watch?v=-Uo6ZB0zrTQ
8. 23,536 tweets http://www.bbc.co.uk/news/live/world-europe-34825270
9. 21,237 tweets https://amp.twimg.com/v/fc122aff-6ece-47a4-b34c-cafbd72ef386
10. 21,107 tweets http://live.reuters.com/Event/Paris_attacks_2?Page=0

Images

We are able to create a list of images tweeted in our dataset by using image_urls.py.

$ python ~/git/twarc/utils/image_urls.py paris-valid-deduplicated.json > paris-tweets-images.txt $ cat paris-tweets-images.txt | sort | uniq -c | sort -n > paris-tweets-images-uniq.txt $ cat paris-tweets-images-uniq.txt | wc -l $ tail paris-tweets-images-uniq.txt

From the above, we can see that there were 6,872,441 total images tweets, representing 46.00% of total tweets, and 660,470 unique images. The top 10 images tweeted were as follows:

  1. 49,051 Occurrences
    http://pbs.twimg.com/media/CT3jpTNWwAAipNa.jpg
  2. 43,348 Occurrences
    http://pbs.twimg.com/media/CTxT6REUsAAdsEe.jpg
  3. 22,615 Occurrences
    http://pbs.twimg.com/media/CTwvCV3WsAAY_r9.jpg
  4. 21,325 Occurrences
    http://pbs.twimg.com/media/CTu1s_tUEAEj1qn.jpg
  5. 20,689 Occurrences
    http://pbs.twimg.com/media/CTwkRSoWoAEdL6Z.jpg
  6. 19,696 Occurrences
    http://pbs.twimg.com/media/CTu3wKfUkAAhtw_.jpg
  7. 19,597 Occurrences
    http://pbs.twimg.com/media/CTvqliHUkAAf0GH.jpg
  8. 19,096 Occurrences
    http://pbs.twimg.com/ext_tw_video_thumb/665318181603426307/pu/img/KuVYpJVjWfPhbTR7.jpg
  9. 16,772 Occurrences
    http://pbs.twimg.com/media/CTwhk0IWoAAc5qZ.jpg
  10. 15,364 Occurrences
    http://pbs.twimg.com/media/CT4ONVjUEAAOBkS.jpg

An Exploratory look at 13,968,293 #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets

#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo

I've spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley's twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC. This data set includes 13,968,293 tweets (10,589,910 retweets - 75.81%) from 3,343,319 different users over 21 days. You can check out a word cloud of all the tweets here.

First tweet in data set (numberic sort of tweet ids):


Hydration

If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for the data set from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweet-ids-20150129.txt > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing approximately 15 pots of coffee.

Map

In this data set, we have 133,970 tweets with geo coordinates availble. This represents about 0.96% of the entire data set.

The map is available here in a separate page since the geojson file is 83M and will potato your browser while everything loads. If anybody knows how to stream that geojson file to Leaflet.js so the browser doesn't potato, please comment! :-)

Users

These are the top 10 users in the data set.

  1. 35,420 tweets Promo_Culturel
  2. 33,075 tweets BotCharlie
  3. 24,251 tweets YaMeCanse21
  4. 23,126 tweets yakacliquer
  5. 17,576 tweets YaMeCanse20
  6. 15,315 tweets iS_Angry_Bird
  7. 9,615 tweets AbraIsacJac
  8. 9,318 tweets AnAnnoyingTweep
  9. 3,967 tweets rightnowio_feed
  10. 3,514 tweets russfeed

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

Hashtags

There are teh top 10 hashtags in the data set.

  1. 8,597,175 tweets #charliehebdo
  2. 7,911,343 tweets #jesuischarlie
  3. 377,041 tweets #jesuisahmed
  4. 264,869 tweets #paris
  5. 186,976 tweets #france
  6. 177,448 tweets #parisshooting
  7. 141,993 tweets #jesuisjuif
  8. 140,539 tweets #marcherepublicaine
  9. 129,484 tweets #noussommescharlie
  10. 128,529 tweets #afp

URLs

These are the top 10 URLs in the data set. 3,771,042 tweets (27.00%) had an URL associated with them.

These are all shortened urls. I'm working through and issue with unshorten.py.

  1. http://bbc.in/1xPaVhN (43,708)
  2. http://bit.ly/1AEpWnE (19,328)
  3. http://bit.ly/1DEm0TK (17,033)
  4. http://nyr.kr/14AeVIi (14,118)
  5. http://youtu.be/4KBdnOrTdMI (13,252)
  6. http://bbc.in/14ulyLt (12,407)
  7. http://europe1.fr/direct-video (9,228)
  8. http://bbc.in/1DxNLQD (9,044)
  9. http://ind.pn/1s5EV8w (8,721)
  10. http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map (8,581)

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

Media

These are the top 10 media urls in the data set. 8,141,552 tweets (58.29%) had a media URL associated with them.

36,753 Occurrences

img

35,942 Occurrences

img

33,501 Occurrences

img

31,712 Occurrences

img

29,359 Occurrences

img

26,334 Occurrences

img

25,989 Occurrences

img

23,974 Occurrences

img

22,659 Occurrences

img

22,421 Occurrences

img

This comes from twarc-report's report-profiler.py.

$ ~/git/twarc-report/reportprofiler.py -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

An exploratory look at 257,093 #JeSuisAhmed tweets

#JeSuisAhmed

Had some time last night to do some exploratory analysis on some of the #JeSuisAhmed collection. This analysis is from the first tweet I was able to harvest #JeSuisAhmed to some time on January 14, 2015 when I copied over the json to experiment with a few of the twarc utilities.

First tweet in data set:


Last tweet in data set:

Hydration

If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for #JeSuisAhmed from here (Data & Analysis tab).

% twarc.py --hydrate JeSuisAhmed-ids-20150113.txt > JeSuisAhmed-tweets-20150113.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing a cup of coffee.

Map

#JeSuisAhmed tweets with geo coordinates.

In this data set, we have 2,329 tweets with geo coordinates availble. This represents about 0.91% of the entire data set (257,093 tweets).

How do you make this?

  • Create the geojson ~/git/twarc/utils/geojson.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-tweets-dedupe-20150112.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

Images

These are the image urls that have more than 1000 occurrences in the data set.

13703 Occurrences

img

10396 Occurrences

img

6088 Occurrences

img

4354 Occurrences

img

3229 Occurrences

img

3124 Occurrences

img

2307 Occurrences

img

2034 Occurrences

img

1949 Occurrences

img

1296 Occurrences

img

1182 Occurrences

img

1100 Occurrences

img

How do you get the image list (requires unshrtn)?

% ~/git/twarc/utils/image_urls.py JeSuisAhmed-tweets-unshortened-20150112.json > JeSuisAhmed-images-20150112.txt
% cat JeSuisAhmed-images-20150112.txt | sort | uniq -c | sort -rn > JeSuisAhmed-images-ranked-20150112.txt

The ranked url data set can be found here.

Retweets

What are the three most retweeted tweets in the hashtag?




How do you find out the most retweets tweets in the dataset? This will give you the top 10.

~/git/twarc/utils/retweets.py JeSuisAhmed-tweets-dedupe-20150112.json > JeSuisAhmed-retweets-20150112.json

Top URLs

Top 10 URLs tweeted from #JeSuisAhmed.

  1. http://www.huffingtonpost.ca/2015/01/08/ahmed-merabet-jesuisahmed-charlie-hebdo_n_6437984.html?ncid=tweetlnkushpmg00000067 (2895)
  2. http://limportant.fr/infos-jesuischarlie/76/360460 (1613)
  3. http://mic.com/articles/107988/the-hero-of-the-charlie-hebdo-shooting-we-re-overlooking (1318)
  4. http://www.huffingtonpost.co.uk/2015/01/08/charlie-hebdocharlie-hebdo-attack-jesuisahmed-hashtag-commemorating-ahmed-merabet-takes-off_n_6436528.html?1420731418&ncid=tweetlnkushpmg00000067 (919)
  5. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000067 (632)
  6. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?ncid=tweetlnkushpmg00000055 (592)
  7. http://www.dailymail.co.uk/news/article-2901681/Hero-police-officer-executed-street-married-42-year-old-Muslim-assigned-patrol-Paris-neighbourhood-Charlie-Hebdo-offices-located.html (571)
  8. http://blogs.mediapart.fr/blog/joel-villain/070115/il-sappelait-ahmed (555)
  9. http://www.bbc.co.uk/news/blogs-trending-30728491?ocid=socialflow_twitter (471)
  10. http://www.huffingtonpost.com/2015/01/08/jesuisahmed-twitter-hashtag_n_6438132.html?utm_hp_ref=tw (436)

Full list of urls can be found here.

How do you get the list (requires unshrtn)?

% cat JeSuisAhmed-tweets-20150112.json | ~/git/twarc/utils/unshorten.py > JeSuisAhmed-tweets-unshortened-20150112.json
% cat JeSuisAhmed-tweets-unshortened-20150112.json | ~/git/twarc/utils/urls.py| sort | uniq -c | sort -rn > JeSuisAhmed-urls.txt

Twitter Clients

Top 10 Twitter clients used from #JeSuisAhmed.

  1. Twitter for iPhone (85116)
  2. Twitter for Android (58819)
  3. Twitter Web Client (58166)
  4. Twitter for iPad (15304)
  5. Twitter for Websites (6877)
  6. Twitter for Windows Phone (5237)
  7. Twitter for Android Tablets (4420)
  8. TweetDeck (3790)
  9. Mobile Web (M5) (1708)
  10. Tweetbot for iΟS (1691)

Full list of clients can be found here.

How do you get the list of Twitter client sources?

% ~/git/twarc/utils/source.py JeSuisAhmed-tweets-20150112.json > JeSuisAhmed-sources-20150112.html



#JeSuisCharlie images

Using the #JeSuisCharlie data set from January 11, 2015 (Warning! Will turn your browser into a potato for a few seconds), these are the image urls that have more than 1000 occurrences in the data set.

How to create (requires unshrtn):

% twarc.py --query "#JeSuisCharlie"
% ~/git/twarc/utils/deduplicate.py JeSuisCharlie-tweets.json > JeSuisCharlie-tweets-deduped.json
% cat JeSuisCharlie-tweets-deduped.json | utils/unshorten.py > JeSuisCharlie-tweets-deduped-ushortened.json
% ~/git/twarc/utils/image_urls.py JeSuisCharlie-tweets-deduped-ushortened.json >| JeSuisCharlie-20150115-image-urls.txt
% cat JeSuisCharlie-20150115-image-urls.txt | sort | uniq -c | sort -rn > JeSuisCharlie-20150115-image-urls-ranked.txt

The ranked url data set can be found here.

11657 Occurrences

img

4764 Occurrences

img

3014 Occurrences

img

2977 Occurrences

img

2840 Occurrences

img

2363 Occurrences

img

2190 Occurrences

img

2015 Occurrences

img

1921 Occurrences

img

1906 Occurrences

img

1832 Occurrences

img

1512 Occurrences

img

1409 Occurrences

img

1348 Occurrences

img

1261 Occurrences

img

1207 Occurrences

img

1152 Occurrences

img

1114 Occurrences

img

1065 Occurrences

img

1055 Occurrences

img

1047 Occurrences

img

Preliminary stats of #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, #CharlieHebdo

#JeSuisAhmed

$ wc -l *json
    148479 %23JeSuisAhmed-20150109103430.json
     94874 %23JeSuisAhmed-20150109141746.json
      5885 %23JeSuisAhmed-20150112092647.json
    249238 total
$ du -h
2.7G    .

#JeSuisCharlie

$ wc -l *json
    3894191 %23JeSuisCharlie-20150109094220.json
    1758849 %23JeSuisCharlie-20150109141730.json
     226784 %23JeSuisCharlie-20150112092710.json
         15 %23JeSuisCharlie-20150112092734.json
    5879839 total
$ du -h
32G .

#JeSuisJuif

$ wc -l *json
    23694 %23JeSuisJuif-20150109172957.json
    50603 %23JeSuisJuif-20150109173104.json
     5941 %23JeSuisJuif-20150110003450.json
    42237 %23JeSuisJuif-20150112094500.json
     5064 %23JeSuisJuif-20150112094648.json
   127539 total
$ du -h
671M    .

#CharlieHebdo

$ wc -l *json
    4444585 %23CharlieHebdo-20150109172713.json
        108 %23CharlieHebdo-20150109172825.json
    1164717 %23CharlieHebdo-20150109172844.json
    1068074 %23CharlieHebdo-20150112094427.json
      69446 %23CharlieHebdo-20150112094446.json
     185263 %23CharlieHebdo-20150112155558.json
    6932193 total
$ du -h
39G     .

Total

Preliminary and non-depuped, we're looking at roughly 74.4G of data, and 13,188,809 tweets after 5.5 days of capturing the 4 hash tags.

Preliminary look at 3,893,553 #JeSuisCharlie tweets

Background

Last Friday (January 9, 2015) I started capturing #JeSuisAhmed, #JeSuisCharlie, #JeSuisJuif, and #CharlieHebdo with Ed Summers' twarc. I have about 12 million tweets at the time of writing this, and plan on writing up something a little bit more in-depth in the coming weeks. But for now, some preliminary analysis of #JeSuisCharlie, and if you haven't seen these two posts ("A Ferguson Twitter Archive", "On Forgetting and hydration") by Ed Summers, please do check them out.

How fast were the tweets coming in? Just to try and get a sense of this, I did a quick recording of tailing the twarc log for #JeSuisCharlie capture.

Hydration

If you checked out both of Ed's post, you'll have noticed that the Twitter ToS forbid the distribution of tweets, but we can distribute the tweet ids, and based on that we can "rehydrate" the data set locally. The tweet ids for each hashtag will be/are available here. I'll update and release the tweet ids files as I can.

We're looking at just around 12 million tweets (un-deduped) at the time of writing, so the hydration process will take some time. I'd highly suggest using GNU Screen or tmux

Hydrate

  • #JeSuisCharlie: % twarc.py --hydrate %23JeSuisCharlie-ids-20150112.txt > %23JeSuisCharlie-tweets-20150112.json
  • #JeSuisAhmed: % twarc.py --hydrate %23JeSuisAhmed-ids-20150112.txt > %23JeSuisAhmed-tweets-20150112.json
  • #JeSuisJuif: % twarc.py --hydrate %23JeSuisJuif-ids-20150112.txt > %23JeSuisJuif-tweets-20150112.json
  • #CharlieHebdo: % twarc.py --hydrate %23CharlieHebdo-ids-20150112.txt > %23CharlieHebdo-tweets-20150112.json

Map

#JeSuisCharlie tweets with geo coordinates.

In this data set, we have 51,942 tweets with geo coordinates availble. This represents about 1.33% of the entire data set (3,893,553 tweets).

How do you make this?

  • Create the geojson % ~/git/twarc/utils/geojson.py %23JeSuisCharlie-cat-20150115-tweets-deduped.json > %23JeSuisCharlie-cat-20150115-tweets-deduped.geojson

  • Give the geojson a variable name.

  • Use Leaflet.js to put all the tweets with geo coordinates on a map like this.

Top URLs

Top 10 URLs tweeted from #JeSuisCharlie.

  1. (11220) http://www.newyorker.com/culture/culture-desk/cover-story-2015-01-19?mbid=social_twitter
  2. (2278) http://www.europe1.fr/direct-video
  3. (1615) https://www.youtube.com/watch?v=4KBdnOrTdMI&feature=youtu.be
  4. (1347) https://www.youtube.com/watch?v=-bjbUg9d64g&feature=youtu.be
  5. (1333) http://www.amazon.com/Charlie-Hebdo/dp/B00007LMFU/
  6. (977) http://www.clubic.com/internet/actualite-748637-opcharliehebdo-anonymous-vengeance.html
  7. (934) http://www.maryam-rajavi.com/en/index.php?option=com_content&view=article&id=1735&catid=159&Itemid=506
  8. (810) http://www.lequipe.fr/eStore/Offres/Achat/271918
  9. (771) http://srogers.cartodb.com/viz/123be814-96bb-11e4-aec1-0e9d821ea90d/embed_map
  10. (605) https://www.youtube.com/watch?v=et4fYWKjP_o

Full list of urls can be found here.

How do you get the list?

  • % cat %23JeSuisCharlie-cat-20150115-tweets-deduped.json | ~/git/twarc/utils/unshorten.py > %23JeSuisCharlie-cat-20150115-tweets-deduped-unshortened.json
  • % cat %23JeSuisCharlie-cat-20150115-tweets-deduped-unshortened.json | ~/git/twarc/utils/urls.py| sort | uniq -c | sort -n > %23JeSuisCharlie-cat-20150115-urls.txt

Twitter Clients

Top 10 Twitter clients used from #JeSuisCharlie.

  1. (1283521) Twitter for iPhone
  2. (951925) Twitter Web Client
  3. (847308) Twitter for Android
  4. (231713) Twitter for iPad
  5. (86209)TweetDeck
  6. (82616) Twitter for Windows Phone
  7. (70286) Twitter for Android Tablets
  8. (44189) Twitter for Websites
  9. (39174) Instagram
  10. (21424) Mobile Web (M5)

Full list of clients can be found here.

How do you get this?

  • % ~/git/twarc/utils/source.py %23JeSuisCharlie-cat-20150115-tweets-deduped.json > %23JeSuisCharlie-cat-20150115-tweets-deduped-source.html

Word cloud

Word cloud from #JeSuisCharlie tweets.

I couldn't get the word cloud to embed nice, so you'll have to check it out here.

How do you create the word cloud?

  • % git/twarc/utils/wordcloud.py %23JeSuisCharlie-cat-20150115-tweets.json > %23JeSuisCharlie-wordcloud.html

Islandora and nginx

Islandora and nginx

Background

I have been doing a fair bit of scale testing for York University Digital Library over the last couple weeks. Most of it has been focused on horizontal scaling of the traditional Islandora stack (Drupal, Fedora Commons, FedoraGSearch, Solr, and aDORe-Djtatoka). The stack is traditionally run with Apache2 in front of it, and it reverse proxies parts of the stack that are Tomcat webapps. I was curious if the stack would work with nginx, and if I would get any noticeable improvements by just switching from Apache2 to nginx. The preliminary good news is that the stack works with nginx (I'll outline the config below). The not surprising news, according to this, is I'm not seeing any noticeable improvements. If time permits, I'll do some real benchmarking.

Islandora nginx configurations

Having no experience with nginx, I started searching around, and found a config by David StClair that worked. With a few slight modifications, I was able to get the stack up any running with no major issues. The only major item that I needed to figure out how to do was reverse proxying aDORe-djatoka so that it would place nice with the default settings for Islandora OpenSeadragon. All this turned out to be was figuring out what the ProxyPass and ProxyPassReverse directive equivalents were for nginx. Turns out that it is very straightforward. With Apache2, we needed:

  
    #Fedora Commons/Islandora proxying
    ProxyRequests Off
    ProxyPreserveHost On
    
      Order deny,allow
      Allow from all
    
    ProxyPass /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka
    ProxyPassReverse /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka
  

This gives us a nice dog in a hat with Apache2.

With nginx we use the proxy_redirect directive.

  server {
    location /adore-djatoka {
      proxy_pass http://localhost:8080/adore-djatoka;
      proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;
    }
  }

This gives us a nice dog in a hat with nginx.

That's really only the major modification that I had to make to get the stack running with nginx. Here is my config adapted from David StClair's example.

  server {

        server_name kappa.library.yorku.ca;
        root /path/to/drupal/install; ## <-- Your only path reference.

        # Enable compression, this will help if you have for instance advagg module
        # by serving Gzip versions of the files.
        gzip_static on;

        location = /favicon.ico {
                log_not_found off;
                access_log off;
        }

        location = /robots.txt {
                allow all;
                log_not_found off;
                access_log off;
        }

        # Very rarely should these ever be accessed outside of your lan
        location ~* \.(txt|log)$ {
                allow 127.0.0.1;
                deny all;
        }

        location ~ \..*/.*\.php$ {
                return 403;
        }

        # No no for private
        location ~ ^/sites/.*/private/ {
                return 403;
        }

        # Block access to "hidden" files and directories whose names begin with a
        # period. This includes directories used by version control systems such
        # as Subversion or Git to store control files.
        location ~ (^|/)\. {
                return 403;
        }
        location / {
                # This is cool because no php is touched for static content
                try_files $uri @rewrite;
                proxy_read_timeout 300;
        }

        location /adore-djatoka {
                proxy_pass http://localhost:8080/adore-djatoka;
                proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;
        }

        location @rewrite {
                # You have 2 options here
                # For D7 and above:
                # Clean URLs are handled in drupal_environment_initialize().
                rewrite ^ /index.php;
                # For Drupal 6 and bwlow:
                # Some modules enforce no slash (/) at the end of the URL
                # Else this rewrite block wouldn't be needed (GlobalRedirect)
                #rewrite ^/(.*)$ /index.php?q=$1;
        }

        # For Munin
        location /nginx_status {
                stub_status on;
                access_log off;
                allow 127.0.0.1;
                deny all;
        }

        location ~ \.php$ {
                fastcgi_split_path_info ^(.+\.php)(/.+)$;
                #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
                include fastcgi_params;
                fastcgi_param SCRIPT_FILENAME $request_filename;
                fastcgi_intercept_errors on;
                fastcgi_pass 127.0.0.1:9000;
        }

        # Fighting with Styles? This little gem is amazing.
        # This is for D7 and D8
        location ~ ^/sites/.*/files/styles/ {
                try_files $uri @rewrite;
        }

        location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
                expires max;
                log_not_found off;
        }

  }

Why @gccaedits?

Why @gccaedits?

Ed already said it much, much better. I agree with Ed, and stand by his rationale.

So, why write this?

I want to document what a wonderful example this little project is of open source software, permissive intellectual property licenses (Public Domain dedication in this instance), open data, and how all of these things together can change in the world.

In the two weeks since Ed has shared his code, it has 179 commits from 24 different contributors. It has been forked 92 times, has 33 watchers, and 460 stargazers. In addition, we've witnessed the proliferation of similar inspired bots. Bots that surface anonymous tweets from national government IP ranges (U.S., Canada, France, Norway, etc), state and provincial government IP ranges (@NCGAedits, @ONgovEdits, @lagovedits, etc), big industry IP ranges (@phrmaedits, @oiledits, @monsantoedits, etc), and intergovernmental organization IP ranges (@un_edits and @NATOedits). I'm aware of over 40 at the time of this writing, and new bots have consistently appeared daily over the past two weeks.

These bots have revealed some pretty amazing and controversial edits. Far, far too many to list here, but here are a few that have caught my eye.

International stories:

  • Russian government anonymous edits on flight MH17 page

Canadian:

  • Canadian House of Commons anonymous edits to Shelly Glover (Minister of Canadian Heritage) article
  • Canadian House of Commons anonymous edits to Pierre-Hugues Boisvenu (Senator) article
  • Homophobic anonymous edits from Natural Resources Canada to Richard Conn article


Much more important than these selected tweets, this software surfaces "big data" in a meaningful way. It provides transparency. It empowers a citizenry. It exists as resource for research and investigative journalism. And, most important in my opinion, software written and shared like this, can push all the cynicism aside, and give one hope for the future.

#aaronsw