I’ve been collecting tweets to @realDonaldTrump since June 2017. In my most recent time pulling together, and deduping the dataset I asked myself, “I wonder how many occurrences of ‘fuck’ are in the dataset.” Or, how many fucks are there to give?
The data is updated by running a query on the Standard Search API every five days.
$ twarc search 'to:realdonaldtrump' --log donald_search_$DATE.log > donald_search_$DATE.jsonl
Which yields something like this every five days.
... donald_search_2018_05_01.jsonl donald_search_2018_05_01.log donald_search_2018_05_06.jsonl donald_search_2018_05_06.log donald_search_2018_05_11.jsonl donald_search_2018_05_11.log donald_search_2018_05_16.jsonl donald_search_2018_05_16.log donald_search_2018_05_21.jsonl donald_search_2018_05_21.log donald_search_2018_05_26.jsonl donald_search_2018_05_26.log donald_search_2018_05_31.jsonl donald_search_2018_05_31.log donald_search_2018_06_01.jsonl donald_search_2018_06_01.log donald_search_2018_06_06.jsonl donald_search_2018_06_06.log ...
cat all the
jsonl files together, and then deduplicate them with
deduplicate.py. So, this currently leaves us with 90,355,874 tweets to work with.
If you want to follow along, you can grab the most recent set of tweet ids from here. Then “hydrate” them like so:
$ gunzip to_realdonaldtrump_20180606_ids.txt.gz $ twarc hydrate to_realdonaldtrump_20180606_ids.txt > 20180609.jsonl
This will probably take quiet a while since there are potentially 90,355,874 tweets to hydrate. In the end, you’ll end up with a
jsonl file around 368G.
Once we have our full dataset, first thing we’ll do is remove all of the retweets with
noretweets.py, giving us just original tweets at @realDonaldTrump.
$ noretweets.py 20180609.jsonl > 20180609_no_retweets.jsonl
This brings us down to 69,013,268 unique tweets. Your number will probably be less if you’re working with a hydrated dataset because deleted tweets, suspended accounts, and protected accounts will not have tweets hydrated.
$ wc -l 20180609_no_retweets.jsonl
Over the time of collecting, some of the Twitter APIs and fields changed slightly (extended tweets, and 280 character tweets). For us, this means the “text” of our tweets can reside in two different attributes;
So, we need to extract out the text. Let’s use
$ tweet_text.py 20180612_no_retweets.jsonl >| 20180612_tweet_text.txt
Now that we have just the text, we can count how many fucks there are with
$ grep -i "fuck" 20180612_tweet_text.txt | wc -l 1882456
There are 1,882,456 fucks to give!
That’s a fuck to tweet ratio of 2.73%.
For some more fun, let’s take the last 1000 lines of the our new text file, and make an animated gif out of it.
First, let’s get our text:
$ grep -i "fuck" 20180612_tweet_text.txt > fucks.txt $ tail -n 1000 fucks.txt > 1000_fucks.txt
Then let’s create a little bash script.
#!/bin/bash index=0 cat /path/to/1000_fucks.txt | while read line; do let "index++" pad=`printf "%05d" $index` convert -size 800x600 -background black -weight 300 -fill white -gravity Center -font Ubuntu caption:"$line" /path/to/images/$pad.png done cd /path/to/images convert -monitor -define registry:temporary-path=/tmp -limit memory 8GiB -limit map 10GiB -delay 90 *.png -loop 0 1000_fucks.gif
Give it a filename, then make it executable, and run it!
In the end, you’ll end up with something like this: