Tweets to @realdonaldtrump; How many fucks are there to give?

I’ve been collecting tweets to @realDonaldTrump since June 2017. In my most recent time pulling together, and deduping the dataset I asked myself, “I wonder how many occurrences of ‘fuck’ are in the dataset.” Or, how many fucks are there to give?

Well…

The data is updated by running a query on the Standard Search API every five days.

$ twarc search 'to:realdonaldtrump' --log donald_search_$DATE.log > donald_search_$DATE.jsonl

Which yields something like this every five days.

...
donald_search_2018_05_01.jsonl
donald_search_2018_05_01.log
donald_search_2018_05_06.jsonl
donald_search_2018_05_06.log
donald_search_2018_05_11.jsonl
donald_search_2018_05_11.log
donald_search_2018_05_16.jsonl
donald_search_2018_05_16.log
donald_search_2018_05_21.jsonl
donald_search_2018_05_21.log
donald_search_2018_05_26.jsonl
donald_search_2018_05_26.log
donald_search_2018_05_31.jsonl
donald_search_2018_05_31.log
donald_search_2018_06_01.jsonl
donald_search_2018_06_01.log
donald_search_2018_06_06.jsonl
donald_search_2018_06_06.log
...

Periodically, I cat all the jsonl files together, and then deduplicate them with deduplicate.py. So, this currently leaves us with 90,355,874 tweets to work with.

If you want to follow along, you can grab the most recent set of tweet ids from here. Then “hydrate” them like so:

$ gunzip to_realdonaldtrump_20180606_ids.txt.gz
$ twarc hydrate to_realdonaldtrump_20180606_ids.txt > 20180609.jsonl 

This will probably take quiet a while since there are potentially 90,355,874 tweets to hydrate. In the end, you’ll end up with a jsonl file around 368G.

Once we have our full dataset, first thing we’ll do is remove all of the retweets with noretweets.py, giving us just original tweets at @realDonaldTrump.

$ noretweets.py 20180609.jsonl > 20180609_no_retweets.jsonl

This brings us down to 69,013,268 unique tweets. Your number will probably be less if you’re working with a hydrated dataset because deleted tweets, suspended accounts, and protected accounts will not have tweets hydrated.

$ wc -l 20180609_no_retweets.jsonl

Over the time of collecting, some of the Twitter APIs and fields changed slightly (extended tweets, and 280 character tweets). For us, this means the “text” of our tweets can reside in two different attributes; text or full_text.

So, we need to extract out the text. Let’s use tweet_text.

$ tweet_text.py 20180612_no_retweets.jsonl >| 20180612_tweet_text.txt

Now that we have just the text, we can count how many fucks there are with grep and wc!

$ grep -i "fuck" 20180612_tweet_text.txt | wc -l
1882456

There are 1,882,456 fucks to give!

That’s a fuck to tweet ratio of 2.73%.

For some more fun, let’s take the last 1000 lines of the our new text file, and make an animated gif out of it.

First, let’s get our text:

$ grep -i "fuck" 20180612_tweet_text.txt > fucks.txt
$ tail -n 1000 fucks.txt > 1000_fucks.txt

Then let’s create a little bash script.

#!/bin/bash

index=0

cat /path/to/1000_fucks.txt | while read line; do
  let "index++"
  pad=`printf "%05d" $index`
  convert -size 800x600 -background black -weight 300 -fill white -gravity Center -font Ubuntu caption:"$line" /path/to/images/$pad.png
done
cd /path/to/images
convert -monitor -define registry:temporary-path=/tmp -limit memory 8GiB -limit map 10GiB -delay 90 *.png -loop 0 1000_fucks.gif

Give it a filename, then make it executable, and run it!

In the end, you’ll end up with something like this:

1000 fucks

Avatar
Nick Ruest
Associate Librarian

Related

comments powered by Disqus