Twitter Wordcloud Pipeline

Apr 30, 2018 1 min read

At this past week’s Archives Unleashed dataton, I jokingly created some wordclouds of my Co-PI’s timelines.

Finished my most likely bigly winning #hackarchives project: A Word Cloud of @lintool's timeline!https://t.co/eK2KPGjaGo
— nick ruest (@ruebot) April 27, 2018

Or, @ianmilligan1 #HackArchives https://t.co/qMxiet0osl
— nick ruest (@ruebot) April 27, 2018

Mat Kelly asked about the process this morning, so here is a little how-to of the pipeline:

Requirements:

Process:

Use twarc to grab some data.

$ twarc timeline lintool > lintool.jsonl

Extract the tweet text.

$ cat lintool.jsonl | jq -r .full_text > lintool_tweet.txt

Remove all the URLs from the tweets.

$ sed -e 's!http[s]\?://\S*!!g' lintool_tweet.txt > lintool.txt

Create a Wordcloud.

$ wordcloud_cli.py --text lintool.txt --imagefile lintool.png

lintool wordcloud

Nota bene

Each of these commands have a whole lot of options. Check them out, and experiment.
Yes, there is probably a better way to do this, and you could even make it into a one-liner. I pulled this together as a favour to Mat.
We were going to initially include wordclouds of collections in AUK, but wordcloud_cli.py doesn’t perform well at scale. Scale being, feeding it txt files of 5G up to 500G of raw text. Maybe one day we’ll revisit it.

Twitter Wordcloud Pipeline

Nick Ruest

Associate Librarian

Related