At this past week’s Archives Unleashed dataton, I jokingly created some wordclouds of my Co-PI’s timelines.

Mat Kelly asked about the process this morning, so here is a little how-to of the pipeline:

Requirements:

Process:

Use twarc to grab some data.

$twarc timeline lintool > lintool.jsonl  Extract the tweet text. $ cat lintool.jsonl | jq -r .full_text > lintool_tweet.txt


Remove all the URLs from the tweets.

$sed -e 's!http[s]\?://\S*!!g' lintool_tweet.txt > lintool.txt  Create a Wordcloud. $ wordcloud_cli.py --text lintool.txt --imagefile lintool.png


Nota bene

• Each of these commands have a whole lot of options. Check them out, and experiment.
• Yes, there is probably a better way to do this, and you could even make it into a one-liner. I pulled this together as a favour to Mat.
• We were going to initially include wordclouds of collections in AUK, but wordcloud_cli.py doesn’t perform well at scale. Scale being, feeding it txt files of 5G up to 500G of raw text. Maybe one day we’ll revisit it.