Exploring #elxn44 Twitter Data

Introduction

This is my third time collecting tweets for a Canadian Federal Election, and will most likely be last. The changes to the Twitter API including the Academic Research product track, twarc2, and great Documenting the Now Slack have considerbly lowered the barrier to collecting and analyizing Twitter data. I’m proud of the work I’ve done over the last seven years collecting and analyizing tweets. I hope it provided a solid implementation pattern for others to be build on in the future!

If you want to check out past Canadian election Twitter dataset analysis:

In this analysis, I’m going to provide a mix of examples for examining the overall dataset; using twarc utilities, twut, and pandas. The format of this post is pretty much the same as the last election post I did, much like the results of this election!

Overview

The dataset was collected with Documenting the Now’s twarc. It contains 2,075,645 tweet ids for #elxn44 tweets.

If you’d like to follow along, and work with the tweets, they can be “rehydrated” with Documenting the Now’s twarc, or Hydrator.

$ twarc hydrate elxn44-ids.txt > elxn44.jsonl

The dataset was created by collecting tweets from the the Standard Search API on a cron job every five days from July 28, 2021 - November 01, 2021.

#elx44 tweet volume
#elxn44 Tweet Volume
#elx44 wordcloud
#elxn44 wordcloud

Top languages

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF = spark.read.json(tweets)
val languages = language(tweetsDF)

languages
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+-------+
|lang|  count|
+----+-------+
|  en|1939944|
|  fr|  72900|
| und|  56247|
|  es|   2067|
|  ht|    583|
|  in|    417|
|  tl|    373|
|  ro|    286|
|  ca|    256|
|  pt|    245|
+----+-------+

Top tweeters

Using the elxn44-user-info.csv derivative, and pandas:


import pandas as pd
import altair as alt

userInfo = pd.read_csv("elxn44-user-info.csv")

tweeters = userInfo['screen_name']
             .value_counts()
             .rename_axis("Username")
             .reset_index(name="Count")
             .head(10)

tweeter_chart = (
    alt.Chart(tweeters)
      .mark_bar()
      .encode(
          x=alt.X("Count:Q", axis=alt.Axis(title="Tweets")),
          y=alt.Y("Username:O", sort="-x", axis=alt.Axis(title="Username")))
)

tweeter_values = tweeter_chart.mark_text(
    align='left',
    baseline='middle',
    dx=3
    ).encode(
        text='Count:Q'
)
#elx44 Top Tweeters
#elxn44 Top Tweeters

Retweets

Using retweets.py from twarc utilities, we can find the most retweeted tweets:

$ python twarc/utils/retweets.py elxn44.jsonl | head

1438853428743184384,4930
1440147315390488584,3763
1433062401654472715,3645
1433484548034076700,2186
1436180477724041227,2033
1430717460035014657,1931
1427413115830890515,1813
1440157529737084928,1804
1434126903699456003,1540
1437456048663760896,1517

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 4,930

  2. 3,763

  3. 3,645

Top Hashtags

Using the elxn44-hashtags.csv derivative, and pandas:

hashtags = pd.read_csv("elxn44-hashtags.csv")

top_tags = hashtags.value_counts().rename_axis("Hashtags").reset_index(name="Count").head(10)

tags_chart = (
    alt.Chart(top_tags)
      .mark_bar()
      .encode(
          x=alt.X("Count:Q", axis=alt.Axis(title="Tweets")),
          y=alt.Y("Hashtag:O", sort="-x", axis=alt.Axis(title="Hashtag")))
)

tags_values = tags_chart.mark_text(
    align='left',
    baseline='middle',
    dx=3
    ).encode(
        text='Count:Q'
)

(tags_chart + tags_values).configure_axis(titleFontSize=20).configure_title(fontSize=35, font='Courier').properties(height=800, width=1600, title="#elxn44 Top Hashtags")
#elx44 hashtags
#elxn44 hashtags

Top URLs

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF = spark.read.json(tweets)
val urlsDF = urls(tweetsDF)

urlsDF
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+-------------------------------------------------------------------------------------------------------------------------------+-----+
|url                                                                                                                            |count|
+-------------------------------------------------------------------------------------------------------------------------------+-----+
|https://www.theglobeandmail.com/politics/article-majority-of-federal-conservative-candidates-wont-disclose-covid-19/           |1107 |
|https://t.co/asEA19aPuX                                                                                                        |1073 |
|https://torontosun.com/news/election-2021/its-what-we-have-to-do-liberal-candidate-says-housing-tax-is-coming                  |672  |
|https://t.co/PYcZuMEeWT                                                                                                        |644  |
|https://montrealgazette.com/news/national/election-2021/survivors-of-dawson-college-shootings-urge-voters-to-shun-conservatives|491  |
|http://leadnow.ca/Cooperate2021                                                                                                |472  |
|http://elections.ca                                                                                                            |467  |
|https://www.theglobeandmail.com/politics/article-liberal-candidate-facing-15-million-lawsuit-over-pandemic-mask-making/        |453  |
|https://www.arcc-cdac.ca/erin-otoole-is-still-not-pro-choice/                                                                  |451  |
|http://Globalnews.ca                                                                                                           |448  |
+-------------------------------------------------------------------------------------------------------------------------------+-----+
#elx44 top url
#elxn44 Top Url

Adding URLs to Internet Archive

Do you know about the Internet Archive’s handy Save Page Now API?

Well, you could submit all those URLs to the Internet Archive if you wanted to.

You have be nice, and need to include a 5 second pause between each submission, otherwise your IP address will be blocked for 5 minutes!!

<h1>Too Many Requests</h1>

We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, using the Save Page Now features, to no more than 15 per minute.
<p>If you submit more than that we will block Save Page Now requests from your IP number for 5 minutes.
</p>
<p>
Please feel free to write to us at info@archive.org if you have questions about this.  Please include your IP address and any URLs in the email so we can provide you with better service.
</p>

You can use something like Raffaele Messuti’s example here.

Or, a one-liner if you prefer!

$ while read -r line; do curl -s -S "https://web.archive.org/save/$line" && echo "$line submitted to Internet Archive" && sleep 5; done < elxn44-urls.txt

I wonder what the delta is this time between collecting organizations, Internet Archive, and Tweeted URLs is?

Top media urls

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF = spark.read.json(tweets)
val mediaUrlsDF = mediaUrls(tweetsDF)

mediaUrlsDF
  .groupBy("image_url")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----------------------------------------------------------------------------------------------------------+-----+
|image_url                                                                                                 |count|
+----------------------------------------------------------------------------------------------------------+-----+
|https://pbs.twimg.com/media/E_fVjfpX0AIuzbk.jpg                                                           |4447 |
|https://pbs.twimg.com/media/E_x3pCIUYAEflui.jpg                                                           |1530 |
|https://pbs.twimg.com/media/E88AJEGXEAomOpF.jpg                                                           |1350 |
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/vid/640x360/Tl3_CcAIuO1muNYb.mp4?tag=12       |1009 |
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/vid/1280x720/hTx3MPjA_q4agA8B.mp4?tag=12      |1009 |
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/vid/480x270/h0r1T_9vvycNfqbD.mp4?tag=12       |1009 |
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/pl/Jn2iGhdECjYPQUbD.m3u8?tag=12&container=fmp4|1009 |
|https://video.twimg.com/amplify_video/1428595346905600002/vid/480x270/8NUYCw9aDJTyRLVB.mp4?tag=14         |908  |
|https://video.twimg.com/amplify_video/1428595346905600002/vid/1280x720/VbQB-q2htVF377Rn.mp4?tag=14        |908  |
|https://video.twimg.com/amplify_video/1428595346905600002/pl/MS_CCvQ_RhXMI-jE.m3u8?tag=14                 |908  |
+----------------------------------------------------------------------------------------------------------+-----+
#elx44 Top Media
#elxn44 Top Media

Top video urls

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF = spark.read.json(tweets)
val videoUrlsDF = videoUrls(tweetsDF)

videoUrlsDF
  .groupBy("video_url")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----------------------------------------------------------------------------------------------------------+-----+
|video_url                                                                                                 |count|
+----------------------------------------------------------------------------------------------------------+-----+
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/vid/1280x720/hTx3MPjA_q4agA8B.mp4?tag=12      |1009 |
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/vid/480x270/h0r1T_9vvycNfqbD.mp4?tag=12       |1009 |
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/pl/Jn2iGhdECjYPQUbD.m3u8?tag=12&container=fmp4|1009 |
|https://video.twimg.com/ext_tw_video/1427709984591253516/pu/vid/640x360/Tl3_CcAIuO1muNYb.mp4?tag=12       |1009 |
|https://video.twimg.com/amplify_video/1428595346905600002/vid/640x360/fNG4AQPEDbK3iBkT.mp4?tag=14         |908  |
|https://video.twimg.com/amplify_video/1428595346905600002/vid/480x270/8NUYCw9aDJTyRLVB.mp4?tag=14         |908  |
|https://video.twimg.com/amplify_video/1428595346905600002/pl/MS_CCvQ_RhXMI-jE.m3u8?tag=14                 |908  |
|https://video.twimg.com/amplify_video/1428595346905600002/vid/1280x720/VbQB-q2htVF377Rn.mp4?tag=14        |908  |
|https://video.twimg.com/amplify_video/1428044986424070146/pl/0HqYKJ8inSozt6oK.m3u8?tag=14                 |838  |
|https://video.twimg.com/amplify_video/1428044986424070146/vid/1280x720/dkzhncX0hp4U_pKe.mp4?tag=14        |838  |
+----------------------------------------------------------------------------------------------------------+-----+
#elxn44 Top Video

Images

We looked a bit at the media URLs above. Since we have a list (elxn44-media-urls.txt) of all of the media URLs, we can download them and get a high level overview of the entire collection with a couple different utilities.

Similar to the one-liner above, we can download all the images like so:

$ while read -r line; do wget "line"; done < elxn44-media-urls.txt

You can speed up the process if you want with xargs or GNU Parallel.

$ cat elxn4-media-urls.txt |  parallel --jobs 24 --gnu "wget '{}'"

Juxta

Juxta is a really great tool that was created by Toke Eskildsen. I’ve written about it in the past, so I won’t go into an overview of it here.

That said, I’ve created a created a collage of the 212,621 images in the dataset. Click the image below to check it out.

If you’ve already hydrated the dataset, you can skip part of the process that occurs in demo_twitter.sh by setting up a directory structure, and naming files accordingly.

Setup the directory structure. You’ll need to have elxn44-ids.txt at the root of it, and hydrated.json.gz in the elxn44_downloads directory you’ll create below.

$ mkdir elxn44 elxn44_downloads
$ cp elxn44.jsonl elxn44_downloads/hydrated.json
$ gzip elxn44_downloads/hydrated.json

Your setup should look like:

├── elxn44
├── elxn44_downloads
│   ├── hydrated.json.gz
├── elxn44-ids.txt

From there you can fire up Juxta and wait while it downloads all the images, and creates the collage. Depending on the number of cores you have, you can make use of multiple threads by setting the THREADS variable. For example: THREADS=20.

$ THREADS=20 /path/to/juxta/demo_twitter.sh elxn44-ids.txt elxn44

Understanding the collage; you can follow along in the image chronologically. The top left corner of the image will be the earliest images in the dataset (July 28, 2021), and the bottom right corner will be the most recent images in the dataset (October 6 , 2021). Zoom in and pan around! The images will link back to the Tweet that they came from.

#elx44 Juxta
#elxn44 Juxta

-30-

Avatar
Nick Ruest
Associate Librarian

Related