A couple years ago I wrote about a method for creating a collage out of 1.2M images collected from the 2015 Canadian Federal Election Twitter dataset. That method was very resource intensive in terms of the amount of temporary disk storage required to create the collage. As the number of images in a given collage increased, the amount of temporary disk space scaled exponentially; 3.5T for 1.2M #exln42 images, and ~90T for 6.1M #WomensMarch images.
Once I shared the collage there was some interesting discussion around it, and a feature that came up a few times that I found fascinating, but had no idea how to implement using the
montage method was linking back to the tweet a given image came from. But, luckily a colleague in Denmark had something in mind: Juxta.
Why does the Juxta method work better? Instead of creating a massive single image from all the base images, then generating tiles to display with OpenSeadragon, it skips the creating a giant image part, and just creates the tiles! Even better, if you’re working with Twitter data, you’re able to link back to the original tweet a given image came from.
About the dataset
I began collecting tweets directed at Donald Trump (@realDonaldTrump) in May of 2017 Documenting the Now’s twarc. Tweets from May 7, 2017 - June 21, 2017 of the dataset used a combination of the Filter (Streaming) API and Search API. Collecting via the Filter API failed on June 21, 2017. From June 23, 2017 forward only the Search API was used to collect. This is done via a cron job every five days. Periodically the dataset is deduplicated, Tweet ids are extracted, and the public dataset is updated.
The collage was created with Version 6 of the dataset: 146,341,720 tweets from May 7, 2017 - January 2019.
About the collage
The collage took about 10 days to complete with 44 threads on an 88 core machine. It consists of 17,525,913 images tweeted at Donald Trump, 93,479,104 tiles, and uses 93,501,108 inodes.
Understanding the collage; you can follow along in the image chronologically. The top left corner of the image will be the earliest images in the dataset (May 2017), and the bottom right corner will be the latest images in the dataset (January 2019). Zoom in and pan around!