1,203,867 elxn42 images

Mar 31, 2016 5 min read

Background

Last August, I began capturing the #elxn42 hashtag as an experiment, and potential research project with Ian Milligan. Once Justin Trudeau was sworn in as the 23rd Prime Minister of Canada, we stopped collection, and began analysing the dataset. We wrote that analysis up for the Code4Lib Journal, which will be published in the next couple weeks. In the interim, you can check out our pre-print here. Included in that dataset is a line-deliminted list of a url to every embedded image tweeted in the dataset; 1,203,867 images. So, I downloaded them. It took a couple days.

getTweetImages

IMAGES=/path/to/elxn42-image-urls.txt cd /path/to/elxn42/images

cat $IMAGES | while read line; do wget “$line” done

Now we can start doing image analysis.

1,203,867 images, now what?

I really wanted to take a macroscopic look at all the images, and looking around the best tool for the job looked like montage, an ImageMagick command for creating composite images. But, it wasn’t that so simple. 1,203,867 images is a lot of images, and starts getting you thinking about what big data is. Is this big data? I don’t know. Maybe?

Attempt #1

I can just point montage at a directory and say go to town, right? NOPE.

$ montage /path/to/1203867/elxn42/images/* elxn42.png

Too many arguments! After glancing through the man page, I find that I can pass it a line-delimited text file with the paths to each file. I run the following command in the directory with all the downloaded images.

find `pwd` -type f | cat > ../images.txt

Now that I have that, I can pass montage that file, and I should be golden, right? NOPE.

$ montage @images.txt elxn42.png

I run out of RAM, and get a segmentation fault. This was on a machine with 80GB of RAM.

Attempt #2

Is this big data? What is big data?

Where can I get a machine with a bunch of RAM really quick? Amazon!

I spin up a d2.8xlarge (36 cores and 244GB RAM) EC2 instance, get my dataset over there, ImageMagick installed, and run the command again.

$ montage @images.txt elxn42.png

NOPE. I run out of RAM, and get a segmentation fault. This was on a machine with 244GB of RAM.

Attempt #3

Is this big data? What is big data?

I’ve failed on two very large machines. Well, what I would consider large machines. So, I start googling, and reading more ImageMagick documentation. Somebody has to have done something like this before, right? Astronomers, they deal with big images right? How do they do this?

Then I find it; ImageMagick Large Image Support/Terapixel support, and the timing couldn’t have been better. Ian and I had recently got setup with our ComputeCanada resource allocation. I setup a machine with 8 cores, 12GB RAM, and compiled the latest version of ImageMagick from source; ImageMagick-6.9.3-7.

montage -monitor -define registry:temporary-path=/data/tmp -limit memory 8GiB -limit map 10GiB -limit area 0 @elxn42-tweets-images.txt elxn42.png

Instead of running everything in RAM, which became my issue with this job, I’m able to write all the tmp files ImageMagick creates to disk with -define registry:temporary-path=/data/tmp and limit my memory usage with -limit memory 8GiB -limit map 10GiB -limit area 0. Then knowing this job was going to probably take a long time, -monitor comes in super handy for providing feedback of where the job is at process-wise.

In the end, it took just over 12 days to run the job. It took up 3.5TB of disk space at its peak, and in the end generated a 32GB png file. You can check it out here.

$ pngcheck elxn42.png
OK: elxn42.png (138112x135828, 48-bit RGB, non-interlaced, 69.6%).

$ exiftool elxn42.png
ExifTool Version Number         : 9.46
File Name                       : elxn42.png
Directory                       : .
File Size                       : 32661 MB
File Modification Date/Time     : 2016:03:30 00:48:44-04:00
File Access Date/Time           : 2016:03:30 10:20:26-04:00
File Inode Change Date/Time     : 2016:03:30 09:14:09-04:00
File Permissions                : rw-rw-r--
File Type                       : PNG
MIME Type                       : image/png
Image Width                     : 138112
Image Height                    : 135828
Bit Depth                       : 16
Color Type                      : RGB
Compression                     : Deflate/Inflate
Filter                          : Adaptive
Interlace                       : Noninterlaced
Gamma                           : 2.2
White Point X                   : 0.3127
White Point Y                   : 0.329
Red X                           : 0.64
Red Y                           : 0.33
Green X                         : 0.3
Green Y                         : 0.6
Blue X                          : 0.15
Blue Y                          : 0.06
Background Color                : 65535 65535 65535
Image Size                      : 138112x135828

Concluding Thoughts

Is this big data? I don’t know. I started with 1,203,867 images and made it into a single image. Using 3.5TB of tmp files to create a 32GB image is mind boggling when you start to think about it. But then it isn’t when you think about it more. Do I need a machine with 3.5TB of RAM to run this in memory? Or do I just need to design a job with the resources I have and be patient. There are always trade-offs. But, at the end of it all, I’m still sitting here asking myself what is big data?

Maybe this is big data :-)

I extracted every image in the 4.1TB GeoCities WARC collection and you won’t believe what I found next!

(me neither… in short: too many!)
— Ian Milligan (@ianmilligan1) March 31, 2016

@ianmilligan1 so, we're going to montage these, right!?
— nick ruest (@ruebot) March 31, 2016