Background
Last August, I began capturing the #elxn42 hashtag as an experiment, and potential research project with Ian Milligan. Once Justin Trudeau was sworn in as the 23rd Prime Minister of Canada, we stopped collection, and began analysing the dataset. We wrote that analysis up for the Code4Lib Journal, which will be published in the next couple weeks. In the interim, you can check out our pre-print here. Included in that dataset is a line-deliminted list of a url to every embedded image tweeted in the dataset; 1,203,867 images. So, I downloaded them. It took a couple days.
getTweetImages
IMAGES=/path/to/elxn42-image-urls.txt cd /path/to/elxn42/imagescat $IMAGES | while read line; do wget “$line” done
Now we can start doing image analysis.
1,203,867 images, now what?
I really wanted to take a macroscopic look at all the images, and looking around the best tool for the job looked like montage
, an ImageMagick command for creating composite images. But, it wasn’t that so simple. 1,203,867 images is a lot of images, and starts getting you thinking about what big data is. Is this big data? I don’t know. Maybe?
Attempt #1
I can just point montage
at a directory and say go to town, right? NOPE.
$ montage /path/to/1203867/elxn42/images/* elxn42.png
Too many arguments! After glancing through the man page, I find that I can pass it a line-delimited text file with the paths to each file. I run the following command in the directory with all the downloaded images.
find `pwd` -type f | cat > ../images.txt
Now that I have that, I can pass montage
that file, and I should be golden, right? NOPE.
$ montage @images.txt elxn42.png
I run out of RAM, and get a segmentation fault. This was on a machine with 80GB of RAM.
Attempt #2
Is this big data? What is big data?
Where can I get a machine with a bunch of RAM really quick? Amazon!
I spin up a d2.8xlarge (36 cores and 244GB RAM) EC2 instance, get my dataset over there, ImageMagick installed, and run the command again.
$ montage @images.txt elxn42.png
NOPE. I run out of RAM, and get a segmentation fault. This was on a machine with 244GB of RAM.
Attempt #3
Is this big data? What is big data?
I’ve failed on two very large machines. Well, what I would consider large machines. So, I start googling, and reading more ImageMagick documentation. Somebody has to have done something like this before, right? Astronomers, they deal with big images right? How do they do this?
Then I find it; ImageMagick Large Image Support/Terapixel support, and the timing couldn’t have been better. Ian and I had recently got setup with our ComputeCanada resource allocation. I setup a machine with 8 cores, 12GB RAM, and compiled the latest version of ImageMagick from source; ImageMagick-6.9.3-7.
montage -monitor -define registry:temporary-path=/data/tmp -limit memory 8GiB -limit map 10GiB -limit area 0 @elxn42-tweets-images.txt elxn42.png
Instead of running everything in RAM, which became my issue with this job, I’m able to write all the tmp files ImageMagick creates to disk with -define registry:temporary-path=/data/tmp
and limit my memory usage with -limit memory 8GiB -limit map 10GiB -limit area 0
. Then knowing this job was going to probably take a long time, -monitor
comes in super handy for providing feedback of where the job is at process-wise.
In the end, it took just over 12 days to run the job. It took up 3.5TB of disk space at its peak, and in the end generated a 32GB png file. You can check it out here.
$ pngcheck elxn42.png OK: elxn42.png (138112x135828, 48-bit RGB, non-interlaced, 69.6%). $ exiftool elxn42.png ExifTool Version Number : 9.46 File Name : elxn42.png Directory : . File Size : 32661 MB File Modification Date/Time : 2016:03:30 00:48:44-04:00 File Access Date/Time : 2016:03:30 10:20:26-04:00 File Inode Change Date/Time : 2016:03:30 09:14:09-04:00 File Permissions : rw-rw-r-- File Type : PNG MIME Type : image/png Image Width : 138112 Image Height : 135828 Bit Depth : 16 Color Type : RGB Compression : Deflate/Inflate Filter : Adaptive Interlace : Noninterlaced Gamma : 2.2 White Point X : 0.3127 White Point Y : 0.329 Red X : 0.64 Red Y : 0.33 Green X : 0.3 Green Y : 0.6 Blue X : 0.15 Blue Y : 0.06 Background Color : 65535 65535 65535 Image Size : 138112x135828
Concluding Thoughts
Is this big data? I don’t know. I started with 1,203,867 images and made it into a single image. Using 3.5TB of tmp files to create a 32GB image is mind boggling when you start to think about it. But then it isn’t when you think about it more. Do I need a machine with 3.5TB of RAM to run this in memory? Or do I just need to design a job with the resources I have and be patient. There are always trade-offs. But, at the end of it all, I’m still sitting here asking myself what is big data?
Maybe this is big data :-)
I extracted every image in the 4.1TB GeoCities WARC collection and you won’t believe what I found next!
— Ian Milligan (@ianmilligan1) March 31, 2016
(me neither… in short: too many!)
@ianmilligan1 so, we're going to montage these, right!?
— nick ruest (@ruebot) March 31, 2016