Originally posted here.
Trevor Owens and Grace Thomas recently had their article, “The invention and dissemination of the spacer gif: implications for the future of access and use of web archives” published in the International Journal of Digital Humanities. It’s a great look at the history of the spacer.gif, how it proliferated in the early web, and a case study of digging into web archives and doing a whole lot of analysis. After reading it this past spring, it really got me motivated to round out the DataFrame implementation in the Archives Unleashed Toolkit so that more people could do this kind of inspirational work.
In late August 2019, the Archives Unleashed team released version 0.18.0 of the Archives Unleashed Toolkit. If you check out the release notes, you’ll see a lot of new functionality was added, along with some bug fixes. In those notes and user documentation you’ll see the new and expanded functionality with our DataFrame implementation. We now have functions for a variety of binary types (images, audio, video, pdf, spreadsheets, presentation program files, word processor files, and text files) that allow a user to extract binaries of a given type — all the images from a web collection for instance — or extract information about those binaries into a DataFrame that is output as a CSV file. For images you can extract:
the url of an image;
the width and height of the image;
the md5 hash of the image;
and the raw bytes of the image.
Similarly, for the other binary extraction and analysis functions (audio, video, pdf, etc.), you are able to have columns as described above minus the height and width.
So, how do you use this new functionality? We do have it all documented here, but for posterity this is how it works if you had a collection of web archives handy:
import io.archivesunleashed._ import io.archivesunleashed.df._ val images = RecordLoader .loadArchives("/path/to/web/archive/collection", sc) .extractImageDetailsDF(); images.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"width", $"height", $"md5") .orderBy(desc("md5")) .write .csv("/path/to/images/dataframe")
Once the script is run across the web archive collection, it will produce a bunch of
part-234432.csv files in your named output directory. You can
cat (i.e. by typing
cat part* > all-files.csv) those files together into a single one. In the end you should end up with something like this:
http://it.geocities.com/grannoce/camere/thumb/camera_blu_001.jpg,camera_blu_001.jpg,jpg,image/jpeg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92 http://ar.geocities.com/angeles_uno/PLAYMATES/1999/JUNIO/KIMBERLY_SPICER/06_small.jpg,06_small.jpg,jpg,image/jpeg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a http://in.geocities.com/nileshtx/images/DSC01219.jpg,DSC01219.jpg,jpg,image/jpeg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d http://kr.geocities.com/magicianclow/img/favor.gif,favor.gif,gif,image/gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e http://login.space2000.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07 http://91-143-80-250.blue.kundencontroller.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07 http://cf.geocities.com/rouquins/images/merlin0.jpg,merlin0.jpg,jpg,image/jpeg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb http://ar.geocities.com/aliaga_fernandoo/ediciones/ed7/imagenes/menu/MENU7_r11_c21.jpg,MENU7_r11_c21.jpg,jpg,image/jpeg,image/jpeg,68,10,ffffe91beaf231ea8b5fc46a1c6b7f32 http://www.geocities.com/audy000/newspic1/qudes.jpg,qudes.jpg,jpg,image/jpeg,image/jpeg,55,24,ffffd381a8c0ae2e6a7d63d8af6b893c http://ca.geocities.com/brunette_george/holidays/dad_brendon_lighthouse.jpg,dad_brendon_lighthouse.jpg,jpg,image/jpeg,image/jpeg,300,226,ffffc83f77a1558222f40d7a44b1d464
So, how is this relevant to the work of Owens and Thomas? Well, a few years ago, our team secured a copy of the GeoCities dataset from the Internet Archive. We’ve done a fair bit of analysis on the collection for a variety of research projects. If you’re curious about some of that research, check out Ian Milligan’s new book, “History in the Age of Abundance? How the Web is Transforming Historical Research.” It features a great case study on GeoCities. Also check out Sarah McTavish’s work, such as this presentation at the Michigan State Global Digital Humanities conference.
We ran an image analysis job across the entire 4 terabyte GeoCities dataset, and ended up with a 16 gigabyte CSV which held data for 121,371,844 images that the toolkit was able to identify!
|URL||MD5||COUNT MD5||COUNT FILENAME|
It looks like 325472601571f31e1bf00674c368d335 (
spaceball.gif) was the most prolific of the spacer.gif files in the GeoCities dataset.
15.27% of the 121,371,844 images that we identified with the toolkit are 325472601571f31e1bf00674c368d335 (
325472601571f31e1bf00674c368d335 is represented with 3,130 different filenames. The full list is available here.
The top 10 occurrences are:
|Pie chart of top 500 filenames for 325472601571f31e1bf00674c368d335 with serv removed.|
In total, 17.58% of the images the toolkit identified in the web archive come from the list Owens and Thomas identified. So, nearly 20% of the images in the 4 terabyte GeoCites dataset are 1x1 images 🤯🤯🤯.
If this type of analysis interests or inspires you, please checkout of the current call for participation for our fourth Archives Unleashed Datathon in NYC next March.
ATTN: Call for Participation is now OPEN!— The Archives Unleashed Project (@unleasharchives) 16 September 2019
We are pleased to co-host our fourth (@MellonFdn) Archives Unleashed Datathon w/ our colleagues from @columbialib in New York City 26-27 March 2020.
Travel grants available; open to all skill levels.
Spread the word! #webarchiving #NYC pic.twitter.com/saLkRO19hR
For the latest news and project updates subscribe to our quarterly newsletter.
Keeping in touch with the Archives Unleashed Project is so easy!