GeoCities and the spacer.gif

Originally posted here.*fu5LWxDSghx3j8XlpiKUug.gif

Trevor Owens and Grace Thomas recently had their article, “The invention and dissemination of the spacer gif: implications for the future of access and use of web archives” published in the International Journal of Digital Humanities. It’s a great look at the history of the spacer.gif, how it proliferated in the early web, and a case study of digging into web archives and doing a whole lot of analysis. After reading it this past spring, it really got me motivated to round out the DataFrame implementation in the Archives Unleashed Toolkit so that more people could do this kind of inspirational work.

In late August 2019, the Archives Unleashed team released version 0.18.0 of the Archives Unleashed Toolkit. If you check out the release notes, you’ll see a lot of new functionality was added, along with some bug fixes. In those notes and user documentation you’ll see the new and expanded functionality with our DataFrame implementation. We now have functions for a variety of binary types (images, audio, video, pdf, spreadsheets, presentation program files, word processor files, and text files) that allow a user to extract binaries of a given type — all the images from a web collection for instance — or extract information about those binaries into a DataFrame that is output as a CSV file. For images you can extract:

  • the url of an image;

  • the filename;

  • the extension;

  • the MimeType provided by the web server and MimeType as identified by Apache Tika;

  • the width and height of the image;

  • the md5 hash of the image;

  • and the raw bytes of the image.

Similarly, for the other binary extraction and analysis functions (audio, video, pdf, etc.), you are able to have columns as described above minus the height and width.

So, how do you use this new functionality? We do have it all documented here, but for posterity this is how it works if you had a collection of web archives handy:

import io.archivesunleashed._
import io.archivesunleashed.df._

val images = RecordLoader
  .loadArchives("/path/to/web/archive/collection", sc)
  .extractImageDetailsDF();$"url", $"filename", $"extension", $"mime_type_web_server",
  $"mime_type_tika", $"width", $"height", $"md5")

Once the script is run across the web archive collection, it will produce a bunch of part-234432.csv files in your named output directory. You can cat (i.e. by typing cat part* > all-files.csv) those files together into a single one. In the end you should end up with something like this:,camera_blu_001.jpg,jpg,image/jpeg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92,06_small.jpg,jpg,image/jpeg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a,DSC01219.jpg,jpg,image/jpeg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d,favor.gif,gif,image/gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07,merlin0.jpg,jpg,image/jpeg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb,MENU7_r11_c21.jpg,jpg,image/jpeg,image/jpeg,68,10,ffffe91beaf231ea8b5fc46a1c6b7f32,qudes.jpg,jpg,image/jpeg,image/jpeg,55,24,ffffd381a8c0ae2e6a7d63d8af6b893c,dad_brendon_lighthouse.jpg,jpg,image/jpeg,image/jpeg,300,226,ffffc83f77a1558222f40d7a44b1d464

So, how is this relevant to the work of Owens and Thomas? Well, a few years ago, our team secured a copy of the GeoCities dataset from the Internet Archive. We’ve done a fair bit of analysis on the collection for a variety of research projects. If you’re curious about some of that research, check out Ian Milligan’s new book, “History in the Age of Abundance? How the Web is Transforming Historical Research.” It features a great case study on GeoCities. Also check out Sarah McTavish’s work, such as this presentation at the Michigan State Global Digital Humanities conference.

We ran an image analysis job across the entire 4 terabyte GeoCities dataset, and ended up with a 16 gigabyte CSV which held data for 121,371,844 images that the toolkit was able to identify!


URL MD5 COUNT MD5 COUNT FILENAME c4746081d66bc2abc269f22ca27ebb46 2,705 373,198 b4682377ddfbe4e7dabfddb2e543e842 3,336 18,685 fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747 55fade2068e7503eae8d7ddf5eb6bd09 2,551 13,852 b4682377ddfbe4e7dabfddb2e543e842 3,336 1,780 fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747 4f59788bde58d15d541a9c116d0e850d 2,729,121 2,731,243 325472601571f31e1bf00674c368d335 18,537,796 39 accba0b69f352b4c9440f05891b015c5 1,341 26,292;-)/n.gif 325472601571f31e1bf00674c368d335 18,537,796 1,888,058

It looks like 325472601571f31e1bf00674c368d335 (n.gif or spaceball.gif) was the most prolific of the spacer.gif files in the GeoCities dataset.

15.27% of the 121,371,844 images that we identified with the toolkit are 325472601571f31e1bf00674c368d335 (n.gif or spaceball.gif)!!

325472601571f31e1bf00674c368d335 is represented with 3,130 different filenames. The full list is available here.

The top 10 occurrences are:

  1. 18507222 serv

  2. 7461 serv.gif

  3. 3981 spacer.gif

  4. 1660 mm_spacer.gif

  5. 953 blank.gif

  6. 629 hbpix

  7. 603 clear.gif

  8. 541 pixel.gif

  9. 513 px1.gif

  10. 448 trans.gif

Pie chart of top 500 filenames for 325472601571f31e1bf00674c368d335 with serv removed.
Pie chart of top 500 filenames for 325472601571f31e1bf00674c368d335 with serv removed.

In total, 17.58% of the images the toolkit identified in the web archive come from the list Owens and Thomas identified. So, nearly 20% of the images in the 4 terabyte GeoCites dataset are 1x1 images 🤯🤯🤯.

If this type of analysis interests or inspires you, please checkout of the current call for participation for our fourth Archives Unleashed Datathon in NYC next March.

For the latest news and project updates subscribe to our quarterly newsletter.

Keeping in touch with the Archives Unleashed Project is so easy!

Nick Ruest
Associate Librarian