GeoCities and the spacer.gif | Nick Ruest

GeoCities and the spacer.gif

Originally posted here.

https://miro.medium.com/max/173/1*fu5LWxDSghx3j8XlpiKUug.gif
https://gifcities.org

Trevor Owens and Grace Thomas recently had their article, “The invention and dissemination of the spacer gif: implications for the future of access and use of web archives” published in the International Journal of Digital Humanities. It’s a great look at the history of the spacer.gif, how it proliferated in the early web, and a case study of digging into web archives and doing a whole lot of analysis. After reading it this past spring, it really got me motivated to round out the DataFrame implementation in the Archives Unleashed Toolkit so that more people could do this kind of inspirational work.

In late August 2019, the Archives Unleashed team released version 0.18.0 of the Archives Unleashed Toolkit. If you check out the release notes, you’ll see a lot of new functionality was added, along with some bug fixes. In those notes and user documentation you’ll see the new and expanded functionality with our DataFrame implementation. We now have functions for a variety of binary types (images, audio, video, pdf, spreadsheets, presentation program files, word processor files, and text files) that allow a user to extract binaries of a given type — all the images from a web collection for instance — or extract information about those binaries into a DataFrame that is output as a CSV file. For images you can extract:

  • the url of an image;

  • the filename;

  • the extension;

  • the MimeType provided by the web server and MimeType as identified by Apache Tika;

  • the width and height of the image;

  • the md5 hash of the image;

  • and the raw bytes of the image.

Similarly, for the other binary extraction and analysis functions (audio, video, pdf, etc.), you are able to have columns as described above minus the height and width.

So, how do you use this new functionality? We do have it all documented here, but for posterity this is how it works if you had a collection of web archives handy:

import io.archivesunleashed._
import io.archivesunleashed.df._

val images = RecordLoader
  .loadArchives("/path/to/web/archive/collection", sc)
  .extractImageDetailsDF();

images.select($"url", $"filename", $"extension", $"mime_type_web_server",
  $"mime_type_tika", $"width", $"height", $"md5")
  .orderBy(desc("md5"))
  .write
  .csv("/path/to/images/dataframe")

Once the script is run across the web archive collection, it will produce a bunch of part-234432.csv files in your named output directory. You can cat (i.e. by typing cat part* > all-files.csv) those files together into a single one. In the end you should end up with something like this:

http://it.geocities.com/grannoce/camere/thumb/camera_blu_001.jpg,camera_blu_001.jpg,jpg,image/jpeg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92
http://ar.geocities.com/angeles_uno/PLAYMATES/1999/JUNIO/KIMBERLY_SPICER/06_small.jpg,06_small.jpg,jpg,image/jpeg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a
http://in.geocities.com/nileshtx/images/DSC01219.jpg,DSC01219.jpg,jpg,image/jpeg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d
http://kr.geocities.com/magicianclow/img/favor.gif,favor.gif,gif,image/gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e
http://login.space2000.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://91-143-80-250.blue.kundencontroller.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://cf.geocities.com/rouquins/images/merlin0.jpg,merlin0.jpg,jpg,image/jpeg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb
http://ar.geocities.com/aliaga_fernandoo/ediciones/ed7/imagenes/menu/MENU7_r11_c21.jpg,MENU7_r11_c21.jpg,jpg,image/jpeg,image/jpeg,68,10,ffffe91beaf231ea8b5fc46a1c6b7f32
http://www.geocities.com/audy000/newspic1/qudes.jpg,qudes.jpg,jpg,image/jpeg,image/jpeg,55,24,ffffd381a8c0ae2e6a7d63d8af6b893c
http://ca.geocities.com/brunette_george/holidays/dad_brendon_lighthouse.jpg,dad_brendon_lighthouse.jpg,jpg,image/jpeg,image/jpeg,300,226,ffffc83f77a1558222f40d7a44b1d464

So, how is this relevant to the work of Owens and Thomas? Well, a few years ago, our team secured a copy of the GeoCities dataset from the Internet Archive. We’ve done a fair bit of analysis on the collection for a variety of research projects. If you’re curious about some of that research, check out Ian Milligan’s new book, “History in the Age of Abundance? How the Web is Transforming Historical Research.” It features a great case study on GeoCities. Also check out Sarah McTavish’s work, such as this presentation at the Michigan State Global Digital Humanities conference.

We ran an image analysis job across the entire 4 terabyte GeoCities dataset, and ended up with a 16 gigabyte CSV which held data for 121,371,844 images that the toolkit was able to identify!

Results

URL MD5 COUNT MD5 COUNT FILENAME
http://www.geocities.com/clipart/pbi/c.gif c4746081d66bc2abc269f22ca27ebb46 2,705 373,198
http://pic.geocities.com/images/pixel.gif b4682377ddfbe4e7dabfddb2e543e842 3,336 18,685
http://www.google.com/images/cleardot.gif fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747
http://www.google.com/clear.gif 55fade2068e7503eae8d7ddf5eb6bd09 2,551 13,852
https://killersites.com/killerSites/resources/dot_clear.gif b4682377ddfbe4e7dabfddb2e543e842 3,336 1,780
https://mail.google.com/mail/images/cleardot.gif fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747
http://visit.geocities.yahoo.com/visit.gif 4f59788bde58d15d541a9c116d0e850d 2,729,121 2,731,243
http://blingee.com/images/spaceball.gif 325472601571f31e1bf00674c368d335 18,537,796 39
http://www-cdr.stanford.edu/~petrie/blank.gif accba0b69f352b4c9440f05891b015c5 1,341 26,292
http://img.artlebedev.ru/;-)/n.gif 325472601571f31e1bf00674c368d335 18,537,796 1,888,058

It looks like 325472601571f31e1bf00674c368d335 (n.gif or spaceball.gif) was the most prolific of the spacer.gif files in the GeoCities dataset.

15.27% of the 121,371,844 images that we identified with the toolkit are 325472601571f31e1bf00674c368d335 (n.gif or spaceball.gif)!!

325472601571f31e1bf00674c368d335 is represented with 3,130 different filenames. The full list is available here.

The top 10 occurrences are:

  1. 18507222 serv

  2. 7461 serv.gif

  3. 3981 spacer.gif

  4. 1660 mm_spacer.gif

  5. 953 blank.gif

  6. 629 hbpix

  7. 603 clear.gif

  8. 541 pixel.gif

  9. 513 px1.gif

  10. 448 trans.gif

Pie chart of top 500 filenames for 325472601571f31e1bf00674c368d335 with serv removed.
Pie chart of top 500 filenames for 325472601571f31e1bf00674c368d335 with serv removed.

In total, 17.58% of the images the toolkit identified in the web archive come from the list Owens and Thomas identified. So, nearly 20% of the images in the 4 terabyte GeoCites dataset are 1x1 images 🤯🤯🤯.

If this type of analysis interests or inspires you, please checkout of the current call for participation for our fourth Archives Unleashed Datathon in NYC next March.


For the latest news and project updates subscribe to our quarterly newsletter.

Keeping in touch with the Archives Unleashed Project is so easy!

Avatar
Nick Ruest
Associate Librarian

Related