Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit


Hsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin


We demonstrate the integration of the Archives Unleashed Toolkit, a scalable platform for exploring web archives, with Google's TensorFlow deep learning framework to provide scholars with content-based image analysis capabilities. By applying pretrained deep neural networks for object detection, we are able to extract images of common objects from a 4TB web archive of GeoCities, which can then be compiled into browsable collages. This case study illustrates the types of interesting analyses enabled by combining big data and deep learning capabilities.


This work was primarily supported by the Natural Sciences and Engineering Research Council of Canada. Additional funding for this project has come from the Andrew W. Mellon Foundation. Our sincerest thanks to the Internet Archive for providing us with the GeoCities web archive.