AUT & Last Date Modified

There’s been a long-standing, frequently asked question by participants of Archives Unleashed datathons and the cohort program: how do we find out the date of a resource or page?

Dates can be really hard to decipher in web archives. As a result, we tend to rely on the crawl date, which is a pretty easy thing to grab out of a WARC since it is a mandatory field. While this is something we’ve always had in aut, it’s not the date of a response or creation of a website, but instead is the date on which the crawl occurred. Not ideal for a lot of research questions I’ve seen teams ask of web collections they are exploring.

Embarrassingly my response to these questions had always been, “dates are hard in web archives,” and then shrug. I’d then point folks to one of the many great projects out of the ODU’s Web Science and Digital Libraries Research Group such as Carbon Dating The Web, as potential path forward.

Even more embarrassing, an easy solution has been right in front of my face this entire time, sitting in the HTTP headers of WARC response records. Sometimes, but not all the time, a resource in a WARC record HTTP header has a value for Last-Modified. This date should come from the file’s attributes, e.g. ls -lt filename. The header field is pretty easy to grab with Sparkling, and then incorporate it into the DataFrames produced by aut.

Normally, the date should be in this format: Tue, 15 Nov 1994 12:45:26 GMT (RFC1123). But of course, there are always exceptions. So we wrote a fuzzy parser utility to extract the date and format to YYYYMMDDHHMMSS so that it mirrors the same format as crawl_date.

This field is useful for web archive collections that are only crawled once. For example, in situations like Geocities permanently shutting down or the excellent work being done by the Saving Ukrainian Cultural Heritage Online project to preserve Ukrainian cultural heritage.

This functionality is now available as of the 1.2.0 release of the Archives Unleashed Toolkit, and will be available in ARCH in the coming days.

Speaking of ARCH, if you’re interested in piloting it out, please fill out this form.

Here it is in action:

import io.archivesunleashed._

val data = "/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz"

RecordLoader.loadArchives(data, sc)
  .all()
  .select($"crawl_date", $"last_modified_date", $"mime_type_web_server")
  .show(20, false)
+--------------+------------------+--------------------+
|crawl_date    |last_modified_date|mime_type_web_server|
+--------------+------------------+--------------------+
|20091027143300|                  |text/html           |
|20091027143259|20000923233454    |image/jpeg          |
|20091027143259|20020913163029    |image/jpeg          |
|20091027143300|20020211154553    |image/jpeg          |
|20091027143259|19980919164703    |image/jpeg          |
|20091027143259|20080125150303    |text/html           |
|20091027143300|20010921224658    |image/gif           |
|20091027143258|20081009015203    |image/jpeg          |
|20091027143300|                  |text/html           |
|20091027143259|20020416145103    |image/jpeg          |
|20091027143300|20090223022835    |text/html           |
|20091027143300|20030928090558    |image/jpeg          |
|20091027143300|20091027143300    |text/html           |
|20091027143300|20021203212451    |text/html           |
|20091027143300|                  |text/html           |
|20091027143300|20040530033010    |image/bmp           |
|20091027143300|                  |text/html           |
|20091027143259|20090223022352    |text/html           |
|20091027143300|                  |text/html           |
|20091027143300|20010608202736    |text/html           |
+--------------+------------------+--------------------+
only showing top 20 rows

There are a few other possibilities for trying to figure out the date of a resource. Since we already use Apache Tika to identify language of a content, and mime_type_tika, we could also potentially use it to try and identify some dates from a given resource in a WARC. It could be a future project if there is a need for it.

I’ve also take the time to update the Geocities dataset again. The updated dataset includes last_modified_date columns for the file format, web pages, and text files derivatives. It also includes crawl_date format fix for domain graph derivative (YYYYMMDD instead of YYYYMMDDHHMMSS).

Here are the 20 most frequent years resources were last modified in the dataset:

Geocities: Year of Last-Modified-Date

Finally, our preliminary research showed reliable values coming from the Last-Modified headers, so we’d love to hear feedback from users on this new feature as they process their data.

Avatar
Nick Ruest
Associate Librarian

Related