Thumbnails in Warclight

One feature of Blacklight that I’ve always wanted to setup in Warclight is displaying thumbnails in the results display. Getting this setup is a bit tricky. But, since Warclight is standardizing metadata on webarchive-discovery’s Solr schema.xml, we avail ourselves to a number of fields available for use for a potential implementation. The url field is the obvious choice, but the problem is that Blacklight out of the box will try and display a thumbnail for every url field value you give to config.index.thumbnail_field, leading to a whole lot of missing images on a page.

Warclight broken thumbnails

Well, there’s one image record there out of ten, and a whole lot of broken image links. That’s not useful at all. How can we make it better?

If we go back to our Solr schema.xml, or checkout the catalog_controller.rb, we see that we have a field called content_type_norm. That field uses Apache Tika to normalize MIME types in ten different buckets. One of those buckets is image. So, if we have a url field value, and we know it’s an image, we could display it as a thumbnail is the results display.

So, how do we do that? There are a few different approaches. All of which will use the config.index.thumbnail_method option.

Option 1:

A method that returns the value of url if a record is an image.

def render_thumbnail(document, options = {})
  return unless document.first(:content_type_norm) == 'image'
  image_tag(document.first(:url), options.merge(alt: document.first(:resourcename)))
end

The drawback with this option is that it is going to pull in the image from the live web. This may or may not be the image that was originally captured.

Option 2:

Build a replay URL by making use of other Solr fields available to us.

Using the Internet Archive’s Wayback

def render_thumbnail(document, options = {})
  return unless document.first(:content_type_norm) == 'image'
  replay_thumbnail_url = 'http://wayback.archive.org/' + document.first(:wayback_date).to_s + '/' + document.first(:url)
  image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end

Using Archive-It’s Wayback

def render_thumbnail(document, options = {})
  return unless document.first(:content_type_norm) == 'image'
  replay_thumbnail_url = 'http://wayback.archive-it.org/' + document.first(:collection_id).to_s + '/' + document.first(:wayback_date).to_s + '/' + document.first(:url)
  image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end

Or your own replay service

def render_thumbnail(document, options = {})
  return unless document.first(:content_type_norm) == 'image'
  replay_thumbnail_url = 'https://digital.library.yorku.ca/wayback/' + document.first(:wayback_date).to_s + '/' + document.first(:url)
  image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end

The drawback with these options are that they are using a single replay point. This works well if you know every URL in your index should be available for replay from the same replay service.

This is being used in our WALK partner instances.

Option 3:

Build a replay url using the Time Travel Service.

def render_thumbnail(document, options = {})
  return unless document.first(:content_type_norm) == 'image'
  replay_thumbnail_url = 'https://timetravel.mementoweb.org/timegate/' + document.first(:url)
  image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end

The drawback with this option is that it will just return the most recent Memento available. It is a captured version. But, we don’t know know if it is the captured copy we want.

Option 4:

Build a replay using the Time Travel Service API using the wayback_date value.

def thumbnail_image(document, options = {})
  return unless document.first(:content_type_norm) == 'image'
  time_travel_base_url = 'http://timetravel.mementoweb.org/api/json/'
  time_travel_request_url = time_travel_base_url + document.first(:wayback_date).to_s + '/' + document.first(:url).to_s
  time_travel_request = URI(time_travel_request_url)
  time_travel_response = Net::HTTP.get(time_travel_request)
  if time_travel_response.present?
    time_travel_response_json = JSON.parse(time_travel_response)
    replay_thumbnail_url = time_travel_response_json['mementos']['closest']['uri'][0]
    image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
  else
    return nil
  end
end

The drawback here is that this is going slow down page load times. But, it is moving toward a more generalized implementation that covers a use case where you can’t guarantee all the URLs in your index can be replayed on the same replace service.

This has been implemented on the Warclight demo site.

Option #5:

Implement it at a lower level in the stack, directly in Solr using a conditional copyField.

Yeah, I didn’t do this 😀.

The biggest drawbacks here is implementing it requires an entire re-index. This is probably a non-starter for nearly every implementation.

Warclight thumbnails

So, that’s a few different ways to implement thumbnail display in Warclight. Can you think of a different, or better way? Let us know!

Special thanks to Andy Jackson, and Toke Eskildsen who critiqued my ideas, and sent me down some rewarding paths.


Nota bene

I haven’t added this to the Warclight engine since there are so many different ways to do this, and a given implementation might chose a roll their own option.

Wish

Use the content offset to display the Base64 encoded image out of the ARC/WARC file itself. Can we make Warclight access the ARC/WARC files directly?

¯_(ツ)_/¯

Related

comments powered by Disqus