One feature of Blacklight that I’ve always wanted to setup in Warclight is displaying thumbnails in the results display. Getting this setup is a bit tricky. But, since Warclight is standardizing metadata on webarchive-discovery’s Solr schema.xml
, we avail ourselves to a number of fields available for use for a potential implementation. The url
field is the obvious choice, but the problem is that Blacklight out of the box will try and display a thumbnail for every url
field value you give to config.index.thumbnail_field
, leading to a whole lot of missing images on a page.
Well, there’s one image record there out of ten, and a whole lot of broken image links. That’s not useful at all. How can we make it better?
If we go back to our Solr schema.xml
, or checkout the catalog_controller.rb
, we see that we have a field called content_type_norm
. That field uses Apache Tika to normalize MIME types in ten different buckets. One of those buckets is image. So, if we have a url
field value, and we know it’s an image, we could display it as a thumbnail is the results display.
So, how do we do that? There are a few different approaches. All of which will use the config.index.thumbnail_method
option.
Option 1:
A method that returns the value of url
if a record is an image.
def render_thumbnail(document, options = {})
return unless document.first(:content_type_norm) == 'image'
image_tag(document.first(:url), options.merge(alt: document.first(:resourcename)))
end
The drawback with this option is that it is going to pull in the image from the live web. This may or may not be the image that was originally captured.
Option 2:
Build a replay URL by making use of other Solr fields available to us.
Using the Internet Archive’s Wayback
def render_thumbnail(document, options = {})
return unless document.first(:content_type_norm) == 'image'
replay_thumbnail_url = 'http://wayback.archive.org/' + document.first(:wayback_date).to_s + '/' + document.first(:url)
image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end
Using Archive-It’s Wayback
def render_thumbnail(document, options = {})
return unless document.first(:content_type_norm) == 'image'
replay_thumbnail_url = 'http://wayback.archive-it.org/' + document.first(:collection_id).to_s + '/' + document.first(:wayback_date).to_s + '/' + document.first(:url)
image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end
Or your own replay service
def render_thumbnail(document, options = {})
return unless document.first(:content_type_norm) == 'image'
replay_thumbnail_url = 'https://digital.library.yorku.ca/wayback/' + document.first(:wayback_date).to_s + '/' + document.first(:url)
image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end
The drawback with these options are that they are using a single replay point. This works well if you know every URL in your index should be available for replay from the same replay service.
This is being used in our WALK partner instances.
Option 3:
Build a replay url using the Time Travel Service.
def render_thumbnail(document, options = {})
return unless document.first(:content_type_norm) == 'image'
replay_thumbnail_url = 'https://timetravel.mementoweb.org/timegate/' + document.first(:url)
image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
end
The drawback with this option is that it will just return the most recent Memento available. It is a captured version. But, we don’t know know if it is the captured copy we want.
Option 4:
Build a replay using the Time Travel Service API using the wayback_date
value.
def thumbnail_image(document, options = {})
return unless document.first(:content_type_norm) == 'image'
time_travel_base_url = 'http://timetravel.mementoweb.org/api/json/'
time_travel_request_url = time_travel_base_url + document.first(:wayback_date).to_s + '/' + document.first(:url).to_s
time_travel_request = URI(time_travel_request_url)
time_travel_response = Net::HTTP.get(time_travel_request)
if time_travel_response.present?
time_travel_response_json = JSON.parse(time_travel_response)
replay_thumbnail_url = time_travel_response_json['mementos']['closest']['uri'][0]
image_tag(replay_thumbnail_url, options.merge(alt: document.first(:resourcename)))
else
return nil
end
end
The drawback here is that this is going slow down page load times. But, it is moving toward a more generalized implementation that covers a use case where you can’t guarantee all the URLs in your index can be replayed on the same replace service.
This has been implemented on the Warclight demo site.
Option #5:
Implement it at a lower level in the stack, directly in Solr using a conditional copyField.
Yeah, I didn’t do this 😀.
The biggest drawbacks here is implementing it requires an entire re-index. This is probably a non-starter for nearly every implementation.
So, that’s a few different ways to implement thumbnail display in Warclight. Can you think of a different, or better way? Let us know!
Special thanks to Andy Jackson, and Toke Eskildsen who critiqued my ideas, and sent me down some rewarding paths.
Nota bene
I haven’t added this to the Warclight engine since there are so many different ways to do this, and a given implementation might chose a roll their own option.
Wish
Use the content offset to display the Base64 encoded image out of the ARC/WARC file itself. Can we make Warclight access the ARC/WARC files directly?
¯_(ツ)_/¯