A Quick Benchmark of Webarchive-Discovery

Sep 25, 2017 3 min read

This past week Compute Canada provided us with resources to setup our Solr Cloud instance for WALK and Archives Unleashed. We were able to get things setup relatively quickly thanks to a bit of preparation and practice on our local machines in the previous weeks. Once everything was setup (5 virtual machines total; 4 Solr Cloud nodes and one indexer – details below), we started benchmarking webarchive-discovery and our Solr Cloud setup with GNU Parallel.¹

Overall this process took about a day to complete. We selected a relatively small collection from one of our Web Archives for Longitudinal Knowledge (WALK) partners. Described in detail below, the collection had more WARCs than number of CPU cores on our indexer, and jobs took around 30 minutes to complete. In our tests, we focused on three main areas: varying the number of jobs and RAM, varying RAM and keeping jobs consistent, and running more jobs that CPU cores available. In the future, we would like to do a more rigorous examination. For each run we deleted the Solr index and then started again, and used webarchive-discovery at this commit.

The command we ran each time was a variation of the command below, with only --jobs, -Xmx, and the log filename modified each time.

$ time find /data/727/4114/warcs -iname "*.gz" -type f | parallel --jobs 16 --gnu "java -Xmx6g -Djava.io.tmpdir=/mnt/tmp -jar /home/ubuntu/warc-indexer.jar -i 'Simon Fraser University Library' -n 'SFU Affiliated Conferences/Institutes' -u '4114' -s http://192.168.32.20:8983/solr/walk {} > $(basename {})-16-06.log"

Below is a summary table of all 19 of the jobs run. Each summary includes the output from time, as well as the number of Solr docs at the end of the run, webarchive-discovery log output from the same file, and the time it took to run on that file. Overall we found that running more jobs than cores available is not only safe, but leads to the fastest results. As you can see from the chart below, running more jobs than cores available had significant performance advantages over several of the other configurations, and there appears to be a sweet spot at 24–32 jobs with 16 cores.

During this process, it occurred to us that Warclight + webarchive-discovery could be a useful tool for crawl QA. It is relatively painless to fire up a Warclight instance, and index some WARCs with webarchive-discovery. Or, it could be as simple as modifying the Rake task used for testing, and running a development server. Once that is done, you’ll have some really good insight into what has been captured by just looking at your facets, and exploring. We’d be curious if others agree with this.

We’d like to extend our thanks and gratitude to Compute Canada, in particular John Simpson, for the resources to do this. Toke Eskildsen for his guidance and assistance on Solr Cloud requirements, tip and tricks, and reviewing and critiquing the process. Andy Jackson, and his team at the UKWA for all their work on webarchive-discovery!

Read more below for information on the setup.

Setup

Collection information

Institution: Simon Fraser University Library
Collection name: SFU Affiliated Conferences/Institutes
Number of WARCs: 26
Collection size: 21G

Solr

Version: 6.6.1
Collection name: walk
Shards: 12
Replication factor: 4
Max Shards Per Node: 12
Number of nodes: 4
CPUs/node: 4
RAM/node: 30G
Java version: OpenJDK Runtime Environment (build 1.8.0_131–8u131-b11–2ubuntu1.16.04.3-b11)
Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0–96-generic x86_64)

Indexer

CPUs: 16
RAM: 120G
Java version: OpenJDK Runtime Environment (build 1.8.0_131–8u131-b11–2ubuntu1.16.04.3-b11)
Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0–96-generic x86_64)