Islandora and nginx

Islandora and nginx


I have been doing a fair bit of scale testing for York University Digital Library over the last couple weeks. Most of it has been focused on horizontal scaling of the traditional Islandora stack (Drupal, Fedora Commons, FedoraGSearch, Solr, and aDORe-Djtatoka). The stack is traditionally run with Apache2 in front of it, and it reverse proxies parts of the stack that are Tomcat webapps. I was curious if the stack would work with nginx, and if I would get any noticeable improvements by just switching from Apache2 to nginx. The preliminary good news is that the stack works with nginx (I'll outline the config below). The not surprising news, according to this, is I'm not seeing any noticeable improvements. If time permits, I'll do some real benchmarking.

Islandora nginx configurations

Having no experience with nginx, I started searching around, and found a config by David StClair that worked. With a few slight modifications, I was able to get the stack up any running with no major issues. The only major item that I needed to figure out how to do was reverse proxying aDORe-djatoka so that it would place nice with the default settings for Islandora OpenSeadragon. All this turned out to be was figuring out what the ProxyPass and ProxyPassReverse directive equivalents were for nginx. Turns out that it is very straightforward. With Apache2, we needed:

    #Fedora Commons/Islandora proxying
    ProxyRequests Off
    ProxyPreserveHost On
      Order deny,allow
      Allow from all
    ProxyPass /adore-djatoka
    ProxyPassReverse /adore-djatoka

This gives us a nice dog in a hat with Apache2.

With nginx we use the proxy_redirect directive.

  server {
    location /adore-djatoka {
      proxy_pass http://localhost:8080/adore-djatoka;
      proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;

This gives us a nice dog in a hat with nginx.

That's really only the major modification that I had to make to get the stack running with nginx. Here is my config adapted from David StClair's example.

  server {

        root /path/to/drupal/install; ## <-- Your only path reference.

        # Enable compression, this will help if you have for instance advagg module
        # by serving Gzip versions of the files.
        gzip_static on;

        location = /favicon.ico {
                log_not_found off;
                access_log off;

        location = /robots.txt {
                allow all;
                log_not_found off;
                access_log off;

        # Very rarely should these ever be accessed outside of your lan
        location ~* \.(txt|log)$ {
                deny all;

        location ~ \..*/.*\.php$ {
                return 403;

        # No no for private
        location ~ ^/sites/.*/private/ {
                return 403;

        # Block access to "hidden" files and directories whose names begin with a
        # period. This includes directories used by version control systems such
        # as Subversion or Git to store control files.
        location ~ (^|/)\. {
                return 403;
        location / {
                # This is cool because no php is touched for static content
                try_files $uri @rewrite;
                proxy_read_timeout 300;

        location /adore-djatoka {
                proxy_pass http://localhost:8080/adore-djatoka;
                proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;

        location @rewrite {
                # You have 2 options here
                # For D7 and above:
                # Clean URLs are handled in drupal_environment_initialize().
                rewrite ^ /index.php;
                # For Drupal 6 and bwlow:
                # Some modules enforce no slash (/) at the end of the URL
                # Else this rewrite block wouldn't be needed (GlobalRedirect)
                #rewrite ^/(.*)$ /index.php?q=$1;

        # For Munin
        location /nginx_status {
                stub_status on;
                access_log off;
                deny all;

        location ~ \.php$ {
                fastcgi_split_path_info ^(.+\.php)(/.+)$;
                #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
                include fastcgi_params;
                fastcgi_param SCRIPT_FILENAME $request_filename;
                fastcgi_intercept_errors on;

        # Fighting with Styles? This little gem is amazing.
        # This is for D7 and D8
        location ~ ^/sites/.*/files/styles/ {
                try_files $uri @rewrite;

        location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
                expires max;
                log_not_found off;


IIPC Curator Tools Fair: Islandora Web ARChive solution pack

The following is the text for a video that I was asked to record for the 2014 International Internet Preservation Consortium General Assembly Curator Tools Fair, on the Islandora Web ARChive solution pack.

My name is Nick Ruest. I am a librarian at York University, in Toronto, Ontario. I’m going to give a quick presentation on the Islandora Web ARChive solution pack. I only have a few minutes, so I’ll quickly cover what the module does, what areas of the web archiving life cycle it covers, and a provide a quick demonstration.

So, what is the Islandora Web ARChive solution pack?

I’ll step back and quickly answer what is Islandora first. "Islandora is an open source digital asset management system based on Fedora Commons, Drupal and a host of additional applications." A solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and the Tuque library thereby allowing users to deposit, create derivatives, and interact with a given type of object. We have solution packs for Audio, Video, Large Images, Images, PDFs, paged content, and now web archives. The Web ARChive solution pack allows users to ingest and retrieve web archives through the Islandora interface. If we think about it in terms of OAIS, we give the repository a SIP, which in the case of this solution pack can be a single warc file and some descriptive metadata, and if available, a screenshot and/or a PDF. From there, the solution pack will create an AIP and DIP. The AIP will contain: the original warc, MODS descriptive metadata, FITS output (file characterization/technical metadata), web dissemination versions of the screenshots (jpg & thumbnail), PDF, and derivatives of the warc via warctools. Those derivatives are a csv and filtered warc. The csv -- WARC_CSV -- is a listing of all the files in a given warc. This allows a user/researcher to have a quick glance at the contents of the warc. The filtered warc -- WARC_FILTERED -- is a warc file stripped down as much as possible to the text, and it is used only for search indexing/keyword searching. The DIP is an a JPG/TN of the captured website (if supplied) and download links to the WARC, PDF, WARC_CSV, screenshot, and descriptive metadata. Here a link to the ‘archived site’ can be supplied in the default MODS form. The suggested usage here is to provide a link to the object in a local instance of Wayback, if it exists.

I’ve also been asked to address the following questions:

1) What aspects of the web archiving life cycle model does the tool cover? What aspects of the model would you like to/do intend to build into the tool? What functionality does the tool provide that isn’t reflected in the model?

I’ll address what it does not cover first: Appraisal and selection, scoping, and data capture. We allow users to use their own Appraisal and selection, scoping, and data capture processes. So, for example, locally, we use Heritrix for cron based crawls, and our own bash script for one-off crawls.

What does it cover? All of the rest of the steps!

  • Storage and organization: via Fedora Commons & Islandora
  • QA and analysis: via display/DIP -- visualization exposes it!
  • Metadata/description: every web archive object has a MODS descriptive datastream
  • Access/use/reuse: each web archive object has a URI, along with its derivatives. By default warcs are available to download.
  • Preservation: preservation depends on the policies of the repository/institution, but, in our case we have a preservation action plan for web archives, and suite of Islandora preservation modules running (checksum, checksum checker, FITS, and PREMIS) that cover the basics.
  • Risk management: see above.

2) What resources are committed to the tool’s ongoing development? What are major features in the roadmap? Is the code open source?

I developed the original version, and transferred it to the Islandora Foundation, allowing for community stewardship of the project.

Currently, there is no official roadmap for the project. If anybody has ideas, comments, suggestion, or roadmapish ideas, feel free to send a message to the Islandora mailing list.

...and yes, the code is totally open source. It is available under a GPLv3 license, and the canonical version of the code can be found under the Islandora organization on Github.

3) What is the user base for the tool? How environment-specific is the tool as opposed to readily reusable by other organizations?

Not entirely sure. It was recently released as part of the 7.x-1.3 version of Islandora.

Given that is it an Islandora module, it is tied to Islandora. So, you’ll have to have at least a 7.x-1.3 instance of Islandora running, along with the solution pack’s dependencies to run it.

4) What are the tool’s unique features? What are its shortcomings?

I think some unique features are that it is apart of a digital asset management system (it is the first of its kind that I am aware of), and the utilization of warctools for keyword searching and file inventories.

Shortcomings? That it is apart of a digital asset management system.

Very quick demo time!

Islandora Web ARChive SP updates


Some pretty exciting stuff has been happening lately in the Islandora community. Earlier this year, Islandora began the transformation to a federally incorporated, community-driven soliciting non-profit. Making it, in my opinion, and much more sustainable project. Thanks to my organization joining on as a member, I've been provided the opporutinity to take part in the Roadmap Committe. Since I've joined, we have been hard at work creating transparent policies and processes software contributions, licenses, and resources. Big thanks to the Hydra community for providing great examples to work from!

I signed my first contirbutor licence agreement, and initiated the process for making the Web ARChive Solution Pack a canonical Islandora project, subject to the same release management and documentation processes as other Islandora modules. After working through the process, I'm happy to see that the Web ARChive Solution Pack is now a canonical Islandora project.

Project updates

I've been slowly picking off items from my initial todo list for the project, and have solved two big issues: indexing the warcs in Solr for full-text/keyword searching and creating and index of each warc.

Solr indexing was very problematic at first. I ened up having a lot of trouble getting an xslt to take the warc datastream and give it to FedoraGSearch, and in-turn to Solr. Frustrated, I began experimenting with newer versions of Solr, which thankfully has Apache Tika bundled, thereby allowing for Solr to index basically whatever you throw at it.

I didn't think our users wanted to be searching the full markup of a warc file. Just the actual text. So, using the Internet Archives' Warctools and @tef's wonderful assistance, I was able to incorporate warcfilter into the derivative creation.

$ warcfilter -H text warc_file > filtered_file

You can view an example of the full-text searching of warcs in action here.

In addition to the full-text searching, I wanted to provided users with a quick overview of what is in a given capture, and was able to do so by also incorporating warcindex into the derivative creation.

$ warcindex warc_file > csv_file
#WARC filename offset warc-type warc-subject-uri warc-record-id content-type content-length
/extra/tmp/yul-113521_OBJ.warc 0 warcinfo None <urn:uuid:588604aa-4ade-4e94-b19a-291c6afa905e> application/warc-fields 514
/extra/tmp/yul-113521_OBJ.warc 797 response <urn:uuid:cbeefcb0-dcd1-466e-9c07-5cd45eb84abb> text/dns 61
/extra/tmp/yul-113521_OBJ.warc 1110 response <urn:uuid:6a5d84d1-b548-41e4-a504-c9cf9acfcde7> application/http; msgtype=response 902
/extra/tmp/yul-113521_OBJ.warc 2366 request <urn:uuid:363da425-594e-4365-94fc-64c4bb24c897> application/http; msgtype=request 257
/extra/tmp/yul-113521_OBJ.warc 2952 metadata <urn:uuid:62ed261e-549d-45e8-9868-0da50c1e92c4> application/warc-fields 149

The updated Web ARChive SP datastreams now look like so:

Warc SP datastreams

One of my major goals with this project has been integration with a local running instance of Wayback, and it looks like we are pretty close. This solution might not be the cleanest, but at least it is a start, and hopefully it will get better over time. I've updated the default MODS form for the module so that it better reflects this Library of Congress example. The key item here is the 'url' element with the 'Archived site' attribute.

  <url displayLabel="Active site"></url>
  <url displayLabel="Archived site"></url>

Wayback accounts for a date in its url structure '' and we can use that to link a given capture to its given dissemination point in Wayback. Using some Islandora Solr magic, I should be able give that link to a user on a given capture page.

We have automated this in our capture and preserve process: capturing warcs with Heritrix, creating MODS datastreams, and screenshots. This allows us to batch import our crawl quickly and efficiently.

Hopefully in the new year we'll have a much more elegant solution!

The Islandora Web ARChive Solution Pack - Open Repositories 2013

Below is the text and slides of my presentation on the Web ARChive solution pack at Open Repositories 2013.

I have a really short amount of time to talk here. So, I am going to focus on the how and why for this solution pack and kinda put it in context of the Web Archiving Life Cycle Model proposed by the Internet Archive earlier this year. Maybe I shouldn't have proposed a 7 minute talk!

Context! Almost a year ago, I was in a meeting and was presented with this problem. YFile, a daily university newspaper -- it was previously a paper now a website -- had been taken over by marketing a while back, and they deleted all their back content. They are an official university publication, so an official university record, and eventual end up in archives, so it will eventually be our problem; the library's problems. Plainly put, we live in a reality where official records are born and disseminated via the Internet. Many institutions have a strategy in place for transferring official university records that are print or tactile to university archives, but not much exists strategy-wise for websites. So, I naively decided to tackle it.

I tend to just do things. I don't ask permission. I apologize later if i have to. Like maybe taking down the YFile server during the first few initial crawls. If I make mistakes, that is good, I am learning something! What i am doing isn't new, but then again it knda is. It is a really weird place. I need to crawl a website everyday. The internet archive crawler comes around whenever it does. There is no way to give the Internet Archive/Wayback machine a whole bunch of warc files, and I'm not ready to pay for Archive-It.

That won't work for me at all when I have some idea how to do it all myself. So, what is the problem? I need to capture and preserve a website everyday. I want to provide the best material to a researcher. I want to keep a fine eye on preservation, but not be a digital pack rat, and need to constantly keep the librarian and archivist in me pleased, which is always seems to the Item vs. collection debate and which of those gets the most attention.

How easy is it to grab a website? Pretty damn easy if you're using at least wget 1.14 which has warc support.

How many people here know what a warc is? Warc stands for web archive. It is an iso standard. It is basically a file -- that can get massive very quickly -- that aggregates raw resources you request into a single file along with crawl metadata, checksums. PROVENANCE!

This is what the beginning of a warc file looks like.

This is what the beginning of a warc file looks like.

And here is a selection from the arctual archive portion. That is my brief crash course on warc. We can talk about it more later if you have questions. I need to keep moving along.

So, warcs are a little weird to deal with on their own. You can disseminate them with Wayback Machine, and I assume nobody but a few people on this planet want to see a page full of just warc files. Building something browsable takes a little bit more work. So, I decided to snag a pdf and screenshot of the page of frontpage of the site that I am grabbing with wkhtmltopdf and wkhtmltoimage. Then I toss this all in a single bash script, and give it to cron.

So this is what I have come up with. This is how I capture and preserve a website. The pdf/image + xvfb came from Peter Binkley. X virtual framebuffer is an X11 server that performs all graphical operations in memory, not showing any screen output.

I've been running that script on cron since last October. Now what? Like I said before, nobody wants to see a page full of warc files. So, I started working with the tools and platforms that I know. In this case, Drupal, Islandora, and Fedora Commons, and created a solution pack. Solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and Tuque API to deposit, create derivatives, and interact with a given type of object. So, we have solution packs for Audio, Video, Large Images, Images, PDFs, and paged content.

What does it do? Adds all required Fedora objects to allow users to ingest, create derivatives, and retrieve web archives through the Islandora interface. So we have, Content Models, Data Stream Composite Models, forms, and collection policies. The current iteration of the module allows one to batch ingest a bunch of objects for a given collection, and it will create all of the derivatives (Thumbnail and display image), and index any provided descriptive metadata in Solr as well as the actual WARC file since it is mostly text. The WARC indexing is still pretty experimental, it works, but I don't know how useful it is.

If you want to check out a live demo, and poke around while I am rambling on here, check this site out.

Collection (in terms of the web archiving life-cycle model). This an object from the Islandora basic collection solution pack.

Seed (in terms of the web archiving life-cycle model). This is an object from the Islandora basic collection solution pack.

Document (in terms of the web archiving life-cycle model). This is an object from the Islandora Web ARChive solution pack.

Here is what my object looks like. The primary archival object is the WARC file, then we have our associated data streams: PDF (from the crawl), MODS/DC (descriptive metadata), Screenshot (from the crawl), FITS (techinical metadata), Thumbnail & Medium JPG (deriative display images).

Todo! What I am still working on when I have time.

I want to tie in the Internet Archive's Wayback Machine for playback/dissemination of WARCs. I haven't quite wrapped my head around how best to do the Wayback integration, but I am thinking of using the date field value on in the MODS record for an individual crawl.

I'm also thinking of incorporating WARC tools into this solution pack. This would be for quick summaries and maybe a little analysis. This of how R is incorporated into Dataverse if you are familar with that.

I am also working on integrating my silly little bash scripts into the solution pack. That way one could just do the whole fell swoop of crawling, dissemination, and preservation in a single click when ingesting an object in Islandora.

Finally, there is a hell of a lot of metadata in each of these warc files begging for something to be done with them. I haven't figured out a way to parse them in an abstract repeatable way, but if I or somebody else does, it will be great!

Islandora Web ARChive Solution Pack

What is it?

The Islandora Web ARChive Solution Pack is yet another Islandora Solution Pack. This particular solution pack provides the necessary Fedora objects for persisting and disseminating web archive objects; warc files.

What does it do?

Currently, the SP allows a user to upload a warc with an associated MODS form. Once the object is deposited, the associated metadata is displayed along with a download link to the warc file.

You can check out an example here

Can I get the code?

Of course!


If I am doing something obviously wrong, please let me know!

Immediate term:

  1. Incorporate Wayback integration for the DIP. I think this is the best disseminator for the warc files. However, I haven't wrapped my head around how to programatically provide access to the warc files in the Wayback. I know that I will have two warc objects, an AIP warc and a DIP warc (Big thank you to @sbmarks for being a soundboard today!). Fedora will manage the AIP, and Wayback will manage the DIP. Do I iFrame the Wayback URI for the object, or link out to it?

  2. Drupal 7 module. Drupal 7 versions of Islandora Solution Packs should be on their way shortly -- Next release I believe. The caveat to using the Drupal 6 version of this module is the mimetype support. It looks like the Drupal 6 api (file_get_mimetype) doesn't pull the correct mimetype for warc file. I should get 'application/warc' but I am getting 'application/octet-stream' -- the fallback default for the api.

Long term:

  1. Incorporate Islandora microservices. What I would really like to do is allow users to automate this entire process. Basically, just say this is a site I would like to archive. This is the frequency at which I would like it archived, with necessary wget options. This is the default metadata profile for it. Then grab the site, ingest it into Fedora, drop the DIP warc into Wayback, and make it all available.

  2. If you have any idea on how to do the above, or how to do it a better manner, please let me know!

Islandora development visualization

Hit a bit of a wall yesterday getting checksums working when ingesting content into Islandora, so I made a Gource video of the Islandora commits in my fork of the git repo.

Music by RipCD (@drichert) and myself.

How'd I do it?

  1. I wanted to use the Gravatars, so I used this handy little perl script.
  2. Hopped into the Islandora git repo, and ran:

    gource --user-image-dir .git/avatar/ -s 3 --auto-skip-seconds 0.1 --file-idle-time 50 --max-files 500 --disable-bloom --stop-at-end --highlight-users --hide mouse --background-colour 111111 --font-size 20 --title "Islandora Development" --output-ppm-stream - --output-framerate 60 | avconv -y -r 60 -f image2pipe -vcodec ppm -i - -b 8192K ~/Videos/islandora-gource.mp4

  3. Then I used OpenShot to add the music and uploaded to YouTube.

FITS and Islandora integration

Digital preservationistas rejoice?
I managed to get FITS integration working in Islandora via a plugin. The plugin will automatically create a FITS xml datastream for an object upon ingest in the Islandora interface for a given solution pack. Right now I have it working with the Basic Image Solution Pack, Large Image Solution Pack, and PDF Solution Pack. You just have to make sure is in your apache user's path (thanks @adr). [UPDATE: Works with the Audio Solution Pack now.]
What I had feared was going to be a pretty insane process turned out to be fairly simple and straightforward, which I'll outline here.

  1. I looked at existing plugins for something similar that I could crib from, and found that something in the exiftool plugin which is used in the audio and video solution packs.
  2. Using the existing plugin, I ran some grep queries to figure out how it is used in the overall codebase (Islandora, and solution packs). 
  3. Created a feature branch
  4. Hammered away until I had something working. (Thanks @mmccollow)
  5. Create an ingest rule for a solution pack. This tells the solution pack to call the plugin.
  6. Test, test, and test.
  7. Merged feature branch with 6.x branch, pushed, and opened up a  pull request.

That is basically it. Let me know if you have any questions. Or, if you know of a way to make it even better, patches welcome ;)
[Update #2]
I've added a configuration option to the Islandora admin page to enable FITS datastream creation, and the ability to define a path to I put it in the advanced section of the admin page which is not expanded by default. This will probably be problematic, and folks won't notice it. It might be a better idea to collect all the various command line tools Islandora uses, and give them all a section in the admin page to define their paths.
I also have FITS creation working with the Video Solution Pack now. Up next, Islandora Scholar... just have to get that up and running ;)

Right! That hackfest report I should have gave...

When I was at Islandora Camp trying to wrap my head around all things Islandora and Fedora, I was thinking ahead about a possible project in archives and research collections - migrating our collection/fonds descriptions and finding aids over to ICA AtoM.
ICA AtoM does some pretty cool stuff in terms of access to collection/fonds descriptions, integrates very nicely with Archivematica with accessioning born digital objects, and associating digital representations of item level objects with their respective collection/fonds. My greedy little brain wanted more! I wanted ICA AtoM to be able to pull in Fedora objects automatically and associate them with their respective collection/fonds. So, this is the hackfest proposal I submitted.
So what happened? What'd we end up doing?
The amazing Peter Van Garderen made absolutely sure Artefactual Systems staff was highly represented at hackfest, and I had two amazing people from Artefactual trying to parse my sleep-deprived-scatter-brained-state reasoning/logic behind what I wanted to do. David Juhasz and Jesús García Crespo, you rock!
We spent the first hour or so working through the Fedora REST API documentation looking for the best way to approach the "problem." After about an hour or so of working through a few conditional queries that would need to be strung together, Jesús jumped in and said, "Why aren't we using SWORD for this!?" Good question!
ICA AtoM can speak SWORD and Fedora and speak SWORD so long as you can get the module working. As things at hackfest generally go for me, it failed. I could not for the life of me get the module to build. Spend a some time going through build.xml and ant and I just weren't going to be friends that day.
Strike one - don't code conditional Fedora REST API queries - not sharable and scalable
Strike two - I couldn't get the SWORD module to build!
Strike three - ???
While brainstorming for other solutions to our "problem", David was looking for examples in which I could share records from our repository. Duh! OAI-PMH! ICA AtoM can harvest OAI. If we can map OAI sets to ICA AtoM collections/fonds, and set records to indivudual items in a collection/fonds we're set. Oh my, another use case of OAI-PMH! Yay!
Did we succeed? Not actually. Turns out the OAI-PMH harvesting code wasn't quite up to snuff at the time, and David, bless his heart, worked on trying to get it up to par before the end of the day. We were not able to pull together a working version, but the framework is there. It was there all along! (Ed, yes we could have and totally should have used atom :P )