Islandora and nginx

Islandora and nginx

Background

I have been doing a fair bit of scale testing for York University Digital Library over the last couple weeks. Most of it has been focused on horizontal scaling of the traditional Islandora stack (Drupal, Fedora Commons, FedoraGSearch, Solr, and aDORe-Djtatoka). The stack is traditionally run with Apache2 in front of it, and it reverse proxies parts of the stack that are Tomcat webapps. I was curious if the stack would work with nginx, and if I would get any noticeable improvements by just switching from Apache2 to nginx. The preliminary good news is that the stack works with nginx (I'll outline the config below). The not surprising news, according to this, is I'm not seeing any noticeable improvements. If time permits, I'll do some real benchmarking.

Islandora nginx configurations

Having no experience with nginx, I started searching around, and found a config by David StClair that worked. With a few slight modifications, I was able to get the stack up any running with no major issues. The only major item that I needed to figure out how to do was reverse proxying aDORe-djatoka so that it would place nice with the default settings for Islandora OpenSeadragon. All this turned out to be was figuring out what the ProxyPass and ProxyPassReverse directive equivalents were for nginx. Turns out that it is very straightforward. With Apache2, we needed:

  
    #Fedora Commons/Islandora proxying
    ProxyRequests Off
    ProxyPreserveHost On
    
      Order deny,allow
      Allow from all
    
    ProxyPass /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka
    ProxyPassReverse /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka
  

This gives us a nice dog in a hat with Apache2.

With nginx we use the proxy_redirect directive.

  server {
    location /adore-djatoka {
      proxy_pass http://localhost:8080/adore-djatoka;
      proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;
    }
  }

This gives us a nice dog in a hat with nginx.

That's really only the major modification that I had to make to get the stack running with nginx. Here is my config adapted from David StClair's example.

  server {

        server_name kappa.library.yorku.ca;
        root /path/to/drupal/install; ## <-- Your only path reference.

        # Enable compression, this will help if you have for instance advagg module
        # by serving Gzip versions of the files.
        gzip_static on;

        location = /favicon.ico {
                log_not_found off;
                access_log off;
        }

        location = /robots.txt {
                allow all;
                log_not_found off;
                access_log off;
        }

        # Very rarely should these ever be accessed outside of your lan
        location ~* \.(txt|log)$ {
                allow 127.0.0.1;
                deny all;
        }

        location ~ \..*/.*\.php$ {
                return 403;
        }

        # No no for private
        location ~ ^/sites/.*/private/ {
                return 403;
        }

        # Block access to "hidden" files and directories whose names begin with a
        # period. This includes directories used by version control systems such
        # as Subversion or Git to store control files.
        location ~ (^|/)\. {
                return 403;
        }
        location / {
                # This is cool because no php is touched for static content
                try_files $uri @rewrite;
                proxy_read_timeout 300;
        }

        location /adore-djatoka {
                proxy_pass http://localhost:8080/adore-djatoka;
                proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka;
        }

        location @rewrite {
                # You have 2 options here
                # For D7 and above:
                # Clean URLs are handled in drupal_environment_initialize().
                rewrite ^ /index.php;
                # For Drupal 6 and bwlow:
                # Some modules enforce no slash (/) at the end of the URL
                # Else this rewrite block wouldn't be needed (GlobalRedirect)
                #rewrite ^/(.*)$ /index.php?q=$1;
        }

        # For Munin
        location /nginx_status {
                stub_status on;
                access_log off;
                allow 127.0.0.1;
                deny all;
        }

        location ~ \.php$ {
                fastcgi_split_path_info ^(.+\.php)(/.+)$;
                #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
                include fastcgi_params;
                fastcgi_param SCRIPT_FILENAME $request_filename;
                fastcgi_intercept_errors on;
                fastcgi_pass 127.0.0.1:9000;
        }

        # Fighting with Styles? This little gem is amazing.
        # This is for D7 and D8
        location ~ ^/sites/.*/files/styles/ {
                try_files $uri @rewrite;
        }

        location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {
                expires max;
                log_not_found off;
        }

  }

Why @gccaedits?

Why @gccaedits?

Ed already said it much, much better. I agree with Ed, and stand by his rationale.

So, why write this?

I want to document what a wonderful example this little project is of open source software, permissive intellectual property licenses (Public Domain dedication in this instance), open data, and how all of these things together can change in the world.

In the two weeks since Ed has shared his code, it has 179 commits from 24 different contributors. It has been forked 92 times, has 33 watchers, and 460 stargazers. In addition, we've witnessed the proliferation of similar inspired bots. Bots that surface anonymous tweets from national government IP ranges (U.S., Canada, France, Norway, etc), state and provincial government IP ranges (@NCGAedits, @ONgovEdits, @lagovedits, etc), big industry IP ranges (@phrmaedits, @oiledits, @monsantoedits, etc), and intergovernmental organization IP ranges (@un_edits and @NATOedits). I'm aware of over 40 at the time of this writing, and new bots have consistently appeared daily over the past two weeks.

These bots have revealed some pretty amazing and controversial edits. Far, far too many to list here, but here are a few that have caught my eye.

International stories:

  • Russian government anonymous edits on flight MH17 page

Canadian:

  • Canadian House of Commons anonymous edits to Shelly Glover (Minister of Canadian Heritage) article
  • Canadian House of Commons anonymous edits to Pierre-Hugues Boisvenu (Senator) article
  • Homophobic anonymous edits from Natural Resources Canada to Richard Conn article


Much more important than these selected tweets, this software surfaces "big data" in a meaningful way. It provides transparency. It empowers a citizenry. It exists as resource for research and investigative journalism. And, most important in my opinion, software written and shared like this, can push all the cynicism aside, and give one hope for the future.

#aaronsw

More Rob Ford tweets on a map

Another example of how global the Rob Ford scandal has become via harvested tweets with geographic coordinates. This example is a harvest of #rofo, #robford, #topoli, and #ShirtlessHorde.

The harvest took place on July 6, 2014, and should cover the discussion around the time of Rob Ford's return on June 30, 2014 to July 6, 2014. The tweets with available geo-information represents less than 10% of all tweets harvested. If you would like the raw tweet data (not the geoJSON - you can grab that if you view the source), you can get it from here. If you would like to see all the tweets harvested, you can view them here. (Warning! This might blow up your browser. There is a fair bit of data here.)

Tweets were harvested with Ed Summer's Twarc.

#rofo OR #robford OR #topoli OR #ShirtlessHorde

IIPC Curator Tools Fair: Islandora Web ARChive solution pack

The following is the text for a video that I was asked to record for the 2014 International Internet Preservation Consortium General Assembly Curator Tools Fair, on the Islandora Web ARChive solution pack.


My name is Nick Ruest. I am a librarian at York University, in Toronto, Ontario. I’m going to give a quick presentation on the Islandora Web ARChive solution pack. I only have a few minutes, so I’ll quickly cover what the module does, what areas of the web archiving life cycle it covers, and a provide a quick demonstration.

So, what is the Islandora Web ARChive solution pack?

I’ll step back and quickly answer what is Islandora first. "Islandora is an open source digital asset management system based on Fedora Commons, Drupal and a host of additional applications." A solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and the Tuque library thereby allowing users to deposit, create derivatives, and interact with a given type of object. We have solution packs for Audio, Video, Large Images, Images, PDFs, paged content, and now web archives. The Web ARChive solution pack allows users to ingest and retrieve web archives through the Islandora interface. If we think about it in terms of OAIS, we give the repository a SIP, which in the case of this solution pack can be a single warc file and some descriptive metadata, and if available, a screenshot and/or a PDF. From there, the solution pack will create an AIP and DIP. The AIP will contain: the original warc, MODS descriptive metadata, FITS output (file characterization/technical metadata), web dissemination versions of the screenshots (jpg & thumbnail), PDF, and derivatives of the warc via warctools. Those derivatives are a csv and filtered warc. The csv -- WARC_CSV -- is a listing of all the files in a given warc. This allows a user/researcher to have a quick glance at the contents of the warc. The filtered warc -- WARC_FILTERED -- is a warc file stripped down as much as possible to the text, and it is used only for search indexing/keyword searching. The DIP is an a JPG/TN of the captured website (if supplied) and download links to the WARC, PDF, WARC_CSV, screenshot, and descriptive metadata. Here a link to the ‘archived site’ can be supplied in the default MODS form. The suggested usage here is to provide a link to the object in a local instance of Wayback, if it exists.

I’ve also been asked to address the following questions:

1) What aspects of the web archiving life cycle model does the tool cover? What aspects of the model would you like to/do intend to build into the tool? What functionality does the tool provide that isn’t reflected in the model?

I’ll address what it does not cover first: Appraisal and selection, scoping, and data capture. We allow users to use their own Appraisal and selection, scoping, and data capture processes. So, for example, locally, we use Heritrix for cron based crawls, and our own bash script for one-off crawls.

What does it cover? All of the rest of the steps!

  • Storage and organization: via Fedora Commons & Islandora
  • QA and analysis: via display/DIP -- visualization exposes it!
  • Metadata/description: every web archive object has a MODS descriptive datastream
  • Access/use/reuse: each web archive object has a URI, along with its derivatives. By default warcs are available to download.
  • Preservation: preservation depends on the policies of the repository/institution, but, in our case we have a preservation action plan for web archives, and suite of Islandora preservation modules running (checksum, checksum checker, FITS, and PREMIS) that cover the basics.
  • Risk management: see above.

2) What resources are committed to the tool’s ongoing development? What are major features in the roadmap? Is the code open source?

I developed the original version, and transferred it to the Islandora Foundation, allowing for community stewardship of the project.

Currently, there is no official roadmap for the project. If anybody has ideas, comments, suggestion, or roadmapish ideas, feel free to send a message to the Islandora mailing list.

...and yes, the code is totally open source. It is available under a GPLv3 license, and the canonical version of the code can be found under the Islandora organization on Github.

3) What is the user base for the tool? How environment-specific is the tool as opposed to readily reusable by other organizations?

Not entirely sure. It was recently released as part of the 7.x-1.3 version of Islandora.

Given that is it an Islandora module, it is tied to Islandora. So, you’ll have to have at least a 7.x-1.3 instance of Islandora running, along with the solution pack’s dependencies to run it.

4) What are the tool’s unique features? What are its shortcomings?

I think some unique features are that it is apart of a digital asset management system (it is the first of its kind that I am aware of), and the utilization of warctools for keyword searching and file inventories.

Shortcomings? That it is apart of a digital asset management system.

Very quick demo time!

Rob Ford tweets on a map

Examples of how global the Rob Ford scandal has become via harvested tweets with geographic coordinates.

If you would like the raw tweet data (not the geoJSON - you can grab that if you view the source), you can get it from here and here. Tweets were harvested with Ed Summer's Twarc.

robford OR rob ford OR rofo OR topoli OR toronto OR FordNation May 3, 2014

TOpoli OR Toronto OR RobFord November 25, 2013

TOpoli OR Toronto OR RobFord November 14, 2013

Digital Preservation Tools and Islandora

Incorporating a suite of digital preservation tools into various Islandora workflows has been a long-term goal of mine and a few other members in the community, and I'm really happy to see that it is now becoming more and more of a priority in the community.

A couple years ago, I cut my teeth on contributing to Islandora by creating a FITS plugin for the Drupal 6 version of Islandora. Later this tool was expanded to a stand alone module with restructuring of the Drupal 7 code base of Islandora. The Drupal 7 version of Islandora, along with Tuque, has really opened up the door for community contributions over the last year or so. Below is a list and description of Islandora modules with a particular focus to the preservation side of the repository platform.

Islandora Checksum

Islandora Checksum is a module that I developed with Adam Vessey (with special thanks to help from Jonathan Green and Jordan Dukart who helped me grok Tuque), that allows repository managers to enable the creation of a checksums for all datastreams on objects. If enabled, the repository administrator can choose from default Fedora Commons checksum algorithms: MD5, SHA-1, SHA-256, SHA-384, SHA-512.

This module is licensed under a GPLv3 license, and currently going through the Islandora Foundation's Licensed Software Acceptance Procedure. If successfull, the module will be apart of the next Islandora release.

Islandora Checksum admin

Islandora Checksum Checker

Islandora Checksum Checker is a module by Mark Jordan that extends Islandora Checksum by verifying, "the checksums derived from Islandora object datastreams and adds a PREMIS 'fixity check' entry to the object's audit log for each datastream checked."

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora PREMIS

Islandora PREMIS is a module by Mark Jordan, Donald Moses, Paul Pound, and myself. The module produces XML and HTML representations of PREMIS metadata for objects in an Islandora repository on the fly. The module currently documents: all fixity checks performed on an object's datastreams, includes configurable 'agent' entries for an institution as well as for the Fedora Commons software, and maps the contents of each object's "rights" elements in the Dublic Core datastream to equivalent PREMIS "rightsExtension" elements. You can view an example here, along with the XML representation.

What we have implemented so far is just the basics, and we are always seeking feedback to make it better. If you're interested in the discussion or would like to provide feedback, feel free to follow along in the Islandora Google Group thread, and the Github issue queue for the project.

This module is also licensed under a GPLv3 license.

Islandora PREMIS admin

Islandora BagIt

Islandora BagIt is also a module by Mark Jordan (actually a fork of his Drupal module) that utilizes Scholars' Lab's BagItPHP, allowing repository administrators to create bags of selected content. Currently a wide variety of configuration options for exporting contents as Bags, as well as creating Bags on ingest and/or when objects are modified. The way Mark has structured this module also allows developers to easily extend it by creating additional plugins for it, as well providing Drush integration.

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora Preservation Documentation

Documentation! One of the most important aspects of digital preservation.

This is not a full blown module yet. What it currently is, is the beginings of a generic set of documentation that can be used by repository administrators. Eventually we hope to use a combination of Default Content/UUID Features and Features to provide a default bundle of preservation documenation in an Islandora installation.

The content in this Github repo comes from the documentation and policies we are creating at York University Library, which is derived from the wonderful documentation created by Scholars Portal during their successful ISO 16363 audit.

Islandora Web ARChive SP updates

Community

Some pretty exciting stuff has been happening lately in the Islandora community. Earlier this year, Islandora began the transformation to a federally incorporated, community-driven soliciting non-profit. Making it, in my opinion, and much more sustainable project. Thanks to my organization joining on as a member, I've been provided the opporutinity to take part in the Roadmap Committe. Since I've joined, we have been hard at work creating transparent policies and processes software contributions, licenses, and resources. Big thanks to the Hydra community for providing great examples to work from!

I signed my first contirbutor licence agreement, and initiated the process for making the Web ARChive Solution Pack a canonical Islandora project, subject to the same release management and documentation processes as other Islandora modules. After working through the process, I'm happy to see that the Web ARChive Solution Pack is now a canonical Islandora project.

Project updates

I've been slowly picking off items from my initial todo list for the project, and have solved two big issues: indexing the warcs in Solr for full-text/keyword searching and creating and index of each warc.

Solr indexing was very problematic at first. I ened up having a lot of trouble getting an xslt to take the warc datastream and give it to FedoraGSearch, and in-turn to Solr. Frustrated, I began experimenting with newer versions of Solr, which thankfully has Apache Tika bundled, thereby allowing for Solr to index basically whatever you throw at it.

I didn't think our users wanted to be searching the full markup of a warc file. Just the actual text. So, using the Internet Archives' Warctools and @tef's wonderful assistance, I was able to incorporate warcfilter into the derivative creation.

$ warcfilter -H text warc_file > filtered_file

You can view an example of the full-text searching of warcs in action here.

In addition to the full-text searching, I wanted to provided users with a quick overview of what is in a given capture, and was able to do so by also incorporating warcindex into the derivative creation.

$ warcindex warc_file > csv_file

#WARC filename offset warc-type warc-subject-uri warc-record-id content-type content-length
/extra/tmp/yul-113521_OBJ.warc 0 warcinfo None <urn:uuid:588604aa-4ade-4e94-b19a-291c6afa905e> application/warc-fields 514
/extra/tmp/yul-113521_OBJ.warc 797 response dns:yfile.news.yorku.ca <urn:uuid:cbeefcb0-dcd1-466e-9c07-5cd45eb84abb> text/dns 61
/extra/tmp/yul-113521_OBJ.warc 1110 response http://yfile.news.yorku.ca/robots.txt <urn:uuid:6a5d84d1-b548-41e4-a504-c9cf9acfcde7> application/http; msgtype=response 902
/extra/tmp/yul-113521_OBJ.warc 2366 request http://yfile.news.yorku.ca/robots.txt <urn:uuid:363da425-594e-4365-94fc-64c4bb24c897> application/http; msgtype=request 257
/extra/tmp/yul-113521_OBJ.warc 2952 metadata http://yfile.news.yorku.ca/robots.txt <urn:uuid:62ed261e-549d-45e8-9868-0da50c1e92c4> application/warc-fields 149

The updated Web ARChive SP datastreams now look like so:

Warc SP datastreams

One of my major goals with this project has been integration with a local running instance of Wayback, and it looks like we are pretty close. This solution might not be the cleanest, but at least it is a start, and hopefully it will get better over time. I've updated the default MODS form for the module so that it better reflects this Library of Congress example. The key item here is the 'url' element with the 'Archived site' attribute.

<location>
  <url displayLabel="Active site">http://yfile.news.yorku.ca/</url>
  <url displayLabel="Archived site">http://digital.library.yorku.ca/wayback/20131226/http://yfile.news.yorku.ca/</url>
</location> 

Wayback accounts for a date in its url structure 'http://digital.library.yorku.ca/wayback/20131226/http://yfile.news.yorku.ca/' and we can use that to link a given capture to its given dissemination point in Wayback. Using some Islandora Solr magic, I should be able give that link to a user on a given capture page.

We have automated this in our capture and preserve process: capturing warcs with Heritrix, creating MODS datastreams, and screenshots. This allows us to batch import our crawl quickly and efficiently.

Hopefully in the new year we'll have a much more elegant solution!

ruebot started following axfelix

ruebot started following axfelix

September 26, 2013