Digital Preservation Tools and Islandora

Incorporating a suite of digital preservation tools into various Islandora workflows has been a long-term goal of mine and a few other members in the community, and I'm really happy to see that it is now becoming more and more of a priority in the community.

A couple years ago, I cut my teeth on contributing to Islandora by creating a FITS plugin for the Drupal 6 version of Islandora. Later this tool was expanded to a stand alone module with restructuring of the Drupal 7 code base of Islandora. The Drupal 7 version of Islandora, along with Tuque, has really opened up the door for community contributions over the last year or so. Below is a list and description of Islandora modules with a particular focus to the preservation side of the repository platform.

Islandora Checksum

Islandora Checksum is a module that I developed with Adam Vessey (with special thanks to help from Jonathan Green and Jordan Dukart who helped me grok Tuque), that allows repository managers to enable the creation of a checksums for all datastreams on objects. If enabled, the repository administrator can choose from default Fedora Commons checksum algorithms: MD5, SHA-1, SHA-256, SHA-384, SHA-512.

This module is licensed under a GPLv3 license, and currently going through the Islandora Foundation's Licensed Software Acceptance Procedure. If successfull, the module will be apart of the next Islandora release.

Islandora Checksum admin

Islandora Checksum Checker

Islandora Checksum Checker is a module by Mark Jordan that extends Islandora Checksum by verifying, "the checksums derived from Islandora object datastreams and adds a PREMIS 'fixity check' entry to the object's audit log for each datastream checked."

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora PREMIS

Islandora PREMIS is a module by Mark Jordan, Donald Moses, Paul Pound, and myself. The module produces XML and HTML representations of PREMIS metadata for objects in an Islandora repository on the fly. The module currently documents: all fixity checks performed on an object's datastreams, includes configurable 'agent' entries for an institution as well as for the Fedora Commons software, and maps the contents of each object's "rights" elements in the Dublic Core datastream to equivalent PREMIS "rightsExtension" elements. You can view an example here, along with the XML representation.

What we have implemented so far is just the basics, and we are always seeking feedback to make it better. If you're interested in the discussion or would like to provide feedback, feel free to follow along in the Islandora Google Group thread, and the Github issue queue for the project.

This module is also licensed under a GPLv3 license.

Islandora PREMIS admin

Islandora BagIt

Islandora BagIt is also a module by Mark Jordan (actually a fork of his Drupal module) that utilizes Scholars' Lab's BagItPHP, allowing repository administrators to create bags of selected content. Currently a wide variety of configuration options for exporting contents as Bags, as well as creating Bags on ingest and/or when objects are modified. The way Mark has structured this module also allows developers to easily extend it by creating additional plugins for it, as well providing Drush integration.

This module is also licensed under a GPLv3 license.

Islandora BagIt admin

Islandora Preservation Documentation

Documentation! One of the most important aspects of digital preservation.

This is not a full blown module yet. What it currently is, is the beginings of a generic set of documentation that can be used by repository administrators. Eventually we hope to use a combination of Default Content/UUID Features and Features to provide a default bundle of preservation documenation in an Islandora installation.

The content in this Github repo comes from the documentation and policies we are creating at York University Library, which is derived from the wonderful documentation created by Scholars Portal during their successful ISO 16363 audit.

Islandora Web ARChive SP updates


Some pretty exciting stuff has been happening lately in the Islandora community. Earlier this year, Islandora began the transformation to a federally incorporated, community-driven soliciting non-profit. Making it, in my opinion, and much more sustainable project. Thanks to my organization joining on as a member, I've been provided the opporutinity to take part in the Roadmap Committe. Since I've joined, we have been hard at work creating transparent policies and processes software contributions, licenses, and resources. Big thanks to the Hydra community for providing great examples to work from!

I signed my first contirbutor licence agreement, and initiated the process for making the Web ARChive Solution Pack a canonical Islandora project, subject to the same release management and documentation processes as other Islandora modules. After working through the process, I'm happy to see that the Web ARChive Solution Pack is now a canonical Islandora project.

Project updates

I've been slowly picking off items from my initial todo list for the project, and have solved two big issues: indexing the warcs in Solr for full-text/keyword searching and creating and index of each warc.

Solr indexing was very problematic at first. I ened up having a lot of trouble getting an xslt to take the warc datastream and give it to FedoraGSearch, and in-turn to Solr. Frustrated, I began experimenting with newer versions of Solr, which thankfully has Apache Tika bundled, thereby allowing for Solr to index basically whatever you throw at it.

I didn't think our users wanted to be searching the full markup of a warc file. Just the actual text. So, using the Internet Archives' Warctools and @tef's wonderful assistance, I was able to incorporate warcfilter into the derivative creation.

$ warcfilter -H text warc_file > filtered_file

You can view an example of the full-text searching of warcs in action here.

In addition to the full-text searching, I wanted to provided users with a quick overview of what is in a given capture, and was able to do so by also incorporating warcindex into the derivative creation.

$ warcindex warc_file > csv_file

#WARC filename offset warc-type warc-subject-uri warc-record-id content-type content-length
/extra/tmp/yul-113521_OBJ.warc 0 warcinfo None <urn:uuid:588604aa-4ade-4e94-b19a-291c6afa905e> application/warc-fields 514
/extra/tmp/yul-113521_OBJ.warc 797 response <urn:uuid:cbeefcb0-dcd1-466e-9c07-5cd45eb84abb> text/dns 61
/extra/tmp/yul-113521_OBJ.warc 1110 response <urn:uuid:6a5d84d1-b548-41e4-a504-c9cf9acfcde7> application/http; msgtype=response 902
/extra/tmp/yul-113521_OBJ.warc 2366 request <urn:uuid:363da425-594e-4365-94fc-64c4bb24c897> application/http; msgtype=request 257
/extra/tmp/yul-113521_OBJ.warc 2952 metadata <urn:uuid:62ed261e-549d-45e8-9868-0da50c1e92c4> application/warc-fields 149

The updated Web ARChive SP datastreams now look like so:

Warc SP datastreams

One of my major goals with this project has been integration with a local running instance of Wayback, and it looks like we are pretty close. This solution might not be the cleanest, but at least it is a start, and hopefully it will get better over time. I've updated the default MODS form for the module so that it better reflects this Library of Congress example. The key item here is the 'url' element with the 'Archived site' attribute.

  <url displayLabel="Active site"></url>
  <url displayLabel="Archived site"></url>

Wayback accounts for a date in its url structure '' and we can use that to link a given capture to its given dissemination point in Wayback. Using some Islandora Solr magic, I should be able give that link to a user on a given capture page.

We have automated this in our capture and preserve process: capturing warcs with Heritrix, creating MODS datastreams, and screenshots. This allows us to batch import our crawl quickly and efficiently.

Hopefully in the new year we'll have a much more elegant solution!

The Islandora Web ARChive Solution Pack - Open Repositories 2013

Below is the text and slides of my presentation on the Web ARChive solution pack at Open Repositories 2013.

I have a really short amount of time to talk here. So, I am going to focus on the how and why for this solution pack and kinda put it in context of the Web Archiving Life Cycle Model proposed by the Internet Archive earlier this year. Maybe I shouldn't have proposed a 7 minute talk!

Context! Almost a year ago, I was in a meeting and was presented with this problem. YFile, a daily university newspaper -- it was previously a paper now a website -- had been taken over by marketing a while back, and they deleted all their back content. They are an official university publication, so an official university record, and eventual end up in archives, so it will eventually be our problem; the library's problems. Plainly put, we live in a reality where official records are born and disseminated via the Internet. Many institutions have a strategy in place for transferring official university records that are print or tactile to university archives, but not much exists strategy-wise for websites. So, I naively decided to tackle it.

I tend to just do things. I don't ask permission. I apologize later if i have to. Like maybe taking down the YFile server during the first few initial crawls. If I make mistakes, that is good, I am learning something! What i am doing isn't new, but then again it knda is. It is a really weird place. I need to crawl a website everyday. The internet archive crawler comes around whenever it does. There is no way to give the Internet Archive/Wayback machine a whole bunch of warc files, and I'm not ready to pay for Archive-It.

That won't work for me at all when I have some idea how to do it all myself. So, what is the problem? I need to capture and preserve a website everyday. I want to provide the best material to a researcher. I want to keep a fine eye on preservation, but not be a digital pack rat, and need to constantly keep the librarian and archivist in me pleased, which is always seems to the Item vs. collection debate and which of those gets the most attention.

How easy is it to grab a website? Pretty damn easy if you're using at least wget 1.14 which has warc support.

How many people here know what a warc is? Warc stands for web archive. It is an iso standard. It is basically a file -- that can get massive very quickly -- that aggregates raw resources you request into a single file along with crawl metadata, checksums. PROVENANCE!

This is what the beginning of a warc file looks like.

This is what the beginning of a warc file looks like.

And here is a selection from the arctual archive portion. That is my brief crash course on warc. We can talk about it more later if you have questions. I need to keep moving along.

So, warcs are a little weird to deal with on their own. You can disseminate them with Wayback Machine, and I assume nobody but a few people on this planet want to see a page full of just warc files. Building something browsable takes a little bit more work. So, I decided to snag a pdf and screenshot of the page of frontpage of the site that I am grabbing with wkhtmltopdf and wkhtmltoimage. Then I toss this all in a single bash script, and give it to cron.

So this is what I have come up with. This is how I capture and preserve a website. The pdf/image + xvfb came from Peter Binkley. X virtual framebuffer is an X11 server that performs all graphical operations in memory, not showing any screen output.

I've been running that script on cron since last October. Now what? Like I said before, nobody wants to see a page full of warc files. So, I started working with the tools and platforms that I know. In this case, Drupal, Islandora, and Fedora Commons, and created a solution pack. Solution pack in Islandora parlance, is a Drupal module that integrates with the main Islandora module and Tuque API to deposit, create derivatives, and interact with a given type of object. So, we have solution packs for Audio, Video, Large Images, Images, PDFs, and paged content.

What does it do? Adds all required Fedora objects to allow users to ingest, create derivatives, and retrieve web archives through the Islandora interface. So we have, Content Models, Data Stream Composite Models, forms, and collection policies. The current iteration of the module allows one to batch ingest a bunch of objects for a given collection, and it will create all of the derivatives (Thumbnail and display image), and index any provided descriptive metadata in Solr as well as the actual WARC file since it is mostly text. The WARC indexing is still pretty experimental, it works, but I don't know how useful it is.

If you want to check out a live demo, and poke around while I am rambling on here, check this site out.

Collection (in terms of the web archiving life-cycle model). This an object from the Islandora basic collection solution pack.

Seed (in terms of the web archiving life-cycle model). This is an object from the Islandora basic collection solution pack.

Document (in terms of the web archiving life-cycle model). This is an object from the Islandora Web ARChive solution pack.

Here is what my object looks like. The primary archival object is the WARC file, then we have our associated data streams: PDF (from the crawl), MODS/DC (descriptive metadata), Screenshot (from the crawl), FITS (techinical metadata), Thumbnail & Medium JPG (deriative display images).

Todo! What I am still working on when I have time.

I want to tie in the Internet Archive's Wayback Machine for playback/dissemination of WARCs. I haven't quite wrapped my head around how best to do the Wayback integration, but I am thinking of using the date field value on in the MODS record for an individual crawl.

I'm also thinking of incorporating WARC tools into this solution pack. This would be for quick summaries and maybe a little analysis. This of how R is incorporated into Dataverse if you are familar with that.

I am also working on integrating my silly little bash scripts into the solution pack. That way one could just do the whole fell swoop of crawling, dissemination, and preservation in a single click when ingesting an object in Islandora.

Finally, there is a hell of a lot of metadata in each of these warc files begging for something to be done with them. I haven't figured out a way to parse them in an abstract repeatable way, but if I or somebody else does, it will be great!

Calling out nonsense - John Degen

This post by John Degen looks like F.U.D., Fear, Uncertainty, and Doubt. If it doesn’t, please tell me why. The thing with F.U.D. is that there are generally misconceptions that lead to false conclusions, and that is what I am seeing in the post by Mr. Degen.

Mr. Degen, I respect the position you are in. Like me, you are standing up for a set of values, ethics, and rights for your profession. This is not a black and white issue. There are grey areas where we overlap, and that is where agreement or conflict can exist. In this case, we have a lot of conflict. But, we have a some stark lines drawn for us with Bill C-11, and the recent Supreme Court rulings. Simply put, rights around fair dealing and educational use have expanded.

Now, the misconceptions.

Misconception one. I am not the great dread pirate black beard of librarianship. In no way have I, nor the Ontario Library and Information Technology Association, said that creators should not be compensated. Yes, resolution language is ugly Robert’s Rules of Order legalese. That is what it has to be for the setting of an annual general meeting. Do I wish it was plain, simple, beautiful prose? Yes.

WHEREAS there exists model license agreements between Access Copyright and the Association of Universities and Colleges of Canada (AUCC) and between Access Copyright and the Association of Canadian Community Colleges (ACCC), and

WHEREAS there exist agreements between Access Copyright and the University of Toronto and between Access Copyright and the University of Western Ontario, and

WHEREAS the Canadian Association of University Teachers (CAUT), the British Columbia Library Association (BCLA), the Atlantic Provinces Library Association (APLA), the Manitoba Library Association (MLA), the Newfoundland Labrador Library Association (NLLA), the Progressive Librarians’ Guild (PLG) as well as many leading copyright scholars in Canada have taken strong positions against the Access Copyright licenses, and

WHEREAS the addition of “education” to the fair dealing categories, and the broad support for fair dealing in the Supreme Court’s pentalogy rulings of July 2012 provide further support for the position that the Access Copyright license does not provide any additional value to institutions beyond their existing rights, and

WHEREAS the fee structure is inequitable to students on whom the costs are imposed, and

WHEREAS several provisions in the license agreements limit the use of emerging technologies and increase the potential for monitoring and surveillance,

BE IT RESOLVED THAT the Ontario Library and Information Technology Association (OLITA):

  1. Stands opposed to the Access Copyright license agreements as they currently stand, including the AUCC and ACCC Model Licenses and the separate licenses with the University of Toronto and the University of Western Ontario,
  2. Urges Canadian post-secondary institutions not to enter into this licensing agreement,
  3. Encourages those who have already signed to exercise their termination options as soon as possible, and
  4. Recommends that institutions move toward the construction of systems of knowledge creation and sharing based on fair dealing, open access, site licensing as well as transactional licenses where they are needed.

The WHEREAS clauses provide the context, setting, or a lens with respect the the resolution. The resolution, I believe, is fairly explicit. OLITA, “stands opposed to the Access Copyright license agreements as they currently stand...” OLITA did not say, “Access Copyright is Cthulhu. It should be banished from this dimension, and no creator should ever be compensated.” We have an issue with those specific model licenses, and agreements. We are not the first to raise this issue. The Canadian Association of University Teachers, the Atlantic Provinces Library Association, the Newfoundland and Labrador Library Association, the Manitoba Library Association, the BC Library Association, the McMaster University Academic Librarians’ Association, the Progressive Librarians Guild Toronto Area Chapter, and many leading copyright scholars in Canada have all spoken out in opposition to these model agreements and licenses. OLITA isn’t even the first association to oppose or condemn the model licenses and agreements by way of a resolution. CAUT did so last spring, as did the BC Library Association and the McMaster University Academic Librarian’s Association, and believe it or not, the Ontario College and University Library Association. Furthermore, OCULA passed the same exact resolution as OLITA one day prior. Neither Mr. Degen or Access Copyright seemed to notice this at all from what I can tell via recent public communications by both parties.

Misconception two. The “dialogue”. I have tried my best to be as transparent as possible. Mr Degen and Access Copyright seem to be misrepresenting the narrative (“a strategic attempt to influence perception by disseminating negative and dubious or false information”). Mr. Degen and Access Copyright both refer to the letter I referenced in the previous post, and seemingly lead one to believe that the transmission of that letter is the end of the story. That Access Copyright tried to engage in an open dialogue with myself and OLITA, and both I and OLITA refused a dialogue. As I showed in my previous post, I welcomed participation in the process at the AGM for those Access Copyright board members, directors, employees, etc., who are OLITA members. From what I understand, we do have members of OLITA that are affiliated with Access Copyright, so there was every opportunity to participate. Moreover, the Access Copyright Executive Director followed up to my response saying, "We don't see how this can properly take place at your AGM. Would you consider delaying the motion until we have the opportunity to meet and begin a dialogue?" That, in my opinion, is attempting circumvent a democratic process. The executive director asked that I pull a resolution. There is no right or standing to ask such a thing. As for the following statement, “We don’t see how this can properly take place at your AGM.” Really? Resolutions are a normal part of AGMs. A member has every right to submit a resolution. If it is moved and seconded, then it moves to the agenda for the meeting. So, yes, it can properly take place. To think otherwise is silly.

Finally, if you want to talk let’s talk. Yesterday wasn’t an example of a constructive dialogue. In fact it got really unconstructive. Mr. Degen, I would like to personally apologise if there was any offense taken from any of my actions, and would also like to apologise on behalf of my colleagues. As I ended my previous post, if Access Copyright or Mr. Degen would like to open a dialogue about why these resolutions were unanimously passed, now is the time to do so.

Calling out nonsense - Access Copyright

Inspired by some fearless leaders in our community, this is my Access Copyright story.

This past week, a very interesting series of events unfolded with Access Copyright, or maybe better said, what unfolded was a lesson in how not to engage in open dialogue. I will not be speaking to the text of the the resolutions mentioned below, just the events surrounding them.

Last week was the annual Ontario Library Association Super Conference. During Super Conference, each OLA Division has their Annual General Meeting. Among other things, AGMs provide opportunities for resolutions to be put forward by the membership. At this particular AGM, we had two resolutions put forward: 1) A Memorial Resolution Honouring Aaron Swartz (Thanks ALA/LITA!) and 2) OLITA Resolution on Opposition to Access Copyright License Agreements. Standard procedures were followed, the resolutions were moved and seconded, and sent out to the membership in advance.

This past Monday (2013-01-08) something happened. I received an email from Robert Gilbert, New Media and Communication Services, at Access Copyright. This is what they had to say. I will explain why I am making this letter public later.

I addressed the incorrect information in the letter in a reply to the sender, and cc'd recipients:

Dear Mr. Gilbert,

I'd like to clear up some confusion with the resolution. The posted resolution[1] which I assume you have seen or been directed to is a proposed resolution for the Ontario Library and Information Technology Association's (OLITA) Annual General Meeting[2]. It was sent out in advance to membership.

The resolution has been moved and seconded, and will be put before the membership at the meeting for a vote. Prior to the vote, an opportunity will be provided to speak to the motion, ask questions, and propose amendments. If you are or your colleagues are OLITA members, you are more than welcome to participate.



[1] [2]

A day went by, and other than I an out of office reply, I didn't hear anything in response. I figured we were done.


On Wednesday evening (2013-01-30), I received an email from the Executive Director of Access Copyright. I will not publish the entire email, but I was asked to delay the motion, "We don't see how this can properly take place at your AGM. Would you consider delaying the motion until we have the opportunity to meet and begin a dialogue?"

I responded:


I due[sic] hope you understand the weight and merit of what you are asking. You are asking that I forgo a democratic process. This is a resolution that was put forward by a member of our association, and will be discussed as [sic] voted on at our AGM.

As I stated previously, if you or any of your colleagues are OLITA members, you are more then welcome to come and part take in this democratic process. You will be provided every opportunity to speak to the resolution on the table.

Other than that, I will in no way interfere this process as you have suggested.



I had hoped this was the end of the exchange.


The OLITA AGM was Friday evening (2013-02-01). Access Copyright was present at the conference as they had a booth in the exhibitors' hall. During the day, a colleague of mine showed me the letter I mentioned earlier. Somewhat (well really a lot) flabbergasted, I asked where and how they got a copy, assuming the only people to see the aforementioned letter were those that sent it, and those that received it. Nope. Access Copyright decided the best way to engage in an open "dialogue" with me, our association and/or our community was to print off a stack of these letters (in a very classy paper stock!) to hand out at their exhibitor booth.

I fully appreciate, and can understand the rationale behind trying to open up a dialogue. However, Access Copyright tried to circumvent a democratic process, refused to engage in a public dialogue, and tried to misrepresent and embarrass OLITA on the exhibitors’ floor. I find these intimidation tactics unacceptable.

We played fair. We brought no mention of Access Copyright's behaviour to the assembly floor. The resolution went forward with a single friendly amendment, and was passed unanimously. The OLITA membership has spoken. If Access Copyright would like to open a dialogue about why these resolutions were unanimously passed, now is the time to do so.

Islandora Web ARChive Solution Pack

What is it?

The Islandora Web ARChive Solution Pack is yet another Islandora Solution Pack. This particular solution pack provides the necessary Fedora objects for persisting and disseminating web archive objects; warc files.

What does it do?

Currently, the SP allows a user to upload a warc with an associated MODS form. Once the object is deposited, the associated metadata is displayed along with a download link to the warc file.

You can check out an example here

Can I get the code?

Of course!


If I am doing something obviously wrong, please let me know!

Immediate term:

  1. Incorporate Wayback integration for the DIP. I think this is the best disseminator for the warc files. However, I haven't wrapped my head around how to programatically provide access to the warc files in the Wayback. I know that I will have two warc objects, an AIP warc and a DIP warc (Big thank you to @sbmarks for being a soundboard today!). Fedora will manage the AIP, and Wayback will manage the DIP. Do I iFrame the Wayback URI for the object, or link out to it?

  2. Drupal 7 module. Drupal 7 versions of Islandora Solution Packs should be on their way shortly -- Next release I believe. The caveat to using the Drupal 6 version of this module is the mimetype support. It looks like the Drupal 6 api (file_get_mimetype) doesn't pull the correct mimetype for warc file. I should get 'application/warc' but I am getting 'application/octet-stream' -- the fallback default for the api.

Long term:

  1. Incorporate Islandora microservices. What I would really like to do is allow users to automate this entire process. Basically, just say this is a site I would like to archive. This is the frequency at which I would like it archived, with necessary wget options. This is the default metadata profile for it. Then grab the site, ingest it into Fedora, drop the DIP warc into Wayback, and make it all available.

  2. If you have any idea on how to do the above, or how to do it a better manner, please let me know!

DPLA Appfest Drupal integration

Below is the output of the little project I worked on today at the DPLA Appfest. It definitely isn't a perfect solution to the problem. It is not a drop-in module to just grab a collection from the DPLA API and "curate" it in your library's Drupal site. I hate reinventing the wheel especially if there are existing modules that can solve the problem for you. Moreover, as one of the few people that still respects what OAI-PMH does, it would be worth considering using DPLA as and OAI-PMH provider. But, I'm not sure if that is technically legal in OAI-PMH terms given that they are most likely likely harvesting it via OAI-PMH. Don't want to get into and infinite regressing of metadata providers. S'up dawg? All jokes aside, I think OAI-PMH would be a better solution that I what I tossed together because it would make harvesting a "set" a hell of a lot easier. My 2¢.

I also have a live demo of it living on my EC2 instance. I've ingested 2000 items from the API, and decided to throw them into a solr index just to demonstrate the possibilities of what you can do with the ingested content.

Finally, I big giant thank you to DPLA and Chattanooga Public Library for putting this on and the wonderful hospitality. This was absolutely fantastic!


Drupal module or distribution

Your Name: Nate Hill

Type of app: Drupal CMS

Description of App: Many. many libraries choose to use Drupal as their content management system or as their application development framework. A contrib Drupal module that creates a simple interface for admin users to curate collections of DPLA content for display on a library website would be useful.



I don't like recreating the wheel. So, let's see what contrib modules already exist, and see if we can just create a workflow to do this to start with. It would be really nice if DPLA had a OAI-PMH provider, then you could just use CCK + Feeds + Feeds OAI-PMH.



  • CCK

    drush pm-download cck

  • Feeds

    drush pm-download feeds

  • Feeds - JSON Parser

    drush pm-download feeds_jsonpath_parser cd sites/all/modules/feeds_jsonpath_parser && wget


  • Create a Content Type for the DPLA content you would like to pull in (admin/content/types/add)
  • Create DPLA metadata fields for the Content Type (admin/content/node-type/YOURCONTENTYPE/fields)
  • Create a new feed importer (admin/build/feeds/create)
  • Configure the settings for you new feed importer
    • Basic settings:
    • Select the Content Type you would like to import into
    • Select a fequency you would like Feeds to ingest
    • Fetcher
    • HTTP Fetcher
    • Processor
    • Node processor
    • Select the Content Type you created
    • Mappings (create a mapping for each metadata field you created)
      • Source : jsonpath_parser:0
      • Target : Title
    • Parser
    • JSONPath Parser
    • Settings for JSONPath parser
      • Context: $.docs.*
  • Construct a search you would like to ingest using the DPLA API
    • ex:
  • Start the import! (node/add/YOURCONTENTTYPE)
  • Give the import a title... whatever your heart desires.
  • Add a feed url
  • Click on JSONPath Parser settings, and start adding all of the JSONPaths
  • Click save, and watch the import go.
  • Check out your results

York University Libraries Open Access Week 2012 - #blogvsbook

Yesterday, York University Libraries held a debate in the Scott Library entitled, "Be it resolved the blog replace the book?" The debate turned out pretty awesome, and somehow the team arguing for the book won!? (Some might say it was because of @adr's compelling closing statements.) 

Along with livestreaming the debate on ustream, I pulled together (a special thanks to Ed Summers, and his very permissive licensing) a little node.js application to display a "twitterfall" of the hashtag for the event. As is always the case, technology is bound to fail, somehow, someway, at a live event. Turns out that we owe a very special thank you to the giant Amazon outage, which in turn took out Heroku's infrastructure. Good thing my paranoia urged me to use a backup application to snag the archive for the stream; with all of the variations on the hashtag.

Enough about the debate, and Amazon's large internet burp! What I want to really talk about is some fun ways to play with the data we collected from the Twitter API. The backup application I mentioned earlier, has some nice visualizations incorporated in it. Along with its ease of use, it is pretty slick and simple to use application. But, most important, I have a csv (deposited in the OCUL Dataverse site) of all the tweets, for all the hashtags I could figure out. Which means we (yes you! Download the csv and have fun with this too!) can start doing some cool visualizations. 

Inspired by @Sarah0s' "Dead easy data visualization for libraries" talk at AccessYUL I decided to play with to see how easy it would be to toss together a visualization of the number of tweets per user.

This is a fairly basic and easy one to make. You only need two columns: twitter usernames, and corresponding number of tweets. Once you have those entered, just hit publish, and you're good to go. 

So, that is something quick and easy. I have "Designing Data Visualizations" on the way. Hopefully that inspires me a bit more, and maybe I'll start playing with d3js again. Should be fairly straightforward to drop the csv into Google Refine and get some json back. In the interim, I just leave it up to Bill Denton to show us some really cool stuff with the data in R 

iaTorrent update OR Learning by reading code

Last week, inspired from a meeting, I started tossing together a little python program to solve a problem. It wasn't perfect. It was warty. But I think I have something worthwhile now. Or, at least useful for me -- It gives you what you want, and writes to a log when something goes wrong.

What I really want to do here, is just take a moment to sing the praises of learning by reading code. Heading into this little project, I had a basic idea of what I wanted to do, and I knew something like this could be done given Tim's project. I knew that I wanted to make a this a module, and set it up on PyPI, but I had really know idea how to do so. But! I knew of somebody who did, and is quite proflic in my mind. Ed making his code available on Github (and using very open licenses) made it possible for me to learn how to build the structure for a Python module, the structure for writing tests, and using argparse/optparse correctly.

So, here is to learning by reading code!

IA Torrent

Yesterday in a meeting for our Digital Initiatives Advisory Group we were discussing what collections we should consider sending over to the UofT Internet Archive shop, and I asked an innocent newbie question - So, do we have copies of everything we have had the Internet Archive digitize?


No big deal. We're in the infant stages of creating a digital preservation program here, and everything that comes with it. INFRASTRUCTURE!

I knew Tim Ribaric over at Brock University wrote an Internet Archive scraper a while back, so I knew it would be possible to get our content if need be. Knowing that combined with the Internet Archive announcement a little over a month ago about making available torrents for items in the Internet Archive, it inspired me to whip together a Python script to grab all the torrents for a given collection.

Last night I threw together a little proof-of-concept grabbing the RSS feed on the York University Libraries Internet Archive page using BeautifulSoup and some ugly regex.

This morning, still inspired and brainstorming with Dan Richert, I started poking around for different ways to get at our collection. The Internet Archive's advanced search is super helpful for this, and I can get the results as json! So, no regex; as Dan told me, "if you solve a problem with regex, you now have two problems."

On the advanced search page, you will need your query parameters. You can grab those from the 'All items (most recently added first) link on a collection page. For example, the York University Libraries collection query parameters:

(collection:yorkuniversity AND format:pdf) AND -mediatype:collection'

Then selected your desired output format, and number of results. 2608 for me given the number of items in the collection. Then you end up with some json like this:

         "qin":"(collection:yorkuniversity AND format:pdf) AND -mediatype:collection",
         "q":"( collection:yorkuniversity AND format:pdf ) AND -mediatype:collection;",
            "title":"Revised statutes of Ontario, 1990 = Lois refondues de l'Ontario de 1990",
            "title":"Essai philosophique concernant l'entendement humain : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons",
            "title":"Essai philosophique concernant l'entendement humain : où l'on montre quelle est l'étendue de nos connoissances certaines, et la manière dont nous y parvenons",
            "title":"Essai philosophique concernant l'entendement humain, : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons.",

(make sure you lop off '&callback=callback&save=yes' at the end of the url). Once you have the url for the json, it is pretty straightforward from there. You just call the script like so: '' '/tmp/ia-torrent'

Caveats! I haven't been able to download all the torrents for an entire collection yet. Looks like Internet Archive's servers don't like the number of requests, and the script dies out with:

'IOError: [Errno socket error] [Errno 111] Connection refused'

I've tried throttling myself in the script at 15 seconds per request, and still get cut off. If anybody knows if Internet Archive has any published request rates, or has a better idea in implementing this, please let me know! Add a comment, or fork + clone + pull request. Patches are most welcome!

Big thank you to Dan Richert for the impromptu crash course on parsing json this morning!!!