IA Torrent

Yesterday in a meeting for our Digital Initiatives Advisory Group we were discussing what collections we should consider sending over to the UofT Internet Archive shop, and I asked an innocent newbie question - So, do we have copies of everything we have had the Internet Archive digitize?

NOPE.

No big deal. We're in the infant stages of creating a digital preservation program here, and everything that comes with it. INFRASTRUCTURE!

I knew Tim Ribaric over at Brock University wrote an Internet Archive scraper a while back, so I knew it would be possible to get our content if need be. Knowing that combined with the Internet Archive announcement a little over a month ago about making available torrents for items in the Internet Archive, it inspired me to whip together a Python script to grab all the torrents for a given collection.

Last night I threw together a little proof-of-concept grabbing the RSS feed on the York University Libraries Internet Archive page using BeautifulSoup and some ugly regex.

This morning, still inspired and brainstorming with Dan Richert, I started poking around for different ways to get at our collection. The Internet Archive's advanced search is super helpful for this, and I can get the results as json! So, no regex; as Dan told me, "if you solve a problem with regex, you now have two problems."

On the advanced search page, you will need your query parameters. You can grab those from the 'All items (most recently added first) link on a collection page. For example, the York University Libraries collection query parameters:

(collection:yorkuniversity AND format:pdf) AND -mediatype:collection'

Then selected your desired output format, and number of results. 2608 for me given the number of items in the collection. Then you end up with some json like this:

{
   "responseHeader":{
      "status":0,
      "QTime":1,
      "params":{
         "json.wrf":"",
         "qin":"(collection:yorkuniversity AND format:pdf) AND -mediatype:collection",
         "fl":"identifier,title",
         "indent":"",
         "start":"0",
         "q":"( collection:yorkuniversity AND format:pdf ) AND -mediatype:collection;",
         "wt":"json",
         "rows":"5"
      }
   },
   "response":{
      "numFound":2608,
      "start":0,
      "docs":[
         {
            "title":"Saint-Pétersbourg",
            "identifier":"saintptersboyork00rauoft"
         },
         {
            "title":"Revised statutes of Ontario, 1990 = Lois refondues de l'Ontario de 1990",
            "identifier":"v4revisedstat1990ontauoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons",
            "identifier":"1714essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : où l'on montre quelle est l'étendue de nos connoissances certaines, et la manière dont nous y parvenons",
            "identifier":"1729essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain, : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons.",
            "identifier":"1735essaiphiloso00lockuoft"
         }
      ]
   }
}

(make sure you lop off '&callback=callback&save=yes' at the end of the url). Once you have the url for the json, it is pretty straightforward from there. You just call the script like so:

ia-torrent.py 'http://archive.org/advancedsearch.php?q=%28collection%3Ayorkuniversity+AND+format%3Apdf%29+AND+-mediatype%3Acollection&fl%5B%5D=identifier&fl%5B%5D=title&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=2608&page=1&output=json' '/tmp/ia-torrent'

Caveats! I haven't been able to download all the torrents for an entire collection yet. Looks like Internet Archive's servers don't like the number of requests, and the script dies out with:

'IOError: [Errno socket error] [Errno 111] Connection refused'

I've tried throttling myself in the script at 15 seconds per request, and still get cut off. If anybody knows if Internet Archive has any published request rates, or has a better idea in implementing this, please let me know! Add a comment, or fork + clone + pull request. Patches are most welcome!

Big thank you to Dan Richert for the impromptu crash course on parsing json this morning!!!

Digitized books into the IR - workflow

This past week, we started depositing digitized books into our institutional repository instance for The McMaster Collection. As of this posting we have 216 books in the collection. However, currently these materials are only available to the McMaster community. This is completely out of my control, and I agree what some of you may be thinking, "wait, out of copyright books are not available to the general public!?"

The workflow is a little complicated right now, but it is the beginning and will definitely be improved. Each book digitized has a specific set of output associated with it; one folder with a TIFF of each page, one folder with an OCR'd text file for each page, one folder for book metadata, and a searchable PDF. The metadata folder has a MARC record (.mrc & MARC21) pulled from WorldCat via Z39.50. Once we have a bulk of digitized books, we copy MARC records to separate directories for processing. Our goal here is to parse the MARC records for certain fields (title, publication date, author, etc) and dump them to a CSV file. We were able to do this by creating a Python script (code below) utilizing a library called pymarc. When the processing of the MARC records is finished, we take the output from the CSV and join (mostly copypasta) it with an XLS file produced by the batch import process for Digital Commons. Once the Digital Commons XLS is finalized, the XLS is uploaded and the Digital Commons system parses the XLS, grabs the PDFs from an accessible directory, and deposits the books.

Future plans...

Automate the copying of PDFs and MARC records via a shell script and set it to a cron. Similarly, once the files are moved the Python script should begin processing the records.

The bottleneck in the entire process is copying the output from the Python script to the Digital Commons XLS. The MARC records are *old* and not very pretty, especially with the date field. Also, the output for the author from the Python script and the input required for author in the XLS is quite different. The values entered by the cataloguer in the author fields of the MARC record are not consistent (last name, first name or first name, last name) and the XLS requires the first name, middle name, and last name in separate fields. I foresee a lot of regex or editing by hand. :(


marc2csv.py - Matt McCollow - http://gist.github.com/348178

#!/usr/bin/env python

import csv
from pymarc import MARCReader
from os import listdir
from re import search

SRC_DIR = '/path/to/mrc/records'

# get a list of all .mrc files in source directory
file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))

csv_out = csv.writer(open('marc_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)

for item in file_list:
fd = file(SRC_DIR + '/' + item, 'r')
reader = MARCReader(fd)
for record in reader:
title = author = date = subject = oclc = publisher = ''

# title
if record['245'] is not None:
title = record['245']['a']
if record['245']['b'] is not None:
title = title + " " + record['245']['b']

# determine author
if record['100'] is not None:
author = record['100']['a']
elif record['110'] is not None:
author = record['110']['a']
elif record['700'] is not None:
author = record['700']['a']
elif record['710'] is not None:
author = record['710']['a']

# date
if record['260'] is not None:
date = record['260']['c']

# subject
if record['650'] is not None:
subject = record['650']['a']

# oclc number
if record['035'] is not None:
if len(record.get_fields('035')[0].get_subfields('a')) > 0:
oclc = record['035']['a'].replace('(OCoLC)', '')

# publisher
if record['260'] is not None:
publisher = record['260']['b']

csv_out.writerow([title, author, date, subject, oclc, publisher])
fd.close()

MEETINGS ALL DAY - ?_?

Meetings all day. Will everything go better than expected, or will I rage?

Morning:

email - nope, I'm in meetings all day.

Got into work and discovered the contract worker for the giant 25,000 object digitization project started yesterday and nobody told me.

LOOK OF DISAPPROVAL

Checked in the worker and made sure that she was provided with proper documentation regarding file-naming convention, scanning requirements, and storage.

Liaison meeting - teaching with iClickers.

Preliminary meeting with the Science and Technology Center for Archaeological Research project to plan out their research centre in the Institutional Repository. Lots of exciting things were discussed. They are very interested in Open Access, so I gave them some SPARC brochures, and made sure they were aware of the Open Access Addendum for submitting articles to journals. Should see some progress with this project very soon!

Digital Collections - Functional Requirements Meeting (site redesign). Finally! Remember all those posts from the last library day in life were I was talking about moving to Drupal 6 and instituting a bunch of new features??? Well, some things have changed, but we are going to do all of them and more, including a complete site redesign from the ground up.

IR Steering Committee - Iteration 2 here to referred to as IR Working Group. That my friends, is a mouth full. Communication, workflow, advocacy, education. I'm the chair of this committee and gathered a new group of people together to move forward with the institutional repository. The meeting went very well, we have a good game plan for moving forward, and a lot of positive plans of action that should be taking place shortly. PRO-GRESS!

Rocked out to Bad Brains for the commute home.

Everything went better than expected.

blog image

Historical Perspectives of Canadian Publishing - LAUNCH!!!

Oops, I was supposed to write about this last Thursday when we actually launched. Busy, busy week. So, without further ado - Historical Perspectives on Canadian Publishing!

So, here is the actual library news story. The site was a year in the making, and still has some content that will be added. An immense amount of hard work was put in by the team. I would like to give a special thanks for all the hard work put in by the project coordinator, Judy Donnelly, Bev Bayzat who handled the data management portion of the project, and Matt McCollow who took over the majority of development responsibilities on the site. Also, many thanks to all of our students who worked on the project - Belinda Hanson, Asiya Zareen, Sherry Sun, and Justina Chong.

Ok, now for the geeky stuff. There are 963 records in the site at the moment, covering approximately 3500 images, audio interviews, and a video tour of Coach House Press. Once again, this collection was built with Drupal and is a sub-site of the overall McMaster University Library Digital Collections site. Users can comment on records and case studies, and logged in users can tag records.

During the development phase of the project, we decided to use the Faceted Search module a lot more than we had used it previously. Most notably in the right hand navigation. When users are in a record, a variety of fields are exposed to the faceted search module, thereby allowing them to discover other similar content based on the metadata from the record.

Finally, Matt put in some hard work during the last week of the project to get Jplayer working in the records which had audio, and Galleriffic for galleries in the "Themes."

blog image
blog image
blog image
blog image
blog image
blog image
blog image
blog image
blog image
blog image
blog image

Internment Camp Correspondences, Gestapo Camp Correspondences, updates, and Russell Library

More updates on the World War, 1939-1945, German Concentration Camps and Prisons Collection. The Internment Camp Correspondences are now finished. There weren't too many of them - only 56 to be exact. With the internment camp letters finished, we have moved on to the Gestapo Camp Correspondences.

I have also added the "Related Information" feature to the World War, 1939-1945, German Concentration Camps and Prisons Collection and Russell Library Collection. It is just a block in the right column, which is an extension of the faceted search module. Also, in somewhat related news regarding the Russell Library Collection, I have inherited another worker who will be going through the records and added cover images, title page images, and book plate images to records without them.

blog image
blog image
blog image

Concentration Camp Correspondences

After an entire year of scanning and meta data entry by a couple of amazing students, we have finished a portion of the World War, 1939-1945, German Concentration Camps and Prisons Collection. The entirety of the Concentration Camp Correspondences [http://digitalcollections.mcmaster.ca/concentration-camp-correspondence] - 1031 to be exact - are up online with full meta data records. Also, a *very* help volunteer has been going through and translating/summarizing the records. If anybody knows German, Yiddish, Polish, or French and would like to volunteer, please contact me.

Now that all of the records are up, some new discovery features will be added this week. *Hopefully*

The next section of the collection to be scanned is the Internment Camp Correspondences. We got a few done today, and they can be previewed here: http://digitalcollections.mcmaster.ca/internment-and-transit-camps-corre...

blog image
blog image

2009 OLITA Digital Odyssey

I must say that the Digital Odyssey was the best one day event I have been to. Just a fantastic day with fantastic people talking about awesome projects. It cheered me up and gave me hope in these crap times. Best part of the day had to be Mike Ridley's keynote speech - The Age of Information is over. It is time for the Age of Imagination. It will be the library's job to nurture and foster creativity.

Workshops attended:

Walter Lewis - The Perfectibility of Data. I must say that Walter may be a bigger metadata fascist that I am. He showed some cool stuff that I didn't know about - media rss feeds! Then using Cooliris to visualize said feed. Also, finally realized how simple it is to provide proper data to interact with google earth & google maps. Just latitude and longitude coordinates!!!

Loren Fantin - Planning and Managing a Digitization Project. Lots of great stuff in Loren's talk. Don't see a blog entry on the Digital Odyssey site yet, so no link. Biggest lessons learned - scope creep!!! & digitization should be apart of collection management.

Art Rhyno - OCR Options for Scanned Content. Great session on the basics & overview of OCR, and OCR software options. Provided many examples from a variety of OCR software packages. Mostly ABBYY & Ocropus.


The text to my presentation, pdf of slides, keynote file, and powerpoint file.

blog image

OMG! You Don't Need CONTENTdm!!!

So, I bet a lot of you are wondering what is up with my with my title? Well, I don’t plan on standing up here taking potshots at OCLC for 15 minutes, but I am sure some people in the crowd wouldn’t mind. Basically, the title should have had a very long sub-title along the lines of, like Dr. Strangelove or: How I learned to Stop Worrying and Embrace Open Source Software.

How many people here know what CONTENTdm is? Well, straight from the site - is a single software solution that handles the storage, management and delivery of your library’s digital collections to the Web.

So, I am an Open Source Software evangelist. Yeah, Yeah, Yeah... I’m a hypocrite. I used proprietary software to make this presentation. I’m not a fascist about open source software, I’m only a fascist when it comes to metadata. But, on a serious note, I strongly believe that libraries should be at the forefront of open source software use. “Being an Open Source Software evangelist is like being a library evangelist.” - Karen Schneider. I also believe, that academic libraries have a responsibility to play a major role in the development of open source software for libraries. As a side note, I believe this ties in very well with the publish or perish notion of academia. What is open source software, but not a constant state of peer-revision?

Which brings me to why libraries should stop buying proprietary/closed out of the box software solutions from vendors. I think all of you know what I am talking about. Horizon, Millenium, LunaInsight, DLXS, and ContentDM. Just to name a few. What to we generally get? Something that works... kinda... for the time being. Support is there... maybe. Oh wait, you want to do that, you’ll need to buy this $20,000 add-on. I think Jessamyn West sums this up quit well in her Evergreen Conference closing keynote, “Closed vendor development = Proof of concept. Go! Ship it!!!” And yes, you can argue the same for open source software, but at least the community can get at source an improve on it!

Now, I told you I was not going to take potshots. So, CONTENTdm is not a pile of garbage. It does what it, it does very well. It does things that the digital collections setup that we have build cannot do yet - JPEG2000 (which hopefully we can launch this summer) and Z39.50. It has an API for custom development. But I want more! I want to be able to move with the times. I want to be able to move at my own pace or the freedom to move with my users. I want the freedom to do what I want to do with the software. What is that? Users want to be able to tag records and bookmark records internally to their account. They want to comment on content, and want a mobile version of the site. Oh wait, I can't do that with CONTENTdm. If I had an open source solution, I would have the freedom to do so.

I would have the freedom to run the program for any purpose. I would have the freedom to study how the program works, and adapt it to my needs. I would have the freedom to redistribute copies so I can help my neighbor. I would have the freedom to improve the program, and release my improvements to the public, so that the whole community benefits. By the way, these four freedoms are from Richard Stallman’s definition of Free Software.

How many people know who Richard Stallman is? Well, for those of you who do not, Richard is the creator of GNU, and founder of the Free Software Foundation.

Just to highlight some of these closed vendor solutions, i.e. CONTENTdm- why don't we take a look at the attack of the clones.

Well, I don’t want to be a clone. I don’t want my site to look like that. To be honest, it looks like something already 5 years old. How am I different? Besides the obviousness of my appearance...

So, what do we use? Drupal.

What is Drupal?

Drupal is a free and open source Modular Framework Content Management System. It is written in PHP, and uses MySQL or PostgreSQL as a database. Drupal is OS agnostic. It can run on a variety of web servers: Apache, IIS (GOD FORBID), Lighttpd and others- so long as you meet the requirements.

Now, when you download Drupal, you do not get a pre-built digital collection platform. You get the Drupal Core. Which is about 10-15 core modules such as; user administration, permissions, search, content types (blogging, pages, etc), commenting, rss, etc.

When Drupal says, Modular, they mean MODULAR. What this image is, is a cropped section of a 2700 x 3800 pixel image representing the contributed modules to Drupal up November 2007. Seriously, look at this, there are thousands of contributed modules.

Now, this presents us with an analogy - this is our foundation, or our little brick house to build off of. Maybe we can start building up this little brick house, into something like this! Now, I’m not saying we’ve built a skyscraper... but the sky is the limits!

So , a little bit of back story now. When I started at McMaster in September of 2007, the library had just received a grant for the Peace & War in the 20th Century digital collection. They had no digital collections infrastructure, and coming straight out of school, I was very scared to say the least. Not scared that we had nothing, but scared of failing. I started thinking that I bit off a little more than I could chew.

Over the summer before I began, they had started working on selecting and scanning images, and creating corresponding metadata. They chose FileMaker Pro to store the metadata, and planned on creating a dynamic website using FileMaker Pro and a ODBC connector. Scary huh!? Written into the grant were things stating that this would be a state of the art, web 2.0 site - i.e., tagging and commenting. Mind you, all of this had to be accomplished in one year. So, after I started, I said to continue scanning, and creating metadata records with FileMaker Pro for the time being. Give me a month to come up with something and then we will go from there. So, after some testing with Joomla, Plone and Drupal, and some pressure to use CONTENTdm, I decided to hedge my bets with Drupal.

So how do you do this? How do you build a Digital Collections site with Drupal?

The best way to tackle this, is not to look at the huge bulk that you have to finish with, but take it apart piece by piece and build with bricks or modules.

What are the key pieces we *really* need? Well, obviously the ability to display our digital object - image, sound, or video with corresponding metadata. We need a metadata format to start with - Dublin Core. We should be friendly, and let others harvest our records (OAI-PMH), thereby adding to the commons. We need a way to get the records in, in a user friendly manner. Finally, users should be able search, and browse records in a variety of ways.

What are the key Drupal Modules to start with?

CCK - The Content Construction Kit allows you to create your own content types, and add custom fields to any content type. So, this is where Dublin Core comes in. For each of our collections, we set up its own content type. Then each content type uses the same dublin core fields + any additional metadata fields that are unique to the collection. So for example, the World War II Concentration Camp correspondences have a lot of additional metadata - so we created fields such as prison camp, sub-camp, block number, censor notes, etc.
Views - The Views module provides a flexible method for Drupal site designers to control how lists and tables of content are presented.
Faceted Search Module - It is what it is, a faceted search module. It allows users to granularly expose themselves to certain content via all the CCK fields that are setup.

[Site Demonstration]

Last but not least - theming! Now, I said I did not want to be a clone. Drupal uses a number of theming engines you can take advantage of. In addition, there a lot of user contributed themes out there. The absolute best one I recommend is the Zen theme. Which is just a framework - a blank white on black setup, with a skeleton css structure that you can add your own muscle to. You can pretty much do whatever you want with it.

Ok, to wrap things up - CONTENTdm is not a free product. By free, I don’t just mean price wise. Free to do what you will with it. Those 4 tenants of Free Software that Richard Stallman will *NEVER* waiver from. But, CONTENTdm is not a bad product, nor is OCLC an evil company. But, times are changing, and business models are changing. Developers and users want more control, they want to do what they will with a product. Mash it up how they please. I haven’t even scratched the surface on what you can do with Drupal and a digital collections site. You saw that little grid of contributed modules. And, modules are not that hard to write. I am not a programmer, but I can manipulate and hack PHP & MySQL to do my bidding and have written modules. So what I am getting to, is why can’t OCLC and other library software vendors open up their products? We are witnessing a revolution right now. How many libraries are moving to Evergreen, Koha, and other open source ILSs? This is not destroying the business models. Companies have to adapt. You can make money with open source software. Look at Red Hat Linux, look at Equinox. One last example with CONTENTdm as my whipping boy - Why not open up CONTENTdm. Let your own users contribute and develop and make the product even better. Do something like Red Hat Linux - give it away for free, and sell support contracts. That is a reliable and proven business model.

digitalcollections.mcmaster.ca released into the wild

Ok, the Digital Collections website is ready for beta testing. Registered users can comment, vote on comments, and tag records - and updated version of the "bookbag" will be added soon. Collections with content include; Peace and War in the 20th Century, Russell Library, and World War II Concentration Camp Correspondences. At this time, AICT, and Kirtas Book Collection are outlines for content to be added later.

For the technical nerds! The site runs on Drupal, and takes advantage of the CCK Module. Each collection has its own content type, allowing it to expose its own unique metadata. All of the collections share Dublin Core fields, which combined with a modified version of the OAI2 module, provide OAI2 compliance. As of the right now, there are approximately 165,000 nodes - with the great majority of those records being an experimental version of BRACERS (more of this some other time).

*the bookbag is a feature that allows registered users to bookmark records