Digitized books into the IR - workflow

This past week, we started depositing digitized books into our institutional repository instance for The McMaster Collection. As of this posting we have 216 books in the collection. However, currently these materials are only available to the McMaster community. This is completely out of my control, and I agree what some of you may be thinking, "wait, out of copyright books are not available to the general public!?"

The workflow is a little complicated right now, but it is the beginning and will definitely be improved. Each book digitized has a specific set of output associated with it; one folder with a TIFF of each page, one folder with an OCR'd text file for each page, one folder for book metadata, and a searchable PDF. The metadata folder has a MARC record (.mrc & MARC21) pulled from WorldCat via Z39.50. Once we have a bulk of digitized books, we copy MARC records to separate directories for processing. Our goal here is to parse the MARC records for certain fields (title, publication date, author, etc) and dump them to a CSV file. We were able to do this by creating a Python script (code below) utilizing a library called pymarc. When the processing of the MARC records is finished, we take the output from the CSV and join (mostly copypasta) it with an XLS file produced by the batch import process for Digital Commons. Once the Digital Commons XLS is finalized, the XLS is uploaded and the Digital Commons system parses the XLS, grabs the PDFs from an accessible directory, and deposits the books.

Future plans...

Automate the copying of PDFs and MARC records via a shell script and set it to a cron. Similarly, once the files are moved the Python script should begin processing the records.

The bottleneck in the entire process is copying the output from the Python script to the Digital Commons XLS. The MARC records are *old* and not very pretty, especially with the date field. Also, the output for the author from the Python script and the input required for author in the XLS is quite different. The values entered by the cataloguer in the author fields of the MARC record are not consistent (last name, first name or first name, last name) and the XLS requires the first name, middle name, and last name in separate fields. I foresee a lot of regex or editing by hand. :(


marc2csv.py - Matt McCollow - http://gist.github.com/348178

#!/usr/bin/env python

import csv
from pymarc import MARCReader
from os import listdir
from re import search

SRC_DIR = '/path/to/mrc/records'

# get a list of all .mrc files in source directory
file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))

csv_out = csv.writer(open('marc_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)

for item in file_list:
fd = file(SRC_DIR + '/' + item, 'r')
reader = MARCReader(fd)
for record in reader:
title = author = date = subject = oclc = publisher = ''

# title
if record['245'] is not None:
title = record['245']['a']
if record['245']['b'] is not None:
title = title + " " + record['245']['b']

# determine author
if record['100'] is not None:
author = record['100']['a']
elif record['110'] is not None:
author = record['110']['a']
elif record['700'] is not None:
author = record['700']['a']
elif record['710'] is not None:
author = record['710']['a']

# date
if record['260'] is not None:
date = record['260']['c']

# subject
if record['650'] is not None:
subject = record['650']['a']

# oclc number
if record['035'] is not None:
if len(record.get_fields('035')[0].get_subfields('a')) > 0:
oclc = record['035']['a'].replace('(OCoLC)', '')

# publisher
if record['260'] is not None:
publisher = record['260']['b']

csv_out.writerow([title, author, date, subject, oclc, publisher])
fd.close()

New Stuffs on the Horizon...

Now that Historical Perspectives of Canadian Publishing is all finished up we have time, albeit a small amount of time, to concentrate on other portions of the Digital Collections site, and other collections.

World War, 1939-1945, German Concentration Camps and Prisons Collection is nearly complete. Only a few boxes remain to be scanned. The next portion of the project is World War, 1939-1945, Jewish Underground Resistance Collection. This collection is predominantly from 1941-1944 and will contain 325 items. The finding aid for this collection is located here. These collections are two parts of a larger overall project, The Virtual Museum of the Holocaust and Resistance, which is to come much later. That site will be a separate site which pulls from the digital collections site.

Another project that will take a bit more time, but will be an excellent resource once complete is the migration of the World War I Maps & Aerial Photography over to the digital collections site. This will also include approximately 900 more trench maps. The collection will retain the use of mrsid formatting, and the use of the Lizard Tech mrsid delivery server. But, we will also be including JPEG2000 versions of each map & aerial photo and those will be served up with a new Djaktoa image server that our team is working on implementing. Open source > Proprietary :D

The major background project that we will be working on implementing is an upgrade from Drupal 5.x to Drupal 6.x, and cleaning up our code base. Moving to Drupal will provide us with some major improvements. Namely, RDFa support which I am the most excited about! We will also be working on a solution that will allow our catalogue to pull from our collections. Thereby, allowing users to search all of our collections at once from the library catalogue.

Keep an eye on the site. I will announce stuff once we have implemented. Maybe there will be a site redesign in there too!

blog image

Drupal/Digital Collections/Images

The Image API was recently released for Drupal 5, and changed a lot of things. During my updates (and redesign), I thought it might be a good idea to provide a sort of "how to" for images in Digital Collections. First thing first, make sure you have CCK! Also of note, for each collection that I've setup with CCK, I created a new imagefield for each one, and provided it with a directory in the "files" directory. So for example, the Peace and War collection has its own image set, and the concentration camp correspondences have their own set.

Modules needed:
ImageAPI
Imagecache
Imagefield

Eye Candy Modules:
ImageAPI Reflect
Thickbox
Lightbox2

So, install and enable all of the modules listed above, then run update.php. After you update, browse to Administrator > Site Configuration and select ImageAPI. From here you can choose to use GD or ImageMagick. Then proceed to select Image Toolkit under the same section to select your image quality.

picture-6

picture-7

Next you just setup your image field for you content type (Administrator > Content Management), and select how you want to display it.

picture-5

If you want to do the fancy eye candy like this:
picture-2a

Browse to your Imagecache settings (Administrator > Site Building > Imagecache) and setup a scale, and setup a reflection. If you have a white background, change the background RGB color to 255,255,255.

picture-3

picture-4

Drupal & Digital Collection Sites - 2

Ok, more Drupal stuff for Digital Collections site. I'll yammer on about "must have" modules in this one. Hit the snooze button if you'd like. Oh, and this is in addition to the ones I mentioned in the previous post... Community Tags, Tagadelic, Service Links, Faceted Search, Views, Zen Theme, Quicktabs, and of course CCK.

First thing first, CCK add-ons:

  • Audio Field - Defines audio field type for CCK content.
  • CCK Fieldgroup Tabs - Display CCK fieldgroups in tabs. Enables splitting up content onto tabs in both editing and display.
  • EMail - Defines an email field type for cck
  • File Field - Defines a file field type.
  • File Field Meta - add metadata gathering and storage to file field.
  • Image - Defines an image field type.
  • Media Field Display - Adds display options for media fields.
  • Node Reference - Defines a field type for referencing one node from another.
  • Number - Defines numeric field types.
  • Text - Defines simple text field types.
  • Video Field - Defines video field type for CCK content.
  • View field - Defines a field type that displays the contents of a view in a node.

Category - combined with CCK, are two absolute must have modules for doing digital collections. You *MUST* have some way to organize your collections - this is the module. Set up a container for each collection, the categories for your different "themes" and sub-categories, and sub-sub-categories. It goes on and on, but this is the way to do it.

Devel - is a great development module for... you guessed it DEVELOPMENT! Seriously, it is great. You get feed sooooooo much output. Arrays, beautiful arrays!

Feedback 2.0 - allows site visitors and users to report issues about this site. We use it internally for tracking issues with metadata creation, and technical issues during digital projects.

Node Import - this is perfect for those silly Filemaker pro databases "some" people start out with. Requires quite a bit of patching to get it to work with cck image field, but well worth it! Ingests the standard csv, tsv files and allows you to map them to specified cck fields.

Nice Menus - works with the standard drupal menus setup. It additionally provides the ability to have horizontal and vertical menus. It is mostly css, and little javascript. The css is highly customizable, so you can easily make it work with your theme.

Front Page - allows you to go further than declaring a node in your settings.php file for the front page. If you are feeling all early 21st century and want to rock a flash landing page, this is the way to do it. Or, can just use it for deciding what users see what front page.

ImageCache - allow you preprocess all your images on the site with ImageMagick.

Akismet & CAPTCHA - FIGHT THE SPAM BOTS!

Google Analytics - ...well you probably know all about google analytics, if not, for god's sake sign up for an account! Very nice features, track certain users, certain content, restrict content, etc.

Printer-friendly pages - is perfect for all of those text heavy case studies. Destroy the environment and print it out to read it later.

Sections - is perfect for theming "sections" (collections) of the site. Each section can have an installed template, theme or style attached to it.

Pathauto - ...I'll just quote from the description, "The Pathauto module automatically generates path aliases for various kinds of content (nodes, categories, users) without requiring the user to manually specify the path alias. This allows you to get aliases like /category/my-node-title.html instead of /node/123. The aliases are based upon a "pattern" system which the administrator can control."

Automatic Node Titles - is perfect for all those thousands of records that you don't want to scribe titles for. You can pull from fields and create a title. So, something like this: Creator, source, date - which is done with a php script:

$token = '[field_creator-formatted]';
if (empty($token)) {
return htmlspecialchars_decode("[field_source-formatted], [field_date-formatted]", ENT_QUOTES);
}else {
return htmlspecialchars_decode("[field_creator-formatted], [field_source-formatted], [field_date-formatted]", ENT_QUOTES);

}
?>

Well, that is it. Next one will be on the custom/customized modules. Don't worry, I'm not going to beat a dead horse and talk about the OAI module again.

Drupal & Digital Collection Sites - 1

I have written about Drupal & the Digital Collections site (http://digitalcollections.mcmaster.ca) a few times now, but haven't really explained how to make a digital collections site out of Drupal. So, without further ado...

What are the necessities of a digital collections site?

What are some additional features that have become necessary?

  • Tagging
  • Social Bookmarking
  • Faceted Searching
  • Visually rich environment
  • Profiles, internal site bookmarking
  • Contact forms, Image requests, Questions
  • Commenting
  • Content Recommendation

So how do you do all of this with Drupal - sans JPEG2000 support (working on that now)? Well, if you are familiar with Drupal, you should know that it is an open source, modular content management system with an amazing support & development community. A standard out of the box Drupal installation will not yield a digital collections site - additional modules are absolutely necessary. Time, effort, and some coding with have to be done, but it is well worth it. The key to all of it is the Content Construction Kit (CCK). Briefly, CCK allows you to create your own fields for a node. So, here is where we get the ability to have all your standard Dublin Core fields, and any other unique metadata a collection will need to be able to present. What I have done with my site is setup a Content Type for each collection. Each content type shares the standard Dublin Core fields (very helpful for massaging an OAI module for digital collections out of an available OAI module), then they have their own unique additional metadata. For example, the World War II German Concentration Camp and Prison Camp Collection has metadata fields for Prison, Sub-Prison, Prison Block, etc.

I have written about the OAI module a couple of times, but essentially what I did is take the OAI-PMH Module, which is an interface for the Bibliography Module, and rework it so it interfaces with the CCK fields I created for the standard Dublin Core fields. I have not had the time to generalize it, (I hope to in the future if time is willing!) so it is hard coded to my collections right now.

Searching is a built in feature of Drupal. Drupal does a pretty good job of creating a search index for itself, as well as advanced searching features. With content types for each collection, users can limit their search to a specific collection or a site wide search.

Browsing a collection can be done by setting up categories and containers for a collections, then placing each record under a specific collection when creating the records or doing a massive mysql update query if you have imported a number of records to start with. Also, for custom browsable options I have used the Views modules to create views for specific metadata fields, and limited them to a collection. Also, the Faceted Search module allows you list all of the fields you would like exposed to faceted search, thereby allowing a user to browse by a variety of field types.

Not too much to say about JPEG2000 support right now. There are two possible scenarios that I am brainstorming with. The first one is Lizard Tech. Before I started here, the Library had purchased a Lizard Tech Express Server license in order to display the mrsid images for the World War I trench maps. The new version of the Lizard Tech server supports JPEG2000, and has an API that I should be able to get Drupal to work with - fingers crossed! The other option is the aDORe djatoka open source JPEG2000 server. I planned on working on this at the Access 2008 Hackfest, but got distracted with SOPAC and Evergreen.

So, now for the rest - additional features...

Tagging is done with the Community Tags module, and tag clouds are created with Tagadelic.

Social Bookmarking is done with the

Faceted Searching is done with the Faceted Search module.

Visually rich environment is done with a variety of modules and custom template coding. Modules that assist in making this possible include; Views (and many views sub-modules), Zen Theme, jquery, Highslide, and Tabs & Quicktabs.

Profiles, internal site bookmarking... user accounts are a standard feature of any content management system. With Drupal we used a custom view and a user hook to allow registered users to bookmark any record to their account.

Contact forms, Image requests, Questions is done with the Contact Form module. Here users can ask questions about records, request images, a report and problems with the site or records.

Commenting is another build in feature of Drupal. Comments are allowed on every record on the site. Unregistered/Anonymous users have to deal with a CAPTCHA, where as registered users do not.

Content Recommendation is done with the Content Recommendation Engine (CRE). This modules interfaces with a number of other modules. The main one that I utilized is the Voting API. The Voting API combined with the CRE allows for a digg like feature on each record. Each record has a Curate It! link, items that have been "curated" are then featured on the Items Curated Page. Drupal also has a popular content feature as well that I utilize.

So, that is pretty much it for the bullet points listed above. I will have another post or two about Drupal in digital collections. Once featuring the all of the modules that I take advantage of, and another covering any questions anybody has.

IR Update

It has been a while since I have done an IR update. So, the UPDATE: 3 new e-Journals, 2 in the works, and 2 books from the anthropology department have been published to the IR. Also, the big news is we have a Scholarly Communications Librarian now - Barbara McDonald - who will be the spokesperson for the IR. So, now I can concentrate more on the technical side of the project. Also we have launched Selected Works. It is basically a standardized faculty page that allows faculty to manage their work, upload new documents, generate readership reports, and send mailings. Additionally, it integrates with the IR (Digital Commons). So, I can pull from faculty SW sites into the IR. I also nervously gave a presentation at the Learning Technologies Conference this week on Digital Commons. The presentation can be found here.

The new Journals:

18th Century Fiction - http://digitalcommons.mcmaster.ca/ecf
The McMaster Journal of Communication - http://digitalcommons.mcmaster.ca/mjc
Esurio: Ontario Journal of Hunger and Poverty - http://digitalcommons.mcmaster.ca/esurio/

In the works:

Nexus: The Canadian Student Journal of Anthropology
Journal of Ethics in Mental Health

Selected Works Gallery - http://digitalcommons.mcmaster.ca/sw_gallery.html

McMaster University e-Journals

Access 2008

Access was great, my first time going, first time at a hackfest, first time presenting at a conference. I had to laugh to myself at some points. Presenters were throwing out their Access "street cred" - being at the '96 Access...I was in high school.

Hackfest
Worked on problem 7:

7 . Socialize Evergreen! John Blyberg just released SOPAC 2.0
(http://www.thesocialopac.net/ and
http://www.blyberg.net/2008/08/16/sopac-20-what-to-expect/) built on a
Drupal, MySQL, and Sphinx platform, with interfaces defined for
sharing tags, reviews, and ratings of materials (Insurge). Your
mission: accomplish one or more of the following:
* Wouldn’t it be nice to have a different interface for Evergreen?
Maybe one built on Drupal, so you can host your whole site in it? You
can’t argue with those rhetorical questions, can you? So, stop
blubbering and build an ILS connector between Evergreen and Locum
(SOPAC’s catalog discovery layer). Right now the only available ILS
connector for Locum is to III. This aggression will not stand, man!
* Enable Evergreen to contribute tags, reviews, and ratings to
Insurge. Hmm, that means you’ll have to teach Evergreen how to store
tags, reviews, and ratings first. Okay, do both. And figure out how to
pull tags, reviews, and ratings from Insurge while you’re at it, okay?
Geez. Do I have to do everything here?
* Figure out how to replicate Insurge. Possibly to an entirely
different platform. CouchDB? Solr? PostgreSQL + Full Text Search? I
dunno, you’re the big-brained person. LOCKSS!"

I spend the majority of the day setting up an ideal test environment - installing mysql, php5, with any and every dependency necessary. Then setting up drupal, sphinx, locum, and insurge. My theory was going to be how I could scrape the web interface in Evergreen, but that was foiled by the bug I found in the server image. Dan Scott troubleshot for a bit, but it was almost 4 by then, and everybody was wiped. So we just chalked it up to experience for the future...although Dan was happy I found the bug...and had it fixed by the next morning!

Presentation
Presentation went well. I presented on a panel with, Ilana Kingsley (University of Alaska Fairbanks), Dave Mitchell (London Public Library), Harish Nayak (University of Rochester), and Debra Riley-Huff (Ole Miss). We ran really tight on time, from what I hear, that always happens when presenting with panels. If you want to have a look at the slides, you can download them here. When the audio from Access gets all hashed out, I'll post that too.

PW20C Launch & Local Press Coverage

Here is the library press release:

The William Ready Division of Archives and Research Collections at McMaster University Library is launching the latest in a series of digital initiatives aimed at bringing its unique collections to a wider, online audience. The new site, Peace and War in the 20th Century, has been developed with the assistance of almost $100,000 in funding from the Department of Canadian Heritage, through the Canadian Memory Fund.

This website aims to create an immersive virtual environment which invites users to explore two of the most central and formative aspects of twentieth-century culture: peace and war. Foregrounding McMaster’s extensive, unique and world-renowned archival collections, incorporating advice from the best subject experts in the field and utilizing state of the art, robust digital technology, the site tells the compelling story of how these two contrary impulses have shaped our country and our world.

Organized into compact thematic modules, constructed to appeal to a wide range of users, content presented in digital form ranges from wrenchingly personal diaries, letters and photographs to the powerful public propaganda of recruiting posters, peace bulletins, and popular songs. The site includes some 3000 database entries and almost 50 individual case studies as well as audio and video segments, maps and an animation of a First World War trench raid, recreated from original archival documents.

The site is already winning praise. Dr. Ken Cruikshank, Chair of the Department of History says: “what makes this website exciting to me is that it introduces students to the exceptional archival resources available to them in their own backyard, at McMaster University Library. The online sources are an exciting addition to research materials currently available on the Internet, and will motivate students interested in studying efforts to make peace, or the social, political and cultural impact of war.”

The project is the first developed by McMaster University Library in collaboration with two community partners, Local History and Archives at Hamilton Public Library and Canadian Warplane Heritage Museum.

A launch event to celebrate the project is being held on Monday, September 29th at 10:30 am to 11:30am in Convocation Hall.

If you are interested in attending the launch or for more information, please contact Kathy Garay.

And the Hamilton Spectator story - History lessons online: Major collections contribute to Peace and War website

Got another grant!!! History of Canadian Publishing

Just we put the finishing touches on the Library and Archives Canada funded Peace & War in the 20th Century project, we received word from the granting agency that our newest grant application has been accepted. We have been awarded almost $100,000 to develop a state-of-the-art, interactive website on the history of Canadian publishing. The project will last for a year (same amount of time for the PW20C project), and will focus on the history of Canadian publishing houses, people in publishing, authorship, and aspects of unique to Canadian culture.

The William Ready Division of Archives and Research Collections houses one of the most prestigious collections of the subject of Canadian publishing. We will also be collaborating with the Thomas Fisher Rare Library at the University of Toronto and Queen's University, who both also hold extensive archives on the subject.

**Update**

Here is the link to the McMaster Daily News story: http://dailynews.mcmaster.ca/story.cfm?id=5566

Judy Donnelly, project specialist, Rick Stapleton, archivist librarian, Nick Ruest, digital strategies librarian, and Carl Spadoni, research collections librarian, pose with some of the artifacts that will be available on a website about the history of Canadian publishing. Photo by Susan Bubak.
Photo by Susan Bubak.