Digital Collections

Fail, Fail, Fail, Success?

This past week I had the privilege of speaking on a panel at Access 2011 about failing entitled, "If you ain't failin', you ain't tryin'!" Amy Buckland moderated the panel where we each took five minutes to tell a library tech fail story to encourage the audience to share their failure stories. I think it went over great, and was cathartic to say the least.
I shared my story, and afterword I had that familiar feeling of "but, wait! I have even more to say!" There are so many lessons to be learned! So, I'll share the story again here and *all* of the lessons learned that given requisite time I would have said.
The story
Three years ago I was on an Access panel presentation to speak about a project we had just hit a critical milestone on. Ironically, I spoke at Access 2011 on a fail panel about that same project.
When I started at MPOW I was thrown to the wolves. We had received a Library and Archives Canada grant to digitize a large number of items from our collections and create a thematic, cutting edge, web 2.0 website for it. Think tag clouds a.k.a the mullets of the internet (attribution c4lirc). Guess what? We had no infrastructure. No policies or procedures for digitization. No workflows. No metadata policies. No standards. 
Given the short turn around time of the grant - 1 year - and the grant requirements, a vendor based drop-in solution would not cut it. So we did it all live! 
We took a month to do some rapid prototyping and pulled off a pretty cool proof of concept with Drupal. It worked, and continued to work. It was the basis of our infrastructure moving forward, and at the time it was perfect!
In the background of working on the PW20C project, we had the foresight to begin creating an overall "repository" to pull content from - Digital Collections @ Mac. A Drupal 5 based repository infrastructure loosely based on best practices and standards at the time. A standard Dublin Core field set created with CCK for records with our own enhanced metadata fields for collections, a hacked-together OAI-PMH module and some really cool timeline visualizations using the SIMILE project.
Flash forward a year, and we have secured another LAC grant for Historical Perspectives on Canadian Publishing; another thematic based digital collection site. Time crunch was in effect, and we pulled together another great project with probably 10x more case studies. My heart goes out for our project coordinator on this one pulling all of those case studies together. 
Flash forward another year, we have what I believed a pretty solid frame work for digital collections. We have a main digital collections site, and two heavily customized thematic sites. We are also about 8 months into a major upgrade of our digital collections infrastructure; migrating everything from Drupal 5 to Drupal 6. 
We upped our functional requirements. We wanted to hang with the cool kids: linked data, seemless JPEG2000 support, KML integration, and MediaRSS support. Yeah, MediaRSS.
Here is where the fail comes to fruition. Mistakes were made. Mistakes were made.
There is this what I suppose could be a called a koan in the Drupal community, "do it the Drupal way." Problem is the Drupal way changes depending on who you are talking to and what time of day it is, and what version you are on. Heavily customizing Drupal themes are definitely not the Drupal way to do things. Those two thematic sites became an albatross, and have sense been put out to pasture on their on virtual machines. (Note. Drupal 5 and PHP 5.3 really don't like each other.)
Lessons learned
Do *not* create custom thematic digital collections sites. To further clarify this, do not create custom thematic digital collections sites if you have limited personnel resources and actually have other *stuff* to do.
Do *not* create policy, procedures, workflows, best practices on the fly. However, given the title of the panel, sometimes you really need to fail to get those best practices down. So, how about, Do *not* create policy, procedures, workflows, best practices on the fly for mission critical projects.
Your data your precious. Think a technology a step later. For us, then past Drupal, think past Fedora. We need to be able to move from platform to platform with ease. Thankfully we had the wherewithal to structure our data in such a way that it was pretty painless to extract.
Sometimes when you think you are *not* reinventing the wheel, you are in-fact reinventing the wheel. Look the the community around you and get involved. Don't be afraid to ask stupid questions. Some of those questions that I thought were stupid and shouldn't be asked were in fact questions that were begging to be asked.
Also akin to reinventing the wheel, the hit-by-the-bus scenario. Your really awesome-homegrown-fantastic-full-of-awesomeness thing you build, you get hit by a bus, take another job, etc. your place of work is so entirely screwed. At the very least, DOCUMENT, DOCUMENT, DOCUMENT. 
The library tech community is pretty rad. We're all doing a lot of similar work that doesn't need to be replicated, or if it does, does not need to be completed reinvented. Again, engage, and interact.
Moving forward, making this fail into a success...
Over the past few months we have taken the time to sit down and write out our digitization/digital collections philosophy with stakeholders. What I thought might be a difficult and painful exercise turned out to be quite wonderful and we came up with a document that I am proud of. 
We also took the time to do a study of what digital preservation means at MPOW, and what we are capable of doing right now, what we can be doing in the near future, and what we should look to achieve in the long-term. This segued nicely into a functional requirements document for our repository infrastructure.
Right now, we are working on creating what I believe to be a solid infrastructure; heavily documented! Something we lacked all along, and what some of my colleagues know me for - that guy who walks around stamping his feet about infrastructure all the time. INFRASTRUCTURE. INFRASTRUCTURE. INFRASTRUCTURE.
Hopefully in a year or two I can come back to Access and present on a panel full of folks turning failures into success!

Node Import fails me | Hack the database!

Over on the dev version of our digital collections site we are working on lots of new features. One of them being JPEG2000 support for our World War I trench maps, World War I aerial photos, and World War II Italian topographical maps. Lightbox2 simply does not cut it when researchers would like to examine these wonderful images. Being that we are pretty short staffed here and don't have the wherewithall to whip up a Drupal module to do this "properly", we have come up with what I think is a pretty creative solution to adding the jp2 images to the records in Drupal.

First off we setup Adore-Djatoka and started converting all of our tifs to jp2 with the utility that comes with Djatoka. We have all of the aerial photos and topographical maps converted. However, we are running into to some heap space problems with the trench maps. I am assuming it is because of their shear size - 400MB-2.5GB. Heap space errors aside, we setup the resolver and the IIPImage Viewer - sample.

Next we setup a CCK IFrame field for each content type and prepped our imports. This is where we ran into a bit of trouble - Node Import does not support CCK IFrame. Problem - time to get creative! We decided to import the records without the jp2 field, and would update then in the database which in turn presented us with a couple more problems... err, how about we say quirks. The update was fairly straight-forward, just like the following two MySQL queries:

UPDATE content_field_jp2 c JOIN content_field_identifier d ON (c.nid = d.nid) JOIN node n ON (c.nid = n.nid)SET field_jp2_url = CONCAT('', d.field_identifier_value) WHERE n.type = 'wwiap';

UPDATE content_field_jp2 c JOIN content_field_identifier d ON (c.nid = d.nid) SET field_jp2_attributes = 'a:4:{s:5:"width";s:4:"100%";s:6:"height";s:3:"768";s:11:"frameborder";s:1:"0";s:9:"scrolling";s:4:"auto";}';

Once the above queries were run the actual nodes were not technically updated due to actually needing to invoke the hooks needed to actually update them. Or as I like to call them, Drupal's rules, regulations and procedures. Basically we had to batch re-save all of them. Hooray for View Bulk Operations! However, we noticed a problem after we re-saved all of the nodes from a particular content type; it did not always reflect what we updated it to. We discovered that the CCK Cache was interfering. The solution was to wipe the 'content_cache' table, run our two update queries again, then batch re-save all of the records. The results are pretty nice, we have our embedded jp2 with our metadata. Now just to theme everything!

blog image

Library day in the life - 5 - Day 4

Wow, day 4 already. This week seems to be going by fast. Worked from home for a bit this morning and then took the train in again. Email this week has been miraculously low. Probably from all the moves. The 5th floor eerily empty, absolutely bizarre up there now.


Morning soundtrack: BBC World Service podcast, Aphex Twin - Druqks, Metallica - Master of Puppets. (BTW, did you know Metallica did not produce any records after ...And Justice for All. All the other ones are an urban legend.)

More merges. Down to one standard subject field. Nearly down to one relation field, and coverage field. Should be done with that by the end of the day. Thank you again View Bulk Operations!

Finished pulling quotes together for the potential Crombie family grant.

Lunch with a colleague. Finally get to collect on my World Cup bet. ¡¡¡ESPAÑA!!!


Afternoon soundtrack: nothing. absolutely nothing.

Thursday afternoons bring me down to the William Ready Division of Archives and Research Collections where I work a reference desk shift once a week. I'm one of the few librarians who still work a help desk shift here. Not saying that it is a bad thing, just still getting used to the idea of librarians not on the help desk.

Same old afternoon brief routine, checked servers for updates, checked drupal module updates, ran updates. New to the routine, committed all the commits from yesterday and updated the MUALA Bargaining Updates blog.

Read the press release put out by Sky River for their antitrust lawsuit again OCLC. This one may be interesting. Meanwhile, merges were constantly running in the background. Almost done.

Found an autographed first edition of Charles Bukowski's "It catches my heart in its hands : new and selected poems 1955-1963" down in research collections during my shift.

blog image
blog image

Promotional items for The great railway, illustrated; written and edited by Pierre Berton


Meetings all day. Will everything go better than expected, or will I rage?


email - nope, I'm in meetings all day.

Got into work and discovered the contract worker for the giant 25,000 object digitization project started yesterday and nobody told me.


Checked in the worker and made sure that she was provided with proper documentation regarding file-naming convention, scanning requirements, and storage.

Liaison meeting - teaching with iClickers.

Preliminary meeting with the Science and Technology Center for Archaeological Research project to plan out their research centre in the Institutional Repository. Lots of exciting things were discussed. They are very interested in Open Access, so I gave them some SPARC brochures, and made sure they were aware of the Open Access Addendum for submitting articles to journals. Should see some progress with this project very soon!

Digital Collections - Functional Requirements Meeting (site redesign). Finally! Remember all those posts from the last library day in life were I was talking about moving to Drupal 6 and instituting a bunch of new features??? Well, some things have changed, but we are going to do all of them and more, including a complete site redesign from the ground up.

IR Steering Committee - Iteration 2 here to referred to as IR Working Group. That my friends, is a mouth full. Communication, workflow, advocacy, education. I'm the chair of this committee and gathered a new group of people together to move forward with the institutional repository. The meeting went very well, we have a good game plan for moving forward, and a lot of positive plans of action that should be taking place shortly. PRO-GRESS!

Rocked out to Bad Brains for the commute home.

Everything went better than expected.

blog image

The ultimate question when working from home - When do I put pants on?

I should just write a script that pulls from all of these librarydayinthelife & #libday4 tags and make it write a post for me.


Email. Surprisingly not that much for the morning. Hopefully the trend stays that way through out the day.

Podcast Monday! TWiT, Spark, Quirks & Quarks. Anybody else find Calacanis really annoying when he is on TWiT?

Fink put me on to a Python course from MIT. I really want to be a better programmer, but there are too many hats that I have to wear at work. :(

No new bugs in redmine for digital collections.

Digital collections had a couple of modules that needed to be update. Updates complete.


Surprisingly very little email. I like this trend for the day.

Back to hacking away at getting Jplayer working on some PW20C Case Studies. Last week I switched the embedded videos on the case studies (1,2,3) from a really crappy unsupported module to just embedding them with the embedded media module and Vimeo. Worked out quite well. Once I get the Jplayer working, I can put PW20C and HPCANPUB out to pasture.

Further work with the Jplayer (still haven't got it to work on dev-pw20c) on digital collections. Setting the player up to work with HTML5 and ogg-vorbis. Gah, I hate flash!!!

Late afternoon.

Code refuses to work properly. I would like to bash my head on the desk, but that is not a good idea. Maybe I should just make a rage comic out of this.

Code not doing as it is told or I am an idiot!

Well, I guess Matt has to stare at the code like this o_0 and it magically works now. Chalk it up to another drupalnomoly.

Pants go on as soon as you have your first smoke of the day.

Karate Chop of Love

blog image

Memes, Smemes, Email, SQL, and Galleries

Library Day in the Life - meme validation below

Theme of the day was SQL queries.


Email, email, email, email.

Cleaned up some more raunchy code in a effort to make the theme migration to Drupal 6 less hectic. Lots of SQL queries to un-hack code, and write said code in a standardized fashion. Note to self, do not ever hire immature developers.

Continued to work with Bepress on setting up a couple new journals. Hopefully by the fall we should launch three or four journals; Text Technology, Early Theatre, Bridges: Conversations in Global Politics, and a Northrop Frye journal.


Email, email, email, email.

Investigated more code. Finished cleaning up all the bugs with the test Drupal 6 migration. All that is left is migrating the themes, and cleaning out the hacks. Cool new stuff on the way, Media RSS!

Genius me realized I had ZERO image galleries in the Digital Collections site. So, SQL queries being the theme, I setup galleries for a PW20C general gallery, a PW20C posters, a World War, 1939-1945, German Concentration Camps and Prisons Collection general gallery, and finally a Russel Library general gallery.

blog image

Internment Camp Correspondences, Gestapo Camp Correspondences, updates, and Russell Library

More updates on the World War, 1939-1945, German Concentration Camps and Prisons Collection. The Internment Camp Correspondences are now finished. There weren't too many of them - only 56 to be exact. With the internment camp letters finished, we have moved on to the Gestapo Camp Correspondences.

I have also added the "Related Information" feature to the World War, 1939-1945, German Concentration Camps and Prisons Collection and Russell Library Collection. It is just a block in the right column, which is an extension of the faceted search module. Also, in somewhat related news regarding the Russell Library Collection, I have inherited another worker who will be going through the records and added cover images, title page images, and book plate images to records without them.

blog image
blog image
blog image

Concentration Camp Correspondences

After an entire year of scanning and meta data entry by a couple of amazing students, we have finished a portion of the World War, 1939-1945, German Concentration Camps and Prisons Collection. The entirety of the Concentration Camp Correspondences [] - 1031 to be exact - are up online with full meta data records. Also, a *very* help volunteer has been going through and translating/summarizing the records. If anybody knows German, Yiddish, Polish, or French and would like to volunteer, please contact me.

Now that all of the records are up, some new discovery features will be added this week. *Hopefully*

The next section of the collection to be scanned is the Internment Camp Correspondences. We got a few done today, and they can be previewed here:

blog image
blog image

OMG! You Don't Need CONTENTdm!!!

So, I bet a lot of you are wondering what is up with my with my title? Well, I don’t plan on standing up here taking potshots at OCLC for 15 minutes, but I am sure some people in the crowd wouldn’t mind. Basically, the title should have had a very long sub-title along the lines of, like Dr. Strangelove or: How I learned to Stop Worrying and Embrace Open Source Software.

How many people here know what CONTENTdm is? Well, straight from the site - is a single software solution that handles the storage, management and delivery of your library’s digital collections to the Web.

So, I am an Open Source Software evangelist. Yeah, Yeah, Yeah... I’m a hypocrite. I used proprietary software to make this presentation. I’m not a fascist about open source software, I’m only a fascist when it comes to metadata. But, on a serious note, I strongly believe that libraries should be at the forefront of open source software use. “Being an Open Source Software evangelist is like being a library evangelist.” - Karen Schneider. I also believe, that academic libraries have a responsibility to play a major role in the development of open source software for libraries. As a side note, I believe this ties in very well with the publish or perish notion of academia. What is open source software, but not a constant state of peer-revision?

Which brings me to why libraries should stop buying proprietary/closed out of the box software solutions from vendors. I think all of you know what I am talking about. Horizon, Millenium, LunaInsight, DLXS, and ContentDM. Just to name a few. What to we generally get? Something that works... kinda... for the time being. Support is there... maybe. Oh wait, you want to do that, you’ll need to buy this $20,000 add-on. I think Jessamyn West sums this up quit well in her Evergreen Conference closing keynote, “Closed vendor development = Proof of concept. Go! Ship it!!!” And yes, you can argue the same for open source software, but at least the community can get at source an improve on it!

Now, I told you I was not going to take potshots. So, CONTENTdm is not a pile of garbage. It does what it, it does very well. It does things that the digital collections setup that we have build cannot do yet - JPEG2000 (which hopefully we can launch this summer) and Z39.50. It has an API for custom development. But I want more! I want to be able to move with the times. I want to be able to move at my own pace or the freedom to move with my users. I want the freedom to do what I want to do with the software. What is that? Users want to be able to tag records and bookmark records internally to their account. They want to comment on content, and want a mobile version of the site. Oh wait, I can't do that with CONTENTdm. If I had an open source solution, I would have the freedom to do so.

I would have the freedom to run the program for any purpose. I would have the freedom to study how the program works, and adapt it to my needs. I would have the freedom to redistribute copies so I can help my neighbor. I would have the freedom to improve the program, and release my improvements to the public, so that the whole community benefits. By the way, these four freedoms are from Richard Stallman’s definition of Free Software.

How many people know who Richard Stallman is? Well, for those of you who do not, Richard is the creator of GNU, and founder of the Free Software Foundation.

Just to highlight some of these closed vendor solutions, i.e. CONTENTdm- why don't we take a look at the attack of the clones.

Well, I don’t want to be a clone. I don’t want my site to look like that. To be honest, it looks like something already 5 years old. How am I different? Besides the obviousness of my appearance...

So, what do we use? Drupal.

What is Drupal?

Drupal is a free and open source Modular Framework Content Management System. It is written in PHP, and uses MySQL or PostgreSQL as a database. Drupal is OS agnostic. It can run on a variety of web servers: Apache, IIS (GOD FORBID), Lighttpd and others- so long as you meet the requirements.

Now, when you download Drupal, you do not get a pre-built digital collection platform. You get the Drupal Core. Which is about 10-15 core modules such as; user administration, permissions, search, content types (blogging, pages, etc), commenting, rss, etc.

When Drupal says, Modular, they mean MODULAR. What this image is, is a cropped section of a 2700 x 3800 pixel image representing the contributed modules to Drupal up November 2007. Seriously, look at this, there are thousands of contributed modules.

Now, this presents us with an analogy - this is our foundation, or our little brick house to build off of. Maybe we can start building up this little brick house, into something like this! Now, I’m not saying we’ve built a skyscraper... but the sky is the limits!

So , a little bit of back story now. When I started at McMaster in September of 2007, the library had just received a grant for the Peace & War in the 20th Century digital collection. They had no digital collections infrastructure, and coming straight out of school, I was very scared to say the least. Not scared that we had nothing, but scared of failing. I started thinking that I bit off a little more than I could chew.

Over the summer before I began, they had started working on selecting and scanning images, and creating corresponding metadata. They chose FileMaker Pro to store the metadata, and planned on creating a dynamic website using FileMaker Pro and a ODBC connector. Scary huh!? Written into the grant were things stating that this would be a state of the art, web 2.0 site - i.e., tagging and commenting. Mind you, all of this had to be accomplished in one year. So, after I started, I said to continue scanning, and creating metadata records with FileMaker Pro for the time being. Give me a month to come up with something and then we will go from there. So, after some testing with Joomla, Plone and Drupal, and some pressure to use CONTENTdm, I decided to hedge my bets with Drupal.

So how do you do this? How do you build a Digital Collections site with Drupal?

The best way to tackle this, is not to look at the huge bulk that you have to finish with, but take it apart piece by piece and build with bricks or modules.

What are the key pieces we *really* need? Well, obviously the ability to display our digital object - image, sound, or video with corresponding metadata. We need a metadata format to start with - Dublin Core. We should be friendly, and let others harvest our records (OAI-PMH), thereby adding to the commons. We need a way to get the records in, in a user friendly manner. Finally, users should be able search, and browse records in a variety of ways.

What are the key Drupal Modules to start with?

CCK - The Content Construction Kit allows you to create your own content types, and add custom fields to any content type. So, this is where Dublin Core comes in. For each of our collections, we set up its own content type. Then each content type uses the same dublin core fields + any additional metadata fields that are unique to the collection. So for example, the World War II Concentration Camp correspondences have a lot of additional metadata - so we created fields such as prison camp, sub-camp, block number, censor notes, etc.
Views - The Views module provides a flexible method for Drupal site designers to control how lists and tables of content are presented.
Faceted Search Module - It is what it is, a faceted search module. It allows users to granularly expose themselves to certain content via all the CCK fields that are setup.

[Site Demonstration]

Last but not least - theming! Now, I said I did not want to be a clone. Drupal uses a number of theming engines you can take advantage of. In addition, there a lot of user contributed themes out there. The absolute best one I recommend is the Zen theme. Which is just a framework - a blank white on black setup, with a skeleton css structure that you can add your own muscle to. You can pretty much do whatever you want with it.

Ok, to wrap things up - CONTENTdm is not a free product. By free, I don’t just mean price wise. Free to do what you will with it. Those 4 tenants of Free Software that Richard Stallman will *NEVER* waiver from. But, CONTENTdm is not a bad product, nor is OCLC an evil company. But, times are changing, and business models are changing. Developers and users want more control, they want to do what they will with a product. Mash it up how they please. I haven’t even scratched the surface on what you can do with Drupal and a digital collections site. You saw that little grid of contributed modules. And, modules are not that hard to write. I am not a programmer, but I can manipulate and hack PHP & MySQL to do my bidding and have written modules. So what I am getting to, is why can’t OCLC and other library software vendors open up their products? We are witnessing a revolution right now. How many libraries are moving to Evergreen, Koha, and other open source ILSs? This is not destroying the business models. Companies have to adapt. You can make money with open source software. Look at Red Hat Linux, look at Equinox. One last example with CONTENTdm as my whipping boy - Why not open up CONTENTdm. Let your own users contribute and develop and make the product even better. Do something like Red Hat Linux - give it away for free, and sell support contracts. That is a reliable and proven business model.