Digitized books into the IR - workflow

This past week, we started depositing digitized books into our institutional repository instance for The McMaster Collection. As of this posting we have 216 books in the collection. However, currently these materials are only available to the McMaster community. This is completely out of my control, and I agree what some of you may be thinking, “wait, out of copyright books are not available to the general public!?”

The workflow is a little complicated right now, but it is the beginning and will definitely be improved. Each book digitized has a specific set of output associated with it; one folder with a TIFF of each page, one folder with an OCR’d text file for each page, one folder for book metadata, and a searchable PDF. The metadata folder has a MARC record (.mrc & MARC21) pulled from WorldCat via Z39.50. Once we have a bulk of digitized books, we copy MARC records to separate directories for processing. Our goal here is to parse the MARC records for certain fields (title, publication date, author, etc) and dump them to a CSV file. We were able to do this by creating a Python script (code below) utilizing a library called pymarc. When the processing of the MARC records is finished, we take the output from the CSV and join (mostly copypasta) it with an XLS file produced by the batch import process for Digital Commons. Once the Digital Commons XLS is finalized, the XLS is uploaded and the Digital Commons system parses the XLS, grabs the PDFs from an accessible directory, and deposits the books.

Future plans…

Automate the copying of PDFs and MARC records via a shell script and set it to a cron. Similarly, once the files are moved the Python script should begin processing the records.

The bottleneck in the entire process is copying the output from the Python script to the Digital Commons XLS. The MARC records are old and not very pretty, especially with the date field. Also, the output for the author from the Python script and the input required for author in the XLS is quite different. The values entered by the cataloguer in the author fields of the MARC record are not consistent (last name, first name or first name, last name) and the XLS requires the first name, middle name, and last name in separate fields. I foresee a lot of regex or editing by hand. :(

marc2csv.py - Matt McCollow - http://gist.github.com/348178

#!/usr/bin/env python
import csv
from pymarc import MARCReader
from os import listdir
from re import search
SRC_DIR = '/path/to/mrc/records'
# get a list of all .mrc files in source directory
file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))
csv_out = csv.writer(open('marc_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
for item in file_list:
  fd = file(SRC_DIR + '/' + item, 'r')
  reader = MARCReader(fd)
  for record in reader:
    title = author = date = subject = oclc = publisher = ''
    # title
    if record['245'] is not None:
      title = record['245']['a']
      if record['245']['b'] is not None:
        title = title + " " + record['245']['b']
    # determine author
    if record['100'] is not None:
      author = record['100']['a']
    elif record['110'] is not None:
      author = record['110']['a']
    elif record['700'] is not None:
      author = record['700']['a']
    elif record['710'] is not None:
      author = record['710']['a']
    # date
    if record['260'] is not None:
      date = record['260']['c']
    # subject
    if record['650'] is not None:
      subject = record['650']['a']
    # oclc number
    if record['035'] is not None:
      if len(record.get_fields('035')[0].get_subfields('a')) > 0:
        oclc = record['035']['a'].replace('(OCoLC)', '')
    # publisher
    if record['260'] is not None:
      publisher = record['260']['b']
    csv_out.writerow([title, author, date, subject, oclc, publisher])


comments powered by Disqus