iaTorrent update OR Learning by reading code

Last week, inspired from a meeting, I started tossing together a little python program to solve a problem. It wasn't perfect. It was warty. But I think I have something worthwhile now. Or, at least useful for me -- It gives you what you want, and writes to a log when something goes wrong.

What I really want to do here, is just take a moment to sing the praises of learning by reading code. Heading into this little project, I had a basic idea of what I wanted to do, and I knew something like this could be done given Tim's project. I knew that I wanted to make a this a module, and set it up on PyPI, but I had really know idea how to do so. But! I knew of somebody who did, and is quite proflic in my mind. Ed making his code available on Github (and using very open licenses) made it possible for me to learn how to build the structure for a Python module, the structure for writing tests, and using argparse/optparse correctly.

So, here is to learning by reading code!

IA Torrent

Yesterday in a meeting for our Digital Initiatives Advisory Group we were discussing what collections we should consider sending over to the UofT Internet Archive shop, and I asked an innocent newbie question - So, do we have copies of everything we have had the Internet Archive digitize?

NOPE.

No big deal. We're in the infant stages of creating a digital preservation program here, and everything that comes with it. INFRASTRUCTURE!

I knew Tim Ribaric over at Brock University wrote an Internet Archive scraper a while back, so I knew it would be possible to get our content if need be. Knowing that combined with the Internet Archive announcement a little over a month ago about making available torrents for items in the Internet Archive, it inspired me to whip together a Python script to grab all the torrents for a given collection.

Last night I threw together a little proof-of-concept grabbing the RSS feed on the York University Libraries Internet Archive page using BeautifulSoup and some ugly regex.

This morning, still inspired and brainstorming with Dan Richert, I started poking around for different ways to get at our collection. The Internet Archive's advanced search is super helpful for this, and I can get the results as json! So, no regex; as Dan told me, "if you solve a problem with regex, you now have two problems."

On the advanced search page, you will need your query parameters. You can grab those from the 'All items (most recently added first) link on a collection page. For example, the York University Libraries collection query parameters:

(collection:yorkuniversity AND format:pdf) AND -mediatype:collection'

Then selected your desired output format, and number of results. 2608 for me given the number of items in the collection. Then you end up with some json like this:

{
   "responseHeader":{
      "status":0,
      "QTime":1,
      "params":{
         "json.wrf":"",
         "qin":"(collection:yorkuniversity AND format:pdf) AND -mediatype:collection",
         "fl":"identifier,title",
         "indent":"",
         "start":"0",
         "q":"( collection:yorkuniversity AND format:pdf ) AND -mediatype:collection;",
         "wt":"json",
         "rows":"5"
      }
   },
   "response":{
      "numFound":2608,
      "start":0,
      "docs":[
         {
            "title":"Saint-Pétersbourg",
            "identifier":"saintptersboyork00rauoft"
         },
         {
            "title":"Revised statutes of Ontario, 1990 = Lois refondues de l'Ontario de 1990",
            "identifier":"v4revisedstat1990ontauoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons",
            "identifier":"1714essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : où l'on montre quelle est l'étendue de nos connoissances certaines, et la manière dont nous y parvenons",
            "identifier":"1729essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain, : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons.",
            "identifier":"1735essaiphiloso00lockuoft"
         }
      ]
   }
}

(make sure you lop off '&callback=callback&save=yes' at the end of the url). Once you have the url for the json, it is pretty straightforward from there. You just call the script like so:

ia-torrent.py 'http://archive.org/advancedsearch.php?q=%28collection%3Ayorkuniversity+AND+format%3Apdf%29+AND+-mediatype%3Acollection&fl%5B%5D=identifier&fl%5B%5D=title&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=2608&page=1&output=json' '/tmp/ia-torrent'

Caveats! I haven't been able to download all the torrents for an entire collection yet. Looks like Internet Archive's servers don't like the number of requests, and the script dies out with:

'IOError: [Errno socket error] [Errno 111] Connection refused'

I've tried throttling myself in the script at 15 seconds per request, and still get cut off. If anybody knows if Internet Archive has any published request rates, or has a better idea in implementing this, please let me know! Add a comment, or fork + clone + pull request. Patches are most welcome!

Big thank you to Dan Richert for the impromptu crash course on parsing json this morning!!!

Digitized books into the IR - workflow

This past week, we started depositing digitized books into our institutional repository instance for The McMaster Collection. As of this posting we have 216 books in the collection. However, currently these materials are only available to the McMaster community. This is completely out of my control, and I agree what some of you may be thinking, "wait, out of copyright books are not available to the general public!?"

The workflow is a little complicated right now, but it is the beginning and will definitely be improved. Each book digitized has a specific set of output associated with it; one folder with a TIFF of each page, one folder with an OCR'd text file for each page, one folder for book metadata, and a searchable PDF. The metadata folder has a MARC record (.mrc & MARC21) pulled from WorldCat via Z39.50. Once we have a bulk of digitized books, we copy MARC records to separate directories for processing. Our goal here is to parse the MARC records for certain fields (title, publication date, author, etc) and dump them to a CSV file. We were able to do this by creating a Python script (code below) utilizing a library called pymarc. When the processing of the MARC records is finished, we take the output from the CSV and join (mostly copypasta) it with an XLS file produced by the batch import process for Digital Commons. Once the Digital Commons XLS is finalized, the XLS is uploaded and the Digital Commons system parses the XLS, grabs the PDFs from an accessible directory, and deposits the books.

Future plans...

Automate the copying of PDFs and MARC records via a shell script and set it to a cron. Similarly, once the files are moved the Python script should begin processing the records.

The bottleneck in the entire process is copying the output from the Python script to the Digital Commons XLS. The MARC records are *old* and not very pretty, especially with the date field. Also, the output for the author from the Python script and the input required for author in the XLS is quite different. The values entered by the cataloguer in the author fields of the MARC record are not consistent (last name, first name or first name, last name) and the XLS requires the first name, middle name, and last name in separate fields. I foresee a lot of regex or editing by hand. :(


marc2csv.py - Matt McCollow - http://gist.github.com/348178

#!/usr/bin/env python

import csv
from pymarc import MARCReader
from os import listdir
from re import search

SRC_DIR = '/path/to/mrc/records'

# get a list of all .mrc files in source directory
file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))

csv_out = csv.writer(open('marc_records.csv', 'w'), delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)

for item in file_list:
fd = file(SRC_DIR + '/' + item, 'r')
reader = MARCReader(fd)
for record in reader:
title = author = date = subject = oclc = publisher = ''

# title
if record['245'] is not None:
title = record['245']['a']
if record['245']['b'] is not None:
title = title + " " + record['245']['b']

# determine author
if record['100'] is not None:
author = record['100']['a']
elif record['110'] is not None:
author = record['110']['a']
elif record['700'] is not None:
author = record['700']['a']
elif record['710'] is not None:
author = record['710']['a']

# date
if record['260'] is not None:
date = record['260']['c']

# subject
if record['650'] is not None:
subject = record['650']['a']

# oclc number
if record['035'] is not None:
if len(record.get_fields('035')[0].get_subfields('a')) > 0:
oclc = record['035']['a'].replace('(OCoLC)', '')

# publisher
if record['260'] is not None:
publisher = record['260']['b']

csv_out.writerow([title, author, date, subject, oclc, publisher])
fd.close()