digitization

IA Torrent

Yesterday in a meeting for our Digital Initiatives Advisory Group we were discussing what collections we should consider sending over to the UofT Internet Archive shop, and I asked an innocent newbie question - So, do we have copies of everything we have had the Internet Archive digitize?

NOPE.

No big deal. We're in the infant stages of creating a digital preservation program here, and everything that comes with it. INFRASTRUCTURE!

I knew Tim Ribaric over at Brock University wrote an Internet Archive scraper a while back, so I knew it would be possible to get our content if need be. Knowing that combined with the Internet Archive announcement a little over a month ago about making available torrents for items in the Internet Archive, it inspired me to whip together a Python script to grab all the torrents for a given collection.

Last night I threw together a little proof-of-concept grabbing the RSS feed on the York University Libraries Internet Archive page using BeautifulSoup and some ugly regex.

This morning, still inspired and brainstorming with Dan Richert, I started poking around for different ways to get at our collection. The Internet Archive's advanced search is super helpful for this, and I can get the results as json! So, no regex; as Dan told me, "if you solve a problem with regex, you now have two problems."

On the advanced search page, you will need your query parameters. You can grab those from the 'All items (most recently added first) link on a collection page. For example, the York University Libraries collection query parameters:

(collection:yorkuniversity AND format:pdf) AND -mediatype:collection'

Then selected your desired output format, and number of results. 2608 for me given the number of items in the collection. Then you end up with some json like this:

{
   "responseHeader":{
      "status":0,
      "QTime":1,
      "params":{
         "json.wrf":"",
         "qin":"(collection:yorkuniversity AND format:pdf) AND -mediatype:collection",
         "fl":"identifier,title",
         "indent":"",
         "start":"0",
         "q":"( collection:yorkuniversity AND format:pdf ) AND -mediatype:collection;",
         "wt":"json",
         "rows":"5"
      }
   },
   "response":{
      "numFound":2608,
      "start":0,
      "docs":[
         {
            "title":"Saint-Pétersbourg",
            "identifier":"saintptersboyork00rauoft"
         },
         {
            "title":"Revised statutes of Ontario, 1990 = Lois refondues de l'Ontario de 1990",
            "identifier":"v4revisedstat1990ontauoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons",
            "identifier":"1714essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : où l'on montre quelle est l'étendue de nos connoissances certaines, et la manière dont nous y parvenons",
            "identifier":"1729essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain, : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons.",
            "identifier":"1735essaiphiloso00lockuoft"
         }
      ]
   }
}

(make sure you lop off '&callback=callback&save=yes' at the end of the url). Once you have the url for the json, it is pretty straightforward from there. You just call the script like so:

ia-torrent.py 'http://archive.org/advancedsearch.php?q=%28collection%3Ayorkuniversity+AND+format%3Apdf%29+AND+-mediatype%3Acollection&fl%5B%5D=identifier&fl%5B%5D=title&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=2608&page=1&output=json' '/tmp/ia-torrent'

Caveats! I haven't been able to download all the torrents for an entire collection yet. Looks like Internet Archive's servers don't like the number of requests, and the script dies out with:

'IOError: [Errno socket error] [Errno 111] Connection refused'

I've tried throttling myself in the script at 15 seconds per request, and still get cut off. If anybody knows if Internet Archive has any published request rates, or has a better idea in implementing this, please let me know! Add a comment, or fork + clone + pull request. Patches are most welcome!

Big thank you to Dan Richert for the impromptu crash course on parsing json this morning!!!

»

Digitized books into the IR - workflow

This past week, we started depositing digitized books into our institutional repository instance for The McMaster Collection. As of this posting we have 216 books in the collection. However, currently these materials are only available to the McMaster community. This is completely out of my control, and I agree what some of you may be thinking, "wait, out of copyright books are not available to the general public!?"

»

MEETINGS ALL DAY - ?_?

Meetings all day. Will everything go better than expected, or will I rage?

Morning:

email - nope, I'm in meetings all day.

Got into work and discovered the contract worker for the giant 25,000 object digitization project started yesterday and nobody told me.

LOOK OF DISAPPROVAL

Checked in the worker and made sure that she was provided with proper documentation regarding file-naming convention, scanning requirements, and storage.

blog image
»

Concentration Camp Correspondences

After an entire year of scanning and meta data entry by a couple of amazing students, we have finished a portion of the World War, 1939-1945, German Concentration Camps and Prisons Collection. The entirety of the Concentration Camp Correspondences [http://digitalcollections.mcmaster.ca/concentration-camp-correspondence] - 1031 to be exact - are up online with full meta data records.

blog image
blog image
»

2009 OLITA Digital Odyssey

I must say that the Digital Odyssey was the best one day event I have been to. Just a fantastic day with fantastic people talking about awesome projects. It cheered me up and gave me hope in these crap times. Best part of the day had to be Mike Ridley's keynote speech - The Age of Information is over. It is time for the Age of Imagination. It will be the library's job to nurture and foster creativity.

Workshops attended:

blog image
»

OMG! You Don't Need CONTENTdm!!!

So, I bet a lot of you are wondering what is up with my with my title? Well, I don’t plan on standing up here taking potshots at OCLC for 15 minutes, but I am sure some people in the crowd wouldn’t mind. Basically, the title should have had a very long sub-title along the lines of, like Dr. Strangelove or: How I learned to Stop Worrying and Embrace Open Source Software.

How many people here know what CONTENTdm is? Well, straight from the site - is a single software solution that handles the storage, management and delivery of your library’s digital collections to the Web.

»

digitalcollections.mcmaster.ca released into the wild

Ok, the Digital Collections website is ready for beta testing. Registered users can comment, vote on comments, and tag records - and updated version of the "bookbag" will be added soon. Collections with content include; Peace and War in the 20th Century, Russell Library, and World War II Concentration Camp Correspondences. At this time, AICT, and Kirtas Book Collection are outlines for content to be added later.

»

Creative Commons license icon Creative Commons license icon Creative Commons license icon

Syndicate content