python

iaTorrent update OR Learning by reading code

Last week, inspired from a meeting, I started tossing together a little python program to solve a problem. It wasn't perfect. It was warty. But I think I have something worthwhile now. Or, at least useful for me -- It gives you what you want, and writes to a log when something goes wrong.

What I really want to do here, is just take a moment to sing the praises of learning by reading code. Heading into this little project, I had a basic idea of what I wanted to do, and I knew something like this could be done given Tim's project. I knew that I wanted to make a this a module, and set it up on PyPI, but I had really know idea how to do so. But! I knew of somebody who did, and is quite proflic in my mind. Ed making his code available on Github (and using very open licenses) made it possible for me to learn how to build the structure for a Python module, the structure for writing tests, and using argparse/optparse correctly.

So, here is to learning by reading code!

»

IA Torrent

Yesterday in a meeting for our Digital Initiatives Advisory Group we were discussing what collections we should consider sending over to the UofT Internet Archive shop, and I asked an innocent newbie question - So, do we have copies of everything we have had the Internet Archive digitize?

NOPE.

No big deal. We're in the infant stages of creating a digital preservation program here, and everything that comes with it. INFRASTRUCTURE!

I knew Tim Ribaric over at Brock University wrote an Internet Archive scraper a while back, so I knew it would be possible to get our content if need be. Knowing that combined with the Internet Archive announcement a little over a month ago about making available torrents for items in the Internet Archive, it inspired me to whip together a Python script to grab all the torrents for a given collection.

Last night I threw together a little proof-of-concept grabbing the RSS feed on the York University Libraries Internet Archive page using BeautifulSoup and some ugly regex.

This morning, still inspired and brainstorming with Dan Richert, I started poking around for different ways to get at our collection. The Internet Archive's advanced search is super helpful for this, and I can get the results as json! So, no regex; as Dan told me, "if you solve a problem with regex, you now have two problems."

On the advanced search page, you will need your query parameters. You can grab those from the 'All items (most recently added first) link on a collection page. For example, the York University Libraries collection query parameters:

(collection:yorkuniversity AND format:pdf) AND -mediatype:collection'

Then selected your desired output format, and number of results. 2608 for me given the number of items in the collection. Then you end up with some json like this:

{
   "responseHeader":{
      "status":0,
      "QTime":1,
      "params":{
         "json.wrf":"",
         "qin":"(collection:yorkuniversity AND format:pdf) AND -mediatype:collection",
         "fl":"identifier,title",
         "indent":"",
         "start":"0",
         "q":"( collection:yorkuniversity AND format:pdf ) AND -mediatype:collection;",
         "wt":"json",
         "rows":"5"
      }
   },
   "response":{
      "numFound":2608,
      "start":0,
      "docs":[
         {
            "title":"Saint-Pétersbourg",
            "identifier":"saintptersboyork00rauoft"
         },
         {
            "title":"Revised statutes of Ontario, 1990 = Lois refondues de l'Ontario de 1990",
            "identifier":"v4revisedstat1990ontauoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons",
            "identifier":"1714essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain : où l'on montre quelle est l'étendue de nos connoissances certaines, et la manière dont nous y parvenons",
            "identifier":"1729essaiphiloso00lockuoft"
         },
         {
            "title":"Essai philosophique concernant l'entendement humain, : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons.",
            "identifier":"1735essaiphiloso00lockuoft"
         }
      ]
   }
}

(make sure you lop off '&callback=callback&save=yes' at the end of the url). Once you have the url for the json, it is pretty straightforward from there. You just call the script like so:

ia-torrent.py 'http://archive.org/advancedsearch.php?q=%28collection%3Ayorkuniversity+AND+format%3Apdf%29+AND+-mediatype%3Acollection&fl%5B%5D=identifier&fl%5B%5D=title&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=2608&page=1&output=json' '/tmp/ia-torrent'

Caveats! I haven't been able to download all the torrents for an entire collection yet. Looks like Internet Archive's servers don't like the number of requests, and the script dies out with:

'IOError: [Errno socket error] [Errno 111] Connection refused'

I've tried throttling myself in the script at 15 seconds per request, and still get cut off. If anybody knows if Internet Archive has any published request rates, or has a better idea in implementing this, please let me know! Add a comment, or fork + clone + pull request. Patches are most welcome!

Big thank you to Dan Richert for the impromptu crash course on parsing json this morning!!!

»

Digitized books into the IR - workflow

This past week, we started depositing digitized books into our institutional repository instance for The McMaster Collection. As of this posting we have 216 books in the collection. However, currently these materials are only available to the McMaster community. This is completely out of my control, and I agree what some of you may be thinking, "wait, out of copyright books are not available to the general public!?"

»
Syndicate content