Yesterday in a meeting for our Digital Initiatives Advisory Group we were discussing what collections we should consider sending over to the UofT Internet Archive shop, and I asked an innocent newbie question - So, do we have copies of everything we have had the Internet Archive digitize?
No big deal. We're in the infant stages of creating a digital preservation program here, and everything that comes with it. INFRASTRUCTURE!
I knew Tim Ribaric over at Brock University wrote an Internet Archive scraper a while back, so I knew it would be possible to get our content if need be. Knowing that combined with the Internet Archive announcement a little over a month ago about making available torrents for items in the Internet Archive, it inspired me to whip together a Python script to grab all the torrents for a given collection.
Last night I threw together a little proof-of-concept grabbing the RSS feed on the York University Libraries Internet Archive page using BeautifulSoup and some ugly regex.
This morning, still inspired and brainstorming with Dan Richert, I started poking around for different ways to get at our collection. The Internet Archive's advanced search is super helpful for this, and I can get the results as json! So, no regex; as Dan told me, "if you solve a problem with regex, you now have two problems."
On the advanced search page, you will need your query parameters. You can grab those from the 'All items (most recently added first) link on a collection page. For example, the York University Libraries collection query parameters:
(collection:yorkuniversity AND format:pdf) AND -mediatype:collection'
Then selected your desired output format, and number of results. 2608 for me given the number of items in the collection. Then you end up with some json like this:
"qin":"(collection:yorkuniversity AND format:pdf) AND -mediatype:collection",
"q":"( collection:yorkuniversity AND format:pdf ) AND -mediatype:collection;",
"title":"Revised statutes of Ontario, 1990 = Lois refondues de l'Ontario de 1990",
"title":"Essai philosophique concernant l'entendement humain : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons",
"title":"Essai philosophique concernant l'entendement humain : où l'on montre quelle est l'étendue de nos connoissances certaines, et la manière dont nous y parvenons",
"title":"Essai philosophique concernant l'entendement humain, : ou l'on montre quelle est l'etendue de nos connoissances certaines, et la maniere dont nous y parvenons.",
(make sure you lop off '&callback=callback&save=yes' at the end of the url).
Once you have the url for the json, it is pretty straightforward from there. You just call the script like so:
ia-torrent.py 'http://archive.org/advancedsearch.php?q=%28collection%3Ayorkuniversity+AND+format%3Apdf%29+AND+-mediatype%3Acollection&fl%5B%5D=identifier&fl%5B%5D=title&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=2608&page=1&output=json' '/tmp/ia-torrent'
Caveats! I haven't been able to download all the torrents for an entire collection yet. Looks like Internet Archive's servers don't like the number of requests, and the script dies out with:
'IOError: [Errno socket error] [Errno 111] Connection refused'
I've tried throttling myself in the script at 15 seconds per request, and still get cut off. If anybody knows if Internet Archive has any published request rates, or has a better idea in implementing this, please let me know! Add a comment, or fork + clone + pull request. Patches are most welcome!
Big thank you to Dan Richert for the impromptu crash course on parsing json this morning!!!