Friday, July 06, 2007

Screen-Scrapey Goodness

templatemaker, a new Python library by Adam Holovati (of Django fame) steamlines the process of screen scraping:

Well, say you want to get the raw data from a bunch of Web pages that use the same template -- like restaurant reviews on Yelp.com, for instance. You can give templatemaker an arbitrary number of HTML files, and it will create the "template" that was used to create those files. ("Template," in this case, means a string with a number of "holes" in it, where the holes represent the parts of the page that change.) Once you've got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: "The value for hole 1 is 'July 6, 2007', the value for hole 2 is 'blue'," etc.

This is helpful if you want to, say, get the data out of your district's web based SIS or data warehouse and into something more flexible for analysis on the school level. Just don't tell anyone at the district what you're doing.

No comments: