Sunday, March 30, 2008

Benchmarking HTML/XML Parsing in Python

If you do any HTML or XML wrangling in Python, you should check out Ian Bicking's comparison of the performance of various libraries. His methodology seems a bit loose, but the differences in performance is sufficiently wide that the results are clear cut:

So in conclusion: lxml kicks ass. You can use it in ways you couldn’t use other systems. You can parse, serialize, parse, serialize, and repeat the process a couple times with your HTML before the performance will hurt you. With high-level constructs many constructs can happen in very fast C code without calling out to Python. As an example, if you do an XPath query, the query string is compiled into something native and traverses the native libxml2 objects, only creating Python objects to wrap the query results. In addition, things like the modest memory use make me more confident that lxml will act reliably even under unexpected load.

Also, lxml is really easy to use, very Pythonic. The only drawback is that it can be a headache to install on Macs.

No comments: