Tuesday, February 13, 2007

Ersatz Data Warehousing for Schools with Python: Part 2, Data Dump

A Richer Picture digital portfolio system will allow an administrator to create a full export of the system's MySQL database. In general terms, this is an excellent practice; the school can always dump their data and create their own complete version of the database locally, in MySQL or any other relational database. In practice, this process can be a little bumpy.

What you get is a zip file with a bunch of comma-delimited text files, one for each table, and an SQL file. I was hoping this file of SQL commands would automatically recreate the database. Unfortunately, it doesn't. I'm not entirely sure what it does, but generally it seems designed to export the data rather than import it. Regardless, it does serve as a good ersatz schema, and turning it into a import script mostly just requires changing the WRITE statements to LOAD statements. Even then, I ran into a string of problems trying to deal with the commas and line endings contained within the text of individual database entries. I think they need to be more explicit about escaping the line endings in the export, although I've found the relevant MySQL docs somewhat vague. I've currently wrangled it into a position where I think I'm importing all the records, but the comments containing line endings are getting chopped. I've gone back and forth with David about this on email a bit, and hopefully we'll get this completely resolved, and I'll post the full script.

If I was more experienced with relational databases, this would be old hat, I suppose. These problems come from underspecified nature of the CSV data format. Unlike, say, XML, there is no one right way to handle character escaping and the like (not that everyone gets it right in XML, but at least you can definitively say they're wrong, for what that's worth). You can export XML from MySQL, and I'd prefer it, even in a really braindead dialect. In theory, the right way to do this would be to use some of the IMS XML specs, but given that I have no tools that make processing data in that XML format easier than in any other, it doesn't give an significant practical advantage at this point (and SIF is a whole message passing system, which would be great to have in this case, but getting it running on either end is a big job).

Anyhow, I've got most of the raw data in my own MySQL database. Hopefully the next steps will be more interesting and enlightening.

No comments: