computer science, math, programming and other stuff
a blog by Christopher Swenson

Painful sed script of Wordpress doom

So, I figured out most of the problems, and how to solve them.

Problem one: pasting text from Word stores the document in Windows-1252 encoding.

Problem two: this is stored in MySQL, by default, in the deceptively-similar-to-but-incompatible-with-1252 ISO-8859-1 encoding.

Problem three: Wordpress is smart, and automatically converts 1252 into HTML entities, which hides the problem until you dump the database.

Problem four: UTF-8 is freely intermingled with 1252 text in the entries.

Problem five: Windows does really weird things to encode non-1252 characters (like smart quotes) in a 1252 character stream.

Problem six: Some piece of software went wrong at some point (probably PHP or phpMyAdmin) , and converted the 1252, non-1252 weird smart quotes, ISO-8859-1, and UTF-8 codes to UTF-8 (by converting the byte literal values instead of the underlying characters, which for UTF-8 codes means that you have to decode twice to get the real answer), but MySQL keeps the text marked as ISO-8859-1.

Problem seven: you are me, and you have 2,300 of these invalid characters in a MySQL dump that needs to be imported.

Problem eight: Mac OS X, by default, uses a sub-standard version of sed that does not support replacement of literal values ("\xC2", for example), making it impossible to replace any non-ASCII characters.

Solution: install sed from Fink, and run the following sed script on your MySQL dump:

s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g
s/\xc3\xa2\xe2\x82\xac\xe2\x84\xa2/\’/g
s/\xc3\xa2\xe2\x82\xac\xe2\x80\x9d/\—/g
s/\xE2\x80\x9C/\–/g
s/\xc3\xa2\xe2\x82\xac\xc2\xa2/\•/g
s/\xC3\xA2\xE2\x82\xAC\xC5\x93/\“/g
s/\xC3\xA2\xE2\x82\xAC\xC2\x9D/\”/g
s/\xC3\xA2\xE2\x82\xAC\xC2\x9C/\‘/g
s/\xC3\xA2\xE2\x82\xAC/\”/g
s/\xC3\x85\xE2\x80\x99/\‘/g
s/\xc3\x82\xc2\xba/\°/g
s/\xC3\x82\xC2\xA9/\©/g
s/\xc3\xa2\xC2\xA6/\…/g
s/\xC3\x82\xC2\xA3/\£/g
s/\xC3\x83\xC2\xA3/\á/g
s/\xc3\x83\xc2\xa1/\á/g
s/\xC3\x83\xC2\xA8/\è/g
s/\xC3\x83\xC2\xA9/\é/g
s/\xC3\x83\xC2\xA7/\ç/g
s/\xc3\x82\xc2\xae/\®/g
s/\xc3\x83\xc2\x81/\Á/g
s/\xc3\x83\xc2\xab/\ë/g
s/\xc3\x82\xc2\xbd/\½/g
s/\xc3\x82\xc2\xbc/\¼/g
s/\xc3\xa2\xe2\x80\x9e\xc2\xa2/\™/g
s/\xc2\xbf//g
s/\xc2\xa0/\ /g
s/\xc3\x82\xc2\xb9/\¹/g
s/\xc3\x82\xc2\xb0/\°/g
s/\xc3\x82\xc2\xab/\«/g
s/\xc3\x83\xc2\xb6/\ö/g

This removes all but 30 of my invalid characters, and those remaining ones appear to be pretty much broke (viewing the original articles shows garbage, and there is no logical character that would fit in those places). Good enough.