converting mediawiki mysql database from latin1 to utf8


Sometime after upgrading our copy of MediaWiki from the antique version I'd had to run, to the shiny latest version, I noticed (well, some of the wiki users noticed first…) that there were some borked characters - accented French characters, Mandarin characters, and fancy schmancy "smart quotes" were displaying as gobbledygook gibberish text. Smelled like a UTF8-related issue - IIRC, MediaWiki switched the front end of the web app to be UTF8, but my database was languishing behind in krufty latin1 encoding. oops.

So, how to convert a database that's approaching a gigabyte of data? I started googling. Found some hints. But none of the solutions I found actually worked for me. So I googled some more, and duct-tape-and-bubblegummed a script together that seems to have successfully converted the database from latin1 to UTF8.

Here's what I did, in case I need to do it again...

#!/bin/bash

# general config stuff
mysql_path="/usr/local/mysql/bin"

# source database config
# change this to point to the database server that is currently hosting the database.
source_host="localhost"
source_db="source_database_name" # mine was mediawiki
source_user="source_database_username"
source_pw="password" #change this. duh.
temp_sql_dir="/path/to/a/directory"

# destination database config
# change this to point to the database server that will host the converted database
# this could be the same server as above, but use a different database on it.
dest_host="localhost"
dest_db="dest_database_name" # mine was mediawiki-utf8
dest_user="dest_database_username"
dest_pw="password"

# magic happens
clear
for table in `$mysql_path/mysql --host=$source_host --user=$source_user --password=$source_pw $source_db -e 'show tables' | egrep -v 'Tables_in_' `; do
	echo "Dumping $table"
	
	# dump the table from the first database
	$mysql_path/mysqldump --host=$source_host --user=$source_user --password=$source_pw --extended-insert=false --quote-names --default-character-set=latin1 $source_db $table > $temp_sql_dir/$table.sql

	# convert the charset declarations from latin1 to utf8
	# sed seems to bork mysteriously, and the mysql 'replace' command borks the files, so I settled on perl...
	# you could add other transformations here, too...
	perl -pi -w -e 's/latin1/utf8/g;' $temp_sql_dir/$table.sql
	
	# import the converted table data into a fresh database table
	$mysql_path/mysql --host=$dest_host --user=$dest_user --password=$dest_pw --default-character-set=utf8 $dest_db < $temp_sql_dir/$table.sql
done

# or not.
if [ "$table" = "" ]; then
	echo "No tables found in db: $db"
fi

It seems to have worked for me. It took maybe 5 minutes to convert almost a gig of data. It borked on one table - categorylinks failed because of some problem with the key, so I just manually copied it over myself after the script was finished.

It is highly probable that there is a better way to do this. Perhaps even some magic bit that could have been twiddled to do this automagically and/or instantly on the server. I couldn't find the proper lever to throw, so wound up trying brute force. When in doubt, try brute force. It worked for me. It might not work for you. You've been warned.


comments powered by Disqus