Converting MySQL from LATIN1 to UTF8
MySQL defaults to “latin1″ as its character set, but at some point, most people want to migrate to “utf8″. I realize that there are dozens of posts about how people handled this, and yet, not a single one of those worked completely for me.
I wanted MySQL to use “utf8″ for the character set and “utf8_unicode_ci” (not “utf8_general_ci”) for the collation, and I wanted it all to work on RDS with ActiveRecord on Rails and Rack.
Here are my steps:
- use mysqldump to extract the old data as “latin1″
- use sed to replace “latin1″ with “utf8″ in the dump file
- create the new database with the right parameters: character set utf8 collate utf8_unicode_ci
- use mysql --default-character-set=utf8 to pipe the converted dump into the new database
Here is my code:
# Dump the old database as latin1, because ironically, mysqldump defaults to utf8. mysqldump --default-character-set=latin1 db > db.dump # If you need to convert a MySQL dump from one character set to another, use iconv. iconv -f LATIN1 -t UTF-8 < db.dump > db.dump # If you've been running mysqldump without parameters on a latin1 instance, you can convert the dump from UTF8 to latin1 to correct it. iconv -f UTF-8 -t LATIN1 < db.dump > db.dump # Rewrite the dump to say 'utf8' and 'utf8_unicode_ci' in all the right places. sed -e 's/SET NAMES latin1/SET NAMES utf8/g' -i db.dump sed -e 's/CHARSET=latin1/CHARSET=utf8 COLLATE=utf8_unicode_ci/g' -i db.dump # Create a new database with the correct parameters. create database db character set utf8 collate utf8_unicode_ci; # Verify it. show create database db; # Pipe the converted database dump into MySQL. mysql -h hostname --default-character-set=utf8 -u root -p db < db.dump
To verify the character set and collation, you can always query the MySQL variables:
show variables like 'collation%'; show variables like 'character%';
For those using Amazon’s AWS RDS for their MySQL instance, you have to create a parameter group with “utf8″ values. I’d guess you could just modify the current parameter group then apply it, but I haven’t verified that.
# Create a parameter group. rds-create-db-parameter-group utf8 -e mysql5.1 -d utf8 # Modify the parameter group's values rds-modify-db-parameter-group utf8 \ --parameters="name=character_set_server, value=utf8, method=immediate" \ --parameters="name=character_set_client, value=utf8, method=immediate" \ --parameters="name=character_set_results,value=utf8,method=immediate" \ --parameters="name=collation_server, value=utf8_unicode_ci, method=immediate" \ --parameters="name=collation_connection, value=utf8_unicode_ci, method=immediate" # Check the parameter group's values. rds-describe-db-parameters utf8 --source=User # Push this new parameter group to your instance. rds-modify-db-instance rds-db --db-parameter-group-name utf8 # Reboot the instance (necessary: http://aws.amazon.com/articles/Amazon-RDS/2935). rds-reboot-db-instance rds-db
ActiveRecord in Rails and Rack
ActiveRecord supports an :encoding option in its parameters for ActiveRecord::Base.establish_connection. The option tells the connection to execute SET NAMES as soon as the connection is established, thereby telling the server what the character set the client wants.
However, I also wanted to specify the collation. When I added :encoding to ActiveRecord::Base.establish_connection, collation_connection (from the MySQL variables, not connection.collation) remained as “utf8_general_ci”. Some people have indicated that you can specify :collation in database.yml for the establish_connection call, but that never worked for me. I think MySQL bug #34980 prevented it. Others indicated that you can simply add ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'") at the bottom of environment.rb for Rails; that also never worked for me. To specify the collation in Rails, I used a before_filter in application_controller.rb. See my code below.
# Rack: add a statement right after establishing the connection. ActiveRecord::Base.establish_connection( :adapter => "mysql", :host => "host", :username => "username", :password => "password", :database => "database", :encoding => "utf8", :reconnect => true ) ActiveRecord::Base.connection.execute("SET collation_connection='utf8_unicode_ci'"); # Rails: add a before_filter in application_controller.rb before_filter :set_database_collation def set_database_collation ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'") end
I pieced my steps together from the following helpful links:
- UTF8 on RDS
- Official AWS steps
- Derek Sivers converting CD Baby
- RDS Parameter group
- UTF8 through the stack
- UTF8 with Rails
- Character Hell in MySQL
- MySQL character set