bdunagan
fill the void

Posted
29 September 2011 @ 12pm

Tagged
random

7 Comments

Converting MySQL from LATIN1 to UTF8

MySQL defaults to “latin1″ as its character set, but at some point, most people want to migrate to “utf8″. I realize that there are dozens of posts about how people handled this, and yet, not a single one of those worked completely for me.

I wanted MySQL to use “utf8″ for the character set and “utf8_unicode_ci” (not “utf8_general_ci”) for the collation, and I wanted it all to work on RDS with ActiveRecord on Rails and Rack.

Here are my steps:

  1. use mysqldump to extract the old data as “latin1″
  2. use sed to replace “latin1″ with “utf8″ in the dump file
  3. create the new database with the right parameters: character set utf8 collate utf8_unicode_ci
  4. use mysql --default-character-set=utf8 to pipe the converted dump into the new database

Here is my code:

# Dump the old database as latin1, because ironically, mysqldump defaults to utf8.
mysqldump --default-character-set=latin1 db > db.dump

# If you need to convert a MySQL dump from one character set to another, use iconv.
iconv -f LATIN1 -t UTF-8 < db.dump > db.dump

# If you've been running mysqldump without parameters on a latin1 instance, you can convert the dump from UTF8 to latin1 to correct it.
iconv -f UTF-8 -t LATIN1 < db.dump > db.dump

# Rewrite the dump to say 'utf8' and 'utf8_unicode_ci' in all the right places.
sed -e 's/SET NAMES latin1/SET NAMES utf8/g' -i db.dump
sed -e 's/CHARSET=latin1/CHARSET=utf8 COLLATE=utf8_unicode_ci/g' -i db.dump

# Create a new database with the correct parameters.
create database db character set utf8 collate utf8_unicode_ci;
# Verify it.
show create database db;

# Pipe the converted database dump into MySQL.
mysql -h hostname --default-character-set=utf8 -u root -p db < db.dump

To verify the character set and collation, you can always query the MySQL variables:

show variables like 'collation%';
show variables like 'character%';

Amazon RDS

For those using Amazon’s AWS RDS for their MySQL instance, you have to create a parameter group with “utf8″ values. I’d guess you could just modify the current parameter group then apply it, but I haven’t verified that.

# Create a parameter group.
rds-create-db-parameter-group utf8 -e mysql5.1 -d utf8

# Modify the parameter group's values
rds-modify-db-parameter-group utf8 \
    --parameters="name=character_set_server, value=utf8, method=immediate" \
    --parameters="name=character_set_client, value=utf8, method=immediate" \
    --parameters="name=character_set_results,value=utf8,method=immediate" \
    --parameters="name=collation_server, value=utf8_unicode_ci, method=immediate" \
    --parameters="name=collation_connection, value=utf8_unicode_ci, method=immediate"

# Check the parameter group's values.
rds-describe-db-parameters utf8 --source=User

# Push this new parameter group to your instance.
rds-modify-db-instance rds-db --db-parameter-group-name utf8

# Reboot the instance (necessary: http://aws.amazon.com/articles/Amazon-RDS/2935).
rds-reboot-db-instance rds-db

ActiveRecord in Rails and Rack

ActiveRecord supports an :encoding option in its parameters for ActiveRecord::Base.establish_connection. The option tells the connection to execute SET NAMES as soon as the connection is established, thereby telling the server what the character set the client wants.

However, I also wanted to specify the collation. When I added :encoding to ActiveRecord::Base.establish_connection, collation_connection (from the MySQL variables, not connection.collation) remained as “utf8_general_ci”. Some people have indicated that you can specify :collation in database.yml for the establish_connection call, but that never worked for me. I think MySQL bug #34980 prevented it. Others indicated that you can simply add ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'") at the bottom of environment.rb for Rails; that also never worked for me. To specify the collation in Rails, I used a before_filter in application_controller.rb. See my code below.

# Rack: add a statement right after establishing the connection.
ActiveRecord::Base.establish_connection(
  :adapter  => "mysql",
  :host     => "host",
  :username => "username",
  :password => "password",
  :database => "database",
  :encoding => "utf8",
  :reconnect => true
)
ActiveRecord::Base.connection.execute("SET collation_connection='utf8_unicode_ci'");

# Rails: add a before_filter in application_controller.rb
before_filter :set_database_collation
def set_database_collation
  ActiveRecord::Base.connection.execute("set collation_connection='utf8_unicode_ci'")
end

Pointers

I pieced my steps together from the following helpful links:


7 Comments

Posted by
Jeff Rafter
5 October 2011 @ 10pm

Hi it looks like the iconv line is off, but in many cases you don’t need to run the iconv. depending on how you run the export.


Posted by
Jeff Rafter
5 October 2011 @ 10pm

But fwiw, this was SUPER helpful. Thanks so much for posting it!


Posted by
bdunagan
6 October 2011 @ 10am

@Jeff True about iconv, but it took me a bit to realize I didn’t need it. Glad it helped!


Posted by
Paul
22 May 2012 @ 2pm

Great article! Did you have to restart the MySQL server after you changed the database to UTF-8?


Posted by
bdunagan
23 May 2012 @ 9am

Glad it helped! No need to restart the server after the switch.


Posted by
Laurence
21 June 2012 @ 8am

This worked perfect, thanks!


Posted by
Rob Blake
12 October 2012 @ 9am

Perfect. This is exactly what I was looking for. Thanks for sharing.


Leave a Comment