3 steps to fix encoding problems in Ruby

You only really think about a string’s encoding when it breaks. When you check your exception tracker and see

Encoding::InvalidByteSequenceError: "\xFE" on UTF-8

staring you in the face. Or maybe “they’re” starts showing up as “theyâ€™re”.

So, when you have a bad encoding, how do you figure out what broke? And how can you fix it?

What is an encoding?

If you can imagine what encoding does to a string, these bugs are easier to fix.

You can think of a string as an array of bytes, or small numbers:

irb(main):001:0> "hello!".bytes
=> [104, 101, 108, 108, 111, 33]

In this encoding, 104 means h, 33 means !, and so on.

It gets trickier when you use characters that are less common in English:

irb(main):002:0> "hellṏ!".bytes
=> [104, 101, 108, 108, 225, 185, 143, 33]

Now it’s harder to tell which number represents which character. Instead of one byte, ṏ is represented by the group of bytes [225, 185, 143]. But there’s still a relationship between bytes and characters. And a string’s encoding defines that relationship.

Take a look at what a single set of bytes looks like when you try different encodings:

# Try an ISO-8859-1 string with a special character!
irb(main):003:0> str = "hellÔ!".encode("ISO-8859-1"); str.encode("UTF-8")
=> "hellÔ!"

irb(main):004:0> str.bytes
=> [104, 101, 108, 108, 212, 33]

# What would that string look like interpreted as ISO-8859-5 instead?
irb(main):005:0> str.force_encoding("ISO-8859-5"); str.encode("UTF-8")
=> "hellд!"

irb(main):006:0> str.bytes
=> [104, 101, 108, 108, 212, 33]

The bytes didn’t change. But that doesn’t look right at all. Changing the encoding changed how the string printed, without changing the bytes.

And not all strings can be represented in all encodings:

irb(main):006:0> "hi∑".encode("Windows-1252")
Encoding::UndefinedConversionError: U+2211 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252
	from (irb):61:in `encode'
	from (irb):61
	from /usr/local/bin/irb:11:in `<main>'

Most encodings are small, and can’t handle every possible character. You’ll see that error when a character in one encoding doesn’t exist in another, or when Ruby can’t figure out how to translate a character between two encodings.

You can work around this error if you pass extra options into encode:

irb(main):064:0> "hi∑".encode("Windows-1252", invalid: :replace, undef: :replace)
=> "hi?"

The invalid and undef options replace characters that can’t be translated with a different character. By default, that replacement character is ?. (When you convert to Unicode, it’s �).

Unfortunately, when you replace characters with encode, you might lose information. You have no idea which bytes were replaced by ?. But if you need your data to be in that new encoding, losing data can be better than things being broken.

So far, you’ve seen three key string methods to help you understand encodings:

encode, which translates a string to another encoding (converting characters to their equivalent in the new encoding)
bytes, which will show you the bytes that make up a string
force_encoding, which will show you what those bytes would look like interpreted by a different encoding

The major difference between encode and force_encoding is that encode might change bytes, and force_encoding won’t.

A three-step process for fixing encoding bugs

You can fix most encoding issues with three steps:

1. Discover which encoding your string is actually in.

This sounds easy. But just because a string says it’s some encoding, doesn’t mean it actually is:

irb(main):078:0> "hi\x99!".encoding
=> #<Encoding:UTF-8>

That’s not right – if it was really UTF-8, it wouldn’t have that weird backslashed number in it. So how do you figure out the right encoding for your string?

A lot of older software will stick to a single default encoding, so you can research where the input came from. Did someone paste it in from Word? It could be Windows-1252. Did it come from a file or did you pull it from an older website? It might be ISO-8859-1.

I’ve also found it helpful to search for encoding tables, like the ones on those linked Wikipedia pages. On those tables, you can look up the characters referenced by the unknown numbers, and see if they make sense in context.

In this example, the Windows-1252 chart shows that the byte 99 represents the “™” character. Byte 99 doesn’t exist under ISO-8859-1. If ™ makes sense here, you could assume the input was in Windows-1252 and move on. Otherwise, you could keep researching until you found a character that seems more reasonable.

2. Decide which encoding you want the string to be.

This one’s easy. Unless you have a really good reason, you want your strings to be UTF-8 encoded.

There’s one other common encoding you might use in Ruby: ASCII-8BIT. In ASCII-8BIT, every character is represented by a single byte. That is, str.chars.length == str.bytes.length. So, if you want a lot of control over the specific bytes in your string, ASCII-8BIT might be a good option.

3. Re-encode your string from the encoding in step 1 to the encoding in step 2.

You can do this with the encode method. In this example, our string was in the Windows-1252 encoding, and we want it to become UTF-8. Pretty straightforward:

irb(main):088:0> "hi\x99!".encode("UTF-8", "Windows-1252")
=> "hi™!"

Much better. (Even though the order of the encodings in that call always seemed backwards to me).

It can be brain-bending to imagine different interpretations of the same array of bytes. Especially when one of those interpretations is broken. But there’s a great way to become a lot more comfortable with encodings: Play with them.

Open an irb console, and mess around with encode, bytes, and force_encoding. Watch how encode changes the bytes making up the string. Build intuition about what different encodings look like. When you’ve grown more comfortable with encodings and use these steps, you’ll fix in minutes what would have taken you hours before.

Finally, if you want to learn how to make a habit out of learning these kinds of things by doing, grab the free sample chapter of my book. Breaking things in the console is a really fun way to study ideas like this.

3 Steps to Fix Encoding Problems in Ruby

What is an encoding?

A three-step process for fixing encoding bugs

1. Discover which encoding your string is actually in.

2. Decide which encoding you want the string to be.

3. Re-encode your string from the encoding in step 1 to the encoding in step 2.

Did you like this article? You should read these:

Comments