You only really think about a string’s encoding when it breaks. When you check your exception tracker and see
staring you in the face. Or maybe “they’re” starts showing up as “they’re”.
So, when you have a bad encoding, how do you figure out what broke? And how can you fix it?
What is an encoding?
If you can imagine what encoding does to a string, these bugs are easier to fix.
You can think of a string as an array of bytes, or small numbers:
In this encoding, 104
means h
, 33
means !
, and so on.
It gets trickier when you use characters that are less common in English:
Now it’s harder to tell which number represents which character. Instead of one byte, ṏ
is represented by the group of bytes [225, 185, 143]
. But there’s still a relationship between bytes and characters. And a string’s encoding defines that relationship.
Take a look at what a single set of bytes looks like when you try different encodings:
The bytes didn’t change. But that doesn’t look right at all. Changing the encoding changed how the string printed, without changing the bytes.
And not all strings can be represented in all encodings:
Most encodings are small, and can’t handle every possible character. You’ll see that error when a character in one encoding doesn’t exist in another, or when Ruby can’t figure out how to translate a character between two encodings.
You can work around this error if you pass extra options into encode
:
The invalid
and undef
options replace characters that can’t be translated with a different character. By default, that replacement character is ?
. (When you convert to Unicode, it’s �).
Unfortunately, when you replace characters with encode
, you might lose information. You have no idea which bytes were replaced by ?
. But if you need your data to be in that new encoding, losing data can be better than things being broken.
So far, you’ve seen three key string methods to help you understand encodings:
-
encode
, which translates a string to another encoding (converting characters to their equivalent in the new encoding) -
bytes
, which will show you the bytes that make up a string -
force_encoding
, which will show you what those bytes would look like interpreted by a different encoding
The major difference between encode
and force_encoding
is that encode
might change bytes
, and force_encoding
won’t.
A three-step process for fixing encoding bugs
You can fix most encoding issues with three steps:
1. Discover which encoding your string is actually in.
This sounds easy. But just because a string says it’s some encoding, doesn’t mean it actually is:
That’s not right – if it was really UTF-8, it wouldn’t have that weird backslashed number in it. So how do you figure out the right encoding for your string?
A lot of older software will stick to a single default encoding, so you can research where the input came from. Did someone paste it in from Word? It could be Windows-1252. Did it come from a file or did you pull it from an older website? It might be ISO-8859-1.
I’ve also found it helpful to search for encoding tables, like the ones on those linked Wikipedia pages. On those tables, you can look up the characters referenced by the unknown numbers, and see if they make sense in context.
In this example, the Windows-1252 chart shows that the byte 99
represents the “™” character. Byte 99
doesn’t exist under ISO-8859-1. If ™ makes sense here, you could assume the input was in Windows-1252 and move on. Otherwise, you could keep researching until you found a character that seems more reasonable.
2. Decide which encoding you want the string to be.
This one’s easy. Unless you have a really good reason, you want your strings to be UTF-8 encoded.
There’s one other common encoding you might use in Ruby: ASCII-8BIT. In ASCII-8BIT, every character is represented by a single byte. That is, str.chars.length == str.bytes.length
. So, if you want a lot of control over the specific bytes in your string, ASCII-8BIT might be a good option.
3. Re-encode your string from the encoding in step 1 to the encoding in step 2.
You can do this with the encode
method. In this example, our string was in the Windows-1252 encoding, and we want it to become UTF-8. Pretty straightforward:
Much better. (Even though the order of the encodings in that call always seemed backwards to me).
It can be brain-bending to imagine different interpretations of the same array of bytes. Especially when one of those interpretations is broken. But there’s a great way to become a lot more comfortable with encodings: Play with them.
Open an irb
console, and mess around with encode
, bytes
, and force_encoding
. Watch how encode
changes the bytes making up the string. Build intuition about what different encodings look like. When you’ve grown more comfortable with encodings and use these steps, you’ll fix in minutes what would have taken you hours before.
Finally, if you want to learn how to make a habit out of learning these kinds of things by doing, grab the free sample chapter of my book. Breaking things in the console is a really fun way to study ideas like this.