When UTF8 != UTF8

Not all encoding problems are problems with different encodings

Problem

Recently I encountered a problem with umlauts in file names. I had to read names from a directory and find and update the appropriate entry in the database. So if I had a file named hund.pdf (Hund is German for dog) I had to find the corresponding record in the database and attach the file. Almost all files went smooth but the ones with umlauts failed all.

Certainly an encoding problem I thought. So I converted the string to UTF-8 before querying. Again the query returned an empty result set. So I read up on the various configuration options for JDBC, Oracle and Active Record (it is a JRuby on Rails based web app). I tried them all starting with nls_language and ending with temporary setting the locale. No luck.

Querying the database with a hard coded string containing umlauts worked. Both strings even printed on the console looked identically.

So last but not least I compared the string from the file name with a hard coded one: they weren’t equal. Looking at the bytes a strange character combination was revealed \204\136. What’s that? UTF8 calls this a combining diaeresis. What’s that? In UTF8 you can encode umlauts with their corresponding characters or use a combination of the character without an umlaut and the combining diaeresis. So ‘ä’ becomes ‘a\204\136’.

Solution

The solution is to normalize the string. In (J)Ruby you can achieve this in the following way:

string = string.mb_chars.normalize.to_s

And in Java this would be:

string = Normalizer.normalize(string, Normalizer.Form.NFKC)

Ruby uses NFKC (or kc for short) as a default and suggests this for databases and validations.

Lesson learned: So the next time you encounter encoding problems look twice it might be in the right encoding but with the wrong bytes.

2 thoughts on “When UTF8 != UTF8”

    1. Yes, it is not specific to UTF-8. UTF-8 is just the instance we encountered of several encodings which have these problems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.