When UTF8 != UTF8

Problem

Recently I encountered a problem with umlauts in file names. I had to read names from a directory and find and update the appropriate entry in the database. So if I had a file named hund.pdf (Hund is German for dog) I had to find the corresponding record in the database and attach the file. Almost all files went smooth but the ones with umlauts failed all.

Certainly an encoding problem I thought. So I converted the string to UTF-8 before querying. Again the query returned an empty result set. So I read up on the various configuration options for JDBC, Oracle and Active Record (it is a JRuby on Rails based web app). I tried them all starting with nls_language and ending with temporary setting the locale. No luck.

Querying the database with a hard coded string containing umlauts worked. Both strings even printed on the console looked identically.

So last but not least I compared the string from the file name with a hard coded one: they weren’t equal. Looking at the bytes a strange character combination was revealed \204\136. What’s that? UTF8 calls this a combining diaeresis. What’s that? In UTF8 you can encode umlauts with their corresponding characters or use a combination of the character without an umlaut and the combining diaeresis. So ‘ä’ becomes ‘a\204\136’.

Solution

The solution is to normalize the string. In (J)Ruby you can achieve this in the following way:

string = string.mb_chars.normalize.to_s

And in Java this would be:

string = Normalizer.normalize(string, Normalizer.Form.NFKC)

Ruby uses NFKC (or kc for short) as a default and suggests this for databases and validations.

Lesson learned: So the next time you encounter encoding problems look twice it might be in the right encoding but with the wrong bytes.

2 thoughts on “When UTF8 != UTF8”

That has nothing to do with utf-8

jenslukowski says:

July 22, 2014 at 8:54 am

Yes, it is not specific to UTF-8. UTF-8 is just the instance we encountered of several encodings which have these problems.

Reply

	mariuselvert on Calculating the Number of Segm…
	Anonymous on Calculating the Number of Segm…
	Anonymous on Avoiding Code Style Discu…
	Anonymous on What Happens When We Don’t Lis…
	Writing Integration… on Every Unit Test Is a Stage Pla…

Problem

Solution

Share this:

Related

2 thoughts on “When UTF8 != UTF8”

Leave a comment Cancel reply