Problem
Recently I encountered a problem with umlauts in file names. I had to read names from a directory and find and update the appropriate entry in the database. So if I had a file named hund.pdf (Hund is German for dog) I had to find the corresponding record in the database and attach the file. Almost all files went smooth but the ones with umlauts failed all.
Certainly an encoding problem I thought. So I converted the string to UTF-8 before querying. Again the query returned an empty result set. So I read up on the various configuration options for JDBC, Oracle and Active Record (it is a JRuby on Rails based web app). I tried them all starting with nls_language and ending with temporary setting the locale. No luck.
Querying the database with a hard coded string containing umlauts worked. Both strings even printed on the console looked identically.
So last but not least I compared the string from the file name with a hard coded one: they weren’t equal. Looking at the bytes a strange character combination was revealed \204\136. What’s that? UTF8 calls this a combining diaeresis. What’s that? In UTF8 you can encode umlauts with their corresponding characters or use a combination of the character without an umlaut and the combining diaeresis. So ‘ä’ becomes ‘a\204\136’.
Solution
The solution is to normalize the string. In (J)Ruby you can achieve this in the following way:
string = string.mb_chars.normalize.to_s
And in Java this would be:
string = Normalizer.normalize(string, Normalizer.Form.NFKC)
Ruby uses NFKC (or kc for short) as a default and suggests this for databases and validations.
Lesson learned: So the next time you encounter encoding problems look twice it might be in the right encoding but with the wrong bytes.
That has nothing to do with utf-8
Yes, it is not specific to UTF-8. UTF-8 is just the instance we encountered of several encodings which have these problems.