I recently had to correct text data stored in an Oracle database with a wrong character set. A bug in one of our applications caused text data to be stored in the ISO 8859-1 West European character encoding. The database, however, is configured to work with UTF-8 encoded strings. The bug was fixed, but the existing data had to be corrected afterwards. If you ever encounter the same problem you can use Oracle’s CONVERT() function to fix the data. The syntax is as follows:
CONVERT(text, to_charset, from_charset)
First you have to find out Oracle’s names of the desired source and target character sets. You can look them up in this table. In my case the the character set names are 'UTF8' and 'WE8ISO8859P1' (for Western European 8-bit ISO-8859-1). You can check the function by selecting from DUAL:
SELECT CONVERT('ä ö ü Ä Ö Ü ß', 'UTF8', 'WE8ISO8859P1') FROM DUAL;
If the result shows the same scrambled characters you see in the data you know that you have chosen the correct character set names:
The other direction should unscramble the characters again:
Finally, you can convert your data:
SET text_column=CONVERT(text_column, 'WE8ISO8859P1', 'UTF8');
Not all encoding problems are problems with different encodings
Recently I encountered a problem with umlauts in file names. I had to read names from a directory and find and update the appropriate entry in the database. So if I had a file named hund.pdf (Hund is German for dog) I had to find the corresponding record in the database and attach the file. Almost all files went smooth but the ones with umlauts failed all.
Certainly an encoding problem I thought. So I converted the string to UTF-8 before querying. Again the query returned an empty result set. So I read up on the various configuration options for JDBC, Oracle and Active Record (it is a JRuby on Rails based web app). I tried them all starting with nls_language and ending with temporary setting the locale. No luck.
Querying the database with a hard coded string containing umlauts worked. Both strings even printed on the console looked identically.
So last but not least I compared the string from the file name with a hard coded one: they weren’t equal. Looking at the bytes a strange character combination was revealed \204\136. What’s that? UTF8 calls this a combining diaeresis. What’s that? In UTF8 you can encode umlauts with their corresponding characters or use a combination of the character without an umlaut and the combining diaeresis. So ‘ä’ becomes ‘a\204\136’.
The solution is to normalize the string. In (J)Ruby you can achieve this in the following way:
Most developers already struggled with textual data from some third party system and getting garbage special characters and the like because of wrong character encodings. Some days ago we encountered an obscure problem when it was possible to login into one of our apps from the computer with the password database running but not from other machines using the same db. After diving into the problem we found out that they SHA-1 hashes generated from our app were slightly different. Looking at the code revealed that platform encoding was used and that lead to different results:
The apps were running on Windows XP and Windows 2k3 Server respectively and you would expect that it would not make much of a difference but in fact it did!
Always specify the encoding explicitly, when exchanging character data with any other system. Here are some examples:
String.getBytes(“utf-8”), new Printwriter(file, “ascii”) in Java
HTML-Forms with attribute accept-charset="ISO-8859-1"
In XML headers <?xml version="1.0" encoding="ISO-8859-15"?>
In your Database and/or JDBC driver
In your file format documentation
In LaTeX documents
everywhere where you can provide that info easily (e.g. as a comment in a config file)
Problems with character encodings seem to appear every once in a while either as end user, when your umlauts get garbled or as a programmer that has to deal with third party input like web forms or text files.
The text file rant
After stumbling over an encoding problem *again* I thought a bit about the whole issue and some of my thought manifested in this rant about text files. I do not want to blame our computer science predecessors for inventing and using restricted charsets like ASCII or iso8859. Nobody has forseen the rapid development of computers and their worldwide adoption and use in everyday life and thus need for an extensible charset (think of the addition of new symbols like the €), let aside performance and memory considerations. The problem I see with text files is that there is no standard way to describe the used encoding. Most text files just leave it to the user to guess what the encoding might be whereas almost all binary file formats feature some kind of defined header with metadata about the content, e.g. bit depth and compression method in image files. For text files you usually have to use heuristical tools which work more or less depending on the input.
A standardized header for text files right from the start would have helped to indicate the encoding and possibly language or encoding version information of the text and many problems we have today would not exist. The encoding attribute in the XML header or the byte order mark in UTF-8 are workarounds for the fundamental problem of a missing text file header.