Most developers already struggled with textual data from some third party system and getting garbage special characters and the like because of wrong character encodings. Some days ago we encountered an obscure problem when it was possible to login into one of our apps from the computer with the password database running but not from other machines using the same db. After diving into the problem we found out that they SHA-1 hashes generated from our app were slightly different. Looking at the code revealed that platform encoding was used and that lead to different results:
The apps were running on Windows XP and Windows 2k3 Server respectively and you would expect that it would not make much of a difference but in fact it did!
Lesson:
Always specify the encoding explicitly, when exchanging character data with any other system. Here are some examples:
- String.getBytes(“utf-8”), new Printwriter(file, “ascii”) in Java
- HTML-Forms with attribute
accept-charset="ISO-8859-1"
- In XML headers
<?xml version="1.0" encoding="ISO-8859-15"?>
- In your Database and/or JDBC driver
- In your file format documentation
- In LaTeX documents
- everywhere where you can provide that info easily (e.g. as a comment in a config file)
Problems with character encodings seem to appear every once in a while either as end user, when your umlauts get garbled or as a programmer that has to deal with third party input like web forms or text files.
The text file rant
After stumbling over an encoding problem *again* I thought a bit about the whole issue and some of my thought manifested in this rant about text files. I do not want to blame our computer science predecessors for inventing and using restricted charsets like ASCII or iso8859. Nobody has forseen the rapid development of computers and their worldwide adoption and use in everyday life and thus need for an extensible charset (think of the addition of new symbols like the €), let aside performance and memory considerations. The problem I see with text files is that there is no standard way to describe the used encoding. Most text files just leave it to the user to guess what the encoding might be whereas almost all binary file formats feature some kind of defined header with metadata about the content, e.g. bit depth and compression method in image files. For text files you usually have to use heuristical tools which work more or less depending on the input.
A standardized header for text files right from the start would have helped to indicate the encoding and possibly language or encoding version information of the text and many problems we have today would not exist. The encoding attribute in the XML header or the byte order mark in UTF-8 are workarounds for the fundamental problem of a missing text file header.
Tell me about it; character encoding hell indeed.
I have spent countless hours figuring out character encoding issues. Take for instance a web site running PHP with a MySQL backend and automated backups.
The bash shell, Apache server and the MySQL database all need to be using UTF-8 encoding. Why the shell? As I found out backing a database onto disk without the shell having UTF-8 encoding means all the UTF-8 encoding is lost.
My personal favorite: UTF-8 characters use 1 to 4 octets versus the default MySQL latin1 1 octet size. This means that text fields can containing UTF-8 characters can exceed the field size after conversion. Very much fun ensues.
Arg. Character encoding is hell.
Actually, your solution is still incomplete, because Unicode characters can be encoded in different ways: you need to normalise to something like NFC first and then compare.
See Unicode UAX15 http://www.unicode.org/reports/tr15/
As an i18n expert and informatician I’d like to advise you: Always use UTF-{8,16} – in HTML forms, XML and the such. Make it a habit working with UTF and programming with Unicode (which is easy in Java and since version 2.6, in Python).
As soon as your application gets worldwide attention you will thank yourself.
And, UTF has a text-header for the charset being recognized as what it is.
Yes, that’s what we are doing, but sometimes you have to deal with third party systems that use some other fixed encoding and you cannot do anything about it. Regarding the UTF text-header: You have to write it explicitly to the file, don’t you?
The InputStreamReader class can be helpful too. You can specify the charset with which the decoding should happen
http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStreamReader.html
I had many bad experiences with UTF-8 Byte Order Mark. I would not recommend to anyone to use it.