Encoding problems are common place in software development but sometimes you get them in unexpected places.
About the setup: we have a web application written in Grails (though the choice of framework here doesn’t really matter) running on Tomcat. A flash application sends a HTTP Get request to this web application.
As you might know parameters in Get request are encoded in the URL with the so called percent encoding for example: %20 for space. But how are they encoded? UTF8?
Looking at our tomcat configuration all Get parameters are decoded with UTF8. Great. But looking at the output of what the flash app sends us we see scrambled Umlauts. Hmmm clearly the flash app does not use UTF8. But wait! There’s another option in Tomcat for decoding Get parameters: look into the header and use the encoding specified there. A restart later nothing changed. So flash does not send its encoding in the HTTP header. Well, let’s take a look at the HTTP standard:
If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.
Ah.. US-ASCII and what about non ASCII ones? Wikipedia states:
For a non-ASCII character, it is typically converted to its byte sequence in UTF-8, and then each byte value is represented as above.
Typically? Not in our case, so we tried ISO-8859-1 and finally the umlauts are correct! But currency signs like the euro are again garbage. So which encoding is similar to Latin-1 but not quite the same?
Yes, guess what: cp1252, the Windows native encoding.
And we tested all this on a Mac?!
If you are seeing ‘currency signs like the euro are again garbage’ when using ISO-8859-1.
It doesn’t necessarily have to be CP-1252, it could be that it is ISO-8859-15 encoded instead. They support the same printable characters, but their code mapping is subtly different I believe.
Could be. But in our case it was Cp1252.