Volkan.
10.03.2005
Encoding solved!
- In .net Strings do not consist of simpe byte arrays, but it is a unicode-aware char array. To be more specific: Internally, .NET applications use UTF-16 Unicode encoding for string data.
- If no encoding is specified, then platform's default encoding is used by default (of course) which may be different than your local encoding.
- The byte array extracted from a String explicitly (str.getBytes("iso-8859-1") returns a different byte array than str.getBytes("iso-8859-9"))
- When you convert a byte array to a String using encoding E, the bytes are mapped to what is called "unicode code points". So if you convert a byte[], that is extracted from an iso-8859-9 encoded String, to a iso-8859-1 encoded String, some characters might show up as question marks (?) or boxes. These are valid data but they are not printable on the screen. And yes the data is there. Just you cannot see it. This is because the iso-8859-1 encoding scheme does not have a mapping for those code points.
- Some Encoding translations (such as ISO8859_1 and ISO8859_9) are reversible. Here is a quote:
"Not all encoders assume a one-to-one relationship between byte values and character values. To ensure a reliable translation, do not rely on the default locale encoder. Explicitly specify an encoder that uses a reversible translation." - Let us explain it with a use-case:
(Sorry for my laziness: the code is java-ish, not c#. Although what's important is the concept here)aString = aStream.toString("ISO8859_1");
/*
* assume that the stream is iso8859_9 encoded
* but we convert it to an ISO8859_1 encoded String.
*/
aByteArray = aString.getBytes("ISO8859_1");
/*
* These bytes belong to an iso8859_9 encoded stream and
* we have now recoverd them since iso8859_1
* encoding is reversible.
* (from the above quote)
*/Thus you can be sure that; when we create a String out of aByteArray
using ISO8859_9 encoding; we will get our original String without any loss.Though this issue might not be apparent if the program was not tested under a non-western locale.
Side Note:
There is another issue called "round trip compatibility". That is when you convert a String from encoding A to encoding B and then back to encoding A you get your original String without any loss. Round-trip compatibility is not strictly related to the reversibility issue explained above but it's also another concept to consider when you try to internationalize your application.
As a final remark;
There ain't no such thing as 'plain text'.
Cheers for now.
References:
- http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4296969
incorrect behavior of several character converters. - http://www.i18nguy.com/unicode/codepages.html
codepages at the push button. - http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemtextencodingclassconverttopic2.asp
Encoding.Convert method. - http://www.simultrans.com/seminars/seminar199808.htm
SimulTrans localization seminar: Demiystifying Unicode. - Java(tm) Development Kit - JDK(tm) 1.1.7 Software README
- http://www.joelonsoftware.com/articles/Unicode.html
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - http://www.cl.cam.ac.uk/~mgk25/unicode.html
UTF-8 and Unicode FAQ for Unix/Linux