Send As SMS
Note: This blog is sorta marketing-related and less frequently updated than other blogs that I author. If you are more of a techy-geek than a marketing wizard then cre8ive hut will be much interesting for you.
Volkan.

9.24.2005

Yet another encoding problem.

Nowadays I am struggling a lot about dot net and encoding issues.

Here are some of my findings:

Environment:
Server uses cp1252(latin) encoding (as a result: odbc datareaderand odbc command objects use cp1252 encoding as well)

Request and response encoding are cp1254, however.
(Changing the server encoding to cp1254 is out of question since it isnot dedicated to me)

Pages are written in ASP.Net C#

I'll try to explain every step I do. Forgive me if I bore you.

Here is what I do:
  1. insert a CP1254 encoded String direclty using PHP MyAdmin's gui toa database table. (so that I am sure it is stored as a CP1254-encodedString in the DB)
  2. Read the column from that table to a DataReader.
    Although the data in the DB is CP1254-encoded, the DataReader returns a (improperly) CP1252-encoded String.
  3. Here comes the fun part:
    When I do

    (1) ByteArray[] = TheString.GetBytes( Using CP1252 Encoding )
    I receive a properly encoded byte array of CP1254 encoding without any errors, garbage characters, question marks etc.

Some interesting question would be "is it always the case"?

That is;

If I create an E1 encoded String S, from E2 encoded byte array B1; when I decode B1 with encoding E2 into byte array B2; will B1 and B2 be allways equal for all possible string values and allpossible encodings?

(Too many variables around ain't there?)

Or am I just lucky-enough because CP1254 and CP1252 are quite similar encodings and somehow their cross-transformation manage to stay reversible.

Let us have a look at the other side of the medallion:

I have a CP1254 encoded String S that I read from page's response.

I convert it into a CP1254-encoded byte array B1 using S.GetBytes(using CP1254 encoding)

If I assume that the relation (1) given above is reversible
(which I must; else I won't have any reasonable explanation :) ) ;
decoding S using CP1252 encoding should create an CP1252-encoded (improper) String that is identical to the one our DataReader has generated above.

And it happens to be the case, because when I Insert the improper String (which displays incorrectly in the response output) using ODBCCommand object; magically correct values are inserted to the DB with proper encoding, no mis-typed characters what so ever.

This makes me deduce that the DataReader and CommandObject operate at byte level.

To sum up; my finding out of all this hassle is as follows:

"If a string S is created out of an arbitrary byte array B using encoding E; we can retrive B back without any loss when we use S.getBytes( Encoding E) , no matter what B or E is." (2)

However, the question "is it always the case?" will remain unanswered since I have other things to do than test other encoding pairs to find a pair that falsifies my argument. (CP1252 and CP1254 have given me enough trouble already). Although, imho, that's because CP1252 (latin) and CP1254 (extended latin) are close enough and that's why this two-sided transformation succeeds. If it was a transformation between say Chinese and Turkish encodings then, imho, there would have been a data loss.

Anyway (2) at least holds true for CP1252 and CP1254 encoded Strings. And that's good enough for me.


Comments: Yorum Gönder

<< Home

This page is powered by Blogger. Isn't yours?