.

Thursday, March 09, 2006

ajax and charset conversion #

If you write ajax applications and your pages are encoded in a non-UTF character set,
you will most probably find yourself in need of a conversion mechanism so that the data you send to server is encoded properly without damaging native characters.

Recently, I've been in a similar situation and I've written two methods (one server-side, one client-side) to sort out the issue.

Although my solution only covers Turkish character set (iso8859-9) it may be generalized to suit your needs.

XMLHttpRequest uses UTF-8 encoding to send data to the server.

You should normally use javascript's escape function to convert the data you wish to send to something server does not confuse.

As you may know, Escaping a string replaces special charaters such as space, ampersand (&), percent (%) to their UTF equivalents so that it will not damage the format of the QueryString when post to the server.

Under normal circumstances, escaping the data before sending it to the server is enough to encode it. However in our special case (where we are using a non-utf charset -iso-8859-9- along with native characters) it is not enough. We need to convert the native Turkish characters to their UTF equivalents as well.

Here is how to do it:


function iso88599Escape(strText)
{

strText=escape(strText);
strText=strText.replace(/ı/g,"%C4%B1");
strText=strText.replace(/Ü/g,"%C3%9C");
strText=strText.replace(/ü/g,"%C3%BC");
strText=strText.replace(/ğ/g,"%C4%9F");
strText=strText.replace(/Ğ/g,"%C49E%");
strText=strText.replace(/ü/g,"%C3%BC");
strText=strText.replace(/Ü/g,"%C3%9C");
strText=strText.replace(/İ/g,"%C4%B0");
strText=strText.replace(/ş/g,"%C5%9F");
strText=strText.replace(/Ş/g,"%C5%9E");
strText=strText.replace(/ç/g,"%C3%A7");
strText=strText.replace(/Ç/g,"%C3%87");
strText=strText.replace(/ö/g,"%C3%B6");
strText=strText.replace(/Ö/g,"%C3%96");
return strText;
}

Though there is another caveat here:
We are sending the data in the QueryString to the server in UTF format.
However the server is configured to interpret the data it received as if it were an
8 bit iso8859-9 encoded string. When it comes to native characters, this encoding differs from unicode.

So we need another conversion method on the server to convert the UTF data it received so that it becomes a properly encoded iso8859-9 string.

A quick and dirty solution would be a brute-force replacement of misinterpreted character sequences:


public static string AjaxRequestToIso88599(string value)
{
return value.Replace("Ü","Ü"
).Replace("Åz","Ş"
).Replace("Äz","Ğ"
).Replace("Ç","Ç"
).Replace("İ","İ"
).Replace("Ö","Ö"
).Replace("ü","ü"
).Replace("ÅŸ","ş"
).Replace("ÄŸ","ğ"
).Replace("ç","ç"
).Replace("ı","ı"
).Replace("ö","ö");
}

I hear you say "There should be a better way to do it.
And yes, you are right.

Let us go one by one:


UTF data as byte array(ajax request)
-> [server (Request.QueryString)]
-> ISO-8859-9 encoded String


The data posted to the server (i.e. the querystring we just formed) is in UTF-8 format.
Although server interprets it as if it were an Latin formatted stream (namely a stream with iso-8859-9 charset). This creates those cryptic characters.



So we need to convert the String into what it once were: a UTF String!

To do it, we first get the original byte array by decoding the incorrectly encoded String back to its bytes.
And then encode those bytes using UTF.


public static string Iso88599ToUTF8(string value)
{
return Encoding.GetEncoding("UTF-8").GetString(
Encoding.GetEncoding("ISO-8859-9").GetBytes(value)
);
}

Easy cheesy!
One line of code and your String is properly converted.

afiyet olsun!

Other References
  1. Special Turkish Alphabet Characters
  2. Jeppe's unicode page
  3. UTF8 Transformation chart
  4. JSPWiki UTF8 Issues
  5. Another UTF conversion table

 bu yaziyi sevdin mi?  hemen una ekle!
 


9 Coments

Anonymous said...
Thats the article what i really look for it. Thank you very much for your help. By the way, it would be better to use encodeURIComponent() rather than escape() method. Then there should not written client-side iso function.

Eline saglik :)

Turgay
10:38 AM  
Anonymous said...
Teşekkür ederiz.

http://www.dynamicdrive.com/dynamicindex17/ajaxtabscontent/index.htm
bu linkde ki http://www.dynamicdrive.com/dynamicindex17/ajaxtabscontent/ajaxtabs/ajaxtabs.js koduna bu yazdıklarını nasıl ekleyebiiriz. yardımcı olursan seviniriz. sukru_saglam[at]hotmail.com
6:11 PM  
Anonymous said...
Selam Millet

O Garip Karakterlerden kurtulmanın bir yoluda bu :)

public static string DecodeCharFilter(string Text)
{

Stream s = new MemoryStream(ASCIIEncoding.Default.GetBytes(Text));
StreamReader sr = new StreamReader(s);

byte[] CurrentBytes = sr.CurrentEncoding.GetBytes(sr.ReadToEnd());
byte[] EncodingBytes = Encoding.Convert(sr.CurrentEncoding, Encoding.GetEncoding(1254), CurrentBytes);

return Encoding.Default.GetString(EncodingBytes);


}

1254 yerine çevirmek istediğiniz codepage değerini yazmanız yeterli.

iyi programlarda kullanın :)

ShipTor
12:50 PM  
Volkan Ozcelik said...
> O Garip Karakterlerden kurtulmanın bir yoluda bu :)

Thanks for sharing your snippet.

However, if your hosting provider's locale is different than your development server's locale (and both of which are non-UTF8) then things may turn out to be weirder than you expect.

It all boils down to trial and error (and strong nerves and some luck hopefully)

Cheers.
9:04 PM  
Volkan Ozcelik said...
>it would be better to use encodeURIComponent()

Yes Turgay, you are right. encodeURIComponent is a more robust alternative. (I think I've addressed it somewhere but I cannot remember where)
9:08 PM  
Anonymous said...
selam volkan,

eğer kullanıcı web.config de bu ayarlamaları yaparsa sorun yaşamıyacaktır, çünkü
framework web.config ayarları yapılmamışsa global (sunucu) ayarlarına göre
çalışır.

<globalization requestEncoding="utf-8" responseEncoding="utf-8" fileEncoding="iso-8859-9"
culture="tr-TR" uiCulture="tr"/>

saygılar.
ShipTor
4:03 PM  
Volkan Ozcelik said...
"
<globalization requestEncoding="utf-8" responseEncoding="utf-8" fileEncoding="iso-8859-9"
culture="tr-TR" uiCulture="tr"/>
"

Selam.
My web.config is something like:

<globalization
requestEncoding="iso-8859-9"
responseEncoding="iso-8859-9"
fileEncoding="iso-8859-9"
culture="tr-TR"
uiCulture="tr-TR" />

After several trials and errors it's the best I could achive, because my database is iso-8859-9 encoded.

If the db were utf encoded things would have been much simpler (which is not possible, because conversion is too costly; I should've decided it from the beginning).

Anyways, thanks for the comment.
I'll give it a try in my forecoming projects.
3:16 PM  
Volkan Ozcelik said...
Selam Şükrü,

... koduna bu yazdıklarını nasıl ekleyebiiriz.

Your question seem somewhat unrelated.
Do you have encoding issues with the AJAX tabs?

If it's encoding then it depends on variety of things included your choice of development platform, your configuration files etc.

I'd be glad to help, if you clarify your question a bit.
3:19 PM  
Anonymous said...
iso formatta ğ,ş Y olarak geliyor onun yerine windows-1254 le encode ettiğimde tüm karakterler düzgün geliyor
5:19 PM  


Post a Comment

Links to this post:


Create a Link

<< Home




Recent Posts

RSS

RSS register icon

Other Blogs

Various

Sponsor

Profile Information

Browser I Suggest

Sponsor

Dikkatimi Çekenler