06 May 2010

Beware, your text editor encodes your characters

Al salamo Alykom,

Actually nowadays I am falling in some encoding issue that making me walk thinking in how characters being encoded!

And here's another note, actually I got from BalusC, from stackoverflow

Ok, When you intend to use some character encoding in your html/jsp/php/... page, you have to know that your text editor is actually encoding your text.

What does this mean??

Well, suppose you have a html page that you need to submit in another encoding rather than UTF-8 (say, GB3212 - some chines encoding)

If you write any field value in the text editor (as opposite to make the user enter it through text boxes), you will get the characters corrupted. (this happens when using hidden fields)

For example, this page:

<HTML>
<meta http-equiv='Content-Type' content='text/html; charset=gb2312'>
<BODY >
<form name="form" method="post" action="http://localhost:8080/testChinaEncoding/EncodingServlet" accept-charset="gb2312" accept="gb2312">
<input type="text" name="username" size="50" value="美白祛斑—做完美女人"> <!-- UTF-8 characters -->
<input type="submit">
</form>
</BODY>
</HTML>


The point is, when open this page in a browser, you will notice that the characters appear in the text box, are corrupted!
Yes, this is because you are typing these characters in the text editor using one encoding (usually UTF-8), and want to display these characters using other character encoding (GB2312)!

And here's a screen-shot from the output page:


Note, the big difference from the character being written in the form, and the characters being displayed!

for more info, please see: http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html

No comments: