1. Garbled characters caused by file page encoding.
Each file (java, js, jsp, html, etc.) has its own encoding format. The code in the file displays normally in one encoding, but will appear garbled in another encoding. .
In Eclipse, each project will have an encoding format (Text file encoding), which generally defaults to GBK. A better programming habit is to create a new project and set the project encoding to UTF-8 first.
The reason for this is very simple. UTF-8 contains characters that all countries in the world need to use. It is an international encoding and has strong versatility. The relationship between several common character sets, GBK, GB2312, and UTF-8 is as follows:
GBK is a standard that is compatible with GB2312 after expansion based on the national standard GB2312. Unicode encoding must be used to convert GBK, GB2312, etc. to UTF8.
2. Garbled characters caused by string conversion of different character sets.
Every String is stored in a byte array in the underlying implementation. If different character sets are used, the length of the stored array will of course be different. If you do not use the same character set for decoding, garbled characters will definitely appear.
For example, the following code:
import java.io.UnsupportedEncodingException; import java.nio.charset.Charset; public class TestCharset { public static void main(String[] args) throws UnsupportedEncodingException { String strChineseString = "中文"; String encoding = System.getProperty("file.encoding"); System.out.println("系统默认的字符集是:" + encoding); System.out.println(strChineseString.getBytes(Charset.forName("GBK")).length); System.out.println(strChineseString.getBytes(Charset.forName("UTF-8")).length); System.out.println(strChineseString.getBytes().length); } }
The output result is:
Java code
1. The system default character set is: UTF-8
2.4
3.6
4.6
It can be seen that using GBK and UTF-8 encoding, the length of the byte array obtained is different. The reason is that utf-8 uses 3 characters byte to encode Chinese, while GBK uses 2 bytes to encode Chinese. Because my project uses UTF-8 by default, the length of the array obtained by using getBytes() without parameters is the same as the length of the string encoded in UTF-8. For detailed knowledge about character sets, please refer to the article address given in the first part.
Description of the getBytes method in JDK:
getBytes() uses the platform’s default character set to encode this String into a byte sequence and stores the result in a new byte array middle.
getBytes(Charset charset) Encodes this String into a byte sequence using the given charset and stores the result into a new byte array.
Each string has its own encoding method at the bottom. However, once the getByte method is called, the byte array obtained is an array encoded with a specific character set, and no redundant conversion is required.
After getting the above byte array, you can call another method of String to generate the String that needs to be transcoded.
The test example is as follows:
import java.io.UnsupportedEncodingException; import java.nio.charset.Charset; public class TestCharset { public static void main(String[] args) throws UnsupportedEncodingException { String strChineseString = "中文"; byte[] byteGBK = null; byte[] byteUTF8 = null; byteGBK = strChineseString.getBytes(Charset.forName("GBK")); byteUTF8 = strChineseString.getBytes(Charset.forName("utf-8")); System.out.println(new String(byteGBK,"GBK")); System.out.println(new String(byteGBK,"utf-8")); System.out.println("**************************"); System.out.println(new String(byteUTF8,"utf-8")); System.out.println(new String(byteUTF8,"GBK")); } }
The output result is:
1.中文 2.���� 3.************************** 4.中文 5.涓枃
It can be seen that which character set is used to encode a String must be used when generating a String Corresponding encoding, otherwise garbled characters will appear.
To put it simply, only String transcoding that satisfies the following formula will not be garbled.
String strSource = "你想要转码的字符串"; String strSomeEncoding = "utf-8"; //例如utf-8 String strTarget = new String (strSource.getBytes(Charset.forName(strSomeEncoding)), strSomeEncoding);
Description of the getBytes method in JDK:
String(byte[] bytes) Constructs a new String by decoding the specified byte array using the platform's default character set.
String(byte[] bytes, Charset charset) Constructs a new String by decoding the specified byte array using the specified charset.
3. Chinese garbled characters caused by Socket network transmission.
When using Socket for communication, there are many options for transmission. You can use PrintStream or PrintWriter. Transmitting English is okay, but transmitting Chinese may cause garbled characters. There are many opinions on the Internet. After actual testing, it was found that the problem still lies in bytes and characters.
As we all know, Java is divided into byte stream and character stream. Characters (char) are 16 bits and bytes (BYTE) are 8 bits. PrintStrean writes a string of 8-bit data. PrintWriter writes a string of 16-bit data.
String is encoded by UNICODE by default, which is 16bit. Therefore, strings written with PrintWriter are better cross-platform, while PrintStream may have garbled character sets.
You can understand the above words like this. PrintStream is used to operate byte, PrintWriter is used to operate Unicode. If PrintStream reads 8 bits at a time, if it encounters Chinese characters (one Chinese character occupies 16 bits), garbled characters may appear. Generally, PrintWriter is used when processing Chinese.
In the final website test, no garbled characters appeared when using PrintWriter. The code is as follows:
import java.io.BufferedReader; import java.io.DataOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.io.PrintWriter; import java.net.Socket; public class TestSocket { public static void main(String[] args) throws IOException { Socket socket = new Socket(); DataOutputStream dos = null; PrintWriter pw = null; BufferedReader in = null; String responseXml = "要传输的中文"; //.......... dos = new DataOutputStream(socket.getOutputStream()); pw = new PrintWriter(new OutputStreamWriter(dos)); //不带自动刷新的Writer pw.println(responseXml); pw.flush(); } }
One thing to note is that you need to use the println of PrintWriter instead of the write method, otherwise the server will not be able to read the data. The reason is that println will add a newline character after the string when outputting, but write will not.
4. Chinese garbled characters are displayed in JSP.
Sometimes the JSP page will have garbled characters when displaying Chinese. In most cases, it is a problem with the character set configuration and page encoding. As long as there are no problems with the following configurations, there will generally be no garbled characters.
a. Add the following statement at the top of the JSP page:
<%@ page contentType="text/html; charset=utf-8" language="java" errorPage="" %>
b. Add the following statement in the head tag of HTML.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
c. Ensure that the JSP page encoding is the same as the above two charsets. I mentioned this in the first point of the article.
上面的字符集可以根据需要自己灵活选择,不一定非要utf-8。不过因为utf-8对各国语言,特别是中文支持较好,所以推荐使用。我就曾经遇到过滘在GB2312编码的页面无法正常显示的问题。
5.Post和Get传递中文,后台获取乱码。
前台传递中文也分为Get和Post方法。
a.Get方法的情况:
Get方法的时候主要是URL传递中文。
如果是在js文件中,可以使用如下代码进行中文转码。
var url ="http://www.baidu.com/s?industry=编码" url = encodeURI(url);
如果是在jsp文件中,则可以使用如下语句进行转码。
页面开始引入:
<%@ page import="java.net.URLEncoder" %>
需要转码的地方使用URLEncoder进行编码:
<a href="xxxxx.xx?industry=<%=URLEncoder.encode(" rel="external nofollow" http://www.baidu.com/s?wd=编码", "UTF-8")%>">
无论使用哪种方法,在后台获取中文的时候都要使用如下代码:
request.setCharacterEncoding("utf-8"); String industry = new String( request.getParameter("industry ").getBytes("ISO8859-1"),"UTF-8");
【注】
1.对于request,是指提交内容的编码,指定后可以通过getParameter()则直接获得正确的字符串,如果不指定,则默认使用iso8859-1编码,为了统一,需要提交指定传输编码。
2.上面代码的第二句好像和第2条中给出的公式矛盾。我也纠结了好久,最后发现ISO8859-1是一种比较老的编码,通常叫做Latin-1,属于单字节编码,正好和计算机最基础的表示单位一致,因此使用它进行转码一般也没有问题。
iso-8859-1是JAVA网络传输使用的标准字符集,而gb2312是标准中文字符集,当你作出提交表单等需要网络传输的操作的时候,就需要把 iso-8859-1转换为gb2312字符集显示,否则如果按浏览器的gb2312格式来解释iso-8859-1字符集的话,由于2者不兼容,所以会是乱码。为了省事,建议统一使用utf-8字符集。
b.POST方法的情况。
对于Post的情况就比较简单了,只需要在post的函数调用部分,制定post的header的字符集,如:
xmlHttp.open("post", url , true); xmlHttp.setRequestHeader("Content-Type","text/xml; charset= utf-8"); xmlHttp.send(param);
其中param为要传递的参数。
后台部分和get方法一样,设置如下即可,注意传输和接受的字符集要统一。
6.后台向前台传递中文乱码。
在这里提供一个函数,通过这个函数来发送信息,就不会出现乱码,核心思想也是设置response流的字符集。函数代码如下:
/** * @Function:writeResponse * @Description:ajax方式返回字符串 * @param str:json * @return:true:输出成功,false:输出失败 */ public boolean writeResponse(String str){ boolean ret = true; try{ HttpServletResponse response = ServletActionContext.getResponse(); response.setContentType("text/html;charset=utf-8"); PrintWriter pw = response.getWriter(); pw.print(str); pw.close(); }catch (Exception e) { ret = false; e.printStackTrace(); } return ret; }
7.下载文件时文件名乱码。
下过下载的人都知道下载的文件容易出现乱码,原因也是没有对输出流的编码格式进行限定。
附上一段代码,用来帮你完成无乱码下载。
Java代码
HttpServletResponse response = ServletActionContext.getResponse(); response.setContentType("text/html;charset=utf-8"); response.reset(); String header = "attachment; filename=" + picName; header = new String(header.getBytes(), "UTF-8"); response.setHeader("Content-disposition", header);
核心代码就上几句,注意第二句和第三句的reset的顺序不能搞错。
reset的作用是用来清空buffer缓存的,清空请求前部的一些空白行。
The above is the detailed content of Sharing solutions to garbled code problems in Java. For more information, please follow other related articles on the PHP Chinese website!