Home  >  Article  >  Java  >  JAVA WEB notes--Chinese garbled characters

JAVA WEB notes--Chinese garbled characters

巴扎黑
巴扎黑Original
2017-06-26 11:11:011448browse

JAVA WEB Garbled Code Problem Analysis

Cause of Garbled Code

In the process of Java Web development, we often encounter the problem of garbled code. The reason for garbled code can be summarized as character encoding. Does not match the decoding method.

Since the reason for garbled characters is that the character encoding and decoding methods do not match, then why do we have to encode the characters? Is it okay not to encode them? This is because the basic unit of storing data in a computer is 1 byte, that is, 8 bits, so the maximum number of characters it can express is 28=256, and in our real society There are far more characters (Chinese characters, English, other characters, etc.) than this number, so in order to solve the conflict between characters and bytes, characters must be encoded before they can be stored in the computer.

Encoding and decoding

 Common encoding methods in computers include ASCII, ISO-8859-1, GB2312, UTF-16, and UTF-8. .

ASCII code is represented by the lower 7 bits of a byte, so the maximum number of characters that can be expressed is 27=128. ISO-8859-1 is an extension of the ISO organization based on ASCII code. It is compatible with ASCII code and covers most Western European characters. ISO8859-1 uses one byte to represent, so it can express up to 256 characters. GB2312 uses double-byte encoding. The encoding range is A1-F7, where A1-A9 is the symbol area and B0-F7 is the Chinese character area, containing 6763 Chinese characters. GBK is to expand the GB2312 encoding and add more Chinese characters. There are 21,003 Chinese characters that can be expressed. UTF-16 uses a fixed-length encoding method. No matter what character is represented, it is represented by 2 bytes. This is also the storage format of characters in JAVA memory. Contrary to UTF-16, UTF-8 uses a variable-length encoding method, and different types of characters can be composed of 1-6 bytes.

Let’s take a look at the encoding of different encoding methods in the computer using the string "Hyuuga Hinata", as shown below.

Garbled code analysis and solution

 For the garbled code problem in JAVA WEB, We divide the garbled characters caused by the request and the garbled characters caused by the response. For different garbled characters, we need to analyze the causes of the garbled characters, that is, what is the character encoding method and what is the decoding method.

For the garbled characters caused by the request, we need to analyze the Http request and check its encoding method. Since the HTTP request is divided into Get request and Post request, we will discuss them separately next.

For Get request, it is the default request method of the browser, and the submission method when the form is set to "Get" when submitting. We check the specific content through Firefox browser as follows:

The address bar is:

The request content is:

 

Through the above request, we can see that the query string in the GET request is stored in the request line and sent to the WEB server , through the "Hyuga Hinata" encoding we can see that the encoding method used by the browser for this string is "UTF-8".

Looking at the server code, we can see garbled characters (as shown below). This is because the server decodes the data after receiving the string encoding by default using ISO-8859-1. , so the encoding and decoding methods are not unified.

 

The solution is as follows:

First get the string user before decoding encoding, and then specify the encoding method of the string, as shown below:

## The solution diagram is as follows:

 In the process of Java web development, we pass parameters in hyperlinks and often encounter Chinese situations. In this case, we need to encode Chinese, we can set it to UTF-8, and the decoding scheme is the same as above.

 

<a href="${pageContext.request.contextPath}/Test?user=<%=URLEncoder.encode("日向雏田", "UTF-8")%>">点击</a>

 For Post request, it is the submission method when the form submission is set to "Post". We use the Firefox browser to view its specific content as follows:

The address bar and its page are:

The post request content is:

We can know from the above picture that in the post request , put the request content directly in the request body and send it to the web server, and the encoding method is "utf-8".

In this response Servlet, the doPost method body is as follows:

 

public void doPost(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		String user=request.getParameter("user");
		System.out.println(user);//输出为日向雏田
	}

 The reason for the garbled characters here is still when the code getParameter ("user"), the web server uses the default decoding scheme "ISO-8859-1" for decoding, resulting in encoding and decoding schemes I disagree, the solution can be to use the get request garbled solution, but there is a simpler solution, directly specify the encoding/decoding scheme of the method body as "utf-8". The plan is as follows.

 

public void doPost(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		response.setCharacterEncoding("utf-8");  //设置请求体的编码/解码方案为UTF-8 但是请求行的编码解码方案不会受影响
		String user=request.getParameter("user");
		System.out.println(user);          //输出为日向雏田
	}

The above analysis of the garbled code caused by the request is completed.

In the garbled code caused by the impact, the web server will write the response content into the response body and return it to the client without involving the status line. For example, if "HelloWorld" is output to the browser, the response is as shown in the figure below.

We have to involve four methods for the garbled code caused by the response, as follows:

response.setHeader("Content-Type", "text/html;cahrset=utf-8");//设置发送到客户端的响应的内容类型和响应内容的编码类型(响应体的编码类型)
response.setCharacterEncoding("utf-8");//设置响应体的编码类型
response.getWriter();           //获取响应的输出字符流 
response.getOutputStream();        //获取响应的输出字节流

 For setting the response body Encoding type, such as response.setHeader("Content-Type", "text/html;cahrset=utf-8"); and response.setCharacterEncoding("utf-8"); The encoding methods set by these two methods are equivalent. If the encoding method of the response body is not set, the default is ISO-8859-1, and the subsequent encoding method of the response body characters will iterate the previous encoding method. Both of these methods are valid before the getWriter method, and the method of setting encoding in the getWriter method will be invalid.

But these two methods are a little different, that is, the browser will automatically use the setHeader("Content-Type", "text/html;cahrset=utf-8") method. The encoding method of the response body is decoded, and the setCharacterEncoding() method is not used by all browsers to decode. The two methods are tested below, and the results are as follows:

	public void doGet(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		response.setHeader("Content-Type", "text/html;charset=utf-8");
		response.getWriter().write("日向雏田");
	}

 

 

public void doGet(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		response.setCharacterEncoding("utf-8");
		response.getWriter().write("日向雏田");
	}

  

 

   从上面可以看到第一个方法对于浏览器来说,支持的较好,提倡采用第一种方法设置响应体的字符编码方式。

  对于获取响应字符输出流的方法,如果在此之前没有设置响应体的编码方式,那么默认为null,即ISO-8859-1方式进行编码。而且后面设置的编码方式会覆盖前面设置的编码方式。在getWriter()方法之后设置的编码无效。

  对于获取响应输出字节流,我们在输出字符串时,我们需要设置字符串的编码方式如果没有那么默认ISO-8859-1。

  对于前面2个输出流,由于只有一个输出缓存,所以这两个方法互斥。

  以上,为了保证响应无乱码,需要保证字符编码和解码方法的统一,方案如下:

	public void doGet(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
//	方案1
//		response.setHeader("Content-Type", "text/html;charset=utf-8");
//		response.getWriter().write("日向雏田");
//	方案2
//		response.getOutputStream().write("日向雏田".getBytes("UTF-8"));
//	方案1,2互斥
	}

  

   此外在Java web开发过程中,我们还会遇到当进行文件下载时,中文文件名导致的问题,如下图所示:

	public void doGet(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		String realPath=this.getServletContext().getRealPath("/src/日向雏田.jpg");
		String fileName=realPath.substring(realPath.lastIndexOf(&#39;\\&#39;)+1);
		response.setHeader("content-disposition", "attachment;filename="+fileName);
		InputStream is=new FileInputStream(new File(realPath));
		OutputStream os=response.getOutputStream();
		byte[] buff=new byte[1024];
		int len=0;
		while((len=is.read(buff))>0){
			os.write(buff, 0, len);
		}
		os.close();
		is.close();
	}

  采用火狐浏览器进行测试,查看页面效果,及其响应结果如下:

  

  经过查看响应头分析,下载文件名存放在响应头中,且对于中文文字没有采用UTF-8、UTF-16、GBK等等能识别中文的编码,那么对于中文文件名导致采用哪种编码方式呢?查看REF 7578得知,在此处采用ASCII编码,但是REF规定,如果不可避免的要使用非ASCII码的字符,程序员应该均匀的使用UTF-8,来最小化交互操作的问题。

  所以,解决方案就是把文件名编码成UTF-8,传递给响应头,浏览器(部分)默认对该文件名进行UTF-8解码处理。

public void doGet(HttpServletRequest request, HttpServletResponse response)
			throws ServletException, IOException {
		String realPath=this.getServletContext().getRealPath("/src/日向雏田.jpg");
		String fileName=realPath.substring(realPath.lastIndexOf(&#39;\\&#39;)+1);
		String utf_8Name=URLEncoder.encode(fileName,"utf-8");//解决方案
		response.setHeader("content-disposition", "attachment;filename="+utf_8Name);
		InputStream is=new FileInputStream(new File(realPath));
		OutputStream os=response.getOutputStream();
		byte[] buff=new byte[1024];
		int len=0;
		while((len=is.read(buff))>0){
			os.write(buff, 0, len);
		}
		os.close();
		is.close();
	}

  效果如下:其中火狐浏览器并没有对其解码

 

The above is the detailed content of JAVA WEB notes--Chinese garbled characters. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn