Home >php教程 >PHP开发 >Some cases of garbled Chinese characters in files under Linux

Some cases of garbled Chinese characters in files under Linux

高洛峰
高洛峰Original
2016-12-15 16:28:491575browse

In fact, the problem of garbled characters is caused by the character set integrated into the system. Since the character set of the corresponding characters cannot be used correctly, the OS cannot recognize the text, resulting in garbled characters. The solution is not difficult...

First of all, we must first know that the language environment variables that control Linux OS are $LANG and $LC_ALL. To solve the problem of garbled characters, we only need to set the above two variables correctly.

There are two situations of garbled characters:
1. Garbled characters in the terminal (pure shell interface)
vi /etc/profile
export LC_ALL="zh_CN.GB18030:zh_CN.GB2312:zh_CN.GBK:zh_CN:en_US.UTF-8:en_US:en:zh:zh_TW:zh_CN.BIG5 "
Save and exit, just reboot the system..

2. Garbled characters in X-window (graphical interface)
vi /etc/sysconfig/i18n
LANG="zh_CN.GB18030:zh_CN.GB2312:zh_CN.GBK:zh_CN: en_US.UTF-8:en_US:en:zh:zh_TW:zh_CN.BIG5"
LANGUAGE="zh_CN.GB18030:zh_CN.GB2312:zh_CN.GBK:zh_CN:en_US.UTF-8:en_US:en:zh:zh_TW: zh_CN.BIG5"
Save the reboot...

Since there are many Chinese character set codes, I am not very clear about the compatibility with each other, so I tried my best to find as many different codes as possible and write them down. , you can also filter it yourself. The general solution is to modify the variables that control the environment parameters and increase the character set supported by the OS (the prerequisite is that the character exists on the kernel, otherwise the kernel needs to be compiled)...


Under development The WEB system is deployed on RED HEAD.
RH version information:
LSB Version: :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1-amd64:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: RedHatEnterpriseServer PDescripting: Red Hat Enterprise Linux Server Release 5 (tikanga)
Release: 5
Codename: Tikanga
---------------------------- ---
locale Information
LANG=zh_CN.UTF-8
LC_CTYPE="zh_CN.UTF-8"
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN. UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN. UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=
------------ --------------------------
Because there are several files in the program directory that need to be read out and displayed on the page. The file names are Chinese names
I use the File.list() method I got a list of file names, but all displayed are garbled characters.
new String(filename.getBytes("utf-8"),"GBK");
new String(filename.getBytes("iso-8859-1"),"GBK");
new String(filename.getBytes( ), GBK");

None of them work,
Using System.getProperty("file.encoding") the result is "utf-8"
In addition, when using the ls command to view, it is garbled, use ls - The -show-control-chars command will display the Chinese name (console)




Add locale, it is estimated that your system does not support the gbk character set

Under ubuntu it is vi /var/lib/locales/supported.d. /local

After adding locale-gen, refresh the character set cache.


If you need to operate files under Windows in Linux, you may often encounter the default file encoding conversion problem in Windows. The file format is GBK (gb2312), and Linux is generally UTF-8. Here is how to check the encoding of the file in Linux and how to convert the file.
First, check the file encoding:
View in Linux. File encoding can be done in the following ways:
1. You can directly view the file encoding in Vim
:set fileencoding
to display the file encoding format
If you just want to view files in other encoding formats or want to use Vim to view files. If there is a problem with garbled characters, you can add the following content to the
~/.vimrc file:
set encoding=utf-8
fileencodings=ucs-bom,utf-8,cp936
In this way, vim can automatically identify the file encoding (it can automatically identify UTF-8 or GBK encoded files). In fact, it is to try according to the encoding list provided by fileencodings. If no suitable encoding is found, use latin-1 (ASCII) encoding. Open.
2. enca (If this command is not installed in your system, you can use sudo yum install -y enca to install it) to view the file encoding
$ enca filename
filename: Universal transformation format 8 bits; UTF-8
CRLF line terminators
required One thing to note is that enca does not recognize some GBK encoded files very well. When identifying, it will appear:
Unrecognized encoding
Second, file encoding conversion
1. Convert the file encoding directly in Vim, such as converting a file into utf-8 format
:set fileencoding=utf-8
2. iconv conversion, the iconv command format is as follows:
iconv -f encoding -t encoding inputfile
For example, convert a UTF-8 encoded file into GBK encoding
iconv -f GBK -t UTF-8 file1 -o file2
3. enconv Convert file encoding
For example, if you want to convert a GBK encoded file to UTF-8 encoding, the operation is as follows
enconv -L zh_CN -x UTF-8 filename
Three , File name encoding conversion:
When copying files from Linux to Windows or from Windows to Linux, sometimes the Chinese file name will be garbled. The reason for this problem is that the default Chinese encoding of the file name in Windows is GBK. The default file name encoding in Linux is UTF8. Due to inconsistent encoding, the file name is garbled. To solve this problem, the file name needs to be transcoded.
In Linux, a tool convmv is specially provided to convert the file name encoding. It can convert the file name from GBK to UTF-8 encoding, or from UTF-8 to GBK.
First check whether convmv is installed on your system. If not, use:
yum -y install convmv to install it.

Let’s take a look at the specific usage of convmv:
convmv -f Source encoding -t New encoding [option] File name
Common parameters:
-r Recursive processing of subfolders
–notest Real operation, please note that by default It does not perform actual operations on the file, but is just a test.
–list displays all supported encodings
–unescap You can escape it, such as changing %20 into a space

For example, if we have a utf8 encoded file name and convert it to GBK encoding, the command is as follows:
convmv -f UTF- 8 -t GBK –notest utf8 encoded file name
After this conversion, the “utf8 encoded file name” will be converted to GBK encoding (only the file name encoding is converted, the file content will not change)


Four, vim encoding Mode settings
Like all popular text editors, Vim can well edit files with various character encodings, which of course includes popular Unicode encoding methods such as UCS-2 and UTF-8. Unfortunately, like a lot of software from the Linux world, this requires you to set it up yourself.
Vim has four options related to character encoding methods, encoding, fileencoding, fileencodings, termencoding (for possible values ​​of these options, please refer to Vim online help: help encoding-names), their meanings are as follows:
* encoding: Vim internal The character encoding used, including Vim's buffer, menu text, message text, etc. The default is selected according to your locale. The user manual recommends changing its value only in .vimrc. In fact, it seems that it only makes sense to change its value in .vimrc. You can use another encoding to edit and save files. For example, if your vim encoding is utf-8, and the edited file is encoded in cp936, vim will automatically convert the read file into utf-8 (vim can read way), and when you write a file, it will automatically switch back to cp936 (the file's saving encoding).
* fileencoding: the character encoding method of the file currently edited in Vim, Vim will also convert the file when saving it Save in this character encoding (regardless of whether the file is new or not).
* fileencodings: Vim automatically detects the sequential list of fileencoding. When starting, it will detect the character encoding of the file to be opened one by one according to the character encoding it lists, and set fileencoding to the final detected character encoding. Therefore, it is best to put the Unicode encoding method at the top of this list and the Latin encoding method latin1 at the end.
* termencoding: The character encoding method of the terminal where Vim works (or the Console window of Windows). If the term where vim is located is the same as vim encoding, no setting is required. Otherwise, you can use vim's termencoding option to automatically convert to term encoding. This option is invalid for our commonly used GUI mode gVim under Windows, and for Console mode Vim, it is the code page of the Windows console, and Usually we don't need to change it.


Fifth, Vim's multi-character encoding working method
1. Start Vim and set the character encoding method of buffer, menu text, and message text according to the encoding value set in .vimrc.

2. Read the file to be edited and detect the file encoding method one by one according to the character encoding methods listed in fileencodings. And set fileencoding to the detected, seemingly correct (Note 1) character encoding.

3. Compare the values ​​of fileencoding and encoding. If they are different, call iconv to convert the file content to the character encoding described by encoding, and put the converted content into the buffer opened for this file. At this time we can Start editing this file. Note that completing this step requires calling the external iconv.dll (Note 2). You need to ensure that this file exists in $VIMRUNTIME or other directories listed in the PATH environment variable.

4. When saving the file after editing, compare the values ​​of fileencoding and encoding again. If they are different, call iconv again to convert the text in the buffer to be saved into the character encoding described by fileencoding and save it to the specified file. Again, this requires calling iconv.dll. Since Unicode can contain characters from almost all languages, and Unicode's UTF-8 encoding is a very cost-effective encoding (space consumption is smaller than UCS-2), it is recommended that the value of encoding Set to utf-8. Another reason for doing this is that when the encoding is set to utf-8, Vim will automatically detect the encoding method of the file more accurately (perhaps this reason is the main reason. The files we edit in Chinese Windows, in order to take into account compatibility with other software It is more appropriate to set the file encoding to GB2312/GBK, so it is recommended to set fileencoding to chinese (chinese is an alias, which means gb2312 in Unix and cp936 in Windows, which is the code page of GBK).

For more articles related to garbled Chinese characters in files under Linux, please follow the PHP Chinese website

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn