In this tutorial, we will explain what character encoding means, and then give some examples of using command line tools to convert files using one character encoding to another encoding. Finally, we will take a look at how to convert files using various character encodings to UTF-8 encoding under Linux.
You may already know that except for binary data, computers cannot understand and store characters, numbers or anything that humans can understand. A binary bit has only two possible values, namely 0 or 1, true or false, yes or no. Everything else, such as characters, data, and images, must be represented in binary form for computer processing.
Simply put, character encoding is a way of instructing the computer to interpret raw 0s and 1s into actual characters. In these character encodings, characters are represented as a string of numbers.
There are many character encoding schemes, such as ASCII, ANCI, Unicode, etc. Below is an example of ASCII encoding.
字符 二进制 A 01000001 B 01000010
In Linux, the command line tool iconv is used to convert text using one encoding to another encoding.
You can use the file command and add the -i or --mime parameter to view the character encoding of a file. This parameter allows the program to output the mime (Multipurpose Internet Mail Extensions) data of a string like the example below:
$ file -i Car.java $ file -i CarDriver.java
View the encoding of a file in Linux
iconv tool is used as follows:
$ iconv option $ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile
Here, -f or --from-code indicates the input encoding, and -t or --to-encoding specifies Output encoding.
To list all existing encoded character sets, you can use the following command:
$ iconv -l
List all existing encoded character sets
Convert the file from ISO-8859-1 encoding to UTF-8 encoding
Below, we will learn how to convert one encoding scheme to another. The following command will convert ISO-8859-1 encoding to UTF-8 encoding.
Consider the following file input.file, which contains these characters:
� � � �
We start by looking at the encoding of this file, and then look at the file content. Finally, we can convert all characters to UTF-8 encoding.
After running the iconv command, we can check the contents of the output file and the character encoding it uses as follows.
$ file -i input.file $ cat input.file $ iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file -o out.file $ cat out.file $ file -i out.file
Convert ISO-8859-1 to UTF-8 in Linux
Note: If the //IGNORE string is added after the output encoding, those characters that cannot be converted will not be converted and will After conversion, the program displays an error message.
Okay, if the string //TRANSLIT is added after the output encoding in the above example (UTF-8//TRANSLIT), the characters to be converted will use the form translation principle as much as possible. That is, if a character cannot be represented in the output encoding scheme, it will be replaced by a character with a similar shape.
Also, if a character is not in the output encoding and cannot be deciphered, it will be replaced by a question mark ? in the output file.
Convert multiple files to UTF-8 encoding
Back to our topic. If you want to convert multiple files or even all files in a certain directory to UTF-8 encoding, you can write a simple shell script like the following and name it encoding.sh:
#!/bin/bash ### 将 values_here 替换为输入编码 FROM_ENCODING="value_here" ### 输出编码 (UTF-8) TO_ENCODING="UTF-8" ### 转换命令 CONVERT=" iconv -f $FROM_ENCODING -t $TO_ENCODING" ### 使用循环转换多个文件 for file in *.txt; do $CONVERT "$file" -o "${file%.txt}.utf8.converted" done exit 0
Save the file and then It adds executable permissions. Run this script in the directory where the file to be converted (*.txt) is located
$ chmod +x encoding.sh $ ./encoding.sh
Important: You can also make this script more general, such as converting any specific character encoding to another encoding. To achieve this, you only need to change the values of the FROM_ENCODING and TO_ENCODING variables. Don't forget to change the file name of the output file to "${file%.txt}.utf8.converted".
For more information, you can check the man page of iconv.
$ man iconv
To sum up this guide, understanding the concept of character encoding and knowing how to convert one encoding scheme to another is a must-have for computer users, especially programmers, when processing text.