Detailed explanation of character encoding in Python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed explanation of character encoding in Python

零下一度

Jun 16, 2017 am 10:49 AM

pythonCharacter Encoding

The following editor will bring you an article about the basic character encoding of Python. The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor and take a look.

Preface

##Character encoding is very easy to go wrong, we must keep a few things in mind In a sentence:

1. Which encoding is used to save it, which encoding should be used to open it

2. The execution of the program is to first read the file into the memory

3.Unicode is the parent encoding and can only be encoded and decoded into other encoding formats

utf-8, GBK are sub-8 encodings and can only be decoded into Unicode

1. What is character encoding

We know that computers can only recognize binary, and the codes we usually write need to be converted into binary to be recognized by the computer. So, how do we convert the characters we write into binary? This process actually uses a standard to make the characters we write correspond to specific numbers one-to-one. This standard is called character encoding.

Character------(Character encoding)------->Number

2. Development history of character encoding

1.ASCII code

Computers originated in the United States, and character encoding also originated in the United States. But the characters used by the American people only have 26 letters, plus some special symbols. Unlike in China, primary school students have to know thousands of Chinese characters. So the American people use ASCII code (American Standard Code for Information Interchange) as character encoding. One Bytes represents one character. 1Bytes=8bit. There can be 2 to the 8th power, which is 256 different changes, but initially only the first 7 were used. bits, that is, 127 characters, which is enough for the people of the United States (of course due to cost considerations). Later, Latin was compiled into the 8th position. At this point, the ASCII codes are full, and English-speaking countries and Latin countries can play happily.

2.GBK

Although China’s current technology is not as good as that of the US empire, we have a positive heart, so , in 1980, the State Administration of Standards issued the character encoding used in Chinese -> GBK, which uses two bytes to represent a Chinese character, so there are 2 to the 16th power, or 65536 combinations, which is enough for Chinese characters.

At the same time, other countries have also released their own national character encoding standards, such as Japan's shift_JIS, South Korea's Euc-kr, etc.

3.Unicode

It is said that there were hundreds of character encodings in their heyday, and they did not support each other. It seems that people in all countries are very strong-minded, but this is not conducive to the interoperability of the world, so Unicode came into being. And born. In 1994, the International Organization for Standardization released Unicode, known as the Universal Code, which uses two bytes to represent a character and has 65,536 combinations, which can already cover most languages in the world.

4.utf-8

Although Unicode is good, there is a problem. English that could be expressed in one byte can now To use two bytes, the storage space is doubled. This is obviously not perfect, so UTF-8 was created, which only uses 1 byte for English characters and 3 bytes for Chinese characters. .

5. All characters in Unicode are two bytes, which is simple and crude. It converts characters into numbers quickly, but takes up a lot of storage space

utf-8 uses different lengths to represent different characters, saving space, but the conversion efficiency is not as fast as Unicode

The character encoding used in the memory is Unicode, and the memory is to speed up, so I would rather sacrifice a little Space, but also ensure speed

Hard disk and network transmission use utf-8, because the disk I/O or network I/O delay is much greater than the conversion efficiency of utf-8, and the network transmission should Save bandwidth as much as possible

3. Python interpreter execution

First phase: Python interpreter starts, this It is equivalent to starting a text editor

Second stage:The python interpreter serves as a text editor to open the t.py file and copy the t.py file from the hard disk The content is read into the memory

The third stage:The python interpreter interprets and executes the code of t.py just loaded into the memory

The second stage, t. py file has a character encoding when saving, and the same encoding method must be specified when the Python interpreter opens the file (the default encoding method of Python2 is ASCII, and the default encoding method of Python3 is utf-8). If the encoding format of the file saving is different from that of the Python interpreter If the default encoding method of the interpreter is different, you need to write #coding: at the beginning of the file to tell the python interpreter not to use its default encoding method to read, but to use the method specified by the header file to read the file, so that Can't go wrong.

The third stage: Read the code that has been loaded into the memory (Unicode by default), and then execute it. During the execution, if an operation like defining a variable is encountered, a new memory space will be opened in the memory. Please note at this time that the newly opened memory space is not necessarily Unicode. The user can specify the encoding method when defining the variable. The memory space opened during definition is just a space and can store codes in any encoding format. Take Python3 as an example

4. Encoding and decoding

Saving the file is to save the file in the memory To the hard disk

Reading files is to read the files from the hard disk into the memory

Unicode is the parent encoding, utf-8, GBK are the child encodings. If the subcode wants to be converted to other codes, it must be converted to the parent code first, and then converted from the parent code to other subcodes

Decoding is decode, which is the process of converting the subcode to the parent code Unicode

Encoding is encoding, which is the process of converting Unicode into other encodings.

As mentioned before, when the file is read into the memory, it becomes Unicode encoding (of course this is the default, and can also be changed according to instructions), The process of reading files from the hard disk is to decode the utf-8 in the hard disk into Unicode

. When the file is saved, it is the process of saving it from the memory to the hard disk. The hard disk is encoded in utf-8 and needs to be encoded by Unicode. Into utf-8

5. The difference between Python2 and Python3

1. The default encoding of Python2 is ASCII, open utf-8 and save An error will be reported when entering the file. You should add #coding to the header file: utf-8

Str in Python2 is recognized as Bytes, so str in Python2 is the result of being encoded. In fact, it will be done by default. The thing is to add a u in front of str, convert it to Unicode first, and encode it into bytes

There are two string types in Python2, str and Unicode. str can be converted by adding a 'u' in front of it. Convert to Unicode

2. The default encoding method of python 3 is utf-8, you can directly open files saved with utf-8

Str in Python3 is recognized as Unicode

There are also two string types (bytes and str) in Python3, but bytes is bytes and str is unicode

6. Print to the terminal

First of all, you need to know that the default encoding method of Windows terminal is GBK

The terminal is also an application and runs in the memory, so the process of printing with print() is from memory to memory middle. So for unicode, no matter how you print, there will be no error. However, in Python2, except for the string with 'u', the other strings are Bytes. At this time, the terminal uses GBK encoding, while Python2 uses the specified utf-8 or default Ascii code, an error will occur when printing in the terminal.

These are my current understanding. If I realize that there are errors or unclear expressions in the future, I will revise them. Alas, character encoding is a pitfall

The above is the detailed content of Detailed explanation of character encoding in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What are the alternatives to concatenate two lists in Python?May 09, 2025 am 12:16 AM

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

Python: Efficient Ways to Merge Two ListsMay 09, 2025 am 12:15 AM

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiled vs Interpreted Languages: pros and consMay 09, 2025 am 12:06 AM

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

Python: For and While Loops, the most complete guideMay 09, 2025 am 12:05 AM

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

Python concatenate lists into a stringMay 09, 2025 am 12:02 AM

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Python's Hybrid Approach: Compilation and Interpretation CombinedMay 08, 2025 am 12:16 AM

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

Learn the Differences Between Python's 'for' and 'while' LoopsMay 08, 2025 am 12:11 AM

ThekeydifferencesbetweenPython's"for"and"while"loopsare:1)"For"loopsareidealforiteratingoversequencesorknowniterations,while2)"while"loopsarebetterforcontinuinguntilaconditionismetwithoutpredefinediterations.Un

Python concatenate lists with duplicatesMay 08, 2025 am 12:09 AM

In Python, you can connect lists and manage duplicate elements through a variety of methods: 1) Use operators or extend() to retain all duplicate elements; 2) Convert to sets and then return to lists to remove all duplicate elements, but the original order will be lost; 3) Use loops or list comprehensions to combine sets to remove duplicate elements and maintain the original order.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software