Home  >  Article  >  Backend Development  >  A brief discussion on character encoding and strings in python learning

A brief discussion on character encoding and strings in python learning

青灯夜游
青灯夜游forward
2018-10-29 17:54:151639browse

This article brings you a brief discussion of character encoding and strings in Python learning. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

What is the character encoding?

For example, the Chinese character "" can be represented by the following

decimal : 20013

Binary: 01001110 00101101(unicode)/11100100 10111000 10101101(utf-8)

Hexadecimal: u4e2d

ascii encoding

  • ASCII encoding is 1 bytes

  • Can only encode pure English

  • ##Save space

unicode encoding

  • Unicode

    encoding is usually 2 bytes . (For example, the letter A encoded with ASCII is decimal 65, binary 01000001; A’s Unicode The encoding is 00000000 01000001.)

  • ##uicode
  • Unify the encoding to solve the encoding conflict and the garbled code problem disappears

  • has twice the storage space as

    ascii, which is not cost-effective for storage and transmission (UTF-8Solution )

utf-8 encoding (variable-length unicode encoding)

UTF-8

encoding puts a

Unicode characters are encoded into 1-6 bytes according to different number sizes, and commonly used English letters are encoded into 1 bytes, Chinese characters are usually 3 bytes, only very rare characters will be encoded into 4 -6 bytes.

CharactersA中

1) If the text you want to transmit contains a large number of English characters, encoding with UTF-8 can save space;

2) ASCII encoding can actually be regarded as part of the UTF-8 encoding. Therefore, a large number of only support ASCII encoding. Legacy software can continue to work under UTF-8 encoding.

Common character encoding working methods for computer systems:

Memory: UnifiedunicodeEncoding

Hard disk, transmission: Convert to utf-8

When browsing the web, the server will convert the dynamically generated Unicode content into UTF-8 Then transmit it to the browser.

Python string

Related functions

  • ord()FunctionGets the integer representation of the character (single character). The parameter is the single character to be operated on, and an integer is returned.

  • chr()FunctionConvert the encoding to the corresponding character (single character)

  • encode() function , converts the str string to the specified encoding The method (parameter) becomes bytes

'str'.encode (ascii/utf-8) Return bytesString

Chinese encoding with ascii will report an error

  • decode()FunctionEncode the bytes read from the network or disk in the specified encoding method (Parameter) becomes str

'bytes'.decode(ascii/utf-8) returns str string

bytes cannot be decoded and an error will be reported, If there are only a small number of invalid bytes in bytes, you can pass in errors='ignore'Ignore the wrong bytes

> >> b'\xe4\xb8\xad\xff'.decode('utf-8', errors='ignore') '中'

  • ##len( )Function, calculate the number of characters contained in a string

  • ##>>> len(b'ABC') 3 >>> len(b'\xe4\xb8\xad\xe6\x96\x87') 6 >>> len('Chinese'.encode('utf-8')) 6

In the latest

Python 3 version, strings are encoded in Unicode, that is, ## The string of #Python supports multiple languages ​​The string type of Python is

str,

If you want to transmit it on the network, or save it to On the disk, you need to change str to bytes. >>In order to avoid garbled characters, you should always stick to using UTF-8 encoding for

str## and bytesConvert<<The difference between str and bytes

  • 1) str One character corresponds to several bytes , but each character of bytes only occupies One byte. (Decompose multi-byte characters into single-byte multi-characters)

##>>> 'ABC'.encode('ascii') b'ABC '>>> 'Chinese'.encode('utf-8') b'\xe4\xb8\xad\xe6\x96\x87'

bytes, bytes that cannot be displayed as ASCII characters are displayed with \x##.

  • 2) Bytes

    characters are prefixed and quoted

.py file contains Chinese characters. utf-8 encoding

#!/usr/bin/env python3 # -*- coding: utf-8 -*-

The first line of comments is to tell Linux/OS X system, this is an Python executable program, Windows system will ignore this comment;

The second line of comments is to tell the

Python interpreter to read the source code according to the UTF-8 encoding, otherwise, The Chinese output you write in the source code may be garbled.

>>

The editor uses UTF-8 without BOM<<

String formattingProblem##>>> 'Hello, %s' % 'world' 'Hello, world' >> ;> 'Hi, %s, you have $%d.' % ('Michael', 1000000) 'Hi, Michael, you have $1000000.'

%

Operator

    # is used to format strings. There are several
  • %?

    placeholders inside the string, followed by several variables or values, and the order must be consistent. If there is only one %?, the brackets can be omitted.

  • Escape, use
  • %%

    to represent a %

    > ;>> 'growth rate: %d %%' % 7 'growth rate: 7 %'

##Placeholder
ASCII Unicode UTF-8
01000001 00000000 01000001 01000001
x 01001110 00101101 11100100 10111000 10101101
Replacement contentIntegerFloating point numberStringHexadecimal integerformat()
%d
%f
%s
%x

Another way to format a string is to use the string's format()

method, which will use the passed in The parameters replace the placeholders

{0}, {1}... in the string in sequence, but this way of writing is easier than % Much more troublesome: >>> 'Hello, {0}, the score has improved by {1:.1f}%'.format('Xiao Ming', 17.125) ' Hello, Xiao Ming, your score has improved by 17.1%'

The above is the detailed content of A brief discussion on character encoding and strings in python learning. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:cnblogs.com. If there is any infringement, please contact admin@php.cn delete