python自然语言编码转换模块codecs介绍-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

python自然语言编码转换模块codecs介绍

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 10, 2016 pm 03:15 PM

pythonmodule

python对多国语言的处理是支持的很好的，它可以处理现在任意编码的字符，这里深入的研究一下python对多种不同语言的处理。

有一点需要清楚的是，当python要做编码转换的时候，会借助于内部的编码，转换过程是这样的：

复制代码代码如下:

原有编码 -> 内部编码 -> 目的编码

python的内部是使用unicode来处理的，但是unicode的使用需要考虑的是它的编码格式有两种，一是UCS-2，它一共有65536个码位，另一种是UCS-4，它有2147483648g个码位。对于这两种格式，python都是支持的，这个是在编译时通过--enable-unicode=ucs2或--enable-unicode=ucs4来指定的。那么我们自己默认安装的python有的什么编码怎么来确定呢？有一个办法，就是通过sys.maxunicode的值来判断：

复制代码代码如下:

import sys
print sys.maxunicode

如果输出的值为65535,那么就是UCS-2,如果输出是1114111就是UCS-4编码。
我们要认识到一点：当一个字符串转换为内部编码后，它就不是str类型了！它是unicode类型：

复制代码代码如下:

a = "风卷残云"
print type(a)
b = a.unicode(a, "gb2312")
print type(b)

输出：

复制代码代码如下:

这个时候b可以方便的任意转换为其他编码，比如转换为utf-8:

复制代码代码如下:

c = b.encode("utf-8")
print c

c输出的东西看起来是乱码，那就对了，因为是utf-8的字符串。

好了，该说说codecs模块了，它和我上面说的概念是密切相关的。codecs专门用作编码转换，当然，其实通过它的接口是可以扩展到其他关于代码方面的转换的，这个东西这里不涉及。

复制代码代码如下:

#-*- encoding: gb2312 -*-
import codecs, sys

print '-'*60
# 创建gb2312编码器
look = codecs.lookup("gb2312")
# 创建utf-8编码器
look2 = codecs.lookup("utf-8")

a = "我爱北京天安门"

print len(a), a
# 把a编码为内部的unicode, 但为什么方法名为decode呢，我的理解是把gb2312的字符串解码为unicode
b = look.decode(a)
# 返回的b[0]是数据，b[1]是长度，这个时候的类型是unicode了
print b[1], b[0], type(b[0])
# 把内部编码的unicode转换为gb2312编码的字符串，encode方法会返回一个字符串类型
b2 = look.encode(b[0])
# 发现不一样的地方了吧？转换回来之后，字符串长度由14变为了7! 现在的返回的长度才是真正的字数，原来的是字节数
print b2[1], b2[0], type(b2[0])
# 虽然上面返回了字数，但并不意味着用len求b2[0]的长度就是7了，仍然还是14，仅仅是codecs.encode会统计字数
print len(b2[0])

上面的代码就是codecs的使用，是最常见的用法。另外还有一个问题就是，如果我们处理的文件里的字符编码是其他类型的呢？这个读取进行做处理也需要特殊的处理的。codecs也提供了方法.

复制代码代码如下:

#-*- encoding: gb2312 -*-
import codecs, sys

# 用codecs提供的open方法来指定打开的文件的语言编码，它会在读取的时候自动转换为内部unicode
bfile = codecs.open("dddd.txt", 'r', "big5")
#bfile = open("dddd.txt", 'r')

ss = bfile.read()
bfile.close()
# 输出，这个时候看到的就是转换后的结果。如果使用语言内建的open函数来打开文件，这里看到的必定是乱码
print ss, type(ss)

上面这个处理big5的，可以去找段big5编码的文件试试。

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Is Tuple Comprehension possible in Python? If yes, how and if not why?Apr 28, 2025 pm 04:34 PM

Article discusses impossibility of tuple comprehension in Python due to syntax ambiguity. Alternatives like using tuple() with generator expressions are suggested for creating tuples efficiently.(159 characters)

What are Modules and Packages in Python?Apr 28, 2025 pm 04:33 PM

The article explains modules and packages in Python, their differences, and usage. Modules are single files, while packages are directories with an __init__.py file, organizing related modules hierarchically.

What is docstring in Python?Apr 28, 2025 pm 04:30 PM

Article discusses docstrings in Python, their usage, and benefits. Main issue: importance of docstrings for code documentation and accessibility.

What is a lambda function?Apr 28, 2025 pm 04:28 PM

Article discusses lambda functions, their differences from regular functions, and their utility in programming scenarios. Not all languages support them.

What is a break, continue and pass in Python?Apr 28, 2025 pm 04:26 PM

Article discusses break, continue, and pass in Python, explaining their roles in controlling loop execution and program flow.

What is a pass in Python?Apr 28, 2025 pm 04:25 PM

The article discusses the 'pass' statement in Python, a null operation used as a placeholder in code structures like functions and classes, allowing for future implementation without syntax errors.

Can we Pass a function as an argument in Python?Apr 28, 2025 pm 04:23 PM

Article discusses passing functions as arguments in Python, highlighting benefits like modularity and use cases such as sorting and decorators.

What is the difference between / and // in Python?Apr 28, 2025 pm 04:21 PM

Article discusses / and // operators in Python: / for true division, // for floor division. Main issue is understanding their differences and use cases.Character count: 158

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

InZoi: How To Apply To School And University

3 weeks agoByDDD

Hot Tools

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version

Chinese version, very easy to use

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7789

1644

1401

1298

1234