我写了一个处理文本的文件就是把文本中所有的符号都替换掉,替换成空格。用的python中maketrans和translate。其中在使用对于ASCII编码的文件时是正常的,但对于utf-8文件时,就报错,提示maketrans中的参数不等长,但是明明是一样长的啊:
File "/Users/lgq/Desktop/p3.py", line 10, in text_to_words
"abcdefghijklmnopqrstuvwxyz ")
ValueError: the first two maketrans arguments must have equal length
我查了一下说是maketrans在utf-8下不能用,那我在utf-8下该怎么替换掉字符呢,求各位大神指点。
def text_to_words(the_text):
"""
Return a list of words with all punctuation removed,
and all in lowercase.
"""
my_substitutions = the_text.maketrans(
# If you find any of these
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
# Replace them by these
"abcdefghijklmnopqrstuvwxyz ")
# Translate the text now.
cleaned_text = the_text.translate(my_substitutions)
wds = cleaned_text.split()
return wds
def get_words_in_book(filename):
""" Read a book from filename, and return a list of its words."""
f = open(filename, "r", encoding = "utf-8")
content = f.read()
f.close()
wds = text_to_words(content)
return wds
book_words = get_words_in_book("alice.txt")
print("There are {0} words in the book, the first 100 are\n{1}".
format(len(book_words), book_words[:100]))
滿天的星座2017-05-18 11:00:56
首先 这两个字符串长度不相等, "
是一个字符, \
也是一个字符
你可以用 len()
查看。
然后关于字符串什么的问题,最好说明 python 的版本
maketrans
参数长度不相等
my_substitutions = the_text.maketrans(
# If you find any of these
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
# Replace them by these
"abcdefghijklmnopqrstuvwxyz ")
测试代码:
from string import translate, maketrans
def text_to_words(the_text):
"""
Return a list of words with all punctuation removed,
and all in lowercase.
"""
my_substitutions = maketrans(
# If you find any of these
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
# Replace them by these
"abcdefghijklmnopqrstuvwxyz ")
# Translate the text now.
cleaned_text = the_text.translate(my_substitutions)
wds = cleaned_text.split()
return wds
text_to_words('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~\'\测试')
output
['abcdefghijklmnopqrstuvwxyz', '\xe6\xb5\x8b\xe8\xaf\x95']
这是 python2 的运行结果