I wrote a file to process text, which is to replace all the symbols in the text with spaces. Use maketrans and translate in python. It is normal when using ASCII encoded files, but when using UTF-8 files, an error is reported, indicating that the parameters in maketrans are not of equal length, but they are obviously the same length:
File "/Users/lgq/Desktop/p3.py", line 10, in text_to_words
"abcdefghijklmnopqrstuvwxyz ")
ValueError: the first two maketrans arguments must have equal length
I checked and said that maketrans cannot be used under utf-8. So how should I replace the characters under utf-8? Please give me some advice.
def text_to_words(the_text):
"""
Return a list of words with all punctuation removed,
and all in lowercase.
"""
my_substitutions = the_text.maketrans(
# If you find any of these
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
# Replace them by these
"abcdefghijklmnopqrstuvwxyz ")
# Translate the text now.
cleaned_text = the_text.translate(my_substitutions)
wds = cleaned_text.split()
return wds
def get_words_in_book(filename):
""" Read a book from filename, and return a list of its words."""
f = open(filename, "r", encoding = "utf-8")
content = f.read()
f.close()
wds = text_to_words(content)
return wds
book_words = get_words_in_book("alice.txt")
print("There are {0} words in the book, the first 100 are\n{1}".
format(len(book_words), book_words[:100]))
滿天的星座2017-05-18 11:00:56
First of all, the lengths of these two strings are not equal, "
是一个字符, \
也是一个字符
你可以用 len()
check.
Then for questions about strings, it’s best to indicate the version of python
maketrans
Parameter lengths are not equal
my_substitutions = the_text.maketrans(
# If you find any of these
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
# Replace them by these
"abcdefghijklmnopqrstuvwxyz ")
Test code:
from string import translate, maketrans
def text_to_words(the_text):
"""
Return a list of words with all punctuation removed,
and all in lowercase.
"""
my_substitutions = maketrans(
# If you find any of these
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
# Replace them by these
"abcdefghijklmnopqrstuvwxyz ")
# Translate the text now.
cleaned_text = the_text.translate(my_substitutions)
wds = cleaned_text.split()
return wds
text_to_words('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~\'\测试')
output
['abcdefghijklmnopqrstuvwxyz', '\xe6\xb5\x8b\xe8\xaf\x95']
This is the result of running python2