search

Home  >  Q&A  >  body text

python3.x - How to use maketrans in python in utf-8 files

I wrote a file to process text, which is to replace all the symbols in the text with spaces. Use maketrans and translate in python. It is normal when using ASCII encoded files, but when using UTF-8 files, an error is reported, indicating that the parameters in maketrans are not of equal length, but they are obviously the same length:

File "/Users/lgq/Desktop/p3.py", line 10, in text_to_words

"abcdefghijklmnopqrstuvwxyz                                                   ") 

ValueError: the first two maketrans arguments must have equal length

I checked and said that maketrans cannot be used under utf-8. So how should I replace the characters under utf-8? Please give me some advice.

def text_to_words(the_text):
    """ 
        Return a list of words with all punctuation removed,
        and all in lowercase.
    """
    my_substitutions = the_text.maketrans(
        # If you find any of these
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
        # Replace them by these
        "abcdefghijklmnopqrstuvwxyz                                            ")
    # Translate the text now.
    cleaned_text = the_text.translate(my_substitutions)
    wds = cleaned_text.split()
    return wds


def get_words_in_book(filename):
    """ Read a book from filename, and return a list of its words."""
    f = open(filename, "r", encoding = "utf-8")
    content = f.read()
    f.close()
    wds = text_to_words(content)
    return wds


book_words = get_words_in_book("alice.txt")
print("There are {0} words in the book, the first 100 are\n{1}".
        format(len(book_words), book_words[:100]))
过去多啦不再A梦过去多啦不再A梦2789 days ago800

reply all(1)I'll reply

  • 滿天的星座

    滿天的星座2017-05-18 11:00:56

    First of all, the lengths of these two strings are not equal, " 是一个字符, \ 也是一个字符
    你可以用 len() check.
    Then for questions about strings, it’s best to indicate the version of python

    maketrans Parameter lengths are not equal

     my_substitutions = the_text.maketrans(
            # If you find any of these
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
            # Replace them by these
            "abcdefghijklmnopqrstuvwxyz                                            ")

    Test code:

    from string import translate, maketrans
    
    def text_to_words(the_text):
        """ 
            Return a list of words with all punctuation removed,
            and all in lowercase.
        """
        my_substitutions = maketrans(
            # If you find any of these
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
            # Replace them by these
            "abcdefghijklmnopqrstuvwxyz                                          ")
        # Translate the text now.
        cleaned_text = the_text.translate(my_substitutions)
        wds = cleaned_text.split()
        return wds
    
    text_to_words('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~\'\测试')

    output

    ['abcdefghijklmnopqrstuvwxyz', '\xe6\xb5\x8b\xe8\xaf\x95']

    This is the result of running python2

    reply
    0
  • Cancelreply