Home  >  Article  >  Backend Development  >  How to use python to batch modify the encoding format of text files

How to use python to batch modify the encoding format of text files

WBOY
WBOYforward
2023-05-01 19:13:112469browse

Use python to batch modify the encoding format of text files

Convert the encoding format of text files in batches, such as ascii, gb2312, utf8, etc., and convert each other. Judging from the size of the character set, utf8>gb2312>ascii , so it is best to convert gb2312 to utf8, otherwise garbled characters will easily appear.

The main difference between gb2312 and utf-8:

About the font size: UTF-8 > gb2312 (utf8 has all characters and gb2312 only has Chinese characters)

About saving size: UTF-8> gb2312 (utf8 is more bloated and loads slower, gb2312 is smaller and loads faster)

About scope of application: gb2312 is mainly used in mainland China. It is a localized character set. UTF-8 contains characters that are needed by all countries in the world. It is an international encoding and has strong versatility. UTF-8 encoded text can be displayed on browsers in various countries that support the UTF8 character set.

import sys
import chardet
import codecs
 
def get_encoding_type(fileName):
    '''print the encoding format of a txt file '''
    with open(fileName, 'rb') as f:
        data = f.read()
        encoding_type = chardet.detect(data)
        #print(encoding_type)
        return encoding_type
        # such as {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
 
def convert_encoding_type(filename_in, filename_out, encode_in="gb2312", encode_out="utf-8"):
    '''convert encoding format of txt file '''
    #filename_in = 'flash.c'
    #filename_out = 'flash_gb2312.c'
    #encode_in = 'utf-8'  # 输入文件的编码类型
    #encode_out = 'gb2312'# 输出文件的编码类型
    with codecs.open(filename=filename_in, mode='r', encoding=encode_in) as fi:
        data = fi.read()
        with open(filename_out, mode='w', encoding=encode_out) as fo:
            fo.write(data)
            fo.close()
        # with open(filename_out, 'rb') as f:
        #     data = f.read()
        #     print(chardet.detect(data))
 
if __name__=="__main__":
    # fileName = argv[1]
    # get_encoding_type(fileName)
    # convert_encoding_type(fileName, fileName)
    filename_of_files = sys.argv[1]   #the file contain full file path at each line
    with open(filename_of_files, 'rb') as f:
        lines = f.readlines()
        for line in lines:
            fileName = line[:-1]
            encoding_type = get_encoding_type(fileName)
            if encoding_type['encoding']=='GB2312':
                print(encoding_type)
                convert_encoding_type(fileName, fileName)
                print(fileName)

Supplement: python implements batch conversion of files to utf-8 format

python implements batch conversion of files to utf-8 format

xml_path = './'
with open(xml_path , 'rb+') as f:
    content = f.read()
    codeType = detect(content)['encoding']
    content = content.decode(codeType, "ignore").encode("utf8")
    fp.seek(0)
    fp.write(content)

The above is the detailed content of How to use python to batch modify the encoding format of text files. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete