Home > Article > Backend Development > How to use python to batch modify the encoding format of text files
Convert the encoding format of text files in batches, such as ascii, gb2312, utf8, etc., and convert each other. Judging from the size of the character set, utf8>gb2312>ascii , so it is best to convert gb2312 to utf8, otherwise garbled characters will easily appear.
The main difference between gb2312 and utf-8:
About the font size: UTF-8 > gb2312 (utf8 has all characters and gb2312 only has Chinese characters)
About saving size: UTF-8> gb2312 (utf8 is more bloated and loads slower, gb2312 is smaller and loads faster)
About scope of application: gb2312 is mainly used in mainland China. It is a localized character set. UTF-8 contains characters that are needed by all countries in the world. It is an international encoding and has strong versatility. UTF-8 encoded text can be displayed on browsers in various countries that support the UTF8 character set.
import sys import chardet import codecs def get_encoding_type(fileName): '''print the encoding format of a txt file ''' with open(fileName, 'rb') as f: data = f.read() encoding_type = chardet.detect(data) #print(encoding_type) return encoding_type # such as {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'} def convert_encoding_type(filename_in, filename_out, encode_in="gb2312", encode_out="utf-8"): '''convert encoding format of txt file ''' #filename_in = 'flash.c' #filename_out = 'flash_gb2312.c' #encode_in = 'utf-8' # 输入文件的编码类型 #encode_out = 'gb2312'# 输出文件的编码类型 with codecs.open(filename=filename_in, mode='r', encoding=encode_in) as fi: data = fi.read() with open(filename_out, mode='w', encoding=encode_out) as fo: fo.write(data) fo.close() # with open(filename_out, 'rb') as f: # data = f.read() # print(chardet.detect(data)) if __name__=="__main__": # fileName = argv[1] # get_encoding_type(fileName) # convert_encoding_type(fileName, fileName) filename_of_files = sys.argv[1] #the file contain full file path at each line with open(filename_of_files, 'rb') as f: lines = f.readlines() for line in lines: fileName = line[:-1] encoding_type = get_encoding_type(fileName) if encoding_type['encoding']=='GB2312': print(encoding_type) convert_encoding_type(fileName, fileName) print(fileName)
python implements batch conversion of files to utf-8 format
xml_path = './' with open(xml_path , 'rb+') as f: content = f.read() codeType = detect(content)['encoding'] content = content.decode(codeType, "ignore").encode("utf8") fp.seek(0) fp.write(content)
The above is the detailed content of How to use python to batch modify the encoding format of text files. For more information, please follow other related articles on the PHP Chinese website!