Be familiar with the basic data structure of python, as well as the input and output of files.
Use the evaluation data and evaluation tasks of the xx machine learning conference in xxxx. The data includes training sets and test sets. The evaluation task is to pass the given training Data, predict whether the relationship in the test set is a positive or negative example, and give 1 or 0 at the end of each sample.
The data is described as follows. The first column is the relationship type, the second and third columns are the names of the people, the fourth column is the title, the fifth column is whether the relationship is a positive or negative example, 1 is a positive example, 0 is a negative example; the sixth column represents the training set.
The test set is described as follows. The format is basically similar to the training set. The only difference is that the fifth column does not matter whether it is a positive or negative example. Example mark.
##Relationship | Character 1 | Character 2 | Event |
Experimental content
Process the training set data, leaving only the first five columns, and the output text is named exp1_1.txt.
Classify 19 types of relationships based on the data obtained in the first step. The generated text is stored in the exp1_train folder. According to the order in which the relationship categories appear, the data of the first relationship category is stored in 1 .txt, the second relationship category is stored in 2.txt until 19.txt.
The test set classifies each sample according to the relationship category in the order of the 19 categories of the training set, that is, the data of the same relationship type is put into a text file, and test files of 19 categories are also generated. The format is still the same Be consistent with the test file. Stored in the exp1_test folder, the files of each category are still named 1_test.txt, 2_test.txt... At the same time, the position of each sample in the original test set is recorded, and corresponds to the 19 test files one by one. For example, the line of each sample of the first type of "rumored discord" in the original text is recorded in the index file and saved in the files index1.txt, index2.txt...
Solution Question ideas
1. The first question is to test our knowledge of file operations and lists. The main difficulty is to read the new file. After processing according to the requirements, we will generate a txt file. Let us Take a look at the specific code implementation:
import os
# 创建一个列表用来存储新的内容
list = []
with open("task1.trainSentence.new", "r",encoding='xxx') as file_input: # 打开.new文件,xxx根据自己的编码格式填写
with open("exp1_1.txt", "w", encoding='xxx') as file_output: # 打开exp1_1.txt,xxx根据自己的编码格式填写文件如果没有就创建一个
for Line in file_input: # 遍历每一行的文件
arr = Line.split('\t') # 以\t为分隔符读取
if arr[0] not in list: # if the word is not in the list
list.append(arr[0]) # add the word to the list
file_output.write(arr[0]+"\t"+arr[1]+"\t"+arr[2]+"\t"+arr[3]+"\t"+arr[4]+"\n") # write the line to the file
file_input.close() #关闭.new文件
file_output.close() #关闭创建的txt文件
2. The second question still examines file operations. Based on the files generated in question 1, events are classified according to the same type of events to see whether they can be grouped efficiently. Use loop conditions to solve, let's take a look at the specific
code implementation
import os
file_1 = open("exp1_1.txt", encoding='xxx') # 打开文件,xxx根据自己的编码格式填写
os.mkdir("exp1_train") # 创建目录
os.chdir("exp1_train") # 修改进程的工作目录(使用该目录)
a = file.readline() # 按行读取exp1_1.txt文件
arr = a.split("\t") # 按\t间隔符作为分割
b = 1 #设置分组文件的序列
file_2 = open("{}.txt".format(b), "w", encoding="xxx") # 打开文件,xxx根据自己的编码格式填写
for line in file_1: # 按行读取文件
arr_1 = line.split("\t") # 按\t间隔符作为分割
if arr[0] != arr_1[0]: # 如果读取文件的第一列内容与存入新文件的第一列类型不同
file_2.close() # 关掉该文件
b += 1 # 文件序列加一
f_2 = open("{}.txt".format(b), "w", encoding="xxx") # 创建新文件,以另一种类型分类,xxx根据自己的编码格式填写
arr = line.split("\t") # 按\t间隔符作为分割
f_2.write(arr[0]+"\t"+arr[1]+"\t"+arr[2]+"\t"+arr[3]+"t"+arr[4]+"\t""\n") # 将相同类型的文件写入
f_1.close() # 关闭题目一创建的exp1_1.txt文件
f_2.close() # 关闭创建的最后一个类型的文件
3. Further classify the 19 categories of the training set according to the relationship between the characters , we can traverse the data through the dictionary, find the relationship, put the content with the same relationship into a folder, and create a new one if it is different.
import os
with open("exp1_1.txt", encoding='xxx') as file_in1: # 打开文件,xxx根据自己的编码格式填写
i = 1 # 类型序列
arr2 = {} # 创建字典
for line in file_in1: # 按行遍历
arr3 = line[0:2] # 读取关系
if arr3 not in arr2.keys():
arr2[arr3] = i
i += 1 # 类型+1
file_in = open("task1.test.new") # 打开文件task1.test.new
os.mkdir("exp1_test") # 创建目录
os.chdir("exp1_test") # 修改进程的工作目录(使用该目录)
for line in file_in:
arr = line[0:2]
with open("{}_test.txt".format(arr2[arr]), "a", encoding='xxx') as file_out:
arr = line.split('\t')
file_out.write(line)
i = 1
file_in.seek(0)
os.mkdir("exp1_index")
os.chdir("exp1_index")
for line in file_in:
arr = line[0:2]
with open("index{}.txt".format(arr2[arr]), "a", encoding='xxx') as file_out:
arr = line.split('\t')
line = line[0:-1]
file_out.write(line + '\t' + "{}".format(i) + "\n")
i += 1
Use python to process numerical data
Experimental purpose
Be familiar with the basic data structure of python, as well as the input and output of files.
Experimental Data
The XX Tianchi Competition in XXXX is also the data of the XXth Big Data Challenge of Chinese Universities. The data includes two tables, namely the user behavior table mars_tianchi_user_actions.csv and the song artist table mars_tianchi_songs.csv. The competition opens sampled song artist data, as well as user behavior history records related to these artists within 6 months (20150301-20150831). Contestants need to predict the artist's playback data for the next 2 months, that is, 60 days (20150901-20151030).
##Experimental content
- Process the song artist data mars_tianchi_songs and count the number of artists and the number of songs for each artist. The output file format is exp2_1.csv. The first column is the artist's ID, and the second column is the number of songs by the artist. The last line outputs the number of artists.
- Merge the user behavior table and the song artist table into one large table using the song song_id as the association. The names of each column are the first to fifth columns, which are consistent with the column names of the user behavior table, and the sixth to tenth columns are the column names of the second to sixth columns in the song artist table. The output file name is exp2_2.csv.
- According to artist statistics, the playback volume of all songs of each artist every day, the output file is exp2_3.csv, and each column is artist id, date Ds, and total song playback volume. Note: Only the number of song plays are counted here, not the number of downloads and collections.
Problem-solving ideas: (Using pandas library) 1.(1) Use .drop_duplicates() to delete duplicate values(2) Use .loc[:,‘artist_id’].value_counts() to find the number of times the singer repeats, that is, the number of songs for each singer (3) Use .loc[:,‘ songs_id’].value_counts() Find out if there are no duplicate songsimport pandas as pd
data = pd.read_csv(r"C:\mars_tianchi_songs.csv") # 读取数据
Newdata = data.drop_duplicates(subset=['artist_id']) # 删除重复值
artist_sum = Newdata['artist_id'].count()
#artistChongFu_count = data.duplicated(subset=['artist_id']).count() artistChongFu_count = data.loc[:,'artist_id'].value_counts() 重复次数,即每个歌手的歌曲数目
songChongFu_count = data.loc[:,'songs_id'].value_counts() # 没有重复(歌手)
artistChongFu_count.loc['artist_sum'] = artist_sum # 没有重复(歌曲)artistChongFu_count.to_csv('exp2_1.csv') # 输出文件格式为exp2_1.csv
Use merge() to merge two tablesimport pandas as pd import os
data = pd.read_csv(r"C:\mars_tianchi_songs.csv")
data_two = pd.read_csv(r"C:\mars_tianchi_user_actions.csv")
num=pd.merge(data_two, data) num.to_csv('exp2_2.csv')
Use groupby()[].sum() for repetitive additionimport pandas as pd
data =pd.read_csv('exp2_2.csv')
DataCHongfu = data.groupby(['artist_id','Ds'])['gmt_create'].sum()#重复项相加DataCHongfu.to_csv('exp2_3.csv')
The above is the detailed content of How to manipulate text data using Python?. For more information, please follow other related articles on the PHP Chinese website!