search
HomeBackend DevelopmentPython TutorialHow to manipulate text data using Python?

Use python to process text data

Experimental purpose

Be familiar with the basic data structure of python, as well as the input and output of files.

Experimental data

Use the evaluation data and evaluation tasks of the xx machine learning conference in xxxx. The data includes training sets and test sets. The evaluation task is to pass the given training Data, predict whether the relationship in the test set is a positive or negative example, and give 1 or 0 at the end of each sample.

The data is described as follows. The first column is the relationship type, the second and third columns are the names of the people, the fourth column is the title, the fifth column is whether the relationship is a positive or negative example, 1 is a positive example, 0 is a negative example; the sixth column represents the training set.

Event Character 1 Character 2 Title Relationship (0 or 1 ) Training set

The test set is described as follows. The format is basically similar to the training set. The only difference is that the fifth column does not matter whether it is a positive or negative example. Example mark.

##RelationshipCharacter 1Character 2Event

Experimental content

Process the training set data, leaving only the first five columns, and the output text is named exp1_1.txt.

Classify 19 types of relationships based on the data obtained in the first step. The generated text is stored in the exp1_train folder. According to the order in which the relationship categories appear, the data of the first relationship category is stored in 1 .txt, the second relationship category is stored in 2.txt until 19.txt.

The test set classifies each sample according to the relationship category in the order of the 19 categories of the training set, that is, the data of the same relationship type is put into a text file, and test files of 19 categories are also generated. The format is still the same Be consistent with the test file. Stored in the exp1_test folder, the files of each category are still named 1_test.txt, 2_test.txt... At the same time, the position of each sample in the original test set is recorded, and corresponds to the 19 test files one by one. For example, the line of each sample of the first type of "rumored discord" in the original text is recorded in the index file and saved in the files index1.txt, index2.txt...

Solution Question ideas

1. The first question is to test our knowledge of file operations and lists. The main difficulty is to read the new file. After processing according to the requirements, we will generate a txt file. Let us Take a look at the specific code implementation:

import os
# 创建一个列表用来存储新的内容
list = []                                     
with open("task1.trainSentence.new", "r",encoding='xxx') as file_input: # 打开.new文件,xxx根据自己的编码格式填写
    with open("exp1_1.txt", "w", encoding='xxx') as file_output:        # 打开exp1_1.txt,xxx根据自己的编码格式填写文件如果没有就创建一个
 
        for Line in file_input:                                         # 遍历每一行的文件
            arr = Line.split('\t')                                      # 以\t为分隔符读取
            if arr[0] not in list:                                      # if the word is not in the list
                list.append(arr[0])                                     # add the word to the list
            file_output.write(arr[0]+"\t"+arr[1]+"\t"+arr[2]+"\t"+arr[3]+"\t"+arr[4]+"\n")  # write the line to the file
file_input.close()                                                      #关闭.new文件
file_output.close()                                                     #关闭创建的txt文件

2. The second question still examines file operations. Based on the files generated in question 1, events are classified according to the same type of events to see whether they can be grouped efficiently. Use loop conditions to solve, let's take a look at the specific

code implementation

import os
file_1 = open("exp1_1.txt", encoding='xxx')             # 打开文件,xxx根据自己的编码格式填写
os.mkdir("exp1_train")                                  # 创建目录
os.chdir("exp1_train")                                  # 修改进程的工作目录(使用该目录)
a = file.readline()                                     # 按行读取exp1_1.txt文件
arr = a.split("\t")                                     # 按\t间隔符作为分割
b = 1                                                   #设置分组文件的序列
file_2 = open("{}.txt".format(b), "w", encoding="xxx")  # 打开文件,xxx根据自己的编码格式填写
for line in file_1:                                     # 按行读取文件
    arr_1 = line.split("\t")                            # 按\t间隔符作为分割
    if arr[0] != arr_1[0]:                              # 如果读取文件的第一列内容与存入新文件的第一列类型不同
        file_2.close()                                  # 关掉该文件
        b += 1                                          # 文件序列加一
        f_2 = open("{}.txt".format(b), "w", encoding="xxx") # 创建新文件,以另一种类型分类,xxx根据自己的编码格式填写
    arr = line.split("\t")                              # 按\t间隔符作为分割
    f_2.write(arr[0]+"\t"+arr[1]+"\t"+arr[2]+"\t"+arr[3]+"t"+arr[4]+"\t""\n") # 将相同类型的文件写入
f_1.close()                                             # 关闭题目一创建的exp1_1.txt文件
f_2.close()                                             # 关闭创建的最后一个类型的文件

3. Further classify the 19 categories of the training set according to the relationship between the characters , we can traverse the data through the dictionary, find the relationship, put the content with the same relationship into a folder, and create a new one if it is different.

import os

with open("exp1_1.txt", encoding='xxx') as file_in1: # 打开文件,xxx根据自己的编码格式填写
    i = 1                                            # 类型序列
    arr2 = {}                                        # 创建字典
    for line in file_in1:                            # 按行遍历
        arr3 = line[0:2]                             # 读取关系
        if arr3 not in arr2.keys():
            arr2[arr3] = i                           
            i += 1                                   # 类型+1
    file_in = open("task1.test.new")                 # 打开文件task1.test.new
    os.mkdir("exp1_test")                            # 创建目录
    os.chdir("exp1_test")                            # 修改进程的工作目录(使用该目录)
    for line in file_in:
        arr = line[0:2]
        with open("{}_test.txt".format(arr2[arr]), "a", encoding='xxx') as file_out:
            arr = line.split('\t')
            file_out.write(line)
    i = 1
    file_in.seek(0)
    os.mkdir("exp1_index")
    os.chdir("exp1_index")
    for line in file_in:
        arr = line[0:2]
        with open("index{}.txt".format(arr2[arr]), "a", encoding='xxx') as file_out:
            arr = line.split('\t')
            line = line[0:-1]
            file_out.write(line + '\t' + "{}".format(i) + "\n")
        i += 1

Use python to process numerical data

Experimental purpose

Be familiar with the basic data structure of python, as well as the input and output of files.

Experimental Data

The XX Tianchi Competition in XXXX is also the data of the XXth Big Data Challenge of Chinese Universities. The data includes two tables, namely the user behavior table mars_tianchi_user_actions.csv and the song artist table mars_tianchi_songs.csv. The competition opens sampled song artist data, as well as user behavior history records related to these artists within 6 months (20150301-20150831). Contestants need to predict the artist's playback data for the next 2 months, that is, 60 days (20150901-20151030).

How to manipulate text data using Python?

How to manipulate text data using Python?

How to manipulate text data using Python?

##Experimental content

  • Process the song artist data mars_tianchi_songs and count the number of artists and the number of songs for each artist. The output file format is exp2_1.csv. The first column is the artist's ID, and the second column is the number of songs by the artist. The last line outputs the number of artists.

  • Merge the user behavior table and the song artist table into one large table using the song song_id as the association. The names of each column are the first to fifth columns, which are consistent with the column names of the user behavior table, and the sixth to tenth columns are the column names of the second to sixth columns in the song artist table. The output file name is exp2_2.csv.

  • According to artist statistics, the playback volume of all songs of each artist every day, the output file is exp2_3.csv, and each column is artist id, date Ds, and total song playback volume. Note: Only the number of song plays are counted here, not the number of downloads and collections.

Problem-solving ideas: (Using pandas library)

1.

(1) Use .drop_duplicates() to delete duplicate values

(2) Use .loc[:,‘artist_id’].value_counts() to find the number of times the singer repeats, that is, the number of songs for each singer

(3) Use .loc[:,‘ songs_id’].value_counts() Find out if there are no duplicate songs

import pandas as pd
data = pd.read_csv(r"C:\mars_tianchi_songs.csv")       # 读取数据
Newdata = data.drop_duplicates(subset=['artist_id'])   # 删除重复值
artist_sum = Newdata['artist_id'].count()              
#artistChongFu_count = data.duplicated(subset=['artist_id']).count() artistChongFu_count = data.loc[:,'artist_id'].value_counts() 重复次数,即每个歌手的歌曲数目
songChongFu_count = data.loc[:,'songs_id'].value_counts()  # 没有重复(歌手)
artistChongFu_count.loc['artist_sum'] = artist_sum         # 没有重复(歌曲)artistChongFu_count.to_csv('exp2_1.csv')                   # 输出文件格式为exp2_1.csv

Use merge() to merge two tables

import pandas as pd import os

data = pd.read_csv(r"C:\mars_tianchi_songs.csv")
data_two = pd.read_csv(r"C:\mars_tianchi_user_actions.csv")
num=pd.merge(data_two, data) num.to_csv('exp2_2.csv')

Use groupby()[].sum() for repetitive addition

import pandas as pd
data =pd.read_csv('exp2_2.csv')
DataCHongfu = data.groupby(['artist_id','Ds'])['gmt_create'].sum()#重复项相加DataCHongfu.to_csv('exp2_3.csv')

The above is the detailed content of How to manipulate text data using Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete
详细讲解Python之Seaborn(数据可视化)详细讲解Python之Seaborn(数据可视化)Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

详细了解Python进程池与进程锁详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

Python自动化实践之筛选简历Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

归纳总结Python标准库归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于标准库总结的相关问题,下面一起来看一下,希望对大家有帮助。

Python数据类型详解之字符串、数字Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

分享10款高效的VSCode插件,总有一款能够惊艳到你!!分享10款高效的VSCode插件,总有一款能够惊艳到你!!Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

详细介绍python的numpy模块详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。

python中文是什么意思python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use