search
HomeBackend DevelopmentPython TutorialDataFrame using pandas for data processing

DataFrame using pandas for data processing

Sep 15, 2020 pm 04:20 PM
dataframepandaspython

DataFrame using pandas for data processing

Relevant learning recommendations: python tutorial

##This is

pandas data processing topic's second article, let's talk about the most important data structure in pandas - DataFrame.

In the previous article, we introduced the usage of Series, and also mentioned that Series is equivalent to a one-dimensional array, but pandas encapsulates many convenient and easy-to-use APIs for us. The DataFrame can be simply understood as a dict

composed of Series, thus splicing the data into a two-dimensional table. It also provides us with many interfaces for table-level data processing and batch data processing, which greatly reduces the difficulty of data processing.

Create DataFrame

DataFrame is a tabular data structure, which has two indexes, namely

Row index and column index allow us to easily obtain the corresponding rows and columns. This greatly reduces the difficulty of finding data for data processing.

First, let’s start with the simplest one, how to create a DataFrame.

Create from dictionary

We create a dict whose The key is the column name, and the value is a list. When we pass this dict into the DataFrame constructor, it will

create a DataFrame for us with key as the column name and value as the corresponding value. When we output in jupyter, it will automatically display the contents of the DataFrame in table form for us.

We can also create a DataFrame from a numpy two-dimensional array, if If we just pass in the numpy array without specifying the column name, then pandas

will use the number as the index to create the column

for us:

We are in When creating, pass in a list of strings for the columns field to specify a column name for it:

Reading from a file

Another very powerful function of pandas is that it canRead data from files in various formats to create DataFrame, such as commonly used excel, csv, or even databases.

For structured data such as excel, csv, json, etc., pandas provides a special API. We can find the corresponding API and use it:

If it is in some special format, it doesn't matter. We use read_table, which can read data from various text files and complete the creation by passing in the separator and other parameters. For example, in the previous article verifying the dimensionality reduction effect of PCA, we read data from a .data format file. The delimiter between columns in this file is a space, not the comma or table character of csv. We pass in the sep parameter through and specify the delimiter to complete the data reading.

This header parameter indicates which lines of the file are used as column names of the data. The default header=0 means that the first line is used as the column name. . If the column name does not exist in the data, header=None needs to be specified, otherwise problems will occur. We rarely need to use multi-level column names, so generally the most commonly used method is to take the default value or set it equal to None.

Among all these methods to create a DataFrame, the most commonly used is the last one , reading from a file. Because when we do machine learning or participate in some competitions in Kaggle, the data is often ready-made and given to us in the form of files. There are very few cases where we need to create data ourselves. If it is in an actual work scenario, although the data will not be stored in files, there will be a source, usually stored in some big data platforms, and the model will obtain training data from these platforms.

So in general, we rarely use other methods of creating DataFrame. We have some understanding and focus on mastering the method of reading from files.

Common operations

#The following introduces some common operations of pandas. These operations were performed before I learned how to use pandas systematically. Already understood. The reason for understanding it is also very simple, because they are too commonly used, and they can be said to be common sense content that must be known.

View data

When we run the DataFrame instance in jupyter, all the data in the DataFrame will be printed for us. , if there are too many rows of data, the middle part will be omitted in the form of ellipses. For a DataFrame with a large amount of data, we generally do not directly output and display it like this, but choose to display the first few or last few pieces of data. Two APIs are needed here.

The method for displaying the first several pieces of data is called head. It accepts a parameter and allows us to specify it to display the number of data we specify from the beginning.

Since there is an API for displaying the first few items, there is also an API for displaying the last few items. Such an API is called tail. Through it, we can view the last specified number of data in the DataFrame:

Add, delete and modify columns

We mentioned before that for DataFrame, it is actually equivalent to a dict composed of Series. Since it is a dict, we can naturally obtain the specified Series based on the key value.

There are two ways to get the specified column in DataFrame. We can add column names or find elements through dict to query:

We can also can read multiple columns at the same time. If there are multiple columns, only one method is supported, which is to query elements through dict. It allows receiving an incoming list and finding the data corresponding to the columns in the list. The result returned is a new DataFrame composed of these new columns.

We canuse del to delete a column we don’t need:

We want to create a new The columns are also very simple. We can directly assign values ​​to the DataFrame just like dict assignment:

The assigned object cannot only be Real numbers, can also be an array:

It is very simple to modify a certain column. We can also overwrite the original data through the same method of assignment.

Convert to numpy array

#Sometimes it is inconvenient for us to use pandas and want to obtain its corresponding original data , you can directly use .values ​​to obtain the numpy array corresponding to DataFrame:

Since each column in the DataFrame has a separate type , After being converted into a numpy array, all data share the same type. Then pandas will find a common type for all columns, which is why you often get an object type. Therefore, it is best to check the type before using .values ​​to ensure that there will be no errors due to the type.

Summary

In today’s article we learned about the relationship between DataFrame and Series, and also learned some DataFrame The basics and common usage. Although DataFrame can be approximately regarded as a dict composed of Series, in fact, as a separate data structure, it also has many own APIs, supports many fancy operations, and is a powerful tool for us to process data.

Professional organizations have made statistics. For an algorithm engineer, about 70% of the time will be invested in data processing. The time spent actually writing the model and adjusting parameters may be less than 20%. From this we can see the necessity and importance of data processing. In the field of Python, pandas is the best scalpel and toolbox for data processing. I hope everyone can master it.

If you want to learn more about programming, please pay attention to the php training column!

The above is the detailed content of DataFrame using pandas for data processing. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:juejin. If there is any infringement, please contact admin@php.cn delete
Python vs. C  : Understanding the Key DifferencesPython vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C  : Which Language to Choose for Your Project?Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyReaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesMaximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C  : The Right Language for YouChoosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C  : A Comparative Analysis of Programming LanguagesPython vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python Learning2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C  : Learning Curves and Ease of UsePython vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.