Home >Database >Mysql Tutorial >How to select a subset of data in SQL query style in Pandas?

How to select a subset of data in SQL query style in Pandas?

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBforward: 2023-08-31 10:13:101253browse

如何在 Pandas 的 SQL 查询样式中选择数据子集？

Introduction

In this article, I will show you how to use Pandas to perform data analysis with SQL-style filtering. Most enterprise data is stored in databases that require SQL to retrieve and manipulate. For example, companies like Oracle, IBM, Microsoft have their own databases and their own SQL implementations.

Data scientists must deal with SQL at some point in their career because data is not always stored in CSV files. I personally prefer to use Oracle because most of my company's data is stored in Oracle.

Scenario – 1 Suppose we have a task to find all movies from our dataset that have the following criteria.

The language of the movie should be English (en) or Spanish (es).
The movie's popularity must be between 500 and 1000.
The movie's status must be Published.
The number of votes must be greater than 5000. For the above scenario, the SQL statement looks like the following.

SELECT
FROM WHERE
title AS movie_title
,original_language AS movie_language
,popularityAS movie_popularity
,statusAS movie_status
,vote_count AS movie_vote_count movies_data
original_languageIN (&#39;en&#39;, &#39;es&#39;)

AND status=(&#39;Released&#39;)
AND popularitybetween 500 AND 1000
AND vote_count > 5000;

Now that you have seen the SQL statement that meets your needs, let us proceed step by step using pandas. I'm going to show you two ways.

Method 1: Boolean Index

1. Load the movies_data dataset into the DataFrame.

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

Assign a variable to each condition.

languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages )
condition_on_status = movies . status == "Released"
condition_on_popularity = movies . popularity . between ( 500 , 1000 )
condition_on_votecount = movies . vote_count > 5000

3. Combine all conditions (boolean array) together.

final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount )
columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ]
# clubbing all together movies . loc [ final_conditions , columns ]

zhpublished724.24778410867##788Deadpool

Title	original_language	Status	Popularity	vote_count
##95星
zh	Released	514.569956	10995

Method 2:- .query() method.

.query() method is a method for filtering data in SQL where clause style. Conditions can be passed to this method as strings, however, column names must not contain any spaces.

If there are spaces in the column names, use the python replace function to replace them with underscores.

In my experience, I have found that the query() method is faster than previous methods when applied to larger DataFrames.

import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )

4. Build the query string and execute the method.

Please note that the .query method does not work with triple quoted strings that span multiple lines.

final_conditions = (
"original_language in [&#39;en&#39;,&#39;es&#39;]"
"and status == &#39;Released&#39; "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000"
) final_result = movies . query ( final_conditions )
final_result

## 95##67512001758000000Deadpool

## There are many more, usually in my coding I have Multiple values to check my "in" clause. So the syntax above is not ideal. Python variables can be referenced using the at symbol (@). You can also programmatically create values as Python lists and use them with (@).

##Budget	id	original_language	original_title	Popularity	Release Date	Revenue	Runtime	st
165000000	157336	zh	星	724.247784	May 11, 2014	##169.0	Relationship		##788
293660	zh	514.569956	September 2, 2016	p>783112979	108.0	Relationship

movie_languages = [ &#39;en&#39; , &#39;es&#39; ]
final_conditions = (
"original_language in @movie_languages "
"and status == &#39;Released&#39; "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000" )
final_result = movies . query ( final_conditions )
final_result

##Budget

idoriginal_languageoriginal_titlePopularityRelease DateRevenueRuntimest## 95165000000293660514.569956


157336 zh	星	724.247784	May 11, 2014	675120017	169.0	Relationship		##788	##58000000
zh	Deadpool	September 2, 2016	783112979	108.0 p>	Relationship

The above is the detailed content of How to select a subset of data in SQL query style in Pandas?. For more information, please follow other related articles on the PHP Chinese website!

Python sql pandas 字符串 td oracle 数据库数据分析 microsoft

Statement：

This article is reproduced at:tutorialspoint.com. If there is any infringement, please contact admin@php.cn delete

Previous article：Determine what the character set of the MySQL database is set toNext article：Determine what the character set of the MySQL database is set to

See more

How to select a subset of data in SQL query style in Pandas?

Introduction

Method 1: Boolean Index

.query() method is a method for filtering data in SQL where clause style. Conditions can be passed to this method as strings, however, column names must not contain any spaces.

Related articles