Home  >  Article  >  Database  >  How to select a subset of data in SQL query style in Pandas?

How to select a subset of data in SQL query style in Pandas?

WBOY
WBOYforward
2023-08-31 10:13:101109browse

如何在 Pandas 的 SQL 查询样式中选择数据子集?

Introduction

In this article, I will show you how to use Pandas to perform data analysis with SQL-style filtering. Most enterprise data is stored in databases that require SQL to retrieve and manipulate. For example, companies like Oracle, IBM, Microsoft have their own databases and their own SQL implementations.

Data scientists must deal with SQL at some point in their career because data is not always stored in CSV files. I personally prefer to use Oracle because most of my company's data is stored in Oracle.

Scenario – 1 Suppose we have a task to find all movies from our dataset that have the following criteria.

  • The language of the movie should be English (en) or Spanish (es).
  • The movie's popularity must be between 500 and 1000.
  • The movie's status must be Published.
  • The number of votes must be greater than 5000. For the above scenario, the SQL statement looks like the following.
SELECT
FROM WHERE
title AS movie_title
,original_language AS movie_language
,popularityAS movie_popularity
,statusAS movie_status
,vote_count AS movie_vote_count movies_data
original_languageIN ('en', 'es')

AND status=('Released')
AND popularitybetween 500 AND 1000
AND vote_count > 5000;

Now that you have seen the SQL statement that meets your needs, let us proceed step by step using pandas. I'm going to show you two ways.

Method 1: Boolean Index

1. Load the movies_data dataset into the DataFrame.

import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

Assign a variable to each condition.

languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages )
condition_on_status = movies . status == "Released"
condition_on_popularity = movies . popularity . between ( 500 , 1000 )
condition_on_votecount = movies . vote_count > 5000

3. Combine all conditions (boolean array) together.

final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount )
columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ]
# clubbing all together movies . loc [ final_conditions , columns ]

zhpublished724.24778410867##788Deadpool
Title

original_language

Status

Popularity

vote_count
##95星

zh

Released

514.569956

10995

Method 2:- .query() method.

.query() method is a method for filtering data in SQL where clause style. Conditions can be passed to this method as strings, however, column names must not contain any spaces.

If there are spaces in the column names, use the python replace function to replace them with underscores.

In my experience, I have found that the query() method is faster than previous methods when applied to larger DataFrames.

import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )

4. Build the query string and execute the method.

Please note that the .query method does not work with triple quoted strings that span multiple lines.

final_conditions = (
"original_language in ['en','es']"
"and status == 'Released' "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000"
) final_result = movies . query ( final_conditions )
final_result

## 95##67512001758000000Deadpool

## There are many more, usually in my coding I have Multiple values ​​to check my "in" clause. So the syntax above is not ideal. Python variables can be referenced using the at symbol (@). You can also programmatically create values ​​as Python lists and use them with (@).
##Budget

id

original_language

original_title

Popularity

Release Date

Revenue

Runtime

st

165000000

157336

zh

724.247784

May 11, 2014

##169.0

Relationship

##788

293660

zh

514.569956

September 2, 2016

p>783112979

108.0

Relationship

movie_languages = [ &#39;en&#39; , &#39;es&#39; ]
final_conditions = (
"original_language in @movie_languages "
"and status == &#39;Released&#39; "
"and popularity > 500 "
"and popularity < 1000"
"and vote_count > 5000" )
final_result = movies . query ( final_conditions )
final_result

##Budget

idoriginal_languageoriginal_titlePopularityRelease DateRevenueRuntimest## 95165000000293660514.569956

157336

zh

724.247784

May 11, 2014

675120017

169.0

Relationship

##788

##58000000

zh

Deadpool

September 2, 2016

783112979

108.0 p>

Relationship

The above is the detailed content of How to select a subset of data in SQL query style in Pandas?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:tutorialspoint.com. If there is any infringement, please contact admin@php.cn delete