- Apache Iceberg 速成課程:什麼是 Data Lakehouse 和表格式?
- Apache Iceberg 權威指南的免費副本
- 免費 Apache Iceberg 速成課程
- Iceberg Lakehouse 工程影片播放清單
資料工程師和科學家經常使用各種工具來處理不同類型的資料操作 - 從大規模分散式處理到記憶體中資料操作。 alexmerced/spark35nb Docker 映像透過提供預先配置的環境來簡化此過程,您可以在其中試驗多種流行的資料工具,包括 PySpark、Pandas、DuckDB、Polars 和 DataFusion。
在本部落格中,我們將指導您設定此環境,並示範如何使用這些工具執行基本資料操作,例如寫入資料、載入資料以及執行查詢和聚合。無論您是處理大型數據集還是只需要操作小型記憶體數據,您都會看到這些不同的庫如何相互補充。
第 1 部分:設定您的環境
1.1 拉取Docker映像
首先,您需要從 Docker Hub 中提取 alexmerced/spark35nb Docker 映像。映像附帶了一個預先配置的環境,其中包括 Spark 3.5.2、JupyterLab 以及許多流行的資料操作庫,如 Pandas、DuckDB 和 Polars。
執行以下指令來拉取鏡像:
docker pull alexmerced/spark35nb
接下來,使用以下指令執行容器:
docker run -p 8888:8888 -p 4040:4040 -p 7077:7077 -p 8080:8080 -p 18080:18080 -p 6066:6066 -p 7078:7078 -p 8081:8081 alexmerced/spark35nb
容器啟動並運行後,打開瀏覽器並導航至 localhost:8888 以存取 JupyterLab,您將在其中執行所有資料操作。
現在您已經設定了環境,我們可以繼續使用 PySpark、Pandas、DuckDB、Polars 和 DataFusion 執行一些基本資料操作。
第 2 部分:使用 PySpark
2.1 什麼是PySpark?
PySpark 是 Apache Spark 的 Python API,Apache Spark 是專為大規模資料處理和分散式運算而設計的開源引擎。它允許您透過跨集群分佈數據和計算來處理大數據。雖然 Spark 通常在分散式叢集中運行,但此設定允許您在單一節點上本地運行它 - 非常適合開發和測試。
使用 PySpark,您可以在高效率處理大數據的框架內執行資料操作、SQL 查詢、機器學習等。在本節中,我們將介紹如何在 JupyterLab 環境中使用 PySpark 寫入和查詢資料。
2.2 使用PySpark寫入數據
我們先在 PySpark 中建立一個簡單的資料集。首先,初始化 Spark 會話,這是與 Spark 功能互動所必需的。我們將使用範例資料建立一個小型 DataFrame 並顯示它。
from pyspark.sql import SparkSession # Initialize the Spark session spark = SparkSession.builder.appName("PySpark Example").getOrCreate() # Sample data: a list of tuples containing names and ages data = [("Alice", 34), ("Bob", 45), ("Catherine", 29)] # Create a DataFrame df = spark.createDataFrame(data, ["Name", "Age"]) # Show the DataFrame df.show()
在這個範例中,我們建立了一個包含三行資料的 DataFrame,代表人的姓名和年齡。 df.show() 函數可讓我們顯示 DataFrame 的內容,從而可以輕鬆檢查我們剛剛建立的資料。
2.3 使用PySpark載入和查詢數據
接下來,讓我們從檔案載入資料集並執行一些基本查詢。 PySpark 可以處理各種檔案格式,包括 CSV、JSON 和 Parquet。
對於此範例,假設我們有一個 CSV 文件,其中包含有關人員的更多數據,我們將其載入到 DataFrame 中。然後我們將演示一個簡單的過濾查詢和聚合來計算每個年齡段的人數。
# Load a CSV file into a DataFrame df_csv = spark.read.csv("data/people.csv", header=True, inferSchema=True) # Show the first few rows of the DataFrame df_csv.show() # Filter the data to only include people older than 30 df_filtered = df_csv.filter(df_csv["Age"] > 30) # Show the filtered DataFrame df_filtered.show() # Group by Age and count the number of people in each age group df_grouped = df_csv.groupBy("Age").count() # Show the result of the grouping df_grouped.show()
在此範例中,我們使用spark.read.csv() 將 CSV 檔案載入到 PySpark DataFrame 中。然後,我們應用了兩種不同的操作:
- 過濾:我們過濾DataFrame以僅顯示年齡大於30的行。
- 聚合:我們將資料依年齡分組,並統計每個年齡層有多少人。
使用 PySpark,您可以對大型資料集執行更複雜的查詢和聚合,使其成為大數據處理的工具。
在下一節中,我們將探索 Pandas,它非常適合不需要分散式處理的小型記憶體資料操作。
第 3 節:使用 Pandas 進行資料操作
3.1 什麼是熊貓?
Pandas 是用於資料操作和分析的最廣泛使用的 Python 程式庫之一。它提供易於使用的資料結構,例如 DataFrame,可讓您以直觀的方式處理表格資料。與專為大規模分散式資料處理而設計的 PySpark 不同,Pandas 在記憶體中工作,非常適合中小型資料集。
With Pandas, you can read and write data from various formats, including CSV, Excel, and JSON, and perform common data operations like filtering, aggregating, and merging data with simple and readable syntax.
3.2 Loading Data with Pandas
Let’s start by loading a dataset into a Pandas DataFrame. We’ll read a CSV file, which is a common file format for data storage, and display the first few rows.
import pandas as pd # Load a CSV file into a Pandas DataFrame df_pandas = pd.read_csv("data/people.csv") # Display the first few rows of the DataFrame print(df_pandas.head())
In this example, we read the CSV file people.csv using pd.read_csv() and loaded it into a Pandas DataFrame. The head() method lets you view the first few rows of the DataFrame, which is useful for quickly inspecting the data.
3.3 Basic Operations with Pandas
Now that we have loaded the data, let’s perform some basic operations, such as filtering rows and grouping data. Pandas allows you to apply these operations easily with simple Python syntax.
# Filter the data to show only people older than 30 df_filtered = df_pandas[df_pandas["Age"] > 30] # Display the filtered data print(df_filtered) # Group the data by 'Age' and count the number of people in each age group df_grouped = df_pandas.groupby("Age").count() # Display the grouped data print(df_grouped)
Here, we filtered the data to include only people older than 30 using a simple boolean expression. Then, we used the groupby() function to group the DataFrame by age and count the number of people in each age group.
Pandas is incredibly efficient for in-memory data operations, making it a go-to tool for smaller datasets that can fit in your machine's memory. In the next section, we’ll explore DuckDB, a SQL-based tool that enables fast querying over in-memory data.
Section 4: Exploring DuckDB
4.1 What is DuckDB?
DuckDB is an in-memory SQL database management system (DBMS) designed for analytical workloads. It offers high-performance, efficient querying of datasets directly within your Python environment. DuckDB is particularly well-suited for performing complex SQL queries on structured data, like CSVs or Parquet files, without needing to set up a separate database server.
DuckDB is lightweight, yet powerful, and can be used as an alternative to tools like SQLite, especially when working with analytical queries on large datasets.
4.2 Writing Data into DuckDB
DuckDB can easily integrate with Pandas, allowing you to transfer data from a Pandas DataFrame into DuckDB for SQL-based queries. Here’s how to create a table in DuckDB using the data from Pandas.
import duckdb # Connect to an in-memory DuckDB instance conn = duckdb.connect() # Create a table in DuckDB from the Pandas DataFrame conn.execute("CREATE TABLE people AS SELECT * FROM df_pandas") # Show the content of the 'people' table conn.execute("SELECT * FROM people").df()
In this example, we connected to DuckDB and created a new table people from the Pandas DataFrame df_pandas. DuckDB’s execute() function allows you to run SQL commands, making it easy to interact with data using SQL queries.
4.3 Querying Data in DuckDB
Once your data is loaded into DuckDB, you can run SQL queries to filter, aggregate, and analyze your data. DuckDB supports a wide range of SQL functionality, making it ideal for users who prefer SQL over Python for data manipulation.
# Query to select people older than 30 result = conn.execute("SELECT Name, Age FROM people WHERE Age > 30").df() # Display the result of the query print(result) # Query to group people by age and count the number of people in each age group result_grouped = conn.execute("SELECT Age, COUNT(*) as count FROM people GROUP BY Age").df() # Display the grouped result print(result_grouped)
In this example, we used SQL to filter the people table, selecting only those who are older than 30. We then ran a grouping query to count the number of people in each age group.
DuckDB is an excellent choice when you need SQL-like functionality directly in your Python environment. It allows you to leverage the power of SQL without the overhead of setting up and managing a database server. In the next section, we will explore Polars, a DataFrame library known for its speed and efficiency.
Section 5: Leveraging Polars for Fast DataFrame Operations
5.1 What is Polars?
Polars is a DataFrame library designed for high-performance data manipulation. It’s known for its speed and efficiency, particularly when compared to libraries like Pandas. Polars is written in Rust and uses an optimized query engine to handle large datasets quickly and with minimal memory usage. It also provides a similar interface to Pandas, making it easy to learn and integrate into existing Python workflows.
Polars is particularly well-suited for processing large datasets that might not fit into memory as easily or for scenarios where performance is a critical factor.
5.2 Working with Polars
Let’s start by creating a Polars DataFrame from a Python dictionary. We’ll then perform some basic operations like filtering and aggregating data.
import polars as pl # Create a Polars DataFrame df_polars = pl.DataFrame({ "Name": ["Alice", "Bob", "Catherine"], "Age": [34, 45, 29] }) # Display the Polars DataFrame print(df_polars)
In this example, we created a Polars DataFrame using a Python dictionary. The syntax is similar to Pandas, but the operations are optimized for speed. Polars offers lazy evaluation, which means it can optimize the execution of multiple operations at once, reducing computation time.
5.3 Filtering and Aggregating with Polars
Now, let’s perform some common data operations such as filtering and aggregating the data. These operations are highly optimized in Polars and can be done using a simple and expressive syntax.
# Filter the DataFrame to show only people older than 30 df_filtered = df_polars.filter(pl.col("Age") > 30) # Display the filtered DataFrame print(df_filtered) # Group by 'Age' and count the number of people in each age group df_grouped = df_polars.groupby("Age").count() # Display the grouped result print(df_grouped)
In this example, we filtered the data to show only rows where the age is greater than 30, and then we grouped the data by age to count how many people are in each group. These operations are highly efficient in Polars due to its optimized memory management and query execution engine.
Polars is ideal when you need the speed of a DataFrame library for both small and large datasets, and when performance is a key requirement. Next, we will explore DataFusion, a tool for SQL-based querying over Apache Arrow data.
Section 6: DataFusion for Query Execution
6.1 What is DataFusion?
DataFusion is an in-memory query execution engine built on top of Apache Arrow, an efficient columnar memory format for analytics. It provides a powerful SQL engine that allows users to run complex queries over structured data stored in Arrow format. DataFusion is part of the Apache Arrow ecosystem, which aims to provide fast data interoperability across different data processing tools.
DataFusion is particularly well-suited for scenarios where you need to query large in-memory datasets using SQL without the overhead of traditional databases. Its integration with Arrow ensures that the data processing is both fast and memory-efficient.
6.2 Writing and Querying Data with DataFusion
DataFusion allows you to execute SQL queries on in-memory data using Apache Arrow. Let’s first create a DataFrame using DataFusion and then perform a few SQL queries on it.
from datafusion import SessionContext # Initialize a DataFusion session ctx = SessionContext() # Create a DataFrame with some data data = [ {"Name": "Alice", "Age": 34}, {"Name": "Bob", "Age": 45}, {"Name": "Catherine", "Age": 29} ] # Register the DataFrame as a table df = ctx.create_dataframe(data) ctx.register_table("people", df) # Query the data to select people older than 30 result = ctx.sql("SELECT Name, Age FROM people WHERE Age > 30").collect() # Display the result print(result)
In this example, we used DataFusion’s SessionContext to create a DataFrame and registered it as a table. We then performed a simple SQL query to filter the data for people older than 30. DataFusion allows you to combine the power of SQL with the speed and efficiency of Apache Arrow’s in-memory format.
6.3 Aggregating Data with DataFusion
Just like in DuckDB, we can perform aggregation queries to group data by a specific field and count the number of records in each group. Let’s see how this works in DataFusion.
# Group by 'Age' and count the number of people in each age group result_grouped = ctx.sql("SELECT Age, COUNT(*) as count FROM people GROUP BY Age").collect() # Display the grouped result print(result_grouped)
In this query, we grouped the data by the 'Age' column and counted how many people were in each age group. DataFusion’s SQL execution engine ensures that queries run efficiently, even on large datasets stored in-memory.
DataFusion is a great tool for users who need fast, SQL-based querying of large in-memory datasets and want to take advantage of Apache Arrow’s high-performance columnar data format. It’s particularly useful for building analytical pipelines that involve heavy querying of structured data.
Bonus Section: Integrating Dremio with Python
What is Dremio?
Dremio is a powerful data lakehouse platform that helps organizations unify and query their data from various sources. It enables users to easily govern, join, and accelerate queries on their data without the need for expensive and complex data warehouse infrastructures. Dremio's ability to access and query data directly from formats like Apache Iceberg, Delta Lake, S3, RDBMS, and JSON files, along with its performance enhancements, reduces the workload on traditional data warehouses.
Dremio is built on top of Apache Arrow, a high-performance columnar in-memory format, and utilizes Arrow Flight to accelerate the transmission of large datasets over the network. This integration provides blazing-fast query performance while enabling interoperability between various analytics tools.
In this section, we will demonstrate how to set up Dremio in a Docker container and use Python to query Dremio's data sources using the dremio-simple-query library.
6.1 Setting Up Dremio with Docker
To run Dremio on your local machine, use the following Docker command:
docker run -p 9047:9047 -p 31010:31010 -p 45678:45678 -p 32010:32010 -e DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist --name try-dremio dremio/dremio-oss
Once Dremio is up and running, navigate to http://localhost:9047 in your browser to access the Dremio UI. Here, you can configure your data sources, create virtual datasets, and explore the platform's capabilities.
6.2 Querying Dremio with Python using dremio-simple-query
The dremio-simple-query library allows you to query Dremio using Apache Arrow Flight, providing a high-performance interface for fetching and analyzing data from Dremio sources. With this library, you can easily convert Dremio queries into Pandas, Polars, or DuckDB DataFrames, or work directly with Apache Arrow data.
Here’s how to get started:
Step 1: Install the necessary libraries
Make sure you have the dremio-simple-query library installed (It is pre-installed on the alexmerced/spark35nb image). You can install it using pip:
pip install dremio-simple-query
Step 2: Set up your connection to Dremio
You’ll need your Dremio credentials to retrieve a token and establish a connection. Here’s a basic example:
from dremio_simple_query.connect import get_token, DremioConnection from os import getenv from dotenv import load_dotenv # Load environment variables (TOKEN and ARROW_ENDPOINT) load_dotenv() # Login to Dremio and get a token login_endpoint = "http://{host}:9047/apiv2/login" payload = { "userName": "your_username", "password": "your_password" } token = get_token(uri=login_endpoint, payload=payload) # Dremio Arrow Flight endpoint, make sure to put in the right host for your Dremio instance arrow_endpoint = "grpc://{host}:32010" # Establish connection to Dremio using Arrow Flight dremio = DremioConnection(token, arrow_endpoint)
If you are running this locally using the docker run command, the host should be the IP address of the Dremio container on the docker network which you can find by running docker inspect.
In this code, we use the get_token function to retrieve an authentication token from Dremio's REST API and establish a connection to Dremio's Arrow Flight endpoint.
Step 3: Query Dremio and retrieve data in various formats
Once connected, you can use the connection to query Dremio and retrieve results in different formats, including Arrow, Pandas, Polars, and DuckDB. Here’s how:
Querying Data and Returning as Arrow Table:
# Query Dremio and return data as an Apache Arrow Table stream = dremio.toArrow("SELECT * FROM my_table;") arrow_table = stream.read_all() # Display Arrow Table print(arrow_table)
Converting to a Pandas DataFrame:
# Query Dremio and return data as a Pandas DataFrame df = dremio.toPandas("SELECT * FROM my_table;") print(df)
Converting to a Polars DataFrame:
# Query Dremio and return data as a Polars DataFrame df_polars = dremio.toPolars("SELECT * FROM my_table;") print(df_polars)
Querying with DuckDB:
# Query Dremio and return as a DuckDB relation duck_rel = dremio.toDuckDB("SELECT * FROM my_table") # Perform a query on the DuckDB relation result = duck_rel.query("my_table", "SELECT * FROM my_table WHERE Age > 30").fetchall() # Display results print(result)
With the dremio-simple-query library, you can efficiently query large datasets from Dremio and immediately start analyzing them with various tools like Pandas, Polars, and DuckDB, all while leveraging the high-performance Apache Arrow format under the hood.
6.3 Why Use Dremio?
Dremio provides several benefits that make it a powerful addition to your data stack:
Governance: Centralize governance over all your data sources, ensuring compliance and control.
Data Federation: Join data across various sources, such as Iceberg, Delta Lake, JSON, CSV, and relational databases, without moving the data.
Performance: Accelerate your queries with the help of Dremio's query acceleration features and Apache Arrow Flight.
Cost Savings: By offloading workloads from traditional data warehouses, Dremio can reduce infrastructure costs.
Dremio's close relationship with Apache Arrow ensures that your queries are both fast and efficient, allowing you to seamlessly integrate various data sources and tools into your analytics workflows.
Conclusion
In this blog, we explored how to use a variety of powerful tools for data operations within a Python notebook environment. Starting with the alexmerced/spark35nb Docker image, we demonstrated how to set up a development environment that includes PySpark, Pandas, DuckDB, Polars, and DataFusion—each optimized for different data processing needs. We showcased basic operations like writing, querying, and aggregating data using each tool’s unique strengths.
- PySpark enables scalable, distributed processing for large datasets, perfect for big data environments.
- Pandas offers in-memory, easy-to-use data manipulation for smaller datasets, making it the go-to tool for quick data exploration.
- DuckDB provides an efficient, in-memory SQL engine, ideal for analytical queries without the need for complex infrastructure.
- Polars brings lightning-fast DataFrame operations, combining performance and simplicity for larger or performance-critical datasets.
- DataFusion, with its foundation in Apache Arrow, allows for high-performance SQL querying, particularly for analytical workloads in memory.
Finally, we introduced Dremio, which integrates with Apache Arrow to enable lightning-fast queries across a range of data sources. With the dremio-simple-query library, Dremio allows analysts to quickly fetch and analyze data using tools like Pandas, Polars, and DuckDB, ensuring that data is available when and where it's needed without the overhead of traditional data warehouses.
Whether you're working with small datasets or handling massive amounts of data in distributed environments, this setup provides a versatile, efficient, and scalable platform for any data engineering or data science project. By leveraging these tools together, you can cover the full spectrum of data processing, from exploration to large-scale analytics, with minimal setup and maximum performance.
以上是在 Python Notebook 中探索使用 PySpark、Pandas、DuckDB、Polars 和 DataFusion 的資料操作的詳細內容。更多資訊請關注PHP中文網其他相關文章!

Linux終端中查看Python版本時遇到權限問題的解決方法當你在Linux終端中嘗試查看Python的版本時,輸入python...

本文解釋瞭如何使用美麗的湯庫來解析html。 它詳細介紹了常見方法,例如find(),find_all(),select()和get_text(),以用於數據提取,處理不同的HTML結構和錯誤以及替代方案(SEL)

本文比較了Tensorflow和Pytorch的深度學習。 它詳細介紹了所涉及的步驟:數據準備,模型構建,培訓,評估和部署。 框架之間的關鍵差異,特別是關於計算刻度的

本文指導Python開發人員構建命令行界面(CLIS)。 它使用Typer,Click和ArgParse等庫詳細介紹,強調輸入/輸出處理,並促進用戶友好的設計模式,以提高CLI可用性。

本文討論了諸如Numpy,Pandas,Matplotlib,Scikit-Learn,Tensorflow,Tensorflow,Django,Blask和請求等流行的Python庫,並詳細介紹了它們在科學計算,數據分析,可視化,機器學習,網絡開發和H中的用途

在使用Python的pandas庫時,如何在兩個結構不同的DataFrame之間進行整列複製是一個常見的問題。假設我們有兩個Dat...

文章討論了虛擬環境在Python中的作用,重點是管理項目依賴性並避免衝突。它詳細介紹了他們在改善項目管理和減少依賴問題方面的創建,激活和利益。


熱AI工具

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool
免費脫衣圖片

Clothoff.io
AI脫衣器

AI Hentai Generator
免費產生 AI 無盡。

熱門文章

熱工具

Dreamweaver CS6
視覺化網頁開發工具

MantisBT
Mantis是一個易於部署的基於Web的缺陷追蹤工具,用於幫助產品缺陷追蹤。它需要PHP、MySQL和一個Web伺服器。請查看我們的演示和託管服務。

ZendStudio 13.5.1 Mac
強大的PHP整合開發環境

記事本++7.3.1
好用且免費的程式碼編輯器

DVWA
Damn Vulnerable Web App (DVWA) 是一個PHP/MySQL的Web應用程序,非常容易受到攻擊。它的主要目標是成為安全專業人員在合法環境中測試自己的技能和工具的輔助工具,幫助Web開發人員更好地理解保護網路應用程式的過程,並幫助教師/學生在課堂環境中教授/學習Web應用程式安全性。 DVWA的目標是透過簡單直接的介面練習一些最常見的Web漏洞,難度各不相同。請注意,該軟體中