search
HomeBackend DevelopmentPython TutorialPython+ big data computing platform, PyODPS architecture construction

Data analysis and machine learning

Python+ big data computing platform, PyODPS architecture construction

Big data is basically built on the ecosystem of the Hadoop system, which is actually a Java environment. Many people like to use Python and R for data analysis, but this often corresponds to some small data problems or local data processing problems. How to combine the two to make it have greater value? The existing ecosystem of Hadoop and the existing Python environment are shown in the figure above.

MaxCompute

MaxCompute is a big data platform for offline computing, providing TB/PB level data processing, multi-tenancy, out-of-the-box use, and isolation mechanism to ensure security. The main analysis tool on MaxCompute is SQL. SQL is very simple, easy to use, and is descriptive. Tunnel provides a data upload and download channel without the need for scheduling by the SQL engine.

Pandas

Pandas is a data analysis tool based on numpy. The most important structure is DataFrame, which provides a series of drawing APIs. Behind it is the operation of matplotlib. It is very easy to interact with Python third-party libraries.

PyODPS architecture

Python+ big data computing platform, PyODPS architecture construction

PyODPS uses Python for big data analysis, and its architecture is shown in the figure above. The bottom layer is the basic API, which can be used to operate tables, functions or resources on MaxCompute. Above is the DataFrame framework. DataFrame consists of two parts. One part is the front end, which defines a set of expression operations. The code written by the user will be converted into an expression tree, which is the same as an ordinary language. Users can customize functions, visualize and interact with third-party libraries. At the bottom of the backend is the Optimizer, whose role is to optimize the expression tree. Both ODPS and pandas are submitted to the Engine for execution through the compiler and analyzer.

Background

Why do you want to build a DataFrame framework?

Python+ big data computing platform, PyODPS architecture construction

For any big data analysis tool, you will face problems in three dimensions: expressiveness, whether the API, syntax, and programming language are simple and intuitive? Data, Can the storage and metadata be compressed and effective? Is the engine and computing performance sufficient? So you will be faced with two choices: pandas and SQL.

Python+ big data computing platform, PyODPS architecture construction

As shown in the picture above, pandas has very good expressive power, but its data can only be placed in memory. The engine is a stand-alone machine and is limited by the performance of the machine. The expressive power of SQL is limited, but it can be used for a large amount of data. When the amount of data is small, there is no advantage of the engine. When the amount of data is large, the engine will become very advantageous. The goal of ODPS is to combine the advantages of both.

PyODPS DataFrame

PyODPS DataFrame is written in Python language, and you can use Python variables, conditional judgments, and loops. You can use pandas-like syntax to define your own set of front-ends for better expressiveness. The backend can determine the specific execution engine based on the data source, which is the design pattern of the visitor and is extensible. The entire execution is delayed and will not be executed directly unless the user calls a method that is executed immediately.

Python+ big data computing platform, PyODPS architecture construction

As you can see from the picture above, the syntax is very similar to pandas.

Expressions and abstract syntax trees

Python+ big data computing platform, PyODPS architecture construction

As can be seen from the above figure, the user performs the GroupBy operation from an original Collection, and then performs the column selection operation. The bottom is the Collection of Source. Two fields species are taken. These two fields are performed by By operation, and pental_length is used for aggregation operation to obtain the aggregate value. The Species field is taken out directly, and the shortest field is added by one.

Optimizer(operation merging)

Python+ big data computing platform, PyODPS architecture construction

The backend will first use Optimizer to optimize the expression tree, first do GroupBy, and then do column selection on it. Through operation merging, petal_length can be removed for aggregation operation, and then add one, Finally, the Collection of GroupBy is formed.

Optimizer(column pruning)

Python+ big data computing platform, PyODPS architecture construction

When the user joins two data frames and then retrieves two columns from the data frame, if it is submitted to a big data environment, such a process is very inefficient because not every column is used. Therefore, the columns under joined must be pruned. For example, we only use one field in data frame1. We only need to intercept the field and make a projection to form a new Collection. The same is true for data frame2. In this way, the amount of data output can be greatly reduced when performing verification operations on these two parts.

Optimizer (predicate pushdown)

Python+ big data computing platform, PyODPS architecture construction

If two data frames are joined and then filtered separately, this filtering operation should be pushed down to the bottom for execution, so as to reduce the amount of joined input .

Visualization

Python+ big data computing platform, PyODPS architecture construction

provides visualize() to facilitate user visualization. As you can see in the example on the right, the ODSP SQL backend will compile into a SQL execution.

Backend

Python+ big data computing platform, PyODPS architecture construction

As you can see from the picture above, the computing backend is very flexible. Users can even join a pandas data frame and maxcompute data from a previous table.

Analyzer

The role of Analyzer is to convert some operations for specific backends. For example:

Some operations such as value_counts are supported by pandas itself, so for the pandas backend, no processing is required; for the ODPS SQL backend, there is no direct operation to perform, so when the analyzer is executed, it will be rewritten as groupby + sort Operations;

There are also some operators that cannot be completed by built-in functions when compiling to ODPS SQL, and will be rewritten into custom functions.

ODPS SQL backend

Python+ big data computing platform, PyODPS architecture construction

How does the ODPS SQL backend perform SQL compilation and then execution? The compiler can traverse the expression tree from top to bottom to find Join or Union. For subprocesses, compile recursively. When it comes to Engine for specific execution, Analyzer will be used to rewrite the expression tree, compile the top-down sub-process, and bottom-up compile into SQL clauses. Finally, the complete SQL statement will be obtained, the SQL will be submitted and the task will be returned.

pandas backend

first accesses the expression tree, and then corresponds to a pandas operation for each expression tree node. A DAG will be formed after the entire expression tree is traversed. Engine execution is executed in DAG topology order, continuously applying it to pandas operations, and finally getting a result. For big data environments, the role of the pandas backend is to do local DEBUG; when the amount of data is small, we can use pandas for calculations.

Difficulties + Pitfalls

Back-end compilation errors can easily cause the context to be lost. Multiple optimizes and analyzes make it difficult to find out which previous visit node caused the problem. Solution: Ensure the independence of each module and complete testing;

bytecode compatibility issues, maxcompute only supports the execution of custom functions in Python 2.7;

SQL execution order.

ML Machine Learning

Python+ big data computing platform, PyODPS architecture construction

Machine learning is the input and output of a data frame. For example, if there is an iris data frame, first use the name field to make a classification field, and call the split method to divide it into 60% training data and 40% test data. Then initialize a RandomForests with a decision tree in it, call the train method to train the training data, call the predict method to form a prediction data, and call segments[0] to see the visual results.

Future plans

Distributed numpy, DataFrame is based on the distributed numpy backend;

In-memory computing to improve interactive experience;

Tensorflow.


Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
详细讲解Python之Seaborn(数据可视化)详细讲解Python之Seaborn(数据可视化)Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

详细了解Python进程池与进程锁详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

Python自动化实践之筛选简历Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

归纳总结Python标准库归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于标准库总结的相关问题,下面一起来看一下,希望对大家有帮助。

分享10款高效的VSCode插件,总有一款能够惊艳到你!!分享10款高效的VSCode插件,总有一款能够惊艳到你!!Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

Python数据类型详解之字符串、数字Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

python中文是什么意思python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。

详细介绍python的numpy模块详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)