search
HomeBackend DevelopmentPython TutorialBig data full-stack development language – Python

Big data full-stack development language – Python

Mar 29, 2017 pm 03:51 PM
pythondevelopdatalanguage

Some time ago, ThoughtWorks held a community event in Shenzhen, and there was a speech titled "Fullstack JavaScript", which was about using JavaScript for front-end, server-side, and even database (MongoDB) development. A web application developer only needs to learn one. language, you can implement the entire application.

Inspired by this, I discovered that Python can be called a big data full-stack development language. Because Python is a hot language in cloud infrastructure, DevOps, big data processing and other fields.

Field Popular language
Cloud Infrastructure Python, Java, Go
DevOps Python, Shell, Ruby, Go
Web Crawler Python, PHP, C++
data processing Python, R, Scala

Just like you can write a complete web application as long as you know JavaScript, you can implement a complete big data processing platform as long as you know Python.

Cloud infrastructure

These days, if we don’t support cloud platforms, massive data, or dynamic scaling, we don’t dare to say that we do big data. At most, we dare to tell others that we do business intelligence (BI).

Cloud platforms are divided into private clouds and public clouds. OpenStack, the popular private cloud platform, is written in Python. CloudStack, the former pursuer, strongly emphasized that it was written in Java and had advantages over Python when it was first launched. As a result, at the beginning of 2015, Citrix, the founder of CloudStack, announced that it would join the OpenStack Foundation, and CloudStack was about to come to an end.

If you find it troublesome and don’t want to build your own private cloud, use public clouds. Whether it’s AWS, GCE, Azure, Alibaba Cloud, or Qingyun, they all provide Python SDKs. GCE only provides Python and JavaScript SDKs, while Qingyun only provides Python SDKs. . It can be seen that various cloud platforms attach great importance to Python.

When it comes to infrastructure construction, we have to mention Hadoop. Today, Hadoop is no longer the first choice for big data processing because its MapReduce data processing speed is not fast enough. However, HDFS and Yarn, the two components of Hadoop, are becoming more and more popular. The more popular it becomes. The development language of Hadoop is Java, and there is no official Python support. However, there are many third-party libraries that encapsulate Hadoop's API interface (pydoop, hadoopy, etc.).

The replacement of Hadoop MapReduce is Spark, which is said to be 100 times faster. Its development language is Scala, but it provides development interfaces for Scala, Java, and Python. It is really unreasonable to want to please so many data scientists who develop in Python without supporting Python. . HDFS alternatives, such as GlusterFS, Ceph, etc., all directly provide Python support. A replacement for Yarn, Mesos is implemented in C++. In addition to C++, it also provides support packages for Java and Python.

DevOps

DevOps has a Chinese name, which is called development and self-operation and maintenance. In the Internet era, only by being able to quickly test new ideas and deliver business value safely and reliably as soon as possible can we remain competitive. The automated build/test/deployment and system measurement and other technical practices advocated by DevOps are indispensable in the Internet era.

Automated construction is easy because of the application. If it is a Python application, because of the existence of tools such as setuptools, pip, virtualenv, tox, flake8, etc., automated construction is very simple. Moreover, because almost all Linux systems have built-in Python interpreters, using Python for automation does not require any pre-installed software on the system.

In terms of automated testing, the Python-based Robot Framework is the favorite automated testing framework for enterprise-level applications, and it has nothing to do with language. Cucumber also has many supporters, and its Python counterpart Lettuce can do exactly the same thing. Locust has also begun to receive more and more attention in automated performance testing.

Automated configuration management tools, old ones such as Chef and Puppet, are developed in Ruby and still maintain a strong momentum. However, the new generation of Ansible and SaltStack - both developed in Python - are more lightweight than the previous two and are welcomed by more and more developers, which has begun to create a lot of pressure on their predecessors.

In terms of system monitoring and measurement, traditional Nagios is gradually declining, upstarts such as Sensu are well received, and New Relic in the form of cloud services has become the standard for startups. None of these are directly implemented through Python, but Python needs to be connected to these tools. , not difficult.

In addition to the above tools, PaaS platforms based on Python that provide complete DevOps functions, such as Cloudify and Deis, have not yet become popular, but they have already received a lot of attention.

Web Crawler

Where does the data of big data come from? Except for some companies that have the ability to generate large amounts of data themselves, most of the time, they need to rely on crawlers to capture Internet data for analysis.

Web crawlers are Python's traditional strong areas. The most popular crawler framework Scrapy, HTTP tool kit urlib2, HTML parsing tool beautifulsoup, XML parser lxml, etc. are all class libraries that can stand alone.

However, web crawlers are not just as simple as opening web pages and parsing HTML. An efficient crawler must be able to support a large number of flexible concurrent operations, and often be able to crawl thousands or even tens of thousands of web pages at the same time. The traditional thread pool method wastes a lot of resources. After the number of threads reaches thousands, system resources are basically wasted. Thread scheduling is on. Because Python can well support coroutine operations, many concurrency libraries have been developed based on this, such as Gevent, Eventlet, and distributed task frameworks such as Celery. ZeroMQ, which is considered more efficient than AMQP, was also the first to provide a Python version. With support for high concurrency, web crawlers can truly reach the scale of big data.

The captured data needs word segmentation processing, and Python is not inferior in this regard. The famous natural language processing package NLTK, and Jieba, which specializes in Chinese word segmentation, are all powerful tools for word segmentation.

data processing

All is ready except for the opportunity. This east wind is the data processing algorithm. From statistical theory, to data mining, machine learning, to the deep learning theory proposed in recent years, data science is in an era where a hundred flowers are blooming. What programming do data scientists use?

If it is in the field of theoretical research, the R language may be the most popular among data scientists, but the problems with the R language are also obvious. Because statisticians created the R language, its syntax is slightly weird. Moreover, if R language wants to realize a large-scale distributed system, it will still take a long time to go on the engineering road. Therefore, many companies use R language for prototype testing. After the algorithm is determined, it is translated into engineering language.

Python is also one of the favorite languages ​​of data scientists. Unlike the R language, Python itself is an engineering language. The algorithms implemented by data scientists in Python can be directly used in products, which is very helpful for big data startups to save costs. Officially because of data scientists' love for Python and R, Spark provides very good support for these two languages ​​in order to please data scientists.

Python has many data processing related libraries. The high-performance scientific computing libraries NumPy and SciPy lay a very good foundation for other advanced algorithms. matploglib makes Python drawing as easy as Matlab. Scikit-learn and Milk implement many machine learning algorithms. Pylearn2 implemented based on these two libraries is an important member of the deep learning field. Theano uses GPU acceleration to achieve high-performance mathematical symbolic calculations and multi-dimensional matrix calculations. Of course, there is also Pandas, a big data processing library that has been widely used in the engineering field. Its DataFrame design is borrowed from the R language, and later inspired the Spark project to implement a similar mechanism.

By the way, there is also iPython. This tool is so useful that I almost regarded it as a standard library and forgot to introduce it. iPython is an interactive Python running environment that allows you to see the results of each piece of Python code in real time. By default, iPython runs on the command line, and you can execute ipython notebook to run it on the web page. Figures drawn with matplotlib can be directly displayed embedded in iPython Notebook.
The notebook files of iPython Notebook can be shared with other people, so that others can reproduce your work results in their own environment; if the other party does not have a running environment, they can also be directly converted into HTML or PDF.

Why Python

It is precisely because application development engineers, operation and maintenance engineers, and data scientists all like Python that Python has become a full-stack development language for big data systems.

For development engineers, the elegance and simplicity of Python are undoubtedly the biggest attraction. In the Python interactive environment, execute import this and read the Zen of Python, and you will understand why Python is so attractive. The Python community has always been very dynamic. Unlike the explosive growth of software packages in the NodeJS community, the growth rate of Python software packages has been relatively stable, and the quality of the software packages is also relatively high. Many people criticize Python for having too strict requirements on spaces, but it is precisely because of this requirement that Python has an advantage over other languages ​​when doing large-scale projects. OpenStack projects total more than 2 million lines of code to prove this.

For operation and maintenance engineers, the biggest advantage of Python is that almost all Linux distributions have built-in Python interpreters. Although Shell is powerful, its syntax is not elegant enough, and it will be painful to write more complex tasks. Using Python to replace Shell to do some complex tasks is a liberation for operation and maintenance personnel.

For data scientists, Python is simple yet powerful. Compared with C/C++, there is no need to do a lot of low-level work and model verification can be carried out quickly; compared with Java, Python has concise syntax and strong expressive ability, and the same work only requires 1/3 of the code; compared with Matlab and Octave, Python's engineering maturity is higher. More than one programming expert has expressed that Python is the most suitable language to use as a university computer science programming course - MIT's introductory computer course uses Python - because Python can let people learn the most important thing about programming - how to solve problems question.

By the way, Microsoft participated in PyCon 2015 and made a high-profile announcement to improve the Python programming experience on Windows, including Visual Studio supporting Python, optimizing the compilation of Python C extensions on Windows, and so on. Imagine a future scenario where Python becomes the default component of Windows.

The above is the detailed content of Big data full-stack development language – Python. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Python vs. C  : Understanding the Key DifferencesPython vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C  : Which Language to Choose for Your Project?Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyReaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesMaximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C  : The Right Language for YouChoosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C  : A Comparative Analysis of Programming LanguagesPython vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python Learning2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C  : Learning Curves and Ease of UsePython vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment