Use Python to analyze 1.4 billion pieces of data-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Use Python to analyze 1.4 billion pieces of data

PHPz

Apr 12, 2023 pm 10:19 PM

pythondata

Use Python to analyze 1.4 billion pieces of data

Google Ngram viewer is a fun and useful tool that uses Google's vast treasure trove of data scanned from books to plot changes in word usage over time. For example, the word Python (case sensitive):

Use Python to analyze 1.4 billion pieces of data

This image from books.google.com/ngrams… depicts usage of the word 'Python' Changes over time.

It is driven by Google's n-gram dataset, which records the usage of a specific word or phrase in Google Books for each year the book was printed. While this is not complete (it does not include every book ever published!), there are millions of books in the dataset, spanning the period from the 16th century to 2008. The dataset can be downloaded for free from here.

I decided to use Python and my new data loading library PyTubes to see how easy it was to regenerate the plot above.

Challenge

The 1-gram data set can be expanded to 27 Gb of data on the hard disk, which is a large amount of data when read into python. Python can easily process gigabytes of data at a time, but when the data is corrupted and processed, it becomes slower and less memory efficient.

In total, these 1.4 billion pieces of data (1,430,727,243) are scattered in 38 source files, with a total of 24 million (24,359,460) words (and part-of-speech tags, see below), calculated from 1505-2008.

It slows down quickly when processing 1 billion rows of data. And native Python is not optimized to handle this aspect of data. Fortunately, numpy is really good at handling large amounts of data. Using some simple tricks, we can make this analysis feasible using numpy.

Handling strings in python/numpy is complicated. The memory overhead of strings in python is significant, and numpy can only handle strings of known and fixed length. Due to this situation, most of the words are of different lengths, so this is not ideal.

Loading the data

All code/examples below are running on a 2016 Macbook Pro with 8 GB of RAM. If the hardware or cloud instance has better ram configuration, the performance will be better.

1-gram data is stored in the file in the form of tab key separation, which looks as follows:

Use Python to analyze 1.4 billion pieces of data

Each piece of data contains the following fields :

Use Python to analyze 1.4 billion pieces of data

In order to generate the chart as required, we only need to know this information, that is:

Use Python to analyze 1.4 billion pieces of data

By extracting these Information, the extra cost of processing string data of different lengths is ignored, but we still need to compare the values of different strings to distinguish which rows of data have the fields we are interested in. This is what pytubes can do:

Use Python to analyze 1.4 billion pieces of data

After almost 170 seconds (3 minutes), one_grams is a numpy array containing almost 1.4 billion rows of data, looking like this (adding table headers for illustration):

╒═══════════╤════════╤════ │ ═ ══════╡

│ 0 │ 1799 │ 2 │

├───────────┼──────────┼── ───────┤

│ 0 │ 1804 │ 1 │

├───────────┼─────────┼─ ────────┤

│ 0 │ 1805 │ 1 │

├───────────┼─────────┼ ─────────┤

│ 0 │ 1811 │ 1 │

├───────────┼──────── ┼─────────┤

│ 0 │ 1820 │ ... │

╘═══════════╧═════ ═══╧═════════╛

From here, it's just a matter of using numpy methods to calculate something:

Total word usage per year

Google shows the percentage of occurrences of each word (number of times a word appears in this year/total number of all words in this year), which is more useful than just counting the original words. In order to calculate this percentage, we need to know what the total number of words is.

Fortunately, numpy makes this very easy:

Use Python to analyze 1.4 billion pieces of data

Plot this graph to show how many words Google collects each year:

Use Python to analyze 1.4 billion pieces of data

It is clear that before 1800, the amount of data declined rapidly, thus distorting the final results and hiding the patterns of interest. To avoid this problem, we only import data after 1800:

Use Python to analyze 1.4 billion pieces of data

This returns 1.3 billion rows of data (only 3.7% before 1800)

Use Python to analyze 1.4 billion pieces of data

Getting Python’s annual percentage share is now particularly simple.

Use a simple trick to create an array based on the year. The 2008 element length means that the index of each year is equal to the number of the year. Therefore, for example, 1995 is just a matter of getting the element of 1995. .

None of this is worth using numpy to do:

Use Python to analyze 1.4 billion pieces of data

Plot the result of word_counts:

Use Python to analyze 1.4 billion pieces of data

The shape looks similar to Google's version

Use Python to analyze 1.4 billion pieces of data

The actual percentages do not match. I think it is because the downloaded data set contains different words. (For example: Python_VERB). This dataset is not explained very well on the google page, and raises several questions: How do we use Python as a verb?

Does the total calculation amount of ‘Python’ include ‘Python_VERB’? etc

Fortunately, we all know that the method I used produces an icon that is very similar to Google, and the related trends are not affected, so for this exploration, I am not going to try to fix it.

Performance

Google generates the image in about 1 second, which is reasonable compared to the 8 minutes of this script. Google's word count backend works from an explicit view of the prepared dataset.

For example, calculating the total word usage for the previous year in advance and storing it in a separate lookup table will save significant time. Likewise, keeping the word usage in a separate database/file and then indexing the first column will eliminate almost all the processing time.

This exploration really shows that using numpy and the fledgling pytubes with standard commodity hardware and Python, it is possible to load, process and extract arbitrary statistics from a billion rows of data in a reasonable amount of time. Possible,

Language Wars

The source data is noisy (it contains all English words used, not just mentions of programming languages, and, for example, python has non-technical meanings too!), so to adjust for this, We've done two things:

Only name forms with capital letters are matched (Python, not python)

The total number of mentions for each language has been converted from 1800 to The percentage average for 1960, which should give a reasonable baseline considering Pascal was first mentioned in 1970.

Results:

Use Python to analyze 1.4 billion pieces of data Compared to Google (without any baseline adjustment):

Use Python to analyze 1.4 billion pieces of data

Running time: just over 10 minutes

Future PyTubes improvements

At this stage, pytubes only has the concept of a single integer, which is 64 bits. This means that the numpy arrays generated by pytubes use i8 dtypes for all integers. In some places (like ngrams data), 8-bit integers are a bit overkill and waste memory (the total ndarray is 38Gb, dtypes can easily reduce it by 60%). I plan to add some level 1, 2 and 4 bit integer support (github.com/stestagg/py… )

More filtering logic - Tube.skip_unless() is a relatively simple filter line method, but lacks the ability to combine conditions (AND/OR/NOT). This can reduce the size of loaded data faster in some use cases.

Better string matching - simple tests like: startswith, endswith, contains, and is_one_of can be easily added to significantly improve the effectiveness of loading string data.

The above is the detailed content of Use Python to analyze 1.4 billion pieces of data. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Python: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python vs. C : Applications and Use Cases ComparedApr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

The 2-Hour Python Plan: A Realistic ApproachApr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python: Exploring Its Primary ApplicationsApr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

How Much Python Can You Learn in 2 Hours?Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics in project and problem-driven methods within 10 hours?Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading?Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

What should I do if the '__builtin__' module is not found when loading the Pickle file in Python 3.6?Apr 02, 2025 am 07:12 AM

Error loading Pickle file in Python 3.6 environment: ModuleNotFoundError:Nomodulenamed...

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

Chinese version, very easy to use

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version

Visual web development tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.