Home > Article > Backend Development > Use Python to analyze 1.4 billion pieces of data
Google Ngram viewer is a fun and useful tool that uses Google's vast treasure trove of data scanned from books to plot changes in word usage over time. For example, the word Python (case sensitive):
This image from books.google.com/ngrams… depicts usage of the word 'Python' Changes over time.
It is driven by Google's n-gram dataset, which records the usage of a specific word or phrase in Google Books for each year the book was printed. While this is not complete (it does not include every book ever published!), there are millions of books in the dataset, spanning the period from the 16th century to 2008. The dataset can be downloaded for free from here.
I decided to use Python and my new data loading library PyTubes to see how easy it was to regenerate the plot above.
The 1-gram data set can be expanded to 27 Gb of data on the hard disk, which is a large amount of data when read into python. Python can easily process gigabytes of data at a time, but when the data is corrupted and processed, it becomes slower and less memory efficient.
In total, these 1.4 billion pieces of data (1,430,727,243) are scattered in 38 source files, with a total of 24 million (24,359,460) words (and part-of-speech tags, see below), calculated from 1505-2008.
It slows down quickly when processing 1 billion rows of data. And native Python is not optimized to handle this aspect of data. Fortunately, numpy is really good at handling large amounts of data. Using some simple tricks, we can make this analysis feasible using numpy.
Handling strings in python/numpy is complicated. The memory overhead of strings in python is significant, and numpy can only handle strings of known and fixed length. Due to this situation, most of the words are of different lengths, so this is not ideal.
All code/examples below are running on a 2016 Macbook Pro with 8 GB of RAM. If the hardware or cloud instance has better ram configuration, the performance will be better.
1-gram data is stored in the file in the form of tab key separation, which looks as follows:
Each piece of data contains the following fields :
In order to generate the chart as required, we only need to know this information, that is:
By extracting these Information, the extra cost of processing string data of different lengths is ignored, but we still need to compare the values of different strings to distinguish which rows of data have the fields we are interested in. This is what pytubes can do:
After almost 170 seconds (3 minutes), one_grams is a numpy array containing almost 1.4 billion rows of data, looking like this (adding table headers for illustration):
╒═══════════╤════════╤════ │ ═ ══════╡
│ 0 │ 1799 │ 2 │
├───────────┼──────────┼── ───────┤
│ 0 │ 1804 │ 1 │
├───────────┼─────────┼─ ────────┤
│ 0 │ 1805 │ 1 │
├───────────┼─────────┼ ─────────┤
│ 0 │ 1811 │ 1 │
├───────────┼──────── ┼─────────┤
│ 0 │ 1820 │ ... │
╘═══════════╧═════ ═══╧═════════╛
From here, it's just a matter of using numpy methods to calculate something:
Total word usage per year
Google shows the percentage of occurrences of each word (number of times a word appears in this year/total number of all words in this year), which is more useful than just counting the original words. In order to calculate this percentage, we need to know what the total number of words is.
Fortunately, numpy makes this very easy:
Plot this graph to show how many words Google collects each year:
It is clear that before 1800, the amount of data declined rapidly, thus distorting the final results and hiding the patterns of interest. To avoid this problem, we only import data after 1800:
This returns 1.3 billion rows of data (only 3.7% before 1800)
Getting Python’s annual percentage share is now particularly simple.
Use a simple trick to create an array based on the year. The 2008 element length means that the index of each year is equal to the number of the year. Therefore, for example, 1995 is just a matter of getting the element of 1995. .
None of this is worth using numpy to do:
Plot the result of word_counts:
The shape looks similar to Google's version
The actual percentages do not match. I think it is because the downloaded data set contains different words. (For example: Python_VERB). This dataset is not explained very well on the google page, and raises several questions: How do we use Python as a verb?
Does the total calculation amount of ‘Python’ include ‘Python_VERB’? etc
Fortunately, we all know that the method I used produces an icon that is very similar to Google, and the related trends are not affected, so for this exploration, I am not going to try to fix it.
Performance
For example, calculating the total word usage for the previous year in advance and storing it in a separate lookup table will save significant time. Likewise, keeping the word usage in a separate database/file and then indexing the first column will eliminate almost all the processing time.
This exploration really shows that using numpy and the fledgling pytubes with standard commodity hardware and Python, it is possible to load, process and extract arbitrary statistics from a billion rows of data in a reasonable amount of time. Possible,
Language Wars
The source data is noisy (it contains all English words used, not just mentions of programming languages, and, for example, python has non-technical meanings too!), so to adjust for this, We've done two things:
Only name forms with capital letters are matched (Python, not python)
The total number of mentions for each language has been converted from 1800 to The percentage average for 1960, which should give a reasonable baseline considering Pascal was first mentioned in 1970.
Results:
Compared to Google (without any baseline adjustment):
Running time: just over 10 minutes
At this stage, pytubes only has the concept of a single integer, which is 64 bits. This means that the numpy arrays generated by pytubes use i8 dtypes for all integers. In some places (like ngrams data), 8-bit integers are a bit overkill and waste memory (the total ndarray is 38Gb, dtypes can easily reduce it by 60%). I plan to add some level 1, 2 and 4 bit integer support (github.com/stestagg/py… )
More filtering logic - Tube.skip_unless() is a relatively simple filter line method, but lacks the ability to combine conditions (AND/OR/NOT). This can reduce the size of loaded data faster in some use cases.
Better string matching - simple tests like: startswith, endswith, contains, and is_one_of can be easily added to significantly improve the effectiveness of loading string data.
The above is the detailed content of Use Python to analyze 1.4 billion pieces of data. For more information, please follow other related articles on the PHP Chinese website!