Home  >  Article  >  Backend Development  >  PHP large data volume and massive data processing algorithm summary_PHP tutorial

PHP large data volume and massive data processing algorithm summary_PHP tutorial

WBOY
WBOYOriginal
2016-07-21 15:29:37669browse

The following method is a general summary of my methods for processing massive data. Of course, these methods may not completely cover all problems, but such methods can basically handle the vast majority of problems encountered. Some of the following questions are basically directly derived from the company's written interview questions. The method may not be the best. If you have a better way to deal with it, you are welcome to discuss it with me.

1.Bloom filter

Scope of application: can be used to implement data dictionary, weight judgment of data, or set intersection

Basic principle And key points:
The principle is very simple, bit array + k independent hash functions. Set the bit array of the value corresponding to the hash function to 1. If it is found that all the corresponding bits of the hash function are 1 during the search, it means that it exists. Obviously, this process does not guarantee that the search result is 100% correct. At the same time, it is not supported to delete an inserted keyword, because the bit corresponding to the keyword will affect other keywords. So a simple improvement is the counting Bloom filter, which uses a counter array instead of the bit array to support deletion.

There is another more important question, how to determine the size of the bit array m and the number of hash functions based on the number of input elements n. The error rate is smallest when the number of hash functions k=(ln2)*(m/n). When the error rate is not greater than E, m must be at least equal to n*lg(1/E) to represent any set of n elements. But m should be larger, because at least half of the bit array must be 0, then m should be >=nlg(1/E)*lge, which is about 1.44 times nlg(1/E) (lg means 2 is logarithm of base).

For example, if we assume the error rate is 0.01, then m should be approximately 13 times n. So k is about 8.

Note that the units of m and n are different here, m is the unit of bit, and n is the unit of the number of elements (to be precise, the number of different elements). Usually the length of a single element is many bits. Therefore, using bloom filter usually saves memory.

Extension:
Bloom filter maps the elements in the set to a bit array, and uses k (k is the number of hash functions) mapping bits to indicate whether the element is in the set or not. Counting bloom filter (CBF) expands each bit in the bit array into a counter, thereby supporting the deletion operation of elements. Spectral Bloom Filter (SBF) associates it with the number of occurrences of a collection element. SBF uses the minimum value in counter to approximately represent the frequency of occurrence of elements.

Problem example: You are given two files A and B, each storing 5 billion URLs. Each URL occupies 64 bytes, and the memory limit is 4G. Let you find the common URLs of files A and B. What if there are three or even n files?

Based on this problem, let’s calculate the memory usage. 4G=2^32 is about 4 billion*8, which is about 34 billion. n=5 billion. If the error rate is 0.01, it will be about 65 billion. bit. What is available now is 34 billion, which is not much different. This may increase the error rate. In addition, if these URLs correspond one to one, they can be converted into IPs, which is much simpler.

2.Hashing

Scope of application: Basic data structure for quick search and deletion, usually the total amount of data can be put into memory

Basic principle And key points:
Hash function selection, for strings, integers, permutations, specific corresponding hash methods.
Collision processing, one is open hashing, also known as zipper method; the other is closed hashing, also known as open addressing method. (http://www.my400800.cn)

Extension:
The d in d-left hashing means multiple. Let’s simplify this problem first and take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, and equipping T1 and T2 with a hash function, h1 and h2 respectively. When storing a new key, two hash functions are used for calculation at the same time, and two addresses h1[key] and h2[key] are obtained. At this time, you need to check the h1[key] position in T1 and the h2[key] position in T2, which position has more stored (colliding) keys, and then store the new key in the location with less load. If there are the same number on both sides, for example, both positions are empty or a key is stored in both, the new key will be stored in the T1 subtable on the left, and 2-left comes from this. When looking for a key, you must hash twice and look for two locations at the same time.

Problem examples:
1). From massive log data, extract the IP with the most visits to Baidu on a certain day.

The number of IPs is still limited, up to 2^32, so you can consider using hash to store the IPs directly into the memory and then perform statistics.

3.bit-map

Scope of application: It can quickly search, determine and delete data. Generally speaking, the data range is less than 10 times of int

Basic principles and key points: Use bit arrays to indicate whether certain elements exist, such as 8-digit phone numbers

Extensions: bloom filter can be seen as an extension of bit-map

Problem examples:

1) It is known that a file contains some phone numbers, each number is 8 digits, and the number of different numbers is counted.

The maximum number of 8 bits is 99 999 999, which requires about 99m bits and about 10 m bytes of memory.

2) Find the number of non-repeating integers among 250 million integers. The memory space is not enough to accommodate these 250 million integers.

Expand the bit-map and use 2 bits to represent a number. 0 means it does not appear, 1 means it appears once, and 2 means it appears 2 times or more. Or we don't use 2 bit to represent it. We can use two bit-maps to simulate and implement this 2-bit-map.

4. Heap

Scope of application: n is large before massive data, and n is relatively small, the heap can be put into memory

Basic principles and key points : Find the first n smallest heaps for the largest heap, and find the largest n largest heaps for the smallest heap. Methods, such as finding the first n smallest, we compare the current element with the largest element in the max heap. If it is smaller than the largest element, the largest element should be replaced. In this way, the final n elements obtained are the smallest n. It is suitable for large amounts of data. If the first n is small and the size of n is relatively small, all the first n elements can be obtained by scanning once, which is very efficient.

Extension: Dual heap, a max heap combined with a min heap, can be used to maintain the median.

Problem examples:
1) Find the largest 100 numbers among 1 million numbers.

Use a minimum heap of 100 elements.

5. Double-layer bucket division ----In fact, it is essentially the idea of ​​[divide and conquer], focusing on the technique of "dividing"!

Applicable range: kth largest, median, non-repetitive or repeating numbers

Basic principles and key points: Because the range of elements is very large, direct addressing tables cannot be used. Therefore, through multiple divisions, the range is gradually determined, and then finally carried out within an acceptable range. Can be zoomed out multiple times, double layer is just one example.

Extension:

Problem example:
1).Find the number of unique integers among 250 million integers. The memory space is not enough to accommodate these 250 million integers.

It’s a bit like the pigeonhole principle. The number of integers is 2^32. That is, we can divide these 2^32 numbers into 2^8 areas (for example, use a single file to represent an area). Then separate the data into different areas, and then different areas can be solved directly using bitmap. In other words, as long as there is enough disk space, it can be easily solved.

2) 50 million ints to find their median.

This example is more obvious than the one above. First, we divide the int into 2^16 areas, and then read the data to count the number of numbers falling in each area. Then we can judge which area the median falls in based on the statistical results, and at the same time know the number in this area. Several large numbers happen to be the median. Then for the second scan, we only count the numbers that fall in this area.

In fact, if it is not int but int64, we can reduce it to an acceptable level after 3 such divisions. That is, you can first divide int64 into 2^24 areas, and then determine the largest number in the area, then divide the area into 2^20 sub-areas, and then determine the largest number in the sub-area, and then determine the number in the sub-area. The number is only 2^20, so you can directly use the direct addr table for statistics.

6. Database index

Scope of application: addition, deletion, modification and query of large amounts of data

Basic principles and key points: design and implementation methods using data, Process the addition, deletion, modification and query of massive data.
Extensions:
Problem examples:


7. Inverted index (Inverted index)

Scope of application: search engines, keyword queries

Basic principles and key points: Why is it called inverted index? An indexing method used to store a mapping of the storage location of a word in a document or set of documents under full-text search.

Taking English as an example, the following is the text to be indexed:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
We can get the following reverse file index:
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
The retrieved conditions "what", "is" and "it" will correspond to the intersection of the sets.

Forward indexes were developed to store a list of words for each document. The forward index query often satisfies the ordered and frequent full-text query of each document and the verification of each word in the verification document. In a forward index, documents occupy a central position, and each document points to a sequence of index entries that it contains. That is to say, the document points to the words it contains, and the reverse index means that the word points to the document containing it. It is easy to see this reverse relationship.

Extension:

Problem example: Document retrieval system, query which documents contain a certain word, such as keyword search for common academic papers.

8. External sorting

Scope of application: sorting of big data, deduplication

Basic principles and key points: merging method of external sorting, replacement Select the loser tree principle, the optimal merge tree

extension:

Problem example:
1). There is a file of 1G size, and each line in it is a word. The size does not exceed 16 bytes, and the memory limit is 1M. Returns the 100 most frequent words.

This data has obvious characteristics. The word size is 16 bytes, but the memory is only 1m which is not enough for hashing, so it can be used for sorting. Memory can be used as an input buffer.

9.trie tree

Applicable scope: large amount of data, many repetitions, but small data types can be put into memory

Basic principles and key points : Implementation method, representation method of node children

Extension: compression implementation.

Problem example:
1). There are 10 files, each file is 1G. Each line of each file stores the user's query, and the query of each file may be repeated. You are asked to sort by query frequency.

2). 10 million strings, some of which are the same (duplicate), need to remove all duplicates and retain the strings without duplication. How to design and implement it?

3). Find popular queries: The query string has a relatively high degree of repetition. Although the total number is 10 million, if duplicates are removed, it does not exceed 3 million, and each query string does not exceed 255 bytes.

10. Distributed processing mapreduce

Scope of application: large amount of data, but small data types can be put into memory

Basic principles and key points: Hand the data to different machines for processing, divide the data, and reduce the results.

Extension:

Problem example:

1).The canonical example application of MapReduce is a process to count the appearances of

each different word in a set of documents:
void map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
EmitIntermediate (w, 1);

void reduce(String word, Iterator partialCounts):
// key: a word
// values: a list of aggregated partial counts
int result = 0;
for each v in partialCounts:
result += ParseInt(v);
Emit(result);
Here, each document is split in words, and each word is counted initially with a "1" value by

the Map function, using the word as the result key. The framework puts together all the pairs

with the same key and feeds them to the same call to Reduce, thus this function just needs to

sum all of its input values ​​to find the total appearances of that word.

2). Massive data is distributed among 100 computers. Find a way to count efficiently Get the TOP10 of this batch of data.

3). There are N machines in total, and there are N numbers on each machine. Each machine can store up to O(N) numbers and operate on them. How to find the median of N^2 numbers?


Classic problem analysis

Tens of tens or billions of data (with duplication), count the top N data that appear the most, divided into two situations : Can be read into the memory at one time, but cannot be read at once.

Available ideas: trie tree + heap, database index, subset statistics, hash, distributed computing, approximate statistics, external sorting

The so-called whether it can be read into the memory at one time, actually The above should refer to the amount of data after removing duplicates. If the data can be put into memory after deduplication, we can create a dictionary for the data, such as through map, hashmap, trie, and then directly perform statistics. Of course, when updating the number of occurrences of each piece of data, we can use a heap to maintain the top N data with the most occurrences. Of course, this will increase the number of maintenance times. It is not as efficient as searching for the top N data after complete statistics.

If the data cannot fit into memory. On the one hand, we can consider whether the above dictionary method can be improved to adapt to this situation. The change that can be made is to store the dictionary on the hard disk instead of the memory. This can refer to the storage method of the database.

Of course there is a better way, which is to use distributed computing, which is basically the map-reduce process. First, you can divide the data according to the range according to the data value or the value after hashing the data (md5) To different machines, it is best to divide the data and read it into the memory at once, so that different machines are responsible for processing various numerical ranges, which is actually a map. After getting the results, each machine only needs to take out the top N data with the most occurrences, and then summarize it to select the top N data with the most occurrences among all the data. This is actually the reduce process.

In fact, you may want to directly distribute the data to different machines for processing, but you will not be able to get the correct solution. Because one data may be evenly distributed on different machines, while another may be completely gathered on one machine, and there may be the same number of data at the same time. For example, if we want to find the top 100 items with the most occurrences, we distribute 10 million data to 10 machines and find the top 100 items with the most occurrences on each machine. After merging, we cannot guarantee to find the real 100th one, because for example The 100th one with the most occurrences may have 10,000, but it is divided into 10 machines, so there are only 1,000 on each machine. Assume that the machines ranked before 1,000 are all distributed on one machine alone. For example, there are 1001, so the one with 10,000 will be eliminated. Even if we let each machine select the 1000 with the most occurrences and then merge them, there will still be an error, because there may be a large number of 1001 A gathering of individuals occurred. Therefore, the data cannot be randomly divided into different machines. Instead, they must be mapped to different machines for processing according to the hashed value, so that different machines can process a range of values.

The external sorting method will consume a lot of IO and will not be very efficient.The above distributed method can also be used in the stand-alone version, that is, the total data is divided into multiple different sub-files according to the value range, and then processed one by one. After the processing is completed, the words and their frequency of occurrence are merged. In fact, an external sorting merge process can be used.

In addition, approximate computing can be considered, that is, we can combine natural language attributes and only use those words that appear most in practice as a dictionary, so that this scale can be placed in memory.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/323372.htmlTechArticleThe following method is a general summary of my methods for processing massive data. Of course, these methods may not be the same. It cannot completely cover all problems, but some of these methods are basically...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn