Home  >  Article  >  Backend Development  >  Given two files a and b, each storing 5 billion urls, each url occupies 64 bytes, and the memory limit is 4G, how to find the common urls of files a and b? , 5 billion 4g_PHP tutorial

Given two files a and b, each storing 5 billion urls, each url occupies 64 bytes, and the memory limit is 4G, how to find the common urls of files a and b? , 5 billion 4g_PHP tutorial

WBOY
WBOYOriginal
2016-07-13 10:12:02931browse

Given two files a and b, each storing 5 billion urls, each url occupies 64 bytes, and the memory limit is 4G. How to find the common url of files a and b? , 5 billion 4g

can estimate the size of each file as 5G*64=300G, which is much larger than 4G. So it is impossible to fully load it into memory for processing. Consider a divide and conquer approach.
Traverse the file a, obtain hash(url)%1000 for each url, and then store the url into 1000 small file (set to a0,a1,...a999). In this way, the size of each small file is approximately 300M. Traverse the file b, and store the urls into 1000 small files (b0, b1....b999) in the same way as a. After this processing, all possible identical urls are in the corresponding small files (a0 vs b0, a1 vs b1....a999 vs b999), and non-corresponding small files (such as a0 vs b99) It is impossible to have the same url. Then we only need to find the same url in 1000 pairs of small files.
For example, for a0 vs b0, we can traverse a0 and store the url in hash_map. Then traverse b0. If the url is in the hash_map, it means that this url exists in both a and b. Just save it to a file.
If the divided small files are uneven and some small files are too large (for example, larger than 2G), you can consider dividing these too large small files into small files in a similar way

Yesterday the Baidu interviewer asked me to study it today

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/924847.htmlTechArticleGiven two files a and b, each storing 5 billion urls, each url takes up 64 bytes , the memory limit is 4G, how to find the common URL of files a and b? , 5 billion 4g It can be estimated that the size of each file is...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn