Home >Backend Development >PHP Tutorial >Algorithms have no borders, domestic algorithms are equally good_PHP Tutorial
[CSDN Report] A few days ago, CSDN compiled an article published on High Scalability by Matt Abrams, the deputy director of data analysis of the foreign AddThis company. In this article, Matt Abrams introduced to readers that AddThis uses only 1.5KB of memory. One billion different objects were calculated, fully demonstrating the charm of the algorithm.
This article received widespread attention on Weibo, and we learned that Yitao’s algorithm was equally impressive. To this end, CSDN interviewed Zhang Yang from Yitao Data Department (he studied at Yantai University and Beihang University. He obtained a master's degree in computer theory from Beihang University in 2011. He joined Taobao in the same year and currently works in Yitao Data Department. work), ask him to explain Yitao’s related algorithms.
Picture: Zhang Yang, Yitao Data Department Engineer
CSDN: First of all, please introduce yourself and your usual work?
Zhang Yang: My name is Zhang Yang, and my nickname in the company is Ye Feng. Currently, he is an ordinary coder in the data department of Yitao. Like thousands of coders, he works by typing codes and writing programs every day. At the same time, he also regards it as the second greatest pleasure in life (the first greatest pleasure is eating). I am very interested in technologies such as PHP, Nginx, data mining, machine learning, algorithms, compilers, and distributed storage computing. I also like mathematics and history. I like the job of writing programs very much, and I hope to make programming a lifelong career. Besides writing programs, I also like to study mathematics and algorithms. At the same time, I am happy to summarize what I have learned into articles and publish them on my blog to share with everyone. Friends who are interested can visit my blog: codinglabs.org.
My position in the data department of Yitao is front-end development, but my "front-end development" is more complicated than the general front-end engineer. In addition to being responsible for HTML, CSS and JavaScript, I also develop background programs for PHP and Lua. , I occasionally develop some C and algorithm programs based on my interests and needs (I like writing C and algorithms very much, and enjoy it very much). At the same time, I also do some operation and maintenance work, such as building a server environment and maintaining online servers.
CSDN: What prompted you to be interested in algorithms?
Zhang Yang: Maybe it comes from my interest in mathematics. I have always liked mathematical things. I formally came into contact with algorithms when I was a sophomore in college. I bought an Introduction to Algorithms at that time, and then I really began to understand a series of basic things in the field of algorithms such as asymptotic complexity, algorithm analysis, dynamic programming, greedy algorithms, and NP problems. I felt amazing when I read it. I felt that every algorithm in the book shines with human wisdom. Reading and learning these things brought me a sense of satisfaction and pleasure that is difficult to express in words. In my subsequent studies and work, I continued to understand and understand from practical applications how algorithms can solve practical problems in various fields and promote the development of human civilization. This deepened my respect for algorithms.
CSDN: Why did Yitao Data Department develop this cardinality estimation algorithm?
Zhang Yang: Yitao Data Department mainly does some data analysis and mining in the field of e-commerce, and closely integrates these technologies with business to form some data products and services, such as data analysis, recommendation systems, etc. We all have Do. These data products not only serve external parties, but also provide support for the internal operations of the company or group.
In the field of e-commerce data analysis, the calculation of some key indicators (such as unique visitor, referred to as UV, which refers to the number of independent visitors under certain time and space constraints) is a very common task. Generally speaking, we will first mark each unique visitor by some means (for example, through cookies), and then record the visitor's mark in all access logs. In this way, the calculation of UV is equivalent to Calculate the number of unique elements in a repeatable user tag set, which is the mathematical cardinality.
There are two difficulties in calculating the base:
First, it is not conducive to the implementation of real-time stream computing. For example, some of our products often provide real-time UV, which is the number of unique visitors from a certain point in time (such as zero o'clock today) to the present. In order to achieve this, a data structure with high search performance (such as a B-tree) needs to be maintained in memory for each UV value, so that when a new visit comes in the real-time stream, it can be quickly found whether the visitor has already been here. This determines whether the UV value increases by 1 or remains unchanged. If we want to provide this kind of service for 1 million stores at the same time, we need to maintain 1 million B-trees in memory, and if we also need to calculate UVs from different source dimensions, this number will expand rapidly. This is a big challenge for our server computing resources and memory resources.
The second point is that the traditional cardinality calculation method cannot be effectively combined . For example, although the UVs of the previous hour and this hour have been calculated separately, complex calculations still need to be performed to see the total UVs of these two hours. Although the solution using the bitmap data structure can be quickly merged, the space complexity is too high, because the number of any combinations of time periods has a power-level relationship with the number of time periods, so neither B-tree nor simple bitmap is the same in the face of big data. Effective program.
Based on the above background, Wang Xiaozhe (name Qingwu), a technical expert from Yitao Data Department, studied the related algorithms of cardinality estimation and a java implementation of Clearspring (stream-lib), and took the lead in our holographic effect platform (codenamed Moonlight Treasure Box) ) has introduced a cardinality estimation algorithm in the project. It has successfully realized the technical problem of using a small amount of memory to calculate a large number of UVs, and has taken on the real-time effect of all venue pits on Tmall and Taobao during the Double Eleven and Double Twelve promotions. Computational tasks.
In order to facilitate more non-Java projects to use such algorithms, Wang Xiaozhe and I gave a C version of the implementation ccard-lib based on relevant papers and with reference to stream-lib. Then Zhang Wei (Hua Mingminzhan), an engineer from the Data Department of Yitao ) and implemented an extension to PHP. At present, the implementation of this C has been used in multiple products of Yitao Data Department, and has also been open sourced through github.
CSDN: Can you introduce the cardinality estimation algorithm of Yitao Data Department to readers in detail?
Zhang Yang: The algorithm we use is mainly the Adaptive Counting algorithm. This algorithm appears in the paper "Fast and accurate traffic matrix measurement using adaptive cardinality counting", but I also implemented it in ccard-lib. Common cardinality estimation algorithms such as Linear Counting, LogLog Counting and HyperLogLog Counting are provided.
These algorithms are probabilistic algorithms, which significantly save the use of computing resources by sacrificing a certain degree of accuracy (but the accuracy is controllable, and methods for controlling accuracy can be given through mathematical analysis). For example, we can estimate hundreds of millions of UVs using only 8k of memory, and the error does not exceed 2%, which saves memory significantly compared to using B-trees or original bitmaps. At the same time, the cardinality estimation algorithm uses the hash-transformed bitmap space, which can significantly save memory while still achieving efficient merging, which simultaneously solves the two difficulties mentioned above.
When using 2^16 (64K) bits, the estimated results are as follows:
Linear Counting with Murmurhash:
actual: 50000, estimated: 50062, error: 0.12%
actual: 100000, estimated: 99924, error: 0.08%
actual: 150000, estimated: 149865, error: 0.09%
actual: 200000, estimated: 199916, error: 0.04%
actual: 250000, estimated: 250123, error: 0.05%
actual: 300000, estimated: 299942, error: 0.02%
actual: 350000, estimated: 349801, error: 0.06%
actual: 400000, estimated: 400101, error: 0.03%
actual: 450000, estimated: 449955, error: 0.01%
actual: 500000, estimated: 500065, error: 0.01%
Linear Counting with Lookup3hash:
actual: 50000, estimated: 49835, error: 0.33%
actual: 100000, estimated: 99461, error: 0.54%
actual: 150000, estimated: 149006, error: 0.66%
actual: 200000, estimated: 198501, error: 0.75%
actual: 250000, estimated: 248365, error: 0.65%
actual: 300000, estimated: 298065, error: 0.65%
actual: 350000, estimated: 347504, error: 0.71%
actual: 400000, estimated: 397292, error: 0.68%
actual: 450000, estimated: 446700, error: 0.73%
actual: 500000, estimated: 495944, error: 0.81%
Hyperloglog Counting with Murmurhash:
actual: 50000, estimated: 50015, error: 0.03%
actual: 100000, estimated: 100048, error: 0.05%
actual: 150000, estimated: 149709, error: 0.19%
actual: 200000, estimated: 201595, error: 0.80%
actual: 250000, estimated: 250168, error: 0.07%
actual: 300000, estimated: 299864, error: 0.05%
actual: 350000, estimated: 348571, error: 0.41%
actual: 400000, estimated: 398583, error: 0.35%
actual: 450000, estimated: 448632, error: 0.30%
actual: 500000, estimated: 498330, error: 0.33%
Hyperloglog Counting with Lookup3hash:
actual: 50000, estimated: 49628, error: 0.74%
actual: 100000, estimated: 99357, error: 0.64%
actual: 150000, estimated: 148880, error: 0.75%
actual: 200000, estimated: 200475, error: 0.24%
actual: 250000, estimated: 249362, error: 0.26%
actual: 300000, estimated: 299119, error: 0.29%
actual: 350000, estimated: 349225, error: 0.22%
actual: 400000, estimated: 398805, error: 0.30%
actual: 450000, estimated: 448373, error: 0.36%
actual: 500000, estimated: 498183, error: 0.36%
Adaptive Counting with Murmurhash:
actual: 50000, estimated: 50015, error: 0.03%
actual: 100000, estimated: 100048, error: 0.05%
actual: 150000, estimated: 149709, error: 0.19%
actual: 200000, estimated: 201059, error: 0.53%
actual: 250000, estimated: 249991, error: 0.00%
actual: 300000, estimated: 300067, error: 0.02%
actual: 350000, estimated: 349610, error: 0.11%
actual: 400000, estimated: 399875, error: 0.03%
actual: 450000, estimated: 450348, error: 0.08%
actual: 500000, estimated: 500977, error: 0.20%
Adaptive Counting with Lookup3hash:
actual: 50000, estimated: 49628, error: 0.74%
actual: 100000, estimated: 99357, error: 0.64%
actual: 150000, estimated: 148880, error: 0.75%
actual: 200000, estimated: 199895, error: 0.05%
actual: 250000, estimated: 249563, error: 0.17%
actual: 300000, estimated: 299047, error: 0.32%
actual: 350000, estimated: 348665, error: 0.38%
actual: 400000, estimated: 399266, error: 0.18%
actual: 450000, estimated: 450196, error: 0.04%
actual: 500000, estimated: 499516, error: 0.10%
Loglog Counting with Murmurhash:
actual: 50000, estimated: 59857, error: 19.71%
actual: 100000, estimated: 103108, error: 3.11%
actual: 150000, estimated: 150917, error: 0.61%
actual: 200000, estimated: 201059, error: 0.53%
actual: 250000, estimated: 249991, error: 0.00%
actual: 300000, estimated: 300067, error: 0.02%
actual: 350000, estimated: 349610, error: 0.11%
actual: 400000, estimated: 399875, error: 0.03%
actual: 450000, estimated: 450348, error: 0.08%
actual: 500000, estimated: 500977, error: 0.20%
Loglog Counting with Lookup3hash:
actual: 50000, estimated: 59870, error: 19.74%
actual: 100000, estimated: 103044, error: 3.04%
actual: 150000, estimated: 150435, error: 0.29%
actual: 200000, estimated: 199895, error: 0.05%
actual: 250000, estimated: 249563, error: 0.17%
actual: 300000, estimated: 299047, error: 0.32%
actual: 350000, estimated: 348665, error: 0.38%
actual: 400000, estimated: 399266, error: 0.18%
actual: 450000, estimated: 450196, error: 0.04%
actual: 500000, estimated: 499516, error: 0.10%
限于篇幅,我在这里不能具体描述这些算法的细节,之前我在博客上发表了一篇翻译的文章,不过内容也是概括性描述。但是我已经在准备写博文详细介绍基数估计算法了,那里面会包括算法的数理细节以及对论文的一些解读,欢迎有兴趣的朋友关注我的博客。
CSDN:看到您微博上自称“代码洁癖重度患者”,这是一个很有趣的称呼,那么是否可以理解为您对代码的规范性很在意,您在平时在编码过程中如何保持代码的规范?
张洋:这么说其实是有点自嘲的意思吧。对代码格式我确实是很在意的,如果看到代码不规范、不整齐甚至多一个空行我都会觉得非常不舒服,骨子里对代码格式有一种完美主义倾向。
不过这个事情要分两面看,如果是我自己开发的比较专的东西,如算法库,可以坚持这种完美主义,但需要多人合作的场合实际上是不太合适的。实事求是的说,业务代码总是不可能一直很漂亮,需要在业务进度和代码质量中间做一个权衡。在保持代码规范方面,我始终认为不能完全靠程序员的自觉和代码规范的宣讲,通过工具(例如lint)和流程去保证会更有效一些。
CSDN:还有哪些困难是需要在未来工作中克服的?
张洋:需要克服的困难主要来自两方面吧。
一方面是算法本身改进的困难,这世界不存在完美无暇的算法,例如上面的基数估计算法,虽然大大降低了内存使用,但是如果维度爆炸的话,内存使用仍然会很夸张,而且合并bitmap也不是没有代价,有时需要进行内存和磁盘bitmap的合并,当bitmap量过大时磁盘IO会称为瓶颈,因此如何结合具体场景来优化和改进算法就成为一个难点。一个方法是查阅相关论文,了解和借鉴目前全球各大研究机构和公司对相关算法的最新研究成果。另一个方法就是自己进行改进,这块需要对算法本身极其相关的数学分析有非常深入掌握,因此对相关工程师的理论水平要求较高。
On the other hand is the combination of algorithms and business products. After all, algorithms are relatively formal things, and there is still a long way to go before they can be concretely applied to products. Finding the best fit and combination of algorithms and products is also one of the key points and difficulties in the work.
2012 has passed, we have passed the end of the world and ushered in a new chapter in the world. In 2013, we will also enter a new era of Internet development. All kinds of data are flooding the Internet. Big data has become one of the problems that every Internet company has to face. How to consume the minimum resources to obtain as much useful information as possible is a question that every Internet company should consider. Through the two recent articles on algorithms, I believe all readers will have an idea. Of course, each algorithm has its own advantages and disadvantages. We still have to choose the algorithm based on the actual usage in daily work, and we cannot generalize. (Wang Xudong/Author Bao Yan/Reviewer)