search
HomeWeb Front-endJS TutorialUsing heap to implement Top K algorithm (JS implementation)_javascript skills

Let’s talk about the Top K algorithm first, the details are as follows

Application scenarios:

The search engine will record all the search strings used by the user for each search through log files. The length of each query string is 1-255 bytes.
Assume that there are currently 10 million records (the repetition degree of these query strings is relatively high, although the total number is 10 million, but if the repetitions are removed, it does not exceed 3 million. The higher the repetition degree of a query string, it means that the query string The more users there are, the more popular it is. ), please count the 10 most popular query strings, and the memory required cannot exceed 1G.

Essential knowledge:
What is a hash table?
Hash table (Hash table, also called hash table) is a data structure that is directly accessed based on the key value.

That is, it accesses the record by mapping the key value to a location in the table to speed up the search. This mapping function is called a hash function, and the array storing the records is called a hash table.

The method of hash table is actually very simple. It is to convert the Key into an integer number through a fixed algorithm function, the so-called hash function, and then modulate the number to the length of the array. The remainder result is As the subscript of the array, the value is stored in the array space with the number as the subscript.
When using a hash table for query, the hash function is used again to convert the key into the corresponding array subscript, and locate the space to obtain the value. In this way, the positioning performance of the array can be fully utilized for data processing. position.
Problem analysis:

To count the most popular queries, first count the number of occurrences of each Query, and then find the Top 10 based on the statistical results. So we can design the algorithm in two steps based on this idea.

That is, the solution to this problem is divided into the following two steps:

Step one: Query statistics (count the number of occurrences of each Query)
Query statistics has the following two methods to choose from:
1), direct sorting method (often statistics in the log file, use Cat File | Format Key | Sort | Uniq -C | Sort -NR | Head -N 10, this is the method)
First of all, the algorithm we first think of is sorting. First, we sort all the queries in the log, and then traverse the sorted queries and count the number of times each query appears.

But there are clear requirements in the question, that is, the memory cannot exceed 1G, there are 10 million records, each record is 255Byte, obviously it will occupy 2.375G of memory, this condition does not meet the requirements.

Let us recall the content of the data structure course. When the amount of data is relatively large and cannot be accommodated in the memory, we can use external sorting to sort. Here we can use merge sort, because merge sort has a Better time complexity O(NlgN).

After sorting, we will traverse the ordered Query file, count the number of occurrences of each Query, and write it to the file again.

A comprehensive analysis shows that the time complexity of sorting is O(NlgN), and the time complexity of traversal is O(N), so the overall time complexity of the algorithm is O(N+NlgN)=O(NlgN) .

2), Hash Table method (this method is very good for counting the number of occurrences of a string)
In the first method, we use a sorting method to count the number of occurrences of each Query. The time complexity is NlgN. So is there a better way to store it with lower time complexity?

It is explained in the title that although there are 10 million Query, due to the high degree of repetition, there are actually only 3 million Query, each Query is 255Byte, so we can consider putting them all into memory. Now we just need a suitable data structure. Here, Hash Table is definitely our priority choice, because the query speed of Hash Table is very fast, with almost O(1) time complexity.

So, we have our algorithm:

Maintain a HashTable whose Key is Query string and Value is the number of occurrences of Query. Each time a Query is read, if the string is not in the Table, then add the string and set the Value to 1; if If the string is in Table, then add one to the count of the string. Finally, we completed the processing of this massive data within O(N) time complexity.

Compared with Algorithm 1, this method improves the time complexity by an order of magnitude, which is O(N), but it is not only an optimization in time complexity, this method only needs to IO the data file once, while Algorithm 1 The number of IOs is high, so Algorithm 2 has better operability in engineering than Algorithm 1.

Step 2: Find the Top 10 (find the 10 that appear most frequently)
Algorithm 1: Ordinary sorting (we only need to find the top10, so all sorting is redundant)
I think everyone is familiar with the sorting algorithm, so I won’t go into details here. What we need to note is that the time complexity of the sorting algorithm is NlgN. In this question, three million records can be saved with 1G of memory.

Algorithm 2: Partial Sorting      
The question requirement is to find the Top 10, so we do not need to sort all the Query. We only need to maintain an array of 10 sizes, initialize 10 Query, and sort each Query from large to small according to the number of statistics. , and then traverse the 3 million records. Each time a record is read, it is compared with the last Query in the array. If it is smaller than this Query, then continue traversing. Otherwise, the last data in the array is eliminated (it still needs to be placed in the appropriate position to keep it there. sequence), add the current Query. Finally, when all the data has been traversed, the 10 Queries in this array will be the Top10 we are looking for.

It is not difficult to analyze that in this way, the worst time complexity of the algorithm is N*K, where K refers to the top number.

Algorithm 3: Heap
In Algorithm 2, we have optimized the time complexity from NlogN to N*K. I have to say that this is a relatively big improvement, but is there a better way?

Analyze it, in Algorithm 2, after each comparison is completed, the required operation complexity is K, because the elements need to be inserted into a linear table, and sequential comparison is used. Let us note here that the array is ordered. Every time we search, we can use the binary method to search. In this way, the complexity of the operation is reduced to logK. However, the subsequent problem is data movement, because movement The number of data has increased. However, this algorithm is still improved over Algorithm 2.

Based on the above analysis, let’s think about it, is there a data structure that can both search and move elements quickly?

The answer is yes, that is heap.
With the help of the heap structure, we can find and adjust/move in log time. So at this point, our algorithm can be improved to this, maintaining a small root heap of size K (10 in this question), then traversing 3 million Query, and comparing them with the root elements respectively.

The idea is consistent with the above-mentioned Algorithm 2, except that in Algorithm 3, we use a data structure such as a minimum heap instead of an array, which reduces the time complexity of finding the target element from O(K) to O(logK).
So, using the heap data structure and Algorithm 3, the final time complexity is reduced to N*logK. Compared with Algorithm 2, there is a relatively large improvement.

At this point, the algorithm is completely over. After the above first step, first use the Hash table to count the number of occurrences of each Query, O(N); then in the second step, use the heap data structure to find the Top 10, N* O(logK). So, our final time complexity is: O(N) + N'*O(logK). (N is 10 million, N' is 3 million).

How does js use heap to implement Top K algorithm?

1. Use heap algorithm to implement Top, the time complexity is O(LogN)

function top(arr,comp){ 
if(arr.length == 0){return ;} 
var i = arr.length / 2 | 0 ; 
for(;i >= 0; i--){ 
if(comp(arr[i], arr[i * 2])){exch(arr, i, i*2);} 
if(comp(arr[i], arr[i * 2 + 1])) {exch(arr, i, i*2 + 1);} 
} 
return arr[0];   
  
} 
   
function exch(arr,i,j){ 
var t = arr[i]; 
arr[i] = arr[j]; 
arr[j] = t; 
}

2. Call the heap implementation K times, the time complexity is O(K * LogN)

function topK(arr,n,comp){ 
if(!arr || arr.length == 0 || n <=0 || n > arr.length){ 
return -1; 
} 
  
  
var ret = new Array(); 
for(var i = 0;i < n; i++){ 
var max = top(arr,comp); 
ret.push(max); 
arr.splice(0,1); 
} 
return ret; 
}

3. Test

var ret = topK(new Array(16,22,91,0,51,44,23),3,function (a,b){return a < b;}); 
console.log(ret);

The above is the implementation of Top K algorithm using heap. What is Top K algorithm? I hope it will be helpful to everyone's learning.

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
如何使用JS和百度地图实现地图平移功能如何使用JS和百度地图实现地图平移功能Nov 21, 2023 am 10:00 AM

如何使用JS和百度地图实现地图平移功能百度地图是一款广泛使用的地图服务平台,在Web开发中经常用于展示地理信息、定位等功能。本文将介绍如何使用JS和百度地图API实现地图平移功能,并提供具体的代码示例。一、准备工作使用百度地图API前,首先需要在百度地图开放平台(http://lbsyun.baidu.com/)上申请一个开发者账号,并创建一个应用。创建完成

js字符串转数组js字符串转数组Aug 03, 2023 pm 01:34 PM

js字符串转数组的方法:1、使用“split()”方法,可以根据指定的分隔符将字符串分割成数组元素;2、使用“Array.from()”方法,可以将可迭代对象或类数组对象转换成真正的数组;3、使用for循环遍历,将每个字符依次添加到数组中;4、使用“Array.split()”方法,通过调用“Array.prototype.forEach()”将一个字符串拆分成数组的快捷方式。

如何使用JS和百度地图实现地图多边形绘制功能如何使用JS和百度地图实现地图多边形绘制功能Nov 21, 2023 am 10:53 AM

如何使用JS和百度地图实现地图多边形绘制功能在现代网页开发中,地图应用已经成为常见的功能之一。而地图上绘制多边形,可以帮助我们将特定区域进行标记,方便用户进行查看和分析。本文将介绍如何使用JS和百度地图API实现地图多边形绘制功能,并提供具体的代码示例。首先,我们需要引入百度地图API。可以利用以下代码在HTML文件中导入百度地图API的JavaScript

如何使用JS和百度地图实现地图热力图功能如何使用JS和百度地图实现地图热力图功能Nov 21, 2023 am 09:33 AM

如何使用JS和百度地图实现地图热力图功能简介:随着互联网和移动设备的迅速发展,地图成为了一种普遍的应用场景。而热力图作为一种可视化的展示方式,能够帮助我们更直观地了解数据的分布情况。本文将介绍如何使用JS和百度地图API来实现地图热力图的功能,并提供具体的代码示例。准备工作:在开始之前,你需要准备以下事项:一个百度开发者账号,并创建一个应用,获取到相应的AP

js中new操作符做了哪些事情js中new操作符做了哪些事情Nov 13, 2023 pm 04:05 PM

js中new操作符做了:1、创建一个空对象,这个新对象将成为函数的实例;2、将新对象的原型链接到构造函数的原型对象,这样新对象就可以访问构造函数原型对象中定义的属性和方法;3、将构造函数的作用域赋给新对象,这样新对象就可以通过this关键字来引用构造函数中的属性和方法;4、执行构造函数中的代码,构造函数中的代码将用于初始化新对象的属性和方法;5、如果构造函数中没有返回等等。

用JavaScript模拟实现打字小游戏!用JavaScript模拟实现打字小游戏!Aug 07, 2022 am 10:34 AM

这篇文章主要为大家详细介绍了js实现打字小游戏,文中示例代码介绍的非常详细,具有一定的参考价值,感兴趣的小伙伴们可以参考一下。

php可以读js内部的数组吗php可以读js内部的数组吗Jul 12, 2023 pm 03:41 PM

php在特定情况下可以读js内部的数组。其方法是:1、在JavaScript中,创建一个包含需要传递给PHP的数组的变量;2、使用Ajax技术将该数组发送给PHP脚本。可以使用原生的JavaScript代码或者使用基于Ajax的JavaScript库如jQuery等;3、在PHP脚本中,接收传递过来的数组数据,并进行相应的处理即可。

js是什么编程语言?js是什么编程语言?May 05, 2019 am 10:22 AM

js全称JavaScript,是一种具有函数优先的轻量级,直译式、解释型或即时编译型的高级编程语言,是一种属于网络的高级脚本语言;JavaScript基于原型编程、多范式的动态脚本语言,并且支持面向对象、命令式和声明式,如函数式编程。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor