How to implement relevance scoring function for full-text search in JavaScript

Home

Web Front-end

JS Tutorial

How to implement relevance scoring function for full-text search in JavaScript_Basic knowledge

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 16, 2016 pm 03:53 PM

javascriptresearch all

Full-text search, unlike most other problems in the field of machine learning, is a problem that web programmers often encounter in their daily work. The customer may ask you to provide a search box somewhere, and then you will write a SQL statement like WHERE title LIKE %:query% to implement the search function. In the beginning, this was no problem, until one day, the customer came to you and said, "There was an error in the search!"

Of course, there is nothing actually “wrong” with the search, it’s just that the search results are not what the customer wants. Ordinary users don't know how to do exact matching, so the search results they get are of poor quality. To solve the problem, you decide to use full-text search. After a period of boring learning, you enabled MySQL's FULLTEXT index and used more advanced query syntax, such as "MATCH() ... AGAINST()".

Okay, the problem is solved and the flowers are finished! This is no problem when the database size is small.

But when you have more and more data, you will find that your database is getting slower and slower. MySQL is not a very easy-to-use full-text search tool. So you decide to use ElasticSearch, refactor your code, and deploy a Lucene-powered full-text search cluster. You will find that it works very well, fast and accurately.

At this point you can’t help but wonder: Why is Lucene so awesome?

This article (mainly introducing TF-IDF, Okapi BM-25 and ordinary relevance scoring) and the next article (mainly introducing indexing) will tell you the basic concepts behind full-text search.

Relevance

For each search query, we easily define a "relevance score" for each document. When users search, we can use relevance scores to sort instead of document appearance time. This way, the most relevant document will be ranked first, no matter how long ago it was created (of course, sometimes it is also related to the creation time of the document).

There are many, many ways to calculate the correlation between words, but we will start with the simplest, statistics-based method. This method does not require an understanding of the language itself, but determines a "relevance score" by counting word usage, matching, and weighting based on the prevalence of unique words in the document.

This algorithm does not care whether the word is a noun or a verb, nor does it care about the meaning of the word. The only thing it cares about is which words are common and which are rare. If a search query contains both common words and rare words, you'd better give the document containing the rare words a higher score and lower the weight of the common words.

This algorithm is called Okapi BM25. It contains two basic concepts: term frequency (term frequency), abbreviated as term frequency ("TF"), and inverse document frequency (abbreviated as "IDF"). Putting them together, it is called "TF-IDF" , which is a statistical measure used to indicate how important a term is in a document.

TF-IDF

Term Frequency, or “TF” for short, is a very simple metric: the number of times a specific word appears in a document. You can divide this value by the total number of words in the document to get a score. For example, if there are 100 words in the document and the word 'the' appears 8 times, then the TF of 'the' is 8 or 8/100 or 8% (depending on how you want to express it).

Inverse Document Frequency, or "IDF" for short, is a bit more complicated: the rarer a word, the higher this value. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of the quotient. The rarer the word, the higher the "IDF" it will produce.

If you multiply these two numbers together (TF*IDF), you will get the weight of a word in the document. "Weight" is defined as: How rare is the word and how often does it appear in the document?

You can use this concept for search queries on documents. For each keyword in the query, calculate their TF-IDF score and add them together. The document with the highest score is the document that best matches the query statement.

How cool!

Okapi BM25

The above algorithm is a usable algorithm, but it is not perfect. It gives a statistically based correlation score algorithm, which we can further improve.

Okapi BM25 is considered one of the most advanced ranking algorithms so far (hence the name ElasticSearch ). Okapi BM25 adds two adjustable parameters, k1 and b, based on TF-IDF, which represent "term frequency saturation" and "field length specification" respectively. What the hell is this?

To understand “word frequency saturation” intuitively, please imagine two articles of similar length discussing baseball. Also, let's assume that all documents (except these two) don't have much baseball-related content, so the word "baseball" will have a high IDF - it's extremely rare and important. Both articles discuss baseball, and both devote considerable space to it, but one uses the word "baseball" more than the other. So in this case, does one article really have a much different score than another article? Since both documents discuss baseball at large, it doesn't matter whether the word "baseball" appears 40 times or 80 times. In fact, 30 times should be the cap!

This is "word frequency saturation. The native TF-IDF algorithm has no concept of saturation, so a document with "baseball" appearing 80 times has a score twice as high as one that appears 40 times. Sometimes, this is what we want, but Sometimes we don’t want that.

In addition, Okapi BM25 also has a k1 parameter, which is used to adjust the rate of saturation change. The value of the k1 parameter is generally between 1.2 and 2.0. The lower the value, the faster the saturation process. (meaning the two documents above have the same score because they both contain a lot of the word "baseball")

Field-length normalization reduces the length of the document to the average length of all documents. This is useful for single-field collections (such as ours), to unify documents of different lengths to the same comparison criteria. It makes more sense for two-field collections (such as "title" and "body"), which also unifies the title and body fields to the same comparison criteria. Field length reduction is represented by b, which has a value between 0 and 1, with 1 meaning full reduction and 0 meaning no reduction.

Algorithm

You can learn about the formula of Okapi algorithm in Okapi BM25 Wikipedia . Now that we know what each term in the formula is, it must be easy to understand. So let’s not mention the formula and go directly to the code:

BM25.Tokenize = function(text) {
  text = text
    .toLowerCase()
    .replace(/\W/g, ' ')
    .replace(/\s+/g, ' ')
    .trim()
    .split(' ')
    .map(function(a) { return stemmer(a); });
 
  // Filter out stopStems
  var out = [];
  for (var i = 0, len = text.length; i < len; i++) {
    if (stopStems.indexOf(text[i]) === -1) {
      out.push(text[i]);
    }
  }
 
  return out;
};

We define a simple static method Tokenize() to parse strings into an array of tokens. That's it, we lowercase all tokens (to reduce entropy). We run the Porter Stemmer algorithm to reduce the amount of entropy while also improving the degree of matching ("walking" and "walk" matching are the same). And we also filter out stop words (very common words) to further reduce entropy. Before I get too deep into the concepts I write about, please bear with me if I overexplain this section.

BM25.prototype.addDocument = function(doc) {
  if (typeof doc.id === 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); };
  if (typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); };
 
  // Raw tokenized list of words
  var tokens = BM25.Tokenize(doc.body);
 
  // Will hold unique terms and their counts and frequencies
  var _terms = {};
 
  // docObj will eventually be added to the documents database
  var docObj = {id: doc.id, tokens: tokens, body: doc.body};
 
  // Count number of terms
  docObj.termCount = tokens.length;
 
  // Increment totalDocuments
  this.totalDocuments++;
 
  // Readjust averageDocumentLength
  this.totalDocumentTermLength += docObj.termCount;
  this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments;
 
  // Calculate term frequency
  // First get terms count
  for (var i = 0, len = tokens.length; i < len; i++) {
    var term = tokens[i];
    if (!_terms[term]) { 
      _terms[term] = {
        count: 0,
        freq: 0
      }; 
    };
    _terms[term].count++;
  }
 
  // Then re-loop to calculate term frequency.
  // We'll also update inverse document frequencies here.
  var keys = Object.keys(_terms);
  for (var i = 0, len = keys.length; i < len; i++) {
    var term = keys[i];
    // Term Frequency for this document.
    _terms[term].freq = _terms[term].count / docObj.termCount;
 
    // Inverse Document Frequency initialization
    if (!this.terms[term]) {
      this.terms[term] = {
        n: 0, // Number of docs this term appears in, uniquely
        idf: 0
      };
    }
 
    this.terms[term].n++;
  };
 
  // Calculate inverse document frequencies
  // This is SLOWish so if you want to index a big batch of documents,
  // comment this out and run it once at the end of your addDocuments run
  // If you're only indexing a document or two at a time you can leave this in.
  // this.updateIdf();
 
  // Add docObj to docs db
  docObj.terms = _terms;
  this.documents[docObj.id] = docObj;
};

This is where the addDocument() method magically appears. We basically build and maintain two similar data structures: this.documents. and this.terms.

This.documentsis is a database that stores all documents. It stores all the original text of the document, the length information of the document, and a list that stores all the words in the document and the number and frequency of occurrence of the words. Using this data structure, we can answer the following questions easily and quickly (yes, very fast, only requiring a hash table query time of O(1) time complexity): In document #3, 'walk' this How many times does the word appear?

We also use another data structure, this.terms. It represents all words in the corpus. Through this data structure, we can answer the following questions in O(1) time: How many documents does the word 'walk' appear in? What is their id?

Finally, we record the length of each document and record the average length of documents across the entire corpus.

Note that in the above code, idf is initialized to 0 and the updateidf() method is commented out. This is because this method runs very slowly and only needs to be run once, after indexing. Since running it once can meet the needs, there is no need to run it 5,000 times. Commenting it out first and running it after a large batch of indexing operations can save a lot of time. Here is the code for this function:

BM25.prototype.updateIdf = function() {
  var keys = Object.keys(this.terms);
  for (var i = 0, len = keys.length; i < len; i++) {
    var term = keys[i];
    var num = (this.totalDocuments - this.terms[term].n + 0.5);
    var denom = (this.terms[term].n + 0.5);
    this.terms[term].idf = Math.max(Math.log10(num / denom), 0.01);
  }
};

This is a very simple function, but because it needs to traverse all the words in the entire corpus and update the values of all words, it works a bit slowly. This method is implemented using the standard formula for inverse document frequency (you can find this formula on Wikipedia) - divide the total number of documents by the number of documents containing the word, and then take the logarithm of the quotient get. I made some modifications so that the return value is always greater than 0.

BM25.prototype.search = function(query) {
 
  var queryTerms = BM25.Tokenize(query);
  var results = [];
 
  // Look at each document in turn. There are better ways to do this with inverted indices.
  var keys = Object.keys(this.documents);
  for (var j = 0, nDocs = keys.length; j < nDocs; j++) {
    var id = keys[j];
    // The relevance score for a document is the sum of a tf-idf-like
    // calculation for each query term.
    this.documents[id]._score = 0;
 
    // Calculate the score for each query term
    for (var i = 0, len = queryTerms.length; i < len; i++) {
      var queryTerm = queryTerms[i];
 
      // We've never seen this term before so IDF will be 0.
      // Means we can skip the whole term, it adds nothing to the score
      // and isn't in any document.
      if (typeof this.terms[queryTerm] === 'undefined') {
        continue;
      }
 
      // This term isn't in the document, so the TF portion is 0 and this
      // term contributes nothing to the search score.
      if (typeof this.documents[id].terms[queryTerm] === 'undefined') {
        continue;
      }
 
      // The term is in the document, let's go.
      // The whole term is :
      // IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength / avgDocLength))
 
      // IDF is pre-calculated for the whole docset.
      var idf = this.terms[queryTerm].idf;
      // Numerator of the TF portion.
      var num = this.documents[id].terms[queryTerm].count * (this.k1 + 1);
      // Denomerator of the TF portion.
      var denom = this.documents[id].terms[queryTerm].count 
        + (this.k1 * (1 - this.b + (this.b * this.documents[id].termCount / this.averageDocumentLength)));
 
      // Add this query term to the score
      this.documents[id]._score += idf * num / denom;
    }
 
    if (!isNaN(this.documents[id]._score) && this.documents[id]._score > 0) {
      results.push(this.documents[id]);
    }
  }
 
  results.sort(function(a, b) { return b._score - a._score; });
  return results.slice(0, 10);
};

Finally, the search() method traverses all documents and gives the BM25 score of each document, and then sorts them in descending order. Of course, it would be unwise to go through every document in the corpus during the search. This issue is addressed in Part Two (Inverted Indexes and Performance).

The code above is well commented and the gist of it is as follows: Calculate the BM25 score for each document and each word. The idf score of the word has been pre-calculated, and you only need to query it when using it. Word frequencies are also pre-computed as part of the document properties. After that, only four simple arithmetic operations are needed. Finally, add a temporary variable _score to each document, then sort it in descending order according to score and return the top 10 results.
Examples, source code, notes and next steps

There are many ways to optimize the above example, we will introduce them in the second part of "Full Text Search", please stay tuned. I hope I can finish it in a few weeks. Below is a list of things that will be mentioned next time:

Reverse index and fast search
Quick index
Better search results

For this demonstration, I wrote a small Wikipedia crawler that crawled to the first paragraph of quite a few (85,000) Wikipedia articles. Since indexing all 85K files takes around 90 seconds, on my computer I've cut that in half. I don’t want you to waste your laptop power just for a simple full text presentation.

Because indexing is a heavy, modular CPU operation, I treat it as a network worker. Indexing runs on a background thread --You can find the full source code here. You will also find source code references to the stemming algorithm and my stop word list. As for the code license, it is still free for educational purposes and not for any commercial purposes.

Finally, the demo. Once the index is complete, try looking for random things and phrases and Wikipedia will know about it. Note that there are only 40,000 segments indexed, so you may want to try some new topics.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

es6数组怎么去掉重复并且重新排序May 05, 2022 pm 07:08 PM

去掉重复并排序的方法：1、使用“Array.from(new Set(arr))”或者“[…new Set(arr)]”语句，去掉数组中的重复元素，返回去重后的新数组；2、利用sort()对去重数组进行排序，语法“去重数组.sort()”。

JavaScript的Symbol类型、隐藏属性及全局注册表详解Jun 02, 2022 am 11:50 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于Symbol类型、隐藏属性及全局注册表的相关问题，包括了Symbol类型的描述、Symbol不会隐式转字符串等问题，下面一起来看一下，希望对大家有帮助。

原来利用纯CSS也能实现文字轮播与图片轮播！Jun 10, 2022 pm 01:00 PM

怎么制作文字轮播与图片轮播？大家第一想到的是不是利用js，其实利用纯CSS也能实现文字轮播与图片轮播，下面来看看实现方法，希望对大家有所帮助！

JavaScript对象的构造函数和new操作符（实例详解）May 10, 2022 pm 06:16 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于对象的构造函数和new操作符，构造函数是所有对象的成员方法中，最早被调用的那个，下面一起来看一下吧，希望对大家有帮助。

JavaScript面向对象详细解析之属性描述符May 27, 2022 pm 05:29 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于面向对象的相关问题，包括了属性描述符、数据描述符、存取描述符等等内容，下面一起来看一下，希望对大家有帮助。

javascript怎么移除元素点击事件Apr 11, 2022 pm 04:51 PM

方法：1、利用“点击元素对象.unbind("click");”方法，该方法可以移除被选元素的事件处理程序；2、利用“点击元素对象.off("click");”方法，该方法可以移除通过on()方法添加的事件处理程序。

整理总结JavaScript常见的BOM操作Jun 01, 2022 am 11:43 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于BOM操作的相关问题，包括了window对象的常见事件、JavaScript执行机制等等相关内容，下面一起来看一下，希望对大家有帮助。

foreach是es6里的吗May 05, 2022 pm 05:59 PM

foreach不是es6的方法。foreach是es3中一个遍历数组的方法，可以调用数组的每个元素，并将元素传给回调函数进行处理，语法“array.forEach(function(当前元素,索引,数组){...})”；该方法不处理空数组。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

1 months agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

1 months agoByDDD

R.E.P.O. Best Graphic Settings

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

Hot Tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

WebStorm Mac version

Useful JavaScript development tools

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Hot Topics

Where is the login entrance for gmail email?

7386

1630

1357

1267

1216