Home >Web Front-end >JS Tutorial >How to implement relevance scoring function for full-text search in JavaScript_Basic knowledge
Full-text search, unlike most other problems in the field of machine learning, is a problem that web programmers often encounter in their daily work. The customer may ask you to provide a search box somewhere, and then you will write a SQL statement like WHERE title LIKE %:query% to implement the search function. In the beginning, this was no problem, until one day, the customer came to you and said, "There was an error in the search!"
Of course, there is nothing actually “wrong” with the search, it’s just that the search results are not what the customer wants. Ordinary users don't know how to do exact matching, so the search results they get are of poor quality. To solve the problem, you decide to use full-text search. After a period of boring learning, you enabled MySQL's FULLTEXT index and used more advanced query syntax, such as "MATCH() ... AGAINST()".
Okay, the problem is solved and the flowers are finished! This is no problem when the database size is small.
But when you have more and more data, you will find that your database is getting slower and slower. MySQL is not a very easy-to-use full-text search tool. So you decide to use ElasticSearch, refactor your code, and deploy a Lucene-powered full-text search cluster. You will find that it works very well, fast and accurately.
At this point you can’t help but wonder: Why is Lucene so awesome?
This article (mainly introducing TF-IDF, Okapi BM-25 and ordinary relevance scoring) and the next article (mainly introducing indexing) will tell you the basic concepts behind full-text search.
Relevance
For each search query, we easily define a "relevance score" for each document. When users search, we can use relevance scores to sort instead of document appearance time. This way, the most relevant document will be ranked first, no matter how long ago it was created (of course, sometimes it is also related to the creation time of the document).
There are many, many ways to calculate the correlation between words, but we will start with the simplest, statistics-based method. This method does not require an understanding of the language itself, but determines a "relevance score" by counting word usage, matching, and weighting based on the prevalence of unique words in the document.
This algorithm does not care whether the word is a noun or a verb, nor does it care about the meaning of the word. The only thing it cares about is which words are common and which are rare. If a search query contains both common words and rare words, you'd better give the document containing the rare words a higher score and lower the weight of the common words.
This algorithm is called Okapi BM25. It contains two basic concepts: term frequency (term frequency), abbreviated as term frequency ("TF"), and inverse document frequency (abbreviated as "IDF"). Putting them together, it is called "TF-IDF" , which is a statistical measure used to indicate how important a term is in a document.
TF-IDF
Term Frequency, or “TF” for short, is a very simple metric: the number of times a specific word appears in a document. You can divide this value by the total number of words in the document to get a score. For example, if there are 100 words in the document and the word 'the' appears 8 times, then the TF of 'the' is 8 or 8/100 or 8% (depending on how you want to express it).
Inverse Document Frequency, or "IDF" for short, is a bit more complicated: the rarer a word, the higher this value. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of the quotient. The rarer the word, the higher the "IDF" it will produce.
If you multiply these two numbers together (TF*IDF), you will get the weight of a word in the document. "Weight" is defined as: How rare is the word and how often does it appear in the document?
You can use this concept for search queries on documents. For each keyword in the query, calculate their TF-IDF score and add them together. The document with the highest score is the document that best matches the query statement.
How cool!
Okapi BM25
The above algorithm is a usable algorithm, but it is not perfect. It gives a statistically based correlation score algorithm, which we can further improve.
Okapi BM25 is considered one of the most advanced ranking algorithms so far (hence the name ElasticSearch ). Okapi BM25 adds two adjustable parameters, k1 and b, based on TF-IDF, which represent "term frequency saturation" and "field length specification" respectively. What the hell is this?
To understand “word frequency saturation” intuitively, please imagine two articles of similar length discussing baseball. Also, let's assume that all documents (except these two) don't have much baseball-related content, so the word "baseball" will have a high IDF - it's extremely rare and important. Both articles discuss baseball, and both devote considerable space to it, but one uses the word "baseball" more than the other. So in this case, does one article really have a much different score than another article? Since both documents discuss baseball at large, it doesn't matter whether the word "baseball" appears 40 times or 80 times. In fact, 30 times should be the cap!
This is "word frequency saturation. The native TF-IDF algorithm has no concept of saturation, so a document with "baseball" appearing 80 times has a score twice as high as one that appears 40 times. Sometimes, this is what we want, but Sometimes we don’t want that.
In addition, Okapi BM25 also has a k1 parameter, which is used to adjust the rate of saturation change. The value of the k1 parameter is generally between 1.2 and 2.0. The lower the value, the faster the saturation process. (meaning the two documents above have the same score because they both contain a lot of the word "baseball")
Field-length normalization reduces the length of the document to the average length of all documents. This is useful for single-field collections (such as ours), to unify documents of different lengths to the same comparison criteria. It makes more sense for two-field collections (such as "title" and "body"), which also unifies the title and body fields to the same comparison criteria. Field length reduction is represented by b, which has a value between 0 and 1, with 1 meaning full reduction and 0 meaning no reduction.
Algorithm
You can learn about the formula of Okapi algorithm in Okapi BM25 Wikipedia . Now that we know what each term in the formula is, it must be easy to understand. So let’s not mention the formula and go directly to the code:
BM25.Tokenize = function(text) { text = text .toLowerCase() .replace(/\W/g, ' ') .replace(/\s+/g, ' ') .trim() .split(' ') .map(function(a) { return stemmer(a); }); // Filter out stopStems var out = []; for (var i = 0, len = text.length; i < len; i++) { if (stopStems.indexOf(text[i]) === -1) { out.push(text[i]); } } return out; };
We define a simple static method Tokenize() to parse strings into an array of tokens. That's it, we lowercase all tokens (to reduce entropy). We run the Porter Stemmer algorithm to reduce the amount of entropy while also improving the degree of matching ("walking" and "walk" matching are the same). And we also filter out stop words (very common words) to further reduce entropy. Before I get too deep into the concepts I write about, please bear with me if I overexplain this section.
BM25.prototype.addDocument = function(doc) { if (typeof doc.id === 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); }; if (typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); }; // Raw tokenized list of words var tokens = BM25.Tokenize(doc.body); // Will hold unique terms and their counts and frequencies var _terms = {}; // docObj will eventually be added to the documents database var docObj = {id: doc.id, tokens: tokens, body: doc.body}; // Count number of terms docObj.termCount = tokens.length; // Increment totalDocuments this.totalDocuments++; // Readjust averageDocumentLength this.totalDocumentTermLength += docObj.termCount; this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments; // Calculate term frequency // First get terms count for (var i = 0, len = tokens.length; i < len; i++) { var term = tokens[i]; if (!_terms[term]) { _terms[term] = { count: 0, freq: 0 }; }; _terms[term].count++; } // Then re-loop to calculate term frequency. // We'll also update inverse document frequencies here. var keys = Object.keys(_terms); for (var i = 0, len = keys.length; i < len; i++) { var term = keys[i]; // Term Frequency for this document. _terms[term].freq = _terms[term].count / docObj.termCount; // Inverse Document Frequency initialization if (!this.terms[term]) { this.terms[term] = { n: 0, // Number of docs this term appears in, uniquely idf: 0 }; } this.terms[term].n++; }; // Calculate inverse document frequencies // This is SLOWish so if you want to index a big batch of documents, // comment this out and run it once at the end of your addDocuments run // If you're only indexing a document or two at a time you can leave this in. // this.updateIdf(); // Add docObj to docs db docObj.terms = _terms; this.documents[docObj.id] = docObj; };
This is where the addDocument() method magically appears. We basically build and maintain two similar data structures: this.documents. and this.terms.
This.documentsis is a database that stores all documents. It stores all the original text of the document, the length information of the document, and a list that stores all the words in the document and the number and frequency of occurrence of the words. Using this data structure, we can answer the following questions easily and quickly (yes, very fast, only requiring a hash table query time of O(1) time complexity): In document #3, 'walk' this How many times does the word appear?
We also use another data structure, this.terms. It represents all words in the corpus. Through this data structure, we can answer the following questions in O(1) time: How many documents does the word 'walk' appear in? What is their id?
Finally, we record the length of each document and record the average length of documents across the entire corpus.
Note that in the above code, idf is initialized to 0 and the updateidf() method is commented out. This is because this method runs very slowly and only needs to be run once, after indexing. Since running it once can meet the needs, there is no need to run it 5,000 times. Commenting it out first and running it after a large batch of indexing operations can save a lot of time. Here is the code for this function:
BM25.prototype.updateIdf = function() { var keys = Object.keys(this.terms); for (var i = 0, len = keys.length; i < len; i++) { var term = keys[i]; var num = (this.totalDocuments - this.terms[term].n + 0.5); var denom = (this.terms[term].n + 0.5); this.terms[term].idf = Math.max(Math.log10(num / denom), 0.01); } };
This is a very simple function, but because it needs to traverse all the words in the entire corpus and update the values of all words, it works a bit slowly. This method is implemented using the standard formula for inverse document frequency (you can find this formula on Wikipedia) - divide the total number of documents by the number of documents containing the word, and then take the logarithm of the quotient get. I made some modifications so that the return value is always greater than 0.
BM25.prototype.search = function(query) { var queryTerms = BM25.Tokenize(query); var results = []; // Look at each document in turn. There are better ways to do this with inverted indices. var keys = Object.keys(this.documents); for (var j = 0, nDocs = keys.length; j < nDocs; j++) { var id = keys[j]; // The relevance score for a document is the sum of a tf-idf-like // calculation for each query term. this.documents[id]._score = 0; // Calculate the score for each query term for (var i = 0, len = queryTerms.length; i < len; i++) { var queryTerm = queryTerms[i]; // We've never seen this term before so IDF will be 0. // Means we can skip the whole term, it adds nothing to the score // and isn't in any document. if (typeof this.terms[queryTerm] === 'undefined') { continue; } // This term isn't in the document, so the TF portion is 0 and this // term contributes nothing to the search score. if (typeof this.documents[id].terms[queryTerm] === 'undefined') { continue; } // The term is in the document, let's go. // The whole term is : // IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength / avgDocLength)) // IDF is pre-calculated for the whole docset. var idf = this.terms[queryTerm].idf; // Numerator of the TF portion. var num = this.documents[id].terms[queryTerm].count * (this.k1 + 1); // Denomerator of the TF portion. var denom = this.documents[id].terms[queryTerm].count + (this.k1 * (1 - this.b + (this.b * this.documents[id].termCount / this.averageDocumentLength))); // Add this query term to the score this.documents[id]._score += idf * num / denom; } if (!isNaN(this.documents[id]._score) && this.documents[id]._score > 0) { results.push(this.documents[id]); } } results.sort(function(a, b) { return b._score - a._score; }); return results.slice(0, 10); };
Finally, the search() method traverses all documents and gives the BM25 score of each document, and then sorts them in descending order. Of course, it would be unwise to go through every document in the corpus during the search. This issue is addressed in Part Two (Inverted Indexes and Performance).
The code above is well commented and the gist of it is as follows: Calculate the BM25 score for each document and each word. The idf score of the word has been pre-calculated, and you only need to query it when using it. Word frequencies are also pre-computed as part of the document properties. After that, only four simple arithmetic operations are needed. Finally, add a temporary variable _score to each document, then sort it in descending order according to score and return the top 10 results.
Examples, source code, notes and next steps
There are many ways to optimize the above example, we will introduce them in the second part of "Full Text Search", please stay tuned. I hope I can finish it in a few weeks. Below is a list of things that will be mentioned next time:
For this demonstration, I wrote a small Wikipedia crawler that crawled to the first paragraph of quite a few (85,000) Wikipedia articles. Since indexing all 85K files takes around 90 seconds, on my computer I've cut that in half. I don’t want you to waste your laptop power just for a simple full text presentation.
Because indexing is a heavy, modular CPU operation, I treat it as a network worker. Indexing runs on a background thread --You can find the full source code here. You will also find source code references to the stemming algorithm and my stop word list. As for the code license, it is still free for educational purposes and not for any commercial purposes.
Finally, the demo. Once the index is complete, try looking for random things and phrases and Wikipedia will know about it. Note that there are only 40,000 segments indexed, so you may want to try some new topics.