Home > Article > Web Front-end > Detailed introduction to the sample code of relevance scoring for JavaScript full-text search
Full-text search, unlike most other problems in the field of machine learning, is a problem that Web programmers often encounter in their daily work. The customer may ask you to provide a search box somewhere, and then you will write a SQL statement like WHERE title LIKE %:query% to implement the search function. In the beginning, this was no problem, until one day, the customer came to you and said, "There was an error in the search!"
Of course, there was no "error" in the search, but the search results were not what the customer wanted. need. Ordinary users don't know how to do exact matching, so the search results they get are of poor quality. To solve the problem, you decide to use full-text search. After a period of boring learning, you opened the FULLTEXT index of MySQL and used more advanced query syntax, such as "MATCH() ... AGAINST() ".
Okay, the problem is solved and the flowers are finished! This is no problem when the database size is small.
But when you have more and more data, you will find that your database is getting slower and slower. MySQL is not a very easy-to-use full-text searchtool. So you decide to use ElasticSearch, refactor your code, and deploy a Lucene driven full-text search cluster. You will find that it works very well, fast and accurately.
At this point you can’t help but wonder: Why is Lucene so awesome?
This article (mainly introduces TF-IDF, Okapi BM-25 and ordinary relevance scores) and the next article (mainly introduces indexes) will tell you about full-text search The Basic Concept behind it.
For each search query, it is easy to define a "relevance score" for each document. When users search, we can use relevance scores to sort instead of document appearance time. This way, the most relevant document will be ranked first, no matter how long ago it was created (of course, sometimes it is also related to the creation time of the document).
There are many, many ways to calculate the correlation between words, but we will start with the simplest, statistics-based method. This method does not require an understanding of the language itself, but determines a "relevance score" by counting word usage, matching, and weighting based on the prevalence of unique words in the document.
This algorithm does not care whether the word is a noun or a verb, nor does it care about the meaning of the word. The only thing it cares about is which words are common and which are rare. If a search query contains both common words and rare words, you'd better give the document containing the rare words a higher score and lower the weight of the common words.
This algorithm is called Okapi BM25. It contains two basic concepts: term frequency (term frequency), abbreviated as term frequency ("TF") , and inverse document frequency (inverse document frequency), abbreviated as ("IDF") . Putting them together is called "TF-IDF", which is a statistical measure used to indicate how important a word (term) is in a document.
Term Frequency ( Term Frequency), referred to as "TF", is a very simple metric: the number of times a specific word appears in a document . You can divide this value by the total number of words in the document to get a score. For example, if there are 100 words in the document and the word ‘the’ appears 8 times, then the TF of ‘the’ is 8 or 8/100 or 8% (depending on how you want to express it).
Inverse Document Frequency (Inverse Document Frequency), referred to as "IDF", is a bit more complicated: the rarer a word, the higher this value. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of the quotient. The rarer the word, the higher the "IDF" it will produce.
If you multiply these two numbers together (TF*IDF), you will get the weight of a word in the document. "Weight" is defined as: How rare is the word and how often does it appear in the document?
You can use this concept for search queries for documents. For each keyword in the query, calculate their TF-IDF score and add them together. The document with the highest score is the document that best matches the query statement.
It’s cool!
The above algorithm is a usable algorithm, but it is not perfect. It gives a statistically based correlation score algorithm, which we can further improve.
Okapi BM25 is considered one of the most advanced ranking algorithms so far (so it is called ElasticSearch). Okapi BM25 adds two adjustable parameters, k1 and b, based on TF-IDF, which represent "term frequency saturation" and "field length specification" respectively. What the hell is this?
To understand "word frequency saturation" intuitively, please imagine two articles of similar length discussing baseball. Also, let's assume that all documents (except these two) don't have much baseball-related content, so the word "baseball" will have a high IDF - it's extremely rare and important. Both articles discuss baseball, and both devote considerable space to it, but one uses the word "baseball" more than the other. So in this case, does one article really have a much different score than another article? Since both documents discuss baseball at large, it doesn't matter whether the word "baseball" appears 40 times or 80 times. In fact, 30 times should be the limit!
This is "word frequency saturation. The native TF-IDF algorithm does not have the concept of saturation, so a document with "baseball" appearing 80 times has a score twice as high as one that appears 40 times. Sometimes, at this time, What we want, but sometimes we don't want this. In addition, Okapi BM25 also has a k1 parameter, which is used to adjust the rate of saturation change. The value of the k1 parameter is generally between 1.2 and 2.0. The lower the value, the faster the saturation process (meaning that the two documents above have the same score, because they both contain a large number of the word "baseball")
Field length reduction ( Field-length normalization reduces the length of a document to the average length of all documents. This is useful for single-field collections (such as ours) to unify documents of different lengths to the same comparison criteria. On. It is more meaningful for dual-field collections (such as "title" and "body"). It can also unify the title and body fields to the same comparison condition. The field length reduction is represented by b, and its value is 0. Between Knowing what each item in the formula is, it must be easy to understand, so we will not mention the formula and go directly to the code:
BM25.Tokenize = function(text) { text = text .toLowerCase() .replace(/\W/g, ' ') .replace(/\s+/g, ' ') .trim() .split(' ') .map(function(a) { return stemmer(a); }); // Filter out stopStems var out = []; for (var i = 0, len = text.length; i < len; i++) { if (stopStems.indexOf(text[i]) === -1) { out.push(text[i]); } } return out; };
We define a simple
static.into the
array of tokens. In this way, we lowercase all tokens (to reduce entropy) and we run Porter Stemmer. algorithm to reduce the amount of entropy and also improve the matching degree ("walking" and "walk" matching are the same) and we also filter out stop words (very common words) in order to further reduce the entropy value. Before I dive into the concepts, please bear with me if I overexplain this section. BM25.prototype.addDocument = function(doc) {
if (typeof doc.id === 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); };
if (typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); };
// Raw tokenized list of words
var tokens = BM25.Tokenize(doc.body);
// Will hold unique terms and their counts and frequencies
var _terms = {};
// docObj will eventually be added to the documents database
var docObj = {id: doc.id, tokens: tokens, body: doc.body};
// Count number of terms
docObj.termCount = tokens.length;
// Increment totalDocuments
this.totalDocuments++;
// Readjust averageDocumentLength
this.totalDocumentTermLength += docObj.termCount;
this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments;
// Calculate term frequency
// First get terms count
for (var i = 0, len = tokens.length; i < len; i++) {
var term = tokens[i];
if (!_terms[term]) {
_terms[term] = {
count: 0,
freq: 0
};
};
_terms[term].count++;
}
// Then re-loop to calculate term frequency.
// We'll also update inverse document frequencies here.
var keys = Object.keys(_terms);
for (var i = 0, len = keys.length; i < len; i++) {
var term = keys[i];
// Term Frequency for this document.
_terms[term].freq = _terms[term].count / docObj.termCount;
// Inverse Document Frequency initialization
if (!this.terms[term]) {
this.terms[term] = {
n: 0, // Number of docs this term appears in, uniquely
idf: 0
};
}
this.terms[term].n++;
};
// Calculate inverse document frequencies
// This is SLOWish so if you want to index a big batch of documents,
// comment this out and run it once at the end of your addDocuments run
// If you're only indexing a document or two at a time you can leave this in.
// this.updateIdf();
// Add docObj to docs db
docObj.terms = _terms;
this.documents[docObj.id] = docObj;
};
This is where the addDocument() method magically appears. We basically create and maintain two similar ones. Data structure: this.documents. and this.terms. This.documentsis is a database that saves all documents. It saves all the original text of the document, the length information of the document and a list, which is stored in the list. The number and frequency of all words and words in the document. Using this data structure, we can easily and quickly (yes, very fast, only requiring a hash table query time of O(1) time). Answer the following question: How many times does the word 'walk' appear in document #3? We also use another data structure, this.terms. It represents all words in the corpus. Through this data structure, we can answer the following questions in O(1) time: How many documents does the word 'walk' appear in? What is their id?
Finally, we record the length of each document and record the average length of documents in the entire corpus.
Note that in the above code, idf is initialized to 0, and the updateidf() method is
commentedout. This is because this method runs very slowly and only needs to be run once, after indexing. Since running it once can meet the needs, there is no need to run it 5,000 times. Commenting it out first and running it after a large batch of indexing operations can save a lot of time. The following is the code of this
function:
BM25.prototype.updateIdf = function() { var keys = Object.keys(this.terms); for (var i = 0, len = keys.length; i < len; i++) { var term = keys[i]; var num = (this.totalDocuments - this.terms[term].n + 0.5); var denom = (this.terms[term].n + 0.5); this.terms[term].idf = Math.max(Math.log10(num / denom), 0.01); } };
这是一个非常简单的函数,但是由于它需要遍历整个语料库中的所有词语,并更新所有词语的值,这就导致它工作的就有点慢。这个方法的实现采用了逆向文档频率 (inverse document frequency) 的标准公式(你可以在 Wikipedia 上找到这个公式)— 由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。我做了一些修改,让返回值一直大于0。
BM25.prototype.search = function(query) { var queryTerms = BM25.Tokenize(query); var results = []; // Look at each document in turn. There are better ways to do this with inverted indices. var keys = Object.keys(this.documents); for (var j = 0, nDocs = keys.length; j < nDocs; j++) { var id = keys[j]; // The relevance score for a document is the sum of a tf-idf-like // calculation for each query term. this.documents[id]._score = 0; // Calculate the score for each query term for (var i = 0, len = queryTerms.length; i < len; i++) { var queryTerm = queryTerms[i]; // We've never seen this term before so IDF will be 0. // Means we can skip the whole term, it adds nothing to the score // and isn't in any document. if (typeof this.terms[queryTerm] === 'undefined') { continue; } // This term isn't in the document, so the TF portion is 0 and this // term contributes nothing to the search score. if (typeof this.documents[id].terms[queryTerm] === 'undefined') { continue; } // The term is in the document, let's go. // The whole term is : // IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength / avgDocLength)) // IDF is pre-calculated for the whole docset. var idf = this.terms[queryTerm].idf; // Numerator of the TF portion. var num = this.documents[id].terms[queryTerm].count * (this.k1 + 1); // Denomerator of the TF portion. var denom = this.documents[id].terms[queryTerm].count + (this.k1 * (1 - this.b + (this.b * this.documents[id].termCount / this.averageDocumentLength))); // Add this query term to the score this.documents[id]._score += idf * num / denom; } if (!isNaN(this.documents[id]._score) && this.documents[id]._score > 0) { results.push(this.documents[id]); } } results.sort(function(a, b) { return b._score - a._score; }); return results.slice(0, 10); };
最后,search() 方法遍历所有的文档,并给出每个文档的 BM25 分数,然后按照由大到小的顺序进行排序。当然了,在搜索过程中遍历语料库中的每个文档实是不明智。这个问题在 Part Two (反向索引和性能)中得到解决。
上面的代码已经做了很好的注释,其要点如下:为每个文档和每个词语计算 BM25 分数。词语的 idf 分数已经预先计算好了,使用的时候只需要查询即可。词语频率作为文档属性的一部分也已经预先计算好了。之后只需要简单的四则运算即可。最后给每个文档增加一个临时变量 _score,然后根据 score 做降序排列并返回前 10 个结果。
上面的示例有很多方法进行优化,我们将在 “全文搜索”的第二部分中介绍它们,欢迎继续收看。我希望我能在几个星期之后完成它。下面列了下次将要提到的内容:
反向索引和快速搜索
快速索引
更好的搜索结果
为了这个演示,我编了一个小的维基百科爬虫,爬到相当多(85000)维基百科文章的第一段。由于索引到所有85K文件需要90秒左右,在我的电脑我已经削减了一半。不想让你们仅仅为了一个简单的全文本演示浪费你的笔记本电脑电量。
因为索引是一个繁重的、模块化的CPU操作,我把它当成一个网络工作者。索引运行在一个后台线程上–在这里你可以找到完整的源代码。你也会发现到词干算法和我的停用词列表中的源代码参考。至于代码许可,还是一如既往为教育目的而免费,而不用于任何商业目的。
最后是演示。一旦索引完成,尝试寻找随机的东西和短语,维基百科会知道的。注意,只有40000段的索引,所以你可能要尝试一些新的话题。
The above is the detailed content of Detailed introduction to the sample code of relevance scoring for JavaScript full-text search. For more information, please follow other related articles on the PHP Chinese website!