首頁 >web前端 >js教程 >純前端倒排全文搜索

純前端倒排全文搜索

Mary-Kate Olsen原創: 2024-10-14 14:27:29317瀏覽

原文連結：https://i18n.site/blog/tech/search

順序

經過幾週的開發，i18n.site（純靜態Markdown多語言翻譯網站架設工具）現已支援純前端全文搜尋。

純前端倒排全文搜索

本文將分享i18n.site純前端全文搜尋的技術實作。存取 i18n.site 體驗搜尋功能。

程式碼開源：搜尋核心/互動介面

無伺服器全文搜尋解決方案概述

對於文件/個人部落格等中小型純靜態網站，建立自建全文搜尋後端過於繁重，無服務全文搜尋是更常見的選擇。

無伺服器全文搜尋解決方案分為兩大類：

第一個涉及第三方搜尋服務供應商，例如 algolia.com，它們提供用於全文搜尋的前端元件。

此類服務需要根據搜尋量付費，且由於合規問題通常無法向中國大陸用戶提供。

它們不能離線或在內部網路上使用，並且有很大的限制。本文不再詳述。

第二類是純前端全文檢索。

目前常見的純前端全文檢索工具有lunrjs和ElasticLunr.js（基於lunrjs的二次開發）。

lunrjs 有兩種建立索引的方法，但都有各自的問題。

預建索引檔

由於索引包含了文件中的所有單字，因此其尺寸較大。
每次新增或修改文件時，都必須載入新的索引檔。
這會增加用戶等待時間並消耗大量頻寬。

動態載入文件並建立索引

建立索引是一項計算密集型任務，每次訪問時重建索引可能會導致明顯的延遲，從而導致糟糕的用戶體驗。

除了lunrjs之外，還有其他全文搜尋解決方案，例如：

fusejs，透過計算字串之間的相似度進行搜尋。

此方案效能較差，不適合全文檢索（參考Fuse.js 查詢時間長超過10秒，如何最佳化？）。

TinySearch使用布隆過濾器進行搜索，無法進行前綴搜索（例如輸入goo搜索good或google），無法實現自動完成效果。

針對現有解決方案的缺陷，i18n.site開發了全新的純前端全文搜尋解決方案，具有以下特點：

支援多語言搜索，體積小巧；使用 gzip 打包後，搜索內核只有 6.9KB（相比之下，lunrjs 為 25KB）
基於IndexedDB建立倒排索引，記憶體佔用量低，效能快
新增/修改文件時，僅對新增或修改的文件重新索引，減少計算量
支援前綴搜索，使用者輸入時即時顯示搜尋結果
離線可用性

以下將介紹i18n.site技術實現的細節。

多語言分詞

分詞使用瀏覽器原生的Intl.Segmenter，所有主流瀏覽器都支援。

純前端倒排全文搜索

分詞的coffeescript程式碼如下：

SEG = new Intl.Segmenter 0, granularity: "word"

seg = (txt) =>
  r = []
  for {segment} from SEG.segment(txt)
    for i from segment.split('.')
      i = i.trim()
      if i and !'|`'.includes(i) and !/\p{P}/u.test(i)
        r.push i
  r

export default seg

export segqy = (q) =>
  seg q.toLocaleLowerCase()

地點：

/p{P}/ 是符合標點符號的正規表示式，包括： ! " # $ % & ' ( ) * , - . / : ; ? @ [ ] ^ _ { | } ~. . p>
- split('.' )是因為Firefox瀏覽器分詞不分詞。
指數建構

IndexedDB 中建立了 5 個物件儲存表：
透過傳入文件 url 和版本號 ver 的數組，檢查 doc 表中文件是否存在。如果不存在，則建立倒排索引。同時，未傳入文件的倒排索引將被刪除。

此方法允許增量索引，減少計算負載。

In the front-end interface, a progress bar for index loading can be displayed to avoid lag during the initial load. See "Animated Progress Bar, Based on a Single progress + Pure CSS Implementation" English / Chinese.

IndexedDB High Concurrent Writing

The project is developed based on the asynchronous encapsulation of IndexedDB, idb.

IndexedDB reads and writes are asynchronous. When creating an index, documents are loaded concurrently to build the index.

To avoid data loss due to concurrent writes, you can refer to the following coffeescript code, which adds a ing cache between reading and writing to intercept competitive writes.

`coffee
pusher = =>
ing = new Map()
(table, id, val)=>
id_set = ing.get(id)
if id_set
id_set.add val
return
```
id_set = new Set([val])
ing.set id, id_set
pre = await table.get(id)
li = pre?.li or []

loop
  to_add = [...id_set]
  li.push(...to_add)
  await table.put({id,li})
  for i from to_add
    id_set.delete i
  if not id_set.size
    ing.delete id
    break
return
```
rindexPush = pusher()
prefixPush = pusher()
`

Prefix Real-Time Search

To display search results in real-time as the user types, for example, showing words like words and work that start with wor when wor is entered.

The search kernel uses the prefix table for the last word after segmentation to find all words with that prefix and search sequentially.

An anti-shake function, debounce (implemented as follows), is used in the front-end interaction to reduce the frequency of searches triggered by user input, thus minimizing computational load.

js
export default (wait, func) => {
var timeout;
return function(...args) {
clearTimeout(timeout);
timeout = setTimeout(func.bind(this, ...args), wait);
};
}

Precision and Recall

The search first segments the keywords entered by the user.

Assuming there are N words after segmentation, the results are first returned with all keywords, followed by results with N-1, N-2, ..., 1 keywords.

The search results displayed first ensure query precision, while subsequent loaded results (click the "Load More" button) ensure recall.

On-Demand Loading

To improve response speed, the search uses the yield generator to implement on-demand loading, returning results after each limit query.

Note that after each yield, a new IndexedDB query transaction must be opened for the next search.

Prefix Real-Time Search

To display search results in real-time as the user types, for example, showing words like words and work that start with wor when wor is entered.

The search kernel uses the prefix table for the last word after segmentation to find all words with that prefix and search sequentially.

An anti-shake function, debounce (implemented as follows), is used in the front-end interaction to reduce the frequency of searches triggered by user input, thus minimizing computational load.

js
export default (wait, func) => {
var timeout;
return function(...args) {
clearTimeout(timeout);
timeout = setTimeout(func.bind(this, ...args), wait);
};
}

Offline Availability

The index table does not store the original text, only words, reducing storage space.

Highlighting search results requires reloading the original text, and using service worker can avoid repeated network requests.

Also, because service worker caches all articles, once a search is performed, the entire website, including search functionality, becomes offline available.

Optimization for Displaying MarkDown Documents

The pure front-end search solution provided by i18n.site is optimized for MarkDown documents.

When displaying search results, the chapter name is shown, and clicking navigates to that chapter.

Summary

The pure front-end implementation of inverted full-text search, without the need for a server, is very suitable for small to medium-sized websites such as documents and personal blogs.

i18n.site's open-source self-developed pure front-end search is compact, responsive, and addresses the various shortcomings of current pure front-end full-text search solutions, providing a better user experience.