原文链接:https://i18n.site/blog/tech/search
顺序
经过几周的开发,i18n.site(纯静态Markdown多语言翻译建站工具)现已支持纯前端全文搜索。
本文将分享i18n.site纯前端全文搜索的技术实现。访问 i18n.site 体验搜索功能。
代码开源:搜索内核/交互界面
无服务器全文搜索解决方案概述
对于文档/个人博客等中小型纯静态网站,构建自建全文搜索后端过于繁重,无服务全文搜索是更常见的选择。
无服务器全文搜索解决方案分为两大类:
第一个涉及第三方搜索服务提供商,例如 algolia.com,它们提供用于全文搜索的前端组件。
此类服务需要根据搜索量付费,并且由于合规问题通常无法向中国大陆用户提供。
它们不能离线或在内联网上使用,并且有很大的局限性。本文不再详述。
第二类是纯前端全文检索。
目前常见的纯前端全文检索工具有lunrjs和ElasticLunr.js(基于lunrjs的二次开发)。
lunrjs 有两种构建索引的方法,但都有各自的问题。
- 预建索引文件
由于索引包含了文档中的所有单词,因此其尺寸较大。
每次添加或修改文档时,都必须加载新的索引文件。
这会增加用户等待时间并消耗大量带宽。
- 动态加载文档并构建索引
构建索引是一项计算密集型任务,每次访问时重建索引可能会导致明显的延迟,从而导致糟糕的用户体验。
除了lunrjs之外,还有其他全文搜索解决方案,例如:
fusejs,通过计算字符串之间的相似度进行搜索。
该方案性能较差,不适合全文检索(参考Fuse.js 查询时间长超过10秒,如何优化?)。
TinySearch使用布隆过滤器进行搜索,无法进行前缀搜索(例如输入goo搜索good或google),无法实现自动完成效果。
针对现有解决方案的缺陷,i18n.site开发了全新的纯前端全文搜索解决方案,具有以下特点:
- 支持多语言搜索,体积小巧;使用 gzip 打包后,搜索内核只有 6.9KB(相比之下,lunrjs 为 25KB)
- 基于IndexedDB构建倒排索引,内存占用低,性能快
- 添加/修改文档时,仅对添加或修改的文档重新索引,减少计算量
- 支持前缀搜索,用户输入时实时显示搜索结果
- 离线可用性
下面将介绍i18n.site技术实现的细节。
多语言分词
分词使用浏览器原生的Intl.Segmenter,所有主流浏览器都支持。
分词的coffeescript代码如下:
SEG = new Intl.Segmenter 0, granularity: "word" seg = (txt) => r = [] for {segment} from SEG.segment(txt) for i from segment.split('.') i = i.trim() if i and !'|`'.includes(i) and !/\p{P}/u.test(i) r.push i r export default seg export segqy = (q) => seg q.toLocaleLowerCase()
地点:
-
/p{P}/ 是匹配标点符号的正则表达式,包括: ! " # $ % & ' ( ) * , - . / : ; ? @ [ ] ^ _ { | } ~. .
- split('.' )是因为Firefox浏览器分词不分词。`.
指数构建
IndexedDB 中创建了 5 个对象存储表:
- 单词:id - 单词
- doc: id - 文档 URL - 文档版本号
- docWord:文档 id - 单词 id 数组
- prefix:前缀 - 单词 id 数组
- rindex:单词 id - 文档 id - 行号数组
通过传入文档 url 和版本号 ver 的数组,检查 doc 表中文档是否存在。如果不存在,则创建倒排索引。同时,未传入文档的倒排索引将被删除。
此方法允许增量索引,减少计算负载。
In the front-end interface, a progress bar for index loading can be displayed to avoid lag during the initial load. See "Animated Progress Bar, Based on a Single progress + Pure CSS Implementation" English / Chinese.
IndexedDB High Concurrent Writing
The project is developed based on the asynchronous encapsulation of IndexedDB, idb.
IndexedDB reads and writes are asynchronous. When creating an index, documents are loaded concurrently to build the index.
To avoid data loss due to concurrent writes, you can refer to the following coffeescript code, which adds a ing cache between reading and writing to intercept competitive writes.
`coffee
pusher = =>
ing = new Map()
(table, id, val)=>
id_set = ing.get(id)
if id_set
id_set.add val
returnid_set = new Set([val]) ing.set id, id_set pre = await table.get(id) li = pre?.li or [] loop to_add = [...id_set] li.push(...to_add) await table.put({id,li}) for i from to_add id_set.delete i if not id_set.size ing.delete id break return
rindexPush = pusher()
prefixPush = pusher()
`Prefix Real-Time Search
To display search results in real-time as the user types, for example, showing words like words and work that start with wor when wor is entered.
The search kernel uses the prefix table for the last word after segmentation to find all words with that prefix and search sequentially.
An anti-shake function, debounce (implemented as follows), is used in the front-end interaction to reduce the frequency of searches triggered by user input, thus minimizing computational load.
js
export default (wait, func) => {
var timeout;
return function(...args) {
clearTimeout(timeout);
timeout = setTimeout(func.bind(this, ...args), wait);
};
}
Precision and Recall
The search first segments the keywords entered by the user.
Assuming there are N words after segmentation, the results are first returned with all keywords, followed by results with N-1, N-2, ..., 1 keywords.
The search results displayed first ensure query precision, while subsequent loaded results (click the "Load More" button) ensure recall.
On-Demand Loading
To improve response speed, the search uses the yield generator to implement on-demand loading, returning results after each limit query.
Note that after each yield, a new IndexedDB query transaction must be opened for the next search.
Prefix Real-Time Search
To display search results in real-time as the user types, for example, showing words like words and work that start with wor when wor is entered.
The search kernel uses the prefix table for the last word after segmentation to find all words with that prefix and search sequentially.
An anti-shake function, debounce (implemented as follows), is used in the front-end interaction to reduce the frequency of searches triggered by user input, thus minimizing computational load.
js
export default (wait, func) => {
var timeout;
return function(...args) {
clearTimeout(timeout);
timeout = setTimeout(func.bind(this, ...args), wait);
};
}
Offline Availability
The index table does not store the original text, only words, reducing storage space.
Highlighting search results requires reloading the original text, and using service worker can avoid repeated network requests.
Also, because service worker caches all articles, once a search is performed, the entire website, including search functionality, becomes offline available.
Optimization for Displaying MarkDown Documents
The pure front-end search solution provided by i18n.site is optimized for MarkDown documents.
When displaying search results, the chapter name is shown, and clicking navigates to that chapter.
Summary
The pure front-end implementation of inverted full-text search, without the need for a server, is very suitable for small to medium-sized websites such as documents and personal blogs.
i18n.site's open-source self-developed pure front-end search is compact, responsive, and addresses the various shortcomings of current pure front-end full-text search solutions, providing a better user experience.
以上是纯前端倒排全文搜索的详细内容。更多信息请关注PHP中文网其他相关文章!

JavaScript核心数据类型在浏览器和Node.js中一致,但处理方式和额外类型有所不同。1)全局对象在浏览器中为window,在Node.js中为global。2)Node.js独有Buffer对象,用于处理二进制数据。3)性能和时间处理在两者间也有差异,需根据环境调整代码。

JavaScriptusestwotypesofcomments:single-line(//)andmulti-line(//).1)Use//forquicknotesorsingle-lineexplanations.2)Use//forlongerexplanationsorcommentingoutblocksofcode.Commentsshouldexplainthe'why',notthe'what',andbeplacedabovetherelevantcodeforclari

Python和JavaScript的主要区别在于类型系统和应用场景。1.Python使用动态类型,适合科学计算和数据分析。2.JavaScript采用弱类型,广泛用于前端和全栈开发。两者在异步编程和性能优化上各有优势,选择时应根据项目需求决定。

选择Python还是JavaScript取决于项目类型:1)数据科学和自动化任务选择Python;2)前端和全栈开发选择JavaScript。Python因其在数据处理和自动化方面的强大库而备受青睐,而JavaScript则因其在网页交互和全栈开发中的优势而不可或缺。

Python和JavaScript各有优势,选择取决于项目需求和个人偏好。1.Python易学,语法简洁,适用于数据科学和后端开发,但执行速度较慢。2.JavaScript在前端开发中无处不在,异步编程能力强,Node.js使其适用于全栈开发,但语法可能复杂且易出错。

javascriptisnotbuiltoncorc; saninterpretedlanguagethatrunsonenginesoftenwritteninc.1)javascriptwasdesignedAsalightweight,解释edganguageforwebbrowsers.2)Enginesevolvedfromsimpleterterterpretpreterterterpretertestojitcompilerers,典型地提示。

JavaScript可用于前端和后端开发。前端通过DOM操作增强用户体验,后端通过Node.js处理服务器任务。1.前端示例:改变网页文本内容。2.后端示例:创建Node.js服务器。

选择Python还是JavaScript应基于职业发展、学习曲线和生态系统:1)职业发展:Python适合数据科学和后端开发,JavaScript适合前端和全栈开发。2)学习曲线:Python语法简洁,适合初学者;JavaScript语法灵活。3)生态系统:Python有丰富的科学计算库,JavaScript有强大的前端框架。


热AI工具

Undresser.AI Undress
人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover
用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool
免费脱衣服图片

Clothoff.io
AI脱衣机

Video Face Swap
使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸!

热门文章

热工具

DVWA
Damn Vulnerable Web App (DVWA) 是一个PHP/MySQL的Web应用程序,非常容易受到攻击。它的主要目标是成为安全专业人员在合法环境中测试自己的技能和工具的辅助工具,帮助Web开发人员更好地理解保护Web应用程序的过程,并帮助教师/学生在课堂环境中教授/学习Web应用程序安全。DVWA的目标是通过简单直接的界面练习一些最常见的Web漏洞,难度各不相同。请注意,该软件中

EditPlus 中文破解版
体积小,语法高亮,不支持代码提示功能

禅工作室 13.0.1
功能强大的PHP集成开发环境

VSCode Windows 64位 下载
微软推出的免费、功能强大的一款IDE编辑器

Dreamweaver Mac版
视觉化网页开发工具