Home >Web Front-end >JS Tutorial >Detailed explanation of how to use Node.js to segment text content and extract keywords
This article mainly introduces the use of Node.js to segment text content and extract keywords. Friends who need it can refer to
before discussing technology. Let’s be cute first, you don’t understand the world of foodies~~
Zhongcheng translated articles have tags, users can quickly filter articles of interest based on tags, and the articles are also Relevant recommendations can be made based on tag associations. But now Zhongcheng Translation’s tags are set when recommending articles, and they are all in English, and manual settings are inevitably not standardized and complete. Although articles can be manually edited after publishing, we cannot expect users or administrators to edit appropriate tags all the time, so we need to use tools to automatically generate tags.
Among the current open source word segmentation tools, jieba is a word segmentation component with powerful functions and excellent performance. Fortunately, it has a node version.
nodejieba's installation and use are very simple:
npm install nodejieba var nodejieba = require("nodejieba"); var result = nodejieba.cut("帝国主义要把我们的地瓜分掉"); console.log(result); //[ '帝国主义', '要', '把', '我们', '的', '地', '瓜分', '掉' ] result = nodejieba.cut('土地,俺老孙的金箍棒在哪里?'); console.log(result); //[ '土地', ',', '俺', '老', '孙', '的', '金箍棒', '在', '哪里', '?' ] result = nodejieba.cut('大圣,您的金箍棒就棒在特别配您的头型!'); console.log(result); //[ '大圣',',','您','的','金箍棒','就','棒','在','特别','配','您','的','头型','!' ]
We can load our own dictionary and set the weight and part of speech for each word in the dictionary:
Edit user.uft8
Sweet Potato 9999 n
Golden Hoop 9999 n
stick is great in 9999
Then load the dictionary through nodejieba.load.
var nodejieba = require("nodejieba"); nodejieba.load({ userDict: './user.utf8', }); var result = nodejieba.cut("帝国主义要把我们的地瓜分掉"); console.log(result); //[ '帝国主义', '要', '把', '我们', '的', '地瓜', '分', '掉' ] result = nodejieba.cut('土地,俺老孙的金箍棒在哪里?'); console.log(result); //[ '土地', ',', '俺', '老', '孙', '的', '金箍棒', '在', '哪里', '?' ] result = nodejieba.cut('大圣,您的金箍棒就棒在特别配您的头型!'); console.log(result); //[ '大圣', ',', '您', '的', '金箍', '棒就棒在', '特别', '配', '您', '的', '头型', '!' ]
In addition to word segmentation, we can use nodejieba to extract keywords:
const content = `
HTTP, HTTP/2 and Performance optimization
The purpose of this article is Through comparison, I will tell you why you should migrate from HTTP to HTTPS and why support for HTTP/2 should be added. Before comparing HTTP and HTTP/2, let’s first look at what HTTP is.
What is HTTP
HTTP is a set of rules for communication on the World Wide Web. HTTP is an application layer protocol and runs on top of the TCP/IP layer. When a user requests a web page through a browser, HTTP is responsible for processing the request and establishing a connection between the web server and the client.
With HTTP/2, performance can be improved without using sprite images, compression, or splicing. However, this does not mean that these techniques should not be used. But this has clearly demonstrated the necessity for us to move from HTTP/1.1 to HTTP/2.
`;
const nodejieba = require("nodejieba"); const result = nodejieba.extract(content, 20); console.log(result);
The output result is similar to the following:
[ { word: 'HTTP', weight: 140.8704516850025 }, { word: '请求', weight: 14.23018001394 }, { word: '应该', weight: 14.052171126120001 }, { word: '万维网', weight: 12.2912397395 }, { word: 'TCP', weight: 11.739204307083542 }, { word: '1.1', weight: 11.739204307083542 }, { word: 'Web', weight: 11.739204307083542 }, { word: '雪碧图', weight: 11.739204307083542 }, { word: 'HTTPS', weight: 11.739204307083542 }, { word: 'IP', weight: 11.739204307083542 }, { word: '应用层', weight: 11.2616203224 }, { word: '客户端', weight: 11.1926274509 }, { word: '浏览器', weight: 10.8561552143 }, { word: '拼接', weight: 9.85762638414 }, { word: '比较', weight: 9.5435285574 }, { word: '网页', weight: 9.53122979951 }, { word: '服务器', weight: 9.41204128224 }, { word: '使用', weight: 9.03259988558 }, { word: '必要性', weight: 8.81927328699 }, { word: '添加', weight: 8.0484751722 } ]
We add some new keywords to the dictionary:
Performance
HTTP/2
The output results are as follows:
[ { word: 'HTTP', weight: 105.65283876375187 }, { word: 'HTTP/2', weight: 58.69602153541771 }, { word: '请求', weight: 14.23018001394 }, { word: '应该', weight: 14.052171126120001 }, { word: '性能', weight: 12.61259281884 }, { word: '万维网', weight: 12.2912397395 }, { word: 'IP', weight: 11.739204307083542 }, { word: 'HTTPS', weight: 11.739204307083542 }, { word: '1.1', weight: 11.739204307083542 }, { word: 'TCP', weight: 11.739204307083542 }, { word: 'Web', weight: 11.739204307083542 }, { word: '雪碧图', weight: 11.739204307083542 }, { word: '应用层', weight: 11.2616203224 }, { word: '客户端', weight: 11.1926274509 }, { word: '浏览器', weight: 10.8561552143 }, { word: '拼接', weight: 9.85762638414 }, { word: '比较', weight: 9.5435285574 }, { word: '网页', weight: 9.53122979951 }, { word: '服务器', weight: 9.41204128224 }, { word: '使用', weight: 9.03259988558 } ]
On this basis, we use the whitelist method to filter out some words that can be used as tags:
const content = `
HTTP, HTTP/2 and performance optimization
The purpose of this article is to tell you through comparison why you should migrate from HTTP to HTTPS, and why support for HTTP/2 should be added. Before comparing HTTP and HTTP/2, let’s first look at what HTTP is.
What is HTTP
HTTP is a set of rules for communication on the World Wide Web. HTTP is an application layer protocol that runs on top of the TCP/IP layer. When a user requests a web page through a browser, HTTP is responsible for processing the request and establishing a connection between the web server and the client.
With HTTP/2, performance can be improved without using sprite images, compression, or splicing. However, this does not mean that these techniques should not be used. But this has clearly demonstrated the necessity for us to move from HTTP/1.1 to HTTP/2.
`;
const nodejieba = require("nodejieba"); nodejieba.load({ userDict: './user.utf8', }); const result = nodejieba.extract(content, 20); const tagList = ['HTTPS', 'HTTP', 'HTTP/2', 'Web', '浏览器', '性能']; console.log(result.filter(item => tagList.indexOf(item.word) >= 0));
Finally we get:
[ { word: 'HTTP', weight: 105.65283876375187 }, { word: 'HTTP/2', weight: 58.69602153541771 }, { word: '性能', weight: 12.61259281884 }, { word: 'HTTPS', weight: 11.739204307083542 }, { word: 'Web', weight: 11.739204307083542 }, { word: '浏览器', weight: 10.8561552143 } ]
This is the result we want.
The above is the basic method of using the word segmentation library nodejieba. In the future, we can use it to automatically analyze and add corresponding tags to the translations published by Zhongcheng Translation, so as to provide translators and readers with a better user experience.
The above is the detailed content of Detailed explanation of how to use Node.js to segment text content and extract keywords. For more information, please follow other related articles on the PHP Chinese website!