search
HomeBackend DevelopmentPython TutorialHow to optimize jieba word segmentation by building a customized thesaurus and stop thesaurus to improve the extraction effect of scenic spot comment themes?

How to optimize jieba word segmentation by building a customized thesaurus and stop thesaurus to improve the extraction effect of scenic spot comment themes?

Improve the theme extraction of scenic spot comments: Optimizing Jieba word segmentation strategy

When using Jieba for Chinese word segmentation and combining LDA models to extract scenic spot comment topics, the theme extraction accuracy is often affected due to poor word segmentation effect. In view of this problem, this article proposes two optimization strategies: building a custom vocabulary and a discontinuing vocabulary.

The existing code has the problem of insufficient word segmentation accuracy, which leads to inaccurate topic keywords extracted by the LDA model. For improvement, the following methods are recommended:

Strategy One: Build a custom vocabulary

In view of the particularity of scenic spot comments, it is crucial to build a custom thesaurus related to a scenic spot. You can refer to the following steps:

  1. Reverse engineering Sogou Travel Dictionary: Analyze Sogou search engine's tourism dictionary (or other large-scale tourism-related dictionary) and extract vocabulary related to scenic spot comments, such as the name of the scenic spot, service type, facility name, etc.
  2. Supplementary field vocabulary: Manually supplement the missing words in Sogou vocabulary but frequently appear in scenic spot comments. This requires analyzing a large number of scenic spot review data to identify those keywords that are wrongly divided or unrecognized by the existing thesaurus.
  3. Integration and optimization: Integrate extracted and supplemented vocabulary into a custom thesaurus, and deduplicate and standardize to ensure the quality and consistency of the thesaurus.
  4. Loading a custom vocabulary: During the Jieba word segmentation process, loading a custom vocabulary, and giving priority to using a custom vocabulary for word segmentation.

Strategy 2: Build a custom stop word library

In addition to custom vocabulary, optimizing the vocabulary is also important.

  1. Utilize GitHub open source resources: There are many open source Chinese disabling thesaurus on GitHub, and choose a suitable one as the basis.
  2. Supplementary stop words for scenic spot comments: According to the characteristics of scenic spot comments, add some words that appear frequently in scenic spot comments but do not contribute to the theme extraction, such as some tone auxiliary words, colloquial expressions, etc.
  3. Simplify the discontinuation database: Avoid the discontinuation database being too large, resulting in the incorrect deletion of important information.

Code improvement suggestions:

Integrate the above custom thesaurus and stop thesaurus into the code and modify the tokenize and delete_stopwords functions:

 import jieba
from gensim import corpora, models
# ... (Other imports)

# Load custom thesaurus jieba.load_userdict("path/to/your/custom_dictionary.txt")

# Load custom stop word library custom_stop_words = set(open("path/to/your/custom_stopwords.txt", encoding='utf-8').read().splitlines())
broadcastVar = spark.sparkContext.broadcast(custom_stop_words)

# ... (The tokenize and delete_stopwords functions are modified to use custom_stop_words)

Through the above two strategies, the accuracy of Jieba word segmentation can be effectively improved, the influence of noise words can be reduced, and the accuracy and effectiveness of the LDA model extracting scenic spot comment topics can be improved. Remember to replace "path/to/your/custom_dictionary.txt" and "path/to/your/custom_stopwords.txt" with the actual paths to your thesaurus and stop the thesaurus. In addition, consider adjusting LDA model parameters such as num_topics and passes for best results.

The above is the detailed content of How to optimize jieba word segmentation by building a customized thesaurus and stop thesaurus to improve the extraction effect of scenic spot comment themes?. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
github是什么github是什么Mar 24, 2023 pm 05:46 PM

​GitHub是一个面向开源及私有软件项目的托管平台,可以让开发者们在这里托管自己的代码,并进行版本控制。GitHub主打的是开源项目与协作,通过这个平台上的开源项目,开发者们可以查看其他开发者的项目源代码,并进行交流和学习。

git中push -u是什么意思git中push -u是什么意思Jul 01, 2022 am 10:36 AM

在git中,“push -u”的意思是将本地的分支版本上传到远程合并,并且记录push到远程分支的默认值;当添加“-u”参数时,表示下次继续push的这个远端分支的时候推送命令就可以简写成“git push”。

git的pack文件有什么用git的pack文件有什么用Jun 30, 2022 pm 05:41 PM

在git中,pack文件可以有效的使用磁盘缓存,并且为常用命令读取最近引用的对象提供访问模式;git会将多个指定的对象打包成一个成为包文件(packfile)的二进制文件,用于节省空间和提高效率。

如何在GitLab上进行第一次登录并更改密码如何在GitLab上进行第一次登录并更改密码Mar 24, 2023 pm 05:46 PM

GitLab是一种基于Web的Git版本控制库管理软件,旨在帮助开发团队更好地协同工作,提高工作效率。当您第一次登录GitLab时,系统会提示您要更改初始密码以确保账户安全。本文将为大家介绍如何在GitLab上进行第一次登录并更改密码。

git中pull失败了怎么办git中pull失败了怎么办Jun 30, 2022 pm 04:47 PM

git中pull失败的解决方法:1、利用“git reset --hard”强制覆盖掉自己的本地修改;2、利用“git stash”推送一个新的储藏,拉取之后利用“git stash pop”将修改保存到暂存区;3、若依然出现问题,则将文件保存到暂存区并提交注释即可。

git分支能改名字吗git分支能改名字吗Jun 16, 2022 pm 05:55 PM

git分支能改名字。改名方法:1、利用git中的branch命令修改本地分支的名称,语法为“git branch -m 旧名字 新名字”;2、利用“git push origin 新名字”命令,在删除远程分支之后将改名后的本地分支推送到远程;3、利用IDEA直接操作修改分支名称即可。

用三行代码使你的git提交记录变干净用三行代码使你的git提交记录变干净Feb 28, 2023 pm 04:19 PM

本篇文章给大家带来了关于git的相关知识,其中主要跟大家聊一聊怎么让你的git记录保持整洁,感兴趣的朋友下面一起来看一下吧,希望对大家有帮助。

git怎么删除某个分支git怎么删除某个分支Jun 24, 2022 am 11:11 AM

git删除某个分支的方法:1、利用“git branch --delete dev”命令删除本地分支;2、利用“git push origin --delete branch”命令删除远程分支;3、利用“git branch --delete --remotes”命令删除追踪分支。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),