Home  >  Article  >  Backend Development  >  Sharing of Chinese word segmentation search tools under asp.net

Sharing of Chinese word segmentation search tools under asp.net

黄舟
黄舟Original
2017-10-08 09:34:522275browse

jieba is a search library under python. Someone has transplanted this library to the asp.net platform. It can completely replace lucene.net and the combination of Pangu word segmentation

The reason why I wrote this is actually because yesterday During the interview, I was asked how you would do a keyword search on a website? I just talked about SQL fuzzy query and SQL statement optimization and caching. I have been exposed to keyword segmentation before, but there is no mature word segmentation retrieval library under the .net platform, unlike Java with Lucene. Although it has been transplanted to .net, the update is slow. When I was learning python before, I noticed python's word segmentation search and word cloud creation. I was wondering if there was any python word segmentation search library that had been transplanted to .net. I checked the python jieba library and sure enough it had been transplanted!
Original introduction: .NET version of jieba Chinese word segmentation: jieba.NET
The common word segmentation component on the .NET platform is Pangu word segmentation, but it has not been updated for a long time. The most obvious one is the built-in dictionary. Jieba's dictionary has 500,000 entries, while Pangu's dictionary has 170,000 entries. This will result in significantly different word segmentation effects. In addition, for unregistered words, jieba "adopts an HMM model based on the word-forming ability of Chinese characters and uses the Viterbi algorithm", and the effect looks good.

We can also search and download directly in the nuget package manager of VS2013:

I saw someone in the comments saying that the monthly period of the virgin officer of the Industry and Information Technology will be Subordinate departments have to personally explain the installation work of 24-port switches and other technical devices. It is good to be able to divide it well. I tested it myself:


var segmenter = new JiebaSegmenter();

            Console.WriteLine("原检索语句: 工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作");            
            var segments1 = segmenter.Cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作", cutAll: true);
            Console.WriteLine("[全模式]: {0}", string.Join("/ ", segments1));            
            var segments2 = segmenter.Cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作");  // 默认为精确模式
            Console.WriteLine("【精确模式】:{0}", string.Join("/ ", segments2));            
            var segments3 = segmenter.Cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作");  // 默认为精确模式,同时也使用HMM模型
            Console.WriteLine("【新词识别】:{0}", string.Join("/ ", segments3));            
            var segments4 = segmenter.CutForSearch("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"); // 搜索引擎模式
            Console.WriteLine("【搜索引擎模式】:{0}", string.Join("/ ", segments4));            
            var segments5 = segmenter.Cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作");
            Console.WriteLine("【歧义消除】:{0}", string.Join("/ ", segments5));

            Console.Read();

Operation results :

Not bad, except for the full mode, the rest can meet the order that we humans read

The above is the detailed content of Sharing of Chinese word segmentation search tools under asp.net. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn