Home  >  Article  >  Java  >  Detailed explanation of Lucene configuration and image and text code for creating index and full-text retrieval

Detailed explanation of Lucene configuration and image and text code for creating index and full-text retrieval

黄舟
黄舟Original
2017-09-06 10:03:011559browse

Lucene


is an open source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture that provides a complete query engine and indexing engine, part text analysis engine (two Western languages, English and German). The purpose of Lucene is to provide software developers with a simple and easy-to-use toolkit to easily implement the full-text search function in the target system, or to build a complete full-text search engine based on it.

Advantages


(1) The index file format is independent of the application platform. Lucene defines a set of 8-bit byte-based index file formats so that compatible systems or applications on different platforms can share the created index files.

(2) Based on the inverted index of the traditional full-text search engine, block indexing is implemented, which can establish small file indexes for new files and improve indexing speed. Then through merging with the original index, the purpose of optimization is achieved.

(3) The excellent object-oriented system architecture makes it easier to learn Lucene extensions and facilitates the expansion of new functions.

(4) A text analysis interface is designed that is independent of language and file format. The indexer completes the creation of index files by accepting the Token stream. Users only need to implement the text analysis interface to expand new languages ​​and file formats. .

(5) A set of powerful query engines has been implemented by default. Users do not need to write code themselves to enable the system to obtain powerful query capabilities. Lucene’s query implementation implements Boolean operations and fuzzy queries by default ( Fuzzy Search[11]), group query, etc.

Concept


First of all, you can take a look at this picture. It has been circulated for a long time. In my understanding, it is:

On the left: It is the collection of various data, such as online, text, database, etc., and the collected data is indexed through Lucene.

On the right: It is the user's throughput A process in which some searches, after indexing, return results

Lucene configuration


It’s very simple to import a few jar packages and create an index file

I am using the latest version of the 6.6.0 core package: lucene-core-6.6.0.jar, You can download it from the official website http://lucene.apache.org/. This package is enough for you to test.

The index file is: index. You can choose this file name at will, because the content inside is automatically generated, that is An index directory, you can download the tool online to view it. If you are interested, you can try

However, this jar package: IKAnalyzer6.5.0.jar is an expansion package for analyzer word segmentation processing and supports Chinese word segmentation. These two The directory is at the same level as the src directory

Without further ado, let’s start with the code:

First, based on the lucene concept map above , we need to create an index first. I threw these exceptions directly. In fact, they need to be handled. They are too lazy.

public static void createindex() throws Exception {
        //创建文件目录    创建在项目目录下的index中
        Directory dir=FSDirectory.open(FileSystems.getDefault().getPath(System.getProperty("user.dir")+"/index"));
        //分词处理         是一个抽象类 一种单字分词,标准的
        Analyzer analyzer=new IKAnalyzer();
        //创建IndexWriterConfig对象
        IndexWriterConfig config=new IndexWriterConfig(analyzer);
        //创建IndexWriter对象
        IndexWriter iWriter=new IndexWriter(dir, config);
        //清除之前的索引
        iWriter.deleteAll();
                //创建文档对象
        Document doc=new Document();
        //向文档中添加文本内容字段,及字段类型
        doc.add(new Field("fieldname","坚持到底gl博主的博文,转载请注释出处", TextField.TYPE_STORED));
        //将文档添加到indexWriter中,写入索引文件中
        iWriter.addDocument(doc);
                //关闭写入    
                iWriter.close();        
}

By running this way, you can see that the content files in your index index have been created.

The index has been created. Next, try to query the index and pass in the words that need to be queried

public static void search(String string) throws Exception {
    
        Directory dir=FSDirectory.open(FileSystems.getDefault().getPath(System.getProperty("user.dir")+"/search"));
        //打开索引目录的
        DirectoryReader dReader=DirectoryReader.open(dir);
        IndexSearcher searcher=new IndexSearcher(dReader);
        //第一个参数 field值 ,第二个参数用户需要检索的字符串
        Term t=new Term("fieldname",string);
        //将用户需要索引的字符串封装成lucene能识别的内容 
        Query query=new TermQuery(t);
        //查询,最大的返回值10
        TopDocs top=searcher.search(query, 10);
                //命中数,那个字段命中,命中的字段有几个
        System.out.println("命中数:"+top.totalHits);
               //查询返回的doc数组
        ScoreDoc[]    sDocs= top.scoreDocs;   
            for (ScoreDoc scoreDoc : sDocs) {
                //输出命中字段内容
        System.out.println(searcher.doc(scoreDoc.doc).get(field));
}
}

Just like this, a full-text search test comes out. Think more about summary and expand it

Adding another code is helpful for understanding

public static void main(String[] args) throws Exception {
        String chString="坚持到底的文章,转载请注释出处";
        Analyzer analyzer=new IKAnalyzer();
        TokenStream stream=analyzer.tokenStream("word", chString);
        stream.reset();
        CharTermAttribute cta=stream.addAttribute(CharTermAttribute.class);
        while (stream.incrementToken()) {
            System.out.println(cta.toString());
        }
        stream.close();
    }

The display is as follows:

You can also add these files, there is one thing you need to pay attention to Yes, pay attention to your encoding format

The first one: ext.dic extended dictionary, which participle needs to be grouped together, for example: word segmentation processing may divide the four words "persistence to the end" into "persistence" " and "to the end", you can directly add "persist to the end" in this file, and you can display the index of "persist to the end"

The third one: stopword.dic expands the stop dictionary, and does not want to appear in the word segmentation. If it appears separately or alone, you can write it inside, and there will be no

when searching. The second one: specifies the

for the two extended dictionaries above.

These are the most basic contents to master, and there are many types of word segmentation algorithms that need to be expanded

The above is the detailed content of Detailed explanation of Lucene configuration and image and text code for creating index and full-text retrieval. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn