Home >WeChat Applet >Mini Program Development >Twelve functions of Nlpir Parser search and mining intelligent platform
Text mining has become an increasingly popular and important research field in data mining. Different from general data mining, which focuses on relationships, transactions, and structured data in data warehouses, the text database studied by text mining consists of a large number of documents from various data sources. These documents may contain structured data such as title, author, publication date, length, etc., or they may contain unstructured text components such as abstract and content. Moreover, the content of these documents is natural language used by humans, and it is difficult for computers to process its semantics. . Therefore, traditional information retrieval technology can no longer adapt to the increasing needs of processing large amounts of text data. Then people have proposed text mining methods to compare different documents and arrange document importance and relevance, or to find patterns or trends in multiple documents. Wait for analysis.
The Nlpir Parser search and mining intelligent platform is a basic tool set for network search, natural language understanding and text mining technology development. The development platform consists of multiple middleware, and each middleware API can be seamlessly integrated into Customers' various complex application systems are compatible with different operating systems such as Windows, Linux, and FreeBSD, and can be used in various development languages such as Java, C, and C#.
The Nlpir Parser search and mining intelligent platform is a set of software specifically designed for processing and processing original text sets. It provides a visual display of the processing effects of middleware and can also be used as a small-scale data processing tool. Users can use this software to process their own data.
Twelve major functions of Nlpir Parser search and mining intelligent platform:
1. Accurate full-text retrieval: supports various data types such as text, numbers, dates, strings, etc., with multi-field efficiency Search supports query syntax such as AND/OR/NOT and NEAR proximity, and supports retrieval in Uyghur, Tibetan, Mongolian, Arabic, Korean and other minority languages. Can be seamlessly integrated with existing text processing systems and database systems.
2. New word discovery: A list of new words with connotations is mined from the file collection, which can be used to compile the user's professional dictionary; the annotations can also be further edited and imported into the word segmentation dictionary, thereby improving the accuracy of the word segmentation system. degree and adapt to new language changes.
3. Word segmentation: perform word segmentation on the original corpus, automatically identify unregistered words such as names of people, places and institutions, new word tags and part-of-speech tags. And user-defined dictionaries can be imported during the analysis process.
4. Statistical analysis and terminology translation: Based on the segmentation annotation results, the system can automatically perform unigram word frequency statistics and binary word transition probability statistics (counting the frequency of left and right connections between two words, that is, the probability). For commonly used terms, corresponding English explanations will automatically be given.
5. Text clustering and hot spot analysis: It can automatically analyze hot events from large-scale data and provide key feature descriptions of event topics. It is also suitable for hotspot analysis of long texts and short texts such as text messages and Weibo.
6. Classification filtering: Based on pre-specified rules and examples, the system automatically filters out samples that meet the needs from a large number of documents.
7. Positive and negative analysis: For the pre-specified analysis objects and sample samples, the system automatically filters out positive and negative scores and sentence samples from massive documents.
8. Automatic summary: It can automatically extract the essence of the content of a single or multiple articles, making it convenient for users to quickly browse the text content.
9. Keyword extraction: From a single article or a collection of articles, several words or phrases representing the central idea of the article can be extracted, which can be used for refined reading, semantic query, and quick matching.
10. Document deduplication: It can quickly and accurately determine whether there are records with the same or similar content in a file collection or database, and find all duplicate records at the same time.
11. HTML text extraction: Automatically remove navigation web pages, remove HTML tags and disruptive text such as navigation and advertisements in web pages, and return valuable text content. Suitable for preprocessing and analysis of large-scale Internet information.
12. Automatic encoding recognition and conversion: Automatically identify the encoding of the content and uniformly convert the encoding to GBK encoding.
In most cases, text mining data sets are very large and growing, so it is impossible to store these data on one machine for calculations. Therefore, it is necessary to study a text mining algorithm that can run in parallel to perform text mining tasks in parallel on a computer cluster. Obviously, this combines the needs of cloud computing and data-intensive computing, which is also a growing field in itself.
The above is the detailed content of Twelve functions of Nlpir Parser search and mining intelligent platform. For more information, please follow other related articles on the PHP Chinese website!