Home  >  Article  >  Database  >  How to Optimize Fuzzy Matching of Emails and Phone Numbers in Elasticsearch?

How to Optimize Fuzzy Matching of Emails and Phone Numbers in Elasticsearch?

Patricia Arquette
Patricia ArquetteOriginal
2024-10-30 15:51:02422browse

How to Optimize Fuzzy Matching of Emails and Phone Numbers in Elasticsearch?

Fuzzy Matching Emails and Phone Numbers in Elasticsearch

Elasticsearch offers flexible methods for fuzzy matching of data, including emails and phone numbers. This article explores how to optimize performance for such queries using custom analyzers and token filters.

Custom Analyzers for Fuzzy Matching

To efficiently fuzzy match emails and phone numbers, it's recommended to create custom analyzers in Elasticsearch. These analyzers consist of a tokenizer that prepares input data for analysis and a set of filters that execute specific transformations.

Email Analyzer

The index_email_analyzer analyzer leverages the standard tokenizer to break down the input. It then applies filters such as lowercase, name_ngram_filter, and trim to convert the email to lowercase, generate ngrams of varying lengths (from 3 to 20 characters), and remove spaces.

The search_email_analyzer similarly uses the standard tokenizer but employs only lowercase and trim filters. This prepares the input for searching, where the ngram filter is not required.

Phone Analyzer

For phone numbers, the index_phone_analyzer utilizes the digit_edge_ngram_tokenizer to generate ngrams of varying lengths (1 to 15 characters) that start with a digit. This allows for matching any prefix of a phone number. The digit_only char filter removes non-digit characters to ensure only numerical values are analyzed.

The search_phone_analyzer uses the keyword tokenizer, which generates a single token from the input, enabling exact matching of phone numbers.

Implementing the Analyzers

Here's a sample mapping that incorporates these custom analyzers:

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "email_url_analyzer": {
          "type": "custom",
          "tokenizer": "uax_url_email",
          "filter": [ "trim" ]
        },
        "index_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "digit_edge_ngram_tokenizer",
          "filter": [ "trim" ]
        },
        "search_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "keyword",
          "filter": [ "trim" ]
        },
        "index_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "name_ngram_filter", "trim" ]
        },
        "search_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "trim" ]
        }
      },
      "char_filter": {
        "digit_only": {
          "type": "pattern_replace",
          "pattern": "\D+",
          "replacement": ""
        }
      },
      "tokenizer": {
        "digit_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "1",
          "max_gram": "15",
          "token_chars": [ "digit" ]
        }
      },
      "filter": {
        "name_ngram_filter": {
          "type": "ngram",
          "min_gram": "1",
          "max_gram": "20"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "email": {
          "type": "string",
          "analyzer": "index_email_analyzer",
          "search_analyzer": "search_email_analyzer"
        },
        "phone": {
          "type": "string",
          "analyzer": "index_phone_analyzer",
          "search_analyzer": "search_phone_analyzer"
        }
      }
    }
  }
}

Performing Fuzzy Queries

To match emails ending with "@gmail.com" or phone numbers starting with "136", you can issue queries like:

POST myindex
{
  "query": {
    "term": {
      "email": "@gmail.com"
    }
  }
}

POST myindex
{
  "query": {
    "term": {
      "phone": "136"
    }
  }
}

These queries will leverage the custom analyzers to generate the necessary ngrams for fuzzy matching.

The above is the detailed content of How to Optimize Fuzzy Matching of Emails and Phone Numbers in Elasticsearch?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn