Home >Backend Development >Python Tutorial >Levenshtein Distance: The Ultimate Guide to Measuring Textual Similarity
The Levenshtein distance, also known as the edit distance, is an essential metric for assessing the similarity between two strings. It counts the minimum number of operations necessary to transform one string into another. These operations include:
This concept is at the heart of many modern applications, such as spelling correction, fuzzy search, and DNA comparison.
The Levenshtein distance between two strings (A) and (B) of lengths (n) and (m), respectively, can be calculated using a dynamic approach. We define a matrix (D) of dimensions ((n 1) times (m 1)), where each (D[i][j]) represents the minimum cost to transform the (i) first characters of (A) into the (j) first characters of (B).
The recurrence formula is:
Here is a simple Python implementation to calculate the Levenshtein distance:
def levenshtein_distance(a, b): n, m = len(a), len(b) dp = [[0] * (m + 1) for _ in range(n + 1)] for i in range(n + 1): for j in range(m + 1): if i == 0: dp[i][j] = j elif j == 0: dp[i][j] = i elif a[i - 1] == b[j - 1]: dp[i][j] = dp[i - 1][j - 1] else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) return dp[n][m] # Exemple d'utilisation print(levenshtein_distance("kitten", "sitting")) # Sortie : 3
Spell checkers use Levenshtein to suggest close words in case of typos. For example, if you type helo, it might suggest hello or hero.
In search engines, the Levenshtein distance allows you to obtain results even when the user makes typing errors.
In bioinformatics, this distance helps measure the similarity between two DNA sequences, each operation representing a possible mutation.
Identity theft detection systems can compare user input with existing data, taking into account small textual differences.
The classic algorithm uses a full matrix, which can be memory intensive. Fortunately, we can optimize using only two lines of memory, because each calculation ( D[i][j] ) depends only on ( D[i-1][j] ), ( D[i][j-1] ), and (D[i-1][j-1]).
def levenshtein_distance(a, b): n, m = len(a), len(b) dp = [[0] * (m + 1) for _ in range(n + 1)] for i in range(n + 1): for j in range(m + 1): if i == 0: dp[i][j] = j elif j == 0: dp[i][j] = i elif a[i - 1] == b[j - 1]: dp[i][j] = dp[i - 1][j - 1] else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) return dp[n][m] # Exemple d'utilisation print(levenshtein_distance("kitten", "sitting")) # Sortie : 3
The Levenshtein distance is a powerful, versatile and widely used tool in many fields. Although it is simple to understand, its complex optimizations and applications demonstrate its value in modern systems.
Exploring further, we can also turn to variants like the Damerau-Levenshtein distance, which takes transpositions into account. You are now equipped to integrate this tool into your projects or simply impress your peers with your in-depth knowledge!
The above is the detailed content of Levenshtein Distance: The Ultimate Guide to Measuring Textual Similarity. For more information, please follow other related articles on the PHP Chinese website!