Home >Database >Mysql Tutorial >How to Efficiently Calculate String Similarity Percentage in MySQL?
To determine the similarity between two strings stored in MySQL, consider using an efficient algorithm like the Levenshtein distance. This approach calculates the number of single-character edits (additions, removals, or substitutions) required to transform one string into another.
MySQL provides a series of functions for string manipulation. Here's an example function (levenshtein) that computes the Levenshtein distance between two strings, s1 and s2:
CREATE FUNCTION `levenshtein`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC BEGIN # Initialize variables DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; DECLARE cv0, cv1 text; SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; # Check for edge cases IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE # Main loop over s1 characters WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; # Inner loop over s2 characters WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j + 1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; # Return Levenshtein distance RETURN c; END
With the Levenshtein distance computed, you can calculate the similarity percentage as follows:
CREATE FUNCTION `levenshtein_ratio`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC BEGIN DECLARE s1_len, s2_len, max_len INT; SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF; RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); END
For example, if you have two strings in MySQL:
SET @a = "Welcome to Stack Overflow"; SET @b = "Hello to stack overflow";
Using the levenshtein_ratio function:
SELECT LEVENSHTEIN_RATIO(@a, @b);
Will return a value representing the similarity percentage between the two strings, eg. 60 for the provided example.
The above is the detailed content of How to Efficiently Calculate String Similarity Percentage in MySQL?. For more information, please follow other related articles on the PHP Chinese website!