Home >Backend Development >PHP Tutorial >PHP function similar_text() principle analysis_PHP tutorial

PHP function similar_text() principle analysis_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal: 2016-07-21 14:58:27961browse

PHP has a function similar_text() that calculates the similarity of two strings. It can get a percentage to express the similarity of the two strings. The effect is as follows:

similar_text('aaaa', 'aaaa', $percent);
var_dump($percent);
//float(100)
similar_text('aaaa', 'aaaabbbb', $percent );
var_dump($percent);
//float(66.666666666667)
similar_text('abcdef', 'aabcdefg', $percent);
var_dump($percent);
/ /float(85.714285714286)
Using this function, you can use it for fuzzy search or other functions that require fuzzy matching. Recently, I have involved this function in the feature matching step in the research on verification code recognition.

But what kind of algorithm does this function use? I studied his underlying implementation and summarized it in three steps:

(1) Find the longest segment with the same part in the two strings;
(2) Use the same method to find the longest segment with the same part in the remaining two segments. Analogize until there are no identical parts;
(3) Similarity = sum of the lengths of all identical parts * 2 / sum of the lengths of the two strings;

The source code version I studied is PHP 5.4.6, the relevant code is located in lines 2951~3031 of the file php-5.4.6/ext/standard/string.c. The following is the source code after I added comments.

//Find the longest segment with the same part in the two strings
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int * pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 + len1;
char *end2 = (char *) txt2 + len2;
int l;

*max = 0;
//Start traversing based on the first string
for (p = (char *) txt1; p < end1; p++) {
//Traverse the second string
for (q = (char *) txt2; q < end2; q++) {
//Find the same character, continue to loop, l is the length of the same part
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
//Bubble method to find the longest l and remember the starting position of the same part
if (l > *max) {
*max = l;
*pos1 = p - txt1;
*pos2 = q - txt2;
}
}
}
}

//Calculate the total length of the same part of the two strings
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1, pos2, max;

//Find out The longest segment of the same part of the two strings
php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);
//This is the initial assignment to sum and the judgment of the max value
//If max is zero, it means that the two strings do not have any same characters, and it will jump out if
if ((sum = max)) {
//Recurse for the first half, the same segment length Accumulation
if (pos1 && pos2) {
sum += php_similar_char(txt1, pos1,
txt2, pos2);
}
//Recurse for the second half, the same segment length is accumulated
if ((pos1 + max < len1) && (pos2 + max < len2)) {
sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
txt2 + pos2 + max, len2 - pos2 - max);
}
}

return sum;
}

//PHP function definition
PHP_FUNCTION(similar_text)
{
char *t1, *t2;
zval **percent = NULL;
int ac = ZEND_NUM_ARGS();
int sim;
int t1_len, t2_len;

//Check parameter validity
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
return;
}

//If there is a third parameter
if (ac > 2) {
convert_to_double_ex(percent);
}

//If two characters The string length is 0, and 0 is returned
if (t1_len + t2_len == 0) {
if (ac > 2) {
Z_DVAL_PP(percent) = 0;
}

RETURN_LONG(0);
}

//Call the above function to calculate the similarity of two strings
sim = php_similar_char(t1, t1_len, t2, t2_len);

//You can see the calculation formula of the third parameter percent
if (ac > 2) {
Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
}

RETURN_LONG(sim);
}

In addition, PHP also provides another function levenshtein() for calculating string similarity. It expresses string similarity by calculating the edit distance of two strings. This is also a very common algorithm. The performance of levenshtein() is better than similar_text(), because from the previous code analysis, we can see that the complexity of similar_text() is O(n^3), n represents the length of the longest string, and levenshtein() The complexity is O(m*n), m and n are the lengths of the two strings respectively.

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Lion system /usr/bin/php default php.ini_PHP tutorialNext article：Lion system /usr/bin/php default php.ini_PHP tutorial

See more

PHP function similar_text() principle analysis_PHP tutorial

Related articles