Next, I will first give the source code and simple test of the two versions of the function. Finally, I will give a more practical string interception function. It should be noted that the string interception issues discussed here are all for UTF-8 encoded Chinese strings.
discuz version
Copy code The code is as follows:
/**
* [discuz] Intercept string based on PHP without mb_substr and other extensions installed. If Chinese characters are intercepted, it will be calculated as 2 characters
* @param $string The string to be intercepted
* @param $length to be Number of characters intercepted
* @param $dot Replace the ending string of the truncated part
* @return Return the intercepted string
* /
function cutstr($string, $length, $dot = '...') {
// If the string is less than the length to be intercepted, return directly
// Use strlen here to get the characters String length has great disadvantages. For example, if you want to intercept 4 Chinese characters from the string "Happy New Year",
// then you must know the number of bytes of these 4 Chinese characters, otherwise the returned string may be " Happy New Year..."
if (strlen($string) <= $length) {
return $string;
}
// Convert htmlspecialchars in the original string
$pre = chr(1);
$end = chr(1);
$string = str_replace ( array ('&', '"', '<', '>' ), array ($pre . '&' . $end, $pre . '"' . $end, $pre . '<' . $end, $pre . '>' . $end ), $string );
$strcut = ''; // Initialization return value
// If it is utf-8 encoding (this judgment is a bit incomplete, it may be utf8)
if (strtolower ( CHARSET ) == 'utf-8') {
//Initial continuous loop pointer $n, the last number of characters $tn, the number of intercepted characters $noc
$n = $tn = $noc = 0;
while ( $n < strlen ( $string ) ) {
$t = ord ( $string [$n] );
if ($t == 9 || $t == 10 || (32 <= $t && $t <= 126)) {
// If it is an English half-width symbol, etc., the $n pointer moves back by 1 digit, and the last word of $tn is 1 digit
$tn = 1;
$n++;
$noc++;
} elseif (194 <= $t && $t <= 223) {
// If it is a two-byte character, the $n pointer is moved back by 2 bits, and the last word of $tn is 2 bit
$tn = 2;
$n += 2;
$noc += 2;
} elseif (224 <= $t && $t <= 239) {
// If it is three bytes (can be understood as a Chinese word), $n is shifted back by 3 digits, and the last word of $tn is 3 digits
$tn = 3;
$n += 3;
$noc += 2;
} elseif (240 <= $t && $t <= 247) {
$tn = 4;
$n += 4;
$noc += 2;
} elseif (248 <= $t && $t <= 251) {
$tn = 5;
$n += 5;
$noc += 2 ;
} elseif ($t == 252 || $t == 253) {
$tn = 6;
$n += 6;
$noc += 2;
} else {
$n++;
}
// Break out of the continuous loop if the number is exceeded
if ($noc >= $length) {
break;
}
}
// This place is to remove the last word in preparation for adding $dot
if ($noc > $length) {
$n -= $tn;
}
$strcut = substr ( $string, 0, $n );
} else {
// Full-width encoding that is not utf-8 is shifted back by 2 bits
for ($i = 0 ; $i < $length; $i ++) {
$strcut .= ord ( $string [$i] ) > 127 ? $string [$i] . $string [++ $i] : $string [$i];
}
}
// Restore the original htmlspecialchars
$strcut = str_replace( array ($pre . '&' . $end, $pre . '" ' . $end, $pre . '<' . $end, $pre . '>' . $end ), array ('&', '"', '<', '>' ), $ strcut );
$pos = strrpos ( $strcut, chr ( 1 ) );
if ($pos !== false) {
$strcut = substr ( $strcut, 0, $pos );
}
return $strcut . $dot; // Finally add the interception to $dot and output
}
The biggest flaw of the discuz version is to use strlen to obtain the original string length, and is used to compare with the incoming length parameter (number of bytes) to be intercepted. Since the number of bytes of Chinese characters in UTF-8 is not fixed, you will face this dilemma: If you want to intercept 4 Chinese characters How much truncation length should be specified for characters? 8 bytes or 12 bytes? . . . This is unpredictable, and it is precisely because of this problem that the cutstr of discuz is actually buggy, as can be seen from the following test results:
Copy code The code is as follows:
$str1 = "desire to be poor";
echo my_cutstr($str1, 10, "...")."n"; // Output: desire to be poor Qianmile... [This is a bug, think about what causes it? ]
echo my_cutstr($str1, 15, "...")."n"; // Output: Seeing a Thousand Miles Away
The reason for the above bug is that when the cutstr function intercepts characters, a Chinese character is counted as 2 characters, so 5 Chinese characters are 10 characters, and the length of the original string is 15 bytes, so cutstr It is considered that "successfully" intercepted 10 characters from the 15-character string, and then added the "tail". To solve this bug, just check whether the returned substring is the same as the original string. If it is the same, don't add the "tail".
ecshop version
Copy code The code is as follows:
/**
* [ecshop] Based on PHP's mb_substr and iconv_substr extensions, these two extensions are used to intercept strings. Chinese characters are calculated as 1 character in length;
* This function is only applicable to utf-8 encoded Chinese strings. .
*
* @param $str Original string
* @param $length Number of characters intercepted
* @param $append Replace the ending string with the truncated part
* @return Return The intercepted string
* /
function sub_str($str, $length = 0, $append = '...') {
$str = trim($str);
$strlength = strlen($str);
if ($length == 0 || $length >= $strlength) {
return $str;
} elseif ($length < 0) {
$length = $strlength + $ length;
if ($length < 0) {
$length = $strlength;
}
}
if ( function_exists('mb_substr') ) {
$newstr = mb_substr($str, 0, $length, 'utf-8');
} elseif ( function_exists('iconv_substr') ) {
$newstr = iconv_substr($str, 0, $length, 'utf- 8');
} else {
//$newstr = trim_right(substr($str, 0, $length));
$newstr = substr($str, 0, $length);
}
if ($append && $str != $newstr) {
$newstr .= $append;
}
return $newstr;
}
The features and disadvantages of the ecshop version are that Chinese characters are counted as one character. If the original string does not contain Chinese characters, such as: abcd1234, if the original intention is to intercept 4 Chinese characters or 8 English characters, then use ecshop The version will not get the expected result, and the return value is: abcd. The following is a simple test result:
Copy code The code is as follows:
$str1 = "The sun is over the mountains, the Yellow River is over. Into the sea";
echo $str1."n";
echo my_sub_str($str1, 4, "...")."n"; // Output: The sun is shining on the mountain...
$str2 = "白1日2伊3山4";
echo $str2."n";
echo my_sub_str($str2, 4, "...")."n"; // Output: White 1st 2...
Optimized version
Most application scenarios for intercepting Chinese strings are "the original string can be Chinese, English, numbers Mixed, Chinese characters are counted as 2 characters, and English numbers are counted as 1 character." For this requirement, an implementation version is given below:
Copy code The code is as follows:
/**
* String interception, Chinese characters are calculated as 2 characters, and both GBK and UTF-8 encoding are supported
* @param $string The string to be intercepted
* @param $length The number of characters to be intercepted
* @param $append The tail added to the substring
* @return Returns the intercepted string
*/
function substring($string, $length, $append = false) {
if ( $length <= 0 ) {
return '';
}
// Check whether the original string is UTF-8 encoded
$is_utf8 = false;
$str1 = @iconv("UTF-8", "GBK", $string);
$str2 = @iconv("GBK", "UTF-8", $str1);
if ( $string == $str2 ) {
$is_utf8 = true;
// If it is UTF-8 encoding, use GBK encoded
$string = $str1;
}
$newstr = '';
for ($i = 0 ; $i < $length; $i ++) {
$newstr .= ord ($string[$i]) > 127 ? $string[$i] . $string[++$i] : $string[$i];
}
if ( $is_utf8 ) {
$newstr = @iconv("GBK", "UTF-8", $newstr);
}
if ($append && $newstr != $string) {
$newstr .= $append;
}
return $newstr;
}
See the test results Below (the results of GBK and UTF-8 are consistent):
Copy code The code is as follows:
$str1 = "白日At the end of the mountains, the Yellow River flows into the sea";
echo substring($str1, 4, "...")."n"; // Output: During the day...
echo substring($str1, 5 , "...")."n"; // Output: Bai Riyi...
$str2 = "12 Bai 34 Ri 56 Yi 78 Mountains";
echo substring($str2, 4, "...")."n"; // Output: 12 white...
echo substring($str2, 5, "...")."n"; // Output: 12 white 3. ..
Author: edwardlost' blog
http://www.bkjia.com/PHPjc/325891.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/325891.htmlTechArticleThe source code and simple test of the two versions of the function are given below. Finally, I will give a practical update. Strong string interception function. It should be noted that: the string interception discussed here...