Home >Backend Development >PHP Tutorial >String segmentation utf-8 (supports Chinese, Japanese, Korean, etc., efficient,)

String segmentation utf-8 (supports Chinese, Japanese, Korean, etc., efficient,)

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal: 2016-07-25 09:08:021384browse

Because mb_substr and mb_strlen are too inefficient, this code is used.

Not original, the main principle is based on the encoding characteristics of UTF-8
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
to get the character boundary, thereby determining the number of bytes occupied by a word, and processing it into an array.

It is convenient for users who frequently operate characters. This function is 10 times more efficient than mb_substr. I once wrote an "N million banned word replacement class". During the development process of this type, I compared the efficiency of the two in detail. , this function clearly wins.

function str_split_utf8($str) {
// place each character of the string into and array
$split = 1;
$array = array(); $len = strlen($str);
for ( $ i = 0; $i < $len; ){
$value = ord($str[$i]);
if($value > 0x7F){
if($value >= 0xC0 && $value < ;= 0xDF)
$split = 2;
elseif($value >= 0xE0 && $value <= 0xEF)
$split = 3;
elseif($value >= 0xF0 && $value <= 0xF7)
$split = 4;
elseif($value >= 0xF8 && $value <= 0xFB)
$split = 5;
elseif($value >= 0xFC)
$split = 6;
} else {
$split = 1;
}
$key = '';
for ( $j = 0; $j < $split; ++$j, ++$i ) {
$key .= $str[$ i];
}
$array[] = $key;
}
return $array;
}

Copy code

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：php+txt Taobao super streamlined version sina_SAE special editionNext article：php+txt Taobao super streamlined version sina_SAE special edition

See more

String segmentation utf-8 (supports Chinese, Japanese, Korean, etc., efficient,)

Related articles