Home >Backend Development >PHP Tutorial >String segmentation utf-8 (supports Chinese, Japanese, Korean, etc., efficient,)

String segmentation utf-8 (supports Chinese, Japanese, Korean, etc., efficient,)

WBOY
WBOYOriginal
2016-07-25 09:08:021364browse
Because mb_substr and mb_strlen are too inefficient, this code is used.

Not original, the main principle is based on the encoding characteristics of UTF-8
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
to get the character boundary, thereby determining the number of bytes occupied by a word, and processing it into an array.


It is convenient for users who frequently operate characters. This function is 10 times more efficient than mb_substr. I once wrote an "N million banned word replacement class". During the development process of this type, I compared the efficiency of the two in detail. , this function clearly wins.
  1. function str_split_utf8($str) {
  2. // place each character of the string into and array
  3. $split = 1;
  4. $array = array(); $len = strlen($str);
  5. for ( $ i = 0; $i < $len; ){
  6. $value = ord($str[$i]);
  7. if($value > 0x7F){
  8. if($value >= 0xC0 && $value < ;= 0xDF)
  9. $split = 2;
  10. elseif($value >= 0xE0 && $value <= 0xEF)
  11. $split = 3;
  12. elseif($value >= 0xF0 && $value <= 0xF7)
  13. $split = 4;
  14. elseif($value >= 0xF8 && $value <= 0xFB)
  15. $split = 5;
  16. elseif($value >= 0xFC)
  17. $split = 6;
  18. } else {
  19. $split = 1;
  20. }
  21. $key = '';
  22. for ( $j = 0; $j < $split; ++$j, ++$i ) {
  23. $key .= $str[$ i];
  24. }
  25. $array[] = $key;
  26. }
  27. return $array;
  28. }
Copy code


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn