Home >Backend Development >PHP Tutorial >Why Does PREG_OFFSET_CAPTURE Return Byte Counts Instead of Character Counts with UTF8 and the 'u' Modifier?
When utilizing preg_match with the u modifier for UTF8 processing, one might encounter an unexpected behavior where PREG_OFFSET_CAPTURE returns byte counts instead of character counts.
preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE); echo $a_matches[0][1]; // Prints 2, but should be 1 for "H" in "¡Hola!"
Despite the u modifier flagging the pattern and subject as UTF8-encoded, the offsets remain in bytes. To obtain character-based offsets, you can employ mb_strlen:
$str = "\xC2\xA1Hola!"; preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE); echo mb_strlen(substr($str, 0, $a_matches[0][1])); // Prints 1
The above is the detailed content of Why Does PREG_OFFSET_CAPTURE Return Byte Counts Instead of Character Counts with UTF8 and the 'u' Modifier?. For more information, please follow other related articles on the PHP Chinese website!