Home >Backend Development >PHP Tutorial >Why are preg_match() Offsets in Bytes, Not Characters, Even with UTF-8 Support?
When matching UTF-8 strings using preg_match() with the PREG_OFFSET_CAPTURE parameter, users may encounter an unexpected behavior where offsets are counted in bytes instead of characters. Despite using the u modifier, which enables UTF-8 support for both the pattern and subject, captured offsets remain byte-based.
To address this discrepancy and obtain character-based offsets, a workaround involving mb_strlen can be employed. This function provides the UTF-8 character count for a specified substring. By utilizing mb_strlen on the substring of the subject string preceding the captured match, we can obtain the accurate character offset.
Here's a modified example:
$str = "\xC2\xA1Hola!"; preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE); echo mb_strlen(substr($str, 0, $a_matches[0][1])); // Outputs 1
By incorporating mb_strlen, we ensure that offsets represent character positions within the UTF-8 string, providing a precise and expected result.
The above is the detailed content of Why are preg_match() Offsets in Bytes, Not Characters, Even with UTF-8 Support?. For more information, please follow other related articles on the PHP Chinese website!