Home >Backend Development >PHP Tutorial >Why are preg_match() Offsets in Bytes, Not Characters, Even with UTF-8 Support?

Why are preg_match() Offsets in Bytes, Not Characters, Even with UTF-8 Support?

DDD
DDDOriginal
2024-12-06 05:35:19161browse

Why are preg_match() Offsets in Bytes, Not Characters, Even with UTF-8 Support?

PREG_OFFSET_CAPTURE and UTF-8 Strings: A Byte-Counting Mismatch

When matching UTF-8 strings using preg_match() with the PREG_OFFSET_CAPTURE parameter, users may encounter an unexpected behavior where offsets are counted in bytes instead of characters. Despite using the u modifier, which enables UTF-8 support for both the pattern and subject, captured offsets remain byte-based.

Resolving the Byte-Counting Issue

To address this discrepancy and obtain character-based offsets, a workaround involving mb_strlen can be employed. This function provides the UTF-8 character count for a specified substring. By utilizing mb_strlen on the substring of the subject string preceding the captured match, we can obtain the accurate character offset.

Here's a modified example:

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1])); // Outputs 1

By incorporating mb_strlen, we ensure that offsets represent character positions within the UTF-8 string, providing a precise and expected result.

The above is the detailed content of Why are preg_match() Offsets in Bytes, Not Characters, Even with UTF-8 Support?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn