Home  >  Article  >  Backend Development  >  PHP—PCRE regular expression performance

PHP—PCRE regular expression performance

伊谢尔伦
伊谢尔伦Original
2016-11-21 17:08:471234browse

Some items in the pattern may be more efficient than others. For example, using a character class like [aeiou] is more efficient than the optional path (a|e|i|o|u). Generally speaking, requirements are funniest when they are described in the simplest possible construct. Jeffrey Friedl's book (Mastering Regular Expressions) contains a lot of discussion about regular expression performance.

When a pattern starts with .* and the PCRE_DOTALL option is set, the pattern is implicitly anchored by PCRE because it matches the beginning of the string. However, if PCRE_DOTALL is not set, PCRE cannot do this optimization because the . metacharacter cannot match newlines, and if the target string contains newlines, the pattern may start matching after a newline rather than at the beginning. For example, the pattern (.*) second matches the target string "firstnand second" (n is a newline character) and the first captured subgroup result is "and". To do this, PCRE attempts to match starting after each newline character in the target string.

If you are using a pattern to match a target string without newlines, you can explicitly indicate anchoring by setting PCRE_DOTALL or a pattern starting with ^.* for best performance. This saves PCRE the time of scanning along the target string looking for newlines and starting over again.

Beware of infinite repeat nesting in patterns. This can result in long run times when applied to unmatched strings. Consider the pattern fragment (a+)*.

This pattern can match "aaaa" in 33 ways, and this number will increase rapidly as the length of the string increases. (*Repetitions can match 0,1,2,3,4 times, and every time except 0 Each situation + has a different number of matching corresponding). When the remainder of the pattern causes the entire match to fail, PCRE will in principle try every possible variation, which can be very time-consuming.

The optimization for some simple cases is to use the original string immediately like (a+)*b. Before starting the formal matching work, PCRE checks whether there is a "b" character after the target string, and fails immediately if not. However, this optimization is not available when there are no original characters immediately following. You can compare and observe the difference in behavior between (a+)*d and the above pattern. The former reports failure almost immediately when applied to a string consisting of "a" in the entire line, while the latter consumes considerable time when the target string is longer than 20 characters.


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn