Home > Article > Web Front-end > How to correctly understand regular expression backtracking in js
This time I will bring you how to correctly understand regular expressionbacktracking in js, and what are the notesfor correctly using regular expression backtracking in js. Here are practical cases. , let’s take a look.
In regular expression implementation, backtracking is a basic part of the matching process. It is the reason why regular expressions are so easy to use and powerful. However, backtracking is computationally expensive and can lead to loss of control if the design is wrong. Backtracking is the only factor that affects overall performance. Understanding how it works, and how to reduce the frequency of use, may be the key to writing efficient regular expressions
When a regular expression scans the target string, from Scan the components of the regular expression one by one from left to right, testing whether a match can be found at each position. For each quantifier and branch, one must decide how to proceed. If it is a quantifier (such as *, +? or {2,}), then the regular expression must determine when to try to match more characters; if it encounters a branch (via the | operator), then the regular expression must start from these Choose one of the options to try.
When the regex makes such a decision, it remembers another option for later use if necessary. If the selected scheme matches successfully, the regular expression will continue to scan the regular expression template, and if the rest of the matches also succeed, the matching ends. However, if the chosen option fails to find a match, or if subsequent matches fail, the regex will backtrack to the last decision point and select one of the remaining options. Continue like this until a match is found, or all possible permutations of quantifiers and branching options have been tried, then the process is abandoned, then moved to the next character at the beginning of the process, and the process is repeated.
For example, the following code demonstrates how this process handles branches through backtracking.
/h(ello|appy) hippo/.test("hello there, happy hippo");
The above line of regular expression is used to match "hello hippo
" or "happy hippo
". At the beginning of the test, we were looking for an h. The first letter of the target string happened to be h, and we found it immediately. Next, the subexpression (ello|appy) provides two processing options. The regular expression selects the leftmost option (branch selection always proceeds from left to right), checks whether ello matches the next character of the string, it does, and then the regular expression matches the following spaces.
However, the regular expression "hits a dead end" in the next match, because the h in hippo cannot match the next letter t in the string. The regex can't give up at this point because it hasn't tried all the options yet, so it backtracks to the last checkpoint (after matching the initial h) and tries to match the second branch option. But since the match was unsuccessful and there were no more options, the regular expression believed that matching starting from the first character of the string would not be successful, so it started searching again from the second character. The regular expression did not find h, so it continued to look backwards until it found the 14th letter, which matched the happy h. Then the regular expression branches again, and this time ello fails to match, but in the second branch after backtracking, it matches the entire string "happy hippo" and the match succeeds.
As another example, the following code demonstrates backtracking with repeated quantifiers.
var str = "<p>Para 1.</p>" +"<img src='smiley.jpg'>" +"<p>Para 2.</p>" +"<p>p.</p>"; /<p>.*<\/p>/i.test(str);
The regular expression first matches the three letters at the beginning of the string
, and then .*. The dot means matching any character except newline characters, and the asterisk, a "greedy" quantifier, means repeating zero or more times to match as many times as possible. Because there are no newlines in the target string, the regular expression will match the entire remaining string! However, since there is more content in the regular expression template to match, the regular expression tries to match <. Since the match at the end of the string is unsuccessful, we backtrack one character at a time and continue to try to match < until the regular expression returns to the < position of the
tag. Next try to match \/ (escaped backslash), the match is successful, then match p, the match is unsuccessful. The regular expression continues to backtrack and repeats this process until it finally matches at the end of the second paragraph. Returning a successful match requires scanning from the head of the first paragraph to the end of the last paragraph, which may not be the result we want.Change the "greedy" quantifier * in the regular expression to the "lazy" (aka "non-greedy") quantifier *? to match a single paragraph. Backtracking for "lazy" quantifiers works in the opposite way. When the regular expression /
.*?<\/p>/ advances to .*?, it first attempts to skip them all, and then continues to match <\/p>.
这样做是因为*?匹配零次或多次,尽可能少重复,尽可能少意味着可以重复零次。但是,当随后的<在字符串的这一点上匹配失败时,正则表达式回溯并尝试下一个最小的字符数:1个。正则表达式继续像这样向前回溯到第一段的末尾,在那里量词后面的<\/p>得到完全匹配。
如果目标字符串只有一个段落,那么此正则表达式的“贪婪”版本和“懒惰”版本是等价的,但尝试匹配的过程不同。
当一个正则表达式占用浏览器几秒甚至更长时间时,问题原因很可能是回溯失控。为说明此问题,给出下面的正则表达式,它的目标是匹配整个HTML文件。此表达式被拆分成多行是为了适合页面显示。与其他正则表达式不同,JavaScript在没有选项时可使点号匹配任意字符,包括换行符,所以此例中以[\s\S]匹配任意字符。
/<html>[\s\S]*?<head>[\s\S]*?<title>[\s\S]*?<\/title>[\s\S]*?<\/head> [\s\S]*?<body>[\s\S]*?<\/body>[\s\S]*?<\/html>/
此正则表达式匹配在正常HTML 字符串时工作良好,但当目标字符串缺少一个或多个标签时,就会变得十分糟糕。例如