Home >Web Front-end >JS Tutorial >How to correctly understand regular expression backtracking in js

How to correctly understand regular expression backtracking in js

php中世界最好的语言Original: 2018-03-30 13:56:481600browse

This time I will bring you how to correctly understand regular expressionbacktracking in js, and what are the notesfor correctly using regular expression backtracking in js. Here are practical cases. , let’s take a look.

In regular expression implementation, backtracking is a basic part of the matching process. It is the reason why regular expressions are so easy to use and powerful. However, backtracking is computationally expensive and can lead to loss of control if the design is wrong. Backtracking is the only factor that affects overall performance. Understanding how it works, and how to reduce the frequency of use, may be the key to writing efficient regular expressions

When a regular expression scans the target string, from Scan the components of the regular expression one by one from left to right, testing whether a match can be found at each position. For each quantifier and branch, one must decide how to proceed. If it is a quantifier (such as *, +? or {2,}), then the regular expression must determine when to try to match more characters; if it encounters a branch (via the | operator), then the regular expression must start from these Choose one of the options to try.

When the regex makes such a decision, it remembers another option for later use if necessary. If the selected scheme matches successfully, the regular expression will continue to scan the regular expression template, and if the rest of the matches also succeed, the matching ends. However, if the chosen option fails to find a match, or if subsequent matches fail, the regex will backtrack to the last decision point and select one of the remaining options. Continue like this until a match is found, or all possible permutations of quantifiers and branching options have been tried, then the process is abandoned, then moved to the next character at the beginning of the process, and the process is repeated.

For example, the following code demonstrates how this process handles branches through backtracking.

/h(ello|appy) hippo/.test("hello there, happy hippo");

The above line of regular expression is used to match "hello hippo" or "happy hippo". At the beginning of the test, we were looking for an h. The first letter of the target string happened to be h, and we found it immediately. Next, the subexpression (ello|appy) provides two processing options. The regular expression selects the leftmost option (branch selection always proceeds from left to right), checks whether ello matches the next character of the string, it does, and then the regular expression matches the following spaces.

However, the regular expression "hits a dead end" in the next match, because the h in hippo cannot match the next letter t in the string. The regex can't give up at this point because it hasn't tried all the options yet, so it backtracks to the last checkpoint (after matching the initial h) and tries to match the second branch option. But since the match was unsuccessful and there were no more options, the regular expression believed that matching starting from the first character of the string would not be successful, so it started searching again from the second character. The regular expression did not find h, so it continued to look backwards until it found the 14th letter, which matched the happy h. Then the regular expression branches again, and this time ello fails to match, but in the second branch after backtracking, it matches the entire string "happy hippo" and the match succeeds.

As another example, the following code demonstrates backtracking with repeated quantifiers.

var str = "<p>Para 1.</p>" +"<img src=&#39;smiley.jpg&#39;>" +"<p>Para 2.</p>" +"<p>p.</p>";
/<p>.*<\/p>/i.test(str);

The regular expression first matches the three letters at the beginning of the string

, and then .*. The dot means matching any character except newline characters, and the asterisk, a "greedy" quantifier, means repeating zero or more times to match as many times as possible. Because there are no newlines in the target string, the regular expression will match the entire remaining string! However, since there is more content in the regular expression template to match, the regular expression tries to match <. Since the match at the end of the string is unsuccessful, we backtrack one character at a time and continue to try to match < until the regular expression returns to the < position of the

tag. Next try to match \/ (escaped backslash), the match is successful, then match p, the match is unsuccessful. The regular expression continues to backtrack and repeats this process until it finally matches

at the end of the second paragraph. Returning a successful match requires scanning from the head of the first paragraph to the end of the last paragraph, which may not be the result we want.

Change the "greedy" quantifier * in the regular expression to the "lazy" (aka "non-greedy") quantifier *? to match a single paragraph. Backtracking for "lazy" quantifiers works in the opposite way. When the regular expression /

.*?<\/p>/ advances to .*?, it first attempts to skip them all, and then continues to match <\/p>.

这样做是因为*?匹配零次或多次，尽可能少重复，尽可能少意味着可以重复零次。但是，当随后的<在字符串的这一点上匹配失败时，正则表达式回溯并尝试下一个最小的字符数：1个。正则表达式继续像这样向前回溯到第一段的末尾，在那里量词后面的<\/p>得到完全匹配。

如果目标字符串只有一个段落，那么此正则表达式的“贪婪”版本和“懒惰”版本是等价的，但尝试匹配的过程不同。

当一个正则表达式占用浏览器几秒甚至更长时间时，问题原因很可能是回溯失控。为说明此问题，给出下面的正则表达式，它的目标是匹配整个HTML文件。此表达式被拆分成多行是为了适合页面显示。与其他正则表达式不同，JavaScript在没有选项时可使点号匹配任意字符，包括换行符，所以此例中以[\s\S]匹配任意字符。

/<html>[\s\S]*?<head>[\s\S]*?<title>[\s\S]*?<\/title>[\s\S]*?<\/head>
[\s\S]*?<body>[\s\S]*?<\/body>[\s\S]*?<\/html>/

此正则表达式匹配在正常HTML 字符串时工作良好，但当目标字符串缺少一个或多个标签时，就会变得十分糟糕。例如标签缺失，最后一个[\s\S]*?将扩展到字符串的末尾，因为在那里没有发现标签，然后正则表达式将查看此前的[\s\S]*?队列记录的回溯位置，使它们进一步扩大。正则表达式尝试扩展倒数第二个[\s\S]*?—用它匹配标签，就是此前匹配过正则表达式模板<\/body>的那个标签，然后继续查找第二个标签，直到字符串的末尾。当所有这些步骤都失败时，倒数第三个[\s\S]*?将被扩展，直至字符串的末尾，依此类推。

此类问题的解决办法在于尽可能具体地指出分隔符之间的字符匹配形式，如模板“.*?”用于匹配双引号包围的一个字符串。用更具体的[^"\rn]*取代过于宽泛的.*?就去除了回溯时可能发生的几种情况，如尝试用点号匹配引号，或者扩展搜索超出预期范围。

在HTML 的例子中解决办法不是那么简单。不能使用否定字符类型，如用[^<]替代[\s\S]，因为在搜索过程中可能会遇到其他类型的标签。但是，可以通过重复一个非捕获组来达到同样效果，它包含一个回溯（阻塞下一个所需的标签）和[\s\S]（任意字符）元序列。这样可以确保中间位置上查找的每个标签都会失败。然后，更重要的是，[\s\S]模板在回溯过程中阻塞的标签在被发现之前不能被扩展。应用此方法后对正则表达式的最终修改如下：

/<html>(?:(?!<head>)[\s\S])*<head>(?:(?!<title>)[\s\S])*<title>
(?:(?!<\/title>)[\s\S])*<\/title>(?:(?!<\/head>)[\s\S])*<\/head>
(?:(?!<body>)[\s\S])*<body>(?:(?!<\/body>)[\s\S])*<\/body>
(?:(?!<\/html>)[\s\S])*<\/html>/

虽然这样做消除了潜在的回溯失控，并允许正则表达式在匹配不完整HTML字符串失败时的使用时间与文本长度呈线性关系，但是正则表达式的效率并没有提高。像这样为每个匹配字符进行多次前瞻，缺乏效率，而且成功匹配过程也相当慢。匹配较短字符串时使用此方法相当不错，而匹配一个HTML 文件可能需要前瞻并测试上千次。

相信看了本文案例你已经掌握了方法，更多精彩请关注php中文网其它相关文章！

How to correctly understand regular expression backtracking in js

Related articles