syntax
Once-only subpatterns
Once-only subpatterns
Once-only subpatterns
With both maximizing and minimizing repetition,
failure of what follows normally causes the repeated item to be
re-evaluated to see if a different number of repeats allows the
rest of the pattern to match. Sometimes it is useful to prevent
this, either to change the nature of the match, or to cause it fail
earlier than it otherwise might, when the author of the pattern
knows there is no point in carrying on.
Consider, for example, the pattern \d+foo when
applied to the subject line 123456bar
After matching all 6 digits and then failing to
match “foo”, the normal action of the matcher is to try again with
only 5 digits matching the \d+ item, and then with 4, and so on,
before ultimately failing. Once-only subpatterns provide the means
for specifying that once a portion of the pattern has matched, it
is not to be re-evaluated in this way, so the matcher would give up
immediately on failing to match “foo” the first time. The notation
is another kind of special parenthesis, starting with (?> as in
this example: (?>\d+)bar
This kind of parenthesis “locks up” the part of the
pattern it contains once it has matched, and a failure further into
the pattern is prevented from backtracking into it. Backtracking
past it to previous items, however, works as normal.
An alternative description is that a subpattern of
this type matches the string of characters that an identical
standalone pattern would match, if anchored at the current point in
the subject string.
Once-only subpatterns are not capturing
subpatterns. Simple cases such as the above example can be thought
of as a maximizing repeat that must swallow everything it can. So,
while both \d+ and \d+? are prepared to adjust the number of digits
they match in order to make the rest of the pattern match,
(?>\d+) can only match an entire sequence of digits.
This construction can of course contain arbitrarily
complicated subpatterns, and it can be nested.
Once-only subpatterns can be used in conjunction
with lookbehind assertions to specify efficient matching at the end
of the subject string. Consider a simple pattern such as
abcd$ when applied to a long string which does not match.
Because matching proceeds from left to right, PCRE will look for
each “a” in the subject and then see if what follows matches the
rest of the pattern. If the pattern is specified as
^.*abcd$ then the initial .* matches the entire string at
first, but when this fails (because there is no following “a”), it
backtracks to match all but the last character, then all but the
last two characters, and so on. Once again the search for “a”
covers the entire string, from right to left, so we are no better
off. However, if the pattern is written as
^(?>.*)(?<=abcd) then there can be no backtracking
for the .* item; it can match only the entire string. The
subsequent lookbehind assertion does a single test on the last four
characters. If it fails, the match fails immediately. For long
strings, this approach makes a significant difference to the
processing time.
When a pattern contains an unlimited repeat inside
a subpattern that can itself be repeated an unlimited number of
times, the use of a once-only subpattern is the only way to avoid
some failing matches taking a very long time indeed. The pattern
(\D+|<\d+>)*[!?] matches an unlimited number of
substrings that either consist of non-digits, or digits enclosed in
<>, followed by either ! or ?. When it matches, it runs
quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa it
takes a long time before reporting failure. This is because the
string can be divided between the two repeats in a large number of
ways, and all have to be tried. (The example used [!?] rather than
a single character at the end, because both PCRE and Perl have an
optimization that allows for fast failure when a single character
is used. They remember the last single character that is required
for a match, and fail early if it is not present in the string.) If
the pattern is changed to ((?>\D+)|<\d+>)*[!?]
sequences of non-digits cannot be broken, and failure happens
quickly.