• PCRE regex
  • Once-only subpatterns

  • Once-only subpatterns
  • Once-only subpatterns

    Once-only subpatterns

    With both maximizing and minimizing repetition,
    failure of what follows normally causes the repeated item to be
    re-evaluated to see if a different number of repeats allows the
    rest of the pattern to match. Sometimes it is useful to prevent
    this, either to change the nature of the match, or to cause it fail
    earlier than it otherwise might, when the author of the pattern
    knows there is no point in carrying on.

    Consider, for example, the pattern \d+foo when
    applied to the subject line 123456bar

    After matching all 6 digits and then failing to
    match “foo”, the normal action of the matcher is to try again with
    only 5 digits matching the \d+ item, and then with 4, and so on,
    before ultimately failing. Once-only subpatterns provide the means
    for specifying that once a portion of the pattern has matched, it
    is not to be re-evaluated in this way, so the matcher would give up
    immediately on failing to match “foo” the first time. The notation
    is another kind of special parenthesis, starting with (?> as in
    this example: (?>\d+)bar

    This kind of parenthesis “locks up” the part of the
    pattern it contains once it has matched, and a failure further into
    the pattern is prevented from backtracking into it. Backtracking
    past it to previous items, however, works as normal.

    An alternative description is that a subpattern of
    this type matches the string of characters that an identical
    standalone pattern would match, if anchored at the current point in
    the subject string.

    Once-only subpatterns are not capturing
    subpatterns. Simple cases such as the above example can be thought
    of as a maximizing repeat that must swallow everything it can. So,
    while both \d+ and \d+? are prepared to adjust the number of digits
    they match in order to make the rest of the pattern match,
    (?>\d+) can only match an entire sequence of digits.

    This construction can of course contain arbitrarily
    complicated subpatterns, and it can be nested.

    Once-only subpatterns can be used in conjunction
    with lookbehind assertions to specify efficient matching at the end
    of the subject string. Consider a simple pattern such as
    abcd$ when applied to a long string which does not match.
    Because matching proceeds from left to right, PCRE will look for
    each “a” in the subject and then see if what follows matches the
    rest of the pattern. If the pattern is specified as
    ^.*abcd$ then the initial .* matches the entire string at
    first, but when this fails (because there is no following “a”), it
    backtracks to match all but the last character, then all but the
    last two characters, and so on. Once again the search for “a”
    covers the entire string, from right to left, so we are no better
    off. However, if the pattern is written as
    ^(?>.*)(?<=abcd) then there can be no backtracking
    for the .* item; it can match only the entire string. The
    subsequent lookbehind assertion does a single test on the last four
    characters. If it fails, the match fails immediately. For long
    strings, this approach makes a significant difference to the
    processing time.

    When a pattern contains an unlimited repeat inside
    a subpattern that can itself be repeated an unlimited number of
    times, the use of a once-only subpattern is the only way to avoid
    some failing matches taking a very long time indeed. The pattern
    (\D+|<\d+>)*[!?] matches an unlimited number of
    substrings that either consist of non-digits, or digits enclosed in
    <>, followed by either ! or ?. When it matches, it runs
    quickly. However, if it is applied to
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa it
    takes a long time before reporting failure. This is because the
    string can be divided between the two repeats in a large number of
    ways, and all have to be tried. (The example used [!?] rather than
    a single character at the end, because both PCRE and Perl have an
    optimization that allows for fast failure when a single character
    is used. They remember the last single character that is required
    for a match, and fail early if it is not present in the string.) If
    the pattern is changed to ((?>\D+)|<\d+>)*[!?]
    sequences of non-digits cannot be broken, and failure happens