regexp-php-reference-php-recursive-0

  • PCRE regex
    syntax
  • Recursive patterns

  • Recursive patterns
  • Recursive patterns

    Recursive patterns

    Consider the problem of matching a string in
    parentheses, allowing for unlimited nested parentheses. Without the
    use of recursion, the best that can be done is to use a pattern
    that matches up to some fixed depth of nesting. It is not possible
    to handle an arbitrary nesting depth. Perl 5.6 has provided an
    experimental facility that allows regular expressions to recurse
    (among other things). The special item (?R) is provided for the
    specific case of recursion. This PCRE pattern solves the
    parentheses problem (assume the PCRE_EXTENDED option is set so that white space is
    ignored): \( ( (?>[^()]+) | (?R) )* \)

    First it matches an opening parenthesis. Then it
    matches any number of substrings which can either be a sequence of
    non-parentheses, or a recursive match of the pattern itself (i.e. a
    correctly parenthesized substring). Finally there is a closing
    parenthesis.

    This particular example pattern contains nested
    unlimited repeats, and so the use of a once-only subpattern for
    matching strings of non-parentheses is important when applying the
    pattern to strings that do not match. For example, when it is
    applied to
    (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
    it yields “no match” quickly. However, if a once-only subpattern is
    not used, the match runs for a very long time indeed because there
    are so many different ways the + and * repeats can carve up the
    subject, and all have to be tested before failure can be
    reported.

    The values set for any capturing subpatterns are
    those from the outermost level of the recursion at which the
    subpattern value is set. If the pattern above is matched against
    (ab(cd)ef) the value for the capturing parentheses is
    “ef”, which is the last value taken on at the top level. If
    additional parentheses are added, giving \( ( ( (?>[^()]+) |
    (?R) )* ) \)
    then the string they capture is “ab(cd)ef”, the
    contents of the top level parentheses. If there are more than 15
    capturing parentheses in a pattern, PCRE has to obtain extra memory
    to store data during a recursion, which it does by using
    pcre_malloc, freeing it via pcre_free afterwards. If no memory can
    be obtained, it saves data for the first 15 capturing parentheses
    only, as there is no way to give an out-of-memory error from within
    a recursion.

    (?1), (?2) and so on can be used
    for recursive subpatterns too. It is also possible to use named
    subpatterns: (?P>name) or (?&name).

    If the syntax for a recursive subpattern reference
    (either by number or by name) is used outside the parentheses to
    which it refers, it operates like a subroutine in a programming
    language. An earlier example pointed out that the pattern
    (sens|respons)e and \1ibility matches “sense and
    sensibility” and “response and responsibility”, but not “sense and
    responsibility”. If instead the pattern (sens|respons)e and
    (?1)ibility
    is used, it does match “sense and responsibility”
    as well as the other two strings. Such references must, however,
    follow the subpattern to which they refer.

    The maximum length of a subject string is the
    largest positive number that an integer variable can hold. However,
    PCRE uses recursion to handle subpatterns and indefinite
    repetition. This means that the available stack space may limit the
    size of a subject string that can be processed by certain
    patterns.