syntax

## Recursive patterns

## Recursive patterns

## Recursive patterns

Consider the problem of matching a string in

parentheses, allowing for unlimited nested parentheses. Without the

use of recursion, the best that can be done is to use a pattern

that matches up to some fixed depth of nesting. It is not possible

to handle an arbitrary nesting depth. Perl 5.6 has provided an

experimental facility that allows regular expressions to recurse

(among other things). The special item (?R) is provided for the

specific case of recursion. This PCRE pattern solves the

parentheses problem (assume the PCRE_EXTENDED option is set so that white space is

ignored): *\( ( (?>[^()]+) | (?R) )* \)*

First it matches an opening parenthesis. Then it

matches any number of substrings which can either be a sequence of

non-parentheses, or a recursive match of the pattern itself (i.e. a

correctly parenthesized substring). Finally there is a closing

parenthesis.

This particular example pattern contains nested

unlimited repeats, and so the use of a once-only subpattern for

matching strings of non-parentheses is important when applying the

pattern to strings that do not match. For example, when it is

applied to

*(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()*

it yields “no match” quickly. However, if a once-only subpattern is

not used, the match runs for a very long time indeed because there

are so many different ways the + and * repeats can carve up the

subject, and all have to be tested before failure can be

reported.

The values set for any capturing subpatterns are

those from the outermost level of the recursion at which the

subpattern value is set. If the pattern above is matched against

*(ab(cd)ef)* the value for the capturing parentheses is

“ef”, which is the last value taken on at the top level. If

additional parentheses are added, giving *\( ( ( (?>[^()]+) |
(?R) )* ) \)* then the string they capture is “ab(cd)ef”, the

contents of the top level parentheses. If there are more than 15

capturing parentheses in a pattern, PCRE has to obtain extra memory

to store data during a recursion, which it does by using

pcre_malloc, freeing it via pcre_free afterwards. If no memory can

be obtained, it saves data for the first 15 capturing parentheses

only, as there is no way to give an out-of-memory error from within

a recursion.

*(?1)*, *(?2)* and so on can be used

for recursive subpatterns too. It is also possible to use named

subpatterns: *(?P>name)* or *(?&name)*.

If the syntax for a recursive subpattern reference

(either by number or by name) is used outside the parentheses to

which it refers, it operates like a subroutine in a programming

language. An earlier example pointed out that the pattern

*(sens|respons)e and \1ibility* matches “sense and

sensibility” and “response and responsibility”, but not “sense and

responsibility”. If instead the pattern *(sens|respons)e and
(?1)ibility* is used, it does match “sense and responsibility”

as well as the other two strings. Such references must, however,

follow the subpattern to which they refer.

The maximum length of a subject string is the

largest positive number that an integer variable can hold. However,

PCRE uses recursion to handle subpatterns and indefinite

repetition. This means that the available stack space may limit the

size of a subject string that can be processed by certain

patterns.