syntax
Recursive patterns
Recursive patterns
Recursive patterns
Consider the problem of matching a string in
parentheses, allowing for unlimited nested parentheses. Without the
use of recursion, the best that can be done is to use a pattern
that matches up to some fixed depth of nesting. It is not possible
to handle an arbitrary nesting depth. Perl 5.6 has provided an
experimental facility that allows regular expressions to recurse
(among other things). The special item (?R) is provided for the
specific case of recursion. This PCRE pattern solves the
parentheses problem (assume the PCRE_EXTENDED option is set so that white space is
ignored): \( ( (?>[^()]+) | (?R) )* \)
First it matches an opening parenthesis. Then it
matches any number of substrings which can either be a sequence of
non-parentheses, or a recursive match of the pattern itself (i.e. a
correctly parenthesized substring). Finally there is a closing
parenthesis.
This particular example pattern contains nested
unlimited repeats, and so the use of a once-only subpattern for
matching strings of non-parentheses is important when applying the
pattern to strings that do not match. For example, when it is
applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields “no match” quickly. However, if a once-only subpattern is
not used, the match runs for a very long time indeed because there
are so many different ways the + and * repeats can carve up the
subject, and all have to be tested before failure can be
reported.
The values set for any capturing subpatterns are
those from the outermost level of the recursion at which the
subpattern value is set. If the pattern above is matched against
(ab(cd)ef) the value for the capturing parentheses is
“ef”, which is the last value taken on at the top level. If
additional parentheses are added, giving \( ( ( (?>[^()]+) |
(?R) )* ) \) then the string they capture is “ab(cd)ef”, the
contents of the top level parentheses. If there are more than 15
capturing parentheses in a pattern, PCRE has to obtain extra memory
to store data during a recursion, which it does by using
pcre_malloc, freeing it via pcre_free afterwards. If no memory can
be obtained, it saves data for the first 15 capturing parentheses
only, as there is no way to give an out-of-memory error from within
a recursion.
(?1), (?2) and so on can be used
for recursive subpatterns too. It is also possible to use named
subpatterns: (?P>name) or (?&name).
If the syntax for a recursive subpattern reference
(either by number or by name) is used outside the parentheses to
which it refers, it operates like a subroutine in a programming
language. An earlier example pointed out that the pattern
(sens|respons)e and \1ibility matches “sense and
sensibility” and “response and responsibility”, but not “sense and
responsibility”. If instead the pattern (sens|respons)e and
(?1)ibility is used, it does match “sense and responsibility”
as well as the other two strings. Such references must, however,
follow the subpattern to which they refer.
The maximum length of a subject string is the
largest positive number that an integer variable can hold. However,
PCRE uses recursion to handle subpatterns and indefinite
repetition. This means that the available stack space may limit the
size of a subject string that can be processed by certain
patterns.