Certain items that may appear in patterns are more
efficient than others. It is more efficient to use a character
class like [aeiou] than a set of alternatives such as (a|e|i|o|u).
In general, the simplest construction that provides the required
behaviour is usually the most efficient. Jeffrey Friedl’s book
contains a lot of discussion about optimizing regular expressions
for efficient performance.
When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is implicitly
anchored by PCRE, since it can match only at the start of a subject
string. However, if PCRE_DOTALL is not set, PCRE cannot make this
optimization, because the . metacharacter does not then match a
newline, and if the subject string contains newlines, the pattern
may match from the character immediately following one of them
instead of from the very start. For example, the pattern (.*)
second matches the subject “first\nand second” (where \n
stands for a newline character) with the first captured substring
being “and”. In order to do this, PCRE has to retry the match
starting after every newline in the subject.
If you are using such a pattern with subject
strings that do not contain newlines, the best performance is
obtained by setting PCRE_DOTALL, or starting the pattern with ^.* to
indicate explicit anchoring. That saves PCRE from having to scan
along the subject looking for a newline to restart at.
Beware of patterns that contain nested indefinite
repeats. These can take a long time to run when applied to a string
that does not match. Consider the pattern fragment
This can match “aaaa” in 33 different ways, and
this number increases very rapidly as the string gets longer. (The
* repeat can match 0, 1, 2, 3, or 4 times, and for each of those
cases other than 0, the + repeats can match different numbers of
times.) When the remainder of the pattern is such that the entire
match is going to fail, PCRE has in principle to try every possible
variation, and this can take an extremely long time.
An optimization catches some of the more simple
cases such as (a+)*b where a literal character follows.
Before embarking on the standard matching procedure, PCRE checks
that there is a “b” later in the subject string, and if there is
not, it fails the match immediately. However, when there is no
following literal this optimization cannot be used. You can see the
difference by comparing the behaviour of (a+)*\d with the
pattern above. The former gives a failure almost instantly when
applied to a whole line of “a” characters, whereas the latter takes
an appreciable time with strings longer than about 20