regexp-php-reference-php-performance-8

  • PCRE regex
    syntax
  • Performance

  • Performance
  • Performance

    Performance

    Certain items that may appear in patterns are more
    efficient than others. It is more efficient to use a character
    class like [aeiou] than a set of alternatives such as (a|e|i|o|u).
    In general, the simplest construction that provides the required
    behaviour is usually the most efficient. Jeffrey Friedl’s book
    contains a lot of discussion about optimizing regular expressions
    for efficient performance.

    When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is implicitly
    anchored by PCRE, since it can match only at the start of a subject
    string. However, if PCRE_DOTALL is not set, PCRE cannot make this
    optimization, because the . metacharacter does not then match a
    newline, and if the subject string contains newlines, the pattern
    may match from the character immediately following one of them
    instead of from the very start. For example, the pattern (.*)
    second
    matches the subject “first\nand second” (where \n
    stands for a newline character) with the first captured substring
    being “and”. In order to do this, PCRE has to retry the match
    starting after every newline in the subject.

    If you are using such a pattern with subject
    strings that do not contain newlines, the best performance is
    obtained by setting PCRE_DOTALL, or starting the pattern with ^.* to
    indicate explicit anchoring. That saves PCRE from having to scan
    along the subject looking for a newline to restart at.

    Beware of patterns that contain nested indefinite
    repeats. These can take a long time to run when applied to a string
    that does not match. Consider the pattern fragment
    (a+)*

    This can match “aaaa” in 33 different ways, and
    this number increases very rapidly as the string gets longer. (The
    * repeat can match 0, 1, 2, 3, or 4 times, and for each of those
    cases other than 0, the + repeats can match different numbers of
    times.) When the remainder of the pattern is such that the entire
    match is going to fail, PCRE has in principle to try every possible
    variation, and this can take an extremely long time.

    An optimization catches some of the more simple
    cases such as (a+)*b where a literal character follows.
    Before embarking on the standard matching procedure, PCRE checks
    that there is a “b” later in the subject string, and if there is
    not, it fails the match immediately. However, when there is no
    following literal this optimization cannot be used. You can see the
    difference by comparing the behaviour of (a+)*\d with the
    pattern above. The former gives a failure almost instantly when
    applied to a whole line of “a” characters, whereas the latter takes
    an appreciable time with strings longer than about 20
    characters.