regexp-php-reference-php-subpatterns-2

  • PCRE regex
    syntax
  • Subpatterns

  • Subpatterns
  • Subpatterns

    Subpatterns

    Subpatterns are delimited by parentheses (round
    brackets), which can be nested. Marking part of a pattern as a
    subpattern does two things:

    1. It localizes a set of alternatives. For example,
      the pattern cat(aract|erpillar|) matches one of the words
      “cat”, “cataract”, or “caterpillar”. Without the parentheses, it
      would match “cataract”, “erpillar” or the empty string.

    2. It sets up the subpattern as a capturing subpattern
      (as defined above). When the whole pattern matches, that portion of
      the subject string that matched the subpattern is passed back to
      the caller via the ovector argument of
      pcre_exec(). Opening
      parentheses are counted from left to right (starting from 1) to
      obtain the numbers of the capturing subpatterns.

    For example, if the string “the red king” is
    matched against the pattern the ((red|white) (king|queen))
    the captured substrings are “red king”, “red”, and “king”, and are
    numbered 1, 2, and 3.

    The fact that plain parentheses fulfill two
    functions is not always helpful. There are often times when a
    grouping subpattern is required without a capturing requirement. If
    an opening parenthesis is followed by “?:”, the subpattern does not
    do any capturing, and is not counted when computing the number of
    any subsequent capturing subpatterns. For example, if the string
    “the white queen” is matched against the pattern the
    ((?:red|white) (king|queen))
    the captured substrings are
    “white queen” and “queen”, and are numbered 1 and 2. The maximum
    number of captured substrings is 65535. It may not be possible to
    compile such large patterns, however, depending on the
    configuration options of libpcre.

    As a convenient shorthand, if any option settings
    are required at the start of a non-capturing subpattern, the option
    letters may appear between the “?” and the “:”. Thus the two
    patterns

    (?i:saturday|sunday)
    (?:(?i)saturday|sunday)
    

    match exactly the same set of strings. Because
    alternative branches are tried from left to right, and options are
    not reset until the end of the subpattern is reached, an option
    setting in one branch does affect subsequent branches, so the above
    patterns match “SUNDAY” as well as “Saturday”.

    It is possible to name a subpattern using the
    syntax (?P<name>pattern). This subpattern will then
    be indexed in the matches array by its normal numeric position and
    also by name. PHP 5.2.2 introduced two alternative syntaxes
    (?<name>pattern) and (?’name’pattern).

    Sometimes it is necessary to have multiple
    matching, but alternating subgroups in a regular expression.
    Normally, each of these would be given their own backreference
    number even though only one of them would ever possibly match. To
    overcome this, the (?| syntax allows having duplicate
    numbers. Consider the following regex matched against the string
    Sunday:

    (?:(Sat)ur|(Sun))day
    

    Here Sun is stored in backreference 2,
    while backreference 1 is empty. Matching yields Sat in
    backreference 1 while backreference 2 does not exist. Changing the
    pattern to use the (?| fixes this problem:

    (?|(Sat)ur|(Sun))day
    

    Using this pattern, both Sun and
    Sat would be stored in backreference 1.