syntax
Unicode character properties
Unicode character properties
Unicode character properties
Since 5.1.0, three additional escape sequences to
match generic character types are available when UTF-8 mode is selected. They are:
- \p{xx}
- a character with the xx
property - \P{xx}
- a character without the xx
property - \X
- an extended Unicode sequence
The property names represented by xx above
are limited to the Unicode general category properties. Each
character has exactly one such property, specified by a two-letter
abbreviation. For compatibility with Perl, negation can be
specified by including a circumflex between the opening brace and
the property name. For example, \p{^Lu} is the same as
\P{Lu}.
If only one letter is specified with \p or
\P, it includes all the properties that start with that
letter. In this case, in the absence of negation, the curly
brackets in the escape sequence are optional; these two examples
have the same effect:
\p{L} \pL
Property | Matches | Notes |
---|---|---|
C | Other | |
Cc | Control | |
Cf | Format | |
Cn | Unassigned | |
Co | Private use | |
Cs | Surrogate | |
L | Letter | Includes the following properties: Ll, Lm, Lo, Lt and Lu. |
Ll | Lower case letter | |
Lm | Modifier letter | |
Lo | Other letter | |
Lt | Title case letter | |
Lu | Upper case letter | |
M | Mark | |
Mc | Spacing mark | |
Me | Enclosing mark | |
Mn | Non-spacing mark | |
N | Number | |
Nd | Decimal number | |
Nl | Letter number | |
No | Other number | |
P | Punctuation | |
Pc | Connector punctuation | |
Pd | Dash punctuation | |
Pe | Close punctuation | |
Pf | Final punctuation | |
Pi | Initial punctuation | |
Po | Other punctuation | |
Ps | Open punctuation | |
S | Symbol | |
Sc | Currency symbol | |
Sk | Modifier symbol | |
Sm | Mathematical symbol | |
So | Other symbol | |
Z | Separator | |
Zl | Line separator | |
Zp | Paragraph separator | |
Zs | Space separator |
Extended properties such as
InMusicalSymbols are not supported by PCRE.
Specifying case-insensitive (caseless) matching
does not affect these escape sequences. For example,
\p{Lu} always matches only upper case letters.
Sets of Unicode characters are defined as belonging
to certain scripts. A character from one of these sets can be
matched using a script name. For example:
- \p{Greek}
- \P{Han}
Those that are not part of an identified script are
lumped together as Common. The current list of scripts
is:
Arabic | Armenian | Avestan | Balinese | Bamum | |
Batak | Bengali | Bopomofo | Brahmi | Braille | |
Buginese | Buhid | Canadian_Aboriginal | Carian | Chakma | |
Cham | Cherokee | Common | Coptic | Cuneiform | |
Cypriot | Cyrillic | Deseret | Devanagari | Egyptian_Hieroglyphs | |
Ethiopic | Georgian | Glagolitic | Gothic | Greek | |
Gujarati | Gurmukhi | Han | Hangul | Hanunoo | |
Hebrew | Hiragana | Imperial_Aramaic | Inherited | Inscriptional_Pahlavi | |
Inscriptional_Parthian | Javanese | Kaithi | Kannada | Katakana | |
Kayah_Li | Kharoshthi | Khmer | Lao | Latin | |
Lepcha | Limbu | Linear_B | Lisu | Lycian | |
Lydian | Malayalam | Mandaic | Meetei_Mayek | Meroitic_Cursive | |
Meroitic_Hieroglyphs | Miao | Mongolian | Myanmar | New_Tai_Lue | |
Nko | Ogham | Old_Italic | Old_Persian | Old_South_Arabian | |
Old_Turkic | Ol_Chiki | Oriya | Osmanya | Phags_Pa | |
Phoenician | Rejang | Runic | Samaritan | Saurashtra | |
Sharada | Shavian | Sinhala | Sora_Sompeng | Sundanese | |
Syloti_Nagri | Syriac | Tagalog | Tagbanwa | Tai_Le | |
Tai_Tham | Tai_Viet | Takri | Tamil | Telugu | |
Thaana | Thai | Tibetan | Tifinagh | Ugaritic | |
Vai | Yi |
The \X escape matches a Unicode extended
grapheme cluster. An extended grapheme cluster is one or more
Unicode characters that combine to form a single glyph. In effect,
this can be thought of as the Unicode equivalent of . as
it will match one composed character, regardless of how many
individual characters are actually used to render it.
In versions of PCRE older than 8.32 (which
corresponds to PHP versions before 5.4.14 when using the bundled
PCRE library), \X is equivalent to
(?>\PM\pM*). That is, it matches a character without
the “mark” property, followed by zero or more characters with the
“mark” property, and treats the sequence as an atomic group (see
below). Characters with the “mark” property are typically accents
that affect the preceding character.
Matching characters by Unicode property is not
fast, because PCRE has to search a structure that contains data for
over fifteen thousand characters. That is why the traditional
escape sequences such as \d and \w do not use
Unicode properties in PCRE.