pcre2pattern.3 - OpenGrok cross reference for /aosp_15_r20/external/pcre/doc/pcre2pattern.3

Lines Matching full:are
7 The syntax and semantics of the regular expressions that are supported by PCRE2
17 Perl's regular expressions are described in its own documentation, and regular
18 expressions in general are covered in a number of books, some of which have
23 This document discusses the regular expression patterns that are supported by
27 discussed below are not available when DFA matching is used. The advantages and
29 function, are discussed in the
40 by special items at the start of a pattern. These are not Perl-compatible, but
41 are provided to make these options accessible to pattern writers who are not
159 These facilities are provided to catch runaway matches that are provoked by
180 \fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
227 The newline convention affects where the circumflex and dollar assertions are
257 the sections below, character code values are ASCII or Unicode; in an EBCDIC
258 environment these characters may have different code values, and there are no
273 pattern), letters are matched independently of case. Note that there are two
275 equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
281 character classes, alternatives, and repetitions in the pattern. These are
283 for themselves but instead are interpreted in some special way.
285 There are two different sets of metacharacters: those that are recognized
286 anywhere in the pattern except within square brackets, and those that are
303 Brace characters { and } are also used to enclose data for constructions such
305 tab characters that follow { or precede } are allowed and are ignored. In the
313 a character class the only metacharacters are:
323 between a # outside a character class and the next newline, inclusive, are
326 same applies, but in addition unescaped space and horizontal tab characters are
327 ignored inside a character class. Note: only these two characters are ignored,
328 not the full set of pattern white space characters that are ignored outside a
355 other characters (in particular, those whose code points are greater than 127)
361 behaviour is different from Perl in that $ and @ are handled as literals in
395 environment, these escapes are as follows:
412 digits are read (letters can be in upper or lower case). Any number of
417 Characters whose code points are less than 256 can be defined by either of the
419 they are handled. For example, \exdc is exactly the same as \ex{dc} or \e334.
433 also use curly brackets, spaces are not allowed and would result in the string
442 There are some legacy applications where the escape sequence \er is expected to
458 only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
471 APC character. Unfortunately, there are several variants of EBCDIC. In most of
476 After \e0 up to two further octal digits are read. If there are fewer than two
477 digits, just those that are present are used. Thus the sequence \e0\ex\e015
498 if there are at least that many previous capture groups in the expression, the
510 Otherwise, up to three octal digits are read to form a character code.
519   \e40    is the same, provided there are fewer than 40
535 Note that octal values of 100 or greater that are specified using this syntax
537 digits are ever read.
543 Characters that are specified using octal or hexadecimal numbers are
551 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
555 and UTF-32 modes, because these values are not representable in UTF-16.
566 \eB, \eR, and \eX are not special inside a character class. Like other
574 In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
587 can be coded as \eg{name}. Backreferences are discussed
604 syntax for referencing a capture group as a subroutine. Details are discussed
609 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
658 The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
659 space (32), which are defined as white space in the "C" locale. This list may
677 or "french" in Windows, some character codes greater than 127 are used for
678 accented letters, and these are then matched by \ew. The use of locales with
681 By default, characters whose code points are greater than 127 never match \ed,
686 is set, the behaviour is changed so that Unicode properties are used to
700 \eB because they are defined in terms of \ew and \eW. Matching these sequences
714 points, whether or not PCRE2_UCP is set. The horizontal space characters are:
736 The vertical space characters are:
760 This is an example of an "atomic group", details of which are given
771 In other modes, two additional characters whose code points are greater than 255
787 Note that these special settings, which are not Perl-compatible, are recognized
804 sequences that match characters with specific properties are available. They
806 sequences are of course limited to testing characters whose code points are
808 greater than 0x10ffff (the Unicode limit) may be encountered. These are all
817 The extra escape sequences that provide property support are:
823 The property names represented by \fIxx\fP above are not case-sensitive, and in
825 underscores are ignored. There is support for Unicode script names, Unicode
833 Certain other Perl properties such as "InMusicalSymbols" are not supported by
842 There are three different syntax forms for matching a script. Each Unicode
848 property types are recognized, and a equals sign is an alternative to the
854 greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
855 part of an identified script are lumped together as "Common". The current list
874 of negation, the curly brackets in the escape sequence are optional; these two
880 The following general category property codes are supported:
930 The Cs (Surrogate) property applies only to characters whose code points are in
931 the range U+D800 to U+DFFF. These characters are no different to any other
933 However, they are not valid in Unicode strings and so cannot be tested by PCRE2
958 values are true or false. You can obtain a list of those that are recognized by
971 The recognized classes are:
997 An equals sign may be used instead of a colon. The class names are
998 case-insensitive; only the short names listed above are recognized.
1012 define the boundaries of extended grapheme clusters. The rules are defined in
1043 regional indicator (RI) characters if there are an odd number of RI characters
1057 explicitly. These properties are:
1074 languages. These are the characters $, @, ` (grave accent), and all characters
1076 surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
1077 excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
1146 The backslashed assertions are:
1172 start and end of the subject string, whatever options are set. Thus, they are
1173 independent of multiline mode. These three assertions are not affected by the
1203 The circumflex and dollar metacharacters are zero-width assertions. That is,
1205 characters from the subject string. These two metacharacters are concerned with
1208 and LF characters are treated as ordinary data characters, and are not
1223 alternatives are involved, but it should be the first thing in each alternative
1227 "anchored" pattern. (There are also other constructs that can cause a pattern
1234 character of the pattern if a number of alternatives are involved, but it
1242 The meanings of the circumflex and dollar metacharacters are changed if the
1252 patterns that are anchored in single line mode because all branches start with
1253 ^ are not anchored in multiline mode, and a match for circumflex is possible
1263 preferred, even if the single characters CR and LF are also recognized as
1291 of CR of LF match dot. When all Unicode line endings are being recognized, dot
1371 character's individual bytes are then captured by the appropriate number of
1397 are in the class by enumerating those that are not. A class that starts with a
1406 match "A", whereas a caseful version would. Note that there are two ASCII
1411 Characters that might indicate line breaks are never treated in any special way
1427 class; it matches the backspace character. The sequences \eB, \eR, and \eX are
1456 example [\e000-\e037]. Ranges can include any characters that are valid for the
1461 surrogates, are always permitted.
1463 There is a special case in EBCDIC environments for ranges whose end points are
1465 Perl, EBCDIC code points within the range that are not letters are omitted. For
1474 tables for a French locale are in use, [\exc8-\excb] matches accented E
1484 The only metacharacters that are recognized in character classes are backslash,
1519 The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1531 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1532 supported, and an error is given if they are encountered.
1537 unless certain options are set (see below), some of the classes are changed so
1538 that Unicode character properties are used. This is achieved by replacing
1552 classes are handled specially in UCP mode:
1565 This matches the same characters as [:graph:] plus space characters that are
1578 The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
1581 There are two options that can be used to restrict the POSIX classes to ASCII
1599 Only these exact character sequences are recognized. A sequence such as
1609 normally shows which is wanted, without the need for the assertions that are
1618 Vertical bar characters are used to separate alternative patterns. For example,
1626 that succeeds is used. If the alternatives are within a group
1640 of letters enclosed between "(?" and ")". The following are Perl-compatible,
1641 and are described in detail in the
1645 documentation. The option letters are:
1656 example (?-im). The two "extended" options are not independent; unsetting
1681 However, except for 'r', these are not unset by (?^), which is equivalent to
1693 group these options are reset to the state they were before the group. For
1709 As a convenient shorthand, if any option settings are required at the start of
1718 \fBNote:\fP There are other PCRE2-specific options, applying to the whole
1722 Details are given in the section entitled
1727 above. There are also the (*UTF) and (*UCP) leading sequences that can be used
1728 to set UTF and Unicode property modes; they are equivalent to setting the
1738 Groups are delimited by parentheses (round brackets), which can be nested.
1754 Opening parentheses are counted from left to right (starting from 1) to obtain
1760 the captured substrings are "red king", "red", and "king", and are numbered 1,
1764 There are often times when grouping is required without capturing. If an
1772 the captured substrings are "white queen" and "queen", and are numbered 1 and
1775 As a convenient shorthand, if any option settings are required at the start of
1782 match exactly the same set of strings. Because alternative branches are tried
1783 from left to right, and options are not reset until the end of the group is
1798 Because the two alternatives are inside a (?| group, both sets of capturing
1799 parentheses are numbered one. Thus, when the pattern matches, you can look
1802 alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1875 Named capture groups are allocated numbers as well as names, exactly as
1877 are primarily identified by numbers; any names are just aliases for these
1885 Consider this pattern, where there are two capture groups, both numbered 1:
1930 There are five capture groups, but only one is ever set after a match. The
1938 pattern, the groups to which the name refers are checked in the order in which
1958 recursion, all groups with the same name are tested. If the condition is true
1995 no upper limit; if the second number and the comma are both omitted, the
2018 because braces are used in other items such as \eN{U+345} or \ek{name}.
2028 capture groups that are referenced as
2038 below). Except for parenthesized groups, items that have a {0} quantifier are
2054 such patterns. However, because there are cases where this can be useful, such
2055 patterns are now accepted, but whenever an iteration of such a group matches no
2060 By default, quantifiers are "greedy", that is, they match as much as possible
2091 the quantifiers are not greedy by default, but individual ones can be made
2110 However, there are some cases where the optimization cannot be used. When .*
2111 is inside capturing parentheses that are the subject of a backreference
2136 "tweedledee". However, if there are nested capture groups, the corresponding
2185 Atomic groups are not capture groups. Simple cases such as the above example
2187 So, while both \ed+ and \ed+? are prepared to adjust the number of digits they
2205 Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
2206 option is ignored. They are a convenient notation for the simpler forms of
2259 always taken as a backreference, and causes an error only if there are not that
2275 there is no problem when named capture groups are used (see below).
2279 signed or unsigned number, optionally enclosed in braces. These examples are
2302 patterns that are created by joining together fragments that contain references
2330 There are several different ways of writing backreferences to named capture
2332 original Perl syntax is \ek<name> or \ek'name'. All of these are now supported
2356 backslash are taken as part of a potential backreference number. If the pattern
2399 coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
2405 More complicated assertions are coded as parenthesized groups. There are two
2413 The Perl-compatible lookaround assertions are atomic. If an assertion is true,
2415 assertion. However, there are some cases where non-atomic assertions can be
2421 below, but they are not Perl-compatible.
2431 Assertion groups are not capture groups. If an assertion contains capture
2432 groups within it, these are counted for the purposes of numbering the capture
2435 such as (.)\eg{-1} can be used to check that two adjacent characters are the
2439 captured are discarded (as happens with any pattern branch that fails to
2441 this means that no captured substrings are ever retained after a successful
2446 branch are retained, and matching continues with the next pattern item after
2453 (see below), captured substrings are retained, because matching continues with
2482 sections, the various assertions are described using the original symbolic
2506 (?!foo) is always true when the next three characters are "bar". A
2525 a lookbehind assertion are restricted such that there must be a known maximum
2526 to the lengths of all the strings it matches. There are two cases:
2556 \eX and \eR escapes, which can match different numbers of code units, are never
2563 calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
2597 covers the entire string, from right to left, so we are no better off. However,
2616 matches "foo" preceded by three digits that are not "999". Notice that each of
2618 string. First there is a check that the previous three characters are all
2619 digits, and then there is a check that the same three characters are not "999".
2621 of which are digits and the last three of which are not "999". For example, it
2627 that the first three are digits, and then the second assertion checks that the
2628 preceding three characters are not "999".
2640 characters that are not "999".
2647 Traditional lookaround assertions are atomic. That is, if an assertion is true,
2649 assertion. However, there are some cases where non-atomic positive assertions
2671 using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
2693 Non-atomic assertions are not supported by the alternative matching function
2694 \fBpcre2_dfa_match()\fP. They are supported by JIT, but only if they do not
2707 In concept, a script run is a sequence of characters that are all from the same
2708 Unicode script such as Latin or Greek. However, because some scripts are
2709 commonly used together, and because some diacritical and other marks are used
2723 parenthesis, it fails if the sequence of characters that it matches are not a
2725 used to detect spoofing attacks using characters that look the same, but are
2728 the matched characters in a sequence of non-spaces that follow white space are
2733 To be sure that they are all from the Latin script (for example), a lookahead
2741 digits, underscore, and dots are permitted at the start:
2758 encountered. Script runs are not supported by the alternate matching function,
2778 already been matched. The two possible forms of conditional group are:
2785 string (it always matches). If there are more than two alternatives in the
2789 itself. This pattern fragment is an example where the alternatives are complex:
2794 There are five kinds of condition: references to capture groups, references to
2826 matches one or more characters that are not parentheses. The third part is a
2849 digits are ambiguous (see the following section). Rewriting the above example
2902 At "top level", all these recursion test conditions are false.
2940 they are dealing with by using this condition to match a string such as
2973 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2979 for which captures are retained only for positive assertions that succeed.)
2986 There are two ways of including comments in patterns that are processed by
2993 closing parenthesis. Nested parentheses are not permitted. If the
2997 characters are interpreted as newlines is controlled by an option passed to the
3087 The first two capture groups (a) and (b) are both numbered 1, and group (c)
3091 reference (?1) was used. In other words, relative references are just a
3096 reference is not inside the parentheses that are referenced. They are always
3120 the match runs for a very long time indeed because there are so many different
3124 At the end of a match, the values of capturing parentheses are those from
3141 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
3142 recursing), whereas any characters are permitted at the outer level.
3163 Starting with release 10.30, recursive subroutine calls are no longer treated
3175 palindrome when there are an odd number of characters, or nothing when there
3233 occur. However, any capturing parentheses that are set during the subroutine
3236 Processing options such as case-independence are fixed when a group is
3264 syntax for calling a group as a subroutine, possibly recursively. Here are two
3275 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
3292 entry point is set to NULL, callouts are disabled.
3295 function is to be called. There are two kinds of callout: those with a
3309 one side-effect is that sometimes callouts are skipped. If you need all
3312 programming interface to the callout function, are given in the
3329 callouts are automatically installed before each item in the pattern. They are
3359 There are a number of special "Backtracking Control Verbs" (to use Perl's
3363 present. The names are not required to be unique within the pattern.
3373 only backslash items that are permitted are \eQ, \eE, and sequences such as
3380 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3390 Since these verbs are specifically related to backtracking, most of them can be
3417 PCRE2 contains some optimizations that are used to speed up matching by running
3444 The following verbs act as soon as they are encountered.
3482 (??{}). Those are, of course, Perl features that are not present in PCRE2. The
3518 including those inside assertions and atomic groups. However, there are
3556 If you are interested in (*MARK) values after failed matches, you should
3568 The following verbs do nothing when they are encountered. Matching continues
3598 caller. However, (*SKIP:NAME) searches only for names that are set with
3606 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3633 possessive quantifier, but there are some uses of (*PRUNE) that cannot be
3674 means that it does not see (*MARK) settings that are inside atomic groups or
3675 assertions, because they are never re-entered by backtracking. Compare the
3698 names that are set by other backtracking verbs.
3712 succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
3724 Consider this pattern, where A, B, etc. are complex pattern fragments that do
3729 If A and B are matched, but there is a failure in C, matching does not
3768 etc. are complex pattern fragments:
3795 If the subject is "abac", Perl matches unless its optimizations are disabled,
3810 without any further processing; captured strings and a mark name (if set) are
3812 fail without any further processing; captured substrings and any mark name are
3816 a positive assertion and false for a negative one; captured substrings are
3821 is confined to the assertion, because Perl lookaround assertions are atomic. A
3832 above. These assertions must be standalone (not used as conditions). They are
3841 The other backtracking verbs are not treated specially if they appear in a