Lines Matching full:is
21 PCRE2 is the name used for a revised API for the PCRE library, which is
26 API is more extensible, and it was simplified by abolishing the sepa-
29 extensively refactored and new features introduced. The old library is
30 now obsolete and is no longer maintained.
34 are available using the Python syntax. There is also some support for
42 unit is not related to the bit size of the underlying hardware. In a
50 gory properties. Unicode support is optional at build time (but is the
61 generic names such as pcre2_compile(), and the documentation is written
62 assuming that this is the case.
72 pcre2pattern and pcre2compat pages. There is a syntax summary in the
76 library is built. The pcre2_config() function makes it possible for a
86 any name clashes. In some environments, it is possible to control which
87 external symbols are exported when a shared library is built, and in
99 tern and any data against which it is matched to be checked for UTF-8
100 validity. If the data string is very long, such a check might use suf-
104 One way of guarding against this possibility is to use the pcre2_pat-
114 If your application is one that supports UTF, be aware that validity
115 checking can take time. If the same data string is to be matched many
123 compile-time error if it is encountered. It is also possible to build
126 Another way that performance can be hit is by running a pattern that
130 function in the pcre2api page. There is a similar function called
132 ory that is used.
138 tions. In the "man" format, each of these is a separate "man page". In
139 the HTML format, each is a separate page, linked from the index page.
143 (which is a program listing), and the short pages for individual func-
170 In the "man" and HTML formats, there is also a short page for each C
180 Putting an actual email address here is a spam magnet. If you want to
203 PCRE2 is a new API for PCRE, starting at release 10.0. This document
450 first is replaced by pcre2_set_depth_limit(); the second is no longer
477 patterns that can be processed by pcre2_compile(). This facility is ex-
486 code units, respectively. However, there is just one header file,
504 For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
505 types are pointers to constants of the equivalent UCHAR types, that is,
517 tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default.
524 Any code that is to be included in an environment where the value of
525 PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
526 names. (Unfortunately, it is not possible in C code to save and restore
529 If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
545 PCRE2 has its own native API, which is described in this document.
564 program that demonstrates the simplest way of using them is provided in
566 of this program is given in the pcre2demo documentation, and the
576 Just-in-time (JIT) compiler support is an optional feature of PCRE2
581 nothing if JIT support is not available.
588 JIT matching is automatically used by pcre2_match() if it is available,
589 unless the PCRE2_NO_JIT option is set. There is also a direct interface
594 A second matching function, pcre2_dfa_match(), which is not Perl-com-
595 patible, is also provided. This uses a different algorithm for the
600 and their advantages and disadvantages is given in the pcre2matching
601 documentation. There is no JIT support for pcre2_dfa_match().
619 functions is called with a NULL argument, the function returns immedi-
634 blocks of various sorts. In all cases, if one of these functions is
642 which is an unsigned integer type, currently always defined as size_t.
643 The largest value that can be stored in such a type (that is
644 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
646 handled is one less than this maximum. Note that string lengths are al-
647 ways given in code units. Only in the 8-bit library is such a length
661 Each of the first three conventions is used by at least one operating
662 system as its standard newline sequence. When PCRE2 is built, a default
663 can be specified. If it is not, the default is set to LF, which is the
670 In the PCRE2 documentation the word "newline" is used to mean "the
674 CRLF is a recognized line ending sequence, the match position advance-
675 ment for a non-anchored pattern. There is more detail about this in the
685 In a multithreaded application it is important to keep thread-specific
687 library code itself is thread-safe: it contains no static or global
688 variables. The API is designed to be fairly simple for non-threaded ap-
697 A pointer to the compiled form of a pattern is returned to the user
698 when pcre2_compile() is successful. The data in the compiled pattern is
699 fixed, and does not change when the pattern is matched. Therefore, it
700 is thread-safe, that is, the same compiled pattern can be used by more
703 use them. However, if the just-in-time (JIT) optimization feature is
710 multiple threads. This is somewhat tricky to do correctly. If you know
711 that writing to a pointer is atomic in your environment, you can use
726 The reason for checking the pointer a second time is as follows: Sev-
736 above logic is not sufficient. The thread that is doing the compiling
739 pointer itself, a separate "pointer is valid" flag (that can be updated
755 If JIT is being used, but the JIT compilation is not being done immedi-
756 ately (perhaps waiting to see if the pattern is used often enough),
757 similar logic is required. JIT compilation updates a value within the
767 PCRE2 functions are called. A context is nothing more than a collection
769 parameters together in a context is a convenient way of passing them to
794 in a context instead of directly. A context is just a block of memory
797 is required.
799 There are three different types of context: a general context that is
807 in the PCRE2 library. The context is named `general' rather than
810 you do not need to bother with a general context. A general context is
823 Whenever code in PCRE2 calls these functions, the final argument is the
826 tions malloc() and free() are used. (This is not currently useful, as
828 might be.) The private_malloc() function is used (if supplied) to ob-
834 was used. When the time comes to free the block, this function is
846 If this function is passed a NULL argument, it returns immediately
851 A compile context is required if you want to provide an external func-
862 A compile context is also required if you are using custom memory man-
866 A compile context is created, copied, and freed by the following func-
877 A compile context is created with default values for its parameters.
879 on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
886 Unicode line ending sequence. The value is used by the JIT compiler and
894 only argument is a general context. This function builds a set of char-
912 is compiled with this context. If the pattern is longer, an error is
913 generated. This facility is provided so that applications that accept
914 patterns from external sources can limit their size. The default is the
915 largest number that a PCRE2_SIZE variable can hold, which is effec-
922 compiled version of a pattern that is compiled with this context. If
923 the pattern needs more memory, an error is generated. This facility is
925 sources can limit the amount of memory they use. The default is the
926 largest number that a PCRE2_SIZE variable can hold, which is effec-
933 variable-length lookbehind assertion. The default is set when PCRE2 is
945 PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
950 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX-
952 the end of internal comments starting with #. The value is saved with
960 This parameter adjusts the limit, set when PCRE2 is built (default
969 There is at least one application that runs PCRE2 in threads with very
970 limited system stack, where running out of stack is to be avoided at
972 stack is actually available during compilation. For a finer control,
973 you can supply a function that is called whenever pcre2_compile()
979 nesting, and the second is user data that is set up by the last argu-
981 should return zero if all is well, or non-zero to force an error.
985 A match context is required if you want to:
997 A match context is created, copied, and freed by the following func-
1008 A match context is created with default values for its parameters.
1010 on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
1032 vance in the subject string. The default value is PCRE2_UNSET. The
1034 MATCH if a match with a starting point before or at the given offset is
1037 For example, if the pattern /abc/ is matched against "123abc" with an
1038 offset limit less than 3, the result is PCRE2_ERROR_NOMATCH. A match
1040 pcre2_dfa_match(), or pcre2_substitute() is greater than the offset
1044 tion when calling pcre2_compile() so that when JIT is in use, different
1045 code can be compiled. If a match is started with a non-default match
1046 limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
1052 the subject. If this is set with an offset limit, a match must occur in
1054 whichever limit comes first is used.
1066 pcre2jit documentation for more details). If the limit is reached, the
1067 negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
1068 limit can be set when PCRE2 is built; if it is not, the default is set
1069 very large and is essentially unlimited.
1076 where ddd is a decimal number. However, such a setting is ignored un-
1077 less ddd is less than the limit set by the caller of pcre2_match() or,
1078 if no such limit is set, less than the default.
1085 For pcre2_dfa_match(), a vector on the system stack is used when pro-
1087 this is not big enough is heap memory used. In this case, setting a
1096 in their search trees. The classic example is a pattern that uses
1099 There is an internal counter in pcre2_match() that is incremented each
1105 to pcre2_dfa_match(), though the counting is done in a different way.
1107 When pcre2_match() is called with a pattern that was successfully
1108 processed by pcre2_jit_compile(), the way in which matching is executed
1109 is entirely different. However, there is still the possibility of run-
1111 value is also used in this case (but in a different way) to limit how
1114 The default value for the limit can be set when PCRE2 is built; the de-
1115 fault is 10 million, which handles all but the most extreme cases. A
1121 where ddd is a decimal number. However, such a setting is ignored un-
1122 less ddd is less than the limit set by the caller of pcre2_match() or
1123 pcre2_dfa_match() or, if no such limit is set, less than the default.
1129 pcre2_match(). Each time a nested backtracking point is passed, a new
1130 memory frame is used to remember the state of matching at that point.
1131 Thus, this parameter indirectly limits the amount of memory that is
1137 The depth limit is not relevant, and is ignored, when matching is done
1138 using JIT compiled code. However, it is supported by pcre2_dfa_match(),
1141 recursions. This limits, indirectly, the amount of system stack that is
1148 If the depth of internal recursive function calls is great enough, lo-
1151 memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
1153 deal of memory. However, it is probably better to limit heap usage di-
1156 The default value for the depth limit can be set when PCRE2 is built;
1157 if it is not, the default is set to the same value as the default for
1158 the match limit. If the limit is exceeded, pcre2_match() or
1165 where ddd is a decimal number. However, such a setting is ignored un-
1166 less ddd is less than the limit set by the caller of pcre2_match() or
1167 pcre2_dfa_match() or, if no such limit is set, less than the default.
1179 The first argument for pcre2_config() specifies which information is
1180 required. The second argument is a pointer to memory into which the in-
1181 formation is placed. If NULL is passed, the function returns the amount
1182 of memory that is needed for the requested information. For calls that
1183 return numerical values, the value is in bytes; when requesting these
1185 that return strings, the required length is given in code units, not
1188 When requesting information, the returned value from pcre2_config() is
1190 TION if the value in the first argument is not recognized. The follow-
1191 ing information is available:
1195 The output is a uint32_t integer whose value indicates what character
1199 or CRLF. The default can be overridden when a pattern is compiled.
1203 The output is a uint32_t integer whose lower bits indicate which code
1210 The output is a uint32_t integer that gives the default limit for the
1217 The output is a uint32_t integer that gives, in kibibytes, the default
1224 The output is a uint32_t integer that is set to one if support for
1225 just-in-time compiling is included in the library; otherwise it is set
1232 The where argument should point to a buffer that is at least 48 code
1234 pcre2_config() with where set to NULL.) The buffer is filled with a
1236 compiler is configured, for example "x86 32bit (little endian + un-
1237 aligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
1238 returned, otherwise the number of code units used is returned. This is
1243 The output is a uint32_t integer that contains the number of bytes used
1244 for internal linkage in compiled regular expressions. When PCRE2 is
1246 2. This is the value that is returned by pcre2_config(). However, when
1247 the 16-bit library is compiled, a value of 3 is rounded up to 4, and
1248 when the 32-bit library is compiled, internal linkages always use 4
1249 bytes, so the configured value is not relevant.
1251 The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1259 The output is a uint32_t integer that gives the default match limit for
1265 The output is a uint32_t integer whose value specifies the default
1266 character sequence that is recognized as meaning "newline". The values
1281 The output is a uint32_t integer that is set to one if the use of \C
1282 was permanently disabled when PCRE2 was built; otherwise it is set to
1287 The output is a uint32_t integer that gives the maximum depth of nest-
1288 ing of parentheses (of any kind) in a pattern. This limit is imposed to
1289 cap the amount of system stack used when a pattern is compiled. It is
1290 specified when PCRE2 is built; the default is 250. This limit does not
1297 This parameter is obsolete and should not be used in new code. The out-
1298 put is a uint32_t integer that is always set to zero.
1302 The output is a uint32_t integer that gives the length of PCRE2's char-
1308 The where argument should point to a buffer that is at least 24 code
1311 without Unicode support, the buffer is filled with the text "Unicode
1313 "8.0.0") is inserted. The number of code units used is returned. This
1314 is the length of the string plus one unit for the terminating zero.
1318 The output is a uint32_t integer that is set to one if Unicode support
1319 is available; otherwise it is set to zero. Unicode support implies UTF
1324 The where argument should point to a buffer that is at least 24 code
1326 pcre2_config() with where set to NULL.) The buffer is filled with the
1327 PCRE2 version string, zero-terminated. The number of code units used is
1328 returned. This is the length of the string plus one unit for the termi-
1345 The pattern is defined by a pointer to a string of code units and a
1346 length in code units. If the pattern is zero-terminated, the length can
1348 length of zero is treated as an empty string (NULL with a non-zero
1353 If the compile context argument ccontext is NULL, memory for the com-
1354 piled pattern is obtained by calling malloc(). Otherwise, it is ob-
1357 it is no longer needed. If pcre2_code_free() is called with a NULL ar-
1363 low), the JIT information cannot be copied (because it is position-de-
1366 pcre2_code_copy() is called with a NULL argument, it returns NULL.
1374 used throughout, so this behaviour is appropriate. Nevertheless, there
1378 pointing to the new tables. The memory for the new tables is automati-
1379 cally freed when pcre2_code_free() is called for the new copy of the
1380 compiled code. If pcre2_code_copy_with_tables() is called with a NULL
1383 NOTE: When one of the matching functions is called, pointers to the
1389 string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
1410 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1419 ing is in force. These are the same as given by pcre2_match() and
1421 There is no separate documentation for the positive error codes, be-
1426 pcre2.h. When compilation is successful errorcode is set to a value
1430 The value returned in erroroffset is an indication of where in the pat-
1431 tern an error occurred. When there is no error, zero is returned. A
1432 non-zero value is not necessarily the furthest point in the pattern
1433 that was read. For example, after the error "lookbehind assertion is
1435 assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1439 in these cases, the offset passed back is the length of the pattern.
1440 Note that the offset is in code units, not characters, even in a UTF
1452 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1466 If this bit is set, the pattern is forced to be "anchored", that is, it
1467 is constrained to match only at the first matching point in the string
1468 that is being searched (the "subject string"). This effect can also be
1469 achieved by appropriate constructs in the pattern itself, which is the
1475 immediately follows an opening one is treated as a data character for
1476 the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
1483 When it is set:
1488 (2) \u matches a lower case "u" character unless it is followed by four
1493 (3) \x matches a lower case "x" character unless it is followed by two
1495 code point to match. By default, as in Perl, a hexadecimal number is
1507 In multiline mode (when PCRE2_MULTILINE is set), the circumflex
1509 is set), and also after any internal newline. However, it does not
1517 such as (*MARK:NAME) is any sequence of characters that does not in-
1518 clude a closing parenthesis. The name is not processed in any way, and
1519 it is not possible to include a closing parenthesis in the name. How-
1520 ever, if the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
1521 cessing is applied to verb names and only an unescaped closing paren-
1524 PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
1525 whitespace in verb names is skipped and #-comments are recognized, ex-
1530 If this bit is set, pcre2_compile() automatically inserts callout
1537 If this bit is set, letters in the pattern match both upper and lower
1538 case letters in the subject. It is equivalent to Perl's /i option, and
1540 PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
1548 For lower valued characters with only one other case, a lookup table is
1549 used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup
1550 table is used for all code points less than 256, and higher code points
1556 If this bit is set, a dollar metacharacter in the pattern matches only
1559 before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
1560 if PCRE2_MULTILINE is set. There is no equivalent to this option in
1565 If this bit is set, a dot metacharacter in the pattern matches any
1569 ject is at a newline. This option is equivalent to Perl's /s option,
1577 If this bit is set, names used to identify capture groups need not be
1578 unique. This can be helpful for certain types of pattern when it is
1585 If this bit is set, the end of any pattern match must be right at the
1589 patterns, a new match is then tried at the next starting point. How-
1600 which is the only way to do it in Perl.
1603 to the first (that is, the longest) matched string. Other parallel
1609 If this bit is set, most white space characters in the pattern are to-
1611 a \Q...\E sequence. However, white space is not allowed within se-
1613 within numerical quantifiers such as {1,3}. Ignorable white space is
1616 TENDED is equivalent to Perl's /x option, and it can be changed within
1619 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
1622 ble is normally created by pcre2_maketables(), which uses the isspace()
1628 When PCRE2 is compiled with Unicode support, in addition to these char-
1632 U+2029 (paragraph separator). This set of characters is the same as
1641 comment is a literal newline sequence in the pattern; escape sequences
1645 ting in the compile context that is passed to pcre2_compile() or by a
1648 A default is defined when PCRE2 is built.
1656 acter class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx option,
1661 If this option is set, the start of an unanchored pattern match must be
1664 line. If startoffset is non-zero, the limiting newline is not necessar-
1666 string is "abc\nxyz" (where \n represents a single-character newline) a
1667 pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
1669 general limiting facility. If PCRE2_FIRSTLINE is set with an offset
1671 limit. In other words, whichever limit comes first is used. This option
1676 If this option is set, all meta-characters in the pattern are disabled,
1677 and it is treated as a literal string. Matching literal strings with a
1678 regular expression engine is not the most efficient way of doing it. If
1695 less such sequences are suitably aligned. This facility is not sup-
1701 If this option is set, a backreference to an unset capture group
1704 tion is set (assuming it can find an "a" in the subject), whereas it
1716 DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any
1718 iour (for ^, $, and dot) is the same as Perl.
1720 When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
1723 start and end. This is equivalent to Perl's /m option, and it can be
1733 This option locks out the use of \C in the pattern that is being com-
1738 is also a build-time option that permanently locks out the use of \C.
1753 or UTF-32, depending on which library is in use. In particular, it pre-
1761 If this option is set, it disables the use of numbered capturing paren-
1762 theses in the pattern. Any opening parenthesis that is not followed by
1765 is the same as Perl's /n option. Note that, when this option is set,
1772 If this option is set, it disables "auto-possessification", which is an
1777 a full unoptimized search and run all the callouts, but it is mainly
1782 If this option is set, it disables an optimization that is applied when
1783 .* is the first significant item in a top-level branch of a pattern,
1785 The optimization is automatically disabled for .* if it is inside an
1786 atomic group or a capture group that is the subject of a backreference,
1788 is not disabled, such a pattern is automatically anchored if
1789 PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1791 at the start of the subject or following a newline is remembered. Like
1796 This is an option whose main effect is at matching time. It does not
1801 match, in order to speed up the process. For example, if it is known
1806 start of a pattern is not considered until after a suitable starting
1809 skipped if the pattern is never actually used. The start-up optimiza-
1811 the pattern is run.
1815 where the result is "no match", the callouts do occur, and that items
1824 When this is compiled, PCRE2 records the fact that a match must start
1825 with the character "A". Suppose the subject string is "DEFABC". The
1829 does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
1831 first match attempt is run starting from "D" and when this fails,
1833 sult is "no match".
1836 matching subject, which is recorded when possible. Consider the pattern
1840 The minimum length for a match is two characters. If the subject is
1842 match "BB", which is long enough. In the process, (*MARK:2) is encoun-
1843 tered and remembered. When the match attempt fails, the next "B" is
1844 found, but there is only one character left, so there are no more at-
1845 tempts, and "no match" is returned with the "last mark seen" set to
1846 "2". If NO_START_OPTIMIZE is set, however, matches are tried at every
1848 (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
1849 that is returned is "1". In this case, the optimizations do not affect
1850 the overall match result, which is still "no match", but they do affect
1851 the auxiliary information that is returned.
1855 When PCRE2_UTF is set, the validity of the pattern as a UTF string is
1858 document. If an invalid UTF sequence is found, pcre2_compile() returns
1861 If you know that your pattern is a valid UTF string, and you want to
1863 PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1864 valid UTF string as a pattern is undefined. It may cause your program
1872 able the error that is given if an escape sequence for an invalid Uni-
1873 code code point is encountered in the pattern. In particular, the so-
1877 section entitled "Extra compile options" below. However, this is pos-
1886 PCRE2_UCP is set, Unicode properties are used to classify characters.
1891 The second effect of PCRE2_UCP is to force the use of Unicode proper-
1892 ties for upper/lower casing operations, even when PCRE2_UTF is not set.
1894 This option is available only if PCRE2 has been compiled with Unicode
1895 support (which is the default). The PCRE2_EXTRA_CASELESS_RESTRICT op-
1903 are not greedy by default, but become greedy if followed by "?". It is
1910 is going to be used to set a non-default offset limit in a match con-
1911 text for matches that use this pattern. An error is generated if an
1912 offset limit is set without this option. For more details, see the de-
1920 instead of single-code-unit strings. It is available when PCRE2 is
1921 built to include Unicode support (which is the default). If Unicode
1922 support is not available, the use of this option provokes an error. De-
1935 assertions, following Perl's lead. This option is provided to re-enable
1937 ones) in case anybody is relying on it.
1942 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1948 string that is being checked for validity by PCRE2.
1957 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1960 only match subject characters if the matching function is called with
1969 as a hexadecimal character code, where hhh.. is any number of hexadeci-
1975 is set. It can be changed within a pattern by means of the (?aD) op-
1981 PCRE2_UCP is set. It can be changed within a pattern by means of the
1987 PCRE2_UCP is set. It can be changed within a pattern by means of the
1993 to match only ASCII digits, even when PCRE2_UCP is set. It can be
1999 and [:xdigit:], to match only ASCII characters, even when PCRE2_UCP is
2006 This is a dangerous option. Use with care. By default, an unrecognized
2008 time error when detected by pcre2_compile(). Perl is somewhat inconsis-
2009 tent in handling such items: for example, \j is treated as a literal
2011 ings are given in both cases if Perl's warning switch is enabled. How-
2015 If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
2017 treated as single-character escapes. For example, \j is a literal "j"
2018 and \x{2z} is treated as the literal string "x{2z}". Setting this op-
2020 results. Also note that a sequence such as [\N{] is interpreted as a
2021 malformed attempt at [\N{...}] and so is treated as [N{] whereas [\N]
2022 gives an error because an unqualified \N is a valid escape sequence but
2023 is not supported in a character class. To reiterate: this is a danger-
2028 When either PCRE2_UCP or PCRE2_UTF is set, caseless matching follows
2031 ASCII characters. The ASCII letter S is case-equivalent to U+017f (long
2032 S) and the ASCII letter K is case-equivalent to U+212a (Kelvin sign).
2041 pattern is expected to match a newline. If this option is set, \r in a
2042 pattern is converted to \n so that it matches a LF (linefeed) instead
2049 This option is provided for use by the -x option of pcre2grep. It
2050 causes the pattern only to match complete lines. This is achieved by
2052 piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
2058 This option is provided for use by the -w option of pcre2grep. It
2060 the start and the end. This is achieved by automatically inserting the
2062 end. The option may be used with PCRE2_LITERAL. However, it is ignored
2063 if PCRE2_EXTRA_MATCH_LINE is also set.
2086 just-in-time compiler is available, further processes a compiled pat-
2091 JIT compilation is a heavyweight optimization. It can take some time
2111 When PCRE2 is built with Unicode support (the default), certain Unicode
2113 the PCRE2_UCP option can be set when a pattern is compiled; this causes
2117 effects apply even when PCRE2_UTF is not set. There are, however, some
2121 The use of locales with Unicode is discouraged. If you are handling
2127 ternal tables recognize only ASCII characters. However, when PCRE2 is
2128 built, it is possible to cause the internal tables to be rebuilt in the
2135 code, the need for this locale support is expected to die away.
2138 in the relevant locale. The only argument to this function is a general
2140 argument is NULL, the system malloc() is used. The result can be passed
2155 The locale name "fr_FR" is used on Linux and other Unix-like systems;
2156 if you are using Windows, the name for the French locale is "french".
2158 The pointer that is passed (via the compile context) to pcre2_compile()
2159 is saved with the compiled pattern, and the same tables are used by the
2164 It is the caller's responsibility to ensure that the memory containing
2174 or whether the processor is 32-bit or 64-bit. A copy of the result of
2180 pcre2_dftables program, which is part of the PCRE2 build system, can be
2191 The first argument for pcre2_pattern_info() is a pointer to the com-
2193 is required, and the third argument is a pointer to a variable to re-
2194 ceive the data. If the third argument is NULL, the first argument is
2196 that is required for the information requested. Otherwise, the yield of
2197 the function is zero for success, or one of the following negative num-
2203 PCRE2_ERROR_UNSET the requested field is not set
2205 The "magic number" is placed at the start of each compiled pattern as a
2206 simple check against passing an arbitrary memory pointer. Here is a
2214 PCRE2_INFO_SIZE, /* what is required */
2233 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2234 TENDED option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED
2240 A pattern compiled without PCRE2_ANCHORED is automatically anchored by
2241 PCRE2 if the first significant item in every top-level branch is one of
2244 ^ unless PCRE2_MULTILINE is set
2249 When .* is the first significant item, anchoring is possible only when
2252 .* is not in an atomic group
2253 .* is not in a capture group that is the subject
2255 PCRE2_DOTALL is in force for .*
2257 PCRE2_NO_DOTSTAR_ANCHOR is not set
2259 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
2269 a capture group is set in a conditional group such as (?(3)a|b) is also
2270 a backreference. Zero is returned if there are no backreferences.
2274 The output is a uint32_t integer whose value indicates what character
2282 where (?| is not used, this is also the total number of capture groups.
2288 the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
2292 is less than the limit set or defaulted by the caller of the match
2303 structed, a pointer to it is returned. Otherwise NULL is returned. The
2310 variable. If there is a fixed first value, for example, the letter "c"
2311 from a pattern such as (cat|cow|coyote), 1 is returned, and the value
2312 can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
2313 first value, but it is known that a match can occur only at the start
2314 of the subject or following a newline in the subject, 2 is returned.
2315 Otherwise, and for anchored patterns, 0 is returned.
2322 library, the value is always less than 256. In the 16-bit library the
2330 backtracking positions when the pattern is processed by pcre2_match()
2345 variable. An explicit match is either a literal CR or LF character, or
2352 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2355 Note that this limit will only be used during matching if it is less
2360 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2373 Returns 1 if there is a rightmost literal code unit that must exist in
2375 point to a uint32_t variable. If there is no such value, 0 is returned.
2376 When 1 is returned, the code unit value itself can be retrieved using
2377 PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
2379 for the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned
2380 from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is
2394 tains recursive subroutine calls it is not always possible to determine
2401 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third ar-
2404 SET. Note that this limit will only be used during matching if it is
2419 Note that this information is useful for multi-segment matching only if
2421 (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is
2425 was at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a
2432 value is returned. Otherwise the returned value is 0. This value is not
2433 computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number of
2436 value is a lower bound to the length of any matching string. There may
2438 string that does match is at least that long.
2448 strings by name. It is also possible to extract the data directly, by
2451 do the conversion, you need to use the name-to-number map, which is de-
2460 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2466 The rest of the entry is the corresponding name, zero terminated.
2468 The names are in alphabetical order. If (?| is used to create multiple
2471 the same name, but there is only one entry in the table. Different
2475 ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the
2477 this is the order of increasing number; when (?| is used this is not
2483 is set, so white space - including newlines - is ignored):
2489 each entry in the table is eight bytes long. The table is as follows,
2499 name-to-number map, remember that the length of the entries is likely
2504 The output is one of the following uint32_t values:
2521 code units of the compiled pattern itself. The value that is used when
2522 pcre2_compile() is getting memory in which to place the compiled pat-
2538 argument is a pointer to a compiled pattern, the second points to a
2539 callback function, and the third is arbitrary user data. The callback
2540 function is called for every callout in the pattern in the order in
2541 which they appear. Its first argument is a pointer to a callout enumer-
2542 ation block, and its second argument is the user_data value that was
2550 It is possible to save compiled patterns on disc or elsewhere, and re-
2556 of PCRE2 is really just a bytecode dump. The functions whose names be-
2573 Information about a successful or unsuccessful match is placed in a
2574 match data block, which is an opaque structure that is accessed by
2577 subject. This is known as the ovector.
2581 tions above. For pcre2_match_data_create(), the first argument is the
2584 When using pcre2_match(), one pair of offsets is required to identify
2594 A minimum of at least 1 pair is imposed by pcre2_match_data_create(),
2595 so it is always possible to return the overall matched string in the
2597 pcre2_dfa_match(). The maximum number of pairs is 65535; if the first
2598 argument of pcre2_match_data_create() is greater than this, 65535 is
2601 The second argument of pcre2_match_data_create() is a pointer to a gen-
2606 For pcre2_match_data_create_from_pattern(), the first argument is a
2607 pointer to a compiled pattern. The ovector is created to be exactly the
2610 with pcre2_dfa_match(). The second argument is again a pointer to a
2611 general context, but in this case if NULL is passed, the memory is ob-
2620 When a call of pcre2_match() fails, valid data is available in the
2621 match block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ER-
2623 actly what is available depends on the error, and is detailed below.
2625 When one of the matching functions is called, pointers to the compiled
2631 string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
2635 When a match data block itself is no longer needed, it should be freed
2636 by calling pcre2_match_data_free(). If this function is called with a
2649 in bytes, of the block that is its argument.
2651 When pcre2_match() runs interpretively (that is, without using JIT), it
2658 Heap memory is used for the frames vector; if the initial memory block
2659 turns out to be too small during matching, it is automatically ex-
2660 panded. When pcre2_match() returns, the memory is not freed, but re-
2662 matches that use the same block. It is automatically freed when the
2663 match data block itself is freed.
2669 that run in environments where memory is constrained can check this and
2680 The function pcre2_match() is called to match a subject string against
2681 a compiled pattern, which is passed in the code argument. You can call
2686 This function is the main matching facility of the library, and it op-
2687 erates in a Perl-like manner. For specialist use there is also an al-
2688 ternative matching function, which is described below in the section
2691 Here is an example of a simple call to pcre2_match():
2703 If the subject string is zero-terminated, the length can be given as
2710 The subject string is passed to pcre2_match() as a pointer in subject,
2712 and offset are in code units, not characters. That is, they are in
2715 cessing is enabled. As a special case, if subject is NULL and length is
2716 zero, the subject is assumed to be an empty string. If length is non-
2717 zero, an error occurs if subject is NULL.
2719 If startoffset is greater than the length of the subject, pcre2_match()
2720 returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the
2721 search for a match starts at the beginning of the subject, and this is
2728 A non-zero starting offset is useful when searching for another match
2737 only if the current position in the subject is not a word boundary.)
2739 pcre2_match() finds the first occurrence. If pcre2_match() is called
2741 not match, because \B is always false at the start of the subject,
2742 which is deemed to be a word boundary. However, if pcre2_match() is
2744 the second occurrence of "iss" because it is able to look behind the
2745 starting point to discover that it is preceded by a letter.
2747 Finding all the matches in a subject is tricky when the pattern can
2748 match an empty string. It is possible to emulate Perl's /g behaviour by
2752 again. There is some code that demonstrates how to do this in the
2755 so, and the current character is CR followed by LF, advance the start-
2758 If a non-zero starting offset is passed when the pattern is anchored, a
2759 single attempt to match at the given offset is made. This can only suc-
2772 TIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below.
2774 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
2775 ported by the just-in-time (JIT) compiler. If it is set, JIT matching
2776 is disabled and the interpretive code in pcre2_match() is run.
2777 PCRE2_DISABLE_RECURSELOOP_CHECK is ignored by JIT, but apart from
2791 By default, a pointer to the subject is remembered in the match data
2795 plications where the lifetime of the subject string is not guaranteed,
2796 it may be necessary to make a copy of the subject string, but it is
2797 wasteful to do this unless the match is successful. After a successful
2798 match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied and
2799 the new pointer is remembered in the match data block instead of the
2801 match block itself is used. The copy is automatically freed when
2802 pcre2_match_data_free() is called to free the match data block. It is
2803 also automatically freed if the match data block is re-used for another
2808 This option is relevant only to pcre2_match() for interpretive match-
2809 ing. It is ignored when JIT is used, and is forbidden for
2815 if the limits are large. There is therefore a check at the start of
2816 each recursion. If the same group is still active from a previous
2817 call, and the current subject pointer is the same as it was at the
2819 ject has not changed, an error is generated.
2822 trigger this error. This option disables the check. It is provided
2827 If the PCRE2_ENDANCHORED option is set, any string that pcre2_match()
2833 This option specifies that first character of the subject string is not
2841 This option specifies that the end of the subject string is not the end
2850 An empty string is not considered to be a valid match if this option is
2857 is applied to a string not beginning with "a" or "b", it matches an
2859 match is not valid, so pcre2_match() searches further into the string
2864 This is like PCRE2_NOTEMPTY, except that it locks out an empty string
2865 match only at the first matching position, that is, at the start of the
2867 subject is permitted. If the pattern is anchored, such a match can oc-
2873 pcre2_jit_compile(), JIT is automatically used when pcre2_match() is
2879 When PCRE2_UTF is set at compile time, the validity of the subject as a
2880 UTF string is checked unless PCRE2_NO_UTF_CHECK is passed to
2882 The latter special case is discussed in detail in the pcre2unicode doc-
2885 In the default case, if a non-zero starting offset is given, the check
2886 is applied only to that part of the subject that could be inspected
2887 during matching, and there is a check that the starting offset points
2895 The check is carried out before any other processing takes place, and a
2896 negative error code is returned if the check fails. There are several
2902 If you know that your subject is valid, and you want to skip this check
2909 PCRE2_NO_UTF_CHECK is set at match time the effect of passing an in-
2910 valid string as a subject, or an invalid value of startoffset, is unde-
2918 curs if the end of the subject string is reached successfully, but
2925 TIAL_HARD) is set, matching continues by testing any remaining alterna-
2926 tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL
2928 TIAL_SOFT specifies that the caller is prepared to handle a partial
2931 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
2932 case, if a partial match is found, pcre2_match() immediately returns
2934 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2937 There is a more detailed discussion of partial and multi-segment match-
2943 When PCRE2 is built, a default newline convention is set; this is usu-
2950 alter the way the match starting position is advanced after a match
2953 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2955 pattern fails when the current starting position is at a CRLF sequence,
2957 the match position is advanced by two characters instead of one, in
2960 The above rule is a compromise that makes the most common cases work as
2961 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL op-
2962 tion is not set), it does not match the string "\r\nA" because, after
2968 An explicit match for CR of LF is either a literal appearance of one of
2975 is a valid newline sequence and explicit \r or \n escapes appear in the
2988 Friedl's book, this is called "capturing" in what follows, and the
2989 phrase "capture group" (Perl terminology) is used for a fragment of a
3000 strings. It is part of the match data block. The function
3005 Within the ovector, the first in each pair of values is set to the off-
3006 set of the first code unit of a substring, and the second is set to the
3008 ues are always code unit offsets, not character offsets. That is, they
3013 first pair of offsets (that is, ovector[0] and ovector[1]) are set.
3019 tern. The next pair is used for the first captured substring, and so
3020 on. The value returned by pcre2_match() is one more than the highest
3022 been captured, the returned value is 3. If there are no captured sub-
3023 strings, the return value from a successful match is 1, indicating that
3028 the match. For example, if the pattern (?=ab\K) is matched against
3031 If a capture group is matched repeatedly within a single match opera-
3032 tion, it is the last portion of the subject that it matched that is re-
3035 If the ovector is too small to hold all the captured substring offsets,
3036 as much as possible is filled in, and the function returns a value of
3038 called with a match data block whose ovector is of minimum length (that
3039 is, one pair).
3041 It is possible for capture group number n+1 to match some part of the
3043 string "abc" is matched against the pattern (a|(z))(bc) the return from
3044 the function is 4, and groups 1 and 3 are matched, but 2 is not. When
3050 is matched against the pattern (abc)(x(yz)?)? groups 2 and 3 are not
3051 matched. The return from the function is 2, because the highest used
3052 capture group number is 1. The offsets for the second and third capture
3053 groups (assuming the vector is large enough, of course) are set to
3057 in the pattern are never changed. That is, if a pattern contains n cap-
3071 is retained in the match data block and can be retrieved by the above
3073 times, the result is undefined.
3080 returns a pointer to the zero-terminated name, which is within the com-
3081 piled pattern. If no name is available, NULL is returned. The length of
3082 the name (excluding the terminating zero) is stored in the code unit
3086 After a successful match, the name that is returned is the last mark
3089 the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
3090 After a "no match" or a partial match, the last encountered name is re-
3095 When it matches "bc", the returned name is A. The B mark is "seen" in
3096 the first branch of the group, but it is not on the matching path. On
3098 name is B.
3102 anchoring is removed from the pattern above, there is an initial check
3115 value is always the same as ovector[0] because \K does not affect the
3129 them. The codes are given names in the header file. If UTF checking is
3130 in force and an invalid UTF subject string is detected, one of a number
3131 of UTF-specific negative error codes is returned. Details are given in
3147 to catch the case when it is passed a junk pointer. This is the error
3148 that is returned when the magic number is not present.
3152 This error is given when a compiled pattern is passed to a function in
3154 piled by the 8-bit library is passed to a 16-bit or 32-bit library
3174 This error is never generated by pcre2_match() itself. It is provided
3194 This error is returned when a pattern that was successfully studied us-
3195 ing JIT is being matched, but the memory available for the just-in-time
3196 processing stack is not large enough. See the pcre2jit documentation
3205 Heap memory is used to remember backtracking points. This error is
3207 Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given if the
3208 amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is
3209 also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca-
3218 This error is returned when pcre2_match() detects a recursion loop
3224 groups, cannot be detected until matching is attempted.
3234 sage(). The code is passed as the first argument, with the remaining
3236 units, into which the text message is placed. The message is returned
3237 in code units of the appropriate width for the library that is being
3240 The returned message is terminated with a trailing zero, and the func-
3242 zero. If the error number is unknown, the negative error code PCRE2_ER-
3243 ROR_BADDATA is returned. If the buffer is too small, the message is
3245 PCRE2_ERROR_NOMEMORY is returned. None of the messages are very long;
3246 a buffer size of 120 code units is ample.
3267 strings. A substring that contains a binary zero is correctly extracted
3268 and has a further zero added on the end, but the result is not, of
3274 match, only substring zero is available. An attempt to extract any
3280 the match. For example, if the pattern (?=ab\K) is matched against
3287 argument is a pointer to the match data block, the second is the group
3288 number, and the third is a pointer to a variable into which the length
3289 is placed. If you just want to know whether or not the substring has
3301 units. This is updated to contain the actual number of code units used
3307 terminating zero. When the substring is no longer needed, the memory
3310 The return value from all these functions is zero for success, or a
3312 code is returned. If a substring number greater than zero is used af-
3313 ter a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
3323 There is no substring with that number in the pattern, that is, the
3324 number is greater than the number of capturing parentheses.
3329 the pattern, is greater than the number of slots in the ovector, so the
3335 pattern is (abc)|(def) and the subject is "def", and the ovector con-
3336 tains at least two capturing slots, substring number 1 is unset.
3349 cluding a terminating zero that is added to each of them. All this is
3350 done in a single block of memory that is obtained using the same memory
3354 after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3356 The address of the memory block is returned via listptr, which is also
3357 the start of the list of string pointers. The end of the list is marked
3358 by a NULL pointer. The address of the list of lengths is returned via
3362 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
3363 ory block could not be obtained. When the list is no longer needed, it
3366 If this function encounters a substring that is unset, which can happen
3395 the number of the capture group called "xxx" is 2. If the name is known
3398 ment is the compiled pattern, and the second is the name. The yield of
3399 the function is the group number, PCRE2_ERROR_NOSUBSTRING if there is
3400 no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is
3407 gument is a name instead of a number. If PCRE2_DUPNAMES is set and
3410 group that is set.
3412 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
3414 than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3415 turned. If there is at least one group with a slot in the ovector, but
3416 no group is found to be set, PCRE2_ERROR_UNSET is returned.
3438 with the replacement string, whose length is supplied in rlength, which
3440 a special case, if replacement is NULL and rlength is zero, the re-
3441 placement is assumed to be an empty string. If rlength is non-zero, an
3442 error occurs if replacement is NULL.
3444 There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3445 turn just the replacement string(s). The default action is to perform
3446 just one replacement if the pattern matches, but there is an option
3451 that were carried out. This may be zero if no match was found, and is
3452 never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3453 tive value is returned if an error is detected.
3464 block is obtained and freed within this function, using memory manage-
3468 If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
3469 provided block is used for all calls to pcre2_match(), and its contents
3476 such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
3480 tor) is then used for the first substitution instead of calling
3486 changed when PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTI-
3487 TUTE_GLOBAL is also set, pcre2_match() is called after the first sub-
3488 stitution to check for further matches, but this is done using an in-
3492 The code argument is not used for matching before the first substitu-
3493 tion when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided,
3494 even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3498 The default action of pcre2_substitute() is to return a copy of the
3500 STITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are
3507 the function is successful, the value is updated to contain the length
3508 in code units of the new string, excluding the trailing zero that is
3511 If the function is not successful, the value set via outlengthptr de-
3513 string, the value is the offset in the replacement string where the er-
3514 ror was detected. For other errors, the value is PCRE2_UNSET by de-
3516 less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
3519 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3520 ORY immediately. If this option is set, however, pcre2_substitute()
3523 buffer that is needed. This value is passed back via the outlengthptr
3527 Passing a buffer size of zero is a permitted way of finding out how
3528 much memory is needed for given substitution. However, this does mean
3529 that the entire operation is carried out twice. Depending on the appli-
3534 The replacement string, which is interpreted as a UTF string in UTF
3535 mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An
3539 If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not in-
3540 terpreted in any way. By default, however, a dollar character is an es-
3543 tern. Dollar is the only escape character (backslash is treated as lit-
3553 the entire matched string. For example, if the pattern a(b)c is
3555 is "=+babcb+=".
3560 (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
3561 the relevant name is "B". This facility can be used to perform simple
3569 string, replacing every matching substring. If this option is not set,
3570 only the first matching substring is replaced. The search for matches
3571 takes place in the original subject string (that is, previous replace-
3572 ments do not affect it). Iteration is implemented by advancing the
3573 startoffset value for each search, which is always passed the entire
3574 subject string. If an offset limit is set in the match context, search-
3575 ing stops when that limit is reached.
3579 set limit. Here is a pcre2test example:
3587 set is performed. If this is not successful, the offset is advanced by
3588 one character except when CRLF is a valid newline sequence and the next
3589 two characters are CR, LF. In this case, the offset is advanced by two
3598 known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated
3599 as empty strings when inserted as described above. If this option is
3605 replacement string. Without this option, only the dollar character is
3607 When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3609 Firstly, backslash in a replacement string is interpreted as an escape
3621 it is a letter) to upper or lower case, respectively, and then the
3629 ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
3633 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
3634 flexibility to capture group substitution. The syntax is similar to
3641 fies a default value. If group <n> is set, its value is inserted; if
3642 not, <string> is expanded and the result inserted. The second form
3643 specifies strings that are expanded and inserted when group <n> is set
3644 or unset, respectively. The first form is just a convenient shorthand
3664 If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
3671 code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors
3674 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3675 tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3677 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3678 ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
3679 when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN-
3680 SET_EMPTY is not set.
3682 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
3683 enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
3684 of buffer that is needed is returned via outlengthptr. Note that this
3687 PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
3688 match_data argument is NULL or if the subject or replacement arguments
3689 are NULL. For backward compatibility reasons an exception is made for
3690 the replacement argument if the rlength argument is also 0.
3692 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
3698 than the current position in the subject, which can happen if \K is
3712 callout function for pcre2_substitute(). This information is passed in
3713 a match context. The callout function is called after each substitution
3715 callout function is not called for simulated substitutions that happen
3718 The first argument of the callout function is a pointer to a substitute
3731 current version is 0. The version number will increase in future if
3732 more fields are added, but the intention is never to remove any of the
3735 The subscount field is the number of the current match. It is 1 for the
3741 that are set in the ovector, and is always greater than zero.
3747 The second argument of the callout function is the value passed as
3749 the callout function is interpreted as follows:
3751 If the value is zero, the replacement is accepted, and, if PCRE2_SUB-
3752 STITUTE_GLOBAL is set, processing continues with a search for the next
3753 match. If the value is not zero, the current replacement is not ac-
3754 cepted. If the value is greater than zero, processing continues when
3755 PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero
3756 or PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied
3766 When a pattern is compiled with the PCRE2_DUPNAMES option, names for
3774 An example is shown in the pcre2pattern documentation.
3778 to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3779 SET is returned. The pcre2_substring_number_from_name() function re-
3785 first argument is the compiled pattern, and the second is the name. If
3793 units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
3796 The format of the name table is described above in the section entitled
3810 is described in the pcre2callout documentation.
3812 What you have to do is to insert a callout right at the end of the pat-
3813 tern. When your callout function is called, extract and save the cur-
3827 The function pcre2_dfa_match() is called to match a subject string
3831 different characteristics to the normal algorithm, and is not compati-
3840 is used in a different way, and this is described below. The other com-
3842 description is not repeated here.
3845 workspace vector should contain at least 20 elements. It is used for
3847 space is needed for patterns and subjects where there are a lot of po-
3850 Here is an example of a simple call to pcre2_dfa_match():
3873 as for pcre2_match(), so their description is not repeated here.
3879 the details are slightly different. When PCRE2_PARTIAL_HARD is set for
3881 subject is reached and there is still at least one matching possibility
3883 matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
3884 return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
3885 if the end of the subject is reached, there have been no complete
3886 matches, but there is still at least one matching possibility. The por-
3888 was found is set as the first matching string in both cases. There is a
3896 tive algorithm works, this is necessarily the shortest possible match
3901 When pcre2_dfa_match() returns a partial match, it is possible to call
3904 it is set, the workspace and wscount options must reference the same
3905 vector as before because data about the match so far is left in them
3906 after a partial match. There is more discussion of this facility in the
3919 is matched against the string
3921 This is <something> <something else> <something further> no more
3929 On success, the yield of the function is a number greater than zero,
3930 which is the number of matched substrings. The offsets of the sub-
3942 length; that is, the longest matching string is first. If there were
3943 too many matches to fit into the ovector, the yield of the function is
3944 zero, and the vector is filled with the longest matches.
3948 example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
3949 matching, this means that only one possible match is found. If you re-
3963 This return is given if pcre2_dfa_match() encounters an item in the
3969 This return is given if pcre2_dfa_match() encounters a condition item
3975 This return is given if pcre2_dfa_match() is called for a pattern that
3976 was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for
3981 This return is given if pcre2_dfa_match() runs out of space in the
3986 When a recursion or subroutine call is processed, the matching function
3988 workspace. This error is given if the internal ovector is not large
3989 enough. This should be extremely rare, as a vector of size 1000 is
3994 When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
3997 these checks fail, this error is given.
4033 PCRE2 is distributed with a configure script that can be used to build
4037 formation about building with Autotools (some of which is repeated be-
4040 There is a lot more information about building PCRE2 without using Au-
4050 can be selected when the library is compiled. It assumes use of the
4071 it is not described. Options that specify values have names that start
4073 tion is output.
4078 By default, a library called libpcre2-8 is built, containing functions
4095 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
4125 By default, PCRE2 is built with support for Unicode and UTF character
4131 It is not possible to build one library with Unicode support and an-
4159 tion when calling pcre2_compile(). There is also a build-time option
4168 Just-in-time (JIT) compiler support is included in the build by speci-
4173 This support is available only for certain hardware architectures. If
4174 this option is set for an unsupported architecture, a building error
4179 which enables JIT only if the current hardware is supported. You can
4180 check if JIT is enabled in the configuration summary that is output at
4186 which enables the use of an execmem allocator in JIT that is compatible
4187 with SELinux. This has no effect if JIT is not enabled. See the
4189 is enabled, pcre2grep automatically makes use of it, unless you add
4199 the end of a line. This is the normal newline character on Unix-like
4203 --enable-newline-is-cr
4205 to the configure command. There is also an --enable-newline-is-lf op-
4212 --enable-newline-is-crlf
4214 to the configure command. There is a fourth option, specified by
4216 --enable-newline-is-anycrlf
4221 --enable-newline-is-any
4227 U+2029). The final option is
4229 --enable-newline-is-nul
4234 Whatever default line ending convention is selected when PCRE2 is built
4236 it is recommended to use the standard for your operating system.
4247 the default is changed so that \R matches only CR, LF, or CRLF. What-
4248 ever is selected when PCRE2 is built can be overridden by applications
4258 for a compiled pattern of around 64 thousand code units. This is suffi-
4260 people do want to process truly enormous patterns, so it is possible to
4267 16-bit library, a value of 3 is rounded up to 4. In these libraries,
4270 value is always 4 and cannot be overridden; the value of --with-link-
4271 size is ignored.
4280 The default is 10 million, but this can be changed by adding a setting
4287 counting is done differently).
4290 points. The more nested backtracking points there are (that is, the
4291 deeper the search tree), the more memory is needed. There is an upper
4294 default limit (in effect unlimited) is 20 million. You can change this
4303 arrangements) is used.
4306 pcre2_match() interpreter. This limit defaults to the value that is set
4313 This depth limit indirectly limits the amount of heap memory that is
4316 is used before the limit is reached varies from pattern to pattern.
4329 able number of characters are supported only if there is a maximum
4330 matching length for each top-level branch. There is a limit to this
4345 less than 256. By default, PCRE2 is built with a set of tables that are
4352 Instead, a program called pcre2_dftables is compiled and run. This out-
4385 character code is ASCII or Unicode, which is a superset of ASCII. This
4386 is the case for most computer operating systems. PCRE2 can, however, be
4395 It is not possible to support both EBCDIC and UTF-8 codes in the same
4399 The EBCDIC character that corresponds to an ASCII LF is assumed to have
4401 is used. In such an environment you should use
4407 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4408 acter (which, in Unicode, is 0x85).
4410 The options that select newline behaviour, such as --enable-newline-is-
4418 within the patterns it is matching. There are two kinds: one that gen-
4420 gram or script. If --disable-pcre2grep-callout-fork is added to the
4421 configure command, only the first kind of callout is supported; if
4422 --disable-pcre2grep-callout is used, all callouts are completely ig-
4443 pcre2grep uses an internal buffer to hold a "window" on the file it is
4445 it finds a match. The default starting size of the buffer is 20KiB. The
4446 buffer itself is three times this size, but because of the way it is
4447 used for holding "before" lines, the longest line that is guaranteed to
4448 be processable is the notional buffer size. If a longer line is encoun-
4450 maximum size, whose default is 1MiB or the starting size, whichever is
4469 to the configure command, pcre2test is linked with the libreadline or-
4470 libedit library, respectively, and when its input is from a terminal,
4472 and history facilities. Note that libreadline is GPL-licensed, so if
4479 system-installed readline library this is sufficient. However, in some
4480 environments (e.g. if an unmodified distribution version of readline is
4489 is automatically included, you may need to add something like
4502 to the configure command, additional debugging code is included in the
4503 build. This feature is intended for use by the PCRE2 maintainers.
4514 valid memory accesses, and is mostly useful for debugging PCRE2 itself.
4519 If your C compiler is gcc, you can build a version of PCRE2 that can
4527 Note that using ccache (a caching C compiler) is incompatible with code
4533 before running make to build PCRE2, so that ccache is not used.
4535 When --enable-coverage is used, the following addition targets are
4540 This creates a fresh coverage report for the PCRE2 test suite. It is
4578 __STDC_VERSION__ is defined and has a value greater than or equal to
4579 199901L (indicating support for C99). However, there is at least one
4585 is specified, no use is made of the z or t modifiers. Instead of %td or
4586 %zu, a suitable format is used depending in the size of long for the
4592 There is a special option for use by people who want to run fuzzing
4602 pattern, and if that succeeds, to match it. This is done both with no
4607 zcheck to be created. This is normally run under valgrind or used when
4608 PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
4609 function and outputs information about what it is doing. The input
4611 rest of it is a literal input string. Otherwise, it is assumed to be a
4624 has changed (the stack is no longer used) and this option now does
4671 PCRE2 provides a feature called "callout", which is a means of tem-
4678 ture is available. This does a callout after each change to the subject
4679 string and is described in the pcre2api documentation; the rest of this
4680 document is concerned with callouts during pattern matching.
4683 external function is to be called. Different callout points can be
4685 default value is zero. Alternatively, the argument may be a delimited
4687 ending delimiter is the same as the start, except for {, where the end-
4688 ing delimiter is }. If the ending delimiter is needed within the
4694 If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
4697 callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
4701 it is processed as if it were
4705 Here is a more complicated example:
4709 With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
4713 Notice that there is a callout before and after each parenthesis and
4715 dition is an assertion, an automatic callout is inserted immediately
4727 pcre2test indicates how the pattern is being matched. This is useful
4741 that what follows cannot be part of the repeat. For example, a+[bc] is
4743 is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
4744 to the string "aaaa" is:
4751 This indicates that when matching [bc] fails, there is no backtracking
4752 into a+ (because it is being treated as a++) and therefore the callouts
4771 By default, an optimization is applied when .* is the first significant
4772 item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
4773 any character, the pattern is automatically anchored. If PCRE2_DOTALL
4774 is not set, a match can start only after an internal newline or at the
4779 This optimization is disabled, however, if .* is in an atomic group or
4780 if there is a backreference to the capture group in which it appears.
4781 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4784 For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
4785 and applied to the string "aa", the pcre2test output is:
4795 ject. In other words, the pattern is anchored. You can disable this op-
4812 there is no subsequent attempt to match with an empty subject.
4817 callouts. For example, if the pattern is
4822 the subject string is "abyz", the lack of "d" means that matching
4823 doesn't ever start, and the callout is never reached. However, with
4824 "abyd", though the result is still no match, the callout is obeyed.
4828 running a match if the subject is not long enough, or, for unanchored
4840 function is provided in the match context, it is called. This applies
4842 out function is a pointer to a pcre2_callout block. The second argument
4843 is the void * callout data that was supplied when the callout was set
4866 current version is 2; the three callout string fields were added for
4871 but the intention is never to remove any of the existing fields.
4875 For a numerical callout, callout_string is NULL, and callout_number
4876 contains the number of the callout, in the range 0-255. This is the
4877 number that follows (?C for callouts that part of the pattern; it is
4882 For callouts with string arguments, callout_number is always zero, and
4883 callout_string points to the string that is contained within the com-
4884 piled pattern. Its length is given by callout_string_length. Duplicated
4886 been turned into single characters, but there is no other processing of
4888 zero is present after the string, but is not included in the length.
4889 The delimiter that was used to start the string is also stored within
4893 The callout_string_offset field is the code unit offset to the start of
4894 the callout argument string within the original pattern string. This is
4903 The offset_vector field is a pointer to a vector of capturing offsets
4907 For calls to pcre2_match(), the offset_vector field is not (since re-
4912 call within a pattern completes, the capturing state is reset to what
4918 strings have yet been captured, the value of capture_last is 0 and the
4919 value of capture_top is 1. The values of these fields do not always
4921 ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
4927 the match is by definition not complete. Substrings that have not been
4936 support substring capturing. The value of capture_top is always 1 and
4937 the value of capture_last is always 0 for DFA matching.
4944 quence \K has been encountered, this value is changed to reflect the
4945 modified starting point. If the pattern is not anchored, the callout
4956 processed in the pattern string. When the callout is at the end of the
4957 pattern, the length is zero. When the callout precedes an opening
4960 the length is 3. For an alternation bar or a closing parenthesis, the
4961 length is one, unless a closing parenthesis is followed by a quanti-
4962 fier, in which case its length is included. (This changed in release
4980 The callout_flags field is always zero in callouts from
4981 pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4982 JIT is used, the following bits may be set:
4986 This is set for the first callout after the start of matching for each
4991 This is set if there has been a matching backtrack since the previous
4992 callout, or since the start of matching if this is the first callout
4997 cate the presence of these bits unless the callout_extra modifier is
5000 The information in the callout_flags field is provided so that applica-
5001 tions can track and tell their users how matching with backtracking is
5003 understand how PCRE2 works. There is no support in pcre2_dfa_match()
5004 because there is no backtracking in DFA matching, and there is no sup-
5005 port in JIT because JIT is all about maximimizing matching performance.
5006 In both these cases the callout_flags field is always zero.
5012 is zero, matching proceeds as normal. If the value is greater than
5015 failed. If the value is less than zero, the match is abandoned, and the
5020 "no match" failure. The error number PCRE2_ERROR_CALLOUT is reserved
5033 argument is a pointer to a compiled pattern, the second points to a
5034 callback function, and the third is arbitrary user data. The callback
5035 function is called for every callout in the pattern in the order in
5036 which they appear. Its first argument is a pointer to a callout enumer-
5037 ation block, and its second argument is the user_data value that was
5047 callout_string Points to callout string or is NULL
5049 The version number is currently 0. It will increase if new fields are
5051 namesakes in the pcre2_callout block that is used for callouts during
5054 Note that the value of pattern_position is unique for each callout.
5055 However, if a callout occurs inside a group that is quantified with a
5056 non-zero minimum or a fixed maximum, the group is replicated inside the
5057 compiled pattern. For example, a pattern such as /(a){2}/ is compiled
5063 zero value, scanning the pattern stops, and that value is returned from
5100 1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set,
5102 matches the next character unless it is the start of a newline se-
5103 quence. This means that, if the newline setting is CR, CRLF, or NUL,
5106 never to match LF, even when 0x0A is not a newline indicator.
5114 serts that the next character is not "a" three times (in principle;
5120 4. If a braced quantifier such as {1,2} appears where there is nothing
5126 negative assertion is a condition that has a matching branch (that is,
5127 the condition is false). Perl may set such capture groups in other
5136 PCRE2, an error is generated by default. However, if either of the
5137 PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are
5141 is built with Unicode support (the default). The properties that can be
5146 erty, but in PCRE2 its use is limited. See the pcre2pattern documenta-
5148 ports (such as \p{Letter}) are not supported by PCRE2, nor is it per-
5149 mitted to prefix any of these properties with "Is".
5152 in between are treated as literals. However, this is slightly different
5169 The \Q...\E sequence is recognized both inside and outside character
5179 and backtracking into subroutine calls is now supported, as in Perl.
5182 group that is called as a subroutine (whether or not recursively),
5183 their effect is confined to that group; it does not extend to the sur-
5184 rounding pattern. This is not always the case in Perl. In particular,
5185 if (*THEN) is present in a group that is called as a subroutine, its
5186 action is limited to that group, even if the group does not contain any
5191 first one that is backtracked onto acts. For example, in the pattern
5193 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
5194 it is the same as PCRE2, but there are cases where it differs.
5197 captured strings when part of a pattern is repeated. For example,
5199 set, but in PCRE2 it is set to "b".
5201 14. PCRE2's handling of duplicate capture group numbers and names is
5202 not as general as Perl's. This is a consequence of the fact the PCRE2
5206 but different names, is not supported, and causes an error at compile
5209 avoid this confusing situation, an error is given at compile time.
5213 modifier is set, Perl allowed white space between ( and ? though the
5223 not affected when case-independent matching is specified. For example,
5226 \p{Ll} match all letters, regardless of case, when case independence is
5231 there is an option for re-enabling the previous behaviour. When this
5232 option is set, \K is acted on when it occurs in positive assertions,
5233 but is ignored in negative assertions.
5238 PCRE2 for some time before. This list is with respect to Perl 5.38:
5240 (a) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
5243 (b) A backslash followed by a letter with no special meaning is
5246 (c) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
5247 fiers is inverted, that is, by default they are not greedy, but if fol-
5259 (g) The callout facility is PCRE2-specific. Perl supports codeblocks
5262 (h) The partial matching facility is PCRE2-specific.
5265 different way and is not Perl-compatible.
5271 (k) PCRE2 supports non-atomic positive lookaround assertions. This is
5284 ing the intermediate matches on the heap, which is ~10% slower but does
5297 ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number
5328 Just-in-time compiling is a heavyweight optimization that can greatly
5330 cessing before the match is performed, so it is of most benefit when
5331 the same pattern is going to be matched many times. This does not nec-
5332 essarily mean many calls of a matching function; if the pattern is not
5335 string is very long, it may still pay to use JIT even for one-off
5336 matches. JIT support is available for all of the 8-bit, 16-bit and
5340 function. It does not apply when the DFA matching function is being
5346 JIT support is an optional feature of PCRE2. The "configure" option
5347 --enable-jit (or equivalent CMake option) must be set when PCRE2 is
5348 built if you want to use JIT. The support is limited to the following
5360 If --enable-jit is set on an unsupported platform, compilation fails.
5362 A client program can tell if JIT support is available by calling
5363 pcre2_config() with the PCRE2_CONFIG_JIT option. The result is one if
5366 particular match. One reason for this is that there are a number of op-
5368 other reason is that in some environments JIT is unable to get memory
5370 fig() is that if it returns zero, JIT will definitely not be used.
5373 JIT when possible. The API is implemented in a way that falls back to
5374 the interpretive code if JIT is not available or cannot be used for a
5376 there is a "fast path" API that is JIT-specific.
5382 is to call pcre2_jit_compile() after successfully compiling a pattern
5383 with pcre2_compile(). This function has two arguments: the first is the
5385 second is zero or more of the following option bits: PCRE2_JIT_COM-
5388 If JIT support is not available, a call to pcre2_jit_compile() does
5390 pattern is passed to the JIT compiler, which turns it into machine code
5393 is zero on success, or a negative error code.
5395 There is a limit to the size of pattern that JIT supports, imposed by
5398 timizations are introduced. If a pattern is too big, a call to
5407 pcre2_match() is called, the appropriate code is run if it is avail-
5408 able. Otherwise, the pattern is matched using interpretive code.
5416 ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5417 diately returns zero. This is an alternative way of testing whether JIT
5418 is available.
5420 At present, it is not possible to free JIT compiled code except when
5421 the entire compiled pattern is freed by calling pcre2_code_free().
5434 stack. Such a callback function is called whenever JIT code is about to
5436 the callback function is not obeyed.
5438 If the JIT compiler finds an unsupported item, no JIT data is gener-
5442 result of 0 means that JIT support is not available, or the pattern was
5451 When a pattern is compiled with the PCRE2_UTF option, subject strings
5453 fault, this is checked at the start of matching and an error is gener-
5454 ated if invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be
5456 you are sure that a subject string is valid. If this option is used
5457 with an invalid string, the result is undefined. The calling program
5461 UTF sequences is available. Calling pcre2_compile() with the
5464 pile() is subsequently called, the compiled JIT code also supports in-
5466 interpretive cases, is given in the pcre2unicode documentation.
5468 There is also an obsolete option for pcre2_jit_compile() called
5470 ibility. It is superseded by the pcre2_compile() option
5483 If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the
5493 When a pattern is matched using JIT, the return values are the same as
5499 The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
5500 searching a very large pattern tree goes on for too long, as it is in
5501 the same circumstance when JIT is not used, but the details of exactly
5502 what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5503 is never returned when JIT matching is used.
5511 ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5513 There is further discussion about the use of JIT stacks in the section
5520 NULL if there is an error. The pcre2_jit_stack_free() function is used
5521 to free a stack that is no longer needed. If its argument is NULL, this
5523 cally minded: the address space is allocated by mmap or VirtualAlloc.)
5534 The first argument is a pointer to a match context. When this is subse-
5536 JIT stack is used. If this argument is NULL, the function returns imme-
5540 (1) If callback is NULL and data is NULL, an internal 32KiB block
5541 on the machine stack is used. This is the default when a match
5542 context is created.
5544 (2) If callback is NULL and data is not NULL, data must be
5548 (3) If callback is not NULL, it must point to a function that is
5551 function is NULL, the internal 32KiB stack is used; otherwise the
5555 A callback function is obeyed whenever JIT code is about to be run; it
5556 is not obeyed when pcre2_match() is called with options that are incom-
5564 up non-sequential matches in one thread is to use callouts: if a call-
5569 you assign or pass back NULL from a callback, that is thread-safe, be-
5572 thread so that the application is thread-safe.
5574 Strictly speaking, even more is allowed. You can assign the same non-
5575 NULL stack to a match context that is used by any number of patterns,
5579 is available for use. However, this is an inefficient solution, and not
5582 This is a suggestion for how a multithreaded program that needs to set
5594 All the functions described in this section do nothing if JIT is not
5602 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5603 where the local data of the current node is pushed before checking its
5604 child nodes. Allocating real machine stack on some platforms is diffi-
5606 extend the stack on PowerPC. Although it is possible, its updating
5614 memory data (this is important because of pointers). Thus we can allo-
5616 4KiB) if that is enough. However, we can still grow up to 1MiB anytime
5621 The owner of the stack is the user program, not the JIT studied pattern
5622 or anything else. The user program must ensure that if a stack is being
5623 used by pcre2_match(), (that is, it is assigned to a match context that
5624 is passed to the pattern currently running), that stack must not be
5626 The best practice for multithreaded programs is to allocate a stack for
5633 a pointer is set. There is no reference counting or any other magic.
5638 also replace the stack in a context at any time when it is not in use.
5644 No, because this is too costly in terms of resources. However, you
5645 could implement some clever idea which release the stack if it is not
5649 (6) OK, the stack is for long term memory allocation. But what happens
5650 if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB
5651 kept until the stack is freed?
5654 ory sometimes without freeing the stack. There is no API for this at
5659 (7) This is too much of a headache. Isn't there any better solution for
5670 The JIT executable allocator does not free all memory when it is possi-
5674 calling pcre2_jit_free_unused_memory(). Its argument is a general con-
5681 This is a single-threaded example that specifies a JIT stack without
5710 JIT is not available, it is convenient for programs that are written
5713 for use where JIT is known to be available, and which need the best
5718 The fast path function is called pcre2_jit_match(), and it takes ex-
5720 must be specified with a length; PCRE2_ZERO_TERMINATED is not sup-
5722 PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The re-
5724 ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
5729 ple, if the subject pointer is NULL but the length is non-zero, an im-
5730 mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF
5731 subject string is tested for validity. In the interests of speed, these
5732 checks do not happen on the JIT fast path. If invalid UTF data is
5734 the result is undefined. The program may crash or loop or give wrong
5736 pcre2_jit_match() in UTF mode only if you are sure the subject is
5775 There are some size limitations in PCRE2 but it is hoped that they will
5778 The maximum size of a compiled pattern is approximately 64 thousand
5779 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5780 the default internal linkage size, which is 2 bytes for these li-
5783 (when building the 16-bit library, 3 is rounded up to 4). See the
5785 for details. In these cases the limit is substantially larger. How-
5786 ever, the speed of execution is slower. In the 32-bit library, the in-
5787 ternal linkage size is always 4.
5789 The maximum length of a source pattern string is essentially unlimited;
5790 it is the largest number a PCRE2_SIZE variable can hold. However, the
5793 The maximum length (in code units) of a subject string is one less than
5794 the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5796 is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-termi-
5803 of characters, the maximum length of any branch is 65535 characters. If
5805 matching length for every branch is limited. The default limit is set
5809 There is no limit to the number of parenthesized groups, but there can
5810 be no more than 65535 capture groups, and there is a limit to the depth
5811 of nesting of parenthesized subpatterns of all kinds. This is imposed
5813 default limit can be specified when PCRE2 is built; if not, the default
5814 is set to 250. An application can change this limit by calling
5817 The maximum length of name for a named capture group is 32 code units,
5818 and the maximum number of such groups is 10000.
5821 (*THEN) verb is 255 code units for the 8-bit library and 65535 code
5824 The maximum length of a string argument to a callout is the largest
5827 The maximum amount of heap memory used for matching is controlled by
5829 The default is a very large number, effectively unlimited.
5861 subject string. The "standard" algorithm is the one provided by the
5864 time (JIT) optimization that is described in the pcre2jit documentation
5865 is compatible with this function.
5867 An alternative algorithm is provided by the pcre2_dfa_match() function;
5868 it operates in a different way, and is not Perl-compatible. This alter-
5872 When there is only one possible way in which a given subject string can
5879 is matched against the string
5891 makes the tree of infinite size, but it is still a tree. Matching the
5901 sions", the standard algorithm is an "NFA algorithm". It conducts a
5902 depth-first search of the pattern tree. That is, it proceeds along a
5903 single path through the tree, checking that the subject matches what is
5904 required. When there is a mismatch, the algorithm tries any alterna-
5909 branches are tried is controlled by the greedy or ungreedy nature of
5912 If a leaf node is reached, a matching string has been found, and at
5913 that point the algorithm stops. Thus, if there is more than one possi-
5915 this is the shortest, the longest, or some intermediate length depends
5919 Because it ends up with a single path through the tree, it is rela-
5931 matches. In Friedl's terminology, this is a kind of "DFA algorithm",
5932 though it is not implemented as a traditional finite state machine (it
5935 Although the general principle of this matching algorithm is that it
5936 scans the subject string only once, without backtracking, there is one
5937 exception: when a lookaround assertion is encountered, the characters
5941 The scan continues until either the end of the subject is reached, or
5944 match has failed). Thus, if there is more than one possible match,
5947 order of length. There is an option to stop the algorithm after the
5948 first match (which is necessarily the shortest) is found.
5953 the match data block is therefore not advisable when doing DFA match-
5961 is matched against the string "the caterpillar catchment", the result
5962 is the three strings "caterpillar", "cater", and "cat" that start at
5968 ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
5969 is no point even considering the possibility of backtracking into the
5971 match is found. If you really do want multiple matches in such cases,
5980 greedy nature of repetition quantifiers is not relevant (though it may
5984 could also match what is quantified, for example in a pattern like
5990 a non-possessive quantifier. Similarly, if an atomic group is present,
5991 it is matched as if it were a standalone pattern at the current point,
5992 and the longest match is then "locked in" for the rest of the overall
5996 is not straightforward to keep track of captured substrings for the
6012 be on some paths and not on others), is not supported.
6014 7. Callouts are supported, but the value of the capture_top field is
6015 always 1, and the value of the capture_last field is always 0.
6018 matches a single code unit, even in a UTF mode, is not supported in
6024 are not supported. (*FAIL) is supported, and behaves like a failing
6027 10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup-
6033 The main advantage of the alternative algorithm is that all possible
6035 in particular, the longest match is found. To find more than one match
6039 Partial matching is possible with this algorithm, though it has some
6048 1. It is substantially slower than the standard algorithm. This is
6049 partly because it has to search for all possible matches, but is also
6050 because it is less susceptible to optimization.
6058 4. JIT optimization is not supported.
6088 In normal use of PCRE2, if there is a match up to the end of a subject
6090 PCRE2_ERROR_NOMATCH is returned, just like any other failing match.
6094 One example is an application where the subject string is very long,
6095 and not all available at once. The requirement here is to be able to do
6096 the matching segment by segment, but special action is needed when a
6099 Another example is checking a user input string as it is typed, to en-
6103 Partial matching is a PCRE2-specific feature; it is not Perl-compati-
6104 ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or
6106 ference between the two options is whether or not a partial match is
6120 tial matches on the same pattern. Separate code is compiled for each
6122 matching code is used.
6126 tern, and abandons matching immediately if it is not present in the
6130 on shorter strings. This optimization is also disabled for partial
6137 subject string is reached successfully, but either more characters are
6139 change what is matched.
6141 Example 1: if the pattern is /abc/ and the subject is "ab", more char-
6145 Example 2: if the pattern is /ab+/ and the subject is "ab", a complete
6147 what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
6150 On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
6151 the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
6162 tion exists in case there is a lookbehind that inspects characters be-
6165 (3) There is a special case when the whole pattern can match an empty
6166 string. When the starting point is at the end of the subject, the
6167 empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set
6168 and neither of the above conditions is true, it is returned. However,
6171 "there is going to be a match at this point, but until some more char-
6178 When a partial matching option is set, the result of calling
6192 When a partial match is returned, the first two elements in the ovector
6199 If it is matched against "456abc123xyz" the result is a complete match,
6201 the "start of match" point. However, if a partial match is requested
6202 and the subject string is "456abc12", a partial match is found for the
6206 If there is more than one partial match, the first one that was found
6207 provides the data that is returned. Consider this pattern:
6211 If this is matched against the subject string "abc123dog", both alter-
6212 natives fail to match, but the end of the subject is reached during
6213 matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
6218 How a partial match is processed by pcre2_match()
6220 What happens when a partial match is identified depends on which of the
6221 two partial matching options is set.
6223 If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon
6224 as a partial match is found, without continuing to search for possible
6225 complete matches. This option is "hard" because it prefers an earlier
6227 tion is made that the end of the supplied subject string is not the
6228 true end of the available data, which is why \z, \Z, \b, \B, and $ al-
6231 If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but
6233 tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re-
6234 turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
6236 items in a pattern behave as if the subject string is potentially com-
6238 for \b and \B the end of the subject is treated as a non-alphanumeric.
6245 This matches either "dog" or "dogsbody", greedily (that is, it prefers
6246 the longer string if possible). If it is matched against the string
6248 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6249 TIAL. On the other hand, if the pattern is made ungreedy the result is
6254 In this case the result is always a complete match because that is
6259 /dog(sbody)?/ is the same as /dogsbody|dog/
6260 /dog(sbody)??/ is the same as /dog|dogsbody/
6269 calling pcre2_match(). Here is a run of pcre2test using a pattern that
6281 matching options. Here is an example where there is a difference:
6290 With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
6291 PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6292 so there is only a partial match.
6304 segment, and are in the middle of very long strings, so the pattern is
6309 returns a partial match at the end of a segment whenever there is the
6319 data> ...the date is 23ja\=ph
6321 data> ...the date is 23jan19 and on that day...\=offset=15
6327 the one in which the partial match was found. This is the most
6328 straightforward approach, typically using a memory buffer that is twice
6330 buffer is discarded, the second half is moved to the start of the
6331 buffer, and a new segment is added before repeating the match as in the
6336 this is not at present straightforward. In cases such as the above,
6337 where the pattern does not contain any lookbehinds, it is sufficient to
6342 with '<' if the allusedtext modifier is set:
6349 However, the allusedtext modifier is not available for JIT matching,
6351 characters. For this reason, this information is not available via the
6352 API. It is therefore not possible in general to obtain the exact number
6359 mation that is currently available via the API is the length of the
6362 pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option is the
6364 behind moves back when it is processed. A pattern such as
6368 In a non-UTF or a 32-bit case, moving back is just a subtraction, but
6377 ously. If the end of the subject is reached before the end of the pat-
6378 tern, there is the possibility of a partial match.
6380 When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
6382 are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes
6384 was matched when the longest partial match was found is set as the
6388 there is no difference between greedy and ungreedy repetition, its be-
6389 haviour is different from the pcre2_match(). Consider the string "dog"
6396 "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
6402 is possible to continue the match by providing additional subject data
6405 same working space as before, because this is where details of the pre-
6408 matching over multiple segments. Here is an example using pcre2test:
6418 (restarted) match. Notice that when the match is complete, only the
6419 last part is shown; PCRE2 does not retain the previously partially-
6420 matched string. It is up to the calling program to do that if it needs
6422 fails, it is not possible to try again at a new starting point. All
6423 this facility is capable of doing is continuing with the previous match
6428 If the first part of the subject is "ABC123", a partial match of the
6429 first alternative is found at offset 3. There is no partial match for
6437 way of doing it is to retain some or all of the segment and try a new
6439 ity is to work with two buffers. If a partial match at offset n in the
6440 first buffer is followed by "no match" when PCRE2_DFA_RESTART is used
6473 by PCRE2 are described in detail below. There is a quick-reference syn-
6484 detail. This description of PCRE2's regular expressions is intended as
6488 ported by PCRE2 when its main matching function, pcre2_match(), is
6490 pcre2_dfa_match(), which matches using a different algorithm that is
6492 available when DFA matching is used. The advantages and disadvantages
6513 strings, PCRE2 must be built to include Unicode support (which is the
6517 which is equivalent to setting the relevant PCRE2_UTF. How setting a
6518 UTF mode affects pattern matching is mentioned in several places below.
6519 There is also a summary of features in the pcre2unicode page.
6523 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al-
6528 Another special sequence that may appear at the start of a pattern is
6534 greater than 127, even when UTF is not set. These behaviours can be
6539 restrict them for security reasons. If the PCRE2_NEVER_UCP option is
6540 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
6547 to whichever matching function is subsequently called to match the pat-
6556 item. For example, by default a+b is treated as a++b. For more details,
6576 If a pattern that starts with (*NO_JIT) is successfully compiled, an
6578 pcre2_jit_compile() is ignored.
6582 The pcre2_match() function contains a counter that is incremented every
6587 that is used, but there is also an explicit memory limit that can be
6591 voked by patterns with huge matching trees. A common example is a pat-
6593 not match. When one of these limits is reached, pcre2_match() gives an
6601 where d is any number of decimal digits. However, the value of the set-
6605 If there is more than one setting of one of these limits, the lower
6606 value is used. The heap limit is specified in kibibytes (units of 1024
6610 name is still recognized for backwards compatibility.
6614 limit is used (but in a different way) when JIT is being used, or when
6615 pcre2_dfa_match() is called, to limit computing resource usage by those
6616 matching functions. The depth limit is ignored by JIT but is relevant
6630 It is also possible to specify a newline convention by starting a pat-
6641 tion. For example, on a Unix system where LF is the default newline se-
6646 changes the convention to CR. That pattern matches "a\nb" because LF is
6647 no longer a newline. If more than one of these settings is present, the
6648 last one is used.
6652 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
6654 escape sequence matches. By default, this is any Unicode newline se-
6662 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6666 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
6680 A regular expression is a pattern that is matched against a subject
6687 matches a portion of a subject string that is identical to itself. When
6688 caseless matching is specified (the PCRE2_CASELESS option or (?i)
6693 PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT
6694 option is in force (either passed to pcre2_compile() or set by (?r)
6725 fore or after the comma. The exception to this is \u{...} which is an
6726 ECMAScript compatibility feature that is recognized only when the
6727 PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript does not ignore such
6730 Part of a pattern that is in square brackets is called a "character
6739 If a pattern is compiled with the PCRE2_EXTENDED option, most white
6744 PCRE2_EXTENDED_MORE option is set, the same applies, but in addition
6756 The backslash character has several uses. Firstly, if it is followed by
6757 a character that is not a digit or a letter, it takes away any special
6763 character would otherwise be interpreted as a metacharacter, so it is
6774 space even when the PCRE2_EXTENDED option is set so that most other
6775 white space is ignored. The behaviour is different from Perl in that $
6792 The \Q...\E sequence is recognized both inside and outside character
6793 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
6794 is not followed by \E later in the pattern, the literal interpretation
6795 continues to the end of the pattern (that is, \E is assumed at the
6796 end). If the isolated \Q is inside a character class, this causes an
6797 error, because the character class is then not terminated by a closing
6803 acters in patterns in a visible manner. There is no restriction on the
6805 is being prepared by text editing, it is often easier to use one of the
6810 \a alarm, that is, the BEL character (hex 07)
6811 \cx "control-x", where x is a non-control ASCII character
6824 By default, after \x that is not followed by {, from zero to two hexa-
6828 there is no terminating }, an error occurs.
6831 of the two syntaxes for \x or by an octal sequence. There is no differ-
6832 ence in the way they are handled. For example, \xdc is exactly the same
6836 Support is available for some ECMAScript (aka JavaScript) escape se-
6837 quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6838 quence \x followed by { is not recognized. Only if \x is followed by
6839 two hexadecimal digits is it recognized as a character escape. Other-
6840 wise it is interpreted as a literal "x" character. In this mode, sup-
6841 port for code points greater than 256 is provided by \u, which must be
6842 followed by four hexadecimal digits; otherwise it is interpreted as a
6846 dition, \u{hhh..} is recognized as the character specified by hexadeci-
6850 syntax is from ECMAScript 6.
6852 The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6854 Unicode name; PCRE2 does not support this. Note that when \N is not
6856 ent meaning, matching any character that is not a newline.
6858 There are some legacy applications where the escape sequence \r is ex-
6860 is set, \r in a pattern is converted to \n so that it matches a LF
6863 An error occurs if \c is not followed by a character whose ASCII code
6864 point is in the range 32 to 126. The precise effect of \cx is as fol-
6865 lows: if x is a lower case letter, it is converted to upper case. Then
6866 bit 6 of the character (hex 40) is inverted. Thus \cA to \cZ become hex
6867 01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and
6868 \c; becomes hex 7B (; is 3B). If the code unit following \c has a code
6871 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
6873 The \c escape is processed as specified for Perl in the perlebcdic doc-
6884 which is BEL in ASCII but DEL in EBCDIC.
6887 but because 127 is not a control character in EBCDIC, Perl makes it
6890 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6898 if the pattern character that follows is itself an octal digit.
6901 in braces. An error occurs if this is not the case. This escape is a
6906 For greater clarity and unambiguity, it is best to avoid following \ by
6911 The handling of a backslash followed by a digit other than 0 is compli-
6915 its as a decimal number. If the number is less than 10, begins with the
6917 groups in the expression, the entire sequence is taken as a backrefer-
6918 ence. A description of how this works is given later, following the
6928 \040 is another way of writing an ASCII space
6929 \40 is the same, provided there are fewer than 40
6931 \7 is always a backreference
6934 \011 is always a tab
6935 \0113 is a tab followed by the character "3"
6940 \81 is always a backreference
6959 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
6967 class, \b is interpreted as the backspace character (hex 08).
6969 When not followed by an opening brace, \N is not allowed in a character
6980 tions is set, \U matches a "U" character, and \u can be used to define
6986 closed in braces, is an absolute or relative backreference. A named
6993 name or a number enclosed either in angle brackets or single quotes, is
6996 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6997 erence; the latter is a subroutine call.
7001 Another use of backslash is for specifying generic character types:
7004 \D any character that is not a decimal digit
7006 \H any character that is not a horizontal white space character
7007 \N any character that is not a newline
7009 \S any character that is not a white space character
7011 \V any character that is not a vertical white space character
7016 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
7017 the meaning of \N. Note that when \N is followed by an opening brace it
7026 the appropriate type. If the current matching point is at the end of
7027 the subject string, all of them fail, because there is no character to
7032 cale. This list may vary if locale-specific matching is taking place.
7034 is recognized as white space, and in others the VT character is not.
7036 A "word" character is an underscore or any character that is a letter
7037 or digit. By default, the definition of letters and digits is con-
7039 specific matching is taking place (see "Locale support" in the pcre2api
7043 use of locales with Unicode is discouraged.
7048 matching is happening. These escape sequences retain their original
7050 ciency reasons. If the PCRE2_UCP option is set, the behaviour is
7067 these sequences is noticeably slower when PCRE2_UCP is set.
7077 list of code points, whether or not PCRE2_UCP is set. The horizontal
7116 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
7121 This is an example of an "atomic group", details of which are given be-
7125 riage return, U+000D), or NEL (next line, U+0085). Because this is an
7126 atomic group, the two-character sequence is treated as a single unit
7131 rator, U+2029). Unicode support is not needed for these characters to
7134 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
7136 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back-
7137 slash R".) This can be made the default when PCRE2 is built; if this is
7139 CODE option. It is also possible to specify these settings by starting
7148 be in upper case. If more than one of them is present, the last one is
7155 Inside a character class, \R is treated as an unrecognized escape se-
7160 When PCRE2 is built with Unicode support (the default), three addi-
7169 Matching characters by Unicode property is not fast, because PCRE2 has
7171 erty. That is why the traditional escape sequences such as \d and \w do
7184 and underscores are ignored. There is support for Unicode script names,
7196 scripts ("Script Extensions") with which it is commonly used. Using the
7198 script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
7201 sign is an alternative to the colon. If a script name is given without
7202 a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
7220 brace and the property name. For example, \p{^Lu} is the same as
7223 If only one letter is specified with \p or \P, it includes all the gen-
7277 The special property LC, which has the synonym L&, is also supported:
7279 words, a letter that is not classified as a modifier or "other".
7283 ferent to any other character when PCRE2 is not in UTF mode (using the
7290 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
7291 any of these properties with "Is".
7293 No character that is in the Unicode table has the Cn (unassigned) prop-
7294 erty. Instead, this property is assumed for any code point that is not
7298 For example, \p{Lu} always matches only upper case letters. This is
7303 Unicode defines a number of binary properties, that is, properties
7385 7. Do not break within emoji flag sequences. That is, do not break be-
7396 non-standard, non-Perl properties internally when PCRE2_UCP is set.
7407 (separator) property. Xsp is the same as Xps; in PCRE1 it used to ex-
7412 There is another non-standard property, Xuc, which matches any charac-
7418 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
7425 characters not to be included in the final matched sequence that is re-
7437 ture is similar to a lookbehind assertion (described below), but the
7438 part of the pattern that precedes \K is not constrained to match a lim-
7439 ited number of characters, as is required for a lookbehind assertion.
7445 matches "foobar", the first substring is still set to "foo".
7451 is set, \K is acted upon when it occurs inside positive assertions, but
7452 is ignored in negative assertions. Note that when a pattern such as
7460 If the subject is "foobar", a call to pcre2_match() with a starting
7462 is, the start of the reported match is earlier than where the match
7467 The final use of backslash is for certain simple assertions. An asser-
7470 use of groups for more complicated assertions is described below. The
7483 character class, an "invalid escape sequence" error is generated.
7485 A word boundary is a position in the subject string where the current
7489 PCRE2 is built with Unicode support, the meanings of \w and \W can be
7490 changed by setting the PCRE2_UCP option. When this is done, it also af-
7493 determines which it is. For example, the fragment \ba matches "a" at
7502 acters. However, if the startoffset argument of pcre2_match() is non-
7503 zero, indicating that matching is to start at a point other than the
7505 \Z and \z is that \Z matches before a newline at the end of the string
7508 The \G assertion is true only when the current matching position is at
7511 startoffset is non-zero. By calling pcre2_match() multiple times with
7512 appropriate arguments, you can mimic Perl's /g option, and it is in
7516 starting character of the matching process, is subtly different from
7522 If all the alternatives of a pattern begin with \G, the expression is
7523 anchored to the starting match position, and the "anchored" flag is set
7530 That is, they test for a particular condition being true without con-
7533 line convention is set so that only the two-character sequence CRLF is
7538 character is an assertion that is true only if the current matching
7539 point is at the start of the subject string. If the startoffset argu-
7540 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
7541 flex can never match if the PCRE2_MULTILINE option is unset. Inside a
7547 alternative in which it appears if the pattern is ever to match that
7548 branch. If all possible alternatives start with a circumflex, that is,
7549 if the pattern is constrained to match only at the start of the sub-
7550 ject, it is said to be an "anchored" pattern. (There are also other
7553 The dollar character is an assertion that is true only if the current
7554 matching point is at the end of the subject string, or immediately be-
7556 TEOL is set. Note, however, that it does not actually match the new-
7567 the PCRE2_MULTILINE option is set. When this is the case, a dollar
7578 match for circumflex is possible when the startoffset argument of
7579 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
7580 if PCRE2_MULTILINE is set.
7583 nizes the two-character sequence CRLF as a newline, this is preferred,
7585 lines. For example, if the newline convention is "any", a multiline
7587 than after CR, even though CR on its own is a valid newline. (It also
7592 start with \A it is always anchored, whether or not PCRE2_MULTILINE is
7604 ter sequence CRLF is the only line ending, dot does not match CR if it
7605 is immediately followed by LF, but otherwise it matches all characters
7606 (including isolated CRs and LFs). When ANYCRLF is selected for line
7612 PCRE2_DOTALL option is set, a dot matches any one character, without
7613 exception. If the two-character sequence CRLF is present in the sub-
7616 The handling of dot is entirely independent of the handling of circum-
7621 like a dot, except that it is not affected by the PCRE2_DOTALL option.
7625 When \N is followed by an opening brace it has a different meaning. See
7634 unit, whether or not a UTF mode is set. In the 8-bit library, one code
7635 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
7636 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
7637 line-ending characters. The feature is provided in Perl in order to
7638 match individual bytes in UTF-8 mode, but it is unclear how it can use-
7644 sults, because PCRE2 assumes that it is matching character by character
7647 PCRE2_MATCH_INVALID_UTF option is used).
7650 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
7658 to optimize and so the match is always run using the interpreter.
7660 In the 32-bit library, however, \C is always supported (when not ex-
7662 whether or not UTF-32 is specified.
7664 In general, the \C escape sequence is best avoided. However, one way of
7666 ters is to use a lookahead to check the length of the next character,
7686 closing square bracket. A closing square bracket on its own is not spe-
7687 cial by default. If a closing square bracket is required as a member
7691 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
7696 the first character in the class definition is a circumflex, in which
7698 If a circumflex is actually required as a member of the class, ensure
7699 it is not the first character, or escape it with a backslash.
7702 while [^aeiou] matches any character that is not a lower case vowel.
7703 Note that a circumflex is just a convenient notation for specifying the
7705 class that starts with a circumflex is not an assertion; it still con-
7707 the current pointer is at the end of the string.
7710 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
7717 ther PCRE2_UTF or PCRE2_UCP is set.
7721 quence is in use, and whatever setting of the PCRE2_DOTALL and
7722 PCRE2_MULTILINE options is used. A class such as [^a] always matches
7735 they cause an error. The same is true for \N when not followed by an
7740 tween d and m, inclusive. If a minus character is required in a class,
7748 or \H. However, unless the hyphen is the last character in the class,
7749 Perl outputs a warning in its warning mode, as this is most likely a
7750 user error. As PCRE2 has no facility for warning, an error is given in
7753 It is not possible to have the literal character "]" as the end charac-
7754 ter of a range. A pattern such as [W-]46] is interpreted as a class of
7756 would match "W46]" or "-46]". However, if the "]" is escaped with a
7757 backslash it is interpreted as the end of range, so [W-\]46] is inter-
7772 There is a special case in EBCDIC environments for ranges whose end
7777 points. However, if the range is specified numerically, for example,
7780 If a range that includes letters is used when caseless matching is set,
7781 it matches the letters in either case. For example, [W-c] is equivalent
7829 CR (13), and space (32). If locale-specific matching is taking place,
7834 The name "word" is a Perl extension, and "blank" is a GNU extension
7835 from Perl 5.8. Another Perl extension is negation, which is indicated
7841 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
7842 these are not supported, and an error is given if they are encountered.
7846 ters in the range 128-255 when locale-specific matching is happening.
7849 used. This is achieved by replacing POSIX classes with other sequences,
7875 characters that are not controls, that is, characters with
7885 Unicode code points start at U+FF10. This is a change that
7892 ASCII characters when PCRE2_UCP is set. The option PCRE2_EX-
7903 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
7906 [[:<:]] is converted to \b(?=\w)
7907 [[:>:]] is converted to \b(?<=\w)
7911 support is not compatible with Perl. It is provided to help migrations
7912 from other environments, and is best not used in any new patterns. Note
7915 character normally shows which is wanted, without the need for the as-
7930 appear, and an empty alternative is permitted (matching the empty
7932 to right, and the first one that succeeds is used. If the alternatives
7951 For example, (?im) sets caseless, multiline matching. It is also possi-
7958 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
7960 the option is unset. An empty options setting "(?)" is allowed. Need-
7963 If the first character following (? is a circumflex, it causes all of
7979 However, except for 'r', these are not unset by (?^), which is equiva-
7980 lent to (?-imnrsx). If 'a' is not followed by any of the upper case
7984 TRA_ASCII_POSIX is set, but including it in (?aP) means that (?-aP)
7987 When one of these option changes occurs at top level (that is, not in-
7996 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
8003 first branch is abandoned before the option setting. This is because
8018 tion is called. In addition, the pattern can contain special leading
8043 is passed back to the caller, separately from the portion that matched
8049 king" is matched against the pattern
8056 The fact that plain parentheses fulfil two functions is not always
8057 helpful. There are often times when grouping is required without cap-
8058 turing. If an opening parenthesis is followed by a question mark and a
8059 colon, the group does not do any capturing, and is not counted when
8061 the string "the white queen" is matched against the pattern
8066 1 and 2. The maximum number of capture groups is 65535.
8077 the group is reached, an option setting in one branch does affect sub-
8086 with (?| and is itself a non-capturing group. For example, consider
8094 matched. This construct is useful when you want to capture part, but
8096 theses are numbered as usual, but the number is reset at the start of
8099 lowing example is taken from the Perl documentation. The numbers under-
8106 A backreference to a capture group uses the most recent value that is
8117 A relative reference such as (?-1) is no different: it is just a conve-
8121 number, the test is true if any group with that number has matched.
8123 An alternative approach to using this "branch reset" feature is to use
8129 Identifying capture groups by number is simple, but it can be very hard
8131 an expression is modified, the numbers may change. To help with this
8139 Names may be up to 128 code units long. When PCRE2_UTF is not set, they
8141 must start with a non-digit. When PCRE2_UTF is set, the syntax of group
8142 names is extended to allow any Unicode letter or Unicode decimal digit.
8145 ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
8146 ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set
8174 vokes a compile-time error. However, there is still scope for confu-
8179 Although the second group number 1 is not explicitly named, the name AA
8180 is still an alias for any group 1. Whether the pattern matches "aa" or
8205 There are five capture groups, but only one is ever set after a match.
8209 was. (An alternative way of solving this problem is to use a "branch
8215 is set is used for the reference. For example, this pattern matches
8222 corresponds to the first occurrence of the name is used. In the absence
8223 of duplicate numbers this is the one with the lowest number.
8228 the condition is true for any one of them, the overall condition is
8229 true. This is the same behaviour as testing by number. For further de-
8236 Repetition is specified by quantifiers, which may follow any one of
8258 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
8259 special character. If the second number is omitted, but the comma is
8260 present, there is no upper limit; if the second number and the comma
8270 matches exactly 8 digits. If the first number is omitted, the lower
8271 limit is taken as zero; in this case the upper limit must be present.
8273 X{,4} is interpreted as X{0,4}
8275 This is a change in behaviour that happened in Perl 5.34.0 and PCRE2
8280 of a quantifier, the brace is taken as a literal character. In particu-
8281 lar, this means that {,} is a literal string of three characters.
8283 Note that not every opening brace is potentially the start of a quanti-
8289 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
8294 The quantifier {0} is permitted, causing the expression to behave as if
8305 * is equivalent to {0,}
8306 + is equivalent to {1,}
8307 ? is equivalent to {0,1}
8309 It is possible to construct infinite loops by following a group that
8323 By default, quantifiers are "greedy", that is, they match as much as
8326 this gives problems is in trying to match comments in C programs. These
8338 the .* item. However, if a quantifier is followed by a question mark,
8345 tifiers is not otherwise changed, just the preferred number of matches.
8352 which matches one digit by preference, but can match two if that is the
8355 If the PCRE2_UNGREEDY option is set (an option that is not available in
8360 When a parenthesized group is quantified with a minimum repeat count
8361 that is greater than 1 or with a limited maximum, more memory is re-
8366 (equivalent to Perl's /s) is set, thus allowing the dot to match new-
8367 lines, the pattern is implicitly anchored, because whatever follows
8369 so there is no point in retrying the overall match at any position af-
8373 In cases where it is known that the subject string contains no new-
8374 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
8378 When .* is inside capturing parentheses that are the subject of a
8384 If the subject is "xyz123abc123" the match point is the fourth charac-
8385 ter. For this reason, such a pattern is not implicitly anchored.
8387 Another case where implicit anchoring is not applied is when the lead-
8388 ing .* is inside an atomic group. Once again, a match at the start may
8395 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
8397 When a capture group is repeated, the value captured is the substring
8403 is "tweedledee". However, if there are nested capture groups, the cor-
8409 matches "aba" the value of the second captured substring is "b".
8417 rest of the pattern to match. Sometimes it is useful to prevent this,
8419 than it otherwise might, when the author of the pattern knows there is
8428 action of the matcher is to try again with only 5 digits matching the
8431 the means for specifying that once a group has matched, it is not to be
8436 is a kind of special parenthesis, starting with (?> as in this example:
8446 contains once it has matched, and a failure further into the pattern is
8450 An alternative description is that a group of this type matches exactly
8462 group is just a single repeated item, as in the example above, a sim-
8475 GREEDY option is ignored. They are a convenient notation for the sim-
8476 pler forms of atomic group. However, there is no difference in the
8481 The possessive quantifier syntax is an extension to the Perl 5.8 syn-
8488 simple pattern constructs. For example, the sequence A+B is treated as
8489 A++B because there is no point in backtracking into a sequence of A's
8495 group is the only way to avoid some failing matches taking a very long
8502 matches, it runs quickly. However, if it is applied to
8506 it takes a long time before reporting failure. This is because the
8511 single character is used. They remember the last single character that
8512 is required for a match, and fail early if it is not present in the
8513 string.) If the pattern is changed so that it uses an atomic group,
8524 0 (and possibly further digits) is a backreference to a capture group
8525 earlier (that is, to its left) in the pattern, provided there have been
8528 However, if the decimal number following the backslash is less than 8,
8529 it is always taken as a backreference, and causes an error only if
8531 words, the group that is referenced need not be to the left of the ref-
8533 can make sense when a repetition is involved and the group to the right
8536 It is not possible to have a numerical "forward backreference" to a
8537 group whose number is 8 or more using this syntax because a sequence
8538 such as \50 is interpreted as a character defined in octal. See the
8542 is no problem when named capture groups are used (see below).
8545 following a backslash is to use the \g escape sequence. This escape
8554 ity that is present in the older syntax. It is also useful when literal
8555 digits follow the reference. A signed number is a relative reference.
8560 The sequence \g{-1} is a reference to the capture group whose number is
8562 example (where the next group would be numbered 3) is it equivalent to
8564 is inside a capture group, that group is included in the count, so in
8573 The sequence \g{+1} is a reference to the next capture group that is
8586 not "sense and responsibility". If caseful matching is in force at the
8587 time of the backreference, the case of letters is relevant. For exam-
8593 original capture group is matched caselessly.
8596 capture groups. The .NET syntax is \k{name}, the Python syntax is
8597 (?=name), and the original Perl syntax is \k<name> or \k'name'. All of
8600 named references, is also supported by PCRE2. We could rewrite the
8608 A capture group that is referenced by name may appear in the pattern
8618 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8625 PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise,
8631 when the group is first used, so, for example, (a\1) never matches.
8652 An assertion is a test on the characters following or preceding the
8662 group is matched in the normal way, and if it is true, matching contin-
8667 is true, but there is a subsequent matching failure, there is no back-
8675 determines which branch of the condition is followed.
8686 fails to match). A negative assertion is true only when all its
8694 branch means that the assertion is not true. If such an assertion is
8706 10.35 the only restriction is that an unlimited maximum repetition is
8707 changed to be one more than the minimum. For example, {3,} is treated
8718 (*positive_lookahead: or (*pla: is the same as (?=
8719 (*negative_lookahead: or (*nla: is the same as (?!
8720 (*positive_lookbehind: or (*plb: is the same as (?<=
8721 (*negative_lookbehind: or (*nlb: is the same as (?<!
8723 For example, (*pla:foo) is the same assertion as (?=foo). In the fol-
8739 matches any occurrence of "foo" that is not followed by "bar". Note
8744 does not find an occurrence of "bar" that is preceded by something
8746 the assertion (?!foo) is always true when the next three characters are
8747 "bar". A lookbehind assertion is needed to achieve the other effect.
8750 most convenient way to do it is with (?!) because an empty string al-
8753 is a synonym for (?!).
8762 does find an occurrence of "bar" that is not preceded by "foo". The
8771 there is a limit of 65535 characters to the lengths, which do not have
8772 to be the same, as this example demonstrates. This is the only kind of
8782 The maximum matching length for any branch of the lookbehind is limited
8784 ited repetition (for example \d*) is not supported. In some cases, the
8797 length string. However, recursion, that is, a "subroutine" call into a
8798 group that is already active, is not supported.
8803 numbers), and if the backreference is by name, the name must be unique.
8819 the pattern is specified as
8824 (because there is no following "a"), it backtracks to match all but the
8827 so we are no better off. However, if the pattern is written as
8844 each of the assertions is applied independently at the same point in
8845 the subject string. First there is a check that the previous three
8846 characters are all digits, and then there is a check that the same
8850 foo". A pattern to do that is
8862 matches an occurrence of "baz" that is preceded by "bar" which in turn
8863 is not preceded by "foo", while
8867 is another pattern that matches "foo" preceded by three digits and any
8873 Traditional lookaround assertions are atomic. That is, if an assertion
8874 is true, but there is a subsequent matching failure, there is no back-
8883 also appears earlier in the string, that is, it must appear at least
8890 is "word3". How does it work? At the start, ^(?x) anchors the pattern
8894 the assertion can match a word, which is captured by group 1. In other
8898 The current matching point is then reset to the start of the subject,
8915 group later in the pattern. If this is not the case, the rest of the
8919 There is one exception to backtracking into a non-atomic assertion. If
8920 an (*ACCEPT) control verb is triggered, the assertion succeeds atomi-
8921 cally. That is, a subsequent match failure cannot backtrack into the
8933 In concept, a script run is a sequence of characters that are all from
8936 other marks are used with multiple scripts, it is not that simple.
8937 There is a full description of the rules that PCRE2 uses in the section
8940 If part of a pattern is enclosed between (*script_run: or (*sr: and a
8945 "paypal.com" is an infamous example, where the letters could be a mix-
8957 This works as long as the first character is expected to be a character
8958 in that script, and not (for example) punctuation, which is allowed
8959 with any script. If this is not the case, a more creative lookahead is
8966 In many cases, backtracking into a script run pattern fragment is not
8968 Because this is a common requirement, a shorthand notation is provided
8971 (*asr:...) is the same as (*sr:(?>...))
8973 Note that the atomic group is inside the script run. Putting it outside
8976 Support for script runs is not available if PCRE2 is compiled without
8977 Unicode support. A compile-time error is given if any of the above con-
8978 structs is encountered. Script runs are not supported by the alternate
8989 It is possible to cause the matching process to obey a pattern fragment
8997 If the condition is satisfied, the yes-pattern is used; otherwise the
8998 no-pattern (if present) is used. An absent no-pattern is equivalent to
9003 the level of the condition itself. This pattern fragment is an example
9016 the condition is true if a capture group of that number has previously
9017 matched. If there is more than one capture group with the same number
9019 is true if any of them have matched. An alternative notation, which is
9020 a PCRE2 extension, not supported by Perl, is to precede the digits with
9021 a plus or minus sign. In this case, the group number is relative rather
9027 is not used; it provokes a compile-time error.
9036 character is present, sets it as the first captured substring. The sec-
9038 third part is a conditional group that tests whether or not the first
9039 capture group matched. If it did, that is, if subject started with an
9040 opening parenthesis, the condition is true, and so the yes-pattern is
9041 executed and a closing parenthesis is required. Otherwise, since no-
9042 pattern is not present, the conditional group matches nothing. In other
9058 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
9065 If the name used in a condition of this kind is a duplicate, the test
9066 is applied to all groups of the same name, and is true if any one of
9072 part of the pattern to another, whether or not it is actually recur-
9076 If a condition is the string (R), and there is no capture group with
9077 the name R, the condition is true if matching is currently in a recur-
9079 digits follow the letter R, and there is no group with that name, the
9080 condition is true if the most recent call is into a group with the
9082 is a contrived example that is equivalent to a+b:
9086 However, in both cases, if there is a capture group with a matching
9096 the condition is true if the most recent recursion is into a group of
9100 the current level. If the name used in a condition of this kind is a
9101 duplicate, the test is applied to all groups of the same name, and is
9102 true if any one of them is the most recent recursion.
9108 If the condition is the string (DEFINE), the condition is always false,
9109 even if there is a group with the name DEFINE. In this case, there may
9110 be only one alternative in the rest of the conditional group. It is al-
9112 DEFINE is that it can be used to define subroutines that can be refer-
9113 enced from elsewhere. (The use of subroutines is described below.) For
9120 The first part of the pattern is a DEFINE group inside which another
9121 group named "byte" is defined. This matches an individual component of
9123 this part of the pattern is skipped because DEFINE acts like a false
9140 This pattern matches "yes" if the PCRE2 version is greater or equal to
9146 If the condition is not in any of the above formats, it must be a
9157 The condition is a positive lookahead assertion that matches an op-
9160 ter is found, the subject is matched against the first alternative;
9161 otherwise it is matched against the second. This pattern matches
9165 When an assertion that is a condition contains capture groups, any cap-
9166 turing that occurs in a matching branch is retained afterwards, for
9183 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
9186 the pattern. Which characters are interpreted as newlines is controlled
9189 line conventions" above. Note that the end of this type of comment is a
9192 when PCRE2_EXTENDED is set, and the default newline convention (a sin-
9193 gle linefeed character) is in force:
9198 for a newline in the pattern. The sequence \n is still literal at this
9207 that can be done is to use a pattern that matches up to some fixed
9208 depth of nesting. It is not possible to handle an arbitrary nesting
9229 zero and a closing parenthesis is a recursive subroutine call of the
9231 group. (If not, it is a non-recursive subroutine call, which is de-
9232 scribed in the next section.) The special item (?R) or (?0) is a recur-
9236 PCRE2_EXTENDED option is set so that white space is ignored):
9242 cursive match of the pattern itself (that is, a correctly parenthesized
9243 substring). Finally there is a closing parenthesis. Note the use of a
9256 tricky. This is made easier by the use of relative references. Instead
9260 the point at which it is encountered.
9269 (c) is number 2. When the reference (?-2) is encountered, the second
9270 most recently opened parentheses has the number 1, but it is the first
9275 It is also possible to refer to subsequent capture groups, by writing
9277 the reference is not inside the parentheses that are referenced. They
9281 An alternative approach is to use named parentheses. The Perl syntax
9282 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
9287 If there is more than one group with the same name, the earliest one is
9292 strings of non-parentheses is important when applying the pattern to
9293 strings that do not match. For example, when this pattern is applied to
9297 it yields "no match" quickly. However, if a possessive quantifier is
9305 tion). If the pattern above is matched against
9309 the value for the inner capturing parentheses (numbered 2) is "ef",
9310 which is the last value taken on at the top level. If a capture group
9311 is not matched at the top level, its final captured value is unset,
9318 brackets (that is, when recursing), whereas any characters are permit-
9323 In this pattern, (?(R) is the start of a conditional group, with two
9325 (?R) item is the actual recursive call.
9333 group. That is, once it had matched some of the subject string, it was
9339 treated as atomic. That is, they can be re-entered to try unused alter-
9340 natives if there is a matching failure later in the pattern. This is
9367 processing is in the handling of captured values. Formerly in Perl,
9385 is used outside the parentheses to which it refers, it operates a bit
9405 is used, it does match "sense and responsibility" as well as the other
9406 two strings. Another example is given in the discussion of DEFINE
9414 Processing options such as case-independence are fixed when a group is
9415 defined, so if it is used as a subroutine, such options cannot be
9424 subroutines is described in the section entitled "Backtracking verbs in
9431 name or a number enclosed either in angle brackets or single quotes, is
9439 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
9440 plus or a minus sign it is taken as a relative reference. For example:
9445 synonymous. The former is a backreference; the latter is a subroutine
9454 strings that match the same pair of parentheses when there is a repeti-
9458 trary Perl code. The feature is called "callout". The caller of PCRE2
9461 context to pcre2_match() or pcre2_dfa_match(). If no match context is
9462 passed, or if the callout entry point is set to NULL, callouts are dis-
9466 external function is to be called. There are two kinds of callout:
9468 on its own with no argument is treated as (?C0). A numerical argument
9475 tion is called. It is provided with the number or string argument of
9476 the callout, the position in the pattern, and one item of data that is
9481 time, and one side-effect is that sometimes callouts are skipped. If
9495 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
9497 They are all numbered 255. If there is a conditional group in the pat-
9498 tern whose condition is an assertion, an additional callout is inserted
9511 ending delimiter is the same as the start, except for {, where the end-
9512 ing delimiter is }. If the ending delimiter is needed within the
9517 The doubling is removed before the string is passed to the callout
9527 or not a name argument is present. The names are not required to be
9530 By default, for compatibility with Perl, a name is any sequence of
9531 characters that does not include a closing parenthesis. The name is not
9532 processed in any way, and it is not possible to include a closing
9534 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
9537 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
9545 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
9546 names is skipped, and #-comments are recognized, exactly as in the rest
9548 verb names unless PCRE2_ALT_VERBNAMES is also set.
9550 The maximum length of a name is 255 in the 8-bit library and 65535 in
9551 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
9552 closing parenthesis immediately follows the colon, the effect is as if
9557 them can be used only when the pattern is to be matched using the tra-
9564 capture groups called as subroutines (whether or not recursively) is
9576 pile(), or by starting the pattern with (*NO_START_OPT). There is more
9590 of the pattern. However, when it is inside a capture group that is
9591 called as a subroutine, only that group is ended successfully. Matching
9596 If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
9601 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
9604 (*ACCEPT) is the only backtracking verb that is allowed to be quanti-
9612 is triggered and the match succeeds. In both cases, all but C is cap-
9623 may be abbreviated to (*F). It is equivalent to (?!) but easier to
9624 read. The Perl documentation notes that it is probably useful only when
9626 are not present in PCRE2. The nearest equivalent is the callout fea-
9631 A match with the string "aaaa" always fails, but the callout is taken
9635 CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is
9640 There is one verb whose main purpose is to track how a match was ar-
9646 A name is always required with this verb. For all the other backtrack-
9647 ing control verbs, a NAME argument is optional.
9650 the matching path is passed back to the caller as described in the sec-
9654 differences in those cases when (*MARK) is used in conjunction with
9657 The mark name that was last encountered on the matching path is passed
9658 back. A verb without a NAME argument is ignored for this purpose. Here
9659 is an example of pcre2test output, where the "mark" modifier requests
9670 The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9671 ple it indicates which of the two alternatives matched. This is a more
9675 If a verb with a name is encountered in a positive assertion that is
9676 true, the name is recorded and passed back if it is the last-encoun-
9681 the entire match process is returned. For example:
9687 Note that in this unanchored example the mark is retained from the
9694 ensure that the match is always attempted.
9699 tinues with what follows, but if there is a subsequent match failure,
9700 causing a backtrack to the verb, a failure is forced. That is, back-
9703 that is true, its effect is confined to that group, because once the
9704 group has been matched, there is never any backtracking into it. Back-
9709 tracking reaches them. The behaviour described below is what happens
9710 when the verb is not in a subroutine or an assertion. Subsequent sec-
9715 This verb causes the whole match to fail outright if there is a later
9717 tern is unanchored, no further attempts to find a match by advancing
9718 the starting point take place. If (*COMMIT) is the only backtracking
9719 verb that is encountered, once it has been passed pcre2_match() is com-
9728 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
9729 MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
9734 If there is more than one backtracking verb in a pattern, a different
9739 Note that (*COMMIT) at the start of a pattern is not the same as an an-
9755 character. The pattern is now applied starting at "x", and so the
9762 the subject if there is a later matching failure that causes backtrack-
9763 ing to reach it. If the pattern is unanchored, the normal "bumpalong"
9765 occur as usual to the left of (*PRUNE), before it is reached, or when
9766 matching to the right of (*PRUNE), but if there is no match to the
9768 (*PRUNE) is just an alternative to an atomic group or possessive quan-
9773 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
9774 It is like (*MARK:NAME) in that the name is remembered for passing back
9780 This verb, when given without a name, is like (*PRUNE), except that if
9781 the pattern is unanchored, the "bumpalong" advance is not to the next
9784 it cannot be part of a successful match if there is a later mismatch.
9789 If the subject is "aaaac...", after the first match attempt fails
9797 If (*SKIP) is used to specify a new starting position that is the same
9799 lookbehind) earlier, the position specified by (*SKIP) is ignored, and
9804 When (*SKIP) has an associated name, its behaviour is modified. When
9805 such a (*SKIP) is triggered, the previous path through the pattern is
9806 searched for the most recent (*MARK) that has the same name. If one is
9807 found, the "bumpalong" advance is to the subject position that corre-
9809 no (*MARK) with a matching name is found, the (*SKIP) is ignored.
9826 In the first example, the (*MARK) setting is in an atomic group, so it
9827 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
9829 character position. In the second example, the (*MARK) setting is not
9832 ond character. This time, the (*MARK) is never seen because "a" does
9842 tracking reaches it. That is, it cancels any further backtracking
9848 If the COND1 pattern matches, FOO is tried (and possibly further items
9851 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
9852 quently BAZ fails, there are no more alternatives, so there is a back-
9853 track to whatever came before the entire group. If (*THEN) is not in-
9856 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
9857 It is like (*MARK:NAME) in that the name is remembered for passing back
9861 A group that does not contain a | character is just a part of the en-
9862 closing alternative; it is not a nested alternation with only one al-
9870 If A and B are matched, but there is a failure in C, matching does not
9871 backtrack into A; instead it moves to the next alternative, that is, D.
9872 However, if the group containing (*THEN) is given an alternative, it
9877 The effect of (*THEN) is now confined to the inner group. After a fail-
9882 Note that a conditional group is not considered as having two alterna-
9883 tives, because only one is ever used. In other words, the | character
9889 If the subject is "ba", this pattern does not match. Because .*? is un-
9891 fails, the character "b" is matched, but "c" is not. At this point,
9893 the presence of the | character. The conditional group is part of the
9899 when subsequent matching fails. (*THEN) is the weakest, carrying on the
9902 character (for an unanchored pattern). (*SKIP) is similar, except that
9903 the advance may be more than one character. (*COMMIT) is the strongest,
9908 If more than one backtracking verb is present in a pattern, the one
9909 that is backtracked onto first acts. For example, consider this pat-
9917 is consistent, but is not always the same as Perl's. It means that if
9923 If there is a matching failure to the right, backtracking onto (*PRUNE)
9924 causes it to be triggered, and its action is taken. There can never be
9934 If the subject is "abac", Perl matches unless its optimizations are
9942 whether or not the assertion is standalone or acting as the condition
9951 If the assertion is a condition, (*ACCEPT) causes the condition to be
9957 effect is confined to the assertion, because Perl lookaround assertions
9958 are atomic. A backtrack that occurs after such an assertion is complete
9960 (*MARK) name that is set in an assertion is not "seen" by an instance
9970 The effect of (*THEN) is not allowed to escape beyond an assertion. If
9984 These behaviours occur whether or not the group is called recursively.
9989 ment of the other verbs in subroutines is different in some cases.
9996 tine. There is then a backtrack at the outer level.
10000 if there is no such group within the subroutine's group, the subroutine
10001 match fails and there is a backtrack at the outer level.
10046 the compiled version. However, there is one case where the memory usage
10049 maximum, the whole group is repeated in the compiled code. For example,
10054 is compiled as if it were
10058 (Technical aside: It is done this way so that backtrack points within
10062 is not usually a problem. However, if the numbers are large, and par-
10068 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
10070 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
10071 libraries, and this is reached with the above pattern if the outer rep-
10072 etition is increased from 3 to 4. PCRE2 can be compiled to use larger
10073 internal pointers and thus handle larger compiled patterns, but it is
10076 One way of reducing the memory usage for such patterns is to make use
10083 this kind of pattern is not always exactly equivalent, because any cap-
10085 If this is not a problem, this kind of rewriting will allow you to
10103 On a 64-bit system the frame size for a pattern with no captures is 128
10110 initial heap allocation is obtained the first time any match data block
10111 is passed to pcre2_match(). This is remembered with the match data
10112 block and re-used if that block is used for another match. It is freed
10113 when the match data block itself is freed.
10115 The size of the initial block is the larger of 20KiB or ten times the
10116 pattern's frame size, unless the heap limit is less than this, in which
10117 case the heap limit is used. If the initial block proves to be too
10118 small during matching, it is replaced by a larger block, subject to the
10119 heap limit. The heap limit is checked only when a new block is to be
10141 ciently than others. It is more efficient to use a character class like
10144 required behaviour is usually the most efficient. Jeffrey Friedl's book
10149 Using Unicode character properties (the \p, \P, and \X escapes) is
10160 pcre2_match(); the performance loss is less with a DFA matching func-
10161 tion, and in both cases there is not much difference for \b.
10165 option is set, the pattern is implicitly anchored by PCRE2, since it
10168 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
10171 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be-
10185 tain newlines, the best performance is obtained by setting
10200 the pattern is such that the entire match is going to fail, PCRE2 has
10209 matching procedure, PCRE2 checks that there is a "b" later in the sub-
10210 ject string, and if there is not, it fails the match immediately. How-
10211 ever, when there is no following literal this optimization cannot be
10220 In many cases, the solution to this kind of performance issue is to use
10227 end of the data, and is the kind of pattern that might be used when
10229 either one character that is not "<" or a "<" that is not followed by
10230 "inet". However, each time a parenthesis is processed, a backtracking
10231 position is passed, so this formulation uses a memory frame for each
10232 matched character. For a long string, a lot of memory is required. Con-
10240 sessive quantifier is used to stop any backtracking into the runs of
10243 that is not followed by "inet" is encountered (and we assume this is
10247 long subject strings is to write repeated parenthesized subpatterns to
10253 matching, and on the amount of heap memory that is used. The default
10255 can be changed when PCRE2 is built, and they can also be set when
10256 pcre2_match() or pcre2_dfa_match() is called. For details of these in-
10262 that allow a pattern to match. This is done by repeatedly matching with
10329 On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
10331 an application. Because the POSIX functions call the native ones, it is
10335 it is recommended that PCRE2POSIX_SHARED is defined before including
10351 which is the "correct" name, if there is no clash. It provides two
10365 options have been implemented. In addition, the option REG_EXTENDED is
10376 When PCRE2 is called via these functions, it is only the API that is
10380 that the API approximates to the POSIX definition; it is not fully
10381 POSIX-compatible, and in multi-unit encoding domains it is probably
10391 The function pcre2_regcomp() is called to compile a pattern into an in-
10392 ternal form. By default, the pattern is a C string terminated by a bi-
10393 nary zero (but see REG_PEND below). The preg argument is a pointer to a
10394 regex_t structure that is used as a base for storing information about
10395 the compiled regular expression. It is also used for input when
10396 REG_PEND is set. The regex_t structure used by pcre2_regcomp() is de-
10397 fined in pcre2posix.h and is not the same as the structure used by
10400 The argument cflags is either zero, or contains one or more of the bits
10405 The PCRE2_DOTALL option is set when the regular expression is passed
10406 for compilation to the native function. Note that REG_DOTALL is not
10411 The PCRE2_CASELESS option is set when the regular expression is passed
10416 The PCRE2_MULTILINE option is set when the regular expression is passed
10423 The PCRE2_LITERAL option is set when the regular expression is passed
10427 REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
10432 When a pattern that is compiled with this flag is passed to
10441 If this option is set, the reg_endp field in the preg structure (which
10446 re_endp field is ignored. This is a GNU extension to the POSIX standard
10452 The PCRE2_UCP option is set when the regular expression is passed for
10455 ASCII values. Note that REG_UCP is not part of the POSIX standard.
10459 The PCRE2_UNGREEDY option is set when the regular expression is passed
10460 for compilation to the native function. Note that REG_UNGREEDY is not
10465 The PCRE2_UTF option is set when the regular expression is passed for
10468 Note that REG_UTF is not part of the POSIX standard.
10471 function. This means that the regex is compiled with PCRE2 default se-
10473 subject string is the Perl way, not the POSIX way. Note that setting
10478 The yield of pcre2_regcomp() is zero on success, and non-zero other-
10479 wise. The preg structure is filled in on success, and one other member
10480 of the structure (as well as re_endp) is public: re_nsub contains the
10484 NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10486 to pcre2_regexec(), the result is undefined and your program is likely
10492 This area is not simple, because POSIX and Perl take different views of
10493 things. It is not possible to get PCRE2 to obey POSIX semantics, but
10506 This is the equivalent table for a POSIX-compatible pattern matcher:
10516 This behaviour is not what happens when PCRE2 is called via its POSIX
10517 API. By default, PCRE2's behaviour is the same as Perl's, except that
10518 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
10519 and Perl, there is no way to stop newline from matching [^a].
10523 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10526 pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
10532 The function pcre2_regexec() is called to match a compiled pattern preg
10533 against a given string, which is by default terminated by a zero byte
10539 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10544 The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
10545 matching function. Note that REG_NOTEMPTY is not part of the POSIX
10551 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10556 When this option is set, the subject string starts at string +
10559 ros within the subject string, and indeed, using REG_STARTEND is the
10568 This is a BSD extension, compatible with but not specified by IEEE
10572 length of the string, not how it is matched. Setting REG_STARTEND and
10573 passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
10577 matched strings is returned. The nmatch and pmatch arguments of
10582 less REG_STARTEND is set); in both these cases no data about any
10583 matched strings is returned.
10601 fined in the header file, of which REG_NOMATCH is the "expected" fail-
10608 pcre2_regcomp() or pcre2_regexec() to a printable message. If preg is
10610 A message terminated by a binary zero is placed in errbuf. If the
10611 buffer is too short, only the first errbuf_size - 1 characters of the
10612 error message are used. The yield of the function is the size of buffer
10614 value is greater than errbuf_size if the message was truncated.
10653 PCRE2 is supplied in the file pcre2demo.c in the src directory in the
10654 PCRE2 distribution. A listing of this program is given in the pcre2demo
10658 The demonstration program compiles the regular expression that is its
10665 If the -g option is given on the command line, the program then goes on
10667 subject string. The logic is a little bit tricky because of the possi-
10669 is going on.
10671 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
10678 If PCRE2 is installed in the standard include and library directories
10684 If PCRE2 is installed elsewhere, you may need to add additional options
10698 Note that there is a much more comprehensive test program, called
10701 though not all three need be installed). The pcre2demo program is pro-
10704 If you try to run pcre2demo when PCRE2 is not installed in the standard
10711 This is caused by the way shared library support works on those sys-
10758 form instead of having to compile them every time the application is
10760 it is not possible to save and reload the JIT data, because it is posi-
10770 output is really just a bytecode dump, which is why it can only be re-
10779 The facility for saving and restoring compiled patterns is intended for
10781 pcre2_serialize_decode() is expected to be trusted data, not data from
10782 arbitrary external sources. There is only some simple consistency
10783 checking, not complete validation of what is being re-loaded. Corrupted
10785 pattern in the serialized data is corrupted, the deserializing code may
10786 read beyond the end of the byte stream that is passed to it.
10794 use the same character tables. A single copy of the tables is included
10795 in the byte stream (its size is 1088 bytes). For more details of char-
10804 respectively. The final argument is a pointer to a general context,
10806 this argument is NULL, malloc() is used to obtain memory for the byte
10807 stream. The yield of the function is the number of serialized patterns,
10810 PCRE2_ERROR_BADDATA the number of patterns is zero or less
10814 PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL
10821 appropriate manner. Here is sample code that compiles two patterns and
10823 that is open for output. The error checking that should be present in a
10839 Note that the serialized data is binary data that may contain any of
10841 tween binary and non-binary data, be sure that the file is opened for
10848 alize_free(). If this function is called with a NULL argument, it re-
10856 from a file). The management of this memory block is up to the applica-
10868 nal argument is a pointer to a general context, which can be used to
10870 this argument is NULL, malloc() and free() are used. After deserializa-
10871 tion, the byte stream is no longer needed and can be discarded.
10878 If the vector is not large enough for all the patterns in the byte
10879 stream, it is filled with those that fit, and the remainder are ig-
10880 nored. The yield of the function is the number of decoded patterns, or
10883 PCRE2_ERROR_BADDATA second argument is zero or less
10888 PCRE2_ERROR_NULL first or third argument is NULL
10890 PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was
10894 freed by calling pcre2_code_free(). However, be aware that there is a
10897 gle copy of the character tables is used by all the decoded patterns
10898 and a reference count is used to arrange for its memory to be automati-
10899 cally freed when the last pattern is freed, but there is no locking on
10906 ized, the JIT data is discarded and so is no longer available after a
10945 \x where x is non-alphanumeric is a literal x
10948 Note that white space inside \Q...\E is always treated as literal, even
10949 if PCRE2_EXTENDED is set, causing most other white space to be ignored.
10958 after the comma. The exception is \u{...} which is not Perl-compatible
10959 and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC-
10968 \a alarm, that is, the BEL character (hex 07)
10969 \cx "control-x", where x is a non-control ASCII character
10982 If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
10989 When \x is not followed by {, from zero to two hexadecimal digits are
10992 literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by
10996 Note that \0dd is always an octal code. The treatment of backslash fol-
10997 lowed by a non-zero digit is complicated; for details see the section
11000 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
11011 \D a character that is not a decimal digit
11013 \H a character that is not a horizontal white space character
11014 \N a character that is not a newline
11019 \S a character that is not a white space character
11021 \V a character that is not a vertical white space character
11026 \C is dangerous because it may leave the current matching point in the
11028 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
11033 matching is happening, \s and \w may also match characters with code
11034 points in the range 128-255. If the PCRE2_UCP option is set, the behav-
11035 iour of these escape sequences is changed to use Unicode properties and
11109 Unicode defines a number of binary properties, that is, properties
11183 default, but some of them use Unicode properties if PCRE2_UCP is set.
11216 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
11231 From release 10.38 \K is not permitted by default in lookaround asser-
11233 LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
11234 When this option is set, \K is honoured in positive assertions, but ig-
11292 ever, it means that (?-aP) is really (?-PT) which disables all ASCII
11296 a mixture of setting and unsetting such as (?i-x) is allowed, but there
11297 may be only one hyphen. Setting (but no unsetting) is allowed after (?^
11303 of them may appear. For the first three, d is a decimal number.
11319 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
11367 characters, the maximum for each branch is limited to a value set by
11368 the caller of pcre2_compile() or defaulted. The default is set when
11369 PCRE2 is built (ultimate default 255). If every branch matches a fixed
11370 number of characters, the limit for each branch is 65535 characters.
11448 conditions or recursion tests. Such a condition is interpreted as a
11455 (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
11456 changes its behaviour if :NAME is present. The others just set a name
11457 for passing back to the caller, but this is not a name that (*SKIP) can
11467 so only if the pattern is not anchored.
11473 (*MARK:NAME); if not found, the (*SKIP) is ignored
11476 The effect of one of these verbs in a group called as a subroutine is
11525 PCRE2 is normally built with Unicode support, though if you do not need
11529 format (depending on the code unit width), but this is not the default.
11535 is constrained. The program can call pcre2_compile() with the PCRE2_UTF
11538 That is, the programmer can prevent the supplier of the pattern from
11552 When PCRE2 is built with Unicode support, the escape sequences \p{..},
11553 \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11563 For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is
11565 prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
11576 The escape sequence \N{U+<hex digits>} is recognized as another way of
11577 specifying a Unicode character by code point in a UTF mode. It is not
11592 documentation). For this reason, there is a build-time option that dis-
11593 ables support for \C completely. There is also a less draconian com-
11594 pile-time option for locking out the use of \C when a pattern is com-
11597 The use of \C is not supported by the alternative matching function
11598 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11601 support \C in these modes. If JIT optimization is requested for a UTF-8
11603 pcre2_match() is called, the matching will be carried out by the inter-
11610 mains true even when PCRE2 is built to include Unicode support, because
11616 capes work is changed so that Unicode properties are used to determine
11622 classes are all low-valued characters unless the PCRE2_UCP option is
11623 set, but there is an option to override this.
11628 is set.
11633 If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing
11636 these, a direct table lookup is used for speed. A few Unicode charac-
11646 PCRE2_EXTRA_CASELESS_RESTRICT option. When this is set, all characters
11655 within the parentheses is a script run. In concept, a script run is a
11658 diacritical and other marks are used with multiple scripts, it is not
11665 "Unknown" is used for code points that have not been assigned, and also
11671 "Common" is used for characters that are used with many scripts. These
11675 "Inherited" is used for characters such as diacritical marks that mod-
11681 U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11683 called Script Extension exists. Its value is a list of scripts that ap-
11686 characters such as U+102E0 more than one Script is listed. There are
11691 string of characters is a script run. Note, however, that there are
11698 A string that is less than two characters long is a script run. This is
11703 If a character's Script Extension property is the single value "Inher-
11704 ited", it is always accepted as part of a script run. This is also true
11711 A simple example is an Internet name such as "google.com". The letters
11712 are all in the Latin script, and the dot is Common, so this string is a
11714 the Latin "o"; a string that looks the same, but with Cyrillic "o"s is
11731 The Chinese Han script is commonly used in conjunction with other
11755 When the PCRE2_UTF option is set, the strings passed as patterns and
11757 functions. If an invalid UTF string is passed, a negative error code is
11760 which is used for this purpose after a UTF error.
11764 mance, for example in the case of a long subject string that is being
11767 it is given (respectively) contains only valid UTF code unit sequences.
11769 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
11770 result is undefined and your program may crash or loop indefinitely or
11771 give incorrect results. There is, however, one mode of matching that
11772 can handle invalid UTF subject strings. This is enabled by passing
11773 PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is discussed below in
11775 PCRE2_MATCH_INVALID_UTF is not set.
11786 Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any
11788 pcre2_dfa_match() calls with a non-zero starting offset, the check is
11790 matching, and there is a check that the starting offset points to the
11798 In addition to checking the format of the string, there is a check to
11808 other words, the whole surrogate thing is a fudge for UTF-16 which un-
11812 that is given if an escape sequence for an invalid Unicode code point
11813 is encountered in the pattern. If you want to allow escape sequences
11815 TRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
11832 nally defined by RFC 2279) allows for up to 6 bytes, and this is
11842 the character do not have the binary value 0b10 (that is, either the
11843 most significant bit is 0, or the next bit is 1).
11848 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
11868 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
11869 for a value that can be represented by fewer bytes, which is invalid.
11876 binary value 0b10 (that is, the most significant bit is 1 and the sec-
11877 ond is 0). Such a byte can only validly occur as the second or subse-
11901 PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
11908 VALID_UTF option. This is supported by pcre2_match(), including JIT
11909 matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is
11921 generates, but if pcre2_jit_compile() is subsequently called, it does
11922 generate different code. If JIT is not used, the option affects the be-
11924 VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at
11936 pattern is matched fragment by fragment. The result of a successful
11937 match, however, is given as code unit offsets in the entire subject
11944 If pcre2_match() is called with an offset that points to an invalid
11945 UTF-sequence, that sequence is skipped, and the match starts at the
11950 \bWORD\b would match an instance of WORD that is surrounded by invalid