Lines Matching +full:turing +full:- +full:complete
1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16 PCRE2 - Perl-compatible regular expressions (revised API)
26 API is more extensible, and it was simplified by abolishing the sepa-
32 As well as Perl-style regular expression patterns, some features that
39 The source code for PCRE2 can be compiled to support strings of 8-bit,
40 16-bit, or 32-bit code units, which means that up to three separate li-
43 64-bit environment that also supports 32-bit applications, versions of
44 PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
46 The original work to extend PCRE to 16-bit and 32-bit code units was
49 unit, or as UTF-encoded Unicode, with support for Unicode general cate-
55 pcre2test -C
58 ending in _8, _16, or _32, respectively (for example, pcre2_com-
64 In addition to the Perl-compatible matching function, PCRE2 contains an
65 alternative function that matches the same compiled patterns in a dif-
77 client to discover which features are available. The features them-
78 selves are described in the pcre2build page. Documentation about build-
80 NON-AUTOTOOLS_BUILD files in the source distribution.
93 If you are using PCRE2 in a non-UTF application that permits users to
96 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
97 mode, which interprets patterns and subjects as strings of UTF-8 code
98 units instead of individual 8-bit characters. This causes both the pat-
99 tern and any data against which it is matched to be checked for UTF-8
100 validity. If the data string is very long, such a check might use suf-
101 ficiently many resources as to cause your application to lose perfor-
104 One way of guarding against this possibility is to use the pcre2_pat-
107 calling pcre2_compile(). This causes a compile time error if the pat-
108 tern contains a UTF-setting sequence.
111 be enabled from within the pattern, by specifying "(*UCP)". This fea-
119 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
121 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op-
123 compile-time error if it is encountered. It is also possible to build
128 Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
131 pcre2_set_depth_limit() that can be used to restrict the amount of mem-
137 The user documentation for PCRE2 comprises a number of different sec-
143 (which is a program listing), and the short pages for individual func-
144 tions, are concatenated in pcre2.txt, for ease of searching. The sec-
148 pcre2-config show PCRE2 installation configuration information
155 pcre2grep description of the pcre2grep command (8-bit only)
156 pcre2jit discussion of just-in-time optimization support
163 pcre2posix the POSIX-compatible C API for the 8-bit library
187 Copyright (c) 1997-2021 University of Cambridge.
191 ------------------------------------------------------------------------------
199 PCRE2 - Perl-compatible regular expressions (revised API)
204 contains a description of all its native functions. See the pcre2 docu-
476 These functions provide a way of converting non-PCRE2 patterns into
477 patterns that can be processed by pcre2_compile(). This facility is ex-
483 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
485 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
488 for all three libraries. One, two, or all three can be installed simul-
489 taneously. On Unix-like systems the libraries are called libpcre2-8,
490 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
508 Character strings are passed to a PCRE2 library as sequences of un-
511 specified as zero-terminated.
514 macros are defined whose names are the generic forms such as pcre2_com-
516 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
534 single library. For example, if you want to run a match using a pat-
546 There are also some wrapper functions for the 8-bit library that corre-
548 to all the functionality of PCRE2 and they are not thread-safe. They
559 program against a non-dll PCRE2 library, you must define PCRE2_STATIC
563 and matching regular expressions in a Perl-compatible manner. A sample
570 passed as bits in an options argument. There are also some more compli-
571 cated parameters such as custom memory management functions and re-
576 Just-in-time (JIT) compiler support is an optional feature of PCRE2
578 speeds up the matching performance of many patterns. Programs can re-
585 pcre2_jit_stack_assign() in order to control the JIT code's memory us-
591 less sanity checking. The JIT-specific functions are discussed in the
594 A second matching function, pcre2_dfa_match(), which is not Perl-com-
598 there are lookaround assertions). However, this algorithm does not re-
617 pcre2_substring_free() and pcre2_substring_list_free() are also pro-
619 functions is called with a NULL argument, the function returns immedi-
629 Finally, there are functions for finding out information about a com-
644 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
646 handled is one less than this maximum. Note that string lengths are al-
647 ways given in code units. Only in the 8-bit library is such a length
654 strings: a single CR (carriage return) character, a single LF (line-
655 feed) character, the two-character sequence CRLF, any of the three pre-
664 Unix standard. However, the newline convention can be changed by an ap-
665 plication when calling pcre2_compile(), or it can be specified by spe-
667 settings. See the pcre2pattern page for details of the special charac-
673 dollar metacharacters, the handling of #-comments in /x mode, and, when
674 CRLF is a recognized line ending sequence, the match position advance-
675 ment for a non-anchored pattern. There is more detail about this in the
685 In a multithreaded application it is important to keep thread-specific
687 library code itself is thread-safe: it contains no static or global
688 variables. The API is designed to be fairly simple for non-threaded ap-
689 plications while at the same time ensuring that multithreaded applica-
692 There are several different blocks of data that are used to pass infor-
700 is thread-safe, that is, the same compiled pattern can be used by more
703 use them. However, if the just-in-time (JIT) optimization feature is
714 Get a read-only (shared) lock (mutex) for pointer
726 The reason for checking the pointer a second time is as follows: Sev-
742 Get a read-only (shared) lock (mutex) for pointer
755 If JIT is being used, but the JIT compilation is not being done immedi-
760 pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to ob-
761 tain a private copy of the compiled code before calling the JIT com-
774 In a multithreaded application, if the parameters in a context are val-
777 it must make its own thread-specific copy.
782 of a match. This includes details of what was matched, as well as addi-
791 memory management or non-standard character tables. To keep function
795 that holds the parameter values. Applications that do not need to ad-
800 relevant for several PCRE2 operations, a compile-time context, and a
801 match-time context.
805 At present, this context just contains pointers to (and data for) ex-
825 function may be NULL, in which case the system memory management func-
828 might be.) The private_malloc() function is used (if supplied) to ob-
851 A compile context is required if you want to provide an external func-
853 values of any of the following compile-time parameters:
862 A compile context is also required if you are using custom memory man-
863 agement. If none of these apply, just pass NULL as the context argu-
866 A compile context is created, copied, and freed by the following func-
894 only argument is a general context. This function builds a set of char-
900 As PCRE2 has developed, almost all the 32 option bits that are avail-
903 bits which are used for some newer, assumed rarer, options. This func-
905 It does not modify any existing setting. The available options are de-
915 largest number that a PCRE2_SIZE variable can hold, which is effec-
926 largest number that a PCRE2_SIZE variable can hold, which is effec-
933 variable-length lookbehind assertion. The default is set when PCRE2 is
934 built, with the ultimate default being 255, the same as Perl. Lookbe-
940 This specifies which characters or character sequences are to be recog-
943 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
950 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX-
962 stops rogue patterns using up too much system stack when being com-
963 piled. The limit applies to parentheses of all kinds, not just captur-
979 nesting, and the second is user data that is set up by the last argu-
981 should return zero if all is well, or non-zero to force an error.
997 A match context is created, copied, and freed by the following func-
1017 during a matching operation. Details are given in the pcre2callout doc-
1024 This sets up a callout function for PCRE2 to call after each substitu-
1025 tion made by pcre2_substitute(). Details are given in the section enti-
1031 The offset_limit parameter limits how far an unanchored search can ad-
1033 pcre2_match() and pcre2_dfa_match() functions return PCRE2_ERROR_NO-
1043 When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT op-
1045 code can be compiled. If a match is started with a non-default match
1062 also applies to pcre2_dfa_match(), which may use the heap when process-
1064 atomic groups. This limit does not apply to matching with the JIT opti-
1076 where ddd is a decimal number. However, such a setting is ignored un-
1082 pcre2_match() uses the heap are given in the pcre2perform documenta-
1085 For pcre2_dfa_match(), a vector on the system stack is used when pro-
1093 The match_limit parameter provides a means of preventing PCRE2 from us-
1109 is entirely different. However, there is still the possibility of run-
1114 The default value for the limit can be set when PCRE2 is built; the de-
1121 where ddd is a decimal number. However, such a setting is ignored un-
1148 If the depth of internal recursive function calls is great enough, lo-
1149 cal workspace vectors are allocated on the heap from version 10.32 on-
1153 deal of memory. However, it is probably better to limit heap usage di-
1165 where ddd is a decimal number. However, such a setting is ignored un-
1170 CHECKING BUILD-TIME OPTIONS
1180 required. The second argument is a pointer to memory into which the in-
1189 non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
1190 TION if the value in the first argument is not recognized. The follow-
1197 PCRE2_BSR_UNICODE means that \R matches any Unicode line ending se-
1204 unit widths were selected when PCRE2 was built. The 1-bit indicates
1205 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1212 recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur-
1225 just-in-time compiling is included in the library; otherwise it is set
1227 that JIT will be used for any given match. See the pcre2jit documenta-
1236 compiler is configured, for example "x86 32bit (little endian + un-
1247 the 16-bit library is compiled, a value of 3 is rounded up to 4, and
1248 when the 32-bit library is compiled, internal linkages always use 4
1251 The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1287 The output is a uint32_t integer that gives the maximum depth of nest-
1291 take into account the stack that may already be used by the calling ap-
1297 This parameter is obsolete and should not be used in new code. The out-
1302 The output is a uint32_t integer that gives the length of PCRE2's char-
1327 PCRE2 version string, zero-terminated. The number of code units used is
1328 returned. This is the length of the string plus one unit for the termi-
1346 length in code units. If the pattern is zero-terminated, the length can
1348 length of zero is treated as an empty string (NULL with a non-zero
1353 If the compile context argument ccontext is NULL, memory for the com-
1354 piled pattern is obtained by calling malloc(). Otherwise, it is ob-
1355 tained from the same memory function that was used for the compile con-
1357 it is no longer needed. If pcre2_code_free() is called with a NULL ar-
1362 However, if the code has been processed by the JIT compiler (see be-
1363 low), the JIT information cannot be copied (because it is position-de-
1364 pendent). The new copy can initially be used only for non-JIT match-
1369 a multithreaded application to acquire a private copy of shared com-
1378 pointing to the new tables. The memory for the new tables is automati-
1390 described in the section entitled "Option bits for pcre2_match()" be-
1394 that affect the compilation. It should be zero if none of them are re-
1397 well) can also be set and unset from within the pattern (see the de-
1400 For those options that can be different in different parts of the pat-
1406 Some additional options and less frequently required compile-time para-
1410 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1412 error code and an offset (number of code units) within the pattern, re-
1413 spectively, when pcre2_compile() returns NULL because a compilation er-
1416 There are nearly 100 positive error codes that pcre2_compile() may re-
1418 error codes that are used for invalid UTF strings when validity check-
1421 There is no separate documentation for the positive error codes, be-
1423 pcre2_get_error_message() function (see "Obtaining a textual error mes-
1424 sage" below) should be self-explanatory. Macro names starting with
1427 that returns the message "no error" if passed to pcre2_get_error_mes-
1430 The value returned in erroroffset is an indication of where in the pat-
1432 non-zero value is not necessarily the furthest point in the pattern
1435 assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1441 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1444 This code fragment shows a typical straightforward call to pcre2_com-
1452 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1485 (1) \U matches an upper case "U" character; by default \U causes a com-
1500 using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile op-
1502 to patterns. Neither of these options affects the processing of re-
1511 Perl. If you want a multiline circumflex also to match after a termi-
1517 such as (*MARK:NAME) is any sequence of characters that does not in-
1519 it is not possible to include a closing parenthesis in the name. How-
1520 ever, if the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
1521 cessing is applied to verb names and only an unescaped closing paren-
1525 whitespace in verb names is skipped and #-comments are recognized, ex-
1531 items, all with number 255, before each pattern item, except immedi-
1532 ately before or after an explicit callout in the pattern. For discus-
1543 characters, K and S, that, in addition to their lower case ASCII equiv-
1544 alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1545 S) respectively. If you do not want this case equivalence, you can sup-
1551 (available only in 16-bit or 32-bit mode) are treated as not having an-
1568 this option, a dot does not match when the current position in the sub-
1570 and it can be changed within a pattern by a (?s) option setting. A neg-
1572 escape sequence always matches a non-newline character, independent of
1589 patterns, a new match is then tried at the next starting point. How-
1604 matches, which are necessarily substrings of the first one, must obvi-
1609 If this bit is set, most white space characters in the pattern are to-
1611 a \Q...\E sequence. However, white space is not allowed within se-
1615 quantifier and a following + that indicates possessiveness. PCRE2_EX-
1619 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
1621 256 that are flagged as white space in its low-character table. The ta-
1628 When PCRE2 is compiled with Unicode support, in addition to these char-
1629 acters, five more Unicode "Pattern White Space" characters are recog-
1630 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1631 right mark), U+200F (right-to-left mark), U+2028 (line separator), and
1637 As well as ignoring most white space, PCRE2_EXTENDED also causes char-
1644 Which characters are interpreted as newlines can be specified by a set-
1646 special sequence at the start of the pattern, as described in the sec-
1652 This option has the effect of PCRE2_EXTENDED, but, in addition, un-
1653 escaped space and horizontal tab characters are ignored inside a char-
1655 set of pattern white space characters that are ignored outside a char-
1663 start of matching, though the matched text may continue over the new-
1664 line. If startoffset is non-zero, the limiting newline is not necessar-
1666 string is "abc\nxyz" (where \n represents a single-character newline) a
1676 If this option is set, all meta-characters in the pattern are disabled,
1679 you are doing a lot of literal matching and are worried about effi-
1684 PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EX-
1692 sequences. Note, however, that the 16-bit and 32-bit PCRE2 libraries
1694 cannot find valid UTF sequences within an arbitrary string of bytes un-
1695 less such sequences are suitably aligned. This facility is not sup-
1696 ported for DFA matching. For details, see the pcre2unicode documenta-
1703 alternative to fail). A pattern such as (\1)(a) succeeds when this op-
1715 string, or before a terminating newline (except when PCRE2_DOLLAR_EN-
1717 character" metacharacter (.) does not match at a newline. This behav-
1733 This option locks out the use of \C in the pattern that is being com-
1734 piled. This escape can cause unpredictable behaviour in UTF-8 or
1735 UTF-16 modes, because it may leave the current matching point in the
1736 middle of a multi-code-unit character. This option may be useful in ap-
1738 is also a build-time option that permanently locks out the use of \C.
1752 This option locks out interpretation of the pattern as UTF-8, UTF-16,
1753 or UTF-32, depending on which library is in use. In particular, it pre-
1755 by starting the pattern with (*UTF). This option may be useful in ap-
1761 If this option is set, it disables the use of numbered capturing paren-
1772 If this option is set, it disables "auto-possessification", which is an
1775 are in use, auto-possessification means that some callouts are never
1783 .* is the first significant item in a top-level branch of a pattern,
1803 the matching code searches the subject for that value, and fails imme-
1804 diately if it cannot find it, without actually running the main match-
1808 items are in use, these "start-up" optimizations can cause them to be
1809 skipped if the pattern is never actually used. The start-up optimiza-
1810 tions are in effect a pre-scan of the subject that takes place before
1813 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1826 start-up optimization scans along the subject, finds "A" and runs the
1827 first match attempt from there. The (*COMMIT) item means that the pat-
1832 (*COMMIT) prevents any further matches being tried, so the overall re-
1835 As another start-up optimization makes use of a minimum length for a
1842 match "BB", which is long enough. In the process, (*MARK:2) is encoun-
1844 found, but there is only one character left, so there are no more at-
1857 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
1863 PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1871 Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1872 able the error that is given if an escape sequence for an invalid Uni-
1873 code code point is encountered in the pattern. In particular, the so-
1877 section entitled "Extra compile options" below. However, this is pos-
1878 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1879 resentable in UTF-16.
1891 The second effect of PCRE2_UCP is to force the use of Unicode proper-
1893 This makes it possible to process strings in the 16-bit UCS-2 code.
1895 support (which is the default). The PCRE2_EXTRA_CASELESS_RESTRICT op-
1897 match only ASCII characters and non-ASCII characters match only non-
1910 is going to be used to set a non-default offset limit in a match con-
1912 offset limit is set without this option. For more details, see the de-
1920 instead of single-code-unit strings. It is available when PCRE2 is
1922 support is not available, the use of this option provokes an error. De-
1935 assertions, following Perl's lead. This option is provided to re-enable
1941 This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
1942 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1944 in UTF-16 to encode code points with values in the range 0x10000 to
1945 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
1946 They can be represented in UTF-8 and UTF-32, but are defined as invalid
1947 code points, and cause errors if encountered in a UTF-8 or UTF-32
1952 when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1954 PCRE2_NO_UTF_CHECK option does not disable the error that occurs, be-
1957 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1958 gate code point values in UTF-8 and UTF-32 patterns no longer provoke
1966 \x in the way that ECMAscript (aka JavaScript) does. Additional func-
1969 as a hexadecimal character code, where hhh.. is any number of hexadeci-
1975 is set. It can be changed within a pattern by means of the (?aD) op-
2002 to ensure that (?-aP) unsets all ASCII restrictions for POSIX classes.
2007 escape such as \j or a malformed one such as \x{2z} causes a compile-
2008 time error when detected by pcre2_compile(). Perl is somewhat inconsis-
2010 "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
2011 ings are given in both cases if Perl's warning switch is enabled. How-
2017 treated as single-character escapes. For example, \j is a literal "j"
2018 and \x{2z} is treated as the literal string "x{2z}". Setting this op-
2023 is not supported in a character class. To reiterate: this is a danger-
2030 are two case-equivalent character sets that contain both ASCII and non-
2031 ASCII characters. The ASCII letter S is case-equivalent to U+017f (long
2032 S) and the ASCII letter K is case-equivalent to U+212a (Kelvin sign).
2033 This option disables recognition of case-equivalences that cross the
2034 ASCII/non-ASCII boundary. In a caseless match, both characters must ei-
2035 ther be ASCII or non-ASCII. The option can be changed with a pattern by
2043 of a CR (carriage return) character. The option does not affect a lit-
2049 This option is provided for use by the -x option of pcre2grep. It
2050 causes the pattern only to match complete lines. This is achieved by
2051 automatically inserting the code for "^(?:" at the start of the com-
2053 the matched line may be in the middle of the subject string. This op-
2058 This option is provided for use by the -w option of pcre2grep. It
2066 JUST-IN-TIME (JIT) COMPILATION
2086 just-in-time compiler is available, further processes a compiled pat-
2092 for patterns to be analyzed, and for one-off matches and simple pat-
2108 code points are less than 256. By default, higher-valued code points
2114 \w and friends to use Unicode property support instead of the built-in
2115 tables. PCRE2_UCP also causes upper/lower casing operations on charac-
2125 PCRE2 contains a built-in set of character tables that are used by de-
2126 fault. These are sufficient for many applications. Normally, the in-
2129 default "C" locale of the local system, which may cause them to be dif-
2132 The built-in tables can be overridden by tables supplied by the appli-
2134 from the default. As more and more applications change to using Uni-
2155 The locale name "fr_FR" is used on Linux and other Unix-like systems;
2174 or whether the processor is 32-bit or 64-bit. A copy of the result of
2176 re-used later, even in a different program or on another computer. The
2181 used stand-alone to create a file that contains a set of binary tables.
2191 The first argument for pcre2_pattern_info() is a pointer to the com-
2193 is required, and the third argument is a pointer to a variable to re-
2197 the function is zero for success, or one of the following negative num-
2207 typical call of pcre2_pattern_info(), to obtain the length of the com-
2225 to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2226 tions that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
2227 TIONS returns the compile options as modified by any top-level (*XXX)
2230 compile context by calling the pcre2_set_compile_extra_options() func-
2233 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2241 PCRE2 if the first significant item in every top-level branch is one of
2247 .* sometimes - see below
2259 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
2290 been set, the call to pcre2_pattern_info() returns the error PCRE2_ER-
2297 In the absence of a single first code unit for a non-anchored pattern,
2298 pcre2_compile() may construct a 256-bit table that defines a fixed set
2302 means "any code unit of value 255 or above". If such a table was con-
2309 a non-anchored pattern. The third argument should point to a uint32_t
2321 The third argument should point to a uint32_t variable. In the 8-bit
2322 library, the value is always less than 256. In the 16-bit library the
2323 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
2324 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2333 in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2346 \r or \n or one of the equivalent hexadecimal or octal escape se-
2352 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2360 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2362 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2367 If the compiled pattern was successfully processed by pcre2_jit_com-
2387 PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2393 third argument should point to a uint32_t variable. When a pattern con-
2395 whether or not it can match an empty string. PCRE2 takes a cautious ap-
2401 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third ar-
2403 set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2405 less than the limit set or defaulted by the caller of the match func-
2411 code units) when it starts to process each of its branches. This re-
2413 should point to a uint32_t integer. The simple assertions \b and \B re-
2414 quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
2415 return 1 in the absence of anything longer. \A also registers a one-
2419 Note that this information is useful for multi-segment matching only if
2423 one character, then the nested lookbehind also moves back by two char-
2427 multi-segment matching.
2444 PCRE2 supports the use of named as well as numbered capturing parenthe-
2445 ses. The names are just an additional way of identifying the parenthe-
2447 pcre2_substring_get_byname() are provided for extracting captured sub-
2451 do the conversion, you need to use the name-to-number map, which is de-
2454 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
2460 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2461 brary, the first two bytes of each entry are the number of the captur-
2462 ing parenthesis, most significant byte first. In the 16-bit library,
2463 the pointer points to 16-bit code units, the first of which contains
2464 the parenthesis number. In the 32-bit library, the pointer points to
2465 32-bit code units, the first of which contains the parenthesis number.
2469 capture groups with the same number, as described in the section on du-
2474 Duplicate names for capture groups with different numbers are permit-
2478 necessarily the case because later capture groups may have lower num-
2482 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
2483 is set, so white space - including newlines - is ignored):
2485 (?<date> (?<year>(\d\d)?\d\d) -
2486 (?<month>\d\d) - (?<day>\d\d) )
2490 with non-printing bytes shows in hexadecimal, and undefined bytes shown
2499 name-to-number map, remember that the length of the entries is likely
2513 This identifies the character sequence that will be recognized as mean-
2518 Return the size of the compiled pattern in bytes (for all three li-
2522 pcre2_compile() is getting memory in which to place the compiled pat-
2523 tern may be slightly larger than the value returned by this option, be-
2525 over-estimate. Processing a pattern with the JIT compiler does not al-
2541 which they appear. Its first argument is a pointer to a callout enumer-
2543 passed to pcre2_callout_enumerate(). The contents of the callout enu-
2550 It is possible to save compiled patterns on disc or elsewhere, and re-
2556 of PCRE2 is really just a bytecode dump. The functions whose names be-
2557 gin with pcre2_serialize_ are used for converting to and from the seri-
2580 you must create a match data block by calling one of the creation func-
2587 to record the matched portion of the subject plus three captured sub-
2601 The second argument of pcre2_match_data_create() is a pointer to a gen-
2611 general context, but in this case if NULL is passed, the memory is ob-
2617 after a match operation has finished, using functions that are de-
2621 match block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ER-
2622 ROR_PARTIAL, or one of the error codes for an invalid UTF string. Ex-
2632 described in the section entitled "Option bits for pcre2_match()" be-
2652 makes use of a vector of data frames for remembering backtracking posi-
2653 tions. The size of each individual frame depends on the number of cap-
2654 turing parentheses in the pattern and can be obtained by calling
2655 pcre2_pattern_info() with the PCRE2_INFO_FRAMESIZE option (see the sec-
2659 turns out to be too small during matching, it is automatically ex-
2660 panded. When pcre2_match() returns, the memory is not freed, but re-
2683 order to find multiple matches in the subject string or to match dif-
2686 This function is the main matching facility of the library, and it op-
2687 erates in a Perl-like manner. For specialist use there is also an al-
2703 If the subject string is zero-terminated, the length can be given as
2705 common matching parameters are to be changed. For details, see the sec-
2713 bytes for the 8-bit library, 16-bit code units for the 16-bit library,
2714 and 32-bit code units for the 32-bit library, whether or not UTF pro-
2716 zero, the subject is assumed to be an empty string. If length is non-
2722 by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2723 set must point to the start of a character, or to the end of the sub-
2724 ject (in UTF-32 mode, one code unit equals one character, so all off-
2725 sets are valid). Like the pattern string, the subject may contain bi-
2728 A non-zero starting offset is useful when searching for another match
2755 so, and the current character is CR followed by LF, advance the start-
2758 If a non-zero starting offset is passed when the pattern is anchored, a
2759 single attempt to match at the given offset is made. This can only suc-
2761 the subject. In other words, the anchoring must be the result of set-
2769 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK, PCRE2_EN-
2771 PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PAR-
2774 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
2775 ported by the just-in-time (JIT) compiler. If it is set, JIT matching
2794 must not be freed until all such operations are complete. For some ap-
2803 also automatically freed if the match data block is re-used for another
2808 This option is relevant only to pcre2_match() for interpretive match-
2812 The use of recursion in patterns can lead to infinite loops. In the in-
2818 start of that group, and the furthest inspected character of the sub-
2821 There are rare cases of matches that would complete, but nevertheless
2828 matches must be right at the end of the subject string. Note that set-
2843 in multiline mode) a newline immediately before it. Setting this with-
2845 match. This option affects only the behaviour of the dollar metacharac-
2867 subject is permitted. If the pattern is anchored, such a match can oc-
2882 The latter special case is discussed in detail in the pcre2unicode doc-
2885 In the default case, if a non-zero starting offset is given, the check
2893 that the sequences \b and \B are one-character lookbehinds.
2899 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
2909 PCRE2_NO_UTF_CHECK is set at match time the effect of passing an in-
2910 valid string as a subject, or an invalid value of startoffset, is unde-
2911 fined. Your program may crash or loop indefinitely or give wrong re-
2917 These options turn on the partial matching feature. A partial match oc-
2919 there are not enough subject characters to complete the match. In addi-
2924 If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR-
2925 TIAL_HARD) is set, matching continues by testing any remaining alterna-
2926 tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL
2927 returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR-
2929 match, but only if no complete match can be found.
2934 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2935 ered to be more important that an alternative complete match.
2937 There is a more detailed discussion of partial and multi-segment match-
2943 When PCRE2 is built, a default newline convention is set; this is usu-
2948 pcre2pattern page. During matching, the newline choice affects the be-
2961 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL op-
2964 However, the pattern [\r\n]A does match that string, because it con-
2965 tains an explicit CR or LF reference, and so advances only by one char-
2971 not count, nor does \s, even though it includes CR and LF in the char-
2998 Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
3002 pcre2_get_ovector_count() returns the number of pairs of values it con-
3005 Within the ovector, the first in each pair of values is set to the off-
3007 offset of the first code unit after the end of a substring. These val-
3009 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
3010 brary, and 32-bit offsets in the 32-bit library.
3018 the portion of the subject string that was matched by the entire pat-
3022 been captured, the returned value is 3. If there are no captured sub-
3031 If a capture group is matched repeatedly within a single match opera-
3032 tion, it is the last portion of the subject that it matched that is re-
3048 Offset values that correspond to unused groups at the end of the ex-
3057 in the pattern are never changed. That is, if a pattern contains n cap-
3058 turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
3059 pcre2_match(). The other elements retain whatever values they previ-
3080 returns a pointer to the zero-terminated name, which is within the com-
3090 After a "no match" or a partial match, the last encountered name is re-
3100 Warning: By default, certain start-of-match optimizations are used to
3103 for the presence of "c" in the subject before running the matching en-
3105 any marks. You can disable the start-of-match optimizations by setting
3112 offset of the character at which the match started. For a non-partial
3125 If pcre2_match() fails, it returns a negative number. This can be con-
3126 verted to a text string by calling the pcre2_get_error_message() func-
3131 of UTF-specific negative error codes is returned. Details are given in
3146 PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
3153 a library of a different code unit width, for example, a pattern com-
3154 piled by the 8-bit library is passed to a 16-bit or 32-bit library
3194 This error is returned when a pattern that was successfully studied us-
3195 ing JIT is being matched, but the memory available for the just-in-time
3209 also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca-
3219 within the pattern. Specifically, it means that either the whole pat-
3222 might do this are detected and faulted at compile time, but more com-
3233 match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
3240 The returned message is terminated with a trailing zero, and the func-
3242 zero. If the error number is unknown, the negative error code PCRE2_ER-
3266 extracting captured substrings as new, separate, zero-terminated
3272 zero refers to the entire matched substring, with higher numbers refer-
3283 extracts a zero-length empty string.
3292 The pcre2_substring_copy_bynumber() function copies a captured sub-
3295 function that was used for the match data block. The first two argu-
3312 code is returned. If a substring number greater than zero is used af-
3335 pattern is (abc)|(def) and the subject is "def", and the ovector con-
3346 The pcre2_substring_list_get() function extracts all available sub-
3348 builds a second list that contains their lengths (in code units), ex-
3360 therefore need the lengths, you may supply NULL as the lengthsptr argu-
3362 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
3369 be distinguished from a genuine zero-length substring by inspecting the
3390 To extract a substring by name, you first have to find associated num-
3397 the name by calling pcre2_substring_number_from_name(). The first argu-
3406 the "bynumber" functions, the only difference being that the second ar-
3414 than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3420 group numbers in the pcre2pattern page, you cannot use names to distin-
3439 can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
3440 a special case, if replacement is NULL and rlength is zero, the re-
3441 placement is assumed to be an empty string. If rlength is non-zero, an
3444 There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3447 that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL be-
3452 never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3457 error return. For global replacements, matches in which \K in a lookbe-
3462 pcre2_match(), except that the partial matching options are not permit-
3464 block is obtained and freed within this function, using memory manage-
3471 will always be a no-match error. The contents of the ovector within the
3479 arguments. The data in the match_data block (return code, offset vec-
3481 pcre2_match() from within pcre2_substitute(). This allows an applica-
3486 changed when PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTI-
3487 TUTE_GLOBAL is also set, pcre2_match() is called after the first sub-
3488 stitution to check for further matches, but this is done using an in-
3492 The code argument is not used for matching before the first substitu-
3494 even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3495 formation such as the UTF setting and the number of capturing parenthe-
3499 subject string with matched substrings replaced. However, if PCRE2_SUB-
3505 The outlengthptr argument of pcre2_substitute() must point to a vari-
3511 If the function is not successful, the value set via outlengthptr de-
3513 string, the value is the offset in the replacement string where the er-
3514 ror was detected. For other errors, the value is PCRE2_UNSET by de-
3515 fault. This includes the case of the output buffer being too small, un-
3519 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3521 continues to go through the motions of matching and substituting (with-
3524 variable, with the result of the function still being PCRE2_ER-
3529 that the entire operation is carried out twice. Depending on the appli-
3531 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
3536 invalid UTF replacement string causes an immediate return with the rel-
3539 If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not in-
3540 terpreted in any way. By default, however, a dollar character is an es-
3541 cape character that can specify the insertion of characters from cap-
3542 ture groups and names from (*MARK) or other control verbs in the pat-
3543 tern. Dollar is the only escape character (backslash is treated as lit-
3551 brackets are required only if the following character would be inter-
3571 takes place in the original subject string (that is, previous replace-
3574 subject string. If an offset limit is set in the match context, search-
3578 the subject string by setting either or both of startoffset and an off-
3586 with zero length, an attempt to find a non-empty match at the same off-
3597 PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3600 not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3611 particular character codes, and backslash followed by any non-alphanu-
3618 current state: \U and \L change to upper or lower case forcing, respec-
3623 all inserted characters, including those from capture groups and let-
3628 Note that case forcing sequences such as \U...\E do not nest. For exam-
3630 \E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX-
3637 ${<n>:-<string>}
3640 As before, <n> may be a group number or a name. The first form speci-
3661 substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un-
3665 PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3674 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3677 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3679 when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN-
3693 the replacement string, with more particular errors being PCRE2_ER-
3702 obtained by calling the pcre2_get_error_message() function (see "Ob-
3719 callout block structure, which contains the following fields, not nec-
3736 first callout, 2 for the second, and so on. The input and output point-
3751 If the value is zero, the replacement is accepted, and, if PCRE2_SUB-
3753 match. If the value is not zero, the current replacement is not ac-
3767 capture groups are not required to be unique. Duplicate names are al-
3773 match, only one of each set of identically-named groups participates.
3778 to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3779 SET is returned. The pcre2_substring_number_from_name() function re-
3791 point to the first and last entries in the name-to-number table for the
3805 which stops when it finds the first match at a given point in the sub-
3808 function (see below) instead. If you cannot use the alternative func-
3812 What you have to do is to insert a callout right at the end of the pat-
3813 tern. When your callout function is called, extract and save the cur-
3831 different characteristics to the normal algorithm, and is not compati-
3832 ble with Perl. Some of the features of PCRE2 patterns are not sup-
3840 is used in a different way, and this is described below. The other com-
3846 keeping track of multiple paths through the pattern tree. More work-
3847 space is needed for patterns and subjects where there are a lot of po-
3869 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3882 that requires additional characters. This happens even if some complete
3885 if the end of the subject is reached, there have been no complete
3886 matches, but there is still at least one matching possibility. The por-
3889 more detailed discussion of partial and multi-segment matching, with
3895 stop as soon as it has found one match. Because of the way the alterna-
3911 When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3930 which is the number of matched substrings. The offsets of the sub-
3936 Calls to the convenience functions that extract substrings by name re-
3937 turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3946 NOTE: PCRE2's "auto-possessification" optimization usually applies to
3949 matching, this means that only one possible match is found. If you re-
3950 ally do want multiple matches in such cases, either use an ungreedy re-
3951 peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com-
4016 Copyright (c) 1997-2024 University of Cambridge.
4020 ------------------------------------------------------------------------------
4028 PCRE2 - Perl-compatible regular expressions (revised API)
4034 the library in Unix-like environments using the applications known as
4036 CMake instead of configure. The text file README contains general in-
4037 formation about building with Autotools (some of which is repeated be-
4040 There is a lot more information about building PCRE2 without using Au-
4042 hand") in the text file called NON-AUTOTOOLS-BUILD. You should consult
4043 this file as well as the README file if you are building in a non-Unix-
4047 PCRE2 BUILD-TIME OPTIONS
4051 configure script, where the optional features are selected or dese-
4052 lected by providing options to configure before running the make com-
4053 mand. However, the same options can be selected in both Unix-like and
4054 non-Unix-like environments if you are using CMake instead of configure
4059 compiler, as described in NON-AUTOTOOLS-BUILD.
4061 The complete list of options for configure (which includes the standard
4062 ones such as the selection of the installation directory) can be ob-
4065 ./configure --help
4068 names begin with --enable or --disable. Because of the way that config-
4069 ure works, --enable and --disable always come in pairs, so the comple-
4072 with --with. At the end of a configure run, a summary of the configura-
4076 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
4078 By default, a library called libpcre2-8 is built, containing functions
4080 either as single-byte characters, or UTF-8 strings. You can also build
4081 two other libraries, called libpcre2-16 and libpcre2-32, which process
4082 strings that are contained in arrays of 16-bit and 32-bit code units,
4083 respectively. These can be interpreted either as single-unit characters
4084 or UTF-16/UTF-32 strings. To build these additional libraries, add one
4087 --enable-pcre2-16
4088 --enable-pcre2-32
4090 If you do not want the 8-bit library, add
4092 --disable-pcre2-8
4095 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
4096 an 8-bit program. Neither of these are built if you select only the
4097 16-bit or 32-bit libraries.
4106 --disable-shared
4107 --disable-static
4109 to the configure command. Setting --disable-shared ensures that PCRE2
4114 you want these binaries to be fully statically linked, you can set LD-
4117 LDFLAGS=--static ./configure --disable-shared
4119 Note the two hyphens in --static. Of course, this works only if static
4128 --disable-unicode
4131 It is not possible to build one library with Unicode support and an-
4134 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
4135 UTF-16 or UTF-32. To do that, applications that use the library can set
4136 the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
4144 and Nd, script names, and some bi-directional properties are supported.
4156 mode, can cause unpredictable behaviour because it may leave the cur-
4157 rent matching point in the middle of a multi-code-unit character. The
4158 application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
4159 tion when calling pcre2_compile(). There is also a build-time option
4161 --enable-never-backslash-C
4166 JUST-IN-TIME COMPILER SUPPORT
4168 Just-in-time (JIT) compiler support is included in the build by speci-
4171 --enable-jit
4177 --enable-jit=auto
4184 --enable-jit-sealloc
4191 --disable-pcre2grep-jit
4199 the end of a line. This is the normal newline character on Unix-like
4203 --enable-newline-is-cr
4205 to the configure command. There is also an --enable-newline-is-lf op-
4209 the two-character sequence CRLF (CR immediately followed by LF). If you
4212 --enable-newline-is-crlf
4216 --enable-newline-is-anycrlf
4221 --enable-newline-is-any
4224 newline sequences are the three just mentioned, plus the single charac-
4229 --enable-newline-is-nul
4231 which causes NUL (binary zero) to be set as the default line-ending
4245 --enable-bsr-anycrlf
4247 the default is changed so that \R matches only CR, LF, or CRLF. What-
4255 part to another (for example, from an opening parenthesis to an alter-
4256 nation metacharacter). By default, in the 8-bit and 16-bit libraries,
4257 two-byte values are used for these offsets, leading to a maximum size
4258 for a compiled pattern of around 64 thousand code units. This is suffi-
4261 compile PCRE2 to use three-byte or four-byte offsets by adding a set-
4264 --with-link-size=3
4267 16-bit library, a value of 3 is rounded up to 4. In these libraries,
4269 to load additional data when handling them. For the 32-bit library the
4270 value is always 4 and cannot be overridden; the value of --with-link-
4283 --with-match-limit=500000
4297 --with-heap-limit=500
4307 for --with-match-limit. You can set a lower default limit by adding,
4310 --with-match-limit-depth=10000
4317 This limit was more useful in versions before 10.30, where function re-
4322 for lookaround assertions, atomic groups, and recursion within pat-
4326 LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS
4328 Lookbehind assertions in which one or more branches can match a vari-
4330 matching length for each top-level branch. There is a limit to this
4334 --with-max-varlookbehind=100
4336 The limit can be changed at runtime by calling pcre2_set_max_varlookbe-
4349 --enable-rebuild-chartables
4352 Instead, a program called pcre2_dftables is compiled and run. This out-
4354 your C run-time system. This method of replacing the tables does not
4364 cc src/pcre2_dftables.c -o pcre2_dftables
4368 want to specify a locale, you must use the -L option:
4370 LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4372 You can also specify -b (with or without -L). This causes the tables to
4374 can be loaded into memory by an application and passed to pcre2_com-
4376 The tables are just a string of bytes, independent of hardware charac-
4387 compiled to run in an 8-bit EBCDIC environment by adding
4389 --enable-ebcdic --disable-unicode
4391 to the configure command. This setting implies --enable-rebuild-charta-
4392 bles. You should only use it if you know that you are in an EBCDIC en-
4395 It is not possible to support both EBCDIC and UTF-8 codes in the same
4396 version of the library. Consequently, --enable-unicode and --enable-
4403 --enable-ebcdic-nl25
4405 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4407 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4410 The options that select newline behaviour, such as --enable-newline-is-
4411 cr, and equivalent run-time options, refer to these character values in
4418 within the patterns it is matching. There are two kinds: one that gen-
4419 erates output using local code, and another that calls an external pro-
4420 gram or script. If --disable-pcre2grep-callout-fork is added to the
4422 --disable-pcre2grep-callout is used, all callouts are completely ig-
4423 nored. For more details of pcre2grep callouts, see the pcre2grep docu-
4433 --enable-pcre2grep-libz
4434 --enable-pcre2grep-libbz2
4436 to the configure command. These options naturally require that the rel-
4448 be processable is the notional buffer size. If a longer line is encoun-
4454 --with-pcre2grep-bufsize=51200
4455 --with-pcre2grep-max-bufsize=2097152
4458 values by using --buffer-size and --max-buffer-size on the command
4466 --enable-pcre2test-libreadline
4467 --enable-pcre2test-libedit
4469 to the configure command, pcre2test is linked with the libreadline or-
4471 it reads it using the readline() function. This provides line-editing
4472 and history facilities. Note that libreadline is GPL-licensed, so if
4477 Setting --enable-pcre2test-libreadline causes the -lreadline option to
4479 system-installed readline library this is sufficient. However, in some
4491 LIBS="-ncurses"
4500 --enable-debug
4510 --enable-valgrind
4513 certain memory regions as unaddressable. This allows it to detect in-
4523 --enable-coverage
4535 When --enable-coverage is used, the following addition targets are
4541 equivalent to running "make coverage-reset", "make coverage-baseline",
4542 "make check", and then "make coverage-report".
4544 make coverage-reset
4548 make coverage-baseline
4552 make coverage-report
4556 make coverage-clean-report
4558 This removes the generated coverage report without cleaning the cover-
4561 make coverage-clean-data
4566 make coverage-clean
4569 For more information about code coverage, see the gcov and lcov docu-
4583 --disable-percent-zt
4595 --enable-fuzz-support
4597 At present this applies only to the 8-bit library. If set, it causes an
4598 extra library called libpcre2-fuzzsupport.a to be built, but not in-
4599 stalled. This contains a single function called LLVMFuzzerTestOneIn-
4606 Setting --enable-fuzz-support also causes a binary called pcre2fuz-
4621 --disable-stack-for-recursion
4630 pcre2api(3), pcre2-config(3).
4643 Copyright (c) 1997-2024 University of Cambridge.
4647 ------------------------------------------------------------------------------
4655 PCRE2 - Perl-compatible regular expressions (revised API)
4671 PCRE2 provides a feature called "callout", which is a means of tem-
4672 porarily passing control to the caller of PCRE2 in the middle of pat-
4677 When using the pcre2_substitute() function, an additional callout fea-
4687 ending delimiter is the same as the start, except for {, where the end-
4707 A(\d{2}|--)
4711 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4714 alternation bar. If the pattern contains a conditional group whose con-
4728 information when you are trying to optimize the performance of a par-
4738 Auto-possessification
4740 At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4746 --->aaaa
4754 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
4758 --->aaaa
4775 beginning of the subject, and pcre2_compile() remembers this. If a pat-
4776 tern has more than one top-level branch, automatic anchoring occurs if
4781 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4787 --->aa
4794 This shows that all match attempts start at the beginning of the sub-
4795 ject. In other words, the pattern is anchored. You can disable this op-
4797 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
4800 --->aa
4810 This shows more match attempts, starting at the second subject charac-
4831 You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4841 to both normal, DFA, and JIT matching. The first argument to the call-
4867 version 1, and the callout_flags field for version 2. If you are writ-
4876 contains the number of the callout, in the range 0-255. This is the
4883 callout_string points to the string that is contained within the com-
4891 delimiter as callout_string[-1] if you need it.
4907 For calls to pcre2_match(), the offset_vector field is not (since re-
4909 matching function in the match data block. Instead it points to an in-
4915 The capture_last field contains the number of the most recently cap-
4917 number of the highest numbered captured substring so far. If no sub-
4923 The contents of ovector[2] to ovector[<capture_top>*2-1] can be in-
4927 the match is by definition not complete. Substrings that have not been
4932 was passed to the matching function in the match data block for call-
4943 at which the current match attempt started. However, if the escape se-
4958 parenthesis, the length includes meta characters that follow the paren-
4961 length is one, unless a closing parenthesis is followed by a quanti-
4964 was that of the entire group, and before an alternation bar or a clos-
4970 are used by pcre2test to show the next item to be matched when display-
4974 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
4996 starting position in the subject. Output from pcre2test does not indi-
5000 The information in the callout_flags field is provided so that applica-
5004 because there is no backtracking in DFA matching, and there is no sup-
5018 Negative values should normally be chosen from the set of PCRE2_ER-
5036 which they appear. Its first argument is a pointer to a callout enumer-
5038 passed to pcre2_callout_enumerate(). The data block contains the fol-
5056 non-zero minimum or a fixed maximum, the group is replicated inside the
5062 The callback function should normally return zero. If it returns a non-
5077 Copyright (c) 1997-2024 University of Cambridge.
5081 ------------------------------------------------------------------------------
5089 PCRE2 - Perl-compatible regular expressions (revised API)
5102 matches the next character unless it is the start of a newline se-
5111 3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
5113 does not assert that the next three characters are not "a". It just as-
5118 on non-lookaround assertions.
5121 to repeat (for example, at the start of a branch), PCRE2 raises an er-
5131 \u, \U, and \N when followed by a character name. \N on its own, match-
5132 ing a non-newline character, and \N{U+dd..}, matching a Unicode code
5134 letters are implemented by Perl's general string-handling and are not
5145 binary properties. Both PCRE2 and Perl support the Cs (surrogate) prop-
5146 erty, but in PCRE2 its use is limited. See the pcre2pattern documenta-
5147 tion for details. The long synonyms for property names that Perl sup-
5148 ports (such as \p{Letter}) are not supported by PCRE2, nor is it per-
5155 variables). Also, Perl does "double-quotish backslash interpolation" on
5183 their effect is confined to that group; it does not extend to the sur-
5198 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5203 works internally just with numbers, using an external table to trans-
5218 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5223 not affected when case-independent matching is specified. For example,
5229 18. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5231 there is an option for re-enabling the previous behaviour. When this
5235 19. PCRE2 provides some extensions to the Perl regular expression fa-
5241 $ meta-character matches only at the very end of the string.
5246 (c) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
5247 fiers is inverted, that is, by default they are not greedy, but if fol-
5259 (g) The callout facility is PCRE2-specific. Perl supports codeblocks
5262 (h) The partial matching facility is PCRE2-specific.
5265 different way and is not Perl-compatible.
5271 (k) PCRE2 supports non-atomic positive lookaround assertions. This is
5272 an extension to the lookaround facilities. The default, Perl-compatible
5278 supports relative group numbers such as +2 and -4 in all three cases.
5282 20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5283 tion for details. Perl went with 5.10 from recursion to iteration keep-
5285 not fall into any stack-overflow limit. PCRE2 made a similar change at
5286 release 10.30, and also has many build-time and run-time customizable
5289 21. Unlike Perl, PCRE2 doesn't have character set modifiers and spe-
5296 can be handled by PCRE2, either by the interpreter or the JIT. An exam-
5311 Copyright (c) 1997-2023 University of Cambridge.
5315 ------------------------------------------------------------------------------
5323 PCRE2 - Perl-compatible regular expressions (revised API)
5326 PCRE2 JUST-IN-TIME COMPILER SUPPORT
5328 Just-in-time compiling is a heavyweight optimization that can greatly
5329 speed up pattern matching. However, it comes at the cost of extra pro-
5331 the same pattern is going to be matched many times. This does not nec-
5333 anchored, matching attempts may take place many times at various posi-
5335 string is very long, it may still pay to use JIT even for one-off
5336 matches. JIT support is available for all of the 8-bit, 16-bit and
5337 32-bit PCRE2 libraries.
5339 JIT support applies only to the traditional Perl-compatible matching
5347 --enable-jit (or equivalent CMake option) must be set when PCRE2 is
5351 ARM 32-bit (v7, and Thumb2)
5352 ARM 64-bit
5354 Intel x86 32-bit and 64-bit
5356 MIPS 32-bit and 64-bit
5357 Power PC 32-bit and 64-bit
5358 RISC-V 32-bit and 64-bit
5360 If --enable-jit is set on an unsupported platform, compilation fails.
5366 particular match. One reason for this is that there are a number of op-
5367 tions and pattern items that are not supported by JIT (see below). An-
5369 in which to build its compiled code. The only guarantee from pcre2_con-
5376 there is a "fast path" API that is JIT-specific.
5385 second is zero or more of the following option bits: PCRE2_JIT_COM-
5396 the size of machine stack that it uses. The exact rules are not docu-
5397 mented because they may change at any time, in particular, when new op-
5401 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
5402 plete matches. If you want to run partial matches using the PCRE2_PAR-
5407 pcre2_match() is called, the appropriate code is run if it is avail-
5412 the option bits. For example, you can call it once with PCRE2_JIT_COM-
5415 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5416 ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5424 are described in the section entitled "Controlling the JIT stack" be-
5433 stack" below, even if you do not need to supply a non-default JIT
5435 be obeyed. If the match-time options are not right for JIT execution,
5438 If the JIT compiler finds an unsupported item, no JIT data is gener-
5440 pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5441 tion. A non-zero result means that JIT compilation was successful. A
5452 are normally expected to be a valid sequence of UTF code units. By de-
5453 fault, this is checked at the start of matching and an error is gener-
5462 PCRE2_MATCH_INVALID_UTF option has two effects: it tells the inter-
5463 preter in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5464 pile() is subsequently called, the compiled JIT code also supports in-
5469 PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5487 when running in a UTF mode, and a callout immediately before an asser-
5510 large or complicated patterns need more than this. The error PCRE2_ER-
5511 ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5516 The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
5522 function returns immediately, without doing anything. (For the techni-
5534 The first argument is a pointer to a match context. When this is subse-
5536 JIT stack is used. If this argument is NULL, the function returns imme-
5556 is not obeyed when pcre2_match() is called with options that are incom-
5558 determine whether a match operation was executed by JIT or by the in-
5564 up non-sequential matches in one thread is to use callouts: if a call-
5569 you assign or pass back NULL from a callback, that is thread-safe, be-
5571 pass back a non-NULL JIT stack, this must be a different stack for each
5572 thread so that the application is thread-safe.
5574 Strictly speaking, even more is allowed. You can assign the same non-
5583 up non-default JIT stacks might operate:
5591 Use a one-line callback function
5602 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5604 child nodes. Allocating real machine stack on some platforms is diffi-
5611 Modern operating systems have a nice feature: they can reserve an ad-
5614 memory data (this is important because of pointers). Thus we can allo-
5634 You can free compiled patterns, contexts, and stacks in any order, any-
5653 Especially on embedded systems, it might be a good idea to release mem-
5656 allocated memory for any stack and another which allows releasing mem-
5670 The JIT executable allocator does not free all memory when it is possi-
5674 calling pcre2_jit_free_unused_memory(). Its argument is a general con-
5675 text, for custom memory management, or NULL for standard memory manage-
5681 This is a single-threaded example that specifies a JIT stack without
5718 The fast path function is called pcre2_jit_match(), and it takes ex-
5720 must be specified with a length; PCRE2_ZERO_TERMINATED is not sup-
5722 PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The re-
5723 turn values are also the same as for pcre2_match(), plus PCRE2_ER-
5724 ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
5728 number of other sanity checks are performed on the arguments. For exam-
5729 ple, if the subject pointer is NULL but the length is non-zero, an im-
5758 Copyright (c) 1997-2024 University of Cambridge.
5762 ------------------------------------------------------------------------------
5770 PCRE2 - Perl-compatible regular expressions (revised API)
5779 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5780 the default internal linkage size, which is 2 bytes for these li-
5783 (when building the 16-bit library, 3 is rounded up to 4). See the
5785 for details. In these cases the limit is substantially larger. How-
5786 ever, the speed of execution is slower. In the 32-bit library, the in-
5794 the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5796 is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-termi-
5801 There are two different limits that apply to branches of lookbehind as-
5821 (*THEN) verb is 255 code units for the 8-bit library and 65535 code
5822 units for the 16-bit and 32-bit libraries.
5825 number a 32-bit unsigned integer can hold.
5842 Copyright (c) 1997-2023 University of Cambridge.
5846 ------------------------------------------------------------------------------
5854 PCRE2 - Perl-compatible regular expressions (revised API)
5862 pcre2_match() function. This works in the same as Perl's matching func-
5863 tion, and provide a Perl-compatible matching operation. The just-in-
5868 it operates in a different way, and is not Perl-compatible. This alter-
5869 native has advantages and disadvantages compared with the standard al-
5889 The set of strings that are matched by a regular expression can be rep-
5894 tree: depth-first and breadth-first, and these correspond to the two
5900 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5902 depth-first search of the pattern tree. That is, it proceeds along a
5904 required. When there is a mismatch, the algorithm tries any alterna-
5913 that point the algorithm stops. Thus, if there is more than one possi-
5916 on the way the alternations and the greedy or ungreedy repetition quan-
5919 Because it ends up with a single path through the tree, it is rela-
5920 tively straightforward for this algorithm to keep track of the sub-
5927 This algorithm conducts a breadth-first search of the tree. Starting
5938 following or preceding the current point have to be independently in-
5953 the match data block is therefore not advisable when doing DFA match-
5963 the fifth character of the subject. The algorithm does not automati-
5966 PCRE2's "auto-possessification" optimization usually applies to charac-
5967 ter repeats at the end of a pattern (as well as internally). For exam-
5972 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5976 not supported or behave differently in the alternative matching func-
5979 1. Because the algorithm finds all possible matches, the greedy or un-
5981 affect auto-possessification, as just described). During matching,
5990 a non-possessive quantifier. Similarly, if an atomic group is present,
5998 algorithm does not attempt to do this. This means that no captured sub-
6001 3. Because no substrings are captured, backreferences within the pat-
6004 4. For the same reason, conditional expressions that use a backrefer-
6010 6. Because many paths through the tree may be active, the \K escape se-
6019 these modes, because the alternative algorithm moves through the sub-
6027 10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup-
6041 matching and discusses multi-segment matching.
6071 Copyright (c) 1997-2024 University of Cambridge.
6075 ------------------------------------------------------------------------------
6083 PCRE2 - Perl-compatible regular expressions
6099 Another example is checking a user input string as it is typed, to en-
6103 Partial matching is a PCRE2-specific feature; it is not Perl-compati-
6105 PCRE2_PARTIAL_SOFT options when calling a matching function. The dif-
6107 preferred to an alternative complete match, though the details differ
6111 If you want to use partial matching with just-in-time optimized code,
6113 you must also call pcre2_jit_compile() with one or both of these op-
6119 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
6124 Setting a partial matching option disables two of PCRE2's standard op-
6125 timization hints. PCRE2 remembers the last literal code unit in a pat-
6138 needed to complete the match, or the addition of more characters might
6141 Example 1: if the pattern is /abc/ and the subject is "ab", more char-
6142 acters are definitely needed to complete a match. In this case both
6145 Example 2: if the pattern is /ab+/ and the subject is "ab", a complete
6147 what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
6148 tial match; PCRE2_PARTIAL_SOFT returns the complete match.
6158 assertions and the \K escape sequence provide ways of inspecting char-
6161 (2) The pattern contains one or more lookbehind assertions. This condi-
6162 tion exists in case there is a lookbehind that inspects characters be-
6169 because adding more characters might result in a non-empty match,
6171 "there is going to be a match at this point, but until some more char-
6172 acters are added, we do not know if it will be an empty string or some-
6182 A complete match has been found, starting and ending within this sub-
6189 Adding more characters may result in a complete match that uses one
6194 the rest of the ovector are undefined. The appearance of \K in the pat-
6199 If it is matched against "456abc123xyz" the result is a complete match,
6203 string "abc12", because all these characters are needed for a subse-
6204 quent re-match with additional characters.
6211 If this is matched against the subject string "abc123dog", both alter-
6214 and 9, identifying "123dog" as the first partial match. (In this exam-
6225 complete matches. This option is "hard" because it prefers an earlier
6226 partial match over a later complete match. For this reason, the assump-
6228 true end of the available data, which is why \z, \Z, \b, \B, and $ al-
6233 tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re-
6235 prefers a complete match over a partial match. All the various matching
6236 items in a pattern behave as if the subject string is potentially com-
6238 for \b and \B the end of the subject is treated as a non-alphanumeric.
6240 The difference between the two partial matching options can be illus-
6247 "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog".
6248 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6254 In this case the result is always a complete match because that is
6255 found first, and matching never continues after finding a complete
6291 PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6295 MULTI-SEGMENT MATCHING WITH pcre2_match()
6297 PCRE was not originally designed with multi-segment matching in mind.
6299 multi-segment matching possible have been added. A very long string can
6301 with the aim of achieving the same results that would happen if the en-
6313 When a partial match occurs, the next segment must be added to the cur-
6314 rent subject and the match re-run, using the startoffset argument of
6334 If there are memory constraints, you may want to discard text that pre-
6353 of characters that must be retained in order to get the right match re-
6358 use that to decide how much text to retain. The only lookbehind infor-
6363 maximum number of characters (not code units) that any individual look-
6368 In a non-UTF or a 32-bit case, moving back is just a subtraction, but
6369 in UTF-8 or UTF-16 you have to count characters while moving back
6376 without backtracking, searching for all possible matches simultane-
6377 ously. If the end of the subject is reached before the end of the pat-
6381 there have been no complete matches. Otherwise, the complete matches
6383 precedence over any complete matches. The portion of the string that
6388 there is no difference between greedy and ungreedy repetition, its be-
6394 Whereas the standard function stops as soon as it finds the complete
6399 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6403 and calling the function again with the same compiled regular expres-
6405 same working space as before, because this is where details of the pre-
6416 The first call has "23ja" as the subject, and requests partial match-
6418 (restarted) match. Notice that when the match is complete, only the
6419 last part is shown; PCRE2 does not retain the previously partially-
6433 match at one point in the subject are remembered. Depending on the ap-
6438 complete match, as described for pcre2_match() above. Another possibil-
6455 Copyright (c) 1997-2019 University of Cambridge.
6459 ------------------------------------------------------------------------------
6467 PCRE2 - Perl-compatible regular expressions (revised API)
6473 by PCRE2 are described in detail below. There is a quick-reference syn-
6475 and semantics as closely as it can. PCRE2 also supports some alterna-
6482 of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6487 This document discusses the regular expression patterns that are sup-
6491 not Perl-compatible. Some of the features discussed below are not
6493 of the alternative function, and how it differs from the normal func-
6497 SPECIAL START-OF-PATTERN ITEMS
6500 set by special items at the start of a pattern. These are not Perl-com-
6502 writers who are not able to change the program that processes the pat-
6503 tern. Any number of these items may appear, but they must all be to-
6509 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6510 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6511 can be specified for the 32-bit library, in which case it constrains
6522 restrict them to non-UTF data for security reasons. If the
6523 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al-
6530 causes sequences such as \d and \w to use Unicode properties to deter-
6532 less than 256 via a lookup table. If also causes upper/lower casing op-
6547 to whichever matching function is subsequently called to match the pat-
6548 tern. These options lock out the matching of empty strings, either en-
6551 Disabling auto-possessification
6559 Disabling start-up optimizations
6562 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6569 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6570 tions that apply to patterns whose top-level branches all start with .*
6590 These facilities are provided to catch runaway matches that are pro-
6591 voked by patterns with huge matching trees. A common example is a pat-
6601 where d is any number of decimal digits. However, the value of the set-
6624 strings: a single CR (carriage return) character, a single LF (line-
6625 feed) character, the two-character sequence CRLF, any of the three pre-
6630 It is also possible to specify a newline convention by starting a pat-
6640 These override the default and the options given to the compiling func-
6641 tion. For example, on a Unix system where LF is the default newline se-
6650 The newline convention affects where the circumflex and dollar asser-
6651 tions are true. It also affects the interpretation of the dot metachar-
6654 escape sequence matches. By default, this is any Unicode newline se-
6663 the complete set of Unicode line endings) by setting the option
6665 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
6672 character code instead of ASCII or Unicode (typically a mainframe sys-
6673 tem). In the sections below, character code values are ASCII or Uni-
6691 their lower case ASCII equivalents, are case-equivalent with Unicode
6703 There are two different sets of metacharacters: those that are recog-
6721 Brace characters { and } are also used to enclose data for construc-
6724 and are ignored. In the case of quantifiers, they may also appear be-
6735 - indicates character range
6741 sequence, or between a # outside a character class and the next new-
6742 line, inclusive, are ignored. An escaping backslash can be used to in-
6764 always safe to precede a non-alphanumeric with backslash to specify
6765 that it stands for itself. In particular, if you want to match a back-
6768 Only ASCII digits and letters have any special meaning after a back-
6777 Perl, $ and @ cause variable interpolation. Also, Perl does "double-
6800 Non-printing characters
6802 A second use of backslash provides a way of encoding non-printing char-
6804 appearance of non-printing characters in a pattern, but when a pattern
6806 following escape sequences instead of the binary character it repre-
6807 sents. In an ASCII or Unicode environment, these escapes are as fol-
6811 \cx "control-x", where x is a non-control ASCII character
6824 By default, after \x that is not followed by {, from zero to two hexa-
6826 number of hexadecimal digits may appear between \x{ and }. If a charac-
6831 of the two syntaxes for \x or by an octal sequence. There is no differ-
6836 Support is available for some ECMAScript (aka JavaScript) escape se-
6837 quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6839 two hexadecimal digits is it recognized as a character escape. Other-
6840 wise it is interpreted as a literal "x" character. In this mode, sup-
6845 PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in ad-
6846 dition, \u{hhh..} is recognized as the character specified by hexadeci-
6847 mal code point. There may be any number of hexadecimal digits, but un-
6852 The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6855 followed by an opening brace (curly bracket) it has an entirely differ-
6858 There are some legacy applications where the escape sequence \r is ex-
6864 point is in the range 32 to 126. The precise effect of \cx is as fol-
6869 point less than 32 or greater than 126, a compile-time error occurs.
6873 The \c escape is processed as specified for Perl in the perlebcdic doc-
6874 ument. The only characters that are allowed after \c are A-Z, a-z, or
6875 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6877 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6878 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
6890 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6891 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6895 than two digits, just those that are present are used. Thus the se-
6911 The handling of a backslash followed by a digit other than 0 is compli-
6914 Outside a character class, PCRE2 reads the digit and any following dig-
6917 groups in the expression, the entire sequence is taken as a backrefer-
6919 discussion of parenthesized groups. Otherwise, up to three octal dig-
6922 Inside a character class, PCRE2 handles \8 and \9 as the literal char-
6923 acters "8" and "9", and otherwise reads up to three octal digits fol-
6924 lowing the backslash, using them to generate a data character. Any sub-
6951 8-bit non-UTF mode no greater than 0xff
6952 16-bit non-UTF mode no greater than 0xffff
6953 32-bit non-UTF mode no greater than 0xffffffff
6957 (the so-called "surrogate" code points). The check for these can be
6960 UTF-8 and UTF-32 modes, because these values are not representable in
6961 UTF-16.
6979 However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op-
6985 The sequence \g followed by a signed or unsigned number, optionally en-
6992 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
6996 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
7013 \W any "non-word" character
7018 has a different meaning. See the section entitled "Non-printing charac-
7022 Each pair of lower and upper case escape sequences partitions the com-
7031 (13), and space (32), which are defined as white space in the "C" lo-
7032 cale. This list may vary if locale-specific matching is taking place.
7033 For example, in some locales the "non-breaking space" character (\xA0)
7037 or digit. By default, the definition of letters and digits is con-
7038 trolled by PCRE2's low-valued character tables, and may vary if locale-
7040 page). For example, in a French locale such as "fr_FR" in Unix-like
7047 be different for characters in the range 128-255 when locale-specific
7049 meanings from before Unicode support was available, mainly for effi-
7058 The addition of \p{Mn} (non-spacing mark) and the replacement of an ex-
7059 plicit test for underscore with a test for \p{Pc} (connector punctua-
7072 reset within a pattern by means of an internal option setting (see be-
7082 U+00A0 Non-break space
7089 U+2004 Three-per-em space
7090 U+2005 Four-per-em space
7091 U+2006 Six-per-em space
7096 U+202F Narrow no-break space
7110 In 8-bit, non-UTF-8 mode, only the characters with code points less
7116 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
7121 This is an example of an "atomic group", details of which are given be-
7122 low. This particular group matches either the two-character sequence
7124 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
7126 atomic group, the two-character sequence is treated as a single unit
7130 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
7135 the complete set of Unicode line endings) by setting the option
7136 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back-
7138 the case, the other behaviour can be requested via the PCRE2_BSR_UNI-
7145 These override the default and the options given to the compiling func-
7146 tion. Note that these special settings, which are not Perl-compatible,
7149 used. They can be combined with a change of newline convention; for ex-
7155 Inside a character class, \R is treated as an unrecognized escape se-
7160 When PCRE2 is built with Unicode support (the default), three addi-
7162 are available. They can be used in any mode, though in 8-bit and 16-bit
7163 non-UTF modes these sequences are of course limited to testing charac-
7165 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
7166 limit) may be encountered. These are all treated as being in the Un-
7170 to do a multistage table lookup in order to find a character's prop-
7182 The property names represented by xx above are not case-sensitive, and
7186 (including newline), Bidi_Class, a number of binary (yes/no) proper-
7194 There are three different syntax forms for matching a script. Each Uni-
7202 a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
7206 Unassigned characters (and in non-UTF 32-bit mode, characters with code
7208 that are not part of an identified script are lumped together as "Com-
7209 mon". The current list of recognized script names and their 4-character
7212 pcre2test -LS
7217 Each character has exactly one Unicode general category property, spec-
7218 ified by a two-letter abbreviation. For compatibility with Perl, nega-
7223 If only one letter is specified with \p or \P, it includes all the gen-
7250 Mn Non-spacing mark
7282 points are in the range U+D800 to U+DFFF. These characters are no dif-
7284 16-bit or 32-bit library). However, they are not valid in Unicode
7285 strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7293 No character that is in the Unicode table has the Cn (unassigned) prop-
7308 pcre2test -LP
7327 L left-to-right
7328 LRE left-to-right embedding
7329 LRI left-to-right isolate
7330 LRO left-to-right override
7331 NSM non-spacing mark
7335 R right-to-left
7336 RLE right-to-left embedding
7337 RLI right-to-left isolate
7338 RLO right-to-left override
7343 case-insensitive; only the short names listed above are recognized.
7354 properties that had been used for emojis. Instead it introduced vari-
7355 ous emoji-specific properties. PCRE2 uses only the Extended Picto-
7364 2. Do not end between CR and LF; otherwise end after any control char-
7370 be followed by a V or T character; an LVT or T character may be fol-
7373 4. Do not end before extending characters or spacing marks or the zero-
7374 width joiner (ZWJ) character. Characters with the "mark" property al-
7379 6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width
7385 7. Do not break within emoji flag sequences. That is, do not break be-
7393 As well as the standard Unicode properties described above, PCRE2 sup-
7394 ports four more that make it possible to convert traditional escape se-
7396 non-standard, non-Perl properties internally when PCRE2_UCP is set.
7404 Xan matches characters that have either the L (letter) or the N (num-
7407 (separator) property. Xsp is the same as Xps; in PCRE1 it used to ex-
7409 matches the same characters as Xan, plus those that match Mn (non-spac-
7412 There is another non-standard property, Xuc, which matches any charac-
7419 Note that the Xuc property does not match these sequences but the char-
7425 characters not to be included in the final matched sequence that is re-
7436 mode), though it again reports the matched string as "bar". This fea-
7438 part of the pattern that precedes \K is not constrained to match a lim-
7440 The use of \K does not interfere with the setting of captured sub-
7447 From version 5.32.0 Perl forbids the use of \K in lookaround asser-
7450 pcre2_compile() to re-enable the previous behaviour. When this option
7467 The final use of backslash is for certain simple assertions. An asser-
7490 changed by setting the PCRE2_UCP option. When this is done, it also af-
7499 set. Thus, they are independent of multiline mode. These three asser-
7501 which affect only the behaviour of the circumflex and dollar metachar-
7502 acters. However, if the startoffset argument of pcre2_match() is non-
7509 the start point of the matching process, as specified by the startoff-
7511 startoffset is non-zero. By calling pcre2_match() multiple times with
7529 The circumflex and dollar metacharacters are zero-width assertions.
7530 That is, they test for a particular condition being true without con-
7532 are concerned with matching the starts and ends of lines. If the new-
7533 line convention is set so that only the two-character sequence CRLF is
7539 point is at the start of the subject string. If the startoffset argu-
7540 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
7542 character class, circumflex has an entirely different meaning (see be-
7549 if the pattern is constrained to match only at the start of the sub-
7554 matching point is at the end of the subject string, or immediately be-
7555 fore a newline at the end of the string (by default), unless PCRE2_NO-
7556 TEOL is set. Note, however, that it does not actually match the new-
7559 branch in which it appears. Dollar has no special meaning in a charac-
7579 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
7582 When the newline convention (see "Newline conventions" below) recog-
7583 nizes the two-character sequence CRLF as a newline, this is preferred,
7584 even if the single characters CR and LF are also recognized as new-
7598 Outside a character class, a dot in the pattern matches any one charac-
7599 ter in the subject string except (by default) a character that signi-
7603 Dot never matches a single line-ending character. When the two-charac-
7613 exception. If the two-character sequence CRLF is present in the sub-
7616 The handling of dot is entirely independent of the handling of circum-
7626 the section entitled "Non-printing characters" above for details. Perl
7634 unit, whether or not a UTF mode is set. In the 8-bit library, one code
7635 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
7636 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
7637 line-ending characters. The feature is provided in Perl in order to
7638 match individual bytes in UTF-8 mode, but it is unclear how it can use-
7642 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
7643 string may start with a malformed UTF character. This has undefined re-
7645 in a valid UTF string (by default it checks the subject string's valid-
7654 below) in UTF-8 or UTF-16 modes, because this would make it impossible
7657 these UTF modes. The former gives a match-time error; the latter fails
7660 In the 32-bit library, however, \C is always supported (when not ex-
7662 whether or not UTF-32 is specified.
7665 using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
7667 as in this pattern, which could be used with a UTF-8 string (ignore
7670 (?| (?=[\x00-\x7f])(\C) |
7671 (?=[\x80-\x{7ff}])(\C)(\C) |
7672 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7673 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7677 below). The assertions at the start of each branch check the next UTF-8
7678 character for values whose encoding uses 1, 2, 3, or 4 bytes, respec-
7679 tively. The character's individual bytes are then captured by the ap-
7686 closing square bracket. A closing square bracket on its own is not spe-
7705 class that starts with a circumflex is not an assertion; it still con-
7711 letters in a class represent both their upper case and lower case ver-
7714 would. Note that there are two ASCII characters, K and S, that, in ad-
7715 dition to their lower case ASCII equivalents, are case-equivalent with
7716 Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7720 special way when matching character classes, whatever line-ending se-
7728 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option af-
7733 backspace character. The sequences \B, \R, and \X are not special in-
7738 The minus (hyphen) character can be used to specify a range of charac-
7739 ters in a character class. For example, [d-m] matches any letter be-
7744 [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7753 It is not possible to have the literal character "]" as the end charac-
7754 ter of a range. A pattern such as [W-]46] is interpreted as a class of
7755 two characters ("W" and "-") followed by a literal string "46]", so it
7756 would match "W46]" or "-46]". However, if the "]" is escaped with a
7757 backslash it is interpreted as the end of range, so [W-\]46] is inter-
7762 Ranges normally include all code points between the start and end char-
7763 acters, inclusive. They can also be used for code points specified nu-
7764 merically, for example [\000-\037]. Ranges can include any characters
7765 that are valid for the current mode. In any UTF mode, the so-called
7768 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
7769 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7773 points are both specified as literal letters in the same case. For com-
7775 letters are omitted. For example, [h-k] matches only four characters,
7778 [\x88-\x92] or [h-\x92], all code points are included.
7781 it matches the letters in either case. For example, [W-c] is equivalent
7782 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
7783 character tables for a French locale are in use, [\xc8-\xcb] matches
7797 special compatibility feature - see the next two sections), and the
7798 terminating closing square bracket. However, escaping other non-al-
7815 ascii character codes 0 - 127
7829 CR (13), and space (32). If locale-specific matching is taking place,
7840 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7845 the POSIX character classes, although this may be different for charac-
7846 ters in the range 128-255 when locale-specific matching is happening.
7866 when printed. In Unicode property terms, it matches all char-
7871 U+2066 - U+2069 Various "isolate"s
7878 [:punct:] This matches all characters that have the Unicode P (punctua-
7892 ASCII characters when PCRE2_UCP is set. The option PCRE2_EX-
7893 TRA_ASCII_DIGIT affects just [:digit:] and [:xdigit:]. Within a pat-
7894 tern, this can be set and unset by (?aT) and (?-aT). The PCRE2_EX-
7896 including [:digit:] and [:xdigit:]. Within a pattern, (?aP) and (?-aP)
7913 that \b matches at the start and the end of a word (see "Simple asser-
7914 tions" above), and in a Perl-style pattern the preceding or following
7915 character normally shows which is wanted, without the need for the as-
7916 sertions that are used above in order to give exactly the POSIX behav-
7918 (and therefore \b) by default, so it also affects these POSIX se-
7941 Perl-compatible, and are described in detail in the pcre2api documenta-
7951 For example, (?im) sets caseless, multiline matching. It is also possi-
7952 ble to unset these options by preceding the relevant letters with a hy-
7953 phen, for example (?-im). The two "extended" options are not indepen-
7956 A combined setting and unsetting such as (?im-sx), which sets
7960 the option is unset. An empty options setting "(?)" is allowed. Need-
7965 cause some options to be re-instated, but a hyphen may not appear.
7967 Some PCRE2-specific options can be changed by the same mechanism using
7979 However, except for 'r', these are not unset by (?^), which is equiva-
7980 lent to (?-imnrsx). If 'a' is not followed by any of the upper case
7983 PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EX-
7984 TRA_ASCII_POSIX is set, but including it in (?aP) means that (?-aP)
7987 When one of these option changes occurs at top level (that is, not in-
8008 start of a non-capturing group (see the next section), the option let-
8016 Note: There are other PCRE2-specific options, applying to the whole
8017 pattern, which can be set by the application when the compiling func-
8023 are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
8041 2. It creates a "capture group". This means that, when the whole pat-
8053 the captured substrings are "red king", "red", and "king", and are num-
8057 helpful. There are often times when grouping is required without cap-
8058 turing. If an opening parenthesis is followed by a question mark and a
8069 start of a non-capturing group, the option letters may appear between
8077 the group is reached, an option setting in one branch does affect sub-
8078 sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
8086 with (?| and is itself a non-capturing group. For example, consider
8091 Because the two alternatives are inside a (?| group, both sets of cap-
8092 turing parentheses are numbered one. Thus, when the pattern matches,
8095 not all, of one of a number of alternatives. Inside a (?| group, paren-
8098 whole group start after the highest number used in any branch. The fol-
8099 lowing example is taken from the Perl documentation. The numbers under-
8102 # before ---------------branch-reset----------- after
8117 A relative reference such as (?-1) is no different: it is just a conve-
8120 If a condition test for a group's having matched refers to a non-unique
8133 was not added to Perl until release 5.10. Python had the feature ear-
8141 must start with a non-digit. When PCRE2_UTF is set, the syntax of group
8145 ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
8156 complete name-to-number translation table from a compiled pattern, as
8160 Warning: When more than one capture group has the same number, as de-
8173 number to be associated with more than one name. The example above pro-
8174 vokes a compile-time error. However, there is still scope for confu-
8183 By default, a name must be unique within a pattern, except that dupli-
8188 The duplicate name constraint can be disabled by setting the PCRE2_DUP-
8194 of a weekday, either as a 3-letter abbreviation or as the full name,
8212 If you make a backreference to a non-unique named group from elsewhere
8221 If you make a subroutine call to a non-unique named group, the one that
8229 true. This is the same behaviour as testing by number. For further de-
8280 of a quantifier, the brace is taken as a literal character. In particu-
8283 Note that not every opening brace is potentially the start of a quanti-
8289 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
8295 the previous item and the quantifier were not present. This may be use-
8296 ful for capture groups that are referenced as subroutines from else-
8297 where in the pattern (but see also the section entitled "Defining cap-
8302 For convenience, the three most common quantifiers have single-charac-
8320 does not prevent backtracking into any of the iterations if a subse-
8344 does the right thing with C comments. The meaning of the various quan-
8361 that is greater than 1 or with a limited maximum, more memory is re-
8362 quired for the compiled pattern, in proportion to the size of the mini-
8366 (equivalent to Perl's /s) is set, thus allowing the dot to match new-
8369 so there is no point in retrying the overall match at any position af-
8373 In cases where it is known that the subject string contains no new-
8374 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
8384 If the subject is "xyz123abc123" the match point is the fourth charac-
8387 Another case where implicit anchoring is not applied is when the lead-
8393 It matches "ab" in the subject "aab". The use of the backtracking con-
8403 is "tweedledee". However, if there are nested capture groups, the cor-
8416 to be re-evaluated to see if a different number of repeats allows the
8432 re-evaluated in this way.
8457 the number of digits they match in order to make the rest of the pat-
8462 group is just a single repeated item, as in the example above, a sim-
8463 pler notation, called a "possessive quantifier" can be used. This con-
8474 Possessive quantifiers are always greedy; the setting of the PCRE2_UN-
8475 GREEDY option is ignored. They are a convenient notation for the sim-
8481 The possessive quantifier syntax is an extension to the Perl 5.8 syn-
8490 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
8493 When a pattern contains an unlimited repeat inside a group that can it-
8500 matches an unlimited number of substrings that either consist of non-
8508 * repeat in a large number of ways, and all have to be tried. (The ex-
8518 sequences of non-digits cannot be broken, and failure happens quickly.
8531 words, the group that is referenced need not be to the left of the ref-
8539 subsection entitled "Non-printing characters" above for further details
8540 of the handling of digits following a backslash. Other forms of back-
8553 An unsigned number specifies an absolute reference without the ambigu-
8558 (abc(def)ghi)\g{-1}
8560 The sequence \g{-1} is a reference to the capture group whose number is
8563 \2, and \g{-2} would be equivalent to \1. Note that if this construct
8565 this example \g{-2} also refers to group 1:
8567 (A)(\g{-2}B)
8587 time of the backreference, the case of letters is relevant. For exam-
8618 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8621 Because there may be many capture groups in a pattern, all digits fol-
8622 lowing a backslash are taken as part of a potential backreference num-
8632 However, such references can be useful inside repeated groups. For ex-
8637 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8638 ation of the group, the backreference matches the character string cor-
8641 the backreference. This can be done using alternation, as in the exam-
8659 subject string, and those that look behind it, and in each case an as-
8662 group is matched in the normal way, and if it is true, matching contin-
8663 ues after it, but with the matching position in the subject string re-
8666 The Perl-compatible lookaround assertions are atomic. If an assertion
8667 is true, but there is a subsequent matching failure, there is no back-
8668 tracking into the assertion. However, there are some cases where non-
8669 atomic assertions can be useful. PCRE2 has some support for these, de-
8670 scribed in the section entitled "Non-atomic assertions" below, but they
8671 are not Perl-compatible.
8677 Assertion groups are not capture groups. If an assertion contains cap-
8679 the capture groups in the whole pattern. Within each branch of an as-
8681 way. For example, a sequence such as (.)\g{-1} can be used to check
8688 retained after a successful negative assertion. When an assertion con-
8691 For a positive assertion, internally captured substrings in the suc-
8692 cessful branch are retained, and matching continues with the next pat-
8701 Most assertion groups may be repeated; though it makes no sense to as-
8702 sert the same thing several times, the side effect of capturing in pos-
8713 to specify lookaround assertions. Perl 5.28 introduced some experimen-
8715 start with (* instead of (? and must be written using lower case let-
8723 For example, (*pla:foo) is the same assertion as (?=foo). In the fol-
8724 lowing sections, the various assertions are described using the origi-
8734 matches a word followed by a semicolon, but does not include the semi-
8750 most convenient way to do it is with (?!) because an empty string al-
8767 If every top-level alternative matches a fixed length, for example
8777 in which one or more top-level alternatives can match more than one
8783 to a value set by the calling program (default 255 characters). Unlim-
8785 escape sequence \K (see above) can be used instead of a lookbehind as-
8786 sertion at the start of a pattern to get round the length limit re-
8789 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
8792 the lookbehind. The \X and \R escapes, which can match different num-
8796 lookbehinds, as long as the called capture group matches a limited-
8800 PCRE2 supports backreferences in lookbehinds, but only if certain con-
8804 Of course, the referenced group must itself match a limited length sub-
8810 Possessive quantifiers can be used in conjunction with lookbehind as-
8817 proceeds from left to right, PCRE2 will look for each "a" in the sub-
8832 quantifier; it can match only the entire string. The subsequent lookbe-
8847 three characters are not "999". This pattern does not match "foo" pre-
8849 three of which are not "999". For example, it doesn't match "123abc-
8871 NON-ATOMIC ASSERTIONS
8874 is true, but there is a subsequent matching failure, there is no back-
8875 tracking into the assertion. However, there are some cases where non-
8882 Consider the problem of finding the right-most word in a string that
8891 and sets the "x" option, which causes white space (introduced for read-
8895 words, when the assertion first succeeds, it captures the right-most
8901 succeeds, we are done, but if the last word in the string does not oc-
8903 lookahead (?= or (*pla: had been used, the assertion could not be re-
8907 Using a non-atomic lookahead, however, means that when the last word
8909 find the second-last word, and so on, until either the match succeeds,
8912 Two conditions must be met for a non-atomic assertion to be useful: the
8917 using a non-atomic assertion just wastes resources.
8919 There is one exception to backtracking into a non-atomic assertion. If
8920 an (*ACCEPT) control verb is triggered, the assertion succeeds atomi-
8924 Non-atomic assertions are not supported by the alternative matching
8942 matches are not a script run. After a failure, normal backtracking oc-
8943 curs. Script runs can be used to detect spoofing attacks using charac-
8945 "paypal.com" is an infamous example, where the letters could be a mix-
8946 ture of Latin and Cyrillic. This pattern ensures that the matched char-
8947 acters in a sequence of non-spaces that follow white space are a script
8963 \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8977 Unicode support. A compile-time error is given if any of the above con-
8979 matching function, pcre2_dfa_match() because they use the same mecha-
8994 (?(condition)yes-pattern)
8995 (?(condition)yes-pattern|no-pattern)
8997 If the condition is satisfied, the yes-pattern is used; otherwise the
8998 no-pattern (if present) is used. An absent no-pattern is equivalent to
8999 an empty string (it always matches). If there are more than two alter-
9000 natives in the group, a compile-time error occurs. Each of the two al-
9001 ternatives may itself contain nested groups of any form, including con-
9009 There are five kinds of condition: references to capture groups, refer-
9010 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
9023 enclosing this condition) can be referenced by (?(-1), the next most
9024 recent by (?(-2), and so on. Inside loops it can also make sense to re-
9027 is not used; it provokes a compile-time error.
9029 Consider the following pattern, which contains non-significant white
9036 character is present, sets it as the first captured substring. The sec-
9040 opening parenthesis, the condition is true, and so the yes-pattern is
9041 executed and a closing parenthesis is required. Otherwise, since no-
9043 words, this pattern matches a sequence of non-parentheses, optionally
9049 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
9060 the letter R followed by digits are ambiguous (see the following sec-
9071 "Recursion" in this sense refers to any subroutine-like call from one
9072 part of the pattern to another, whether or not it is actually recur-
9077 the name R, the condition is true if matching is currently in a recur-
9087 name, the condition tests for its being set, as described in the sec-
9089 group with the name R1 by adding (?<R1>) to the above pattern com-
9110 be only one alternative in the rest of the conditional group. It is al-
9112 DEFINE is that it can be used to define subroutines that can be refer-
9117 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
9125 to match the four dot-separated components of an IPv4 address, insist-
9130 Programs that link with a PCRE2 library can check the version by call-
9132 that do not have access to the underlying code cannot do this. A spe-
9148 or lookbehind assertion. However, it must be a traditional atomic as-
9149 sertion, not one of the non-atomic assertions.
9151 Consider this pattern, again containing non-significant white space,
9154 (?(?=[^a-z]*[a-z])
9155 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
9157 The condition is a positive lookahead assertion that matches an op-
9158 tional sequence of non-letters followed by a letter. In other words, it
9159 tests for the presence of at least one letter in the subject. If a let-
9162 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
9165 When an assertion that is a condition contains capture groups, any cap-
9166 turing that occurs in a matching branch is retained afterwards, for
9167 both positive and negative assertions, because matching always contin-
9168 ues after the assertion, whether it succeeds or fails. (Compare non-
9169 conditional assertions, for which captures are retained only for posi-
9188 at the start of the pattern, as described in the section entitled "New-
9192 when PCRE2_EXTENDED is set, and the default newline convention (a sin-
9211 For some time, Perl has provided a facility that allows regular expres-
9222 Obviously, PCRE2 cannot support the interpolation of Perl code. In-
9231 group. (If not, it is a non-recursive subroutine call, which is de-
9232 scribed in the next section.) The special item (?R) or (?0) is a recur-
9241 substrings which can either be a sequence of non-parentheses, or a re-
9244 possessive quantifier to avoid backtracking into sequences of non-
9257 of (?1) in the pattern above you can write (?-2) to refer to the second
9266 (?|(a)|(b)) (c) (?-2)
9269 (c) is number 2. When the reference (?-2) is encountered, the second
9272 the same if an absolute reference (?1) was used. In other words, rela-
9278 are always non-recursive subroutine calls, as described in the next
9282 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
9290 The example pattern that we have been looking at contains nested unlim-
9292 strings of non-parentheses is important when applying the pattern to
9304 callout function can be used (see below and the pcre2callout documenta-
9316 recursion. Consider this pattern, which matches text in angle brack-
9318 brackets (that is, when recursing), whereas any characters are permit-
9324 different alternatives for the recursive and non-recursive cases. The
9334 never re-entered, even if it contained untried alternatives and there
9339 treated as atomic. That is, they can be re-entered to try unused alter-
9344 Supporting backtracking into recursions simplifies certain types of re-
9353 match fails. If you want to match typical palindromic phrases, the pat-
9354 tern has to ignore all non-word characters, which can be done like
9360 such as "A man, a plan, a canal: Panama!". Note the use of the posses-
9361 sive quantifier *+ to avoid backtracking into sequences of non-word
9388 to match at the current matching position. The called group may be de-
9389 fined before or after the reference. A numbered reference can be ab-
9393 (...(relative)...)...(?-1)...
9414 Processing options such as case-independence are fixed when a group is
9418 (abc)(?i:(?-1))
9430 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
9432 an alternative syntax for calling a group as a subroutine, possibly re-
9442 (abc)(?i:\g<-1>)
9453 This makes it possible, amongst other things, to extract different sub-
9454 strings that match the same pair of parentheses when there is a repeti-
9457 PCRE2 provides a similar feature, but of course it cannot obey arbi-
9462 passed, or if the callout entry point is set to NULL, callouts are dis-
9474 During matching, when PCRE2 reaches a callout point, the external func-
9481 time, and one side-effect is that sometimes callouts are skipped. If
9483 disable the relevant optimizations. More details, including a complete
9497 They are all numbered 255. If there is a conditional group in the pat-
9509 A delimited string may be used instead of a number as a callout argu-
9511 ending delimiter is the same as the start, except for {, where the end-
9534 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
9540 and sequences such as \x{100} that define character code points. Char-
9546 names is skipped, and #-comments are recognized, exactly as in the rest
9550 The maximum length of a name is 255 in the 8-bit library and 65535 in
9551 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
9553 the colon were not there. Any number of these verbs may occur in a pat-
9557 them can be used only when the pattern is to be matched using the tra-
9574 course, be processed. You can suppress the start-of-match optimizations
9575 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9592 then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9596 If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
9601 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
9604 (*ACCEPT) is the only backtracking verb that is allowed to be quanti-
9612 is triggered and the match succeeds. In both cases, all but C is cap-
9613 tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re-
9616 Warning: (*ACCEPT) should not be used within a script run group, be-
9626 are not present in PCRE2. The nearest equivalent is the callout fea-
9634 (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC-
9640 There is one verb whose main purpose is to track how a match was ar-
9641 rived at, though it also has a secondary use in conjunction with ad-
9646 A name is always required with this verb. For all the other backtrack-
9649 When a match succeeds, the name of the last-encountered mark name on
9650 the matching path is passed back to the caller as described in the sec-
9651 tion entitled "Other information about the match" in the pcre2api docu-
9670 The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9672 efficient way of obtaining this information than putting each alterna-
9676 true, the name is recorded and passed back if it is the last-encoun-
9698 The following verbs do nothing when they are encountered. Matching con-
9700 causing a backtrack to the verb, a failure is forced. That is, back-
9704 group has been matched, there is never any backtracking into it. Back-
9708 These verbs differ in exactly what kind of failure occurs when back-
9710 when the verb is not in a subroutine or an assertion. Subsequent sec-
9716 matching failure that causes backtracking to reach it. Even if the pat-
9719 verb that is encountered, once it has been passed pcre2_match() is com-
9728 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
9729 MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
9731 that are set with (*MARK), ignoring those set by any of the other back-
9739 Note that (*COMMIT) at the start of a pattern is not the same as an an-
9740 chor, unless PCRE2's start-of-match optimizations are turned off, as
9762 the subject if there is a later matching failure that causes backtrack-
9768 (*PRUNE) is just an alternative to an atomic group or possessive quan-
9782 character, but to the position in the subject where (*SKIP) was encoun-
9791 skips on to start the next attempt at "c". Note that a possessive quan-
9793 suppress backtracking during the first match attempt, the second at-
9807 found, the "bumpalong" advance is to the subject position that corre-
9813 atomic groups or assertions, because they are never re-entered by back-
9831 backtracks, and this causes a new matching attempt to start at the sec-
9841 This verb causes a skip to the next innermost alternative when back-
9844 that it can be used for a pattern-based if-then-else block:
9851 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
9852 quently BAZ fails, there are no more alternatives, so there is a back-
9853 track to whatever came before the entire group. If (*THEN) is not in-
9861 A group that does not contain a | character is just a part of the en-
9862 closing alternative; it is not a nested alternation with only one al-
9863 ternative. The effect of (*THEN) extends beyond such a group to the en-
9864 closing alternative. Consider this pattern, where A, B, etc. are com-
9877 The effect of (*THEN) is now confined to the inner group. After a fail-
9882 Note that a conditional group is not considered as having two alterna-
9889 If the subject is "ba", this pattern does not match. Because .*? is un-
9909 that is backtracked onto first acts. For example, consider this pat-
9947 name (if set) are retained. In a standalone negative assertion, (*AC-
9948 CEPT) causes the assertion to fail without any further processing; cap-
9956 reach them. This means that, for the Perl-compatible assertions, their
9958 are atomic. A backtrack that occurs after such an assertion is complete
9963 PCRE2 now supports non-atomic positive assertions, as described in the
9964 section entitled "Non-atomic assertions" above. These assertions must
9965 be standalone (not used as conditions). They are not Perl-compatible.
9966 For these assertions, a later backtrack does jump back into the asser-
9967 tion, and therefore verbs such as (*COMMIT) can be triggered by back-
9975 in a standalone positive assertion. In a conditional positive asser-
9977 or (*PRUNE) causes the condition to be false. However, for both stand-
9979 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9987 to succeed without any further processing. Matching then continues af-
9988 ter the subroutine call. Perl documents this behaviour. Perl's treat-
9995 when triggered by being backtracked to in a group called as a subrou-
10020 Copyright (c) 1997-2024 University of Cambridge.
10024 ------------------------------------------------------------------------------
10032 PCRE2 - Perl-compatible regular expressions (revised API)
10037 Two aspects of performance are discussed below: memory usage and pro-
10062 is not usually a problem. However, if the numbers are large, and par-
10068 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
10070 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
10071 libraries, and this is reached with the above pattern if the outer rep-
10077 of PCRE2's "subroutine" facility. Re-writing the above pattern as
10083 this kind of pattern is not always exactly equivalent, because any cap-
10086 process patterns that PCRE2 cannot otherwise handle. The matching per-
10088 same. (This applies from release 10.30 - things were different in ear-
10094 From release 10.30, the interpretive (non-JIT) version of pcre2_match()
10095 uses very little system stack at run time. In earlier releases recur-
10097 cause problems, but this usage has been eliminated. Backtracking posi-
10103 On a 64-bit system the frame size for a pattern with no captures is 128
10107 the system stack, but this still caused some issues for multi-thread
10112 block and re-used if that block is used for another match. It is freed
10124 function calls, but only for processing atomic groups, lookaround as-
10129 has been re-factored to use heap memory when necessary for internal
10140 Certain items in regular expression patterns are processed more effi-
10142 [aeiou] than a set of single-character alternatives such as
10146 expressions for efficient performance. This document contains a few ob-
10150 slow, because PCRE2 has to use a multi-stage table lookup whenever it
10160 pcre2_match(); the performance loss is less with a DFA matching func-
10163 When a pattern begins with .* not in atomic parentheses, nor in paren-
10167 multiple top-level branches, they must all be anchorable. The optimiza-
10168 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
10171 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be-
10173 subject string contains newlines, the pattern may match from the char-
10184 If you are using such a pattern with subject strings that do not con-
10186 PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex-
10187 plicit anchoring. That saves PCRE2 from having to scan along the sub-
10201 in principle to try every possible variation, and this can take an ex-
10209 matching procedure, PCRE2 checks that there is a "b" later in the sub-
10210 ject string, and if there is not, it fails the match immediately. How-
10221 an atomic group or a possessive quantifier. This can often reduce mem-
10232 matched character. For a long string, a lot of memory is required. Con-
10238 This runs much faster, because sequences of characters that do not con-
10239 tain "<" are "swallowed" in one item inside the parentheses, and a pos-
10241 non-"<" characters. This version also uses a lot less memory because
10256 pcre2_match() or pcre2_dfa_match() is called. For details of these in-
10276 Copyright (c) 1997-2022 University of Cambridge.
10280 ------------------------------------------------------------------------------
10288 PCRE2 - Perl-compatible regular expressions (revised API)
10309 This set of functions provides a POSIX-style API for the PCRE2 regular
10310 expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
10311 16-bit and 32-bit libraries. See the pcre2api documentation for a de-
10312 scription of PCRE2's native API, which contains much additional func-
10315 IMPORTANT NOTE: The functions described here are NOT thread-safe, and
10316 should not be used in multi-threaded applications. They are also lim-
10317 ited to processing subjects that are not bigger than 2GB. Use the na-
10326 risk of accidentally linking with POSIX functions from a different li-
10329 On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
10330 so can be accessed by adding -lpcre2-posix to the command for linking
10332 also necessary to add -lpcre2-8.
10341 regcomp() etc. These simply passed their arguments to the PCRE2 func-
10354 names start with "REG_"; these are used for setting options and identi-
10360 Note that these functions are just POSIX-style wrappers for PCRE2's na-
10362 they are not thread-safe or even POSIX compatible.
10373 PCRE2-specific features via the POSIX calling interface or to add BSD
10377 POSIX-like in style. The syntax and semantics of the regular expres-
10379 various PCRE2 options, as described below. "POSIX-like in style" means
10381 POSIX-compatible, and in multi-unit encoding domains it is probably
10391 The function pcre2_regcomp() is called to compile a pattern into an in-
10392 ternal form. By default, the pattern is a C string terminated by a bi-
10396 REG_PEND is set. The regex_t structure used by pcre2_regcomp() is de-
10398 other libraries that provide POSIX-style matching.
10418 the defined POSIX behaviour for REG_NEWLINE (see the following sec-
10424 for compilation to the native function. This disables all meta charac-
10433 pcre2_regexec() for matching, the nmatch and pmatch arguments are ig-
10434 nored, and no captured strings are returned. Versions of the PCRE li-
10435 brary prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10436 tion, but this no longer happens because it disables the use of back-
10443 the end of the pattern before calling pcre2_regcomp(). The pattern it-
10444 self may now contain binary zeros, which are treated as data charac-
10467 all data strings used for matching it to be treated as UTF-8 strings.
10471 function. This means that the regex is compiled with PCRE2 default se-
10475 It does not affect the way newlines are matched by the dot metacharac-
10478 The yield of pcre2_regcomp() is zero on success, and non-zero other-
10481 number of capturing subpatterns in the regular expression. Various er-
10484 NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10506 This is the equivalent table for a POSIX-compatible pattern matcher:
10523 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10539 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10546 standard. However, setting this option can give more POSIX-like behav-
10551 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10558 point to the first character beyond the string. There may be binary ze-
10565 relative to string + pmatch[0].rm_so, but this differs from other im-
10570 intended to be portable to other systems. Note that a non-zero rm_so
10578 pcre2_regexec() are ignored (except possibly as input for REG_STAR-
10581 The value of nmatch may be zero, and the value pmatch may be NULL (un-
10593 Unused entries in the array have both structure members set to -1.
10597 other similarly named types from other libraries that provide POSIX-
10600 A successful match yields a zero return; various error codes are de-
10601 fined in the header file, of which REG_NOMATCH is the "expected" fail-
10607 The pcre2_regerror() function maps a non-zero errorcode from either
10611 buffer is too short, only the first errbuf_size - 1 characters of the
10619 Compiling a regular expression causes memory to be allocated and asso-
10621 such memory, after which preg may no longer be used as a compiled ex-
10635 Copyright (c) 1997-2024 University of Cambridge.
10639 ------------------------------------------------------------------------------
10647 PCRE2 - Perl-compatible regular expressions (revised API)
10652 A simple, complete demonstration program to get you started with using
10656 can save this listing to re-create the contents of pcre2demo.c.
10661 used. If matching succeeds, the program outputs the portion of the sub-
10662 ject that matched, together with the contents of any captured sub-
10665 If the -g option is given on the command line, the program then goes on
10667 subject string. The logic is a little bit tricky because of the possi-
10671 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
10672 library. It handles strings and characters that are stored in 8-bit
10675 treated as UTF-8 strings, where characters may occupy multiple code
10679 for your operating system, you should be able to compile the demonstra-
10682 cc -o pcre2demo pcre2demo.c -lpcre2-8
10685 to the command line. For example, on a Unix-like system that has PCRE2
10686 installed in /usr/local, you can compile the demonstration program us-
10689 cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10690 -L/usr/local/lib -lpcre2-8
10696 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10699 pcre2test, which supports many more facilities for testing regular ex-
10700 pressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
10701 though not all three need be installed). The pcre2demo program is pro-
10708 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10711 This is caused by the way shared library support works on those sys-
10714 -R/usr/local/lib
10729 Copyright (c) 1997-2016 University of Cambridge.
10733 ------------------------------------------------------------------------------
10739 PCRE2 - Perl-compatible regular expressions (revised API)
10742 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10759 run. However, if you are using the just-in-time optimization feature,
10760 it is not possible to save and reload the JIT data, because it is posi-
10761 tion-dependent. The host on which the patterns are reloaded must be
10764 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit
10765 library cannot be reloaded on a 64-bit system, nor can they be reloaded
10766 using the 8-bit library.
10770 output is really just a bytecode dump, which is why it can only be re-
10773 linked with a fixed version of PCRE2 must be prepared to recompile pat-
10783 checking, not complete validation of what is being re-loaded. Corrupted
10795 in the byte stream (its size is 1088 bytes). For more details of char-
10796 acter tables, see the section on locale support in the pcre2api docu-
10802 the length of the vector. The third and fourth arguments point to vari-
10816 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
10817 rupted, or that a slot in the vector does not point to a compiled pat-
10840 the 256 possible byte values. On systems that make a distinction be-
10841 tween binary and non-binary data, be sure that the file is opened for
10846 freed in the usual way by calling pcre2_code_free(). When you have fin-
10847 ished with the byte stream, it too must be freed by calling pcre2_seri-
10848 alize_free(). If this function is called with a NULL argument, it re-
10852 RE-USING PRECOMPILED PATTERNS
10854 In order to re-use a set of saved patterns you must first make the se-
10856 from a file). The management of this memory block is up to the applica-
10867 and its length, and the third argument points to a byte stream. The fi-
10870 this argument is NULL, malloc() and free() are used. After deserializa-
10879 stream, it is filled with those that fit, and the remainder are ig-
10895 potential race issue if you are using multiple patterns that were de-
10896 coded from a single byte stream in a multithreaded application. A sin-
10898 and a reference count is used to arrange for its memory to be automati-
10905 If a pattern was processed by pcre2_jit_compile() before being serial-
10921 Copyright (c) 1997-2018 University of Cambridge.
10925 ------------------------------------------------------------------------------
10933 PCRE2 - Perl-compatible regular expressions (revised API)
10938 The full syntax and semantics of the regular expressions that are sup-
10940 document contains a quick-reference summary of the syntax.
10945 \x where x is non-alphanumeric is a literal x
10958 after the comma. The exception is \u{...} which is not Perl-compatible
10959 and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC-
10969 \cx "control-x", where x is a non-control ASCII character
10990 read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig-
10996 Note that \0dd is always an octal code. The treatment of backslash fol-
10997 lowed by a non-zero digit is complicated; for details see the section
10998 "Non-printing characters" in the pcre2pattern documentation, where de-
11023 \W a "non-word" character
11027 middle of a UTF-8 or UTF-16 character. The application can lock out the
11031 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
11032 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
11034 points in the range 128-255. If the PCRE2_UCP option is set, the behav-
11037 that can restrict individual sequences to matching only ASCII charac-
11040 Property descriptions in \p and \P are matched caselessly; hyphens, un-
11066 Mn Non-spacing mark
11099 Xuc Universally-named character: one that can be
11103 Perl and POSIX space are now the same. Perl added VT to its space char-
11114 pcre2test -LP
11119 Many script names and their 4-letter abbreviations are recognized in
11121 of course). You can obtain a list of these scripts by running this com-
11124 pcre2test -LS
11143 L left-to-right
11144 LRE left-to-right embedding
11145 LRI left-to-right isolate
11146 LRO left-to-right override
11147 NSM non-spacing mark
11151 R right-to-left
11152 RLE right-to-left embedding
11153 RLI right-to-left isolate
11154 RLO right-to-left override
11163 [x-y] range (can be used for hex characters)
11169 ascii 0-127
11231 From release 10.38 \K is not permitted by default in lookaround asser-
11232 tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL-
11233 LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
11234 When this option is set, \K is honoured in positive assertions, but ig-
11249 (?:...) non-capture group
11250 (?|...) non-capture group; reset group numbers for
11253 In non-UTF modes, names may contain underscores and ASCII letters and
11260 (?>...) atomic non-capture group
11261 (*atomic:...) atomic non-capture group
11283 (?r) restrict caseless to either ASCII or non-ASCII
11288 (?-...) unset the given option(s)
11291 (?aP) implies (?aT) as well, though this has no additional effect. How-
11292 ever, it means that (?-aP) is really (?-PT) which disables all ASCII
11296 a mixture of setting and unsetting such as (?i-x) is allowed, but there
11298 for example (?^in). An option setting may appear at the start of a non-
11301 The following are recognized only at the very start of a pattern or af-
11310 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
11313 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
11327 These are recognized only at the very start of the pattern or after op-
11340 These are recognized only at the very start of the pattern or after op-
11365 Each top-level branch of a lookbehind must have a limit for the number
11373 NON-ATOMIC LOOKAROUND ASSERTIONS
11375 These assertions are specific to PCRE2 and are not Perl-compatible.
11401 \g-n relative reference by number
11403 \g{-n} relative reference by number
11416 (?-n) call subroutine by relative number
11425 \g<-n> call subroutine by relative number (PCRE2 extension)
11426 \g'-n' call subroutine by relative number (PCRE2 extension)
11431 (?(condition)yes-pattern)
11432 (?(condition)yes-pattern|no-pattern)
11436 (?(-n) relative reference condition (PCRE2 extension)
11464 The following act only when a subsequent match failure causes a back-
11466 what happens afterwards. Those that advance the start-of-match point do
11508 Copyright (c) 1997-2023 University of Cambridge.
11512 ------------------------------------------------------------------------------
11520 PCRE - Perl-compatible regular expressions (revised API)
11528 properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11533 There are two ways of telling PCRE2 to switch to UTF mode, where char-
11546 one-code-unit characters. There are also some other changes to the way
11553 \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11556 properties such as Lu for an upper case letter or Nd for a decimal num-
11561 The full lists are given in the pcre2pattern and pcre2syntax documenta-
11565 prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
11578 allowed in non-UTF mode.
11580 In UTF mode, repeat quantifiers apply to complete UTF characters, not
11591 multi-unit characters (see the description of \C in the pcre2pattern
11592 documentation). For this reason, there is a build-time option that dis-
11593 ables support for \C completely. There is also a less draconian com-
11594 pile-time option for locking out the use of \C when a pattern is com-
11598 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11600 modes provokes a match-time error. Also, the JIT optimization does not
11601 support \C in these modes. If JIT optimization is requested for a UTF-8
11602 or UTF-16 pattern that contains \C, it will not succeed, and so when
11603 pcre2_match() is called, the matching will be carried out by the inter-
11609 set as in non-UTF mode, all with code points less than 256. This re-
11614 you can use explicit Unicode property tests such as \p{Nd}. Alterna-
11615 tively, if you set the PCRE2_UCP option, the way that the character es-
11622 classes are all low-valued characters unless the PCRE2_UCP option is
11631 UNICODE CASE-EQUIVALENCE
11635 are less than 128 and that have at most two case-equivalent values. For
11636 these, a direct table lookup is used for speed. A few Unicode charac-
11637 ters such as Greek sigma have more than two code points that are case-
11639 PCRE2_UTF allows Unicode-style case processing for non-UTF character
11640 encodings such as UCS-2.
11643 ASCII lower case equivalents, have a non-ASCII one as well (long S and
11644 Kelvin sign). Recognition of these non-ASCII characters as case-equiv-
11647 in a case equivalence must either be ASCII or non-ASCII; there can be
11656 sequence of characters that are all from the same Unicode script. How-
11661 Every Unicode character has a Script property, mostly with a value cor-
11666 for the surrogate code points. In the PCRE2 32-bit library, characters
11668 which are accessible only in non-UTF mode, are assigned the Unknown
11672 include punctuation, emoji, mathematical, musical, and currency sym-
11675 "Inherited" is used for characters such as diacritical marks that mod-
11681 U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11683 called Script Extension exists. Its value is a list of scripts that ap-
11687 also some Common characters that have a single, non-Common script in
11693 constraint for decimal digits. These are covered in subsequent sec-
11700 run. Longer strings are checked using only the Script Extensions prop-
11703 If a character's Script Extension property is the single value "Inher-
11707 at least one script in common in their Script Extension lists. In set-
11723 The first has the Script Extension list Arabic, Hanifi Rohingya, Syr-
11725 of them could appear in script runs of either Arabic or Hanifi Ro-
11733 Katakana scripts together with Han; Korean uses Hangul and Han; Tai-
11736 "virtual scripts". Thus, a script run may contain a mixture of Hira-
11739 Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
11740 dard 39 ("Unicode Security Mechanisms", http://unicode.org/re-
11748 from the common ASCII digits. In addition to the script checking de-
11758 returned. The code unit offset to the offending character can be ex-
11763 and therefore want to skip these checks in order to improve perfor-
11765 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
11782 UTF-16 and UTF-32 strings can indicate their endianness by special code
11783 knows as a byte-order mark (BOM). The PCRE2 functions do not handle
11788 pcre2_dfa_match() calls with a non-zero starting offset, the check is
11796 that the sequences \b and \B are one-character lookbehinds.
11800 the surrogate area. The so-called "non-character" code points are not
11805 UTF-16, where they are used in pairs to encode code points with values
11806 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
11807 are available independently in the UTF-8 and UTF-32 encodings. (In
11808 other words, the whole surrogate thing is a fudge for UTF-16 which un-
11809 fortunately messes up UTF-8 and UTF-32.)
11814 such as \x{d800} (a surrogate code point) you can set the PCRE2_EX-
11816 only in UTF-8 and UTF-32 modes, because these values are not repre-
11817 sentable in UTF-16.
11819 Errors in UTF-8 strings
11821 The following negative error codes are given for invalid UTF-8 strings:
11829 The string ends with a truncated UTF-8 character; the code specifies
11830 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
11831 characters to be no longer than 4 bytes, the encoding scheme (origi-
11853 A 4-byte character has a value greater than 0x10ffff; these code points
11858 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
11859 range of code points are reserved by RFC 3629 for use with UTF-16, and
11860 so are excluded from UTF-8.
11868 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
11870 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
11876 binary value 0b10 (that is, the most significant bit is 1 and the sec-
11877 ond is 0). Such a byte can only validly occur as the second or subse-
11878 quent byte of a multi-byte character.
11883 can never occur in a valid UTF-8 string.
11885 Errors in UTF-16 strings
11887 The following negative error codes are given for invalid UTF-16
11895 Errors in UTF-32 strings
11897 The following negative error codes are given for invalid UTF-32
11907 UTF sequences if you call pcre2_compile() with the PCRE2_MATCH_IN-
11914 and you are not certain that your subject strings are valid UTF se-
11917 for UTF validity. An invalid string may cause undefined behaviour, in-
11922 generate different code. If JIT is not used, the option affects the be-
11923 haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11929 \p{Any}, it does not even match negative items such as [^X]. A lookbe-
11945 UTF-sequence, that sequence is skipped, and the match starts at the
11953 Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11958 Note, however, that the 16-bit and 32-bit PCRE2 libraries process
11974 Copyright (c) 1997-2023 University of Cambridge.
11978 ------------------------------------------------------------------------------