pcre2.txt - OpenGrok cross reference for /aosp_15

Lines Matching +full:turing +full:- +full:complete
1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16        PCRE2 - Perl-compatible regular expressions (revised API)
26        API is more extensible, and it was simplified by abolishing  the  sepa-
32        As well as Perl-style regular expression patterns, some  features  that
39        The  source code for PCRE2 can be compiled to support strings of 8-bit,
40        16-bit, or 32-bit code units, which means that up to three separate li-
43        64-bit  environment that also supports 32-bit applications, versions of
44        PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
46        The original work to extend PCRE to 16-bit and 32-bit  code  units  was
49        unit, or as UTF-encoded Unicode, with support for Unicode general cate-
55          pcre2test -C
58        ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
64        In addition to the Perl-compatible matching function, PCRE2 contains an
65        alternative  function that matches the same compiled patterns in a dif-
77        client to discover which features are  available.  The  features  them-
78        selves are described in the pcre2build page. Documentation about build-
80        NON-AUTOTOOLS_BUILD files in the source distribution.
93        If  you  are using PCRE2 in a non-UTF application that permits users to
96        For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
97        mode, which interprets patterns and subjects as strings of  UTF-8  code
98        units instead of individual 8-bit characters. This causes both the pat-
99        tern  and  any data against which it is matched to be checked for UTF-8
100        validity. If the data string is very long, such a check might use  suf-
101        ficiently  many  resources as to cause your application to lose perfor-
104        One way of guarding against this possibility is to use  the  pcre2_pat-
107        calling  pcre2_compile().  This causes a compile time error if the pat-
108        tern contains a UTF-setting sequence.
111        be  enabled  from within the pattern, by specifying "(*UCP)". This fea-
119        The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
121        middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op-
123        compile-time  error  if it is encountered. It is also possible to build
128        Nested unlimited repeats in a pattern are a common example. PCRE2  pro-
131        pcre2_set_depth_limit() that can be used to restrict the amount of mem-
137        The  user  documentation for PCRE2 comprises a number of different sec-
143        (which is a program listing), and the short pages for individual  func-
144        tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
148          pcre2-config       show PCRE2 installation configuration information
155          pcre2grep          description of the pcre2grep command (8-bit only)
156          pcre2jit           discussion of just-in-time optimization support
163          pcre2posix         the POSIX-compatible C API for the 8-bit library
187        Copyright (c) 1997-2021 University of Cambridge.
191 ------------------------------------------------------------------------------
199        PCRE2 - Perl-compatible regular expressions (revised API)
204        contains a description of all its native functions. See the pcre2 docu-
476        These  functions  provide  a  way of converting non-PCRE2 patterns into
477        patterns that can be processed by pcre2_compile(). This facility is ex-
483 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
485        There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
488        for all three libraries. One, two, or all three can be installed simul-
489        taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
490        libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
508        Character  strings  are  passed  to a PCRE2 library as sequences of un-
511        specified as zero-terminated.
514        macros are defined whose names are the generic forms such as pcre2_com-
516        PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
534        single  library.   For example, if you want to run a match using a pat-
546        There are also some wrapper functions for the 8-bit library that corre-
548        to  all  the  functionality of PCRE2 and they are not thread-safe. They
559        program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
563        and matching regular expressions in a Perl-compatible manner. A  sample
570        passed as bits in an options argument. There are also some more compli-
571        cated parameters such as custom memory  management  functions  and  re-
576        Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
578        speeds  up  the matching performance of many patterns. Programs can re-
585        pcre2_jit_stack_assign()  in order to control the JIT code's memory us-
591        less  sanity  checking. The JIT-specific functions are discussed in the
594        A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
598        there are lookaround assertions). However, this algorithm does not  re-
617        pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
619        functions is called with a NULL argument, the function returns  immedi-
629        Finally, there are functions for finding out information about  a  com-
644        ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
646        handled is one less than this maximum. Note that string lengths are al-
647        ways given in code units. Only in the 8-bit library is  such  a  length
654        strings:  a  single  CR (carriage return) character, a single LF (line-
655        feed) character, the two-character sequence CRLF, any of the three pre-
664        Unix standard. However, the newline convention can be changed by an ap-
665        plication  when calling pcre2_compile(), or it can be specified by spe-
667        settings.  See the pcre2pattern page for details of the special charac-
673        dollar metacharacters, the handling of #-comments in /x mode, and, when
674        CRLF is a recognized line ending sequence, the match position  advance-
675        ment for a non-anchored pattern. There is more detail about this in the
685        In a multithreaded application it is important to keep  thread-specific
687        library code itself is thread-safe: it contains  no  static  or  global
688        variables. The API is designed to be fairly simple for non-threaded ap-
689        plications  while at the same time ensuring that multithreaded applica-
692        There are several different blocks of data that are used to pass infor-
700        is thread-safe, that is, the same compiled pattern can be used by  more
703        use  them.  However,  if the just-in-time (JIT) optimization feature is
714          Get a read-only (shared) lock (mutex) for pointer
726        The  reason  for checking the pointer a second time is as follows: Sev-
742          Get a read-only (shared) lock (mutex) for pointer
755        If JIT is being used, but the JIT compilation is not being done immedi-
760        pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob-
761        tain  a  private  copy of the compiled code before calling the JIT com-
774        In a multithreaded application, if the parameters in a context are val-
777        it must make its own thread-specific copy.
782        of a match. This includes details of what was matched, as well as addi-
791        memory management or non-standard character tables.  To  keep  function
795        that holds the parameter values.  Applications that do not need to  ad-
800        relevant for several PCRE2 operations, a compile-time  context,  and  a
801        match-time context.
805        At  present,  this context just contains pointers to (and data for) ex-
825        function may be NULL, in which case the system memory management  func-
828        might  be.)  The private_malloc() function is used (if supplied) to ob-
851        A  compile context is required if you want to provide an external func-
853        values of any of the following compile-time parameters:
862        A  compile context is also required if you are using custom memory man-
863        agement.  If none of these apply, just pass NULL as the  context  argu-
866        A  compile context is created, copied, and freed by the following func-
894        only argument is a general context. This function builds a set of char-
900        As  PCRE2  has developed, almost all the 32 option bits that are avail-
903        bits which are used for some newer, assumed rarer, options. This  func-
905        It does not modify any existing setting. The available options are  de-
915        largest number that a PCRE2_SIZE variable can  hold,  which  is  effec-
926        largest number that a PCRE2_SIZE variable can  hold,  which  is  effec-
933        variable-length lookbehind assertion. The default is set when PCRE2  is
934        built,  with  the ultimate default being 255, the same as Perl. Lookbe-
940        This specifies which characters or character sequences are to be recog-
943        two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
950        When  a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or PCRE2_EX-
962        stops rogue patterns using up too much system  stack  when  being  com-
963        piled.  The limit applies to parentheses of all kinds, not just captur-
979        nesting, and the second is user data that is set up by the  last  argu-
981        should return zero if all is well, or non-zero to force an error.
997        A  match  context  is created, copied, and freed by the following func-
1017        during a matching operation. Details are given in the pcre2callout doc-
1024        This sets up a callout function for PCRE2 to call after each  substitu-
1025        tion made by pcre2_substitute(). Details are given in the section enti-
1031        The  offset_limit parameter limits how far an unanchored search can ad-
1033        pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO-
1043        When using this facility, you must set the  PCRE2_USE_OFFSET_LIMIT  op-
1045        code  can  be  compiled. If a match is started with a non-default match
1062        also applies to pcre2_dfa_match(), which may use the heap when process-
1064        atomic groups. This limit does not apply to matching with the JIT opti-
1076        where  ddd  is a decimal number. However, such a setting is ignored un-
1082        pcre2_match()  uses  the  heap are given in the pcre2perform documenta-
1085        For pcre2_dfa_match(), a vector on the system stack is used  when  pro-
1093        The match_limit parameter provides a means of preventing PCRE2 from us-
1109        is entirely different. However, there is still the possibility of  run-
1114        The default value for the limit can be set when PCRE2 is built; the de-
1121        where ddd is a decimal number. However, such a setting is  ignored  un-
1148        If  the depth of internal recursive function calls is great enough, lo-
1149        cal workspace vectors are allocated on the heap from version 10.32  on-
1153        deal  of memory. However, it is probably better to limit heap usage di-
1165        where ddd is a decimal number. However, such a setting is  ignored  un-
1170 CHECKING BUILD-TIME OPTIONS
1180        required. The second argument is a pointer to memory into which the in-
1189        non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1190        TION if the value in the first argument is not recognized. The  follow-
1197        PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se-
1204        unit  widths  were  selected  when PCRE2 was built. The 1-bit indicates
1205        8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit  sup-
1212        recursions,  lookarounds,  and atomic groups in pcre2_dfa_match(). Fur-
1225        just-in-time compiling is included in the library; otherwise it is  set
1227        that  JIT will be used for any given match. See the pcre2jit documenta-
1236        compiler is configured, for example "x86 32bit  (little  endian  +  un-
1247        the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1248        when  the  32-bit  library  is compiled, internal linkages always use 4
1251        The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1287        The  output is a uint32_t integer that gives the maximum depth of nest-
1291        take into account the stack that may already be used by the calling ap-
1297        This parameter is obsolete and should not be used in new code. The out-
1302        The output is a uint32_t integer that gives the length of PCRE2's char-
1327        PCRE2 version string, zero-terminated. The number of code units used is
1328        returned. This is the length of the string plus one unit for the termi-
1346        length in code units. If the pattern is zero-terminated, the length can
1348        length of zero is treated as an empty  string  (NULL  with  a  non-zero
1353        If the compile context argument ccontext is NULL, memory for  the  com-
1354        piled  pattern  is  obtained  by calling malloc(). Otherwise, it is ob-
1355        tained from the same memory function that was used for the compile con-
1357        it is no longer needed.  If pcre2_code_free() is called with a NULL ar-
1362        However,  if  the  code has been processed by the JIT compiler (see be-
1363        low), the JIT information cannot be copied (because it is  position-de-
1364        pendent).   The  new copy can initially be used only for non-JIT match-
1369        a  multithreaded  application  to acquire a private copy of shared com-
1378        pointing  to the new tables. The memory for the new tables is automati-
1390        described in the section entitled "Option bits for  pcre2_match()"  be-
1394        that affect the compilation. It should be zero if none of them are  re-
1397        well)  can  also  be set and unset from within the pattern (see the de-
1400        For those options that can be different in different parts of the  pat-
1406        Some additional options and less frequently required compile-time para-
1410        If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1412        error code and an offset (number of code units) within the pattern, re-
1413        spectively, when pcre2_compile() returns NULL because a compilation er-
1416        There are nearly 100 positive error codes that pcre2_compile() may  re-
1418        error codes that are used for invalid UTF strings when validity  check-
1421        There is no separate documentation for the positive  error  codes,  be-
1423        pcre2_get_error_message() function (see "Obtaining a textual error mes-
1424        sage" below) should be  self-explanatory.  Macro  names  starting  with
1427        that  returns  the message "no error" if passed to pcre2_get_error_mes-
1430        The value returned in erroroffset is an indication of where in the pat-
1432        non-zero  value  is  not  necessarily the furthest point in the pattern
1435        assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1441        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1444        This  code  fragment shows a typical straightforward call to pcre2_com-
1452            PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1485        (1) \U matches an upper case "U" character; by default \U causes a com-
1500        using the PCRE2_EXTRA_ALT_BSUX extra option  (see  "Extra  compile  op-
1502        to  patterns.  Neither  of  these options affects the processing of re-
1511        Perl.  If  you want a multiline circumflex also to match after a termi-
1517        such  as  (*MARK:NAME)  is any sequence of characters that does not in-
1519        it  is  not possible to include a closing parenthesis in the name. How-
1520        ever, if the PCRE2_ALT_VERBNAMES option is set, normal  backslash  pro-
1521        cessing  is  applied to verb names and only an unescaped closing paren-
1525        whitespace  in verb names is skipped and #-comments are recognized, ex-
1531        items,  all  with  number 255, before each pattern item, except immedi-
1532        ately before or after an explicit callout in the pattern.  For  discus-
1543        characters, K and S, that, in addition to their lower case ASCII equiv-
1544        alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1545        S) respectively. If you do not want this case equivalence, you can sup-
1551        (available only in 16-bit or 32-bit mode) are treated as not having an-
1568        this option, a dot does not match when the current position in the sub-
1570        and it can be changed within a pattern by a (?s) option setting. A neg-
1572        escape sequence always matches a non-newline character, independent  of
1589        patterns,  a  new  match is then tried at the next starting point. How-
1604        matches,  which are necessarily substrings of the first one, must obvi-
1609        If this bit is set, most white space characters in the pattern are  to-
1611        a \Q...\E sequence. However, white space  is  not  allowed  within  se-
1615        quantifier and a following + that indicates  possessiveness.  PCRE2_EX-
1619        When PCRE2 is compiled without Unicode support,  PCRE2_EXTENDED  recog-
1621        256 that are flagged as white space in its low-character table. The ta-
1628        When PCRE2 is compiled with Unicode support, in addition to these char-
1629        acters,  five  more Unicode "Pattern White Space" characters are recog-
1630        nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1631        right mark), U+200F (right-to-left mark), U+2028 (line separator),  and
1637        As well as ignoring most white space, PCRE2_EXTENDED also causes  char-
1644        Which characters are interpreted as newlines can be specified by a set-
1646        special  sequence at the start of the pattern, as described in the sec-
1652        This  option  has  the  effect of PCRE2_EXTENDED, but, in addition, un-
1653        escaped space and horizontal tab characters are ignored inside a  char-
1655        set of pattern white space characters that are ignored outside a  char-
1663        start  of  matching, though the matched text may continue over the new-
1664        line. If startoffset is non-zero, the limiting newline is not necessar-
1666        string is "abc\nxyz" (where \n represents a single-character newline) a
1676        If this option is set, all meta-characters in the pattern are disabled,
1679        you are doing a lot of literal matching and  are  worried  about  effi-
1684        PCRE2_UTF,  and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options PCRE2_EX-
1692        sequences.  Note, however, that the 16-bit and 32-bit  PCRE2  libraries
1694        cannot find valid UTF sequences within an arbitrary string of bytes un-
1695        less such sequences are suitably aligned. This  facility  is  not  sup-
1696        ported  for  DFA matching. For details, see the pcre2unicode documenta-
1703        alternative to fail).  A pattern such as (\1)(a) succeeds when this op-
1715        string,  or  before a terminating newline (except when PCRE2_DOLLAR_EN-
1717        character" metacharacter (.) does not match at a newline.  This  behav-
1733        This option locks out the use of \C in the pattern that is  being  com-
1734        piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1735        UTF-16 modes, because it may leave the current matching  point  in  the
1736        middle of a multi-code-unit character. This option may be useful in ap-
1738        is also a build-time option that permanently locks out the use of \C.
1752        This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
1753        or UTF-32, depending on which library is in use. In particular, it pre-
1755        by  starting  the pattern with (*UTF). This option may be useful in ap-
1761        If this option is set, it disables the use of numbered capturing paren-
1772        If this option is set, it disables "auto-possessification", which is an
1775        are  in  use,  auto-possessification means that some callouts are never
1783        .*  is  the  first significant item in a top-level branch of a pattern,
1803        the matching code searches the subject for that value, and fails  imme-
1804        diately  if it cannot find it, without actually running the main match-
1808        items  are  in use, these "start-up" optimizations can cause them to be
1809        skipped if the pattern is never actually used. The  start-up  optimiza-
1810        tions  are  in effect a pre-scan of the subject that takes place before
1813        The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1826        start-up  optimization  scans along the subject, finds "A" and runs the
1827        first match attempt from there. The (*COMMIT) item means that the  pat-
1832        (*COMMIT)  prevents any further matches being tried, so the overall re-
1835        As another start-up optimization makes use of a minimum  length  for  a
1842        match "BB", which is long enough. In the process, (*MARK:2) is  encoun-
1844        found, but there is only one character left, so there are no  more  at-
1857        UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1863        PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1871        Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1872        able  the error that is given if an escape sequence for an invalid Uni-
1873        code code point is encountered in the pattern. In particular,  the  so-
1877        section entitled "Extra compile options" below.  However, this is  pos-
1878        sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1879        resentable in UTF-16.
1891        The second effect of PCRE2_UCP is to force the use of  Unicode  proper-
1893        This  makes  it  possible  to process strings in the 16-bit UCS-2 code.
1895        support  (which is the default).  The PCRE2_EXTRA_CASELESS_RESTRICT op-
1897        match only ASCII characters and non-ASCII characters  match  only  non-
1910        is going to be used to set a non-default offset limit in a  match  con-
1912        offset limit is set without this option. For more details, see the  de-
1920        instead of single-code-unit strings. It  is  available  when  PCRE2  is
1922        support is not available, the use of this option provokes an error. De-
1935        assertions, following Perl's lead. This option is provided to re-enable
1941        This  option  applies when compiling a pattern in UTF-8 or UTF-32 mode.
1942        It is forbidden in UTF-16 mode, and ignored in non-UTF  modes.  Unicode
1944        in  UTF-16  to  encode  code points with values in the range 0x10000 to
1945        0x10ffff. The surrogates cannot therefore  be  represented  in  UTF-16.
1946        They can be represented in UTF-8 and UTF-32, but are defined as invalid
1947        code  points,  and  cause  errors  if  encountered in a UTF-8 or UTF-32
1952        when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1954        PCRE2_NO_UTF_CHECK option does not disable the error that  occurs,  be-
1957        If  the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1958        gate code point values in UTF-8 and UTF-32 patterns no  longer  provoke
1966        \x in the way that ECMAscript (aka JavaScript) does.  Additional  func-
1969        as a hexadecimal character code, where hhh.. is any number of hexadeci-
1975        is  set.   It can be changed within a pattern by means of the (?aD) op-
2002        to ensure that (?-aP) unsets all ASCII restrictions for POSIX classes.
2007        escape  such  as \j or a malformed one such as \x{2z} causes a compile-
2008        time error when detected by pcre2_compile(). Perl is somewhat inconsis-
2010        "j",  and non-hexadecimal digits in \x{} are just ignored, though warn-
2011        ings are given in both cases if Perl's warning switch is enabled.  How-
2017        treated as single-character escapes. For example, \j is a  literal  "j"
2018        and  \x{2z}  is treated as the literal string "x{2z}". Setting this op-
2023        is not supported in a character class. To reiterate: this is a  danger-
2030        are two case-equivalent character sets that contain both ASCII and non-
2031        ASCII characters. The ASCII letter S is case-equivalent to U+017f (long
2032        S) and the ASCII letter K is case-equivalent to U+212a  (Kelvin  sign).
2033        This  option  disables  recognition of case-equivalences that cross the
2034        ASCII/non-ASCII boundary. In a caseless match, both characters must ei-
2035        ther be ASCII or non-ASCII. The option can be changed with a pattern by
2043        of  a CR (carriage return) character. The option does not affect a lit-
2049        This  option  is  provided  for  use  by the -x option of pcre2grep. It
2050        causes the pattern only to match complete lines. This  is  achieved  by
2051        automatically  inserting  the  code for "^(?:" at the start of the com-
2053        the  matched  line may be in the middle of the subject string. This op-
2058        This option is provided for use by  the  -w  option  of  pcre2grep.  It
2066 JUST-IN-TIME (JIT) COMPILATION
2086        just-in-time compiler is available, further processes a  compiled  pat-
2092        for  patterns  to  be analyzed, and for one-off matches and simple pat-
2108        code points are less than 256. By default,  higher-valued  code  points
2114        \w and friends to use Unicode property support instead of the  built-in
2115        tables.  PCRE2_UCP also causes upper/lower casing operations on charac-
2125        PCRE2  contains a built-in set of character tables that are used by de-
2126        fault.  These are sufficient for many applications. Normally,  the  in-
2129        default "C" locale of the local system, which may cause them to be dif-
2132        The built-in tables can be overridden by tables supplied by the  appli-
2134        from the default.  As more and more applications change to  using  Uni-
2155        The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2174        or whether the processor is 32-bit or 64-bit. A copy of the  result  of
2176        re-used later, even in a different program or on another computer.  The
2181        used stand-alone to create a file that contains a set of binary tables.
2191        The  first  argument  for pcre2_pattern_info() is a pointer to the com-
2193        is required, and the third argument is a pointer to a variable  to  re-
2197        the function is zero for success, or one of the following negative num-
2207        typical call of pcre2_pattern_info(), to obtain the length of the  com-
2225        to  a  uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2226        tions that were passed to  pcre2_compile(),  whereas  PCRE2_INFO_ALLOP-
2227        TIONS  returns  the compile options as modified by any top-level (*XXX)
2230        compile context by calling the pcre2_set_compile_extra_options()  func-
2233        For  example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2241        PCRE2 if the first significant item in every top-level branch is one of
2247          .*    sometimes - see below
2259        For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2290        been set, the call to pcre2_pattern_info() returns the error  PCRE2_ER-
2297        In the absence of a single first code unit for a non-anchored  pattern,
2298        pcre2_compile()  may construct a 256-bit table that defines a fixed set
2302        means  "any  code unit of value 255 or above". If such a table was con-
2309        a  non-anchored  pattern. The third argument should point to a uint32_t
2321        The  third  argument  should point to a uint32_t variable. In the 8-bit
2322        library, the value is always less than 256. In the 16-bit  library  the
2323        value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2324        value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2333        in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2346        \r or \n or one of the  equivalent  hexadecimal  or  octal  escape  se-
2352        (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2360        Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
2362        (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2367        If the compiled pattern was successfully  processed  by  pcre2_jit_com-
2387        PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2393        third argument should point to a uint32_t variable. When a pattern con-
2395        whether or not it can match an empty string. PCRE2 takes a cautious ap-
2401        (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third  ar-
2403        set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2405        less  than  the limit set or defaulted by the caller of the match func-
2411        code  units)  when  it starts to process each of its branches. This re-
2413        should point to a uint32_t integer. The simple assertions \b and \B re-
2414        quire  a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
2415        return 1 in the absence of anything longer. \A also  registers  a  one-
2419        Note that this information is useful for multi-segment matching only if
2423        one  character, then the nested lookbehind also moves back by two char-
2427        multi-segment matching.
2444        PCRE2 supports the use of named as well as numbered capturing parenthe-
2445        ses.  The names are just an additional way of identifying the parenthe-
2447        pcre2_substring_get_byname() are provided for extracting captured  sub-
2451        do the conversion, you need to use the name-to-number map, which is de-
2454        The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
2460        This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2461        brary,  the first two bytes of each entry are the number of the captur-
2462        ing parenthesis, most significant byte first. In  the  16-bit  library,
2463        the  pointer  points  to 16-bit code units, the first of which contains
2464        the parenthesis number. In the 32-bit library, the  pointer  points  to
2465        32-bit  code units, the first of which contains the parenthesis number.
2469        capture groups with the same number, as described in the section on du-
2474        Duplicate names for capture groups with different numbers  are  permit-
2478        necessarily  the  case because later capture groups may have lower num-
2482        pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED
2483        is set, so white space - including newlines - is ignored):
2485          (?<date> (?<year>(\d\d)?\d\d) -
2486          (?<month>\d\d) - (?<day>\d\d) )
2490        with non-printing bytes shows in hexadecimal, and undefined bytes shown
2499        name-to-number map, remember that the length of the entries  is  likely
2513        This identifies the character sequence that will be recognized as mean-
2518        Return  the  size  of  the compiled pattern in bytes (for all three li-
2522        pcre2_compile()  is  getting memory in which to place the compiled pat-
2523        tern may be slightly larger than the value returned by this option, be-
2525        over-estimate.  Processing a pattern with the JIT compiler does not al-
2541        which they appear. Its first argument is a pointer to a callout enumer-
2543        passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2550        It  is possible to save compiled patterns on disc or elsewhere, and re-
2556        of PCRE2 is really just a bytecode dump.  The functions whose names be-
2557        gin with pcre2_serialize_ are used for converting to and from the seri-
2580        you must create a match data block by calling one of the creation func-
2587        to record the matched portion of the subject plus three  captured  sub-
2601        The second argument of pcre2_match_data_create() is a pointer to a gen-
2611        general context, but in this case if NULL is passed, the memory is  ob-
2617        after a match operation has finished,  using  functions  that  are  de-
2621        match block only  when  the  error  is  PCRE2_ERROR_NOMATCH,  PCRE2_ER-
2622        ROR_PARTIAL,  or  one of the error codes for an invalid UTF string. Ex-
2632        described in the section entitled "Option bits for  pcre2_match()"  be-
2652        makes use of a vector of data frames for remembering backtracking posi-
2653        tions.  The size of each individual frame depends on the number of cap-
2654        turing parentheses in the  pattern  and  can  be  obtained  by  calling
2655        pcre2_pattern_info() with the PCRE2_INFO_FRAMESIZE option (see the sec-
2659        turns out to be too small during  matching,  it  is  automatically  ex-
2660        panded.  When  pcre2_match()  returns, the memory is not freed, but re-
2683        order  to  find multiple matches in the subject string or to match dif-
2686        This function is the main matching facility of the library, and it  op-
2687        erates  in  a Perl-like manner. For specialist use there is also an al-
2703        If  the  subject  string is zero-terminated, the length can be given as
2705        common matching parameters are to be changed. For details, see the sec-
2713        bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
2714        and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
2716        zero,  the  subject is assumed to be an empty string. If length is non-
2722        by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2723        set must point to the start of a character, or to the end of  the  sub-
2724        ject  (in  UTF-32 mode, one code unit equals one character, so all off-
2725        sets are valid). Like the pattern string, the subject may  contain  bi-
2728        A  non-zero  starting offset is useful when searching for another match
2755        so,  and the current character is CR followed by LF, advance the start-
2758        If a non-zero starting offset is passed when the pattern is anchored, a
2759        single attempt to match at the given offset is made. This can only suc-
2761        the  subject.  In other words, the anchoring must be the result of set-
2769        PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK,  PCRE2_EN-
2771        PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_JIT,  PCRE2_NO_UTF_CHECK,  PCRE2_PAR-
2774        Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2775        ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2794        must not be freed until all such operations are complete. For some  ap-
2803        also automatically freed if the match data block is re-used for another
2808        This  option  is relevant only to pcre2_match() for interpretive match-
2812        The use of recursion in patterns can lead to infinite loops. In the in-
2818        start  of  that group, and the furthest inspected character of the sub-
2821        There are rare cases of matches that would complete,  but  nevertheless
2828        matches  must be right at the end of the subject string. Note that set-
2843        in  multiline mode) a newline immediately before it. Setting this with-
2845        match. This option affects only the behaviour of the dollar metacharac-
2867        subject is permitted.  If the pattern is anchored, such a match can oc-
2882        The latter special case is discussed in detail in the pcre2unicode doc-
2885        In  the default case, if a non-zero starting offset is given, the check
2893        that the sequences \b and \B are one-character lookbehinds.
2899        validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
2909        PCRE2_NO_UTF_CHECK  is  set  at match time the effect of passing an in-
2910        valid string as a subject, or an invalid value of startoffset, is unde-
2911        fined.  Your program may crash or loop indefinitely or give  wrong  re-
2917        These options turn on the partial matching feature. A partial match oc-
2919        there are not enough subject characters to complete the match. In addi-
2924        If this situation arises when PCRE2_PARTIAL_SOFT  (but  not  PCRE2_PAR-
2925        TIAL_HARD) is set, matching continues by testing any remaining alterna-
2926        tives.  Only  if  no complete match can be found is PCRE2_ERROR_PARTIAL
2927        returned instead of PCRE2_ERROR_NOMATCH.  In  other  words,  PCRE2_PAR-
2929        match, but only if no complete match can be found.
2934        other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2935        ered to be more important that an alternative complete match.
2937        There is a more detailed discussion of partial and multi-segment match-
2943        When  PCRE2 is built, a default newline convention is set; this is usu-
2948        pcre2pattern page. During matching, the newline choice affects the  be-
2961        expected. For example, if the pattern is .+A (and the PCRE2_DOTALL  op-
2964        However,  the  pattern  [\r\n]A does match that string, because it con-
2965        tains an explicit CR or LF reference, and so advances only by one char-
2971        not  count, nor does \s, even though it includes CR and LF in the char-
2998        Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
3002        pcre2_get_ovector_count() returns the number of pairs of values it con-
3005        Within the ovector, the first in each pair of values is set to the off-
3007        offset of the first code unit after the end of a substring. These  val-
3009        are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
3010        brary, and 32-bit offsets in the 32-bit library.
3018        the portion of the subject string that was matched by the  entire  pat-
3022        been captured, the returned value is 3. If there are no  captured  sub-
3031        If  a  capture group is matched repeatedly within a single match opera-
3032        tion, it is the last portion of the subject that it matched that is re-
3048        Offset values that correspond to unused groups at the end  of  the  ex-
3057        in the pattern are never changed. That is, if a pattern contains n cap-
3058        turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
3059        pcre2_match().  The  other  elements retain whatever values they previ-
3080        returns a pointer to the zero-terminated name, which is within the com-
3090        After a "no match" or a partial match, the last encountered name is re-
3100        Warning:  By  default, certain start-of-match optimizations are used to
3103        for the presence of "c" in the subject before running the matching  en-
3105        any  marks. You can disable the start-of-match optimizations by setting
3112        offset  of  the character at which the match started. For a non-partial
3125        If pcre2_match() fails, it returns a negative number. This can be  con-
3126        verted  to a text string by calling the pcre2_get_error_message() func-
3131        of  UTF-specific negative error codes is returned. Details are given in
3146        PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
3153        a library of a different code unit width, for example, a  pattern  com-
3154        piled  by  the  8-bit  library  is passed to a 16-bit or 32-bit library
3194        This error is returned when a pattern that was successfully studied us-
3195        ing JIT is being matched, but the memory available for the just-in-time
3209        also returned if PCRE2_COPY_MATCHED_SUBJECT is set and  memory  alloca-
3219        within the pattern. Specifically, it means that either the  whole  pat-
3222        might do this are detected and faulted at compile time, but  more  com-
3233        match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
3240        The returned message is terminated with a trailing zero, and the  func-
3242        zero. If the error number is unknown, the negative error code PCRE2_ER-
3266        extracting   captured  substrings  as  new,  separate,  zero-terminated
3272        zero refers to the entire matched substring, with higher numbers refer-
3283        extracts a zero-length empty string.
3292        The  pcre2_substring_copy_bynumber()  function  copies  a captured sub-
3295        function that was used for the match data block. The  first  two  argu-
3312        code  is returned.  If a substring number greater than zero is used af-
3335        pattern is (abc)|(def) and the subject is "def", and the  ovector  con-
3346        The  pcre2_substring_list_get()  function  extracts  all available sub-
3348        builds  a  second list that contains their lengths (in code units), ex-
3360        therefore need the lengths, you may supply NULL as the lengthsptr argu-
3362        function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3369        be distinguished from a genuine zero-length substring by inspecting the
3390        To  extract a substring by name, you first have to find associated num-
3397        the name by calling pcre2_substring_number_from_name(). The first argu-
3406        the "bynumber" functions, the only difference being that the second ar-
3414        than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3420        group numbers in the pcre2pattern page, you cannot use names to distin-
3439        can  be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
3440        a special case, if replacement is NULL and rlength  is  zero,  the  re-
3441        placement  is assumed to be an empty string. If rlength is non-zero, an
3444        There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3447        that requests multiple replacements  (see  PCRE2_SUBSTITUTE_GLOBAL  be-
3452        never  greater  than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3457        error return. For global replacements, matches in which \K in a lookbe-
3462        pcre2_match(), except that the partial matching options are not permit-
3464        block  is obtained and freed within this function, using memory manage-
3471        will always be a no-match error. The contents of the ovector within the
3479        arguments.  The  data in the match_data block (return code, offset vec-
3481        pcre2_match()  from  within pcre2_substitute(). This allows an applica-
3486        changed   when   PCRE2_SUBSTITUTE_MATCHED   is  set.  If  PCRE2_SUBSTI-
3487        TUTE_GLOBAL is also set, pcre2_match() is called after the  first  sub-
3488        stitution  to  check for further matches, but this is done using an in-
3492        The  code  argument is not used for matching before the first substitu-
3494        even  when  PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3495        formation such as the UTF setting and the number of capturing parenthe-
3499        subject string with matched substrings replaced. However, if PCRE2_SUB-
3505        The  outlengthptr  argument of pcre2_substitute() must point to a vari-
3511        If  the  function is not successful, the value set via outlengthptr de-
3513        string, the value is the offset in the replacement string where the er-
3514        ror  was  detected.  For  other errors, the value is PCRE2_UNSET by de-
3515        fault. This includes the case of the output buffer being too small, un-
3519        buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3521        continues to go through the motions of matching and substituting (with-
3524        variable, with  the  result  of  the  function  still  being  PCRE2_ER-
3529        that the entire operation is carried out twice. Depending on the appli-
3531        the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3536        invalid UTF replacement string causes an immediate return with the rel-
3539        If  PCRE2_SUBSTITUTE_LITERAL  is set, the replacement string is not in-
3540        terpreted in any way. By default, however, a dollar character is an es-
3541        cape character that can specify the insertion of characters  from  cap-
3542        ture  groups  and names from (*MARK) or other control verbs in the pat-
3543        tern. Dollar is the only escape character (backslash is treated as lit-
3551        brackets  are  required only if the following character would be inter-
3571        takes place in the original subject string (that is, previous  replace-
3574        subject string. If an offset limit is set in the match context, search-
3578        the subject string by setting either or both of startoffset and an off-
3586        with zero length, an attempt to find a non-empty match at the same off-
3597        PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3600        not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3611        particular character codes, and backslash followed by any  non-alphanu-
3618        current state: \U and \L change to upper or lower case forcing, respec-
3623        all  inserted  characters, including those from capture groups and let-
3628        Note that case forcing sequences such as \U...\E do not nest. For exam-
3630        \E  has  no  effect.  Note  also  that the PCRE2_ALT_BSUX and PCRE2_EX-
3637          ${<n>:-<string>}
3640        As before, <n> may be a group number or a name. The first  form  speci-
3661        substitutions.  However,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un-
3665        PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3674        PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3677        PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3679        when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN-
3693        the replacement string, with more  particular  errors  being  PCRE2_ER-
3702        obtained by calling the pcre2_get_error_message()  function  (see  "Ob-
3719        callout block structure, which contains the following fields, not  nec-
3736        first callout, 2 for the second, and so on. The input and output point-
3751        If  the  value is zero, the replacement is accepted, and, if PCRE2_SUB-
3753        match.  If  the  value  is not zero, the current replacement is not ac-
3767        capture groups are not required to be unique. Duplicate names  are  al-
3773        match,  only  one of each set of identically-named groups participates.
3778        to the given name that is set. Only if none are set is  PCRE2_ERROR_UN-
3779        SET  is  returned.  The pcre2_substring_number_from_name() function re-
3791        point to the first and last entries in the name-to-number table for the
3805        which  stops when it finds the first match at a given point in the sub-
3808        function  (see  below) instead. If you cannot use the alternative func-
3812        What you have to do is to insert a callout right at the end of the pat-
3813        tern.  When your callout function is called, extract and save the  cur-
3831        different characteristics to the normal algorithm, and is not  compati-
3832        ble  with  Perl.  Some  of  the features of PCRE2 patterns are not sup-
3840        is used in a different way, and this is described below. The other com-
3846        keeping  track  of  multiple paths through the pattern tree. More work-
3847        space is needed for patterns and subjects where there are a lot of  po-
3869        PCRE2_COPY_MATCHED_SUBJECT,  PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3882        that requires additional characters. This happens even if some complete
3885        if the end of the subject is  reached,  there  have  been  no  complete
3886        matches, but there is still at least one matching possibility. The por-
3889        more detailed discussion of partial and  multi-segment  matching,  with
3895        stop as soon as it has found one match. Because of the way the alterna-
3911        When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3930        which  is  the  number  of  matched substrings. The offsets of the sub-
3936        Calls  to the convenience functions that extract substrings by name re-
3937        turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3946        NOTE: PCRE2's "auto-possessification" optimization usually  applies  to
3949        matching,  this means that only one possible match is found. If you re-
3950        ally do want multiple matches in such cases, either use an ungreedy re-
3951        peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when  com-
4016        Copyright (c) 1997-2024 University of Cambridge.
4020 ------------------------------------------------------------------------------
4028        PCRE2 - Perl-compatible regular expressions (revised API)
4034        the library in Unix-like environments using the applications  known  as
4036        CMake  instead  of configure. The text file README contains general in-
4037        formation about building with Autotools (some of which is repeated  be-
4040        There is a lot more information about building PCRE2 without using  Au-
4042        hand") in the text file called NON-AUTOTOOLS-BUILD.  You should consult
4043        this file as well as the README file if you are building in a non-Unix-
4047 PCRE2 BUILD-TIME OPTIONS
4051        configure  script,  where  the  optional features are selected or dese-
4052        lected by providing options to configure before running the  make  com-
4053        mand.  However,  the same options can be selected in both Unix-like and
4054        non-Unix-like environments if you are using CMake instead of  configure
4059        compiler, as described in NON-AUTOTOOLS-BUILD.
4061        The complete list of options for configure (which includes the standard
4062        ones  such  as  the selection of the installation directory) can be ob-
4065          ./configure --help
4068        names begin with --enable or --disable. Because of the way that config-
4069        ure  works, --enable and --disable always come in pairs, so the comple-
4072        with --with. At the end of a configure run, a summary of the configura-
4076 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
4078        By  default, a library called libpcre2-8 is built, containing functions
4080        either  as single-byte characters, or UTF-8 strings. You can also build
4081        two other libraries, called libpcre2-16 and libpcre2-32, which  process
4082        strings  that  are contained in arrays of 16-bit and 32-bit code units,
4083        respectively. These can be interpreted either as single-unit characters
4084        or UTF-16/UTF-32 strings. To build these additional libraries, add  one
4087          --enable-pcre2-16
4088          --enable-pcre2-32
4090        If you do not want the 8-bit library, add
4092          --disable-pcre2-8
4095        the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
4096        an  8-bit  program.  Neither  of these are built if you select only the
4097        16-bit or 32-bit libraries.
4106          --disable-shared
4107          --disable-static
4109        to the configure command. Setting --disable-shared ensures  that  PCRE2
4114        you  want these binaries to be fully statically linked, you can set LD-
4117        LDFLAGS=--static ./configure --disable-shared
4119        Note the two hyphens in --static. Of course, this works only if  static
4128          --disable-unicode
4131        It  is  not  possible to build one library with Unicode support and an-
4134        Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
4135        UTF-16 or UTF-32. To do that, applications that use the library can set
4136        the  PCRE2_UTF  option when they call pcre2_compile() to compile a pat-
4144        and Nd, script names, and some bi-directional properties are supported.
4156        mode,  can  cause unpredictable behaviour because it may leave the cur-
4157        rent matching point in the middle of a multi-code-unit  character.  The
4158        application  can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
4159        tion when calling pcre2_compile(). There is also a build-time option
4161          --enable-never-backslash-C
4166 JUST-IN-TIME COMPILER SUPPORT
4168        Just-in-time (JIT) compiler support is included in the build by  speci-
4171          --enable-jit
4177          --enable-jit=auto
4184          --enable-jit-sealloc
4191          --disable-pcre2grep-jit
4199        the  end  of  a line. This is the normal newline character on Unix-like
4203          --enable-newline-is-cr
4205        to  the  configure command. There is also an --enable-newline-is-lf op-
4209        the two-character sequence CRLF (CR immediately followed by LF). If you
4212          --enable-newline-is-crlf
4216          --enable-newline-is-anycrlf
4221          --enable-newline-is-any
4224        newline sequences are the three just mentioned, plus the single charac-
4229          --enable-newline-is-nul
4231        which  causes  NUL  (binary  zero) to be set as the default line-ending
4245          --enable-bsr-anycrlf
4247        the  default  is changed so that \R matches only CR, LF, or CRLF. What-
4255        part to another (for example, from an opening parenthesis to an  alter-
4256        nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
4257        two-byte values are used for these offsets, leading to a  maximum  size
4258        for a compiled pattern of around 64 thousand code units. This is suffi-
4261        compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
4264          --with-link-size=3
4267        16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
4269        to load additional data when handling them. For the 32-bit library  the
4270        value  is  always 4 and cannot be overridden; the value of --with-link-
4283          --with-match-limit=500000
4297          --with-heap-limit=500
4307        for  --with-match-limit.  You  can set a lower default limit by adding,
4310          --with-match-limit-depth=10000
4317        This limit was more useful in versions before 10.30, where function re-
4322        for lookaround assertions, atomic groups,  and  recursion  within  pat-
4326 LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS
4328        Lookbehind  assertions  in which one or more branches can match a vari-
4330        matching  length  for  each  top-level branch. There is a limit to this
4334          --with-max-varlookbehind=100
4336        The limit can be changed at runtime by calling pcre2_set_max_varlookbe-
4349          --enable-rebuild-chartables
4352        Instead, a program called pcre2_dftables is compiled and run. This out-
4354        your C run-time system. This method of replacing the  tables  does  not
4364          cc src/pcre2_dftables.c -o pcre2_dftables
4368        want to specify a locale, you must use the -L option:
4370          LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4372        You can also specify -b (with or without -L). This causes the tables to
4374        can  be  loaded  into memory by an application and passed to pcre2_com-
4376        The tables are just a string of bytes, independent of hardware  charac-
4387        compiled to run in an 8-bit EBCDIC environment by adding
4389          --enable-ebcdic --disable-unicode
4391        to the configure command. This setting implies --enable-rebuild-charta-
4392        bles.  You should only use it if you know that you are in an EBCDIC en-
4395        It is not possible to support both EBCDIC and UTF-8 codes in  the  same
4396        version  of  the  library. Consequently, --enable-unicode and --enable-
4403          --enable-ebcdic-nl25
4405        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4407        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4410        The options that select newline behaviour, such as --enable-newline-is-
4411        cr, and equivalent run-time options, refer to these character values in
4418        within the patterns it is matching. There are two kinds: one that  gen-
4419        erates output using local code, and another that calls an external pro-
4420        gram  or  script.   If --disable-pcre2grep-callout-fork is added to the
4422        --disable-pcre2grep-callout  is  used,  all callouts are completely ig-
4423        nored. For more details of pcre2grep callouts, see the pcre2grep  docu-
4433          --enable-pcre2grep-libz
4434          --enable-pcre2grep-libbz2
4436        to the configure command. These options naturally require that the rel-
4448        be processable is the notional buffer size. If a longer line is encoun-
4454          --with-pcre2grep-bufsize=51200
4455          --with-pcre2grep-max-bufsize=2097152
4458        values by using --buffer-size  and  --max-buffer-size  on  the  command
4466          --enable-pcre2test-libreadline
4467          --enable-pcre2test-libedit
4469        to  the configure command, pcre2test is linked with the libreadline or-
4471        it  reads  it using the readline() function. This provides line-editing
4472        and history facilities. Note that libreadline is  GPL-licensed,  so  if
4477        Setting --enable-pcre2test-libreadline causes the -lreadline option  to
4479        system-installed readline library this is sufficient. However, in  some
4491          LIBS="-ncurses"
4500          --enable-debug
4510          --enable-valgrind
4513        certain  memory  regions as unaddressable. This allows it to detect in-
4523          --enable-coverage
4535        When --enable-coverage is used,  the  following  addition  targets  are
4541        equivalent to running "make coverage-reset", "make  coverage-baseline",
4542        "make check", and then "make coverage-report".
4544          make coverage-reset
4548          make coverage-baseline
4552          make coverage-report
4556          make coverage-clean-report
4558        This  removes the generated coverage report without cleaning the cover-
4561          make coverage-clean-data
4566          make coverage-clean
4569        For more information about code coverage, see the gcov and  lcov  docu-
4583          --disable-percent-zt
4595          --enable-fuzz-support
4597        At present this applies only to the 8-bit library. If set, it causes an
4598        extra  library  called  libpcre2-fuzzsupport.a to be built, but not in-
4599        stalled. This contains a single  function  called  LLVMFuzzerTestOneIn-
4606        Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz-
4621          --disable-stack-for-recursion
4630        pcre2api(3), pcre2-config(3).
4643        Copyright (c) 1997-2024 University of Cambridge.
4647 ------------------------------------------------------------------------------
4655        PCRE2 - Perl-compatible regular expressions (revised API)
4671        PCRE2  provides  a  feature  called "callout", which is a means of tem-
4672        porarily passing control to the caller of PCRE2 in the middle  of  pat-
4677        When  using the pcre2_substitute() function, an additional callout fea-
4687        ending delimiter is the same as the start, except for {, where the end-
4707          A(\d{2}|--)
4711          (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4714        alternation bar. If the pattern contains a conditional group whose con-
4728        information when you are trying to optimize the performance of  a  par-
4738    Auto-possessification
4740        At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4746          --->aaaa
4754        the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
4758          --->aaaa
4775        beginning of the subject, and pcre2_compile() remembers this. If a pat-
4776        tern  has more than one top-level branch, automatic anchoring occurs if
4781        It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
4787          --->aa
4794        This shows that all match attempts start at the beginning of  the  sub-
4795        ject. In other words, the pattern is anchored. You can disable this op-
4797        starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
4800          --->aa
4810        This  shows more match attempts, starting at the second subject charac-
4831        You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4841        to both normal, DFA, and JIT matching. The first argument to the  call-
4867        version  1, and the callout_flags field for version 2. If you are writ-
4876        contains the number of the callout, in the range  0-255.  This  is  the
4883        callout_string  points  to the string that is contained within the com-
4891        delimiter as callout_string[-1] if you need it.
4907        For  calls  to pcre2_match(), the offset_vector field is not (since re-
4909        matching  function in the match data block. Instead it points to an in-
4915        The capture_last field contains the number of the  most  recently  cap-
4917        number of the highest numbered captured substring so far.  If  no  sub-
4923        The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in-
4927        the match is by definition not complete. Substrings that have not  been
4932        was  passed  to the matching function in the match data block for call-
4943        at which the current match attempt started. However, if the escape  se-
4958        parenthesis, the length includes meta characters that follow the paren-
4961        length  is  one,  unless a closing parenthesis is followed by a quanti-
4964        was that of the entire group, and before an alternation bar or a  clos-
4970        are used by pcre2test to show the next item to be matched when display-
4974        zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
4996        starting position in the subject. Output from pcre2test does not  indi-
5000        The information in the callout_flags field is provided so that applica-
5004        because  there is no backtracking in DFA matching, and there is no sup-
5018        Negative values should normally be chosen from  the  set  of  PCRE2_ER-
5036        which they appear. Its first argument is a pointer to a callout enumer-
5038        passed to pcre2_callout_enumerate(). The data block contains  the  fol-
5056        non-zero minimum or a fixed maximum, the group is replicated inside the
5062        The callback function should normally return zero. If it returns a non-
5077        Copyright (c) 1997-2024 University of Cambridge.
5081 ------------------------------------------------------------------------------
5089        PCRE2 - Perl-compatible regular expressions (revised API)
5102        matches the next character unless it is the  start  of  a  newline  se-
5111        3.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
5113        does not assert that the next three characters are not "a". It just as-
5118        on non-lookaround assertions.
5121        to repeat (for example, at the start of a branch), PCRE2 raises an  er-
5131        \u, \U, and \N when followed by a character name. \N on its own, match-
5132        ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
5134        letters are implemented by Perl's general string-handling and  are  not
5145        binary properties. Both PCRE2 and Perl support the Cs (surrogate) prop-
5146        erty, but in PCRE2 its use is limited. See the pcre2pattern  documenta-
5147        tion  for  details. The long synonyms for property names that Perl sup-
5148        ports (such as \p{Letter}) are not supported by PCRE2, nor is  it  per-
5155        variables). Also, Perl does "double-quotish backslash interpolation" on
5183        their effect is confined to that group; it does not extend to the  sur-
5198        matching  "aba"  against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5203        works internally just with numbers, using an external table  to  trans-
5218        such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5223        not  affected when case-independent matching is specified. For example,
5229        18. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5231        there is an option for re-enabling the previous  behaviour.  When  this
5235        19. PCRE2 provides some extensions to the Perl regular  expression  fa-
5241        $ meta-character matches only at the very end of the string.
5246        (c)  If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
5247        fiers is inverted, that is, by default they are not greedy, but if fol-
5259        (g)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
5262        (h) The partial matching facility is PCRE2-specific.
5265        different way and is not Perl-compatible.
5271        (k)  PCRE2  supports non-atomic positive lookaround assertions. This is
5272        an extension to the lookaround facilities. The default, Perl-compatible
5278        supports  relative  group numbers such as +2 and -4 in all three cases.
5282        20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5283        tion for details. Perl went with 5.10 from recursion to iteration keep-
5285        not  fall into any stack-overflow limit. PCRE2 made a similar change at
5286        release 10.30, and also has many build-time and  run-time  customizable
5289        21.  Unlike  Perl,  PCRE2 doesn't have character set modifiers and spe-
5296        can be handled by PCRE2, either by the interpreter or the JIT. An exam-
5311        Copyright (c) 1997-2023 University of Cambridge.
5315 ------------------------------------------------------------------------------
5323        PCRE2 - Perl-compatible regular expressions (revised API)
5326 PCRE2 JUST-IN-TIME COMPILER SUPPORT
5328        Just-in-time  compiling  is a heavyweight optimization that can greatly
5329        speed up pattern matching. However, it comes at the cost of extra  pro-
5331        the same pattern is going to be matched many times. This does not  nec-
5333        anchored, matching attempts may take place many times at various  posi-
5335        string  is  very  long,  it  may  still pay to use JIT even for one-off
5336        matches. JIT support is available for all  of  the  8-bit,  16-bit  and
5337        32-bit PCRE2 libraries.
5339        JIT  support  applies  only to the traditional Perl-compatible matching
5347        --enable-jit (or equivalent CMake option) must be  set  when  PCRE2  is
5351          ARM 32-bit (v7, and Thumb2)
5352          ARM 64-bit
5354          Intel x86 32-bit and 64-bit
5356          MIPS 32-bit and 64-bit
5357          Power PC 32-bit and 64-bit
5358          RISC-V 32-bit and 64-bit
5360        If --enable-jit is set on an unsupported platform, compilation fails.
5366        particular match. One reason for this is that there are a number of op-
5367        tions and pattern items that are not supported by JIT (see below).  An-
5369        in which to build its compiled code. The only guarantee from pcre2_con-
5376        there is a "fast path" API that is JIT-specific.
5385        second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
5396        the size of machine stack that it uses. The exact rules are  not  docu-
5397        mented because they may change at any time, in particular, when new op-
5401        PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for  com-
5402        plete  matches. If you want to run partial matches using the PCRE2_PAR-
5407        pcre2_match()  is  called,  the appropriate code is run if it is avail-
5412        the option bits. For example, you can call it once with  PCRE2_JIT_COM-
5415        will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5416        ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5424        are  described  in the section entitled "Controlling the JIT stack" be-
5433        stack"  below,  even  if  you  do  not need to supply a non-default JIT
5435        be obeyed. If the match-time options are not right for  JIT  execution,
5438        If  the  JIT  compiler finds an unsupported item, no JIT data is gener-
5440        pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5441        tion. A non-zero result means that JIT compilation  was  successful.  A
5452        are  normally expected to be a valid sequence of UTF code units. By de-
5453        fault, this is checked at the start of matching and an error is  gener-
5462        PCRE2_MATCH_INVALID_UTF option has two effects:  it  tells  the  inter-
5463        preter  in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5464        pile() is subsequently called, the compiled JIT code also supports  in-
5469        PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5487        when  running in a UTF mode, and a callout immediately before an asser-
5510        large or complicated patterns need more than this. The error  PCRE2_ER-
5511        ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5516        The pcre2_jit_stack_create() function creates a JIT  stack.  Its  argu-
5522        function returns immediately, without doing anything. (For the  techni-
5534        The first argument is a pointer to a match context. When this is subse-
5536        JIT stack is used. If this argument is NULL, the function returns imme-
5556        is not obeyed when pcre2_match() is called with options that are incom-
5558        determine  whether  a match operation was executed by JIT or by the in-
5564        up  non-sequential matches in one thread is to use callouts: if a call-
5569        you assign or pass back NULL from a callback, that is thread-safe,  be-
5571        pass back a non-NULL JIT stack, this must be a different stack for each
5572        thread so that the application is thread-safe.
5574        Strictly speaking, even more is allowed. You can assign the  same  non-
5583        up non-default JIT stacks might operate:
5591          Use a one-line callback function
5602        PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5604        child nodes.  Allocating real machine stack on some platforms is diffi-
5611        Modern  operating  systems have a nice feature: they can reserve an ad-
5614        memory data (this is important because of pointers). Thus we can  allo-
5634        You can free compiled patterns, contexts, and stacks in any order, any-
5653        Especially on embedded systems, it might be a good idea to release mem-
5656        allocated  memory for any stack and another which allows releasing mem-
5670        The JIT executable allocator does not free all memory when it is possi-
5674        calling pcre2_jit_free_unused_memory(). Its argument is a general  con-
5675        text, for custom memory management, or NULL for standard memory manage-
5681        This  is  a  single-threaded example that specifies a JIT stack without
5718        The fast path function is called pcre2_jit_match(), and  it  takes  ex-
5720        must be specified with a  length;  PCRE2_ZERO_TERMINATED  is  not  sup-
5722        PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The  re-
5723        turn  values  are  also  the  same as for pcre2_match(), plus PCRE2_ER-
5724        ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
5728        number of other sanity checks are performed on the arguments. For exam-
5729        ple,  if the subject pointer is NULL but the length is non-zero, an im-
5758        Copyright (c) 1997-2024 University of Cambridge.
5762 ------------------------------------------------------------------------------
5770        PCRE2 - Perl-compatible regular expressions (revised API)
5779        code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5780        the default internal linkage size, which  is  2  bytes  for  these  li-
5783        (when building the 16-bit library, 3 is  rounded  up  to  4).  See  the
5785        for  details.  In  these cases the limit is substantially larger.  How-
5786        ever, the speed of execution is slower. In the 32-bit library, the  in-
5794        the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5796        is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi-
5801        There are two different limits that apply to branches of lookbehind as-
5821        (*THEN)  verb  is  255  code units for the 8-bit library and 65535 code
5822        units for the 16-bit and 32-bit libraries.
5825        number a 32-bit unsigned integer can hold.
5842        Copyright (c) 1997-2023 University of Cambridge.
5846 ------------------------------------------------------------------------------
5854        PCRE2 - Perl-compatible regular expressions (revised API)
5862        pcre2_match() function. This works in the same as Perl's matching func-
5863        tion,  and  provide  a Perl-compatible matching operation. The just-in-
5868        it operates in a different way, and is not Perl-compatible. This alter-
5869        native has advantages and disadvantages compared with the standard  al-
5889        The set of strings that are matched by a regular expression can be rep-
5894        tree: depth-first and breadth-first, and these correspond  to  the  two
5900        In  the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5902        depth-first  search  of  the pattern tree. That is, it proceeds along a
5904        required. When there is a mismatch, the algorithm  tries  any  alterna-
5913        that  point the algorithm stops. Thus, if there is more than one possi-
5916        on the way the alternations and the greedy or ungreedy repetition quan-
5919        Because it ends up with a single path through the  tree,  it  is  rela-
5920        tively  straightforward  for  this  algorithm to keep track of the sub-
5927        This  algorithm  conducts  a breadth-first search of the tree. Starting
5938        following  or  preceding the current point have to be independently in-
5953        the  match  data block is therefore not advisable when doing DFA match-
5963        the  fifth  character  of the subject. The algorithm does not automati-
5966        PCRE2's "auto-possessification" optimization usually applies to charac-
5967        ter repeats at the end of a pattern (as well as internally). For  exam-
5972        either use an ungreedy repeat ("a\d+?") or set  the  PCRE2_NO_AUTO_POS-
5976        not supported or behave differently in the alternative  matching  func-
5979        1.  Because the algorithm finds all possible matches, the greedy or un-
5981        affect  auto-possessification,  as  just  described).  During matching,
5990        a  non-possessive quantifier. Similarly, if an atomic group is present,
5998        algorithm does not attempt to do this. This means that no captured sub-
6001        3. Because no substrings are captured, backreferences within  the  pat-
6004        4.  For  the same reason, conditional expressions that use a backrefer-
6010        6. Because many paths through the tree may be active, the \K escape se-
6019        these  modes,  because the alternative algorithm moves through the sub-
6027        10.  The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup-
6041        matching and discusses multi-segment matching.
6071        Copyright (c) 1997-2024 University of Cambridge.
6075 ------------------------------------------------------------------------------
6083        PCRE2 - Perl-compatible regular expressions
6099        Another  example is checking a user input string as it is typed, to en-
6103        Partial  matching  is a PCRE2-specific feature; it is not Perl-compati-
6105        PCRE2_PARTIAL_SOFT  options  when calling a matching function. The dif-
6107        preferred  to  an alternative complete match, though the details differ
6111        If  you  want to use partial matching with just-in-time optimized code,
6113        you  must  also  call pcre2_jit_compile() with one or both of these op-
6119        PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
6124        Setting  a partial matching option disables two of PCRE2's standard op-
6125        timization hints. PCRE2 remembers the last literal code unit in a  pat-
6138        needed to complete the match, or the addition of more characters  might
6141        Example  1: if the pattern is /abc/ and the subject is "ab", more char-
6142        acters are definitely needed to complete a match.  In  this  case  both
6145        Example  2: if the pattern is /ab+/ and the subject is "ab", a complete
6147        what  is  matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
6148        tial match; PCRE2_PARTIAL_SOFT returns the complete match.
6158        assertions  and the \K escape sequence provide ways of inspecting char-
6161        (2) The pattern contains one or more lookbehind assertions. This condi-
6162        tion exists in case there is a lookbehind that inspects characters  be-
6169        because  adding  more  characters  might  result  in a non-empty match,
6171        "there  is going to be a match at this point, but until some more char-
6172        acters are added, we do not know if it will be an empty string or some-
6182          A complete match has been found, starting and ending within this sub-
6189          Adding  more  characters may result in a complete match that uses one
6194        the rest of the ovector are undefined. The appearance of \K in the pat-
6199        If it is matched against "456abc123xyz" the result is a complete match,
6203        string  "abc12",  because  all these characters are needed for a subse-
6204        quent re-match with additional characters.
6211        If  this is matched against the subject string "abc123dog", both alter-
6214        and 9, identifying "123dog" as the first partial match. (In this  exam-
6225        complete matches. This option is "hard" because it prefers  an  earlier
6226        partial match over a later complete match. For this reason, the assump-
6228        true end of the available data, which is why \z, \Z, \b, \B, and $  al-
6233        tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re-
6235        prefers a complete match over a partial match. All the various matching
6236        items  in a pattern behave as if the subject string is potentially com-
6238        for \b and \B the end of the subject is treated as a non-alphanumeric.
6240        The  difference  between the two partial matching options can be illus-
6247        "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
6248        However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6254        In  this  case  the  result  is always a complete match because that is
6255        found first, and matching never  continues  after  finding  a  complete
6291        PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6295 MULTI-SEGMENT MATCHING WITH pcre2_match()
6297        PCRE  was  not originally designed with multi-segment matching in mind.
6299        multi-segment matching possible have been added. A very long string can
6301        with the aim of achieving the same results that would happen if the en-
6313        When a partial match occurs, the next segment must be added to the cur-
6314        rent subject and the match re-run, using the  startoffset  argument  of
6334        If there are memory constraints, you may want to discard text that pre-
6353        of characters that must be retained in order to get the right match re-
6358        use that to decide how much text to retain. The only lookbehind  infor-
6363        maximum number of characters (not code units) that any individual look-
6368        In  a  non-UTF or a 32-bit case, moving back is just a subtraction, but
6369        in UTF-8 or UTF-16 you have  to  count  characters  while  moving  back
6376        without  backtracking,  searching  for  all possible matches simultane-
6377        ously. If the end of the subject is reached before the end of the  pat-
6381        there  have  been  no complete matches. Otherwise, the complete matches
6383        precedence  over  any  complete matches. The portion of the string that
6388        there is no difference between greedy and ungreedy repetition, its  be-
6394        Whereas the standard function stops as soon as it  finds  the  complete
6399 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6403        and  calling  the function again with the same compiled regular expres-
6405        same working space as before, because this is where details of the pre-
6416        The first call has "23ja" as the subject, and requests  partial  match-
6418        (restarted) match.  Notice that when the match is  complete,  only  the
6419        last  part  is  shown;  PCRE2 does not retain the previously partially-
6433        match  at one point in the subject are remembered. Depending on the ap-
6438        complete match, as described for pcre2_match() above. Another possibil-
6455        Copyright (c) 1997-2019 University of Cambridge.
6459 ------------------------------------------------------------------------------
6467        PCRE2 - Perl-compatible regular expressions (revised API)
6473        by PCRE2 are described in detail below. There is a quick-reference syn-
6475        and  semantics as closely as it can.  PCRE2 also supports some alterna-
6482        of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6487        This  document  discusses the regular expression patterns that are sup-
6491        not  Perl-compatible.  Some  of  the  features  discussed below are not
6493        of  the  alternative function, and how it differs from the normal func-
6497 SPECIAL START-OF-PATTERN ITEMS
6500        set by special items at the start of a pattern. These are not Perl-com-
6502        writers who are not able to change the program that processes the  pat-
6503        tern.  Any  number  of these items may appear, but they must all be to-
6509        In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6510        as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6511        can  be  specified  for the 32-bit library, in which case it constrains
6522        restrict  them  to  non-UTF  data  for   security   reasons.   If   the
6523        PCRE2_NEVER_UTF  option is passed to pcre2_compile(), (*UTF) is not al-
6530        causes sequences such as \d and \w to use Unicode properties to  deter-
6532        less than 256 via a lookup table. If also causes upper/lower casing op-
6547        to whichever matching function is subsequently called to match the pat-
6548        tern. These options lock out the matching of empty strings, either  en-
6551    Disabling auto-possessification
6559    Disabling start-up optimizations
6562        setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6569        as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6570        tions that apply to patterns whose top-level branches all start with .*
6590        These facilities are provided to catch runaway matches  that  are  pro-
6591        voked  by patterns with huge matching trees. A common example is a pat-
6601        where d is any number of decimal digits. However, the value of the set-
6624        strings: a single CR (carriage return) character, a  single  LF  (line-
6625        feed) character, the two-character sequence CRLF, any of the three pre-
6630        It  is also possible to specify a newline convention by starting a pat-
6640        These override the default and the options given to the compiling func-
6641        tion. For example, on a Unix system where LF is the default newline se-
6650        The newline convention affects where the circumflex and  dollar  asser-
6651        tions are true. It also affects the interpretation of the dot metachar-
6654        escape  sequence  matches.  By default, this is any Unicode newline se-
6663        the complete set  of  Unicode  line  endings)  by  setting  the  option
6665        starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI-
6672        character code instead of ASCII or Unicode (typically a mainframe  sys-
6673        tem).  In  the  sections below, character code values are ASCII or Uni-
6691        their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6703        There are two different sets of metacharacters: those that  are  recog-
6721        Brace  characters  {  and } are also used to enclose data for construc-
6724        and  are  ignored. In the case of quantifiers, they may also appear be-
6735          -      indicates character range
6741        sequence,  or  between  a # outside a character class and the next new-
6742        line, inclusive, are ignored. An escaping backslash can be used to  in-
6764        always safe to precede a non-alphanumeric  with  backslash  to  specify
6765        that it stands for itself.  In particular, if you want to match a back-
6768        Only  ASCII  digits  and letters have any special meaning after a back-
6777        Perl, $ and @ cause variable interpolation. Also,  Perl  does  "double-
6800    Non-printing characters
6802        A second use of backslash provides a way of encoding non-printing char-
6804        appearance  of non-printing characters in a pattern, but when a pattern
6806        following escape sequences instead of the binary  character  it  repre-
6807        sents.  In  an  ASCII or Unicode environment, these escapes are as fol-
6811          \cx         "control-x", where x is a non-control ASCII character
6824        By default, after \x that is not followed by {, from zero to two  hexa-
6826        number of hexadecimal digits may appear between \x{ and }. If a charac-
6831        of the two syntaxes for \x or by an octal sequence. There is no differ-
6836        Support is available for some ECMAScript (aka  JavaScript)  escape  se-
6837        quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6839        two hexadecimal digits is it recognized as a character  escape.  Other-
6840        wise  it  is interpreted as a literal "x" character. In this mode, sup-
6845        PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad-
6846        dition, \u{hhh..} is recognized as the character specified by hexadeci-
6847        mal code point.  There may be any number of hexadecimal digits, but un-
6852        The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6855        followed by an opening brace (curly bracket) it has an entirely differ-
6858        There are some legacy applications where the escape sequence \r is  ex-
6864        point is in the range 32 to 126. The precise effect of \cx is  as  fol-
6869        point less than 32 or greater than 126, a compile-time error occurs.
6873        The \c escape is processed as specified for Perl in the perlebcdic doc-
6874        ument.  The  only characters that are allowed after \c are A-Z, a-z, or
6875        one of @, [, \, ], ^, _, or ?. Any other character provokes a  compile-
6877        letters (in either case) encode characters 1-26 (hex 01 to hex 1A);  [,
6878        \,  ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
6890        FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6891        certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6895        than  two  digits,  just  those that are present are used. Thus the se-
6911        The handling of a backslash followed by a digit other than 0 is compli-
6914        Outside a character class, PCRE2 reads the digit and any following dig-
6917        groups  in the expression, the entire sequence is taken as a backrefer-
6919        discussion  of parenthesized groups.  Otherwise, up to three octal dig-
6922        Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
6923        acters  "8"  and "9", and otherwise reads up to three octal digits fol-
6924        lowing the backslash, using them to generate a data character. Any sub-
6951          8-bit non-UTF mode    no greater than 0xff
6952          16-bit non-UTF mode   no greater than 0xffff
6953          32-bit non-UTF mode   no greater than 0xffffffff
6957        (the so-called "surrogate" code points). The check  for  these  can  be
6960        UTF-8  and  UTF-32 modes, because these values are not representable in
6961        UTF-16.
6979        However, if either of the PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  op-
6985        The sequence \g followed by a signed or unsigned number, optionally en-
6992        For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
6996        \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
7013          \W     any "non-word" character
7018        has a different meaning. See the section entitled "Non-printing charac-
7022        Each pair of lower and upper case escape sequences partitions the  com-
7031        (13),  and  space (32), which are defined as white space in the "C" lo-
7032        cale. This list may vary if locale-specific matching is  taking  place.
7033        For  example, in some locales the "non-breaking space" character (\xA0)
7037        or  digit.   By  default,  the definition of letters and digits is con-
7038        trolled by PCRE2's low-valued character tables, and may vary if locale-
7040        page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
7047        be different for characters in the range 128-255  when  locale-specific
7049        meanings from before Unicode support was available,  mainly  for  effi-
7058        The addition of \p{Mn} (non-spacing mark) and the replacement of an ex-
7059        plicit  test  for underscore with a test for \p{Pc} (connector punctua-
7072        reset within a pattern by means of an internal option setting (see  be-
7082          U+00A0     Non-break space
7089          U+2004     Three-per-em space
7090          U+2005     Four-per-em space
7091          U+2006     Six-per-em space
7096          U+202F     Narrow no-break space
7110        In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
7116        any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
7121        This is an example of an "atomic group", details of which are given be-
7122        low.   This  particular group matches either the two-character sequence
7124        U+000A),  VT  (vertical  tab, U+000B), FF (form feed, U+000C), CR (car-
7126        atomic  group,  the  two-character sequence is treated as a single unit
7130        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
7135        the  complete  set  of  Unicode  line  endings)  by  setting the option
7136        PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation  for  "back-
7138        the  case,  the other behaviour can be requested via the PCRE2_BSR_UNI-
7145        These override the default and the options given to the compiling func-
7146        tion.  Note that these special settings, which are not Perl-compatible,
7149        used. They can be combined with a change of newline convention; for ex-
7155        Inside a character class, \R is treated as an unrecognized  escape  se-
7160        When  PCRE2  is  built  with Unicode support (the default), three addi-
7162        are available. They can be used in any mode, though in 8-bit and 16-bit
7163        non-UTF  modes these sequences are of course limited to testing charac-
7165        In  32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
7166        limit) may be encountered. These are all treated as being  in  the  Un-
7170        to do a multistage table lookup in order to find  a  character's  prop-
7182        The  property names represented by xx above are not case-sensitive, and
7186        (including  newline),  Bidi_Class,  a number of binary (yes/no) proper-
7194        There are three different syntax forms for matching a script. Each Uni-
7202        a property type, for example, \p{Adlam}, it is  treated  as  \p{scx:Ad-
7206        Unassigned characters (and in non-UTF 32-bit mode, characters with code
7208        that are not part of an identified script are lumped together as  "Com-
7209        mon". The current list of recognized script names and their 4-character
7212          pcre2test -LS
7217        Each character has exactly one Unicode general category property, spec-
7218        ified  by a two-letter abbreviation. For compatibility with Perl, nega-
7223        If only one letter is specified with \p or \P, it includes all the gen-
7250          Mn    Non-spacing mark
7282        points  are in the range U+D800 to U+DFFF. These characters are no dif-
7284        16-bit  or  32-bit  library).   However,  they are not valid in Unicode
7285        strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7293        No character that is in the Unicode table has the Cn (unassigned) prop-
7308          pcre2test -LP
7327          L           left-to-right
7328          LRE         left-to-right embedding
7329          LRI         left-to-right isolate
7330          LRO         left-to-right override
7331          NSM         non-spacing mark
7335          R           right-to-left
7336          RLE         right-to-left embedding
7337          RLI         right-to-left isolate
7338          RLO         right-to-left override
7343        case-insensitive; only the short names listed above are recognized.
7354        properties that had been used for emojis.  Instead it introduced  vari-
7355        ous  emoji-specific  properties.  PCRE2  uses  only the Extended Picto-
7364        2.  Do not end between CR and LF; otherwise end after any control char-
7370        be  followed  by  a V or T character; an LVT or T character may be fol-
7373        4. Do not end before extending characters or spacing marks or the zero-
7374        width joiner (ZWJ) character. Characters with the "mark"  property  al-
7379        6.  Do not end within emoji modifier sequences or emoji ZWJ (zero-width
7385        7.  Do not break within emoji flag sequences. That is, do not break be-
7393        As  well as the standard Unicode properties described above, PCRE2 sup-
7394        ports four more that make it possible to convert traditional escape se-
7396        non-standard,  non-Perl  properties  internally  when PCRE2_UCP is set.
7404        Xan matches characters that have either the L (letter) or the  N  (num-
7407        (separator)  property.  Xsp is the same as Xps; in PCRE1 it used to ex-
7409        matches the same characters as Xan, plus those that match Mn (non-spac-
7412        There  is another non-standard property, Xuc, which matches any charac-
7419        Note that the Xuc property does not match these sequences but the char-
7425        characters not to be included in the final matched sequence that is re-
7436        mode),  though  it again reports the matched string as "bar". This fea-
7438        part of the pattern that precedes \K is not constrained to match a lim-
7440        The use of \K does not interfere with  the  setting  of  captured  sub-
7447        From  version  5.32.0  Perl  forbids the use of \K in lookaround asser-
7450        pcre2_compile() to re-enable the previous behaviour. When  this  option
7467        The final use of backslash is for certain simple assertions. An  asser-
7490        changed by setting the PCRE2_UCP option. When this is done, it also af-
7499        set. Thus, they are independent of multiline mode. These  three  asser-
7501        which affect only the behaviour of the circumflex and dollar  metachar-
7502        acters.  However,  if the startoffset argument of pcre2_match() is non-
7509        the start point of the matching process, as specified by the  startoff-
7511        startoffset is non-zero. By calling pcre2_match() multiple  times  with
7529        The  circumflex  and  dollar  metacharacters are zero-width assertions.
7530        That is, they test for a particular condition being true  without  con-
7532        are  concerned  with matching the starts and ends of lines. If the new-
7533        line convention is set so that only the two-character sequence CRLF  is
7539        point  is  at the start of the subject string. If the startoffset argu-
7540        ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is  set,  circum-
7542        character class, circumflex has an entirely different meaning (see  be-
7549        if  the  pattern  is constrained to match only at the start of the sub-
7554        matching point is at the end of the subject string, or immediately  be-
7555        fore  a newline at the end of the string (by default), unless PCRE2_NO-
7556        TEOL is set. Note, however, that it does not actually  match  the  new-
7559        branch  in which it appears. Dollar has no special meaning in a charac-
7579        pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
7582        When  the  newline  convention (see "Newline conventions" below) recog-
7583        nizes the two-character sequence CRLF as a newline, this is  preferred,
7584        even  if  the  single  characters CR and LF are also recognized as new-
7598        Outside a character class, a dot in the pattern matches any one charac-
7599        ter in the subject string except (by default) a character  that  signi-
7603        Dot  never matches a single line-ending character. When the two-charac-
7613        exception.  If the two-character sequence CRLF is present in  the  sub-
7616        The  handling of dot is entirely independent of the handling of circum-
7626        the section entitled "Non-printing characters" above for details.  Perl
7634        unit,  whether or not a UTF mode is set. In the 8-bit library, one code
7635        unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
7636        32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
7637        line-ending characters. The feature is provided in  Perl  in  order  to
7638        match individual bytes in UTF-8 mode, but it is unclear how it can use-
7642        one unit with \C in UTF-8 or UTF-16 mode means that  the  rest  of  the
7643        string may start with a malformed UTF character. This has undefined re-
7645        in a valid UTF string (by default it checks the subject string's valid-
7654        below)  in UTF-8 or UTF-16 modes, because this would make it impossible
7657        these UTF modes.  The former gives a match-time error; the latter fails
7660        In  the  32-bit  library, however, \C is always supported (when not ex-
7662        whether or not UTF-32 is specified.
7665        using  it  that avoids the problem of malformed UTF-8 or UTF-16 charac-
7667        as  in  this  pattern,  which could be used with a UTF-8 string (ignore
7670          (?| (?=[\x00-\x7f])(\C) |
7671              (?=[\x80-\x{7ff}])(\C)(\C) |
7672              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7673              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7677        below). The assertions at the start of each branch check the next UTF-8
7678        character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec-
7679        tively.  The  character's individual bytes are then captured by the ap-
7686        closing square bracket. A closing square bracket on its own is not spe-
7705        class that starts with a circumflex is not an assertion; it still  con-
7711        letters in a class represent both their upper case and lower case  ver-
7714        would.  Note that there are two ASCII characters, K and S, that, in ad-
7715        dition to their lower case ASCII equivalents, are case-equivalent  with
7716        Unicode  U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7720        special  way  when matching character classes, whatever line-ending se-
7728        matches any hexadecimal digit. In UTF modes, the PCRE2_UCP  option  af-
7733        backspace  character.  The sequences \B, \R, and \X are not special in-
7738        The minus (hyphen) character can be used to specify a range of  charac-
7739        ters  in  a  character class. For example, [d-m] matches any letter be-
7744        [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7753        It is not possible to have the literal character "]" as the end charac-
7754        ter of a range. A pattern such as [W-]46] is interpreted as a class  of
7755        two  characters ("W" and "-") followed by a literal string "46]", so it
7756        would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
7757        backslash  it is interpreted as the end of range, so [W-\]46] is inter-
7762        Ranges normally include all code points between the start and end char-
7763        acters, inclusive. They can also be used for code points specified  nu-
7764        merically,  for  example [\000-\037]. Ranges can include any characters
7765        that are valid for the current mode. In any  UTF  mode,  the  so-called
7768        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7769        ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7773        points are both specified as literal letters in the same case. For com-
7775        letters are omitted. For example, [h-k] matches only  four  characters,
7778        [\x88-\x92] or [h-\x92], all code points are included.
7781        it matches the letters in either case. For example, [W-c] is equivalent
7782        to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
7783        character  tables  for  a French locale are in use, [\xc8-\xcb] matches
7797        special  compatibility  feature  -  see the next two sections), and the
7798        terminating closing square bracket.  However,  escaping  other  non-al-
7815          ascii    character codes 0 - 127
7829        CR  (13),  and space (32). If locale-specific matching is taking place,
7840        matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7845        the POSIX character classes, although this may be different for charac-
7846        ters in the range 128-255 when locale-specific matching  is  happening.
7866                  when printed. In Unicode property terms, it matches all char-
7871                    U+2066 - U+2069  Various "isolate"s
7878        [:punct:] This matches all characters that have the Unicode P (punctua-
7892        ASCII  characters  when  PCRE2_UCP  is  set.   The   option   PCRE2_EX-
7893        TRA_ASCII_DIGIT  affects  just  [:digit:] and [:xdigit:]. Within a pat-
7894        tern, this can be set and unset by  (?aT)  and  (?-aT).  The  PCRE2_EX-
7896        including [:digit:] and [:xdigit:]. Within a pattern, (?aP) and  (?-aP)
7913        that  \b matches at the start and the end of a word (see "Simple asser-
7914        tions" above), and in a Perl-style pattern the preceding  or  following
7915        character  normally shows which is wanted, without the need for the as-
7916        sertions that are used above in order to give exactly the POSIX  behav-
7918        (and therefore \b) by default, so  it  also  affects  these  POSIX  se-
7941        Perl-compatible, and are described in detail in the pcre2api documenta-
7951        For example, (?im) sets caseless, multiline matching. It is also possi-
7952        ble to unset these options by preceding the relevant letters with a hy-
7953        phen, for example (?-im). The two "extended" options are  not  indepen-
7956        A   combined  setting  and  unsetting  such  as  (?im-sx),  which  sets
7960        the  option  is unset. An empty options setting "(?)" is allowed. Need-
7965        cause some options to be re-instated, but a hyphen may not appear.
7967        Some PCRE2-specific options can be changed by the same mechanism  using
7979        However,  except for 'r', these are not unset by (?^), which is equiva-
7980        lent to (?-imnrsx). If 'a' is not followed by any  of  the  upper  case
7983        PCRE2_EXTRA_ASCII_DIGIT   has   no  additional  effect  when  PCRE2_EX-
7984        TRA_ASCII_POSIX is set, but including it in  (?aP)  means  that  (?-aP)
7987        When  one of these option changes occurs at top level (that is, not in-
8008        start of a non-capturing group (see the next section), the option  let-
8016        Note:  There  are  other  PCRE2-specific options, applying to the whole
8017        pattern, which can be set by the application when the  compiling  func-
8023        are  equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
8041        2. It creates a "capture group". This means that, when the  whole  pat-
8053        the captured substrings are "red king", "red", and "king", and are num-
8057        helpful.  There are often times when grouping is required without  cap-
8058        turing.  If an opening parenthesis is followed by a question mark and a
8069        start  of  a non-capturing group, the option letters may appear between
8077        the  group is reached, an option setting in one branch does affect sub-
8078        sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
8086        with  (?|  and  is  itself a non-capturing group. For example, consider
8091        Because the two alternatives are inside a (?| group, both sets of  cap-
8092        turing  parentheses  are  numbered one. Thus, when the pattern matches,
8095        not all, of one of a number of alternatives. Inside a (?| group, paren-
8098        whole group start after the highest number used in any branch. The fol-
8099        lowing example is taken from the Perl documentation. The numbers under-
8102          # before  ---------------branch-reset----------- after
8117        A relative reference such as (?-1) is no different: it is just a conve-
8120        If a condition test for a group's having matched refers to a non-unique
8133        was  not  added to Perl until release 5.10. Python had the feature ear-
8141        must start with a non-digit. When PCRE2_UTF is set, the syntax of group
8145          ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
8156        complete  name-to-number  translation table from a compiled pattern, as
8160        Warning:  When  more than one capture group has the same number, as de-
8173        number to be associated with more than one name. The example above pro-
8174        vokes  a  compile-time  error. However, there is still scope for confu-
8183        By  default, a name must be unique within a pattern, except that dupli-
8188        The duplicate name constraint can be disabled by setting the PCRE2_DUP-
8194        of a weekday, either as a 3-letter abbreviation or as  the  full  name,
8212        If you make a backreference to a non-unique named group from  elsewhere
8221        If you make a subroutine call to a non-unique named group, the one that
8229        true.  This is the same behaviour as testing by number. For further de-
8280        of a quantifier, the brace is taken as a literal character. In particu-
8283        Note that not every opening brace is potentially the start of a quanti-
8289        of which is represented by a two-byte sequence in a UTF-8 string. Simi-
8295        the previous item and the quantifier were not present. This may be use-
8296        ful for capture groups that are referenced as  subroutines  from  else-
8297        where  in the pattern (but see also the section entitled "Defining cap-
8302        For  convenience, the three most common quantifiers have single-charac-
8320        does  not  prevent  backtracking into any of the iterations if a subse-
8344        does  the right thing with C comments. The meaning of the various quan-
8361        that  is  greater  than 1 or with a limited maximum, more memory is re-
8362        quired for the compiled pattern, in proportion to the size of the mini-
8366        (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
8369        so there is no point in retrying the overall match at any position  af-
8373        In cases where it is known that the subject  string  contains  no  new-
8374        lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
8384        If  the subject is "xyz123abc123" the match point is the fourth charac-
8387        Another case where implicit anchoring is not applied is when the  lead-
8393        It matches "ab" in the subject "aab". The use of the backtracking  con-
8403        is  "tweedledee". However, if there are nested capture groups, the cor-
8416        to  be  re-evaluated to see if a different number of repeats allows the
8432        re-evaluated in this way.
8457        the number of digits they match in order to make the rest of  the  pat-
8462        group is just a single repeated item, as in the example above,  a  sim-
8463        pler  notation, called a "possessive quantifier" can be used. This con-
8474        Possessive  quantifiers are always greedy; the setting of the PCRE2_UN-
8475        GREEDY option is ignored. They are a convenient notation for  the  sim-
8481        The possessive quantifier syntax is an extension to the Perl  5.8  syn-
8490        when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
8493        When a pattern contains an unlimited repeat inside a group that can it-
8500        matches an unlimited number of substrings that either consist  of  non-
8508        * repeat in a large number of ways, and all have to be tried. (The  ex-
8518        sequences of non-digits cannot be broken, and failure happens quickly.
8531        words, the group that is referenced need not be to the left of the ref-
8539        subsection entitled "Non-printing characters" above for further details
8540        of the handling of digits following a backslash. Other forms  of  back-
8553        An unsigned number specifies an absolute reference without the  ambigu-
8558          (abc(def)ghi)\g{-1}
8560        The sequence \g{-1} is a reference to the capture group whose number is
8563        \2, and \g{-2} would be equivalent to \1. Note that if  this  construct
8565        this example \g{-2} also refers to group 1:
8567          (A)(\g{-2}B)
8587        time  of  the backreference, the case of letters is relevant. For exam-
8618        the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8621        Because  there may be many capture groups in a pattern, all digits fol-
8622        lowing a backslash are taken as part of a potential backreference  num-
8632        However, such references can be useful inside repeated groups. For  ex-
8637        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8638        ation of the group, the backreference matches the character string cor-
8641        the  backreference. This can be done using alternation, as in the exam-
8659        subject string, and those that look behind it, and in each case an  as-
8662        group is matched in the normal way, and if it is true, matching contin-
8663        ues  after it, but with the matching position in the subject string re-
8666        The Perl-compatible lookaround assertions are atomic. If  an  assertion
8667        is  true, but there is a subsequent matching failure, there is no back-
8668        tracking into the assertion. However, there are some cases  where  non-
8669        atomic  assertions can be useful. PCRE2 has some support for these, de-
8670        scribed in the section entitled "Non-atomic assertions" below, but they
8671        are not Perl-compatible.
8677        Assertion groups are not capture groups. If an assertion contains  cap-
8679        the capture groups in the whole pattern. Within each branch of  an  as-
8681        way. For example, a sequence such as (.)\g{-1} can  be  used  to  check
8688        retained after a successful negative assertion. When an assertion  con-
8691        For  a  positive  assertion, internally captured substrings in the suc-
8692        cessful branch are retained, and matching continues with the next  pat-
8701        Most  assertion groups may be repeated; though it makes no sense to as-
8702        sert the same thing several times, the side effect of capturing in pos-
8713        to  specify lookaround assertions. Perl 5.28 introduced some experimen-
8715        start with (* instead of (? and must be written using lower  case  let-
8723        For  example,  (*pla:foo) is the same assertion as (?=foo). In the fol-
8724        lowing sections, the various assertions are described using the  origi-
8734        matches  a word followed by a semicolon, but does not include the semi-
8750        most  convenient  way to do it is with (?!) because an empty string al-
8767        If every top-level alternative matches a fixed length, for example
8777        in  which  one  or  more top-level alternatives can match more than one
8783        to a value set by the calling program (default 255 characters).  Unlim-
8785        escape sequence \K (see above) can be used instead of a lookbehind  as-
8786        sertion  at  the  start  of a pattern to get round the length limit re-
8789        In UTF-8 and UTF-16 modes, PCRE2 does not allow the  \C  escape  (which
8792        the  lookbehind.  The \X and \R escapes, which can match different num-
8796        lookbehinds,  as  long  as  the called capture group matches a limited-
8800        PCRE2  supports backreferences in lookbehinds, but only if certain con-
8804        Of course, the referenced group must itself match a limited length sub-
8810        Possessive  quantifiers  can be used in conjunction with lookbehind as-
8817        proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8832        quantifier; it can match only the entire string. The subsequent lookbe-
8847        three characters are not "999".  This pattern does not match "foo" pre-
8849        three of which are not "999". For example, it  doesn't  match  "123abc-
8871 NON-ATOMIC ASSERTIONS
8874        is  true, but there is a subsequent matching failure, there is no back-
8875        tracking into the assertion. However, there are some cases  where  non-
8882        Consider the problem of finding the right-most word in  a  string  that
8891        and sets the "x" option, which causes white space (introduced for read-
8895        words,  when  the  assertion first succeeds, it captures the right-most
8901        succeeds,  we are done, but if the last word in the string does not oc-
8903        lookahead  (?=  or (*pla: had been used, the assertion could not be re-
8907        Using a non-atomic lookahead, however, means that when  the  last  word
8909        find the second-last word, and so on, until either the match  succeeds,
8912        Two conditions must be met for a non-atomic assertion to be useful: the
8917        using a non-atomic assertion just wastes resources.
8919        There  is one exception to backtracking into a non-atomic assertion. If
8920        an (*ACCEPT) control verb is triggered, the assertion  succeeds  atomi-
8924        Non-atomic assertions are not supported  by  the  alternative  matching
8942        matches  are not a script run. After a failure, normal backtracking oc-
8943        curs. Script runs can be used to detect spoofing attacks using  charac-
8945        "paypal.com" is an infamous example, where the letters could be a  mix-
8946        ture of Latin and Cyrillic. This pattern ensures that the matched char-
8947        acters in a sequence of non-spaces that follow white space are a script
8963          \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8977        Unicode support. A compile-time error is given if any of the above con-
8979        matching function, pcre2_dfa_match() because they use the  same  mecha-
8994          (?(condition)yes-pattern)
8995          (?(condition)yes-pattern|no-pattern)
8997        If  the  condition is satisfied, the yes-pattern is used; otherwise the
8998        no-pattern (if present) is used. An absent no-pattern is equivalent  to
8999        an  empty string (it always matches). If there are more than two alter-
9000        natives in the group, a compile-time error occurs. Each of the two  al-
9001        ternatives may itself contain nested groups of any form, including con-
9009        There are five kinds of condition: references to capture groups, refer-
9010        ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
9023        enclosing this condition) can be referenced by (?(-1),  the  next  most
9024        recent by (?(-2), and so on. Inside loops it can also make sense to re-
9027        is not used; it provokes a compile-time error.
9029        Consider  the  following  pattern, which contains non-significant white
9036        character is present, sets it as the first captured substring. The sec-
9040        opening  parenthesis,  the condition is true, and so the yes-pattern is
9041        executed and a closing parenthesis is required.  Otherwise,  since  no-
9043        words,  this  pattern matches a sequence of non-parentheses, optionally
9049          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
9060        the letter R followed by digits are ambiguous (see the  following  sec-
9071        "Recursion"  in  this sense refers to any subroutine-like call from one
9072        part of the pattern to another, whether or not it  is  actually  recur-
9077        the  name R, the condition is true if matching is currently in a recur-
9087        name,  the  condition tests for its being set, as described in the sec-
9089        group  with  the  name  R1  by adding (?<R1>) to the above pattern com-
9110        be only one alternative in the rest of the conditional group. It is al-
9112        DEFINE  is that it can be used to define subroutines that can be refer-
9117          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
9125        to  match the four dot-separated components of an IPv4 address, insist-
9130        Programs that link with a PCRE2 library can check the version by  call-
9132        that do not have access to the underlying code cannot do this.  A  spe-
9148        or  lookbehind  assertion. However, it must be a traditional atomic as-
9149        sertion, not one of the non-atomic assertions.
9151        Consider this pattern, again containing  non-significant  white  space,
9154          (?(?=[^a-z]*[a-z])
9155          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
9157        The  condition  is  a  positive lookahead assertion that matches an op-
9158        tional sequence of non-letters followed by a letter. In other words, it
9159        tests for the presence of at least one letter in the subject. If a let-
9162        strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
9165        When an assertion that is a condition contains capture groups, any cap-
9166        turing  that  occurs  in  a matching branch is retained afterwards, for
9167        both positive and negative assertions, because matching always  contin-
9168        ues  after  the  assertion, whether it succeeds or fails. (Compare non-
9169        conditional assertions, for which captures are retained only for  posi-
9188        at the start of the pattern, as described in the section entitled "New-
9192        when  PCRE2_EXTENDED is set, and the default newline convention (a sin-
9211        For some time, Perl has provided a facility that allows regular expres-
9222        Obviously,  PCRE2  cannot  support  the interpolation of Perl code. In-
9231        group. (If not, it is a non-recursive subroutine  call,  which  is  de-
9232        scribed in the next section.) The special item (?R) or (?0) is a recur-
9241        substrings  which can either be a sequence of non-parentheses, or a re-
9244        possessive  quantifier  to  avoid  backtracking  into sequences of non-
9257        of (?1) in the pattern above you can write (?-2) to refer to the second
9266          (?|(a)|(b)) (c) (?-2)
9269        (c) is number 2. When the reference (?-2) is  encountered,  the  second
9272        the same if an absolute reference (?1) was used. In other words,  rela-
9278        are always non-recursive subroutine calls, as  described  in  the  next
9282        for this is (?&name); PCRE1's earlier syntax  (?P>name)  is  also  sup-
9290        The example pattern that we have been looking at contains nested unlim-
9292        strings of non-parentheses is important when applying  the  pattern  to
9304        callout function can be used (see below and the pcre2callout documenta-
9316        recursion.   Consider  this pattern, which matches text in angle brack-
9318        brackets  (that is, when recursing), whereas any characters are permit-
9324        different  alternatives  for the recursive and non-recursive cases. The
9334        never  re-entered,  even if it contained untried alternatives and there
9339        treated as atomic. That is, they can be re-entered to try unused alter-
9344        Supporting backtracking into recursions simplifies certain types of re-
9353        match fails. If you want to match typical palindromic phrases, the pat-
9354        tern has to ignore all non-word characters,  which  can  be  done  like
9360        such as "A man, a plan, a canal: Panama!". Note the use of the  posses-
9361        sive  quantifier  *+  to  avoid backtracking into sequences of non-word
9388        to match at the current matching position. The called group may be  de-
9389        fined  before  or  after the reference. A numbered reference can be ab-
9393          (...(relative)...)...(?-1)...
9414        Processing options such as case-independence are fixed when a group  is
9418          (abc)(?i:(?-1))
9430        For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
9432        an alternative syntax for calling a group as a subroutine, possibly re-
9442          (abc)(?i:\g<-1>)
9453        This makes it possible, amongst other things, to extract different sub-
9454        strings that match the same pair of parentheses when there is a repeti-
9457        PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
9462        passed, or if the callout entry point is set to NULL, callouts are dis-
9474        During matching, when PCRE2 reaches a callout point, the external func-
9481        time,  and  one  side-effect is that sometimes callouts are skipped. If
9483        disable  the relevant optimizations. More details, including a complete
9497        They  are all numbered 255. If there is a conditional group in the pat-
9509        A  delimited  string may be used instead of a number as a callout argu-
9511        ending delimiter is the same as the start, except for {, where the end-
9534        PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati-
9540        and sequences such as \x{100} that define character code points.  Char-
9546        names is skipped, and #-comments are recognized, exactly as in the rest
9550        The  maximum  length of a name is 255 in the 8-bit library and 65535 in
9551        the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
9553        the colon were not there. Any number of these verbs may occur in a pat-
9557        them  can be used only when the pattern is to be matched using the tra-
9574        course, be processed. You can suppress the start-of-match optimizations
9575        by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9592        then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9596        If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap-
9601        This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
9604        (*ACCEPT)  is  the only backtracking verb that is allowed to be quanti-
9612        is  triggered  and the match succeeds. In both cases, all but C is cap-
9613        tured. Whereas (*COMMIT) (see below) means "fail on backtrack",  a  re-
9616        Warning:  (*ACCEPT)  should  not be used within a script run group, be-
9626        are  not  present  in PCRE2. The nearest equivalent is the callout fea-
9634        (*ACCEPT:NAME)  and  (*FAIL:NAME)  behave the same as (*MARK:NAME)(*AC-
9640        There  is  one  verb whose main purpose is to track how a match was ar-
9641        rived at, though it also has a secondary use in  conjunction  with  ad-
9646        A  name is always required with this verb. For all the other backtrack-
9649        When a match succeeds, the name of the last-encountered  mark  name  on
9650        the matching path is passed back to the caller as described in the sec-
9651        tion entitled "Other information about the match" in the pcre2api docu-
9670        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9672        efficient  way of obtaining this information than putting each alterna-
9676        true,  the  name  is recorded and passed back if it is the last-encoun-
9698        The following verbs do nothing when they are encountered. Matching con-
9700        causing  a  backtrack  to the verb, a failure is forced. That is, back-
9704        group has been matched, there is never any backtracking into it.  Back-
9708        These  verbs  differ  in exactly what kind of failure occurs when back-
9710        when  the  verb is not in a subroutine or an assertion. Subsequent sec-
9716        matching failure that causes backtracking to reach it. Even if the pat-
9719        verb that is encountered, once it has been passed pcre2_match() is com-
9728        The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM-
9729        MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
9731        that are set with (*MARK), ignoring those set by any of the other back-
9739        Note that (*COMMIT) at the start of a pattern is not the same as an an-
9740        chor,  unless  PCRE2's  start-of-match optimizations are turned off, as
9762        the subject if there is a later matching failure that causes backtrack-
9768        (*PRUNE) is just an alternative to an atomic group or possessive  quan-
9782        character, but to the position in the subject where (*SKIP) was encoun-
9791        skips on to start the next attempt at "c". Note that a possessive quan-
9793        suppress  backtracking  during  the first match attempt, the second at-
9807        found,  the  "bumpalong" advance is to the subject position that corre-
9813        atomic groups or assertions, because they are never re-entered by back-
9831        backtracks, and this causes a new matching attempt to start at the sec-
9841        This  verb  causes  a skip to the next innermost alternative when back-
9844        that it can be used for a pattern-based if-then-else block:
9851        into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
9852        quently BAZ fails, there are no more alternatives, so there is a  back-
9853        track  to  whatever came before the entire group. If (*THEN) is not in-
9861        A group that does not contain a | character is just a part of  the  en-
9862        closing  alternative;  it is not a nested alternation with only one al-
9863        ternative. The effect of (*THEN) extends beyond such a group to the en-
9864        closing alternative.  Consider this pattern, where A, B, etc. are  com-
9877        The effect of (*THEN) is now confined to the inner group. After a fail-
9882        Note that a conditional group is not considered as having two  alterna-
9889        If the subject is "ba", this pattern does not match. Because .*? is un-
9909        that is backtracked onto first acts. For example,  consider  this  pat-
9947        name  (if  set) are retained. In a standalone negative assertion, (*AC-
9948        CEPT) causes the assertion to fail without any further processing; cap-
9956        reach them. This means that, for the Perl-compatible assertions,  their
9958        are atomic. A backtrack that occurs after such an assertion is complete
9963        PCRE2  now supports non-atomic positive assertions, as described in the
9964        section entitled "Non-atomic assertions" above. These  assertions  must
9965        be  standalone  (not used as conditions). They are not Perl-compatible.
9966        For these assertions, a later backtrack does jump back into the  asser-
9967        tion,  and  therefore verbs such as (*COMMIT) can be triggered by back-
9975        in  a  standalone  positive assertion. In a conditional positive asser-
9977        or (*PRUNE) causes the condition to be false. However, for both  stand-
9979        (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9987        to  succeed without any further processing. Matching then continues af-
9988        ter the subroutine call. Perl documents this behaviour.  Perl's  treat-
9995        when  triggered  by being backtracked to in a group called as a subrou-
10020        Copyright (c) 1997-2024 University of Cambridge.
10024 ------------------------------------------------------------------------------
10032        PCRE2 - Perl-compatible regular expressions (revised API)
10037        Two  aspects  of performance are discussed below: memory usage and pro-
10062        is not usually a problem. However, if the numbers are large,  and  par-
10068        uses over 50KiB when compiled using the 8-bit library.  When  PCRE2  is
10070        limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
10071        libraries, and this is reached with the above pattern if the outer rep-
10077        of PCRE2's "subroutine" facility. Re-writing the above pattern as
10083        this kind of pattern is not always exactly equivalent, because any cap-
10086        process  patterns that PCRE2 cannot otherwise handle. The matching per-
10088        same.  (This applies from release 10.30 - things were different in ear-
10094        From release 10.30, the interpretive (non-JIT) version of pcre2_match()
10095        uses very little system stack at run time. In earlier  releases  recur-
10097        cause problems, but this usage has been eliminated. Backtracking  posi-
10103        On a 64-bit system the frame size for a pattern with no captures is 128
10107        the system stack, but this still caused some  issues  for  multi-thread
10112        block and re-used if that block is used for another match. It is  freed
10124        function calls, but only for processing atomic groups,  lookaround  as-
10129        has been re-factored to use heap memory  when  necessary  for  internal
10140        Certain items in regular expression patterns are processed  more  effi-
10142        [aeiou]   than   a   set   of  single-character  alternatives  such  as
10146        expressions for efficient performance. This document contains a few ob-
10150        slow,  because  PCRE2 has to use a multi-stage table lookup whenever it
10160        pcre2_match(); the performance loss is less with a DFA  matching  func-
10163        When  a pattern begins with .* not in atomic parentheses, nor in paren-
10167        multiple top-level branches, they must all be anchorable. The optimiza-
10168        tion  can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
10171        If PCRE2_DOTALL is not set, PCRE2 cannot make  this  optimization,  be-
10173        subject string contains newlines, the pattern may match from the  char-
10184        If you are using such a pattern with subject strings that do  not  con-
10186        PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate  ex-
10187        plicit  anchoring.  That saves PCRE2 from having to scan along the sub-
10201        in  principle to try every possible variation, and this can take an ex-
10209        matching  procedure, PCRE2 checks that there is a "b" later in the sub-
10210        ject string, and if there is not, it fails the match immediately.  How-
10221        an atomic group or a possessive quantifier. This can often reduce  mem-
10232        matched character. For a long string, a lot of memory is required. Con-
10238        This runs much faster, because sequences of characters that do not con-
10239        tain "<" are "swallowed" in one item inside the parentheses, and a pos-
10241        non-"<" characters. This version also uses a lot  less  memory  because
10256        pcre2_match()  or pcre2_dfa_match() is called. For details of these in-
10276        Copyright (c) 1997-2022 University of Cambridge.
10280 ------------------------------------------------------------------------------
10288        PCRE2 - Perl-compatible regular expressions (revised API)
10309        This  set of functions provides a POSIX-style API for the PCRE2 regular
10310        expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
10311        16-bit and 32-bit libraries. See the pcre2api documentation for  a  de-
10312        scription  of  PCRE2's native API, which contains much additional func-
10315        IMPORTANT NOTE: The functions described here are NOT  thread-safe,  and
10316        should  not  be used in multi-threaded applications. They are also lim-
10317        ited to processing subjects that are not bigger than 2GB. Use  the  na-
10326        risk of accidentally linking with POSIX functions from a different  li-
10329        On  Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
10330        so can be accessed by adding -lpcre2-posix to the command  for  linking
10332        also necessary to add -lpcre2-8.
10341        regcomp()  etc.  These simply passed their arguments to the PCRE2 func-
10354        names start with "REG_"; these are used for setting options and identi-
10360        Note that these functions are just POSIX-style wrappers for PCRE2's na-
10362        they are not thread-safe or even POSIX compatible.
10373        PCRE2-specific features via the POSIX calling interface or to  add  BSD
10377        POSIX-like in style. The syntax and semantics of  the  regular  expres-
10379        various PCRE2 options, as described below. "POSIX-like in style"  means
10381        POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
10391        The function pcre2_regcomp() is called to compile a pattern into an in-
10392        ternal  form. By default, the pattern is a C string terminated by a bi-
10396        REG_PEND is set. The regex_t structure used by pcre2_regcomp()  is  de-
10398        other libraries that provide POSIX-style matching.
10418        the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
10424        for compilation to the native function. This disables all meta  charac-
10433        pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
10434        nored, and no captured strings are returned. Versions of the  PCRE  li-
10435        brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10436        tion, but this no longer happens because it disables the use  of  back-
10443        the end of the pattern before calling pcre2_regcomp(). The pattern  it-
10444        self  may  now  contain binary zeros, which are treated as data charac-
10467        all  data  strings used for matching it to be treated as UTF-8 strings.
10471        function.  This means that the regex is compiled with PCRE2 default se-
10475        It  does not affect the way newlines are matched by the dot metacharac-
10478        The yield of pcre2_regcomp() is zero on success,  and  non-zero  other-
10481        number of capturing subpatterns in the regular expression. Various  er-
10484        NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10506        This is the equivalent table for a POSIX-compatible pattern matcher:
10523        there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10539        The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10546        standard.  However, setting this option can give more POSIX-like behav-
10551        The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10558        point to the first character beyond the string. There may be binary ze-
10565        relative to string + pmatch[0].rm_so, but this differs from  other  im-
10570        intended  to  be  portable to other systems. Note that a non-zero rm_so
10578        pcre2_regexec()  are  ignored  (except  possibly as input for REG_STAR-
10581        The value of nmatch may be zero, and the value pmatch may be NULL  (un-
10593        Unused entries in the array have both structure members set to -1.
10597        other  similarly  named  types from other libraries that provide POSIX-
10600        A successful match yields a zero return; various error  codes  are  de-
10601        fined  in the header file, of which REG_NOMATCH is the "expected" fail-
10607        The pcre2_regerror() function maps a  non-zero  errorcode  from  either
10611        buffer is too short, only the first errbuf_size - 1 characters  of  the
10619        Compiling a regular expression causes memory to be allocated and  asso-
10621        such memory, after which preg may no longer be used as a  compiled  ex-
10635        Copyright (c) 1997-2024 University of Cambridge.
10639 ------------------------------------------------------------------------------
10647        PCRE2 - Perl-compatible regular expressions (revised API)
10652        A  simple, complete demonstration program to get you started with using
10656        can save this listing to re-create the contents of pcre2demo.c.
10661        used. If matching succeeds, the program outputs the portion of the sub-
10662        ject  that  matched,  together  with  the contents of any captured sub-
10665        If the -g option is given on the command line, the program then goes on
10667        subject string. The logic is a little bit tricky because of the  possi-
10671        The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
10672        library.  It  handles  strings  and characters that are stored in 8-bit
10675        treated as UTF-8 strings, where characters  may  occupy  multiple  code
10679        for your operating system, you should be able to compile the demonstra-
10682          cc -o pcre2demo pcre2demo.c -lpcre2-8
10685        to the command line. For example, on a Unix-like system that has  PCRE2
10686        installed  in /usr/local, you can compile the demonstration program us-
10689          cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10690             -L/usr/local/lib -lpcre2-8
10696          ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10699        pcre2test,  which supports many more facilities for testing regular ex-
10700        pressions using all three PCRE2 libraries (8-bit, 16-bit,  and  32-bit,
10701        though  not all three need be installed). The pcre2demo program is pro-
10708          ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10711        This  is  caused  by the way shared library support works on those sys-
10714          -R/usr/local/lib
10729        Copyright (c) 1997-2016 University of Cambridge.
10733 ------------------------------------------------------------------------------
10739        PCRE2 - Perl-compatible regular expressions (revised API)
10742 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10759        run. However, if you are using the just-in-time  optimization  feature,
10760        it is not possible to save and reload the JIT data, because it is posi-
10761        tion-dependent.  The  host  on  which the patterns are reloaded must be
10764        For example, patterns compiled on a 32-bit system using PCRE2's  16-bit
10765        library cannot be reloaded on a 64-bit system, nor can they be reloaded
10766        using the 8-bit library.
10770        output  is really just a bytecode dump, which is why it can only be re-
10773        linked with a fixed version of PCRE2 must be prepared to recompile pat-
10783        checking, not complete validation of what is being re-loaded. Corrupted
10795        in the byte stream (its size is 1088 bytes). For more details of  char-
10796        acter  tables,  see the section on locale support in the pcre2api docu-
10802        the length of the vector. The third and fourth arguments point to vari-
10816        PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
10817        rupted, or that a slot in the vector does not point to a compiled  pat-
10840        the  256  possible  byte values. On systems that make a distinction be-
10841        tween binary and non-binary data, be sure that the file is  opened  for
10846        freed in the usual way by calling pcre2_code_free(). When you have fin-
10847        ished with the byte stream, it too must be freed by calling pcre2_seri-
10848        alize_free().  If  this function is called with a NULL argument, it re-
10852 RE-USING PRECOMPILED PATTERNS
10854        In order to re-use a set of saved patterns you must first make the  se-
10856        from a file). The management of this memory block is up to the applica-
10867        and its length, and the third argument points to a byte stream. The fi-
10870        this argument is NULL, malloc() and free() are used. After deserializa-
10879        stream, it is filled with those that fit, and  the  remainder  are  ig-
10895        potential race issue if you are using multiple patterns that  were  de-
10896        coded  from a single byte stream in a multithreaded application. A sin-
10898        and a reference count is used to arrange for its memory to be automati-
10905        If  a pattern was processed by pcre2_jit_compile() before being serial-
10921        Copyright (c) 1997-2018 University of Cambridge.
10925 ------------------------------------------------------------------------------
10933        PCRE2 - Perl-compatible regular expressions (revised API)
10938        The  full syntax and semantics of the regular expressions that are sup-
10940        document contains a quick-reference summary of the syntax.
10945          \x         where x is non-alphanumeric is a literal x
10958        after the comma. The exception is \u{...} which is not  Perl-compatible
10959        and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC-
10969          \cx        "control-x", where x is a non-control ASCII character
10990        read,  but in ALT_BSUX mode \x must be followed by two hexadecimal dig-
10996        Note that \0dd is always an octal code. The treatment of backslash fol-
10997        lowed  by  a non-zero digit is complicated; for details see the section
10998        "Non-printing characters" in the pcre2pattern documentation, where  de-
11023          \W         a "non-word" character
11027        middle of a UTF-8 or UTF-16 character. The application can lock out the
11031        By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
11032        mode or in the 16-bit and 32-bit libraries. However, if locale-specific
11034        points in the range 128-255. If the PCRE2_UCP option is set, the behav-
11037        that can restrict individual sequences to matching only  ASCII  charac-
11040        Property descriptions in \p and \P are matched caselessly; hyphens, un-
11066          Mn         Non-spacing mark
11099          Xuc        Universally-named character: one that can be
11103        Perl and POSIX space are now the same. Perl added VT to its space char-
11114          pcre2test -LP
11119        Many  script  names  and their 4-letter abbreviations are recognized in
11121        of course). You can obtain a list of these scripts by running this com-
11124          pcre2test -LS
11143          L           left-to-right
11144          LRE         left-to-right embedding
11145          LRI         left-to-right isolate
11146          LRO         left-to-right override
11147          NSM         non-spacing mark
11151          R           right-to-left
11152          RLE         right-to-left embedding
11153          RLI         right-to-left isolate
11154          RLO         right-to-left override
11163          [x-y]       range (can be used for hex characters)
11169          ascii       0-127
11231        From  release 10.38 \K is not permitted by default in lookaround asser-
11232        tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL-
11233        LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
11234        When this option is set, \K is honoured in positive assertions, but ig-
11249          (?:...)         non-capture group
11250          (?|...)         non-capture group; reset group numbers for
11253        In  non-UTF  modes, names may contain underscores and ASCII letters and
11260          (?>...)         atomic non-capture group
11261          (*atomic:...)   atomic non-capture group
11283          (?r)            restrict caseless to either ASCII or non-ASCII
11288          (?-...)         unset the given option(s)
11291        (?aP) implies (?aT) as well, though this has no additional effect. How-
11292        ever, it means that (?-aP) is really (?-PT) which  disables  all  ASCII
11296        a mixture of setting and unsetting such as (?i-x) is allowed, but there
11298        for example (?^in). An option setting may appear at the start of a non-
11301        The following are recognized only at the very start of a pattern or af-
11310          (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
11313          (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
11327        These are recognized only at the very start of the pattern or after op-
11340        These are recognized only at the very start of the pattern or after op-
11365        Each top-level branch of a lookbehind must have a limit for the  number
11373 NON-ATOMIC LOOKAROUND ASSERTIONS
11375        These assertions are specific to PCRE2 and are not Perl-compatible.
11401          \g-n            relative reference by number
11403          \g{-n}          relative reference by number
11416          (?-n)           call subroutine by relative number
11425          \g<-n>          call subroutine by relative number (PCRE2 extension)
11426          \g'-n'          call subroutine by relative number (PCRE2 extension)
11431          (?(condition)yes-pattern)
11432          (?(condition)yes-pattern|no-pattern)
11436          (?(-n)              relative reference condition (PCRE2 extension)
11464        The following act only when a subsequent match failure causes  a  back-
11466        what happens afterwards. Those that advance the start-of-match point do
11508        Copyright (c) 1997-2023 University of Cambridge.
11512 ------------------------------------------------------------------------------
11520        PCRE - Perl-compatible regular expressions (revised API)
11528        properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11533        There  are two ways of telling PCRE2 to switch to UTF mode, where char-
11546        one-code-unit  characters. There are also some other changes to the way
11553        \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11556        properties such as Lu for an upper case letter or Nd for a decimal num-
11561        The full lists are given in the pcre2pattern and pcre2syntax documenta-
11565        prefixed  by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
11578        allowed in non-UTF mode.
11580        In  UTF  mode, repeat quantifiers apply to complete UTF characters, not
11591        multi-unit characters (see the description of \C  in  the  pcre2pattern
11592        documentation). For this reason, there is a build-time option that dis-
11593        ables  support  for  \C completely. There is also a less draconian com-
11594        pile-time option for locking out the use of \C when a pattern  is  com-
11598        pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11600        modes  provokes a match-time error. Also, the JIT optimization does not
11601        support \C in these modes. If JIT optimization is requested for a UTF-8
11602        or UTF-16 pattern that contains \C, it will not succeed,  and  so  when
11603        pcre2_match() is called, the matching will be carried out by the inter-
11609        set  as  in  non-UTF mode, all with code points less than 256. This re-
11614        you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-
11615        tively, if you set the PCRE2_UCP option, the way that the character es-
11622        classes are all low-valued characters unless the  PCRE2_UCP  option  is
11631 UNICODE CASE-EQUIVALENCE
11635        are less than 128 and that have at most two case-equivalent values. For
11636        these, a direct table lookup is used for speed. A few  Unicode  charac-
11637        ters  such as Greek sigma have more than two code points that are case-
11639        PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character
11640        encodings such as UCS-2.
11643        ASCII  lower case equivalents, have a non-ASCII one as well (long S and
11644        Kelvin sign).  Recognition of these non-ASCII characters as case-equiv-
11647        in a case equivalence must either be ASCII or non-ASCII; there  can  be
11656        sequence of characters that are all from the same Unicode script.  How-
11661        Every Unicode character has a Script property, mostly with a value cor-
11666        for  the surrogate code points. In the PCRE2 32-bit library, characters
11668        which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11672        include  punctuation,  emoji,  mathematical, musical, and currency sym-
11675        "Inherited" is used for characters such as diacritical marks that  mod-
11681        U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11683        called Script Extension exists. Its value is a list of scripts that ap-
11687        also some Common characters that have a single,  non-Common  script  in
11693        constraint  for  decimal  digits.  These are covered in subsequent sec-
11700        run.  Longer strings are checked using only the Script Extensions prop-
11703        If a character's Script Extension property is the single value  "Inher-
11707        at least one script in common in their Script Extension lists. In  set-
11723        The first has the Script Extension list Arabic, Hanifi  Rohingya,  Syr-
11725        of them could appear in script runs of  either  Arabic  or  Hanifi  Ro-
11733        Katakana scripts together with Han; Korean uses Hangul  and  Han;  Tai-
11736        "virtual  scripts".  Thus,  a script run may contain a mixture of Hira-
11739        Bopomofo and Han. PCRE2 (like Perl) follows Unicode's  Technical  Stan-
11740        dard   39   ("Unicode   Security   Mechanisms",  http://unicode.org/re-
11748        from  the  common  ASCII digits. In addition to the script checking de-
11758        returned. The code unit offset to the offending character  can  be  ex-
11763        and  therefore  want  to  skip these checks in order to improve perfor-
11765        scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
11782        UTF-16 and UTF-32 strings can indicate their endianness by special code
11783        knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
11788        pcre2_dfa_match()  calls  with a non-zero starting offset, the check is
11796        that the sequences \b and \B are one-character lookbehinds.
11800        the  surrogate  area. The so-called "non-character" code points are not
11805        UTF-16,  where they are used in pairs to encode code points with values
11806        greater than 0xFFFF. The code points that are encoded by  UTF-16  pairs
11807        are  available  independently  in  the  UTF-8 and UTF-32 encodings. (In
11808        other words, the whole surrogate thing is a fudge for UTF-16 which  un-
11809        fortunately messes up UTF-8 and UTF-32.)
11814        such as \x{d800} (a surrogate code point) you  can  set  the  PCRE2_EX-
11816        only in UTF-8 and UTF-32 modes, because these  values  are  not  repre-
11817        sentable in UTF-16.
11819    Errors in UTF-8 strings
11821        The following negative error codes are given for invalid UTF-8 strings:
11829        The  string  ends  with a truncated UTF-8 character; the code specifies
11830        how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
11831        characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
11853        A 4-byte character has a value greater than 0x10ffff; these code points
11858        A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
11859        range of code points are reserved by RFC 3629 for use with UTF-16,  and
11860        so are excluded from UTF-8.
11868        A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
11870        For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
11876        binary value 0b10 (that is, the most significant bit is 1 and the  sec-
11877        ond  is  0). Such a byte can only validly occur as the second or subse-
11878        quent byte of a multi-byte character.
11883        can never occur in a valid UTF-8 string.
11885    Errors in UTF-16 strings
11887        The  following  negative  error  codes  are  given  for  invalid UTF-16
11895    Errors in UTF-32 strings
11897        The following  negative  error  codes  are  given  for  invalid  UTF-32
11907        UTF  sequences  if  you  call  pcre2_compile() with the PCRE2_MATCH_IN-
11914        and  you  are  not  certain that your subject strings are valid UTF se-
11917        for UTF validity. An invalid string may cause undefined behaviour,  in-
11922        generate different code. If JIT is not used, the option affects the be-
11923        haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11929        \p{Any}, it does not even match negative items such as [^X]. A  lookbe-
11945        UTF-sequence,  that  sequence  is  skipped, and the match starts at the
11953        Using PCRE2_MATCH_INVALID_UTF, an application can run matches on  arbi-
11958        Note, however, that the  16-bit  and  32-bit  PCRE2  libraries  process
11974        Copyright (c) 1997-2023 University of Cambridge.
11978 ------------------------------------------------------------------------------