pcre2.txt - OpenGrok cross reference for /aosp_15

Lines Matching full:are
6 the pcre2demo program. There are separate text files for the pcre2grep and
27        rate  "study" optimizing function; in PCRE2, patterns are automatically
34        are available using the Python syntax. There is also some  support  for
35        one  or  two .NET and Oniguruma syntax items, and there are options for
44        PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
70        Details  of  exactly which Perl regular expression features are and are
71        not supported by  PCRE2  are  given  in  separate  documents.  See  the
77        client to discover which features are  available.  The  features  them-
78        selves are described in the pcre2build page. Documentation about build-
83        data  tables  that  are  used by more than one of the exported external
84        functions, but which are not intended  for  use  by  external  callers.
87        external symbols are exported when a shared library is  built,  and  in
88        these cases the undocumented symbols are not exported.
93        If  you  are using PCRE2 in a non-UTF application that permits users to
128        Nested unlimited repeats in a pattern are a common example. PCRE2  pro-
141        pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
144        tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
145        tions are as follows:
448        These functions became obsolete at release 10.30 and are retained  only
479        and  POSIX  basic  and  extended patterns can be converted. Details are
485        There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
489        taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
498        There are also three different sets of data types:
505        types are pointers to constants of the equivalent UCHAR types, that is,
506        they are pointers to vectors of unsigned code units.
508        Character  strings  are  passed  to a PCRE2 library as sequences of un-
514        macros are defined whose names are the generic forms such as pcre2_com-
539        other PCRE2 documents, functions and data  types  are  described  using
546        There are also some wrapper functions for the 8-bit library that corre-
548        to  all  the  functionality of PCRE2 and they are not thread-safe. They
549        are described in the pcre2posix documentation. Both these APIs define a
553        error codes are defined in the header file pcre2.h, which also contains
562        The  functions pcre2_compile() and pcre2_match() are used for compiling
569        The compiling and matching functions recognize various options that are
570        passed as bits in an options argument. There are also some more compli-
572        source  limits  that  are  passed  in "contexts" (which are just memory
591        less  sanity  checking. The JIT-specific functions are discussed in the
598        there are lookaround assertions). However, this algorithm does not  re-
603        In  addition  to  the  main compiling and matching functions, there are
605        string that has been matched by pcre2_match(). They are:
617        pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
626        Functions  whose  names begin with pcre2_serialize_ are used for saving
629        Finally, there are functions for finding out information about  a  com-
633        Functions with names ending with _free() are used  for  freeing  memory
641        units  in  several  places. These values are always of type PCRE2_SIZE,
646        handled is one less than this maximum. Note that string lengths are al-
657        are  the  three just mentioned, plus the single characters VT (vertical
692        There are several different blocks of data that are used to pass infor-
707        In  a more complicated situation, where patterns are compiled only when
708        they are first needed, but are still shared between  threads,  pointers
767        PCRE2 functions are called. A context is nothing more than a collection
771        are stored in contexts are in some sense  "advanced  features"  of  the
774        In a multithreaded application, if the parameters in a context are val-
775        ues  that  are  never  changed, the same context can be used by all the
789        Some PCRE2 functions have a lot of parameters, many of which  are  used
793        API extensible, "uncommon" parameters are passed to  certain  functions
799        There  are  three different types of context: a general context that is
806        ternal memory management functions that are called from several  places
818        whose prototypes are:
826        tions  malloc()  and free() are used. (This is not currently useful, as
827        there are no other fields in a general context,  but  in  future  there
829        tain memory for storing the context, and all three values are saved  as
862        A  compile context is also required if you are using custom memory man-
900        As  PCRE2  has developed, almost all the 32 option bits that are avail-
903        bits which are used for some newer, assumed rarer, options. This  func-
905        It does not modify any existing setting. The available options are  de-
935        hind assertions without a bounding length are not supported.
940        This specifies which characters or character sequences are to be recog-
1017        during a matching operation. Details are given in the pcre2callout doc-
1025        tion made by pcre2_substitute(). Details are given in the section enti-
1082        pcre2_match()  uses  the  heap are given in the pcre2perform documenta-
1094        ing  up  too many computing resources when processing patterns that are
1103        take place. For patterns that are not anchored, the count restarts from
1144        version 10.32, only local variables are allocated on the stack  and  as
1149        cal workspace vectors are allocated on the heap from version 10.32  on-
1213        ther details are given with pcre2_set_depth_limit() above.
1219        pcre2_dfa_match().     Further     details     are      given      with
1260        pcre2_match().  Further  details are given with pcre2_set_match_limit()
1267        are:
1373        of  these  character  tables.) In many applications the same tables are
1375        are occasions when a copy of a compiled pattern and the relevant tables
1376        are  needed.  The pcre2_code_copy_with_tables() provides this facility.
1377        Copies of both the code and the tables are  made,  with  the  new  code
1384        compiled pattern and the subject string are set in the match data block
1394        that affect the compilation. It should be zero if none of them are  re-
1395        quired.  The  available  options  are described below. Some of them (in
1396        particular, those that are compatible with Perl,  but  some  others  as
1411        diately.  Otherwise,  the  variables to which these point are set to an
1416        There are nearly 100 positive error codes that pcre2_compile() may  re-
1417        turn  if it finds an error in the pattern. There are also some negative
1418        error codes that are used for invalid UTF strings when validity  check-
1419        ing  is  in  force.  These  are  the same as given by pcre2_match() and
1420        pcre2_dfa_match(), and are described in the pcre2unicode documentation.
1422        cause  the  textual  error  messages  that  are obtained by calling the
1425        PCRE2_ERROR_  are defined for both positive and negative error codes in
1438        Some errors are not detected until the whole pattern has been  scanned;
1461        The following names for option bits are defined in the  pcre2.h  header
1525        whitespace  in verb names is skipped and #-comments are recognized, ex-
1540        PCRE2_UTF or PCRE2_UCP is set, Unicode  properties  are  used  for  all
1542        code points are greater than U+007F. Note  that  there  are  two  ASCII
1544        alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1551        (available only in 16-bit or 32-bit mode) are treated as not having an-
1567        ever matches one character, even if newlines are coded as CRLF. Without
1580        There are more details of named capture  groups  below;  see  also  the
1604        matches,  which are necessarily substrings of the first one, must obvi-
1609        If this bit is set, most white space characters in the pattern are  to-
1621        256 that are flagged as white space in its low-character table. The ta-
1624        relevant  characters  are  those  with code points 0x0009 (tab), 0x000A
1629        acters,  five  more Unicode "Pattern White Space" characters are recog-
1630        nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1634        space  characters that are matched by the \h and \v escapes in patterns
1635        are a much bigger set.
1644        Which characters are interpreted as newlines can be specified by a set-
1653        escaped space and horizontal tab characters are ignored inside a  char-
1654        acter  class. Note: only these two characters are ignored, not the full
1655        set of pattern white space characters that are ignored outside a  char-
1676        If this option is set, all meta-characters in the pattern are disabled,
1679        you are doing a lot of literal matching and  are  worried  about  effi-
1681        options  that  are  allowed  with  PCRE2_LITERAL  are:  PCRE2_ANCHORED,
1685        TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
1695        less such sequences are suitably aligned. This  facility  is  not  sup-
1727        setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in  a
1775        are  in  use,  auto-possessification means that some callouts are never
1800        There  are  a  number of optimizations that may occur at the start of a
1808        items  are  in use, these "start-up" optimizations can cause them to be
1810        tions  are  in effect a pre-scan of the subject that takes place before
1816        such as (*COMMIT) and (*MARK) are considered at every possible starting
1844        found, but there is only one character left, so there are no  more  at-
1846        "2". If NO_START_OPTIMIZE is set, however, matches are tried  at  every
1856        automatically  checked.  There  are  discussions  about the validity of
1874        called  "surrogate"  code points (0xd800 to 0xdfff) are invalid. If you
1878        sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1885        classes. By default, only  ASCII  characters  are  recognized,  but  if
1886        PCRE2_UCP  is  set, Unicode properties are used to classify characters.
1887        There are some PCRE2_EXTRA options (see below) that add  finer  control
1888        to  this  behaviour.  More  details are given in the section on generic
1903        are not greedy by default, but become greedy if followed by "?". It  is
1919        strings  that  are  subsequently processed as strings of UTF characters
1923        tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in  the
1930        pcre2_set_compile_extra_options() function are as follows:
1943        "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1946        They can be represented in UTF-8 and UTF-32, but are defined as invalid
1959        errors  and are incorporated in the compiled pattern. However, they can
2010        "j",  and non-hexadecimal digits in \x{} are just ignored, though warn-
2011        ings are given in both cases if Perl's warning switch is enabled.  How-
2016        pcre2_compile(),  all  unrecognized  or  malformed escape sequences are
2030        are two case-equivalent character sets that contain both ASCII and non-
2040        There are some legacy applications where the escape sequence  \r  in  a
2088        interpretive  matching function. Full details are given in the pcre2jit
2105        PCRE2 handles caseless matching, and determines whether characters  are
2108        code points are less than 256. By default,  higher-valued  code  points
2117        effects  apply even when PCRE2_UTF is not set. There are, however, some
2121        The  use  of  locales  with Unicode is discouraged. If you are handling
2125        PCRE2  contains a built-in set of character tables that are used by de-
2126        fault.  These are sufficient for many applications. Normally,  the  in-
2137        External  tables  are built by calling the pcre2_maketables() function,
2145        For example, to build and use  tables  that  are  appropriate  for  the
2147        are treated as letters), the following code could be used:
2156        if you are using Windows, the name for the French locale is "french".
2159        is saved with the compiled pattern, and the same tables are used by the
2165        the tables remains available while they are still in use. When they are
2172        The tables described above are just a sequence of binary  bytes,  which
2217        The possible values for the second argument are defined in pcre2.h, and
2218        are as follows:
2250        all the following are true:
2259        For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2270        a backreference.  Zero is returned if there are no backreferences.
2301        code unit values greater than 255 are supported, the flag bit  for  255
2329        Return the size (in bytes) of the data frames that are used to remember
2445        ses.  The names are just an additional way of identifying the parenthe-
2447        pcre2_substring_get_byname() are provided for extracting captured  sub-
2461        brary,  the first two bytes of each entry are the number of the captur-
2468        The names are in alphabetical order. If (?| is used to create  multiple
2472        names for groups of the same number are not permitted.
2474        Duplicate names for capture groups with different numbers  are  permit-
2488        There are four named capture groups, so the table has four entries, and
2524        cause there are cases where the code that calculates the  size  has  to
2544        meration block are described in the pcre2callout  documentation,  which
2552        the patterns are reloaded must be running the same  version  of  PCRE2,
2557        gin with pcre2_serialize_ are used for converting to and from the seri-
2558        alized  form.  They  are described in the pcre2serialize documentation.
2592        should be made large enough to hold as many as are expected.
2603        the memory for the match data block. If you are not using custom memory
2617        after a match operation has finished,  using  functions  that  are  de-
2626        pattern  and the subject string are set in the match data block so that
2705        common matching parameters are to be changed. For details, see the sec-
2712        and offset are in code units, not characters.  That  is,  they  are  in
2725        sets are valid). Like the pattern string, the subject may  contain  bi-
2768        The    only    bits    that    may    be    set   are   PCRE2_ANCHORED,
2778        PCRE2_NO_JIT  (obviously),  the remaining options are supported for JIT
2794        must not be freed until all such operations are complete. For some  ap-
2815        if the limits are large. There is therefore a check  at  the  start  of
2821        There are rare cases of matches that would complete,  but  nevertheless
2851        set.  If  there are alternatives in the pattern, they are tried. If all
2889        there  are no lookbehind assertions in the pattern, the check starts at
2892        if  there are not that many characters before the starting offset. Note
2893        that the sequences \b and \B are one-character lookbehinds.
2896        negative error code is returned if the check fails. There  are  several
2898        problems with the code unit sequence. There are discussions  about  the
2905        subsequent  calls  to pcre2_match() if you are making repeated calls to
2919        there are not enough subject characters to complete the match. In addi-
2993        groups there are in a compiled pattern.
3008        ues  are always code unit offsets, not character offsets. That is, they
3009        are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
3013        first  pair  of  offsets  (that is, ovector[0] and ovector[1]) are set.
3022        been captured, the returned value is 3. If there are no  captured  sub-
3029        "ab", the start and end offset values for the match are 2 and 0.
3037        zero.  If captured substrings are not of interest, pcre2_match() may be
3044        the function is 4, and groups 1 and 3 are matched, but 2 is  not.  When
3046        groups are set to PCRE2_UNSET.
3049        pression  are also set to PCRE2_UNSET. For example, if the string "abc"
3050        is matched against the pattern (abc)(x(yz)?)? groups 2 and  3  are  not
3053        groups (assuming the vector is large enough,  of  course)  are  set  to
3057        in the pattern are never changed. That is, if a pattern contains n cap-
3058        turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
3061        are unchanged.
3072        functions  in  appropriate  circumstances.  If they are called at other
3100        Warning:  By  default, certain start-of-match optimizations are used to
3119        the code unit offset of the invalid UTF character. Details are given in
3128        codes  are  also  returned  by other functions, and are documented with
3129        them. The codes are given names in the header file. If UTF checking  is
3131        of  UTF-specific negative error codes is returned. Details are given in
3132        the pcre2unicode page. The following are the other errors that  may  be
3222        might do this are detected and faulted at compile time, but  more  com-
3245        PCRE2_ERROR_NOMEMORY  is returned.  None of the messages are very long;
3265        described above.  For convenience, auxiliary functions are provided for
3281        "ab", the start and end offset values for the match are  2  and  0.  In
3296        ments  of  these  functions are a pointer to the match data block and a
3299        The final arguments of pcre2_substring_copy_bynumber() are a pointer to
3305        to variables that are updated with a pointer to the new memory and  the
3314        error codes are:
3405        For  convenience,  there are also "byname" functions that correspond to
3408        there are duplicate names, these functions scan all the groups with the
3412        If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING  is
3413        returned.  If  all  groups  with the name have numbers that are greater
3421        guish the different capture groups, because names are not  included  in
3456        match  to  end  before it starts are not supported, and give rise to an
3459        in the previous iteration are also not supported.
3461        The  first  seven  arguments  of pcre2_substitute() are the same as for
3462        pcre2_match(), except that the partial matching options are not permit-
3470        afterwards are the result of the final call. For global  changes,  this
3485        The contents of the  externally  supplied  match  data  block  are  not
3500        STITUTE_REPLACEMENT_ONLY  is  set,  only the replacement substrings are
3501        returned. In the global case, multiple replacements are concatenated in
3544        eral). The following forms are always recognized:
3551        brackets  are  required only if the following character would be inter-
3589        two  characters are CR, LF. In this case, the offset is advanced by two
3606        special, and only the group insertion forms  listed  above  are  valid.
3615        There are also four escape sequences for forcing the case  of  inserted
3625        was  set when the pattern was compiled, Unicode properties are used for
3626        case forcing characters whose code points are greater than 127.
3643        specifies  strings that are expanded and inserted when group <n> is set
3665        PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3666        vant and are ignored.
3672        from pcre2_match() are passed straight back.
3689        are NULL. For backward compatibility reasons an exception is  made  for
3732        more  fields are added, but the intention is never to remove any of the
3737        ers are copies of the values passed to pcre2_substitute().
3741        that are set in the ovector, and is always greater than zero.
3767        capture groups are not required to be unique. Duplicate names  are  al-
3769        feature. Indeed, if such groups are named, they are required to use the
3772        Normally, patterns that use duplicate names are such that  in  any  one
3776        When  duplicates   are   present,   pcre2_substring_copy_byname()   and
3778        to the given name that is set. Only if none are set is  PCRE2_ERROR_UN-
3780        turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are  duplicate
3786        the third and fourth arguments are NULL, the function returns  a  group
3789        When the third and fourth arguments are not NULL, they must be pointers
3790        to  variables  that are updated by the function. After it has run, they
3793        units.  In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
3832        ble  with  Perl.  Some  of  the features of PCRE2 patterns are not sup-
3833        ported. Nevertheless, there are times when this kind of matching can be
3838        The  arguments  for  the pcre2_dfa_match() function are the same as for
3841        mon arguments are used in the same way as for pcre2_match(),  so  their
3847        space is needed for patterns and subjects where there are a lot of  po-
3868        zero.  The  only   bits   that   may   be   set   are   PCRE2_ANCHORED,
3872        PCRE2_DFA_RESTART. All but the last four of these are exactly the  same
3879        the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
3914        matches  are all initial substrings of the longer matches. For example,
3923        the three matched strings are
3931        strings are returned in the ovector, and can be extracted by number  in
3941        The  matched  strings  are  stored  in  the ovector in reverse order of
3957        Many  of  the  errors  are  the same as for pcre2_match(), as described
3958        above.  There are in addition the following errors that are specific to
3971        a specific capture group. These are not supported.
3995        some plausibility checks are made on the  contents  of  the  workspace,
4035        Autotools. Also in the distribution are files to support building using
4043        this file as well as the README file if you are building in a non-Unix-
4051        configure  script,  where  the  optional features are selected or dese-
4054        non-Unix-like environments if you are using CMake instead of  configure
4057        If  you  are not using Autotools or CMake, option selection can be done
4082        strings  that  are contained in arrays of 16-bit and 32-bit code units,
4096        an  8-bit  program.  Neither  of these are built if you select only the
4110        libraries  are  built  as  static libraries. The binaries that are then
4112        pcre2grep)  are linked statically with one or more PCRE2 libraries, but
4120        versions of all the relevant libraries are available for linking.
4144        and Nd, script names, and some bi-directional properties are supported.
4145        Details are given in the pcre2pattern documentation.
4181        the  end  of a configure run. If you are enabling JIT under SELinux you
4208        Alternatively, you can specify that line endings are to be indicated by
4224        newline sequences are the three just mentioned, plus the single charac-
4254        Within  a  compiled  pattern,  offset values are used to point from one
4257        two-byte values are used for these offsets, leading to a  maximum  size
4290        points. The more nested backtracking points there  are  (that  is,  the
4321        depth  of recursive function calls in pcre2_dfa_match(). These are used
4329        able number of characters are supported only  if  there  is  a  maximum
4338        number of characters (not necessarily all the same) are not constrained
4344        PCRE2 uses fixed tables for processing characters whose code points are
4345        less than 256. By default, PCRE2 is built with a set of tables that are
4346        distributed in the file src/pcre2_chartables.c.dist. These  tables  are
4351        to  the  configure  command, the distributed tables are no longer used.
4355        work if you are cross compiling, because pcre2_dftables needs to be run
4376        The tables are just a string of bytes, independent of hardware  charac-
4392        bles.  You should only use it if you know that you are in an EBCDIC en-
4397        ebcdic are mutually exclusive.
4418        within the patterns it is matching. There are two kinds: one that  gen-
4422        --disable-pcre2grep-callout  is  used,  all callouts are completely ig-
4437        evant  libraries  are installed on your system. Configuration will fail
4438        if they are not.
4535        When --enable-coverage is used,  the  following  addition  targets  are
4600        put()  whose  arguments are a pointer to a string and the length of the
4603        options and with some random options bits that are generated  from  the
4610        strings are specified by arguments: if an argument starts with "="  the
4612        file name, and the contents of the file are the test string.
4721        This applies only to assertion conditions (because they are  themselves
4726        automatic callouts.  When any callouts are  present,  the  output  from
4728        information when you are trying to optimize the performance of  a  par-
4777        all branches are anchorable.
4834        that callouts such as the example above are obeyed.
4867        version  1, and the callout_flags field for version 2. If you are writ-
4870        The version number will increase in future if more  fields  are  added,
4900        The  remaining  fields in the callout block are the same for both kinds
4926        The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4928        captured  but whose numbers are less than capture_top also have both of
4967        The  pattern_position  and next_item_length fields are intended to help
4969        the  same  callout  number. However, they are set for all callouts, and
4970        are used by pcre2test to show the next item to be matched when display-
4995        Both  bits  are  set when a backtrack has caused a "bumpalong" to a new
5049        The  version  number is currently 0. It will increase if new fields are
5050        ever added to the block. The remaining fields are  the  same  as  their
5096        here  are  with  respect  to  Perl version 5.38.0, but as both Perl and
5097        PCRE2 are continually changing, the information may at times be out  of
5109        it does have are given in the pcre2unicode page.
5113        does not assert that the next three characters are not "a". It just as-
5124        5.  Capture groups that occur inside negative lookaround assertions are
5125        counted, but their entries in the offsets vector are set  only  when  a
5130        6.  The  following Perl escape sequences are not supported: \F, \l, \L,
5133        point,  are  supported.  The  escapes that modify the case of following
5134        letters are implemented by Perl's general string-handling and  are  not
5135        part of its pattern matching engine. If any of these are encountered by
5137        PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U  and  \u  are
5140        7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
5142        tested  with  \p  and \P are limited to the general category properties
5148        ports (such as \p{Letter}) are not supported by PCRE2, nor is  it  per-
5152        in between are treated as literals. However, this is slightly different
5153        from  Perl  in  that  $  and  @ are also handled as literals inside the
5181        11. In PCRE2, if any of the backtracking control verbs are  used  in  a
5187        |  characters.  Note  that such groups are processed as anchored at the
5188        point where they are tested.
5194        it is the same as PCRE2, but there are cases where it differs.
5196        13.  There are some differences that are concerned with the settings of
5220        because they are almost certainly user mistakes.
5222        17. In PCRE2, the upper/lower case character properties Lu and  Ll  are
5247        fiers is inverted, that is, by default they are not greedy, but if fol-
5248        lowed by a question mark they are.
5273        lookarounds are atomic.
5275        (l) There are three syntactical items in patterns that can refer  to  a
5366        particular match. One reason for this is that there are a number of op-
5367        tions and pattern items that are not supported by JIT (see below).  An-
5396        the size of machine stack that it uses. The exact rules are  not  docu-
5398        timizations  are  introduced.   If  a  pattern  is  too  big, a call to
5424        are  described  in the section entitled "Controlling the JIT stack" be-
5427        There are some pcre2_match() options that are not supported by JIT, and
5428        there are also some pattern items that JIT cannot handle.  Details  are
5435        be obeyed. If the match-time options are not right for  JIT  execution,
5445        guarantee  the  use  of  JIT at match time because there are some match
5446        time options that are not supported by JIT.
5452        are  normally expected to be a valid sequence of UTF code units. By de-
5456        you are sure that a subject string is valid. If  this  option  is  used
5477        The pcre2_match() options that  are  supported  for  JIT  matching  are
5481        are not supported at match time.
5486        The only unsupported pattern items are \C (match a  single  data  unit)
5493        When a pattern is matched using JIT, the return values are the same  as
5502        what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5512        tions are provided for managing blocks of memory for use as JIT stacks.
5517        ments  are  a starting size, a maximum size, and a general context (for
5528        should use. Its arguments are as follows:
5537        diately, without doing anything. There are three cases for  the  values
5556        is not obeyed when pcre2_match() is called with options that are incom-
5562        by assigning directly or by callback), as  long  as  the  patterns  are
5576        as long as they are not used for matching by multiple  threads  at  the
5710        JIT  is  not  available, it is convenient for programs that are written
5712        pcre2_match() does have a performance impact. Programs that are written
5722        PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The  re-
5723        turn  values  are  also  the  same as for pcre2_match(), plus PCRE2_ER-
5728        number of other sanity checks are performed on the arguments. For exam-
5736        pcre2_jit_match()  in  UTF  mode  only  if  you are sure the subject is
5775        There are some size limitations in PCRE2 but it is hoped that they will
5781        braries.  If  you  want  to  process regular expressions that are truly
5801        There are two different limits that apply to branches of lookbehind as-
5859        This document describes the two different algorithms that are available
5870        gorithm, and these are described below.
5874        arises, however, when there are multiple possibilities. For example, if
5883        there are three possible answers. The standard algorithm finds only one
5889        The set of strings that are matched by a regular expression can be rep-
5893        thought  of  as  a  search of the tree.  There are two ways to search a
5909        branches  are  tried  is controlled by the greedy or ungreedy nature of
5917        tifiers are specified in the pattern.
5921        strings that are matched by portions of  the  pattern  in  parentheses.
5942        there  are  no more unterminated paths. At this point, terminated paths
5943        represent the different matching possibilities (if there are none,  the
5946        longest.  The  matches  are returned in the output vector in decreasing
5956        Note also that all the matches that are found start at the  same  point
5975        There  are  a  number of features of PCRE2 regular expressions that are
5977        tion. Those that are not supported cause an error if encountered.
5982        greedy and ungreedy quantifiers are treated in exactly  the  same  way.
5999        strings are available.
6001        3. Because no substrings are captured, backreferences within  the  pat-
6002        tern are not supported.
6005        ence as the condition or test for a specific group  recursion  are  not
6008        5. Again for the same reason, script runs are not supported.
6014        7. Callouts are supported, but the value of the  capture_top  field  is
6024        are not supported. (*FAIL) is supported, and  behaves  like  a  failing
6034        matches (at a single point in the subject) are automatically found, and
6053        within invalid UTF string are not supported.
6055        3. Although atomic groups are supported, their use does not provide the
6089        string, but more characters are needed to  match  the  entire  pattern,
6091        There are circumstances where it might be helpful to  distinguish  this
6108        between the two types of matching function. If both  options  are  set,
6119        PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
6137        subject  string is reached successfully, but either more characters are
6142        acters are definitely needed to complete a match.  In  this  case  both
6172        acters are added, we do not know if it will be an empty string or some-
6194        the rest of the ovector are undefined. The appearance of \K in the pat-
6203        string  "abc12",  because  all these characters are needed for a subse-
6213        matching,  so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
6215        ple,  there are two partial matches, because "dog" on its own partially
6232        matching continues as normal, and other alternatives in the pattern are
6303        strings  that  are  being  sought are much shorter than each individual
6304        segment, and are in the middle of very long strings, so the pattern  is
6334        If there are memory constraints, you may want to discard text that pre-
6361        if  there  are  nested  lookbehinds.  The  value  returned  by  calling
6382        are returned.  If PCRE2_PARTIAL_HARD is  set,  a  partial  match  takes
6406        vious partial match are stored. You can set the  PCRE2_PARTIAL_SOFT  or
6433        match  at one point in the subject are remembered. Depending on the ap-
6472        The  syntax and semantics of the regular expressions that are supported
6473        by PCRE2 are described in detail below. There is a quick-reference syn-
6480        Perl's  regular expressions are described in its own documentation, and
6481        regular expressions in general are covered in a number of  books,  some
6487        This  document  discusses the regular expression patterns that are sup-
6491        not  Perl-compatible.  Some  of  the  features  discussed below are not
6494        tion, are discussed in the pcre2matching page.
6500        set by special items at the start of a pattern. These are not Perl-com-
6501        patible,  but  are provided to make these options accessible to pattern
6502        writers who are not able to change the program that processes the  pat-
6590        These facilities are provided to catch runaway matches  that  are  pro-
6613        interpreters are used for matching. It does not apply to JIT. The match
6651        tions are true. It also affects the interpretation of the dot metachar-
6673        tem).  In  the  sections below, character code values are ASCII or Uni-
6675        values, and there are no code points greater than 255.
6689        within the pattern), letters are matched independently  of  case.  Note
6690        that  there  are  two  ASCII  characters, K and S, that, in addition to
6691        their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6699        These are encoded in the pattern by the use of metacharacters, which do
6700        not  stand  for  themselves but instead are interpreted in some special
6703        There are two different sets of metacharacters: those that  are  recog-
6705        that are recognized within square brackets.  Outside  square  brackets,
6706        the metacharacters are as follows:
6721        Brace  characters  {  and } are also used to enclose data for construc-
6723        and/or horizontal tab characters that follow { or precede } are allowed
6724        and  are  ignored. In the case of quantifiers, they may also appear be-
6731        class". In a character class the only metacharacters are:
6742        line, inclusive, are ignored. An escaping backslash can be used to  in-
6745        unescaped  space  and  horizontal  tab  characters are ignored inside a
6746        character class. Note: only these two characters are ignored,  not  the
6747        full  set  of pattern white space characters that are ignored outside a
6769        slash. All other characters (in particular, those whose code points are
6770        greater than 127) are treated as literals.
6776        and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in
6807        sents.  In  an  ASCII or Unicode environment, these escapes are as fol-
6825        decimal  digits  are  read (letters can be in upper or lower case). Any
6830        Characters whose code points are less than 256 can be defined by either
6832        ence in the way they are handled. For example, \xdc is exactly the same
6848        like  other places that also use curly brackets, spaces are not allowed
6858        There are some legacy applications where the escape sequence \r is  ex-
6874        ument.  The  only characters that are allowed after \c are A-Z, a-z, or
6888        generate  the  APC character. Unfortunately, there are several variants
6894        After \0 up to two further octal digits are read. If  there  are  fewer
6895        than  two  digits,  just  those that are present are used. Thus the se-
6916        digit 8 or 9, or if there are  at  least  that  many  previous  capture
6920        its are read to form a character code.
6929          \40    is the same, provided there are fewer than 40
6942        Note  that octal values of 100 or greater that are specified using this
6944        three octal digits are ever read.
6948        Characters  that  are  specified using octal or hexadecimal numbers are
6956        Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6960        UTF-8  and  UTF-32 modes, because these values are not representable in
6970        class.   \B,  \R, and \X are not special inside a character class. Like
6976        In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
6987        backreference  can  be  coded as \g{name}. Backreferences are discussed
6995        Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6996        \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
7030        The default \s characters are HT (9), LF (10), VT  (11),  FF  (12),  CR
7031        (13),  and  space (32), which are defined as white space in the "C" lo-
7042        are used for accented letters, and these are then matched  by  \w.  The
7045        By  default,  characters  whose  code points are greater than 127 never
7051        changed so that Unicode properties  are  used  to  determine  character
7066        \b, and \B because they are defined in terms of  \w  and  \W.  Matching
7078        space characters are:
7100        The vertical space characters are:
7111        than 256 are relevant.
7121        This is an example of an "atomic group", details of which are given be-
7129        In other modes, two additional characters whose code points are greater
7130        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
7146        tion.  Note that these special settings, which are not Perl-compatible,
7147        are  recognized only at the very start of a pattern, and that they must
7162        are available. They can be used in any mode, though in 8-bit and 16-bit
7163        non-UTF  modes these sequences are of course limited to testing charac-
7164        ters whose code points are less than U+0100 and U+10000,  respectively.
7166        limit) may be encountered. These are all treated as being  in  the  Un-
7176        The extra escape sequences that provide property support are:
7182        The  property names represented by xx above are not case-sensitive, and
7184        and underscores are ignored. There is support for Unicode script names,
7188        other  Perl  properties such as "InMusicalSymbols" are not supported by
7194        There are three different syntax forms for matching a script. Each Uni-
7200        "script extensions" for the property types are recognized, and a equals
7207        points greater than 0x10FFFF) are assigned the "Unknown" script. Others
7208        that are not part of an identified script are lumped together as  "Com-
7225        the  absence of negation, the curly brackets in the escape sequence are
7231        The following general category property codes are supported:
7282        points  are in the range U+D800 to U+DFFF. These characters are no dif-
7284        16-bit  or  32-bit  library).   However,  they are not valid in Unicode
7290        \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
7304        whose only values are true or false. You can obtain  a  list  of  those
7305        that  are  recognized  by \p and \P, along with their abbreviations, by
7316        The recognized classes are:
7342        An equals sign may be used instead of a  colon.  The  class  names  are
7343        case-insensitive; only the short names listed above are recognized.
7352        clusters. The rules are defined in Unicode Standard Annex 29,  "Unicode
7368        characters  are of five types: L, V, T, LV, and LVT. An L character may
7386        tween regional indicator (RI) characters if there are an odd number  of
7397        However, they may also be used explicitly. These properties are:
7414        other  programming  languages.  These are the characters $, @, ` (grave
7417        most base (ASCII) characters are excluded. (Universal  Character  Names
7418        are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
7471        backslashed assertions are:
7498        at  the  very start and end of the subject string, whatever options are
7499        set. Thus, they are independent of multiline mode. These  three  asser-
7500        tions  are  not  affected  by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
7529        The  circumflex  and  dollar  metacharacters are zero-width assertions.
7532        are  concerned  with matching the starts and ends of lines. If the new-
7534        recognized  as  a newline, isolated CR and LF characters are treated as
7535        ordinary data characters, and are not recognized as newlines.
7546        of alternatives are involved, but it should be the first thing in  each
7550        ject, it is said to be an "anchored" pattern.  (There  are  also  other
7558        of alternatives are involved, but it should be the  last  item  in  any
7566        The meanings of the circumflex and dollar metacharacters are changed if
7576        Consequently,  patterns  that  are anchored in single line mode because
7577        all branches start with ^ are not anchored in  multiline  mode,  and  a
7584        even  if  the  single  characters CR and LF are also recognized as new-
7608        endings are being recognized, dot does not match CR or LF or any of the
7679        tively.  The  character's individual bytes are then captured by the ap-
7704        characters  that  are in the class by enumerating those that are not. A
7714        would.  Note that there are two ASCII characters, K and S, that, in ad-
7715        dition to their lower case ASCII equivalents, are case-equivalent  with
7719        Characters that might indicate line breaks are  never  treated  in  any
7733        backspace  character.  The sequences \B, \R, and \X are not special in-
7765        that are valid for the current mode. In any  UTF  mode,  the  so-called
7770        are always permitted.
7773        points are both specified as literal letters in the same case. For com-
7774        patibility  with Perl, EBCDIC code points within the range that are not
7775        letters are omitted. For example, [h-k] matches only  four  characters,
7776        even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7778        [\x88-\x92] or [h-\x92], all code points are included.
7783        character  tables  for  a French locale are in use, [\xc8-\xcb] matches
7793        The  only  metacharacters  that are recognized in character classes are
7811        names are:
7828        The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
7842        these are not supported, and an error is given if they are encountered.
7847        However,  in UCP mode, unless certain options are set (see below), some
7848        of the classes are changed so that  Unicode  character  properties  are
7863        POSIX classes are handled specially in UCP mode:
7875                  characters that are not controls, that  is,  characters  with
7888        The  other  POSIX  classes  are  unchanged by PCRE2_UCP, and match only
7891        There are two options that can be used to restrict the POSIX classes to
7909        Only these exact character sequences are recognized. A sequence such as
7916        sertions that are used above in order to give exactly the POSIX  behav-
7924        Vertical  bar characters are used to separate alternative patterns. For
7933        are  within a group (defined below), "succeeds" means matching the rest
7940        sequence  of  letters  enclosed between "(?" and ")". The following are
7941        Perl-compatible, and are described in detail in the pcre2api documenta-
7942        tion. The option letters are:
7953        phen, for example (?-im). The two "extended" options are  not  indepen-
7979        However,  except for 'r', these are not unset by (?^), which is equiva-
7991        follows  it.  At  the  end  of the group these options are reset to the
8007        As  a  convenient shorthand, if any option settings are required at the
8016        Note:  There  are  other  PCRE2-specific options, applying to the whole
8020        what  has  been  defaulted.   Details are given in the section entitled
8021        "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
8023        are  equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
8031        Groups are delimited by parentheses  (round  brackets),  which  can  be
8047        Opening parentheses are counted from left to right (starting from 1) to
8053        the captured substrings are "red king", "red", and "king", and are num-
8057        helpful.  There are often times when grouping is required without  cap-
8065        the captured substrings are "white queen" and "queen", and are numbered
8068        As a convenient shorthand, if any option settings are required  at  the
8075        match exactly the same set of strings. Because alternative branches are
8076        tried from left to right, and options are not reset until  the  end  of
8091        Because the two alternatives are inside a (?| group, both sets of  cap-
8092        turing  parentheses  are  numbered one. Thus, when the pattern matches,
8096        theses are numbered as usual, but the number is reset at the  start  of
8152        Named capture groups are allocated numbers as well as names, exactly as
8154        are primarily identified by numbers; any names  are  just  aliases  for
8163        names.  Consider this pattern, where there are two capture groups, both
8184        cate names are permitted for groups with the same number, for example:
8205        There are five capture groups, but only one is ever set after a  match.
8213        in  the pattern, the groups to which the name refers are checked in the
8227        or to check for recursion, all groups with the same name are tested. If
8261        are  both omitted, the quantifier specifies an exact number of required
8284        fier because braces are used  in  other  items  such  as  \N{U+345}  or
8296        ful for capture groups that are referenced as  subroutines  from  else-
8299        groups, items that have a {0} quantifier are omitted from the  compiled
8316        time for such patterns. However, because there are cases where this can
8317        be useful, such patterns are now accepted, but whenever an iteration of
8323        By default, quantifiers are "greedy", that is, they match  as  much  as
8356        Perl), the quantifiers are not greedy by default, but  individual  ones
8377        However, there are some cases where the optimization  cannot  be  used.
8378        When  .*   is  inside  capturing  parentheses that are the subject of a
8403        is  "tweedledee". However, if there are nested capture groups, the cor-
8454        Atomic  groups  are  not capture groups. Simple cases such as the above
8456        everything  it can.  So, while both \d+ and \d+? are prepared to adjust
8474        Possessive  quantifiers are always greedy; the setting of the PCRE2_UN-
8475        GREEDY option is ignored. They are a convenient notation for  the  sim-
8530        there  are not that many capture groups in the entire pattern. In other
8542        is no problem when named capture groups are used (see below).
8547        braces. These examples are all identical:
8570        also  in  patterns  that are created by joining together fragments that
8595        There  are  several  different  ways of writing backreferences to named
8598        these are now supported by both Perl and  PCRE2.  Perl  5.10's  unified
8622        lowing a backslash are taken as part of a potential backreference  num-
8654        assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
8657        More complicated assertions are coded as  parenthesized  groups.  There
8658        are  two  kinds:  those  that look ahead of the current position in the
8666        The Perl-compatible lookaround assertions are atomic. If  an  assertion
8668        tracking into the assertion. However, there are some cases  where  non-
8671        are not Perl-compatible.
8677        Assertion groups are not capture groups. If an assertion contains  cap-
8678        ture  groups within it, these are counted for the purposes of numbering
8682        that two adjacent characters are the same.
8685        were captured are discarded (as happens with any  pattern  branch  that
8687        branches fail to match; this means that no captured substrings are ever
8692        cessful branch are retained, and matching continues with the next  pat-
8696        substrings are retained,  because  matching  continues  with  the  "no"
8724        lowing sections, the various assertions are described using the  origi-
8746        the assertion (?!foo) is always true when the next three characters are
8763        contents  of a lookbehind assertion are restricted such that there must
8765        are two cases:
8793        bers of code units, are never permitted in lookbehinds.
8795        "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
8801        ditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be  set,
8827        so we are no better off. However, if the pattern is written as
8843        matches  "foo" preceded by three digits that are not "999". Notice that
8846        characters are all digits, and then there is  a  check  that  the  same
8847        three characters are not "999".  This pattern does not match "foo" pre-
8848        ceded  by  six  characters,  the first of which are digits and the last
8849        three of which are not "999". For example, it  doesn't  match  "123abc-
8855        checking that the first three are digits, and then the second assertion
8856        checks that the preceding three characters are not "999".
8868        three characters that are not "999".
8873        Traditional lookaround assertions are atomic. That is, if an  assertion
8875        tracking into the assertion. However, there are some cases  where  non-
8901        succeeds,  we are done, but if the last word in the string does not oc-
8924        Non-atomic assertions are not supported  by  the  alternative  matching
8925        function pcre2_dfa_match(). They are supported by JIT, but only if they
8933        In  concept, a script run is a sequence of characters that are all from
8935        scripts  are  commonly  used together, and because some diacritical and
8936        other marks are used with multiple scripts,  it  is  not  that  simple.
8942        matches  are not a script run. After a failure, normal backtracking oc-
8944        ters  that  look  the  same, but are from different scripts. The string
8947        acters in a sequence of non-spaces that follow white space are a script
8952        To  be  sure  that  they are all from the Latin script (for example), a
8960        needed. For example, if digits, underscore, and dots are  permitted  at
8978        structs  is encountered. Script runs are not supported by the alternate
8992        already been matched. The two possible forms of conditional group are:
8999        an  empty string (it always matches). If there are more than two alter-
9004        where the alternatives are complex:
9009        There are five kinds of condition: references to capture groups, refer-
9037        ond part matches one or more characters that are not  parentheses.  The
9060        the letter R followed by digits are ambiguous (see the  following  sec-
9104        At "top level", all these recursion test conditions are false.
9134        which version of PCRE2 they are dealing with by using this condition to
9162        strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
9163        letters and dd are digits.
9169        conditional assertions, for which captures are retained only for  posi-
9175        There are two ways of including comments in patterns that are processed
9182        next  closing parenthesis. Nested parentheses are not permitted. If the
9186        the pattern. Which characters are interpreted as newlines is controlled
9262        Be aware however, that if duplicate capture group numbers are  in  use,
9268        The first two capture groups (a) and (b) are both numbered 1, and group
9273        tive references are just a shorthand for computing a group number.
9277        the  reference  is not inside the parentheses that are referenced. They
9278        are always non-recursive subroutine calls, as  described  in  the  next
9298        not used, the match runs for a very long time indeed because there  are
9302        At the end of a match, the values of capturing  parentheses  are  those
9317        ets, allowing for arbitrary nesting. Only digits are allowed in  nested
9318        brackets  (that is, when recursing), whereas any characters are permit-
9338        Starting  with  release 10.30, recursive subroutine calls are no longer
9350        the  palindrome  when there are an odd number of characters, or nothing
9351        when there are an even number of characters, but in order  to  work  it
9411        calls  can  now  occur. However, any capturing parentheses that are set
9414        Processing options such as case-independence are fixed when a group  is
9433        cursively.  Here  are  two  of the examples used above, rewritten using
9444        Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
9462        passed, or if the callout entry point is set to NULL, callouts are dis-
9466        external function is to be called. There  are  two  kinds  of  callout:
9481        time,  and  one  side-effect is that sometimes callouts are skipped. If
9484        description of the programming interface to the callout  function,  are
9496        callouts are automatically installed before each item in  the  pattern.
9497        They  are all numbered 255. If there is a conditional group in the pat-
9523        There  are  a  number  of  special "Backtracking Control Verbs" (to use
9525        matching.  They are generally of the form (*VERB) or (*VERB:NAME). Some
9527        or not a name argument is present. The names are  not  required  to  be
9539        name.  However, the only backslash items that are permitted are \Q, \E,
9541        acter type escapes such as \d are faulted.
9546        names is skipped, and #-comments are recognized, exactly as in the rest
9556        Since these verbs are specifically related  to  backtracking,  most  of
9569        PCRE2 contains some optimizations that are used to speed up matching by
9585        The following verbs act as soon as they are encountered.
9625        combined with (?{}) or (??{}). Those are, of course, Perl features that
9626        are  not  present  in PCRE2. The nearest equivalent is the callout fea-
9653        including those inside assertions and atomic groups. However, there are
9692        If you are interested in  (*MARK)  values  after  failed  matches,  you
9698        The following verbs do nothing when they are encountered. Matching con-
9731        that are set with (*MARK), ignoring those set by any of the other back-
9740        chor,  unless  PCRE2's  start-of-match optimizations are turned off, as
9769        tifier, but there are some uses of (*PRUNE) that cannot be expressed in
9812        which means that it does not  see  (*MARK)  settings  that  are  inside
9813        atomic groups or assertions, because they are never re-entered by back-
9837        ignores names that are set by other backtracking verbs.
9852        quently BAZ fails, there are no more alternatives, so there is a  back-
9864        closing alternative.  Consider this pattern, where A, B, etc. are  com-
9870        If A and B are matched, but there is a failure in C, matching does  not
9879        fail  because  there  are  no  more  alternatives to try. In this case,
9910        tern, where A, B, etc. are complex pattern fragments:
9934        If the subject is "abac", Perl matches  unless  its  optimizations  are
9947        name  (if  set) are retained. In a standalone negative assertion, (*AC-
9949        tured substrings and any mark name are discarded.
9953        substrings are retained in both cases.
9958        are atomic. A backtrack that occurs after such an assertion is complete
9965        be  standalone  (not used as conditions). They are not Perl-compatible.
9971        there  are no more branches to try, (*THEN) causes a positive assertion
9974        The other backtracking verbs are not treated specially if  they  appear
10037        Two  aspects  of performance are discussed below: memory usage and pro-
10044        Patterns are compiled by PCRE2 into a reasonably efficient interpretive
10062        is not usually a problem. However, if the numbers are large,  and  par-
10063        ticularly  if  such repetitions are nested, the memory usage can become
10084        tures  within  subroutine calls are lost when the subroutine completes.
10087        formance of the two different versions of the pattern are  roughly  the
10098        tions  are now explicitly remembered in memory frames controlled by the
10109        10.41 backtracking memory frames are always held  in  heap  memory.  An
10130        workspace  when  recursing,  though  recursive function calls are still
10140        Certain items in regular expression patterns are processed  more  effi-
10164        theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
10184        If you are using such a pattern with subject strings that do  not  con-
10239        tain "<" are "swallowed" in one item inside the parentheses, and a pos-
10254        values of the limits are very large, and unlikely ever to operate. They
10310        expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
10315        IMPORTANT NOTE: The functions described here are NOT  thread-safe,  and
10316        should  not  be used in multi-threaded applications. They are also lim-
10317        ited to processing subjects that are not bigger than 2GB. Use  the  na-
10320        These  functions  are  wrapper functions that ultimately call the PCRE2
10321        native API. Their prototypes are defined  in  the  pcre2posix.h  header
10334        On Windows systems, if you are linking to a DLL version of the library,
10354        names start with "REG_"; these are used for setting options and identi-
10360        Note that these functions are just POSIX-style wrappers for PCRE2's na-
10362        they are not thread-safe or even POSIX compatible.
10367        that are written to the POSIX interface often use  it,  this  makes  it
10369        are not even defined.
10371        There are also some options that are not defined by POSIX.  These  have
10378        sions  themselves  are  still  those of Perl, subject to the setting of
10426        only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE,
10433        pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
10434        nored, and no captured strings are returned. Versions of the  PCRE  li-
10444        self  may  now  contain binary zeros, which are treated as data charac-
10470        In the absence of these flags, no options  are  passed  to  the  native
10475        It  does not affect the way newlines are matched by the dot metacharac-
10476        ter (they are not) or by a negative class such as [^a] (they are).
10482        ror codes are defined in the header file.
10563        string and any captured substrings are  still  given  relative  to  the
10573        passing pmatch as NULL are mutually exclusive; the error REG_INVARG  is
10578        pcre2_regexec()  are  ignored  (except  possibly as input for REG_STAR-
10586        captured substrings, are returned via the pmatch argument, which points
10595        regmatch_t  as  well  as  the  regoff_t  typedef it uses are defined in
10596        pcre2posix.h and are not warranted to have the same size or  layout  as
10600        A successful match yields a zero return; various error  codes  are  de-
10612        error message are used. The yield of the function is the size of buffer
10660        argument. No PCRE2 options are set, and default  character  tables  are
10672        library.  It  handles  strings  and characters that are stored in 8-bit
10674        but  if  the  pattern starts with "(*UTF)", both it and the subject are
10756        If  you  are running an application that uses a large number of regular
10759        run. However, if you are using the just-in-time  optimization  feature,
10761        tion-dependent.  The  host  on  which the patterns are reloaded must be
10772        restrictions  mentioned  above.   Applications  that are not statically
10803        ables which are set to point to the created byte stream and its length,
10858        find out how many compiled patterns are in the serialized data  without
10866        a  vector.  The  first two arguments are a pointer to a suitable vector
10870        this argument is NULL, malloc() and free() are used. After deserializa-
10879        stream, it is filled with those that fit, and  the  remainder  are  ig-
10895        potential race issue if you are using multiple patterns that  were  de-
10938        The  full syntax and semantics of the regular expressions that are sup-
10939        ported by PCRE2 are described in the pcre2pattern  documentation.  This
10954        With  one  exception, wherever brace characters { and } are required to
10956        horizontal  tab  characters  that follow { or precede } are allowed and
10957        are ignored. In the case of quantifiers, they may also appear before or
10983        following are also recognized:
10989        When \x is not followed by {, from zero to two hexadecimal  digits  are
10999        tails  of  escape  processing  in  EBCDIC  environments are also given.
11036        they  match  many  more  characters, but there are some option settings
11040        Property descriptions in \p and \P are matched caselessly; hyphens, un-
11041        derscores,  and  white  space are ignored, in accordance with Unicode's
11103        Perl and POSIX space are now the same. Perl added VT to its space char-
11110        whose  only  values  are  true or false. You can obtain a list of those
11111        that are recognized by \p and \P, along with  their  abbreviations,  by
11119        Many  script  names  and their 4-letter abbreviations are recognized in
11132        The recognized classes are:
11255        are permitted. In both cases, a name must not start with a digit.
11270        Changes  of these options within a group are automatically cancelled at
11301        The following are recognized only at the very start of a pattern or af-
11327        These are recognized only at the very start of the pattern or after op-
11340        These are recognized only at the very start of the pattern or after op-
11375        These assertions are specific to PCRE2 and are not Perl-compatible.
11458        see. The following act immediately they are reached:
11486        The allowed string delimiters are ` ' " ^ % # $ (which are the same for
11533        There  are two ways of telling PCRE2 to switch to UTF mode, where char-
11544        In  UTF mode, both the pattern and any subject strings that are matched
11545        against it are treated as UTF strings instead of strings of  individual
11546        one-code-unit  characters. There are also some other changes to the way
11547        characters are handled, as documented below.
11554        ting.   The Unicode properties that can be tested are a subset of those
11555        that Perl supports. Currently they are limited to the general  category
11561        The full lists are given in the pcre2pattern and pcre2syntax documenta-
11562        tion.  In  general,  only the short names for properties are supported.
11574        up to \777 are also recognized; larger ones can be coded using \o{...}.
11586        In UTF mode, capture group names are not restricted to ASCII,  and  may
11612        that  this also applies to \b and \B, because they are defined in terms
11616        capes work is changed so that Unicode properties are used to  determine
11617        which  characters  match,  though  there are some options that suppress
11622        classes are all low-valued characters unless the  PCRE2_UCP  option  is
11635        are less than 128 and that have at most two case-equivalent values. For
11637        ters  such as Greek sigma have more than two code points that are case-
11638        equivalent, and these are treated specially. Setting PCRE2_UCP  without
11642        There are two ASCII characters (S and K) that,  in  addition  to  their
11656        sequence of characters that are all from the same Unicode script.  How-
11657        ever, because some scripts are commonly used together, and because some
11658        diacritical  and  other marks are used with multiple scripts, it is not
11663        There are also three special values:
11667        whose code points are greater  than  the  Unicode  maximum  (U+10FFFF),
11668        which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11671        "Common" is used for characters that are used with many scripts.  These
11676        ify a previous character. These are considered to take on the script of
11679        Some  Inherited characters are used with many scripts, but many of them
11680        are only normally used with a small number  of  scripts.  For  example,
11686        characters  such  as  U+102E0 more than one Script is listed. There are
11691        string  of  characters  is  a script run. Note, however, that there are
11693        constraint  for  decimal  digits.  These are covered in subsequent sec-
11700        run.  Longer strings are checked using only the Script Extensions prop-
11712        are all in the Latin script, and the dot is Common, so this string is a
11734        wanese  Mandarin  uses  Bopomofo  and Han. These three combinations are
11735        treated as special cases when checking script runs and are, in  effect,
11747        set. Some of these decimal digits them are  visually  indistinguishable
11756        subjects are (by default) checked for validity on entry to the relevant
11762        In some situations, you may already know that your strings  are  valid,
11792        are no lookbehind assertions in the pattern, the check  starts  at  the
11795        if  there are not that many characters before the starting offset. Note
11796        that the sequences \b and \B are one-character lookbehinds.
11800        the  surrogate  area. The so-called "non-character" code points are not
11804        Characters in the "Surrogate Area" of Unicode are reserved for  use  by
11805        UTF-16,  where they are used in pairs to encode code points with values
11806        greater than 0xFFFF. The code points that are encoded by  UTF-16  pairs
11807        are  available  independently  in  the  UTF-8 and UTF-32 encodings. (In
11816        only in UTF-8 and UTF-32 modes, because these  values  are  not  repre-
11821        The following negative error codes are given for invalid UTF-8 strings:
11830        how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
11849        long; these code points are excluded by RFC 3629.
11854        are excluded by RFC 3629.
11859        range of code points are reserved by RFC 3629 for use with UTF-16,  and
11860        so are excluded from UTF-8.
11887        The  following  negative  error  codes  are  given  for  invalid UTF-16
11897        The following  negative  error  codes  are  given  for  invalid  UTF-32
11914        and  you  are  not  certain that your subject strings are valid UTF se-
11938        string in the usual way. There are a few points to consider:
11940        The internal boundaries are not interpreted as the beginnings  or  ends
11954        trary  data,  knowing  that  any  matched strings that are returned are
11961        such sequences are suitably aligned.