Lines Matching full:are
6 the pcre2demo program. There are separate text files for the pcre2grep and
27 rate "study" optimizing function; in PCRE2, patterns are automatically
34 are available using the Python syntax. There is also some support for
35 one or two .NET and Oniguruma syntax items, and there are options for
44 PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
70 Details of exactly which Perl regular expression features are and are
71 not supported by PCRE2 are given in separate documents. See the
77 client to discover which features are available. The features them-
78 selves are described in the pcre2build page. Documentation about build-
83 data tables that are used by more than one of the exported external
84 functions, but which are not intended for use by external callers.
87 external symbols are exported when a shared library is built, and in
88 these cases the undocumented symbols are not exported.
93 If you are using PCRE2 in a non-UTF application that permits users to
128 Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
141 pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
144 tions, are concatenated in pcre2.txt, for ease of searching. The sec-
145 tions are as follows:
448 These functions became obsolete at release 10.30 and are retained only
479 and POSIX basic and extended patterns can be converted. Details are
485 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
489 taneously. On Unix-like systems the libraries are called libpcre2-8,
498 There are also three different sets of data types:
505 types are pointers to constants of the equivalent UCHAR types, that is,
506 they are pointers to vectors of unsigned code units.
508 Character strings are passed to a PCRE2 library as sequences of un-
514 macros are defined whose names are the generic forms such as pcre2_com-
539 other PCRE2 documents, functions and data types are described using
546 There are also some wrapper functions for the 8-bit library that corre-
548 to all the functionality of PCRE2 and they are not thread-safe. They
549 are described in the pcre2posix documentation. Both these APIs define a
553 error codes are defined in the header file pcre2.h, which also contains
562 The functions pcre2_compile() and pcre2_match() are used for compiling
569 The compiling and matching functions recognize various options that are
570 passed as bits in an options argument. There are also some more compli-
572 source limits that are passed in "contexts" (which are just memory
591 less sanity checking. The JIT-specific functions are discussed in the
598 there are lookaround assertions). However, this algorithm does not re-
603 In addition to the main compiling and matching functions, there are
605 string that has been matched by pcre2_match(). They are:
617 pcre2_substring_free() and pcre2_substring_list_free() are also pro-
626 Functions whose names begin with pcre2_serialize_ are used for saving
629 Finally, there are functions for finding out information about a com-
633 Functions with names ending with _free() are used for freeing memory
641 units in several places. These values are always of type PCRE2_SIZE,
646 handled is one less than this maximum. Note that string lengths are al-
657 are the three just mentioned, plus the single characters VT (vertical
692 There are several different blocks of data that are used to pass infor-
707 In a more complicated situation, where patterns are compiled only when
708 they are first needed, but are still shared between threads, pointers
767 PCRE2 functions are called. A context is nothing more than a collection
771 are stored in contexts are in some sense "advanced features" of the
774 In a multithreaded application, if the parameters in a context are val-
775 ues that are never changed, the same context can be used by all the
789 Some PCRE2 functions have a lot of parameters, many of which are used
793 API extensible, "uncommon" parameters are passed to certain functions
799 There are three different types of context: a general context that is
806 ternal memory management functions that are called from several places
818 whose prototypes are:
826 tions malloc() and free() are used. (This is not currently useful, as
827 there are no other fields in a general context, but in future there
829 tain memory for storing the context, and all three values are saved as
862 A compile context is also required if you are using custom memory man-
900 As PCRE2 has developed, almost all the 32 option bits that are avail-
903 bits which are used for some newer, assumed rarer, options. This func-
905 It does not modify any existing setting. The available options are de-
935 hind assertions without a bounding length are not supported.
940 This specifies which characters or character sequences are to be recog-
1017 during a matching operation. Details are given in the pcre2callout doc-
1025 tion made by pcre2_substitute(). Details are given in the section enti-
1082 pcre2_match() uses the heap are given in the pcre2perform documenta-
1094 ing up too many computing resources when processing patterns that are
1103 take place. For patterns that are not anchored, the count restarts from
1144 version 10.32, only local variables are allocated on the stack and as
1149 cal workspace vectors are allocated on the heap from version 10.32 on-
1213 ther details are given with pcre2_set_depth_limit() above.
1219 pcre2_dfa_match(). Further details are given with
1260 pcre2_match(). Further details are given with pcre2_set_match_limit()
1267 are:
1373 of these character tables.) In many applications the same tables are
1375 are occasions when a copy of a compiled pattern and the relevant tables
1376 are needed. The pcre2_code_copy_with_tables() provides this facility.
1377 Copies of both the code and the tables are made, with the new code
1384 compiled pattern and the subject string are set in the match data block
1394 that affect the compilation. It should be zero if none of them are re-
1395 quired. The available options are described below. Some of them (in
1396 particular, those that are compatible with Perl, but some others as
1411 diately. Otherwise, the variables to which these point are set to an
1416 There are nearly 100 positive error codes that pcre2_compile() may re-
1417 turn if it finds an error in the pattern. There are also some negative
1418 error codes that are used for invalid UTF strings when validity check-
1419 ing is in force. These are the same as given by pcre2_match() and
1420 pcre2_dfa_match(), and are described in the pcre2unicode documentation.
1422 cause the textual error messages that are obtained by calling the
1425 PCRE2_ERROR_ are defined for both positive and negative error codes in
1438 Some errors are not detected until the whole pattern has been scanned;
1461 The following names for option bits are defined in the pcre2.h header
1525 whitespace in verb names is skipped and #-comments are recognized, ex-
1540 PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all
1542 code points are greater than U+007F. Note that there are two ASCII
1544 alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1551 (available only in 16-bit or 32-bit mode) are treated as not having an-
1567 ever matches one character, even if newlines are coded as CRLF. Without
1580 There are more details of named capture groups below; see also the
1604 matches, which are necessarily substrings of the first one, must obvi-
1609 If this bit is set, most white space characters in the pattern are to-
1621 256 that are flagged as white space in its low-character table. The ta-
1624 relevant characters are those with code points 0x0009 (tab), 0x000A
1629 acters, five more Unicode "Pattern White Space" characters are recog-
1630 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1634 space characters that are matched by the \h and \v escapes in patterns
1635 are a much bigger set.
1644 Which characters are interpreted as newlines can be specified by a set-
1653 escaped space and horizontal tab characters are ignored inside a char-
1654 acter class. Note: only these two characters are ignored, not the full
1655 set of pattern white space characters that are ignored outside a char-
1676 If this option is set, all meta-characters in the pattern are disabled,
1679 you are doing a lot of literal matching and are worried about effi-
1681 options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
1685 TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
1695 less such sequences are suitably aligned. This facility is not sup-
1727 setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
1775 are in use, auto-possessification means that some callouts are never
1800 There are a number of optimizations that may occur at the start of a
1808 items are in use, these "start-up" optimizations can cause them to be
1810 tions are in effect a pre-scan of the subject that takes place before
1816 such as (*COMMIT) and (*MARK) are considered at every possible starting
1844 found, but there is only one character left, so there are no more at-
1846 "2". If NO_START_OPTIMIZE is set, however, matches are tried at every
1856 automatically checked. There are discussions about the validity of
1874 called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
1878 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1885 classes. By default, only ASCII characters are recognized, but if
1886 PCRE2_UCP is set, Unicode properties are used to classify characters.
1887 There are some PCRE2_EXTRA options (see below) that add finer control
1888 to this behaviour. More details are given in the section on generic
1903 are not greedy by default, but become greedy if followed by "?". It is
1919 strings that are subsequently processed as strings of UTF characters
1923 tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in the
1930 pcre2_set_compile_extra_options() function are as follows:
1943 "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1946 They can be represented in UTF-8 and UTF-32, but are defined as invalid
1959 errors and are incorporated in the compiled pattern. However, they can
2010 "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
2011 ings are given in both cases if Perl's warning switch is enabled. How-
2016 pcre2_compile(), all unrecognized or malformed escape sequences are
2030 are two case-equivalent character sets that contain both ASCII and non-
2040 There are some legacy applications where the escape sequence \r in a
2088 interpretive matching function. Full details are given in the pcre2jit
2105 PCRE2 handles caseless matching, and determines whether characters are
2108 code points are less than 256. By default, higher-valued code points
2117 effects apply even when PCRE2_UTF is not set. There are, however, some
2121 The use of locales with Unicode is discouraged. If you are handling
2125 PCRE2 contains a built-in set of character tables that are used by de-
2126 fault. These are sufficient for many applications. Normally, the in-
2137 External tables are built by calling the pcre2_maketables() function,
2145 For example, to build and use tables that are appropriate for the
2147 are treated as letters), the following code could be used:
2156 if you are using Windows, the name for the French locale is "french".
2159 is saved with the compiled pattern, and the same tables are used by the
2165 the tables remains available while they are still in use. When they are
2172 The tables described above are just a sequence of binary bytes, which
2217 The possible values for the second argument are defined in pcre2.h, and
2218 are as follows:
2250 all the following are true:
2259 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
2270 a backreference. Zero is returned if there are no backreferences.
2301 code unit values greater than 255 are supported, the flag bit for 255
2329 Return the size (in bytes) of the data frames that are used to remember
2445 ses. The names are just an additional way of identifying the parenthe-
2447 pcre2_substring_get_byname() are provided for extracting captured sub-
2461 brary, the first two bytes of each entry are the number of the captur-
2468 The names are in alphabetical order. If (?| is used to create multiple
2472 names for groups of the same number are not permitted.
2474 Duplicate names for capture groups with different numbers are permit-
2488 There are four named capture groups, so the table has four entries, and
2524 cause there are cases where the code that calculates the size has to
2544 meration block are described in the pcre2callout documentation, which
2552 the patterns are reloaded must be running the same version of PCRE2,
2557 gin with pcre2_serialize_ are used for converting to and from the seri-
2558 alized form. They are described in the pcre2serialize documentation.
2592 should be made large enough to hold as many as are expected.
2603 the memory for the match data block. If you are not using custom memory
2617 after a match operation has finished, using functions that are de-
2626 pattern and the subject string are set in the match data block so that
2705 common matching parameters are to be changed. For details, see the sec-
2712 and offset are in code units, not characters. That is, they are in
2725 sets are valid). Like the pattern string, the subject may contain bi-
2768 The only bits that may be set are PCRE2_ANCHORED,
2778 PCRE2_NO_JIT (obviously), the remaining options are supported for JIT
2794 must not be freed until all such operations are complete. For some ap-
2815 if the limits are large. There is therefore a check at the start of
2821 There are rare cases of matches that would complete, but nevertheless
2851 set. If there are alternatives in the pattern, they are tried. If all
2889 there are no lookbehind assertions in the pattern, the check starts at
2892 if there are not that many characters before the starting offset. Note
2893 that the sequences \b and \B are one-character lookbehinds.
2896 negative error code is returned if the check fails. There are several
2898 problems with the code unit sequence. There are discussions about the
2905 subsequent calls to pcre2_match() if you are making repeated calls to
2919 there are not enough subject characters to complete the match. In addi-
2993 groups there are in a compiled pattern.
3008 ues are always code unit offsets, not character offsets. That is, they
3009 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
3013 first pair of offsets (that is, ovector[0] and ovector[1]) are set.
3022 been captured, the returned value is 3. If there are no captured sub-
3029 "ab", the start and end offset values for the match are 2 and 0.
3037 zero. If captured substrings are not of interest, pcre2_match() may be
3044 the function is 4, and groups 1 and 3 are matched, but 2 is not. When
3046 groups are set to PCRE2_UNSET.
3049 pression are also set to PCRE2_UNSET. For example, if the string "abc"
3050 is matched against the pattern (abc)(x(yz)?)? groups 2 and 3 are not
3053 groups (assuming the vector is large enough, of course) are set to
3057 in the pattern are never changed. That is, if a pattern contains n cap-
3058 turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
3061 are unchanged.
3072 functions in appropriate circumstances. If they are called at other
3100 Warning: By default, certain start-of-match optimizations are used to
3119 the code unit offset of the invalid UTF character. Details are given in
3128 codes are also returned by other functions, and are documented with
3129 them. The codes are given names in the header file. If UTF checking is
3131 of UTF-specific negative error codes is returned. Details are given in
3132 the pcre2unicode page. The following are the other errors that may be
3222 might do this are detected and faulted at compile time, but more com-
3245 PCRE2_ERROR_NOMEMORY is returned. None of the messages are very long;
3265 described above. For convenience, auxiliary functions are provided for
3281 "ab", the start and end offset values for the match are 2 and 0. In
3296 ments of these functions are a pointer to the match data block and a
3299 The final arguments of pcre2_substring_copy_bynumber() are a pointer to
3305 to variables that are updated with a pointer to the new memory and the
3314 error codes are:
3405 For convenience, there are also "byname" functions that correspond to
3408 there are duplicate names, these functions scan all the groups with the
3412 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
3413 returned. If all groups with the name have numbers that are greater
3421 guish the different capture groups, because names are not included in
3456 match to end before it starts are not supported, and give rise to an
3459 in the previous iteration are also not supported.
3461 The first seven arguments of pcre2_substitute() are the same as for
3462 pcre2_match(), except that the partial matching options are not permit-
3470 afterwards are the result of the final call. For global changes, this
3485 The contents of the externally supplied match data block are not
3500 STITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are
3501 returned. In the global case, multiple replacements are concatenated in
3544 eral). The following forms are always recognized:
3551 brackets are required only if the following character would be inter-
3589 two characters are CR, LF. In this case, the offset is advanced by two
3606 special, and only the group insertion forms listed above are valid.
3615 There are also four escape sequences for forcing the case of inserted
3625 was set when the pattern was compiled, Unicode properties are used for
3626 case forcing characters whose code points are greater than 127.
3643 specifies strings that are expanded and inserted when group <n> is set
3665 PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3666 vant and are ignored.
3672 from pcre2_match() are passed straight back.
3689 are NULL. For backward compatibility reasons an exception is made for
3732 more fields are added, but the intention is never to remove any of the
3737 ers are copies of the values passed to pcre2_substitute().
3741 that are set in the ovector, and is always greater than zero.
3767 capture groups are not required to be unique. Duplicate names are al-
3769 feature. Indeed, if such groups are named, they are required to use the
3772 Normally, patterns that use duplicate names are such that in any one
3776 When duplicates are present, pcre2_substring_copy_byname() and
3778 to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3780 turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
3786 the third and fourth arguments are NULL, the function returns a group
3789 When the third and fourth arguments are not NULL, they must be pointers
3790 to variables that are updated by the function. After it has run, they
3793 units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
3832 ble with Perl. Some of the features of PCRE2 patterns are not sup-
3833 ported. Nevertheless, there are times when this kind of matching can be
3838 The arguments for the pcre2_dfa_match() function are the same as for
3841 mon arguments are used in the same way as for pcre2_match(), so their
3847 space is needed for patterns and subjects where there are a lot of po-
3868 zero. The only bits that may be set are PCRE2_ANCHORED,
3872 PCRE2_DFA_RESTART. All but the last four of these are exactly the same
3879 the details are slightly different. When PCRE2_PARTIAL_HARD is set for
3914 matches are all initial substrings of the longer matches. For example,
3923 the three matched strings are
3931 strings are returned in the ovector, and can be extracted by number in
3941 The matched strings are stored in the ovector in reverse order of
3957 Many of the errors are the same as for pcre2_match(), as described
3958 above. There are in addition the following errors that are specific to
3971 a specific capture group. These are not supported.
3995 some plausibility checks are made on the contents of the workspace,
4035 Autotools. Also in the distribution are files to support building using
4043 this file as well as the README file if you are building in a non-Unix-
4051 configure script, where the optional features are selected or dese-
4054 non-Unix-like environments if you are using CMake instead of configure
4057 If you are not using Autotools or CMake, option selection can be done
4082 strings that are contained in arrays of 16-bit and 32-bit code units,
4096 an 8-bit program. Neither of these are built if you select only the
4110 libraries are built as static libraries. The binaries that are then
4112 pcre2grep) are linked statically with one or more PCRE2 libraries, but
4120 versions of all the relevant libraries are available for linking.
4144 and Nd, script names, and some bi-directional properties are supported.
4145 Details are given in the pcre2pattern documentation.
4181 the end of a configure run. If you are enabling JIT under SELinux you
4208 Alternatively, you can specify that line endings are to be indicated by
4224 newline sequences are the three just mentioned, plus the single charac-
4254 Within a compiled pattern, offset values are used to point from one
4257 two-byte values are used for these offsets, leading to a maximum size
4290 points. The more nested backtracking points there are (that is, the
4321 depth of recursive function calls in pcre2_dfa_match(). These are used
4329 able number of characters are supported only if there is a maximum
4338 number of characters (not necessarily all the same) are not constrained
4344 PCRE2 uses fixed tables for processing characters whose code points are
4345 less than 256. By default, PCRE2 is built with a set of tables that are
4346 distributed in the file src/pcre2_chartables.c.dist. These tables are
4351 to the configure command, the distributed tables are no longer used.
4355 work if you are cross compiling, because pcre2_dftables needs to be run
4376 The tables are just a string of bytes, independent of hardware charac-
4392 bles. You should only use it if you know that you are in an EBCDIC en-
4397 ebcdic are mutually exclusive.
4418 within the patterns it is matching. There are two kinds: one that gen-
4422 --disable-pcre2grep-callout is used, all callouts are completely ig-
4437 evant libraries are installed on your system. Configuration will fail
4438 if they are not.
4535 When --enable-coverage is used, the following addition targets are
4600 put() whose arguments are a pointer to a string and the length of the
4603 options and with some random options bits that are generated from the
4610 strings are specified by arguments: if an argument starts with "=" the
4612 file name, and the contents of the file are the test string.
4721 This applies only to assertion conditions (because they are themselves
4726 automatic callouts. When any callouts are present, the output from
4728 information when you are trying to optimize the performance of a par-
4777 all branches are anchorable.
4834 that callouts such as the example above are obeyed.
4867 version 1, and the callout_flags field for version 2. If you are writ-
4870 The version number will increase in future if more fields are added,
4900 The remaining fields in the callout block are the same for both kinds
4926 The values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4928 captured but whose numbers are less than capture_top also have both of
4967 The pattern_position and next_item_length fields are intended to help
4969 the same callout number. However, they are set for all callouts, and
4970 are used by pcre2test to show the next item to be matched when display-
4995 Both bits are set when a backtrack has caused a "bumpalong" to a new
5049 The version number is currently 0. It will increase if new fields are
5050 ever added to the block. The remaining fields are the same as their
5096 here are with respect to Perl version 5.38.0, but as both Perl and
5097 PCRE2 are continually changing, the information may at times be out of
5109 it does have are given in the pcre2unicode page.
5113 does not assert that the next three characters are not "a". It just as-
5124 5. Capture groups that occur inside negative lookaround assertions are
5125 counted, but their entries in the offsets vector are set only when a
5130 6. The following Perl escape sequences are not supported: \F, \l, \L,
5133 point, are supported. The escapes that modify the case of following
5134 letters are implemented by Perl's general string-handling and are not
5135 part of its pattern matching engine. If any of these are encountered by
5137 PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are
5140 7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
5142 tested with \p and \P are limited to the general category properties
5148 ports (such as \p{Letter}) are not supported by PCRE2, nor is it per-
5152 in between are treated as literals. However, this is slightly different
5153 from Perl in that $ and @ are also handled as literals inside the
5181 11. In PCRE2, if any of the backtracking control verbs are used in a
5187 | characters. Note that such groups are processed as anchored at the
5188 point where they are tested.
5194 it is the same as PCRE2, but there are cases where it differs.
5196 13. There are some differences that are concerned with the settings of
5220 because they are almost certainly user mistakes.
5222 17. In PCRE2, the upper/lower case character properties Lu and Ll are
5247 fiers is inverted, that is, by default they are not greedy, but if fol-
5248 lowed by a question mark they are.
5273 lookarounds are atomic.
5275 (l) There are three syntactical items in patterns that can refer to a
5366 particular match. One reason for this is that there are a number of op-
5367 tions and pattern items that are not supported by JIT (see below). An-
5396 the size of machine stack that it uses. The exact rules are not docu-
5398 timizations are introduced. If a pattern is too big, a call to
5424 are described in the section entitled "Controlling the JIT stack" be-
5427 There are some pcre2_match() options that are not supported by JIT, and
5428 there are also some pattern items that JIT cannot handle. Details are
5435 be obeyed. If the match-time options are not right for JIT execution,
5445 guarantee the use of JIT at match time because there are some match
5446 time options that are not supported by JIT.
5452 are normally expected to be a valid sequence of UTF code units. By de-
5456 you are sure that a subject string is valid. If this option is used
5477 The pcre2_match() options that are supported for JIT matching are
5481 are not supported at match time.
5486 The only unsupported pattern items are \C (match a single data unit)
5493 When a pattern is matched using JIT, the return values are the same as
5502 what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5512 tions are provided for managing blocks of memory for use as JIT stacks.
5517 ments are a starting size, a maximum size, and a general context (for
5528 should use. Its arguments are as follows:
5537 diately, without doing anything. There are three cases for the values
5556 is not obeyed when pcre2_match() is called with options that are incom-
5562 by assigning directly or by callback), as long as the patterns are
5576 as long as they are not used for matching by multiple threads at the
5710 JIT is not available, it is convenient for programs that are written
5712 pcre2_match() does have a performance impact. Programs that are written
5722 PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The re-
5723 turn values are also the same as for pcre2_match(), plus PCRE2_ER-
5728 number of other sanity checks are performed on the arguments. For exam-
5736 pcre2_jit_match() in UTF mode only if you are sure the subject is
5775 There are some size limitations in PCRE2 but it is hoped that they will
5781 braries. If you want to process regular expressions that are truly
5801 There are two different limits that apply to branches of lookbehind as-
5859 This document describes the two different algorithms that are available
5870 gorithm, and these are described below.
5874 arises, however, when there are multiple possibilities. For example, if
5883 there are three possible answers. The standard algorithm finds only one
5889 The set of strings that are matched by a regular expression can be rep-
5893 thought of as a search of the tree. There are two ways to search a
5909 branches are tried is controlled by the greedy or ungreedy nature of
5917 tifiers are specified in the pattern.
5921 strings that are matched by portions of the pattern in parentheses.
5942 there are no more unterminated paths. At this point, terminated paths
5943 represent the different matching possibilities (if there are none, the
5946 longest. The matches are returned in the output vector in decreasing
5956 Note also that all the matches that are found start at the same point
5975 There are a number of features of PCRE2 regular expressions that are
5977 tion. Those that are not supported cause an error if encountered.
5982 greedy and ungreedy quantifiers are treated in exactly the same way.
5999 strings are available.
6001 3. Because no substrings are captured, backreferences within the pat-
6002 tern are not supported.
6005 ence as the condition or test for a specific group recursion are not
6008 5. Again for the same reason, script runs are not supported.
6014 7. Callouts are supported, but the value of the capture_top field is
6024 are not supported. (*FAIL) is supported, and behaves like a failing
6034 matches (at a single point in the subject) are automatically found, and
6053 within invalid UTF string are not supported.
6055 3. Although atomic groups are supported, their use does not provide the
6089 string, but more characters are needed to match the entire pattern,
6091 There are circumstances where it might be helpful to distinguish this
6108 between the two types of matching function. If both options are set,
6119 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
6137 subject string is reached successfully, but either more characters are
6142 acters are definitely needed to complete a match. In this case both
6172 acters are added, we do not know if it will be an empty string or some-
6194 the rest of the ovector are undefined. The appearance of \K in the pat-
6203 string "abc12", because all these characters are needed for a subse-
6213 matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
6215 ple, there are two partial matches, because "dog" on its own partially
6232 matching continues as normal, and other alternatives in the pattern are
6303 strings that are being sought are much shorter than each individual
6304 segment, and are in the middle of very long strings, so the pattern is
6334 If there are memory constraints, you may want to discard text that pre-
6361 if there are nested lookbehinds. The value returned by calling
6382 are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes
6406 vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or
6433 match at one point in the subject are remembered. Depending on the ap-
6472 The syntax and semantics of the regular expressions that are supported
6473 by PCRE2 are described in detail below. There is a quick-reference syn-
6480 Perl's regular expressions are described in its own documentation, and
6481 regular expressions in general are covered in a number of books, some
6487 This document discusses the regular expression patterns that are sup-
6491 not Perl-compatible. Some of the features discussed below are not
6494 tion, are discussed in the pcre2matching page.
6500 set by special items at the start of a pattern. These are not Perl-com-
6501 patible, but are provided to make these options accessible to pattern
6502 writers who are not able to change the program that processes the pat-
6590 These facilities are provided to catch runaway matches that are pro-
6613 interpreters are used for matching. It does not apply to JIT. The match
6651 tions are true. It also affects the interpretation of the dot metachar-
6673 tem). In the sections below, character code values are ASCII or Uni-
6675 values, and there are no code points greater than 255.
6689 within the pattern), letters are matched independently of case. Note
6690 that there are two ASCII characters, K and S, that, in addition to
6691 their lower case ASCII equivalents, are case-equivalent with Unicode
6699 These are encoded in the pattern by the use of metacharacters, which do
6700 not stand for themselves but instead are interpreted in some special
6703 There are two different sets of metacharacters: those that are recog-
6705 that are recognized within square brackets. Outside square brackets,
6706 the metacharacters are as follows:
6721 Brace characters { and } are also used to enclose data for construc-
6723 and/or horizontal tab characters that follow { or precede } are allowed
6724 and are ignored. In the case of quantifiers, they may also appear be-
6731 class". In a character class the only metacharacters are:
6742 line, inclusive, are ignored. An escaping backslash can be used to in-
6745 unescaped space and horizontal tab characters are ignored inside a
6746 character class. Note: only these two characters are ignored, not the
6747 full set of pattern white space characters that are ignored outside a
6769 slash. All other characters (in particular, those whose code points are
6770 greater than 127) are treated as literals.
6776 and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in
6807 sents. In an ASCII or Unicode environment, these escapes are as fol-
6825 decimal digits are read (letters can be in upper or lower case). Any
6830 Characters whose code points are less than 256 can be defined by either
6832 ence in the way they are handled. For example, \xdc is exactly the same
6848 like other places that also use curly brackets, spaces are not allowed
6858 There are some legacy applications where the escape sequence \r is ex-
6874 ument. The only characters that are allowed after \c are A-Z, a-z, or
6888 generate the APC character. Unfortunately, there are several variants
6894 After \0 up to two further octal digits are read. If there are fewer
6895 than two digits, just those that are present are used. Thus the se-
6916 digit 8 or 9, or if there are at least that many previous capture
6920 its are read to form a character code.
6929 \40 is the same, provided there are fewer than 40
6942 Note that octal values of 100 or greater that are specified using this
6944 three octal digits are ever read.
6948 Characters that are specified using octal or hexadecimal numbers are
6956 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6960 UTF-8 and UTF-32 modes, because these values are not representable in
6970 class. \B, \R, and \X are not special inside a character class. Like
6976 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
6987 backreference can be coded as \g{name}. Backreferences are discussed
6995 Details are discussed later. Note that \g{...} (Perl syntax) and
6996 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
7030 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
7031 (13), and space (32), which are defined as white space in the "C" lo-
7042 are used for accented letters, and these are then matched by \w. The
7045 By default, characters whose code points are greater than 127 never
7051 changed so that Unicode properties are used to determine character
7066 \b, and \B because they are defined in terms of \w and \W. Matching
7078 space characters are:
7100 The vertical space characters are:
7111 than 256 are relevant.
7121 This is an example of an "atomic group", details of which are given be-
7129 In other modes, two additional characters whose code points are greater
7130 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
7146 tion. Note that these special settings, which are not Perl-compatible,
7147 are recognized only at the very start of a pattern, and that they must
7162 are available. They can be used in any mode, though in 8-bit and 16-bit
7163 non-UTF modes these sequences are of course limited to testing charac-
7164 ters whose code points are less than U+0100 and U+10000, respectively.
7166 limit) may be encountered. These are all treated as being in the Un-
7176 The extra escape sequences that provide property support are:
7182 The property names represented by xx above are not case-sensitive, and
7184 and underscores are ignored. There is support for Unicode script names,
7188 other Perl properties such as "InMusicalSymbols" are not supported by
7194 There are three different syntax forms for matching a script. Each Uni-
7200 "script extensions" for the property types are recognized, and a equals
7207 points greater than 0x10FFFF) are assigned the "Unknown" script. Others
7208 that are not part of an identified script are lumped together as "Com-
7225 the absence of negation, the curly brackets in the escape sequence are
7231 The following general category property codes are supported:
7282 points are in the range U+D800 to U+DFFF. These characters are no dif-
7284 16-bit or 32-bit library). However, they are not valid in Unicode
7290 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
7304 whose only values are true or false. You can obtain a list of those
7305 that are recognized by \p and \P, along with their abbreviations, by
7316 The recognized classes are:
7342 An equals sign may be used instead of a colon. The class names are
7343 case-insensitive; only the short names listed above are recognized.
7352 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
7368 characters are of five types: L, V, T, LV, and LVT. An L character may
7386 tween regional indicator (RI) characters if there are an odd number of
7397 However, they may also be used explicitly. These properties are:
7414 other programming languages. These are the characters $, @, ` (grave
7417 most base (ASCII) characters are excluded. (Universal Character Names
7418 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
7471 backslashed assertions are:
7498 at the very start and end of the subject string, whatever options are
7499 set. Thus, they are independent of multiline mode. These three asser-
7500 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
7529 The circumflex and dollar metacharacters are zero-width assertions.
7532 are concerned with matching the starts and ends of lines. If the new-
7534 recognized as a newline, isolated CR and LF characters are treated as
7535 ordinary data characters, and are not recognized as newlines.
7546 of alternatives are involved, but it should be the first thing in each
7550 ject, it is said to be an "anchored" pattern. (There are also other
7558 of alternatives are involved, but it should be the last item in any
7566 The meanings of the circumflex and dollar metacharacters are changed if
7576 Consequently, patterns that are anchored in single line mode because
7577 all branches start with ^ are not anchored in multiline mode, and a
7584 even if the single characters CR and LF are also recognized as new-
7608 endings are being recognized, dot does not match CR or LF or any of the
7679 tively. The character's individual bytes are then captured by the ap-
7704 characters that are in the class by enumerating those that are not. A
7714 would. Note that there are two ASCII characters, K and S, that, in ad-
7715 dition to their lower case ASCII equivalents, are case-equivalent with
7719 Characters that might indicate line breaks are never treated in any
7733 backspace character. The sequences \B, \R, and \X are not special in-
7765 that are valid for the current mode. In any UTF mode, the so-called
7770 are always permitted.
7773 points are both specified as literal letters in the same case. For com-
7774 patibility with Perl, EBCDIC code points within the range that are not
7775 letters are omitted. For example, [h-k] matches only four characters,
7776 even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7778 [\x88-\x92] or [h-\x92], all code points are included.
7783 character tables for a French locale are in use, [\xc8-\xcb] matches
7793 The only metacharacters that are recognized in character classes are
7811 names are:
7828 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
7842 these are not supported, and an error is given if they are encountered.
7847 However, in UCP mode, unless certain options are set (see below), some
7848 of the classes are changed so that Unicode character properties are
7863 POSIX classes are handled specially in UCP mode:
7875 characters that are not controls, that is, characters with
7888 The other POSIX classes are unchanged by PCRE2_UCP, and match only
7891 There are two options that can be used to restrict the POSIX classes to
7909 Only these exact character sequences are recognized. A sequence such as
7916 sertions that are used above in order to give exactly the POSIX behav-
7924 Vertical bar characters are used to separate alternative patterns. For
7933 are within a group (defined below), "succeeds" means matching the rest
7940 sequence of letters enclosed between "(?" and ")". The following are
7941 Perl-compatible, and are described in detail in the pcre2api documenta-
7942 tion. The option letters are:
7953 phen, for example (?-im). The two "extended" options are not indepen-
7979 However, except for 'r', these are not unset by (?^), which is equiva-
7991 follows it. At the end of the group these options are reset to the
8007 As a convenient shorthand, if any option settings are required at the
8016 Note: There are other PCRE2-specific options, applying to the whole
8020 what has been defaulted. Details are given in the section entitled
8021 "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
8023 are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
8031 Groups are delimited by parentheses (round brackets), which can be
8047 Opening parentheses are counted from left to right (starting from 1) to
8053 the captured substrings are "red king", "red", and "king", and are num-
8057 helpful. There are often times when grouping is required without cap-
8065 the captured substrings are "white queen" and "queen", and are numbered
8068 As a convenient shorthand, if any option settings are required at the
8075 match exactly the same set of strings. Because alternative branches are
8076 tried from left to right, and options are not reset until the end of
8091 Because the two alternatives are inside a (?| group, both sets of cap-
8092 turing parentheses are numbered one. Thus, when the pattern matches,
8096 theses are numbered as usual, but the number is reset at the start of
8152 Named capture groups are allocated numbers as well as names, exactly as
8154 are primarily identified by numbers; any names are just aliases for
8163 names. Consider this pattern, where there are two capture groups, both
8184 cate names are permitted for groups with the same number, for example:
8205 There are five capture groups, but only one is ever set after a match.
8213 in the pattern, the groups to which the name refers are checked in the
8227 or to check for recursion, all groups with the same name are tested. If
8261 are both omitted, the quantifier specifies an exact number of required
8284 fier because braces are used in other items such as \N{U+345} or
8296 ful for capture groups that are referenced as subroutines from else-
8299 groups, items that have a {0} quantifier are omitted from the compiled
8316 time for such patterns. However, because there are cases where this can
8317 be useful, such patterns are now accepted, but whenever an iteration of
8323 By default, quantifiers are "greedy", that is, they match as much as
8356 Perl), the quantifiers are not greedy by default, but individual ones
8377 However, there are some cases where the optimization cannot be used.
8378 When .* is inside capturing parentheses that are the subject of a
8403 is "tweedledee". However, if there are nested capture groups, the cor-
8454 Atomic groups are not capture groups. Simple cases such as the above
8456 everything it can. So, while both \d+ and \d+? are prepared to adjust
8474 Possessive quantifiers are always greedy; the setting of the PCRE2_UN-
8475 GREEDY option is ignored. They are a convenient notation for the sim-
8530 there are not that many capture groups in the entire pattern. In other
8542 is no problem when named capture groups are used (see below).
8547 braces. These examples are all identical:
8570 also in patterns that are created by joining together fragments that
8595 There are several different ways of writing backreferences to named
8598 these are now supported by both Perl and PCRE2. Perl 5.10's unified
8622 lowing a backslash are taken as part of a potential backreference num-
8654 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
8657 More complicated assertions are coded as parenthesized groups. There
8658 are two kinds: those that look ahead of the current position in the
8666 The Perl-compatible lookaround assertions are atomic. If an assertion
8668 tracking into the assertion. However, there are some cases where non-
8671 are not Perl-compatible.
8677 Assertion groups are not capture groups. If an assertion contains cap-
8678 ture groups within it, these are counted for the purposes of numbering
8682 that two adjacent characters are the same.
8685 were captured are discarded (as happens with any pattern branch that
8687 branches fail to match; this means that no captured substrings are ever
8692 cessful branch are retained, and matching continues with the next pat-
8696 substrings are retained, because matching continues with the "no"
8724 lowing sections, the various assertions are described using the origi-
8746 the assertion (?!foo) is always true when the next three characters are
8763 contents of a lookbehind assertion are restricted such that there must
8765 are two cases:
8793 bers of code units, are never permitted in lookbehinds.
8795 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
8801 ditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set,
8827 so we are no better off. However, if the pattern is written as
8843 matches "foo" preceded by three digits that are not "999". Notice that
8846 characters are all digits, and then there is a check that the same
8847 three characters are not "999". This pattern does not match "foo" pre-
8848 ceded by six characters, the first of which are digits and the last
8849 three of which are not "999". For example, it doesn't match "123abc-
8855 checking that the first three are digits, and then the second assertion
8856 checks that the preceding three characters are not "999".
8868 three characters that are not "999".
8873 Traditional lookaround assertions are atomic. That is, if an assertion
8875 tracking into the assertion. However, there are some cases where non-
8901 succeeds, we are done, but if the last word in the string does not oc-
8924 Non-atomic assertions are not supported by the alternative matching
8925 function pcre2_dfa_match(). They are supported by JIT, but only if they
8933 In concept, a script run is a sequence of characters that are all from
8935 scripts are commonly used together, and because some diacritical and
8936 other marks are used with multiple scripts, it is not that simple.
8942 matches are not a script run. After a failure, normal backtracking oc-
8944 ters that look the same, but are from different scripts. The string
8947 acters in a sequence of non-spaces that follow white space are a script
8952 To be sure that they are all from the Latin script (for example), a
8960 needed. For example, if digits, underscore, and dots are permitted at
8978 structs is encountered. Script runs are not supported by the alternate
8992 already been matched. The two possible forms of conditional group are:
8999 an empty string (it always matches). If there are more than two alter-
9004 where the alternatives are complex:
9009 There are five kinds of condition: references to capture groups, refer-
9037 ond part matches one or more characters that are not parentheses. The
9060 the letter R followed by digits are ambiguous (see the following sec-
9104 At "top level", all these recursion test conditions are false.
9134 which version of PCRE2 they are dealing with by using this condition to
9162 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
9163 letters and dd are digits.
9169 conditional assertions, for which captures are retained only for posi-
9175 There are two ways of including comments in patterns that are processed
9182 next closing parenthesis. Nested parentheses are not permitted. If the
9186 the pattern. Which characters are interpreted as newlines is controlled
9262 Be aware however, that if duplicate capture group numbers are in use,
9268 The first two capture groups (a) and (b) are both numbered 1, and group
9273 tive references are just a shorthand for computing a group number.
9277 the reference is not inside the parentheses that are referenced. They
9278 are always non-recursive subroutine calls, as described in the next
9298 not used, the match runs for a very long time indeed because there are
9302 At the end of a match, the values of capturing parentheses are those
9317 ets, allowing for arbitrary nesting. Only digits are allowed in nested
9318 brackets (that is, when recursing), whereas any characters are permit-
9338 Starting with release 10.30, recursive subroutine calls are no longer
9350 the palindrome when there are an odd number of characters, or nothing
9351 when there are an even number of characters, but in order to work it
9411 calls can now occur. However, any capturing parentheses that are set
9414 Processing options such as case-independence are fixed when a group is
9433 cursively. Here are two of the examples used above, rewritten using
9444 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
9462 passed, or if the callout entry point is set to NULL, callouts are dis-
9466 external function is to be called. There are two kinds of callout:
9481 time, and one side-effect is that sometimes callouts are skipped. If
9484 description of the programming interface to the callout function, are
9496 callouts are automatically installed before each item in the pattern.
9497 They are all numbered 255. If there is a conditional group in the pat-
9523 There are a number of special "Backtracking Control Verbs" (to use
9525 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
9527 or not a name argument is present. The names are not required to be
9539 name. However, the only backslash items that are permitted are \Q, \E,
9541 acter type escapes such as \d are faulted.
9546 names is skipped, and #-comments are recognized, exactly as in the rest
9556 Since these verbs are specifically related to backtracking, most of
9569 PCRE2 contains some optimizations that are used to speed up matching by
9585 The following verbs act as soon as they are encountered.
9625 combined with (?{}) or (??{}). Those are, of course, Perl features that
9626 are not present in PCRE2. The nearest equivalent is the callout fea-
9653 including those inside assertions and atomic groups. However, there are
9692 If you are interested in (*MARK) values after failed matches, you
9698 The following verbs do nothing when they are encountered. Matching con-
9731 that are set with (*MARK), ignoring those set by any of the other back-
9740 chor, unless PCRE2's start-of-match optimizations are turned off, as
9769 tifier, but there are some uses of (*PRUNE) that cannot be expressed in
9812 which means that it does not see (*MARK) settings that are inside
9813 atomic groups or assertions, because they are never re-entered by back-
9837 ignores names that are set by other backtracking verbs.
9852 quently BAZ fails, there are no more alternatives, so there is a back-
9864 closing alternative. Consider this pattern, where A, B, etc. are com-
9870 If A and B are matched, but there is a failure in C, matching does not
9879 fail because there are no more alternatives to try. In this case,
9910 tern, where A, B, etc. are complex pattern fragments:
9934 If the subject is "abac", Perl matches unless its optimizations are
9947 name (if set) are retained. In a standalone negative assertion, (*AC-
9949 tured substrings and any mark name are discarded.
9953 substrings are retained in both cases.
9958 are atomic. A backtrack that occurs after such an assertion is complete
9965 be standalone (not used as conditions). They are not Perl-compatible.
9971 there are no more branches to try, (*THEN) causes a positive assertion
9974 The other backtracking verbs are not treated specially if they appear
10037 Two aspects of performance are discussed below: memory usage and pro-
10044 Patterns are compiled by PCRE2 into a reasonably efficient interpretive
10062 is not usually a problem. However, if the numbers are large, and par-
10063 ticularly if such repetitions are nested, the memory usage can become
10084 tures within subroutine calls are lost when the subroutine completes.
10087 formance of the two different versions of the pattern are roughly the
10098 tions are now explicitly remembered in memory frames controlled by the
10109 10.41 backtracking memory frames are always held in heap memory. An
10130 workspace when recursing, though recursive function calls are still
10140 Certain items in regular expression patterns are processed more effi-
10164 theses that are the subject of a backreference, and the PCRE2_DOTALL
10184 If you are using such a pattern with subject strings that do not con-
10239 tain "<" are "swallowed" in one item inside the parentheses, and a pos-
10254 values of the limits are very large, and unlikely ever to operate. They
10310 expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
10315 IMPORTANT NOTE: The functions described here are NOT thread-safe, and
10316 should not be used in multi-threaded applications. They are also lim-
10317 ited to processing subjects that are not bigger than 2GB. Use the na-
10320 These functions are wrapper functions that ultimately call the PCRE2
10321 native API. Their prototypes are defined in the pcre2posix.h header
10334 On Windows systems, if you are linking to a DLL version of the library,
10354 names start with "REG_"; these are used for setting options and identi-
10360 Note that these functions are just POSIX-style wrappers for PCRE2's na-
10362 they are not thread-safe or even POSIX compatible.
10367 that are written to the POSIX interface often use it, this makes it
10369 are not even defined.
10371 There are also some options that are not defined by POSIX. These have
10378 sions themselves are still those of Perl, subject to the setting of
10426 only other options that are allowed with REG_NOSPEC are REG_ICASE,
10433 pcre2_regexec() for matching, the nmatch and pmatch arguments are ig-
10434 nored, and no captured strings are returned. Versions of the PCRE li-
10444 self may now contain binary zeros, which are treated as data charac-
10470 In the absence of these flags, no options are passed to the native
10475 It does not affect the way newlines are matched by the dot metacharac-
10476 ter (they are not) or by a negative class such as [^a] (they are).
10482 ror codes are defined in the header file.
10563 string and any captured substrings are still given relative to the
10573 passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
10578 pcre2_regexec() are ignored (except possibly as input for REG_STAR-
10586 captured substrings, are returned via the pmatch argument, which points
10595 regmatch_t as well as the regoff_t typedef it uses are defined in
10596 pcre2posix.h and are not warranted to have the same size or layout as
10600 A successful match yields a zero return; various error codes are de-
10612 error message are used. The yield of the function is the size of buffer
10660 argument. No PCRE2 options are set, and default character tables are
10672 library. It handles strings and characters that are stored in 8-bit
10674 but if the pattern starts with "(*UTF)", both it and the subject are
10756 If you are running an application that uses a large number of regular
10759 run. However, if you are using the just-in-time optimization feature,
10761 tion-dependent. The host on which the patterns are reloaded must be
10772 restrictions mentioned above. Applications that are not statically
10803 ables which are set to point to the created byte stream and its length,
10858 find out how many compiled patterns are in the serialized data without
10866 a vector. The first two arguments are a pointer to a suitable vector
10870 this argument is NULL, malloc() and free() are used. After deserializa-
10879 stream, it is filled with those that fit, and the remainder are ig-
10895 potential race issue if you are using multiple patterns that were de-
10938 The full syntax and semantics of the regular expressions that are sup-
10939 ported by PCRE2 are described in the pcre2pattern documentation. This
10954 With one exception, wherever brace characters { and } are required to
10956 horizontal tab characters that follow { or precede } are allowed and
10957 are ignored. In the case of quantifiers, they may also appear before or
10983 following are also recognized:
10989 When \x is not followed by {, from zero to two hexadecimal digits are
10999 tails of escape processing in EBCDIC environments are also given.
11036 they match many more characters, but there are some option settings
11040 Property descriptions in \p and \P are matched caselessly; hyphens, un-
11041 derscores, and white space are ignored, in accordance with Unicode's
11103 Perl and POSIX space are now the same. Perl added VT to its space char-
11110 whose only values are true or false. You can obtain a list of those
11111 that are recognized by \p and \P, along with their abbreviations, by
11119 Many script names and their 4-letter abbreviations are recognized in
11132 The recognized classes are:
11255 are permitted. In both cases, a name must not start with a digit.
11270 Changes of these options within a group are automatically cancelled at
11301 The following are recognized only at the very start of a pattern or af-
11327 These are recognized only at the very start of the pattern or after op-
11340 These are recognized only at the very start of the pattern or after op-
11375 These assertions are specific to PCRE2 and are not Perl-compatible.
11458 see. The following act immediately they are reached:
11486 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
11533 There are two ways of telling PCRE2 to switch to UTF mode, where char-
11544 In UTF mode, both the pattern and any subject strings that are matched
11545 against it are treated as UTF strings instead of strings of individual
11546 one-code-unit characters. There are also some other changes to the way
11547 characters are handled, as documented below.
11554 ting. The Unicode properties that can be tested are a subset of those
11555 that Perl supports. Currently they are limited to the general category
11561 The full lists are given in the pcre2pattern and pcre2syntax documenta-
11562 tion. In general, only the short names for properties are supported.
11574 up to \777 are also recognized; larger ones can be coded using \o{...}.
11586 In UTF mode, capture group names are not restricted to ASCII, and may
11612 that this also applies to \b and \B, because they are defined in terms
11616 capes work is changed so that Unicode properties are used to determine
11617 which characters match, though there are some options that suppress
11622 classes are all low-valued characters unless the PCRE2_UCP option is
11635 are less than 128 and that have at most two case-equivalent values. For
11637 ters such as Greek sigma have more than two code points that are case-
11638 equivalent, and these are treated specially. Setting PCRE2_UCP without
11642 There are two ASCII characters (S and K) that, in addition to their
11656 sequence of characters that are all from the same Unicode script. How-
11657 ever, because some scripts are commonly used together, and because some
11658 diacritical and other marks are used with multiple scripts, it is not
11663 There are also three special values:
11667 whose code points are greater than the Unicode maximum (U+10FFFF),
11668 which are accessible only in non-UTF mode, are assigned the Unknown
11671 "Common" is used for characters that are used with many scripts. These
11676 ify a previous character. These are considered to take on the script of
11679 Some Inherited characters are used with many scripts, but many of them
11680 are only normally used with a small number of scripts. For example,
11686 characters such as U+102E0 more than one Script is listed. There are
11691 string of characters is a script run. Note, however, that there are
11693 constraint for decimal digits. These are covered in subsequent sec-
11700 run. Longer strings are checked using only the Script Extensions prop-
11712 are all in the Latin script, and the dot is Common, so this string is a
11734 wanese Mandarin uses Bopomofo and Han. These three combinations are
11735 treated as special cases when checking script runs and are, in effect,
11747 set. Some of these decimal digits them are visually indistinguishable
11756 subjects are (by default) checked for validity on entry to the relevant
11762 In some situations, you may already know that your strings are valid,
11792 are no lookbehind assertions in the pattern, the check starts at the
11795 if there are not that many characters before the starting offset. Note
11796 that the sequences \b and \B are one-character lookbehinds.
11800 the surrogate area. The so-called "non-character" code points are not
11804 Characters in the "Surrogate Area" of Unicode are reserved for use by
11805 UTF-16, where they are used in pairs to encode code points with values
11806 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
11807 are available independently in the UTF-8 and UTF-32 encodings. (In
11816 only in UTF-8 and UTF-32 modes, because these values are not repre-
11821 The following negative error codes are given for invalid UTF-8 strings:
11830 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
11849 long; these code points are excluded by RFC 3629.
11854 are excluded by RFC 3629.
11859 range of code points are reserved by RFC 3629 for use with UTF-16, and
11860 so are excluded from UTF-8.
11887 The following negative error codes are given for invalid UTF-16
11897 The following negative error codes are given for invalid UTF-32
11914 and you are not certain that your subject strings are valid UTF se-
11938 string in the usual way. There are a few points to consider:
11940 The internal boundaries are not interpreted as the beginnings or ends
11954 trary data, knowing that any matched strings that are returned are
11961 such sequences are suitably aligned.