1*22dc650dSSadaf Ebrahimi----------------------------------------------------------------------------- 2*22dc650dSSadaf EbrahimiThis file contains a concatenation of the PCRE2 man pages, converted to plain 3*22dc650dSSadaf Ebrahimitext format for ease of searching with a text editor, or for use on systems 4*22dc650dSSadaf Ebrahimithat do not have a man page processor. The small individual files that give 5*22dc650dSSadaf Ebrahimisynopses of each function in the library have not been included. Neither has 6*22dc650dSSadaf Ebrahimithe pcre2demo program. There are separate text files for the pcre2grep and 7*22dc650dSSadaf Ebrahimipcre2test commands. 8*22dc650dSSadaf Ebrahimi----------------------------------------------------------------------------- 9*22dc650dSSadaf Ebrahimi 10*22dc650dSSadaf Ebrahimi 11*22dc650dSSadaf Ebrahimi 12*22dc650dSSadaf EbrahimiPCRE2(3) Library Functions Manual PCRE2(3) 13*22dc650dSSadaf Ebrahimi 14*22dc650dSSadaf Ebrahimi 15*22dc650dSSadaf EbrahimiNAME 16*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 17*22dc650dSSadaf Ebrahimi 18*22dc650dSSadaf Ebrahimi 19*22dc650dSSadaf EbrahimiINTRODUCTION 20*22dc650dSSadaf Ebrahimi 21*22dc650dSSadaf Ebrahimi PCRE2 is the name used for a revised API for the PCRE library, which is 22*22dc650dSSadaf Ebrahimi a set of functions, written in C, that implement regular expression 23*22dc650dSSadaf Ebrahimi pattern matching using the same syntax and semantics as Perl, with just 24*22dc650dSSadaf Ebrahimi a few differences. After nearly two decades, the limitations of the 25*22dc650dSSadaf Ebrahimi original API were making development increasingly difficult. The new 26*22dc650dSSadaf Ebrahimi API is more extensible, and it was simplified by abolishing the sepa- 27*22dc650dSSadaf Ebrahimi rate "study" optimizing function; in PCRE2, patterns are automatically 28*22dc650dSSadaf Ebrahimi optimized where possible. Since forking from PCRE1, the code has been 29*22dc650dSSadaf Ebrahimi extensively refactored and new features introduced. The old library is 30*22dc650dSSadaf Ebrahimi now obsolete and is no longer maintained. 31*22dc650dSSadaf Ebrahimi 32*22dc650dSSadaf Ebrahimi As well as Perl-style regular expression patterns, some features that 33*22dc650dSSadaf Ebrahimi appeared in Python and the original PCRE before they appeared in Perl 34*22dc650dSSadaf Ebrahimi are available using the Python syntax. There is also some support for 35*22dc650dSSadaf Ebrahimi one or two .NET and Oniguruma syntax items, and there are options for 36*22dc650dSSadaf Ebrahimi requesting some minor changes that give better ECMAScript (aka 37*22dc650dSSadaf Ebrahimi JavaScript) compatibility. 38*22dc650dSSadaf Ebrahimi 39*22dc650dSSadaf Ebrahimi The source code for PCRE2 can be compiled to support strings of 8-bit, 40*22dc650dSSadaf Ebrahimi 16-bit, or 32-bit code units, which means that up to three separate li- 41*22dc650dSSadaf Ebrahimi braries may be installed, one for each code unit size. The size of code 42*22dc650dSSadaf Ebrahimi unit is not related to the bit size of the underlying hardware. In a 43*22dc650dSSadaf Ebrahimi 64-bit environment that also supports 32-bit applications, versions of 44*22dc650dSSadaf Ebrahimi PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed. 45*22dc650dSSadaf Ebrahimi 46*22dc650dSSadaf Ebrahimi The original work to extend PCRE to 16-bit and 32-bit code units was 47*22dc650dSSadaf Ebrahimi done by Zoltan Herczeg and Christian Persch, respectively. In all three 48*22dc650dSSadaf Ebrahimi cases, strings can be interpreted either as one character per code 49*22dc650dSSadaf Ebrahimi unit, or as UTF-encoded Unicode, with support for Unicode general cate- 50*22dc650dSSadaf Ebrahimi gory properties. Unicode support is optional at build time (but is the 51*22dc650dSSadaf Ebrahimi default). However, processing strings as UTF code units must be enabled 52*22dc650dSSadaf Ebrahimi explicitly at run time. The version of Unicode in use can be discovered 53*22dc650dSSadaf Ebrahimi by running 54*22dc650dSSadaf Ebrahimi 55*22dc650dSSadaf Ebrahimi pcre2test -C 56*22dc650dSSadaf Ebrahimi 57*22dc650dSSadaf Ebrahimi The three libraries contain identical sets of functions, with names 58*22dc650dSSadaf Ebrahimi ending in _8, _16, or _32, respectively (for example, pcre2_com- 59*22dc650dSSadaf Ebrahimi pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 60*22dc650dSSadaf Ebrahimi 32, a program that uses just one code unit width can be written using 61*22dc650dSSadaf Ebrahimi generic names such as pcre2_compile(), and the documentation is written 62*22dc650dSSadaf Ebrahimi assuming that this is the case. 63*22dc650dSSadaf Ebrahimi 64*22dc650dSSadaf Ebrahimi In addition to the Perl-compatible matching function, PCRE2 contains an 65*22dc650dSSadaf Ebrahimi alternative function that matches the same compiled patterns in a dif- 66*22dc650dSSadaf Ebrahimi ferent way. In certain circumstances, the alternative function has some 67*22dc650dSSadaf Ebrahimi advantages. For a discussion of the two matching algorithms, see the 68*22dc650dSSadaf Ebrahimi pcre2matching page. 69*22dc650dSSadaf Ebrahimi 70*22dc650dSSadaf Ebrahimi Details of exactly which Perl regular expression features are and are 71*22dc650dSSadaf Ebrahimi not supported by PCRE2 are given in separate documents. See the 72*22dc650dSSadaf Ebrahimi pcre2pattern and pcre2compat pages. There is a syntax summary in the 73*22dc650dSSadaf Ebrahimi pcre2syntax page. 74*22dc650dSSadaf Ebrahimi 75*22dc650dSSadaf Ebrahimi Some features of PCRE2 can be included, excluded, or changed when the 76*22dc650dSSadaf Ebrahimi library is built. The pcre2_config() function makes it possible for a 77*22dc650dSSadaf Ebrahimi client to discover which features are available. The features them- 78*22dc650dSSadaf Ebrahimi selves are described in the pcre2build page. Documentation about build- 79*22dc650dSSadaf Ebrahimi ing PCRE2 for various operating systems can be found in the README and 80*22dc650dSSadaf Ebrahimi NON-AUTOTOOLS_BUILD files in the source distribution. 81*22dc650dSSadaf Ebrahimi 82*22dc650dSSadaf Ebrahimi The libraries contains a number of undocumented internal functions and 83*22dc650dSSadaf Ebrahimi data tables that are used by more than one of the exported external 84*22dc650dSSadaf Ebrahimi functions, but which are not intended for use by external callers. 85*22dc650dSSadaf Ebrahimi Their names all begin with "_pcre2", which hopefully will not provoke 86*22dc650dSSadaf Ebrahimi any name clashes. In some environments, it is possible to control which 87*22dc650dSSadaf Ebrahimi external symbols are exported when a shared library is built, and in 88*22dc650dSSadaf Ebrahimi these cases the undocumented symbols are not exported. 89*22dc650dSSadaf Ebrahimi 90*22dc650dSSadaf Ebrahimi 91*22dc650dSSadaf EbrahimiSECURITY CONSIDERATIONS 92*22dc650dSSadaf Ebrahimi 93*22dc650dSSadaf Ebrahimi If you are using PCRE2 in a non-UTF application that permits users to 94*22dc650dSSadaf Ebrahimi supply arbitrary patterns for compilation, you should be aware of a 95*22dc650dSSadaf Ebrahimi feature that allows users to turn on UTF support from within a pattern. 96*22dc650dSSadaf Ebrahimi For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 97*22dc650dSSadaf Ebrahimi mode, which interprets patterns and subjects as strings of UTF-8 code 98*22dc650dSSadaf Ebrahimi units instead of individual 8-bit characters. This causes both the pat- 99*22dc650dSSadaf Ebrahimi tern and any data against which it is matched to be checked for UTF-8 100*22dc650dSSadaf Ebrahimi validity. If the data string is very long, such a check might use suf- 101*22dc650dSSadaf Ebrahimi ficiently many resources as to cause your application to lose perfor- 102*22dc650dSSadaf Ebrahimi mance. 103*22dc650dSSadaf Ebrahimi 104*22dc650dSSadaf Ebrahimi One way of guarding against this possibility is to use the pcre2_pat- 105*22dc650dSSadaf Ebrahimi tern_info() function to check the compiled pattern's options for 106*22dc650dSSadaf Ebrahimi PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when 107*22dc650dSSadaf Ebrahimi calling pcre2_compile(). This causes a compile time error if the pat- 108*22dc650dSSadaf Ebrahimi tern contains a UTF-setting sequence. 109*22dc650dSSadaf Ebrahimi 110*22dc650dSSadaf Ebrahimi The use of Unicode properties for character types such as \d can also 111*22dc650dSSadaf Ebrahimi be enabled from within the pattern, by specifying "(*UCP)". This fea- 112*22dc650dSSadaf Ebrahimi ture can be disallowed by setting the PCRE2_NEVER_UCP option. 113*22dc650dSSadaf Ebrahimi 114*22dc650dSSadaf Ebrahimi If your application is one that supports UTF, be aware that validity 115*22dc650dSSadaf Ebrahimi checking can take time. If the same data string is to be matched many 116*22dc650dSSadaf Ebrahimi times, you can use the PCRE2_NO_UTF_CHECK option for the second and 117*22dc650dSSadaf Ebrahimi subsequent matches to avoid running redundant checks. 118*22dc650dSSadaf Ebrahimi 119*22dc650dSSadaf Ebrahimi The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead 120*22dc650dSSadaf Ebrahimi to problems, because it may leave the current matching point in the 121*22dc650dSSadaf Ebrahimi middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op- 122*22dc650dSSadaf Ebrahimi tion can be used by an application to lock out the use of \C, causing a 123*22dc650dSSadaf Ebrahimi compile-time error if it is encountered. It is also possible to build 124*22dc650dSSadaf Ebrahimi PCRE2 with the use of \C permanently disabled. 125*22dc650dSSadaf Ebrahimi 126*22dc650dSSadaf Ebrahimi Another way that performance can be hit is by running a pattern that 127*22dc650dSSadaf Ebrahimi has a very large search tree against a string that will never match. 128*22dc650dSSadaf Ebrahimi Nested unlimited repeats in a pattern are a common example. PCRE2 pro- 129*22dc650dSSadaf Ebrahimi vides some protection against this: see the pcre2_set_match_limit() 130*22dc650dSSadaf Ebrahimi function in the pcre2api page. There is a similar function called 131*22dc650dSSadaf Ebrahimi pcre2_set_depth_limit() that can be used to restrict the amount of mem- 132*22dc650dSSadaf Ebrahimi ory that is used. 133*22dc650dSSadaf Ebrahimi 134*22dc650dSSadaf Ebrahimi 135*22dc650dSSadaf EbrahimiUSER DOCUMENTATION 136*22dc650dSSadaf Ebrahimi 137*22dc650dSSadaf Ebrahimi The user documentation for PCRE2 comprises a number of different sec- 138*22dc650dSSadaf Ebrahimi tions. In the "man" format, each of these is a separate "man page". In 139*22dc650dSSadaf Ebrahimi the HTML format, each is a separate page, linked from the index page. 140*22dc650dSSadaf Ebrahimi In the plain text format, the descriptions of the pcre2grep and 141*22dc650dSSadaf Ebrahimi pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, 142*22dc650dSSadaf Ebrahimi respectively. The remaining sections, except for the pcre2demo section 143*22dc650dSSadaf Ebrahimi (which is a program listing), and the short pages for individual func- 144*22dc650dSSadaf Ebrahimi tions, are concatenated in pcre2.txt, for ease of searching. The sec- 145*22dc650dSSadaf Ebrahimi tions are as follows: 146*22dc650dSSadaf Ebrahimi 147*22dc650dSSadaf Ebrahimi pcre2 this document 148*22dc650dSSadaf Ebrahimi pcre2-config show PCRE2 installation configuration information 149*22dc650dSSadaf Ebrahimi pcre2api details of PCRE2's native C API 150*22dc650dSSadaf Ebrahimi pcre2build building PCRE2 151*22dc650dSSadaf Ebrahimi pcre2callout details of the pattern callout feature 152*22dc650dSSadaf Ebrahimi pcre2compat discussion of Perl compatibility 153*22dc650dSSadaf Ebrahimi pcre2convert details of pattern conversion functions 154*22dc650dSSadaf Ebrahimi pcre2demo a demonstration C program that uses PCRE2 155*22dc650dSSadaf Ebrahimi pcre2grep description of the pcre2grep command (8-bit only) 156*22dc650dSSadaf Ebrahimi pcre2jit discussion of just-in-time optimization support 157*22dc650dSSadaf Ebrahimi pcre2limits details of size and other limits 158*22dc650dSSadaf Ebrahimi pcre2matching discussion of the two matching algorithms 159*22dc650dSSadaf Ebrahimi pcre2partial details of the partial matching facility 160*22dc650dSSadaf Ebrahimi pcre2pattern syntax and semantics of supported regular 161*22dc650dSSadaf Ebrahimi expression patterns 162*22dc650dSSadaf Ebrahimi pcre2perform discussion of performance issues 163*22dc650dSSadaf Ebrahimi pcre2posix the POSIX-compatible C API for the 8-bit library 164*22dc650dSSadaf Ebrahimi pcre2sample discussion of the pcre2demo program 165*22dc650dSSadaf Ebrahimi pcre2serialize details of pattern serialization 166*22dc650dSSadaf Ebrahimi pcre2syntax quick syntax reference 167*22dc650dSSadaf Ebrahimi pcre2test description of the pcre2test command 168*22dc650dSSadaf Ebrahimi pcre2unicode discussion of Unicode and UTF support 169*22dc650dSSadaf Ebrahimi 170*22dc650dSSadaf Ebrahimi In the "man" and HTML formats, there is also a short page for each C 171*22dc650dSSadaf Ebrahimi library function, listing its arguments and results. 172*22dc650dSSadaf Ebrahimi 173*22dc650dSSadaf Ebrahimi 174*22dc650dSSadaf EbrahimiAUTHOR 175*22dc650dSSadaf Ebrahimi 176*22dc650dSSadaf Ebrahimi Philip Hazel 177*22dc650dSSadaf Ebrahimi Retired from University Computing Service 178*22dc650dSSadaf Ebrahimi Cambridge, England. 179*22dc650dSSadaf Ebrahimi 180*22dc650dSSadaf Ebrahimi Putting an actual email address here is a spam magnet. If you want to 181*22dc650dSSadaf Ebrahimi email me, use my two names separated by a dot at gmail.com. 182*22dc650dSSadaf Ebrahimi 183*22dc650dSSadaf Ebrahimi 184*22dc650dSSadaf EbrahimiREVISION 185*22dc650dSSadaf Ebrahimi 186*22dc650dSSadaf Ebrahimi Last updated: 27 August 2021 187*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2021 University of Cambridge. 188*22dc650dSSadaf Ebrahimi 189*22dc650dSSadaf Ebrahimi 190*22dc650dSSadaf EbrahimiPCRE2 10.38 27 August 2021 PCRE2(3) 191*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 192*22dc650dSSadaf Ebrahimi 193*22dc650dSSadaf Ebrahimi 194*22dc650dSSadaf Ebrahimi 195*22dc650dSSadaf EbrahimiPCRE2API(3) Library Functions Manual PCRE2API(3) 196*22dc650dSSadaf Ebrahimi 197*22dc650dSSadaf Ebrahimi 198*22dc650dSSadaf EbrahimiNAME 199*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 200*22dc650dSSadaf Ebrahimi 201*22dc650dSSadaf Ebrahimi #include <pcre2.h> 202*22dc650dSSadaf Ebrahimi 203*22dc650dSSadaf Ebrahimi PCRE2 is a new API for PCRE, starting at release 10.0. This document 204*22dc650dSSadaf Ebrahimi contains a description of all its native functions. See the pcre2 docu- 205*22dc650dSSadaf Ebrahimi ment for an overview of all the PCRE2 documentation. 206*22dc650dSSadaf Ebrahimi 207*22dc650dSSadaf Ebrahimi 208*22dc650dSSadaf EbrahimiPCRE2 NATIVE API BASIC FUNCTIONS 209*22dc650dSSadaf Ebrahimi 210*22dc650dSSadaf Ebrahimi pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 211*22dc650dSSadaf Ebrahimi uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 212*22dc650dSSadaf Ebrahimi pcre2_compile_context *ccontext); 213*22dc650dSSadaf Ebrahimi 214*22dc650dSSadaf Ebrahimi void pcre2_code_free(pcre2_code *code); 215*22dc650dSSadaf Ebrahimi 216*22dc650dSSadaf Ebrahimi pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 217*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 218*22dc650dSSadaf Ebrahimi 219*22dc650dSSadaf Ebrahimi pcre2_match_data *pcre2_match_data_create_from_pattern( 220*22dc650dSSadaf Ebrahimi const pcre2_code *code, pcre2_general_context *gcontext); 221*22dc650dSSadaf Ebrahimi 222*22dc650dSSadaf Ebrahimi int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 223*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 224*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 225*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext); 226*22dc650dSSadaf Ebrahimi 227*22dc650dSSadaf Ebrahimi int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 228*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 229*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 230*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext, 231*22dc650dSSadaf Ebrahimi int *workspace, PCRE2_SIZE wscount); 232*22dc650dSSadaf Ebrahimi 233*22dc650dSSadaf Ebrahimi void pcre2_match_data_free(pcre2_match_data *match_data); 234*22dc650dSSadaf Ebrahimi 235*22dc650dSSadaf Ebrahimi 236*22dc650dSSadaf EbrahimiPCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS 237*22dc650dSSadaf Ebrahimi 238*22dc650dSSadaf Ebrahimi PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 239*22dc650dSSadaf Ebrahimi 240*22dc650dSSadaf Ebrahimi PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data); 241*22dc650dSSadaf Ebrahimi 242*22dc650dSSadaf Ebrahimi PCRE2_SIZE pcre2_get_match_data_heapframes_size( 243*22dc650dSSadaf Ebrahimi pcre2_match_data *match_data); 244*22dc650dSSadaf Ebrahimi 245*22dc650dSSadaf Ebrahimi uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 246*22dc650dSSadaf Ebrahimi 247*22dc650dSSadaf Ebrahimi PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 248*22dc650dSSadaf Ebrahimi 249*22dc650dSSadaf Ebrahimi PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 250*22dc650dSSadaf Ebrahimi 251*22dc650dSSadaf Ebrahimi 252*22dc650dSSadaf EbrahimiPCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS 253*22dc650dSSadaf Ebrahimi 254*22dc650dSSadaf Ebrahimi pcre2_general_context *pcre2_general_context_create( 255*22dc650dSSadaf Ebrahimi void *(*private_malloc)(PCRE2_SIZE, void *), 256*22dc650dSSadaf Ebrahimi void (*private_free)(void *, void *), void *memory_data); 257*22dc650dSSadaf Ebrahimi 258*22dc650dSSadaf Ebrahimi pcre2_general_context *pcre2_general_context_copy( 259*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 260*22dc650dSSadaf Ebrahimi 261*22dc650dSSadaf Ebrahimi void pcre2_general_context_free(pcre2_general_context *gcontext); 262*22dc650dSSadaf Ebrahimi 263*22dc650dSSadaf Ebrahimi 264*22dc650dSSadaf EbrahimiPCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS 265*22dc650dSSadaf Ebrahimi 266*22dc650dSSadaf Ebrahimi pcre2_compile_context *pcre2_compile_context_create( 267*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 268*22dc650dSSadaf Ebrahimi 269*22dc650dSSadaf Ebrahimi pcre2_compile_context *pcre2_compile_context_copy( 270*22dc650dSSadaf Ebrahimi pcre2_compile_context *ccontext); 271*22dc650dSSadaf Ebrahimi 272*22dc650dSSadaf Ebrahimi void pcre2_compile_context_free(pcre2_compile_context *ccontext); 273*22dc650dSSadaf Ebrahimi 274*22dc650dSSadaf Ebrahimi int pcre2_set_bsr(pcre2_compile_context *ccontext, 275*22dc650dSSadaf Ebrahimi uint32_t value); 276*22dc650dSSadaf Ebrahimi 277*22dc650dSSadaf Ebrahimi int pcre2_set_character_tables(pcre2_compile_context *ccontext, 278*22dc650dSSadaf Ebrahimi const uint8_t *tables); 279*22dc650dSSadaf Ebrahimi 280*22dc650dSSadaf Ebrahimi int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, 281*22dc650dSSadaf Ebrahimi uint32_t extra_options); 282*22dc650dSSadaf Ebrahimi 283*22dc650dSSadaf Ebrahimi int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 284*22dc650dSSadaf Ebrahimi PCRE2_SIZE value); 285*22dc650dSSadaf Ebrahimi 286*22dc650dSSadaf Ebrahimi int pcre2_set_max_pattern_compiled_length( 287*22dc650dSSadaf Ebrahimi pcre2_compile_context *ccontext, PCRE2_SIZE value); 288*22dc650dSSadaf Ebrahimi 289*22dc650dSSadaf Ebrahimi int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext, 290*22dc650dSSadaf Ebrahimi uint32_t value); 291*22dc650dSSadaf Ebrahimi 292*22dc650dSSadaf Ebrahimi int pcre2_set_newline(pcre2_compile_context *ccontext, 293*22dc650dSSadaf Ebrahimi uint32_t value); 294*22dc650dSSadaf Ebrahimi 295*22dc650dSSadaf Ebrahimi int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 296*22dc650dSSadaf Ebrahimi uint32_t value); 297*22dc650dSSadaf Ebrahimi 298*22dc650dSSadaf Ebrahimi int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 299*22dc650dSSadaf Ebrahimi int (*guard_function)(uint32_t, void *), void *user_data); 300*22dc650dSSadaf Ebrahimi 301*22dc650dSSadaf Ebrahimi 302*22dc650dSSadaf EbrahimiPCRE2 NATIVE API MATCH CONTEXT FUNCTIONS 303*22dc650dSSadaf Ebrahimi 304*22dc650dSSadaf Ebrahimi pcre2_match_context *pcre2_match_context_create( 305*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 306*22dc650dSSadaf Ebrahimi 307*22dc650dSSadaf Ebrahimi pcre2_match_context *pcre2_match_context_copy( 308*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext); 309*22dc650dSSadaf Ebrahimi 310*22dc650dSSadaf Ebrahimi void pcre2_match_context_free(pcre2_match_context *mcontext); 311*22dc650dSSadaf Ebrahimi 312*22dc650dSSadaf Ebrahimi int pcre2_set_callout(pcre2_match_context *mcontext, 313*22dc650dSSadaf Ebrahimi int (*callout_function)(pcre2_callout_block *, void *), 314*22dc650dSSadaf Ebrahimi void *callout_data); 315*22dc650dSSadaf Ebrahimi 316*22dc650dSSadaf Ebrahimi int pcre2_set_substitute_callout(pcre2_match_context *mcontext, 317*22dc650dSSadaf Ebrahimi int (*callout_function)(pcre2_substitute_callout_block *, void *), 318*22dc650dSSadaf Ebrahimi void *callout_data); 319*22dc650dSSadaf Ebrahimi 320*22dc650dSSadaf Ebrahimi int pcre2_set_offset_limit(pcre2_match_context *mcontext, 321*22dc650dSSadaf Ebrahimi PCRE2_SIZE value); 322*22dc650dSSadaf Ebrahimi 323*22dc650dSSadaf Ebrahimi int pcre2_set_heap_limit(pcre2_match_context *mcontext, 324*22dc650dSSadaf Ebrahimi uint32_t value); 325*22dc650dSSadaf Ebrahimi 326*22dc650dSSadaf Ebrahimi int pcre2_set_match_limit(pcre2_match_context *mcontext, 327*22dc650dSSadaf Ebrahimi uint32_t value); 328*22dc650dSSadaf Ebrahimi 329*22dc650dSSadaf Ebrahimi int pcre2_set_depth_limit(pcre2_match_context *mcontext, 330*22dc650dSSadaf Ebrahimi uint32_t value); 331*22dc650dSSadaf Ebrahimi 332*22dc650dSSadaf Ebrahimi 333*22dc650dSSadaf EbrahimiPCRE2 NATIVE API STRING EXTRACTION FUNCTIONS 334*22dc650dSSadaf Ebrahimi 335*22dc650dSSadaf Ebrahimi int pcre2_substring_copy_byname(pcre2_match_data *match_data, 336*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 337*22dc650dSSadaf Ebrahimi 338*22dc650dSSadaf Ebrahimi int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 339*22dc650dSSadaf Ebrahimi uint32_t number, PCRE2_UCHAR *buffer, 340*22dc650dSSadaf Ebrahimi PCRE2_SIZE *bufflen); 341*22dc650dSSadaf Ebrahimi 342*22dc650dSSadaf Ebrahimi void pcre2_substring_free(PCRE2_UCHAR *buffer); 343*22dc650dSSadaf Ebrahimi 344*22dc650dSSadaf Ebrahimi int pcre2_substring_get_byname(pcre2_match_data *match_data, 345*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 346*22dc650dSSadaf Ebrahimi 347*22dc650dSSadaf Ebrahimi int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 348*22dc650dSSadaf Ebrahimi uint32_t number, PCRE2_UCHAR **bufferptr, 349*22dc650dSSadaf Ebrahimi PCRE2_SIZE *bufflen); 350*22dc650dSSadaf Ebrahimi 351*22dc650dSSadaf Ebrahimi int pcre2_substring_length_byname(pcre2_match_data *match_data, 352*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_SIZE *length); 353*22dc650dSSadaf Ebrahimi 354*22dc650dSSadaf Ebrahimi int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 355*22dc650dSSadaf Ebrahimi uint32_t number, PCRE2_SIZE *length); 356*22dc650dSSadaf Ebrahimi 357*22dc650dSSadaf Ebrahimi int pcre2_substring_nametable_scan(const pcre2_code *code, 358*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 359*22dc650dSSadaf Ebrahimi 360*22dc650dSSadaf Ebrahimi int pcre2_substring_number_from_name(const pcre2_code *code, 361*22dc650dSSadaf Ebrahimi PCRE2_SPTR name); 362*22dc650dSSadaf Ebrahimi 363*22dc650dSSadaf Ebrahimi void pcre2_substring_list_free(PCRE2_UCHAR **list); 364*22dc650dSSadaf Ebrahimi 365*22dc650dSSadaf Ebrahimi int pcre2_substring_list_get(pcre2_match_data *match_data, 366*22dc650dSSadaf Ebrahimi PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 367*22dc650dSSadaf Ebrahimi 368*22dc650dSSadaf Ebrahimi 369*22dc650dSSadaf EbrahimiPCRE2 NATIVE API STRING SUBSTITUTION FUNCTION 370*22dc650dSSadaf Ebrahimi 371*22dc650dSSadaf Ebrahimi int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 372*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 373*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 374*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext, PCRE2_SPTR replacementz, 375*22dc650dSSadaf Ebrahimi PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, 376*22dc650dSSadaf Ebrahimi PCRE2_SIZE *outlengthptr); 377*22dc650dSSadaf Ebrahimi 378*22dc650dSSadaf Ebrahimi 379*22dc650dSSadaf EbrahimiPCRE2 NATIVE API JIT FUNCTIONS 380*22dc650dSSadaf Ebrahimi 381*22dc650dSSadaf Ebrahimi int pcre2_jit_compile(pcre2_code *code, uint32_t options); 382*22dc650dSSadaf Ebrahimi 383*22dc650dSSadaf Ebrahimi int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 384*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 385*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 386*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext); 387*22dc650dSSadaf Ebrahimi 388*22dc650dSSadaf Ebrahimi void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 389*22dc650dSSadaf Ebrahimi 390*22dc650dSSadaf Ebrahimi pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize, 391*22dc650dSSadaf Ebrahimi size_t maxsize, pcre2_general_context *gcontext); 392*22dc650dSSadaf Ebrahimi 393*22dc650dSSadaf Ebrahimi void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 394*22dc650dSSadaf Ebrahimi pcre2_jit_callback callback_function, void *callback_data); 395*22dc650dSSadaf Ebrahimi 396*22dc650dSSadaf Ebrahimi void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 397*22dc650dSSadaf Ebrahimi 398*22dc650dSSadaf Ebrahimi 399*22dc650dSSadaf EbrahimiPCRE2 NATIVE API SERIALIZATION FUNCTIONS 400*22dc650dSSadaf Ebrahimi 401*22dc650dSSadaf Ebrahimi int32_t pcre2_serialize_decode(pcre2_code **codes, 402*22dc650dSSadaf Ebrahimi int32_t number_of_codes, const uint8_t *bytes, 403*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 404*22dc650dSSadaf Ebrahimi 405*22dc650dSSadaf Ebrahimi int32_t pcre2_serialize_encode(const pcre2_code **codes, 406*22dc650dSSadaf Ebrahimi int32_t number_of_codes, uint8_t **serialized_bytes, 407*22dc650dSSadaf Ebrahimi PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 408*22dc650dSSadaf Ebrahimi 409*22dc650dSSadaf Ebrahimi void pcre2_serialize_free(uint8_t *bytes); 410*22dc650dSSadaf Ebrahimi 411*22dc650dSSadaf Ebrahimi int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 412*22dc650dSSadaf Ebrahimi 413*22dc650dSSadaf Ebrahimi 414*22dc650dSSadaf EbrahimiPCRE2 NATIVE API AUXILIARY FUNCTIONS 415*22dc650dSSadaf Ebrahimi 416*22dc650dSSadaf Ebrahimi pcre2_code *pcre2_code_copy(const pcre2_code *code); 417*22dc650dSSadaf Ebrahimi 418*22dc650dSSadaf Ebrahimi pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); 419*22dc650dSSadaf Ebrahimi 420*22dc650dSSadaf Ebrahimi int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 421*22dc650dSSadaf Ebrahimi PCRE2_SIZE bufflen); 422*22dc650dSSadaf Ebrahimi 423*22dc650dSSadaf Ebrahimi const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); 424*22dc650dSSadaf Ebrahimi 425*22dc650dSSadaf Ebrahimi void pcre2_maketables_free(pcre2_general_context *gcontext, 426*22dc650dSSadaf Ebrahimi const uint8_t *tables); 427*22dc650dSSadaf Ebrahimi 428*22dc650dSSadaf Ebrahimi int pcre2_pattern_info(const pcre2_code *code, uint32_t what, 429*22dc650dSSadaf Ebrahimi void *where); 430*22dc650dSSadaf Ebrahimi 431*22dc650dSSadaf Ebrahimi int pcre2_callout_enumerate(const pcre2_code *code, 432*22dc650dSSadaf Ebrahimi int (*callback)(pcre2_callout_enumerate_block *, void *), 433*22dc650dSSadaf Ebrahimi void *user_data); 434*22dc650dSSadaf Ebrahimi 435*22dc650dSSadaf Ebrahimi int pcre2_config(uint32_t what, void *where); 436*22dc650dSSadaf Ebrahimi 437*22dc650dSSadaf Ebrahimi 438*22dc650dSSadaf EbrahimiPCRE2 NATIVE API OBSOLETE FUNCTIONS 439*22dc650dSSadaf Ebrahimi 440*22dc650dSSadaf Ebrahimi int pcre2_set_recursion_limit(pcre2_match_context *mcontext, 441*22dc650dSSadaf Ebrahimi uint32_t value); 442*22dc650dSSadaf Ebrahimi 443*22dc650dSSadaf Ebrahimi int pcre2_set_recursion_memory_management( 444*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext, 445*22dc650dSSadaf Ebrahimi void *(*private_malloc)(size_t, void *), 446*22dc650dSSadaf Ebrahimi void (*private_free)(void *, void *), void *memory_data); 447*22dc650dSSadaf Ebrahimi 448*22dc650dSSadaf Ebrahimi These functions became obsolete at release 10.30 and are retained only 449*22dc650dSSadaf Ebrahimi for backward compatibility. They should not be used in new code. The 450*22dc650dSSadaf Ebrahimi first is replaced by pcre2_set_depth_limit(); the second is no longer 451*22dc650dSSadaf Ebrahimi needed and has no effect (it always returns zero). 452*22dc650dSSadaf Ebrahimi 453*22dc650dSSadaf Ebrahimi 454*22dc650dSSadaf EbrahimiPCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS 455*22dc650dSSadaf Ebrahimi 456*22dc650dSSadaf Ebrahimi pcre2_convert_context *pcre2_convert_context_create( 457*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 458*22dc650dSSadaf Ebrahimi 459*22dc650dSSadaf Ebrahimi pcre2_convert_context *pcre2_convert_context_copy( 460*22dc650dSSadaf Ebrahimi pcre2_convert_context *cvcontext); 461*22dc650dSSadaf Ebrahimi 462*22dc650dSSadaf Ebrahimi void pcre2_convert_context_free(pcre2_convert_context *cvcontext); 463*22dc650dSSadaf Ebrahimi 464*22dc650dSSadaf Ebrahimi int pcre2_set_glob_escape(pcre2_convert_context *cvcontext, 465*22dc650dSSadaf Ebrahimi uint32_t escape_char); 466*22dc650dSSadaf Ebrahimi 467*22dc650dSSadaf Ebrahimi int pcre2_set_glob_separator(pcre2_convert_context *cvcontext, 468*22dc650dSSadaf Ebrahimi uint32_t separator_char); 469*22dc650dSSadaf Ebrahimi 470*22dc650dSSadaf Ebrahimi int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length, 471*22dc650dSSadaf Ebrahimi uint32_t options, PCRE2_UCHAR **buffer, 472*22dc650dSSadaf Ebrahimi PCRE2_SIZE *blength, pcre2_convert_context *cvcontext); 473*22dc650dSSadaf Ebrahimi 474*22dc650dSSadaf Ebrahimi void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern); 475*22dc650dSSadaf Ebrahimi 476*22dc650dSSadaf Ebrahimi These functions provide a way of converting non-PCRE2 patterns into 477*22dc650dSSadaf Ebrahimi patterns that can be processed by pcre2_compile(). This facility is ex- 478*22dc650dSSadaf Ebrahimi perimental and may be changed in future releases. At present, "globs" 479*22dc650dSSadaf Ebrahimi and POSIX basic and extended patterns can be converted. Details are 480*22dc650dSSadaf Ebrahimi given in the pcre2convert documentation. 481*22dc650dSSadaf Ebrahimi 482*22dc650dSSadaf Ebrahimi 483*22dc650dSSadaf EbrahimiPCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES 484*22dc650dSSadaf Ebrahimi 485*22dc650dSSadaf Ebrahimi There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit 486*22dc650dSSadaf Ebrahimi code units, respectively. However, there is just one header file, 487*22dc650dSSadaf Ebrahimi pcre2.h. This contains the function prototypes and other definitions 488*22dc650dSSadaf Ebrahimi for all three libraries. One, two, or all three can be installed simul- 489*22dc650dSSadaf Ebrahimi taneously. On Unix-like systems the libraries are called libpcre2-8, 490*22dc650dSSadaf Ebrahimi libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- 491*22dc650dSSadaf Ebrahimi inal PCRE libraries. Every PCRE2 function comes in three different 492*22dc650dSSadaf Ebrahimi forms, one for each library, for example: 493*22dc650dSSadaf Ebrahimi 494*22dc650dSSadaf Ebrahimi pcre2_compile_8() 495*22dc650dSSadaf Ebrahimi pcre2_compile_16() 496*22dc650dSSadaf Ebrahimi pcre2_compile_32() 497*22dc650dSSadaf Ebrahimi 498*22dc650dSSadaf Ebrahimi There are also three different sets of data types: 499*22dc650dSSadaf Ebrahimi 500*22dc650dSSadaf Ebrahimi PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 501*22dc650dSSadaf Ebrahimi PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32 502*22dc650dSSadaf Ebrahimi 503*22dc650dSSadaf Ebrahimi The UCHAR types define unsigned code units of the appropriate widths. 504*22dc650dSSadaf Ebrahimi For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR 505*22dc650dSSadaf Ebrahimi types are pointers to constants of the equivalent UCHAR types, that is, 506*22dc650dSSadaf Ebrahimi they are pointers to vectors of unsigned code units. 507*22dc650dSSadaf Ebrahimi 508*22dc650dSSadaf Ebrahimi Character strings are passed to a PCRE2 library as sequences of un- 509*22dc650dSSadaf Ebrahimi signed integers in code units of the appropriate width. The length of a 510*22dc650dSSadaf Ebrahimi string may be given as a number of code units, or the string may be 511*22dc650dSSadaf Ebrahimi specified as zero-terminated. 512*22dc650dSSadaf Ebrahimi 513*22dc650dSSadaf Ebrahimi Many applications use only one code unit width. For their convenience, 514*22dc650dSSadaf Ebrahimi macros are defined whose names are the generic forms such as pcre2_com- 515*22dc650dSSadaf Ebrahimi pile() and PCRE2_SPTR. These macros use the value of the macro 516*22dc650dSSadaf Ebrahimi PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func- 517*22dc650dSSadaf Ebrahimi tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default. 518*22dc650dSSadaf Ebrahimi An application must define it to be 8, 16, or 32 before including 519*22dc650dSSadaf Ebrahimi pcre2.h in order to make use of the generic names. 520*22dc650dSSadaf Ebrahimi 521*22dc650dSSadaf Ebrahimi Applications that use more than one code unit width can be linked with 522*22dc650dSSadaf Ebrahimi more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to 523*22dc650dSSadaf Ebrahimi be 0 before including pcre2.h, and then use the real function names. 524*22dc650dSSadaf Ebrahimi Any code that is to be included in an environment where the value of 525*22dc650dSSadaf Ebrahimi PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function 526*22dc650dSSadaf Ebrahimi names. (Unfortunately, it is not possible in C code to save and restore 527*22dc650dSSadaf Ebrahimi the value of a macro.) 528*22dc650dSSadaf Ebrahimi 529*22dc650dSSadaf Ebrahimi If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a 530*22dc650dSSadaf Ebrahimi compiler error occurs. 531*22dc650dSSadaf Ebrahimi 532*22dc650dSSadaf Ebrahimi When using multiple libraries in an application, you must take care 533*22dc650dSSadaf Ebrahimi when processing any particular pattern to use only functions from a 534*22dc650dSSadaf Ebrahimi single library. For example, if you want to run a match using a pat- 535*22dc650dSSadaf Ebrahimi tern that was compiled with pcre2_compile_16(), you must do so with 536*22dc650dSSadaf Ebrahimi pcre2_match_16(), not pcre2_match_8() or pcre2_match_32(). 537*22dc650dSSadaf Ebrahimi 538*22dc650dSSadaf Ebrahimi In the function summaries above, and in the rest of this document and 539*22dc650dSSadaf Ebrahimi other PCRE2 documents, functions and data types are described using 540*22dc650dSSadaf Ebrahimi their generic names, without the _8, _16, or _32 suffix. 541*22dc650dSSadaf Ebrahimi 542*22dc650dSSadaf Ebrahimi 543*22dc650dSSadaf EbrahimiPCRE2 API OVERVIEW 544*22dc650dSSadaf Ebrahimi 545*22dc650dSSadaf Ebrahimi PCRE2 has its own native API, which is described in this document. 546*22dc650dSSadaf Ebrahimi There are also some wrapper functions for the 8-bit library that corre- 547*22dc650dSSadaf Ebrahimi spond to the POSIX regular expression API, but they do not give access 548*22dc650dSSadaf Ebrahimi to all the functionality of PCRE2 and they are not thread-safe. They 549*22dc650dSSadaf Ebrahimi are described in the pcre2posix documentation. Both these APIs define a 550*22dc650dSSadaf Ebrahimi set of C function calls. 551*22dc650dSSadaf Ebrahimi 552*22dc650dSSadaf Ebrahimi The native API C data types, function prototypes, option values, and 553*22dc650dSSadaf Ebrahimi error codes are defined in the header file pcre2.h, which also contains 554*22dc650dSSadaf Ebrahimi definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release 555*22dc650dSSadaf Ebrahimi numbers for the library. Applications can use these to include support 556*22dc650dSSadaf Ebrahimi for different releases of PCRE2. 557*22dc650dSSadaf Ebrahimi 558*22dc650dSSadaf Ebrahimi In a Windows environment, if you want to statically link an application 559*22dc650dSSadaf Ebrahimi program against a non-dll PCRE2 library, you must define PCRE2_STATIC 560*22dc650dSSadaf Ebrahimi before including pcre2.h. 561*22dc650dSSadaf Ebrahimi 562*22dc650dSSadaf Ebrahimi The functions pcre2_compile() and pcre2_match() are used for compiling 563*22dc650dSSadaf Ebrahimi and matching regular expressions in a Perl-compatible manner. A sample 564*22dc650dSSadaf Ebrahimi program that demonstrates the simplest way of using them is provided in 565*22dc650dSSadaf Ebrahimi the file called pcre2demo.c in the PCRE2 source distribution. A listing 566*22dc650dSSadaf Ebrahimi of this program is given in the pcre2demo documentation, and the 567*22dc650dSSadaf Ebrahimi pcre2sample documentation describes how to compile and run it. 568*22dc650dSSadaf Ebrahimi 569*22dc650dSSadaf Ebrahimi The compiling and matching functions recognize various options that are 570*22dc650dSSadaf Ebrahimi passed as bits in an options argument. There are also some more compli- 571*22dc650dSSadaf Ebrahimi cated parameters such as custom memory management functions and re- 572*22dc650dSSadaf Ebrahimi source limits that are passed in "contexts" (which are just memory 573*22dc650dSSadaf Ebrahimi blocks, described below). Simple applications do not need to make use 574*22dc650dSSadaf Ebrahimi of contexts. 575*22dc650dSSadaf Ebrahimi 576*22dc650dSSadaf Ebrahimi Just-in-time (JIT) compiler support is an optional feature of PCRE2 577*22dc650dSSadaf Ebrahimi that can be built in appropriate hardware environments. It greatly 578*22dc650dSSadaf Ebrahimi speeds up the matching performance of many patterns. Programs can re- 579*22dc650dSSadaf Ebrahimi quest that it be used if available by calling pcre2_jit_compile() after 580*22dc650dSSadaf Ebrahimi a pattern has been successfully compiled by pcre2_compile(). This does 581*22dc650dSSadaf Ebrahimi nothing if JIT support is not available. 582*22dc650dSSadaf Ebrahimi 583*22dc650dSSadaf Ebrahimi More complicated programs might need to make use of the specialist 584*22dc650dSSadaf Ebrahimi functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and 585*22dc650dSSadaf Ebrahimi pcre2_jit_stack_assign() in order to control the JIT code's memory us- 586*22dc650dSSadaf Ebrahimi age. 587*22dc650dSSadaf Ebrahimi 588*22dc650dSSadaf Ebrahimi JIT matching is automatically used by pcre2_match() if it is available, 589*22dc650dSSadaf Ebrahimi unless the PCRE2_NO_JIT option is set. There is also a direct interface 590*22dc650dSSadaf Ebrahimi for JIT matching, which gives improved performance at the expense of 591*22dc650dSSadaf Ebrahimi less sanity checking. The JIT-specific functions are discussed in the 592*22dc650dSSadaf Ebrahimi pcre2jit documentation. 593*22dc650dSSadaf Ebrahimi 594*22dc650dSSadaf Ebrahimi A second matching function, pcre2_dfa_match(), which is not Perl-com- 595*22dc650dSSadaf Ebrahimi patible, is also provided. This uses a different algorithm for the 596*22dc650dSSadaf Ebrahimi matching. The alternative algorithm finds all possible matches (at a 597*22dc650dSSadaf Ebrahimi given point in the subject), and scans the subject just once (unless 598*22dc650dSSadaf Ebrahimi there are lookaround assertions). However, this algorithm does not re- 599*22dc650dSSadaf Ebrahimi turn captured substrings. A description of the two matching algorithms 600*22dc650dSSadaf Ebrahimi and their advantages and disadvantages is given in the pcre2matching 601*22dc650dSSadaf Ebrahimi documentation. There is no JIT support for pcre2_dfa_match(). 602*22dc650dSSadaf Ebrahimi 603*22dc650dSSadaf Ebrahimi In addition to the main compiling and matching functions, there are 604*22dc650dSSadaf Ebrahimi convenience functions for extracting captured substrings from a subject 605*22dc650dSSadaf Ebrahimi string that has been matched by pcre2_match(). They are: 606*22dc650dSSadaf Ebrahimi 607*22dc650dSSadaf Ebrahimi pcre2_substring_copy_byname() 608*22dc650dSSadaf Ebrahimi pcre2_substring_copy_bynumber() 609*22dc650dSSadaf Ebrahimi pcre2_substring_get_byname() 610*22dc650dSSadaf Ebrahimi pcre2_substring_get_bynumber() 611*22dc650dSSadaf Ebrahimi pcre2_substring_list_get() 612*22dc650dSSadaf Ebrahimi pcre2_substring_length_byname() 613*22dc650dSSadaf Ebrahimi pcre2_substring_length_bynumber() 614*22dc650dSSadaf Ebrahimi pcre2_substring_nametable_scan() 615*22dc650dSSadaf Ebrahimi pcre2_substring_number_from_name() 616*22dc650dSSadaf Ebrahimi 617*22dc650dSSadaf Ebrahimi pcre2_substring_free() and pcre2_substring_list_free() are also pro- 618*22dc650dSSadaf Ebrahimi vided, to free memory used for extracted strings. If either of these 619*22dc650dSSadaf Ebrahimi functions is called with a NULL argument, the function returns immedi- 620*22dc650dSSadaf Ebrahimi ately without doing anything. 621*22dc650dSSadaf Ebrahimi 622*22dc650dSSadaf Ebrahimi The function pcre2_substitute() can be called to match a pattern and 623*22dc650dSSadaf Ebrahimi return a copy of the subject string with substitutions for parts that 624*22dc650dSSadaf Ebrahimi were matched. 625*22dc650dSSadaf Ebrahimi 626*22dc650dSSadaf Ebrahimi Functions whose names begin with pcre2_serialize_ are used for saving 627*22dc650dSSadaf Ebrahimi compiled patterns on disc or elsewhere, and reloading them later. 628*22dc650dSSadaf Ebrahimi 629*22dc650dSSadaf Ebrahimi Finally, there are functions for finding out information about a com- 630*22dc650dSSadaf Ebrahimi piled pattern (pcre2_pattern_info()) and about the configuration with 631*22dc650dSSadaf Ebrahimi which PCRE2 was built (pcre2_config()). 632*22dc650dSSadaf Ebrahimi 633*22dc650dSSadaf Ebrahimi Functions with names ending with _free() are used for freeing memory 634*22dc650dSSadaf Ebrahimi blocks of various sorts. In all cases, if one of these functions is 635*22dc650dSSadaf Ebrahimi called with a NULL argument, it does nothing. 636*22dc650dSSadaf Ebrahimi 637*22dc650dSSadaf Ebrahimi 638*22dc650dSSadaf EbrahimiSTRING LENGTHS AND OFFSETS 639*22dc650dSSadaf Ebrahimi 640*22dc650dSSadaf Ebrahimi The PCRE2 API uses string lengths and offsets into strings of code 641*22dc650dSSadaf Ebrahimi units in several places. These values are always of type PCRE2_SIZE, 642*22dc650dSSadaf Ebrahimi which is an unsigned integer type, currently always defined as size_t. 643*22dc650dSSadaf Ebrahimi The largest value that can be stored in such a type (that is 644*22dc650dSSadaf Ebrahimi ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated 645*22dc650dSSadaf Ebrahimi strings and unset offsets. Therefore, the longest string that can be 646*22dc650dSSadaf Ebrahimi handled is one less than this maximum. Note that string lengths are al- 647*22dc650dSSadaf Ebrahimi ways given in code units. Only in the 8-bit library is such a length 648*22dc650dSSadaf Ebrahimi the same as the number of bytes in the string. 649*22dc650dSSadaf Ebrahimi 650*22dc650dSSadaf Ebrahimi 651*22dc650dSSadaf EbrahimiNEWLINES 652*22dc650dSSadaf Ebrahimi 653*22dc650dSSadaf Ebrahimi PCRE2 supports five different conventions for indicating line breaks in 654*22dc650dSSadaf Ebrahimi strings: a single CR (carriage return) character, a single LF (line- 655*22dc650dSSadaf Ebrahimi feed) character, the two-character sequence CRLF, any of the three pre- 656*22dc650dSSadaf Ebrahimi ceding, or any Unicode newline sequence. The Unicode newline sequences 657*22dc650dSSadaf Ebrahimi are the three just mentioned, plus the single characters VT (vertical 658*22dc650dSSadaf Ebrahimi tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line 659*22dc650dSSadaf Ebrahimi separator, U+2028), and PS (paragraph separator, U+2029). 660*22dc650dSSadaf Ebrahimi 661*22dc650dSSadaf Ebrahimi Each of the first three conventions is used by at least one operating 662*22dc650dSSadaf Ebrahimi system as its standard newline sequence. When PCRE2 is built, a default 663*22dc650dSSadaf Ebrahimi can be specified. If it is not, the default is set to LF, which is the 664*22dc650dSSadaf Ebrahimi Unix standard. However, the newline convention can be changed by an ap- 665*22dc650dSSadaf Ebrahimi plication when calling pcre2_compile(), or it can be specified by spe- 666*22dc650dSSadaf Ebrahimi cial text at the start of the pattern itself; this overrides any other 667*22dc650dSSadaf Ebrahimi settings. See the pcre2pattern page for details of the special charac- 668*22dc650dSSadaf Ebrahimi ter sequences. 669*22dc650dSSadaf Ebrahimi 670*22dc650dSSadaf Ebrahimi In the PCRE2 documentation the word "newline" is used to mean "the 671*22dc650dSSadaf Ebrahimi character or pair of characters that indicate a line break". The choice 672*22dc650dSSadaf Ebrahimi of newline convention affects the handling of the dot, circumflex, and 673*22dc650dSSadaf Ebrahimi dollar metacharacters, the handling of #-comments in /x mode, and, when 674*22dc650dSSadaf Ebrahimi CRLF is a recognized line ending sequence, the match position advance- 675*22dc650dSSadaf Ebrahimi ment for a non-anchored pattern. There is more detail about this in the 676*22dc650dSSadaf Ebrahimi section on pcre2_match() options below. 677*22dc650dSSadaf Ebrahimi 678*22dc650dSSadaf Ebrahimi The choice of newline convention does not affect the interpretation of 679*22dc650dSSadaf Ebrahimi the \n or \r escape sequences, nor does it affect what \R matches; this 680*22dc650dSSadaf Ebrahimi has its own separate convention. 681*22dc650dSSadaf Ebrahimi 682*22dc650dSSadaf Ebrahimi 683*22dc650dSSadaf EbrahimiMULTITHREADING 684*22dc650dSSadaf Ebrahimi 685*22dc650dSSadaf Ebrahimi In a multithreaded application it is important to keep thread-specific 686*22dc650dSSadaf Ebrahimi data separate from data that can be shared between threads. The PCRE2 687*22dc650dSSadaf Ebrahimi library code itself is thread-safe: it contains no static or global 688*22dc650dSSadaf Ebrahimi variables. The API is designed to be fairly simple for non-threaded ap- 689*22dc650dSSadaf Ebrahimi plications while at the same time ensuring that multithreaded applica- 690*22dc650dSSadaf Ebrahimi tions can use it. 691*22dc650dSSadaf Ebrahimi 692*22dc650dSSadaf Ebrahimi There are several different blocks of data that are used to pass infor- 693*22dc650dSSadaf Ebrahimi mation between the application and the PCRE2 libraries. 694*22dc650dSSadaf Ebrahimi 695*22dc650dSSadaf Ebrahimi The compiled pattern 696*22dc650dSSadaf Ebrahimi 697*22dc650dSSadaf Ebrahimi A pointer to the compiled form of a pattern is returned to the user 698*22dc650dSSadaf Ebrahimi when pcre2_compile() is successful. The data in the compiled pattern is 699*22dc650dSSadaf Ebrahimi fixed, and does not change when the pattern is matched. Therefore, it 700*22dc650dSSadaf Ebrahimi is thread-safe, that is, the same compiled pattern can be used by more 701*22dc650dSSadaf Ebrahimi than one thread simultaneously. For example, an application can compile 702*22dc650dSSadaf Ebrahimi all its patterns at the start, before forking off multiple threads that 703*22dc650dSSadaf Ebrahimi use them. However, if the just-in-time (JIT) optimization feature is 704*22dc650dSSadaf Ebrahimi being used, it needs separate memory stack areas for each thread. See 705*22dc650dSSadaf Ebrahimi the pcre2jit documentation for more details. 706*22dc650dSSadaf Ebrahimi 707*22dc650dSSadaf Ebrahimi In a more complicated situation, where patterns are compiled only when 708*22dc650dSSadaf Ebrahimi they are first needed, but are still shared between threads, pointers 709*22dc650dSSadaf Ebrahimi to compiled patterns must be protected from simultaneous writing by 710*22dc650dSSadaf Ebrahimi multiple threads. This is somewhat tricky to do correctly. If you know 711*22dc650dSSadaf Ebrahimi that writing to a pointer is atomic in your environment, you can use 712*22dc650dSSadaf Ebrahimi logic like this: 713*22dc650dSSadaf Ebrahimi 714*22dc650dSSadaf Ebrahimi Get a read-only (shared) lock (mutex) for pointer 715*22dc650dSSadaf Ebrahimi if (pointer == NULL) 716*22dc650dSSadaf Ebrahimi { 717*22dc650dSSadaf Ebrahimi Get a write (unique) lock for pointer 718*22dc650dSSadaf Ebrahimi if (pointer == NULL) pointer = pcre2_compile(... 719*22dc650dSSadaf Ebrahimi } 720*22dc650dSSadaf Ebrahimi Release the lock 721*22dc650dSSadaf Ebrahimi Use pointer in pcre2_match() 722*22dc650dSSadaf Ebrahimi 723*22dc650dSSadaf Ebrahimi Of course, testing for compilation errors should also be included in 724*22dc650dSSadaf Ebrahimi the code. 725*22dc650dSSadaf Ebrahimi 726*22dc650dSSadaf Ebrahimi The reason for checking the pointer a second time is as follows: Sev- 727*22dc650dSSadaf Ebrahimi eral threads may have acquired the shared lock and tested the pointer 728*22dc650dSSadaf Ebrahimi for being NULL, but only one of them will be given the write lock, with 729*22dc650dSSadaf Ebrahimi the rest kept waiting. The winning thread will compile the pattern and 730*22dc650dSSadaf Ebrahimi store the result. After this thread releases the write lock, another 731*22dc650dSSadaf Ebrahimi thread will get it, and if it does not retest pointer for being NULL, 732*22dc650dSSadaf Ebrahimi will recompile the pattern and overwrite the pointer, creating a memory 733*22dc650dSSadaf Ebrahimi leak and possibly causing other issues. 734*22dc650dSSadaf Ebrahimi 735*22dc650dSSadaf Ebrahimi In an environment where writing to a pointer may not be atomic, the 736*22dc650dSSadaf Ebrahimi above logic is not sufficient. The thread that is doing the compiling 737*22dc650dSSadaf Ebrahimi may be descheduled after writing only part of the pointer, which could 738*22dc650dSSadaf Ebrahimi cause other threads to use an invalid value. Instead of checking the 739*22dc650dSSadaf Ebrahimi pointer itself, a separate "pointer is valid" flag (that can be updated 740*22dc650dSSadaf Ebrahimi atomically) must be used: 741*22dc650dSSadaf Ebrahimi 742*22dc650dSSadaf Ebrahimi Get a read-only (shared) lock (mutex) for pointer 743*22dc650dSSadaf Ebrahimi if (!pointer_is_valid) 744*22dc650dSSadaf Ebrahimi { 745*22dc650dSSadaf Ebrahimi Get a write (unique) lock for pointer 746*22dc650dSSadaf Ebrahimi if (!pointer_is_valid) 747*22dc650dSSadaf Ebrahimi { 748*22dc650dSSadaf Ebrahimi pointer = pcre2_compile(... 749*22dc650dSSadaf Ebrahimi pointer_is_valid = TRUE 750*22dc650dSSadaf Ebrahimi } 751*22dc650dSSadaf Ebrahimi } 752*22dc650dSSadaf Ebrahimi Release the lock 753*22dc650dSSadaf Ebrahimi Use pointer in pcre2_match() 754*22dc650dSSadaf Ebrahimi 755*22dc650dSSadaf Ebrahimi If JIT is being used, but the JIT compilation is not being done immedi- 756*22dc650dSSadaf Ebrahimi ately (perhaps waiting to see if the pattern is used often enough), 757*22dc650dSSadaf Ebrahimi similar logic is required. JIT compilation updates a value within the 758*22dc650dSSadaf Ebrahimi compiled code block, so a thread must gain unique write access to the 759*22dc650dSSadaf Ebrahimi pointer before calling pcre2_jit_compile(). Alternatively, 760*22dc650dSSadaf Ebrahimi pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to ob- 761*22dc650dSSadaf Ebrahimi tain a private copy of the compiled code before calling the JIT com- 762*22dc650dSSadaf Ebrahimi piler. 763*22dc650dSSadaf Ebrahimi 764*22dc650dSSadaf Ebrahimi Context blocks 765*22dc650dSSadaf Ebrahimi 766*22dc650dSSadaf Ebrahimi The next main section below introduces the idea of "contexts" in which 767*22dc650dSSadaf Ebrahimi PCRE2 functions are called. A context is nothing more than a collection 768*22dc650dSSadaf Ebrahimi of parameters that control the way PCRE2 operates. Grouping a number of 769*22dc650dSSadaf Ebrahimi parameters together in a context is a convenient way of passing them to 770*22dc650dSSadaf Ebrahimi a PCRE2 function without using lots of arguments. The parameters that 771*22dc650dSSadaf Ebrahimi are stored in contexts are in some sense "advanced features" of the 772*22dc650dSSadaf Ebrahimi API. Many straightforward applications will not need to use contexts. 773*22dc650dSSadaf Ebrahimi 774*22dc650dSSadaf Ebrahimi In a multithreaded application, if the parameters in a context are val- 775*22dc650dSSadaf Ebrahimi ues that are never changed, the same context can be used by all the 776*22dc650dSSadaf Ebrahimi threads. However, if any thread needs to change any value in a context, 777*22dc650dSSadaf Ebrahimi it must make its own thread-specific copy. 778*22dc650dSSadaf Ebrahimi 779*22dc650dSSadaf Ebrahimi Match blocks 780*22dc650dSSadaf Ebrahimi 781*22dc650dSSadaf Ebrahimi The matching functions need a block of memory for storing the results 782*22dc650dSSadaf Ebrahimi of a match. This includes details of what was matched, as well as addi- 783*22dc650dSSadaf Ebrahimi tional information such as the name of a (*MARK) setting. Each thread 784*22dc650dSSadaf Ebrahimi must provide its own copy of this memory. 785*22dc650dSSadaf Ebrahimi 786*22dc650dSSadaf Ebrahimi 787*22dc650dSSadaf EbrahimiPCRE2 CONTEXTS 788*22dc650dSSadaf Ebrahimi 789*22dc650dSSadaf Ebrahimi Some PCRE2 functions have a lot of parameters, many of which are used 790*22dc650dSSadaf Ebrahimi only by specialist applications, for example, those that use custom 791*22dc650dSSadaf Ebrahimi memory management or non-standard character tables. To keep function 792*22dc650dSSadaf Ebrahimi argument lists at a reasonable size, and at the same time to keep the 793*22dc650dSSadaf Ebrahimi API extensible, "uncommon" parameters are passed to certain functions 794*22dc650dSSadaf Ebrahimi in a context instead of directly. A context is just a block of memory 795*22dc650dSSadaf Ebrahimi that holds the parameter values. Applications that do not need to ad- 796*22dc650dSSadaf Ebrahimi just any of the context parameters can pass NULL when a context pointer 797*22dc650dSSadaf Ebrahimi is required. 798*22dc650dSSadaf Ebrahimi 799*22dc650dSSadaf Ebrahimi There are three different types of context: a general context that is 800*22dc650dSSadaf Ebrahimi relevant for several PCRE2 operations, a compile-time context, and a 801*22dc650dSSadaf Ebrahimi match-time context. 802*22dc650dSSadaf Ebrahimi 803*22dc650dSSadaf Ebrahimi The general context 804*22dc650dSSadaf Ebrahimi 805*22dc650dSSadaf Ebrahimi At present, this context just contains pointers to (and data for) ex- 806*22dc650dSSadaf Ebrahimi ternal memory management functions that are called from several places 807*22dc650dSSadaf Ebrahimi in the PCRE2 library. The context is named `general' rather than 808*22dc650dSSadaf Ebrahimi specifically `memory' because in future other fields may be added. If 809*22dc650dSSadaf Ebrahimi you do not want to supply your own custom memory management functions, 810*22dc650dSSadaf Ebrahimi you do not need to bother with a general context. A general context is 811*22dc650dSSadaf Ebrahimi created by: 812*22dc650dSSadaf Ebrahimi 813*22dc650dSSadaf Ebrahimi pcre2_general_context *pcre2_general_context_create( 814*22dc650dSSadaf Ebrahimi void *(*private_malloc)(PCRE2_SIZE, void *), 815*22dc650dSSadaf Ebrahimi void (*private_free)(void *, void *), void *memory_data); 816*22dc650dSSadaf Ebrahimi 817*22dc650dSSadaf Ebrahimi The two function pointers specify custom memory management functions, 818*22dc650dSSadaf Ebrahimi whose prototypes are: 819*22dc650dSSadaf Ebrahimi 820*22dc650dSSadaf Ebrahimi void *private_malloc(PCRE2_SIZE, void *); 821*22dc650dSSadaf Ebrahimi void private_free(void *, void *); 822*22dc650dSSadaf Ebrahimi 823*22dc650dSSadaf Ebrahimi Whenever code in PCRE2 calls these functions, the final argument is the 824*22dc650dSSadaf Ebrahimi value of memory_data. Either of the first two arguments of the creation 825*22dc650dSSadaf Ebrahimi function may be NULL, in which case the system memory management func- 826*22dc650dSSadaf Ebrahimi tions malloc() and free() are used. (This is not currently useful, as 827*22dc650dSSadaf Ebrahimi there are no other fields in a general context, but in future there 828*22dc650dSSadaf Ebrahimi might be.) The private_malloc() function is used (if supplied) to ob- 829*22dc650dSSadaf Ebrahimi tain memory for storing the context, and all three values are saved as 830*22dc650dSSadaf Ebrahimi part of the context. 831*22dc650dSSadaf Ebrahimi 832*22dc650dSSadaf Ebrahimi Whenever PCRE2 creates a data block of any kind, the block contains a 833*22dc650dSSadaf Ebrahimi pointer to the free() function that matches the malloc() function that 834*22dc650dSSadaf Ebrahimi was used. When the time comes to free the block, this function is 835*22dc650dSSadaf Ebrahimi called. 836*22dc650dSSadaf Ebrahimi 837*22dc650dSSadaf Ebrahimi A general context can be copied by calling: 838*22dc650dSSadaf Ebrahimi 839*22dc650dSSadaf Ebrahimi pcre2_general_context *pcre2_general_context_copy( 840*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 841*22dc650dSSadaf Ebrahimi 842*22dc650dSSadaf Ebrahimi The memory used for a general context should be freed by calling: 843*22dc650dSSadaf Ebrahimi 844*22dc650dSSadaf Ebrahimi void pcre2_general_context_free(pcre2_general_context *gcontext); 845*22dc650dSSadaf Ebrahimi 846*22dc650dSSadaf Ebrahimi If this function is passed a NULL argument, it returns immediately 847*22dc650dSSadaf Ebrahimi without doing anything. 848*22dc650dSSadaf Ebrahimi 849*22dc650dSSadaf Ebrahimi The compile context 850*22dc650dSSadaf Ebrahimi 851*22dc650dSSadaf Ebrahimi A compile context is required if you want to provide an external func- 852*22dc650dSSadaf Ebrahimi tion for stack checking during compilation or to change the default 853*22dc650dSSadaf Ebrahimi values of any of the following compile-time parameters: 854*22dc650dSSadaf Ebrahimi 855*22dc650dSSadaf Ebrahimi What \R matches (Unicode newlines or CR, LF, CRLF only) 856*22dc650dSSadaf Ebrahimi PCRE2's character tables 857*22dc650dSSadaf Ebrahimi The newline character sequence 858*22dc650dSSadaf Ebrahimi The compile time nested parentheses limit 859*22dc650dSSadaf Ebrahimi The maximum length of the pattern string 860*22dc650dSSadaf Ebrahimi The extra options bits (none set by default) 861*22dc650dSSadaf Ebrahimi 862*22dc650dSSadaf Ebrahimi A compile context is also required if you are using custom memory man- 863*22dc650dSSadaf Ebrahimi agement. If none of these apply, just pass NULL as the context argu- 864*22dc650dSSadaf Ebrahimi ment of pcre2_compile(). 865*22dc650dSSadaf Ebrahimi 866*22dc650dSSadaf Ebrahimi A compile context is created, copied, and freed by the following func- 867*22dc650dSSadaf Ebrahimi tions: 868*22dc650dSSadaf Ebrahimi 869*22dc650dSSadaf Ebrahimi pcre2_compile_context *pcre2_compile_context_create( 870*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 871*22dc650dSSadaf Ebrahimi 872*22dc650dSSadaf Ebrahimi pcre2_compile_context *pcre2_compile_context_copy( 873*22dc650dSSadaf Ebrahimi pcre2_compile_context *ccontext); 874*22dc650dSSadaf Ebrahimi 875*22dc650dSSadaf Ebrahimi void pcre2_compile_context_free(pcre2_compile_context *ccontext); 876*22dc650dSSadaf Ebrahimi 877*22dc650dSSadaf Ebrahimi A compile context is created with default values for its parameters. 878*22dc650dSSadaf Ebrahimi These can be changed by calling the following functions, which return 0 879*22dc650dSSadaf Ebrahimi on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 880*22dc650dSSadaf Ebrahimi 881*22dc650dSSadaf Ebrahimi int pcre2_set_bsr(pcre2_compile_context *ccontext, 882*22dc650dSSadaf Ebrahimi uint32_t value); 883*22dc650dSSadaf Ebrahimi 884*22dc650dSSadaf Ebrahimi The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only 885*22dc650dSSadaf Ebrahimi CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any 886*22dc650dSSadaf Ebrahimi Unicode line ending sequence. The value is used by the JIT compiler and 887*22dc650dSSadaf Ebrahimi by the two interpreted matching functions, pcre2_match() and 888*22dc650dSSadaf Ebrahimi pcre2_dfa_match(). 889*22dc650dSSadaf Ebrahimi 890*22dc650dSSadaf Ebrahimi int pcre2_set_character_tables(pcre2_compile_context *ccontext, 891*22dc650dSSadaf Ebrahimi const uint8_t *tables); 892*22dc650dSSadaf Ebrahimi 893*22dc650dSSadaf Ebrahimi The value must be the result of a call to pcre2_maketables(), whose 894*22dc650dSSadaf Ebrahimi only argument is a general context. This function builds a set of char- 895*22dc650dSSadaf Ebrahimi acter tables in the current locale. 896*22dc650dSSadaf Ebrahimi 897*22dc650dSSadaf Ebrahimi int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, 898*22dc650dSSadaf Ebrahimi uint32_t extra_options); 899*22dc650dSSadaf Ebrahimi 900*22dc650dSSadaf Ebrahimi As PCRE2 has developed, almost all the 32 option bits that are avail- 901*22dc650dSSadaf Ebrahimi able in the options argument of pcre2_compile() have been used up. To 902*22dc650dSSadaf Ebrahimi avoid running out, the compile context contains a set of extra option 903*22dc650dSSadaf Ebrahimi bits which are used for some newer, assumed rarer, options. This func- 904*22dc650dSSadaf Ebrahimi tion sets those bits. It always sets all the bits (either on or off). 905*22dc650dSSadaf Ebrahimi It does not modify any existing setting. The available options are de- 906*22dc650dSSadaf Ebrahimi fined in the section entitled "Extra compile options" below. 907*22dc650dSSadaf Ebrahimi 908*22dc650dSSadaf Ebrahimi int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 909*22dc650dSSadaf Ebrahimi PCRE2_SIZE value); 910*22dc650dSSadaf Ebrahimi 911*22dc650dSSadaf Ebrahimi This sets a maximum length, in code units, for any pattern string that 912*22dc650dSSadaf Ebrahimi is compiled with this context. If the pattern is longer, an error is 913*22dc650dSSadaf Ebrahimi generated. This facility is provided so that applications that accept 914*22dc650dSSadaf Ebrahimi patterns from external sources can limit their size. The default is the 915*22dc650dSSadaf Ebrahimi largest number that a PCRE2_SIZE variable can hold, which is effec- 916*22dc650dSSadaf Ebrahimi tively unlimited. 917*22dc650dSSadaf Ebrahimi 918*22dc650dSSadaf Ebrahimi int pcre2_set_max_pattern_compiled_length( 919*22dc650dSSadaf Ebrahimi pcre2_compile_context *ccontext, PCRE2_SIZE value); 920*22dc650dSSadaf Ebrahimi 921*22dc650dSSadaf Ebrahimi This sets a maximum size, in bytes, for the memory needed to hold the 922*22dc650dSSadaf Ebrahimi compiled version of a pattern that is compiled with this context. If 923*22dc650dSSadaf Ebrahimi the pattern needs more memory, an error is generated. This facility is 924*22dc650dSSadaf Ebrahimi provided so that applications that accept patterns from external 925*22dc650dSSadaf Ebrahimi sources can limit the amount of memory they use. The default is the 926*22dc650dSSadaf Ebrahimi largest number that a PCRE2_SIZE variable can hold, which is effec- 927*22dc650dSSadaf Ebrahimi tively unlimited. 928*22dc650dSSadaf Ebrahimi 929*22dc650dSSadaf Ebrahimi int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext, 930*22dc650dSSadaf Ebrahimi uint32_t value); 931*22dc650dSSadaf Ebrahimi 932*22dc650dSSadaf Ebrahimi This sets a maximum length for the number of characters matched by a 933*22dc650dSSadaf Ebrahimi variable-length lookbehind assertion. The default is set when PCRE2 is 934*22dc650dSSadaf Ebrahimi built, with the ultimate default being 255, the same as Perl. Lookbe- 935*22dc650dSSadaf Ebrahimi hind assertions without a bounding length are not supported. 936*22dc650dSSadaf Ebrahimi 937*22dc650dSSadaf Ebrahimi int pcre2_set_newline(pcre2_compile_context *ccontext, 938*22dc650dSSadaf Ebrahimi uint32_t value); 939*22dc650dSSadaf Ebrahimi 940*22dc650dSSadaf Ebrahimi This specifies which characters or character sequences are to be recog- 941*22dc650dSSadaf Ebrahimi nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage 942*22dc650dSSadaf Ebrahimi return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the 943*22dc650dSSadaf Ebrahimi two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any 944*22dc650dSSadaf Ebrahimi of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or 945*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero). 946*22dc650dSSadaf Ebrahimi 947*22dc650dSSadaf Ebrahimi A pattern can override the value set in the compile context by starting 948*22dc650dSSadaf Ebrahimi with a sequence such as (*CRLF). See the pcre2pattern page for details. 949*22dc650dSSadaf Ebrahimi 950*22dc650dSSadaf Ebrahimi When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX- 951*22dc650dSSadaf Ebrahimi TENDED_MORE option, the newline convention affects the recognition of 952*22dc650dSSadaf Ebrahimi the end of internal comments starting with #. The value is saved with 953*22dc650dSSadaf Ebrahimi the compiled pattern for subsequent use by the JIT compiler and by the 954*22dc650dSSadaf Ebrahimi two interpreted matching functions, pcre2_match() and 955*22dc650dSSadaf Ebrahimi pcre2_dfa_match(). 956*22dc650dSSadaf Ebrahimi 957*22dc650dSSadaf Ebrahimi int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 958*22dc650dSSadaf Ebrahimi uint32_t value); 959*22dc650dSSadaf Ebrahimi 960*22dc650dSSadaf Ebrahimi This parameter adjusts the limit, set when PCRE2 is built (default 961*22dc650dSSadaf Ebrahimi 250), on the depth of parenthesis nesting in a pattern. This limit 962*22dc650dSSadaf Ebrahimi stops rogue patterns using up too much system stack when being com- 963*22dc650dSSadaf Ebrahimi piled. The limit applies to parentheses of all kinds, not just captur- 964*22dc650dSSadaf Ebrahimi ing parentheses. 965*22dc650dSSadaf Ebrahimi 966*22dc650dSSadaf Ebrahimi int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 967*22dc650dSSadaf Ebrahimi int (*guard_function)(uint32_t, void *), void *user_data); 968*22dc650dSSadaf Ebrahimi 969*22dc650dSSadaf Ebrahimi There is at least one application that runs PCRE2 in threads with very 970*22dc650dSSadaf Ebrahimi limited system stack, where running out of stack is to be avoided at 971*22dc650dSSadaf Ebrahimi all costs. The parenthesis limit above cannot take account of how much 972*22dc650dSSadaf Ebrahimi stack is actually available during compilation. For a finer control, 973*22dc650dSSadaf Ebrahimi you can supply a function that is called whenever pcre2_compile() 974*22dc650dSSadaf Ebrahimi starts to compile a parenthesized part of a pattern. This function can 975*22dc650dSSadaf Ebrahimi check the actual stack size (or anything else that it wants to, of 976*22dc650dSSadaf Ebrahimi course). 977*22dc650dSSadaf Ebrahimi 978*22dc650dSSadaf Ebrahimi The first argument to the callout function gives the current depth of 979*22dc650dSSadaf Ebrahimi nesting, and the second is user data that is set up by the last argu- 980*22dc650dSSadaf Ebrahimi ment of pcre2_set_compile_recursion_guard(). The callout function 981*22dc650dSSadaf Ebrahimi should return zero if all is well, or non-zero to force an error. 982*22dc650dSSadaf Ebrahimi 983*22dc650dSSadaf Ebrahimi The match context 984*22dc650dSSadaf Ebrahimi 985*22dc650dSSadaf Ebrahimi A match context is required if you want to: 986*22dc650dSSadaf Ebrahimi 987*22dc650dSSadaf Ebrahimi Set up a callout function 988*22dc650dSSadaf Ebrahimi Set an offset limit for matching an unanchored pattern 989*22dc650dSSadaf Ebrahimi Change the limit on the amount of heap used when matching 990*22dc650dSSadaf Ebrahimi Change the backtracking match limit 991*22dc650dSSadaf Ebrahimi Change the backtracking depth limit 992*22dc650dSSadaf Ebrahimi Set custom memory management specifically for the match 993*22dc650dSSadaf Ebrahimi 994*22dc650dSSadaf Ebrahimi If none of these apply, just pass NULL as the context argument of 995*22dc650dSSadaf Ebrahimi pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). 996*22dc650dSSadaf Ebrahimi 997*22dc650dSSadaf Ebrahimi A match context is created, copied, and freed by the following func- 998*22dc650dSSadaf Ebrahimi tions: 999*22dc650dSSadaf Ebrahimi 1000*22dc650dSSadaf Ebrahimi pcre2_match_context *pcre2_match_context_create( 1001*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 1002*22dc650dSSadaf Ebrahimi 1003*22dc650dSSadaf Ebrahimi pcre2_match_context *pcre2_match_context_copy( 1004*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext); 1005*22dc650dSSadaf Ebrahimi 1006*22dc650dSSadaf Ebrahimi void pcre2_match_context_free(pcre2_match_context *mcontext); 1007*22dc650dSSadaf Ebrahimi 1008*22dc650dSSadaf Ebrahimi A match context is created with default values for its parameters. 1009*22dc650dSSadaf Ebrahimi These can be changed by calling the following functions, which return 0 1010*22dc650dSSadaf Ebrahimi on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 1011*22dc650dSSadaf Ebrahimi 1012*22dc650dSSadaf Ebrahimi int pcre2_set_callout(pcre2_match_context *mcontext, 1013*22dc650dSSadaf Ebrahimi int (*callout_function)(pcre2_callout_block *, void *), 1014*22dc650dSSadaf Ebrahimi void *callout_data); 1015*22dc650dSSadaf Ebrahimi 1016*22dc650dSSadaf Ebrahimi This sets up a callout function for PCRE2 to call at specified points 1017*22dc650dSSadaf Ebrahimi during a matching operation. Details are given in the pcre2callout doc- 1018*22dc650dSSadaf Ebrahimi umentation. 1019*22dc650dSSadaf Ebrahimi 1020*22dc650dSSadaf Ebrahimi int pcre2_set_substitute_callout(pcre2_match_context *mcontext, 1021*22dc650dSSadaf Ebrahimi int (*callout_function)(pcre2_substitute_callout_block *, void *), 1022*22dc650dSSadaf Ebrahimi void *callout_data); 1023*22dc650dSSadaf Ebrahimi 1024*22dc650dSSadaf Ebrahimi This sets up a callout function for PCRE2 to call after each substitu- 1025*22dc650dSSadaf Ebrahimi tion made by pcre2_substitute(). Details are given in the section enti- 1026*22dc650dSSadaf Ebrahimi tled "Creating a new string with substitutions" below. 1027*22dc650dSSadaf Ebrahimi 1028*22dc650dSSadaf Ebrahimi int pcre2_set_offset_limit(pcre2_match_context *mcontext, 1029*22dc650dSSadaf Ebrahimi PCRE2_SIZE value); 1030*22dc650dSSadaf Ebrahimi 1031*22dc650dSSadaf Ebrahimi The offset_limit parameter limits how far an unanchored search can ad- 1032*22dc650dSSadaf Ebrahimi vance in the subject string. The default value is PCRE2_UNSET. The 1033*22dc650dSSadaf Ebrahimi pcre2_match() and pcre2_dfa_match() functions return PCRE2_ERROR_NO- 1034*22dc650dSSadaf Ebrahimi MATCH if a match with a starting point before or at the given offset is 1035*22dc650dSSadaf Ebrahimi not found. The pcre2_substitute() function makes no more substitutions. 1036*22dc650dSSadaf Ebrahimi 1037*22dc650dSSadaf Ebrahimi For example, if the pattern /abc/ is matched against "123abc" with an 1038*22dc650dSSadaf Ebrahimi offset limit less than 3, the result is PCRE2_ERROR_NOMATCH. A match 1039*22dc650dSSadaf Ebrahimi can never be found if the startoffset argument of pcre2_match(), 1040*22dc650dSSadaf Ebrahimi pcre2_dfa_match(), or pcre2_substitute() is greater than the offset 1041*22dc650dSSadaf Ebrahimi limit set in the match context. 1042*22dc650dSSadaf Ebrahimi 1043*22dc650dSSadaf Ebrahimi When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT op- 1044*22dc650dSSadaf Ebrahimi tion when calling pcre2_compile() so that when JIT is in use, different 1045*22dc650dSSadaf Ebrahimi code can be compiled. If a match is started with a non-default match 1046*22dc650dSSadaf Ebrahimi limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. 1047*22dc650dSSadaf Ebrahimi 1048*22dc650dSSadaf Ebrahimi The offset limit facility can be used to track progress when searching 1049*22dc650dSSadaf Ebrahimi large subject strings or to limit the extent of global substitutions. 1050*22dc650dSSadaf Ebrahimi See also the PCRE2_FIRSTLINE option, which requires a match to start 1051*22dc650dSSadaf Ebrahimi before or at the first newline that follows the start of matching in 1052*22dc650dSSadaf Ebrahimi the subject. If this is set with an offset limit, a match must occur in 1053*22dc650dSSadaf Ebrahimi the first line and also within the offset limit. In other words, 1054*22dc650dSSadaf Ebrahimi whichever limit comes first is used. 1055*22dc650dSSadaf Ebrahimi 1056*22dc650dSSadaf Ebrahimi int pcre2_set_heap_limit(pcre2_match_context *mcontext, 1057*22dc650dSSadaf Ebrahimi uint32_t value); 1058*22dc650dSSadaf Ebrahimi 1059*22dc650dSSadaf Ebrahimi The heap_limit parameter specifies, in units of kibibytes (1024 bytes), 1060*22dc650dSSadaf Ebrahimi the maximum amount of heap memory that pcre2_match() may use to hold 1061*22dc650dSSadaf Ebrahimi backtracking information when running an interpretive match. This limit 1062*22dc650dSSadaf Ebrahimi also applies to pcre2_dfa_match(), which may use the heap when process- 1063*22dc650dSSadaf Ebrahimi ing patterns with a lot of nested pattern recursion or lookarounds or 1064*22dc650dSSadaf Ebrahimi atomic groups. This limit does not apply to matching with the JIT opti- 1065*22dc650dSSadaf Ebrahimi mization, which has its own memory control arrangements (see the 1066*22dc650dSSadaf Ebrahimi pcre2jit documentation for more details). If the limit is reached, the 1067*22dc650dSSadaf Ebrahimi negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default 1068*22dc650dSSadaf Ebrahimi limit can be set when PCRE2 is built; if it is not, the default is set 1069*22dc650dSSadaf Ebrahimi very large and is essentially unlimited. 1070*22dc650dSSadaf Ebrahimi 1071*22dc650dSSadaf Ebrahimi A value for the heap limit may also be supplied by an item at the start 1072*22dc650dSSadaf Ebrahimi of a pattern of the form 1073*22dc650dSSadaf Ebrahimi 1074*22dc650dSSadaf Ebrahimi (*LIMIT_HEAP=ddd) 1075*22dc650dSSadaf Ebrahimi 1076*22dc650dSSadaf Ebrahimi where ddd is a decimal number. However, such a setting is ignored un- 1077*22dc650dSSadaf Ebrahimi less ddd is less than the limit set by the caller of pcre2_match() or, 1078*22dc650dSSadaf Ebrahimi if no such limit is set, less than the default. 1079*22dc650dSSadaf Ebrahimi 1080*22dc650dSSadaf Ebrahimi The pcre2_match() function always needs some heap memory, so setting a 1081*22dc650dSSadaf Ebrahimi value of zero guarantees a "heap limit exceeded" error. Details of how 1082*22dc650dSSadaf Ebrahimi pcre2_match() uses the heap are given in the pcre2perform documenta- 1083*22dc650dSSadaf Ebrahimi tion. 1084*22dc650dSSadaf Ebrahimi 1085*22dc650dSSadaf Ebrahimi For pcre2_dfa_match(), a vector on the system stack is used when pro- 1086*22dc650dSSadaf Ebrahimi cessing pattern recursions, lookarounds, or atomic groups, and only if 1087*22dc650dSSadaf Ebrahimi this is not big enough is heap memory used. In this case, setting a 1088*22dc650dSSadaf Ebrahimi value of zero disables the use of the heap. 1089*22dc650dSSadaf Ebrahimi 1090*22dc650dSSadaf Ebrahimi int pcre2_set_match_limit(pcre2_match_context *mcontext, 1091*22dc650dSSadaf Ebrahimi uint32_t value); 1092*22dc650dSSadaf Ebrahimi 1093*22dc650dSSadaf Ebrahimi The match_limit parameter provides a means of preventing PCRE2 from us- 1094*22dc650dSSadaf Ebrahimi ing up too many computing resources when processing patterns that are 1095*22dc650dSSadaf Ebrahimi not going to match, but which have a very large number of possibilities 1096*22dc650dSSadaf Ebrahimi in their search trees. The classic example is a pattern that uses 1097*22dc650dSSadaf Ebrahimi nested unlimited repeats. 1098*22dc650dSSadaf Ebrahimi 1099*22dc650dSSadaf Ebrahimi There is an internal counter in pcre2_match() that is incremented each 1100*22dc650dSSadaf Ebrahimi time round its main matching loop. If this value reaches the match 1101*22dc650dSSadaf Ebrahimi limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT. 1102*22dc650dSSadaf Ebrahimi This has the effect of limiting the amount of backtracking that can 1103*22dc650dSSadaf Ebrahimi take place. For patterns that are not anchored, the count restarts from 1104*22dc650dSSadaf Ebrahimi zero for each position in the subject string. This limit also applies 1105*22dc650dSSadaf Ebrahimi to pcre2_dfa_match(), though the counting is done in a different way. 1106*22dc650dSSadaf Ebrahimi 1107*22dc650dSSadaf Ebrahimi When pcre2_match() is called with a pattern that was successfully 1108*22dc650dSSadaf Ebrahimi processed by pcre2_jit_compile(), the way in which matching is executed 1109*22dc650dSSadaf Ebrahimi is entirely different. However, there is still the possibility of run- 1110*22dc650dSSadaf Ebrahimi away matching that goes on for a very long time, and so the match_limit 1111*22dc650dSSadaf Ebrahimi value is also used in this case (but in a different way) to limit how 1112*22dc650dSSadaf Ebrahimi long the matching can continue. 1113*22dc650dSSadaf Ebrahimi 1114*22dc650dSSadaf Ebrahimi The default value for the limit can be set when PCRE2 is built; the de- 1115*22dc650dSSadaf Ebrahimi fault is 10 million, which handles all but the most extreme cases. A 1116*22dc650dSSadaf Ebrahimi value for the match limit may also be supplied by an item at the start 1117*22dc650dSSadaf Ebrahimi of a pattern of the form 1118*22dc650dSSadaf Ebrahimi 1119*22dc650dSSadaf Ebrahimi (*LIMIT_MATCH=ddd) 1120*22dc650dSSadaf Ebrahimi 1121*22dc650dSSadaf Ebrahimi where ddd is a decimal number. However, such a setting is ignored un- 1122*22dc650dSSadaf Ebrahimi less ddd is less than the limit set by the caller of pcre2_match() or 1123*22dc650dSSadaf Ebrahimi pcre2_dfa_match() or, if no such limit is set, less than the default. 1124*22dc650dSSadaf Ebrahimi 1125*22dc650dSSadaf Ebrahimi int pcre2_set_depth_limit(pcre2_match_context *mcontext, 1126*22dc650dSSadaf Ebrahimi uint32_t value); 1127*22dc650dSSadaf Ebrahimi 1128*22dc650dSSadaf Ebrahimi This parameter limits the depth of nested backtracking in 1129*22dc650dSSadaf Ebrahimi pcre2_match(). Each time a nested backtracking point is passed, a new 1130*22dc650dSSadaf Ebrahimi memory frame is used to remember the state of matching at that point. 1131*22dc650dSSadaf Ebrahimi Thus, this parameter indirectly limits the amount of memory that is 1132*22dc650dSSadaf Ebrahimi used in a match. However, because the size of each memory frame depends 1133*22dc650dSSadaf Ebrahimi on the number of capturing parentheses, the actual memory limit varies 1134*22dc650dSSadaf Ebrahimi from pattern to pattern. This limit was more useful in versions before 1135*22dc650dSSadaf Ebrahimi 10.30, where function recursion was used for backtracking. 1136*22dc650dSSadaf Ebrahimi 1137*22dc650dSSadaf Ebrahimi The depth limit is not relevant, and is ignored, when matching is done 1138*22dc650dSSadaf Ebrahimi using JIT compiled code. However, it is supported by pcre2_dfa_match(), 1139*22dc650dSSadaf Ebrahimi which uses it to limit the depth of nested internal recursive function 1140*22dc650dSSadaf Ebrahimi calls that implement atomic groups, lookaround assertions, and pattern 1141*22dc650dSSadaf Ebrahimi recursions. This limits, indirectly, the amount of system stack that is 1142*22dc650dSSadaf Ebrahimi used. It was more useful in versions before 10.32, when stack memory 1143*22dc650dSSadaf Ebrahimi was used for local workspace vectors for recursive function calls. From 1144*22dc650dSSadaf Ebrahimi version 10.32, only local variables are allocated on the stack and as 1145*22dc650dSSadaf Ebrahimi each call uses only a few hundred bytes, even a small stack can support 1146*22dc650dSSadaf Ebrahimi quite a lot of recursion. 1147*22dc650dSSadaf Ebrahimi 1148*22dc650dSSadaf Ebrahimi If the depth of internal recursive function calls is great enough, lo- 1149*22dc650dSSadaf Ebrahimi cal workspace vectors are allocated on the heap from version 10.32 on- 1150*22dc650dSSadaf Ebrahimi wards, so the depth limit also indirectly limits the amount of heap 1151*22dc650dSSadaf Ebrahimi memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when 1152*22dc650dSSadaf Ebrahimi matched to a very long string using pcre2_dfa_match(), can use a great 1153*22dc650dSSadaf Ebrahimi deal of memory. However, it is probably better to limit heap usage di- 1154*22dc650dSSadaf Ebrahimi rectly by calling pcre2_set_heap_limit(). 1155*22dc650dSSadaf Ebrahimi 1156*22dc650dSSadaf Ebrahimi The default value for the depth limit can be set when PCRE2 is built; 1157*22dc650dSSadaf Ebrahimi if it is not, the default is set to the same value as the default for 1158*22dc650dSSadaf Ebrahimi the match limit. If the limit is exceeded, pcre2_match() or 1159*22dc650dSSadaf Ebrahimi pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth 1160*22dc650dSSadaf Ebrahimi limit may also be supplied by an item at the start of a pattern of the 1161*22dc650dSSadaf Ebrahimi form 1162*22dc650dSSadaf Ebrahimi 1163*22dc650dSSadaf Ebrahimi (*LIMIT_DEPTH=ddd) 1164*22dc650dSSadaf Ebrahimi 1165*22dc650dSSadaf Ebrahimi where ddd is a decimal number. However, such a setting is ignored un- 1166*22dc650dSSadaf Ebrahimi less ddd is less than the limit set by the caller of pcre2_match() or 1167*22dc650dSSadaf Ebrahimi pcre2_dfa_match() or, if no such limit is set, less than the default. 1168*22dc650dSSadaf Ebrahimi 1169*22dc650dSSadaf Ebrahimi 1170*22dc650dSSadaf EbrahimiCHECKING BUILD-TIME OPTIONS 1171*22dc650dSSadaf Ebrahimi 1172*22dc650dSSadaf Ebrahimi int pcre2_config(uint32_t what, void *where); 1173*22dc650dSSadaf Ebrahimi 1174*22dc650dSSadaf Ebrahimi The function pcre2_config() makes it possible for a PCRE2 client to 1175*22dc650dSSadaf Ebrahimi find the value of certain configuration parameters and to discover 1176*22dc650dSSadaf Ebrahimi which optional features have been compiled into the PCRE2 library. The 1177*22dc650dSSadaf Ebrahimi pcre2build documentation has more details about these features. 1178*22dc650dSSadaf Ebrahimi 1179*22dc650dSSadaf Ebrahimi The first argument for pcre2_config() specifies which information is 1180*22dc650dSSadaf Ebrahimi required. The second argument is a pointer to memory into which the in- 1181*22dc650dSSadaf Ebrahimi formation is placed. If NULL is passed, the function returns the amount 1182*22dc650dSSadaf Ebrahimi of memory that is needed for the requested information. For calls that 1183*22dc650dSSadaf Ebrahimi return numerical values, the value is in bytes; when requesting these 1184*22dc650dSSadaf Ebrahimi values, where should point to appropriately aligned memory. For calls 1185*22dc650dSSadaf Ebrahimi that return strings, the required length is given in code units, not 1186*22dc650dSSadaf Ebrahimi counting the terminating zero. 1187*22dc650dSSadaf Ebrahimi 1188*22dc650dSSadaf Ebrahimi When requesting information, the returned value from pcre2_config() is 1189*22dc650dSSadaf Ebrahimi non-negative on success, or the negative error code PCRE2_ERROR_BADOP- 1190*22dc650dSSadaf Ebrahimi TION if the value in the first argument is not recognized. The follow- 1191*22dc650dSSadaf Ebrahimi ing information is available: 1192*22dc650dSSadaf Ebrahimi 1193*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_BSR 1194*22dc650dSSadaf Ebrahimi 1195*22dc650dSSadaf Ebrahimi The output is a uint32_t integer whose value indicates what character 1196*22dc650dSSadaf Ebrahimi sequences the \R escape sequence matches by default. A value of 1197*22dc650dSSadaf Ebrahimi PCRE2_BSR_UNICODE means that \R matches any Unicode line ending se- 1198*22dc650dSSadaf Ebrahimi quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, 1199*22dc650dSSadaf Ebrahimi or CRLF. The default can be overridden when a pattern is compiled. 1200*22dc650dSSadaf Ebrahimi 1201*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_COMPILED_WIDTHS 1202*22dc650dSSadaf Ebrahimi 1203*22dc650dSSadaf Ebrahimi The output is a uint32_t integer whose lower bits indicate which code 1204*22dc650dSSadaf Ebrahimi unit widths were selected when PCRE2 was built. The 1-bit indicates 1205*22dc650dSSadaf Ebrahimi 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup- 1206*22dc650dSSadaf Ebrahimi port, respectively. 1207*22dc650dSSadaf Ebrahimi 1208*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_DEPTHLIMIT 1209*22dc650dSSadaf Ebrahimi 1210*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that gives the default limit for the 1211*22dc650dSSadaf Ebrahimi depth of nested backtracking in pcre2_match() or the depth of nested 1212*22dc650dSSadaf Ebrahimi recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur- 1213*22dc650dSSadaf Ebrahimi ther details are given with pcre2_set_depth_limit() above. 1214*22dc650dSSadaf Ebrahimi 1215*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_HEAPLIMIT 1216*22dc650dSSadaf Ebrahimi 1217*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that gives, in kibibytes, the default 1218*22dc650dSSadaf Ebrahimi limit for the amount of heap memory used by pcre2_match() or 1219*22dc650dSSadaf Ebrahimi pcre2_dfa_match(). Further details are given with 1220*22dc650dSSadaf Ebrahimi pcre2_set_heap_limit() above. 1221*22dc650dSSadaf Ebrahimi 1222*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_JIT 1223*22dc650dSSadaf Ebrahimi 1224*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that is set to one if support for 1225*22dc650dSSadaf Ebrahimi just-in-time compiling is included in the library; otherwise it is set 1226*22dc650dSSadaf Ebrahimi to zero. Note that having the support in the library does not guarantee 1227*22dc650dSSadaf Ebrahimi that JIT will be used for any given match. See the pcre2jit documenta- 1228*22dc650dSSadaf Ebrahimi tion for more details. 1229*22dc650dSSadaf Ebrahimi 1230*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_JITTARGET 1231*22dc650dSSadaf Ebrahimi 1232*22dc650dSSadaf Ebrahimi The where argument should point to a buffer that is at least 48 code 1233*22dc650dSSadaf Ebrahimi units long. (The exact length required can be found by calling 1234*22dc650dSSadaf Ebrahimi pcre2_config() with where set to NULL.) The buffer is filled with a 1235*22dc650dSSadaf Ebrahimi string that contains the name of the architecture for which the JIT 1236*22dc650dSSadaf Ebrahimi compiler is configured, for example "x86 32bit (little endian + un- 1237*22dc650dSSadaf Ebrahimi aligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is 1238*22dc650dSSadaf Ebrahimi returned, otherwise the number of code units used is returned. This is 1239*22dc650dSSadaf Ebrahimi the length of the string, plus one unit for the terminating zero. 1240*22dc650dSSadaf Ebrahimi 1241*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_LINKSIZE 1242*22dc650dSSadaf Ebrahimi 1243*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that contains the number of bytes used 1244*22dc650dSSadaf Ebrahimi for internal linkage in compiled regular expressions. When PCRE2 is 1245*22dc650dSSadaf Ebrahimi configured, the value can be set to 2, 3, or 4, with the default being 1246*22dc650dSSadaf Ebrahimi 2. This is the value that is returned by pcre2_config(). However, when 1247*22dc650dSSadaf Ebrahimi the 16-bit library is compiled, a value of 3 is rounded up to 4, and 1248*22dc650dSSadaf Ebrahimi when the 32-bit library is compiled, internal linkages always use 4 1249*22dc650dSSadaf Ebrahimi bytes, so the configured value is not relevant. 1250*22dc650dSSadaf Ebrahimi 1251*22dc650dSSadaf Ebrahimi The default value of 2 for the 8-bit and 16-bit libraries is sufficient 1252*22dc650dSSadaf Ebrahimi for all but the most massive patterns, since it allows the size of the 1253*22dc650dSSadaf Ebrahimi compiled pattern to be up to 65535 code units. Larger values allow 1254*22dc650dSSadaf Ebrahimi larger regular expressions to be compiled by those two libraries, but 1255*22dc650dSSadaf Ebrahimi at the expense of slower matching. 1256*22dc650dSSadaf Ebrahimi 1257*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_MATCHLIMIT 1258*22dc650dSSadaf Ebrahimi 1259*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that gives the default match limit for 1260*22dc650dSSadaf Ebrahimi pcre2_match(). Further details are given with pcre2_set_match_limit() 1261*22dc650dSSadaf Ebrahimi above. 1262*22dc650dSSadaf Ebrahimi 1263*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_NEWLINE 1264*22dc650dSSadaf Ebrahimi 1265*22dc650dSSadaf Ebrahimi The output is a uint32_t integer whose value specifies the default 1266*22dc650dSSadaf Ebrahimi character sequence that is recognized as meaning "newline". The values 1267*22dc650dSSadaf Ebrahimi are: 1268*22dc650dSSadaf Ebrahimi 1269*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_CR Carriage return (CR) 1270*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_LF Linefeed (LF) 1271*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 1272*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_ANY Any Unicode line ending 1273*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 1274*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_NUL The NUL character (binary zero) 1275*22dc650dSSadaf Ebrahimi 1276*22dc650dSSadaf Ebrahimi The default should normally correspond to the standard sequence for 1277*22dc650dSSadaf Ebrahimi your operating system. 1278*22dc650dSSadaf Ebrahimi 1279*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_NEVER_BACKSLASH_C 1280*22dc650dSSadaf Ebrahimi 1281*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that is set to one if the use of \C 1282*22dc650dSSadaf Ebrahimi was permanently disabled when PCRE2 was built; otherwise it is set to 1283*22dc650dSSadaf Ebrahimi zero. 1284*22dc650dSSadaf Ebrahimi 1285*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_PARENSLIMIT 1286*22dc650dSSadaf Ebrahimi 1287*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that gives the maximum depth of nest- 1288*22dc650dSSadaf Ebrahimi ing of parentheses (of any kind) in a pattern. This limit is imposed to 1289*22dc650dSSadaf Ebrahimi cap the amount of system stack used when a pattern is compiled. It is 1290*22dc650dSSadaf Ebrahimi specified when PCRE2 is built; the default is 250. This limit does not 1291*22dc650dSSadaf Ebrahimi take into account the stack that may already be used by the calling ap- 1292*22dc650dSSadaf Ebrahimi plication. For finer control over compilation stack usage, see 1293*22dc650dSSadaf Ebrahimi pcre2_set_compile_recursion_guard(). 1294*22dc650dSSadaf Ebrahimi 1295*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_STACKRECURSE 1296*22dc650dSSadaf Ebrahimi 1297*22dc650dSSadaf Ebrahimi This parameter is obsolete and should not be used in new code. The out- 1298*22dc650dSSadaf Ebrahimi put is a uint32_t integer that is always set to zero. 1299*22dc650dSSadaf Ebrahimi 1300*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_TABLES_LENGTH 1301*22dc650dSSadaf Ebrahimi 1302*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that gives the length of PCRE2's char- 1303*22dc650dSSadaf Ebrahimi acter processing tables in bytes. For details of these tables see the 1304*22dc650dSSadaf Ebrahimi section on locale support below. 1305*22dc650dSSadaf Ebrahimi 1306*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_UNICODE_VERSION 1307*22dc650dSSadaf Ebrahimi 1308*22dc650dSSadaf Ebrahimi The where argument should point to a buffer that is at least 24 code 1309*22dc650dSSadaf Ebrahimi units long. (The exact length required can be found by calling 1310*22dc650dSSadaf Ebrahimi pcre2_config() with where set to NULL.) If PCRE2 has been compiled 1311*22dc650dSSadaf Ebrahimi without Unicode support, the buffer is filled with the text "Unicode 1312*22dc650dSSadaf Ebrahimi not supported". Otherwise, the Unicode version string (for example, 1313*22dc650dSSadaf Ebrahimi "8.0.0") is inserted. The number of code units used is returned. This 1314*22dc650dSSadaf Ebrahimi is the length of the string plus one unit for the terminating zero. 1315*22dc650dSSadaf Ebrahimi 1316*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_UNICODE 1317*22dc650dSSadaf Ebrahimi 1318*22dc650dSSadaf Ebrahimi The output is a uint32_t integer that is set to one if Unicode support 1319*22dc650dSSadaf Ebrahimi is available; otherwise it is set to zero. Unicode support implies UTF 1320*22dc650dSSadaf Ebrahimi support. 1321*22dc650dSSadaf Ebrahimi 1322*22dc650dSSadaf Ebrahimi PCRE2_CONFIG_VERSION 1323*22dc650dSSadaf Ebrahimi 1324*22dc650dSSadaf Ebrahimi The where argument should point to a buffer that is at least 24 code 1325*22dc650dSSadaf Ebrahimi units long. (The exact length required can be found by calling 1326*22dc650dSSadaf Ebrahimi pcre2_config() with where set to NULL.) The buffer is filled with the 1327*22dc650dSSadaf Ebrahimi PCRE2 version string, zero-terminated. The number of code units used is 1328*22dc650dSSadaf Ebrahimi returned. This is the length of the string plus one unit for the termi- 1329*22dc650dSSadaf Ebrahimi nating zero. 1330*22dc650dSSadaf Ebrahimi 1331*22dc650dSSadaf Ebrahimi 1332*22dc650dSSadaf EbrahimiCOMPILING A PATTERN 1333*22dc650dSSadaf Ebrahimi 1334*22dc650dSSadaf Ebrahimi pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 1335*22dc650dSSadaf Ebrahimi uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 1336*22dc650dSSadaf Ebrahimi pcre2_compile_context *ccontext); 1337*22dc650dSSadaf Ebrahimi 1338*22dc650dSSadaf Ebrahimi void pcre2_code_free(pcre2_code *code); 1339*22dc650dSSadaf Ebrahimi 1340*22dc650dSSadaf Ebrahimi pcre2_code *pcre2_code_copy(const pcre2_code *code); 1341*22dc650dSSadaf Ebrahimi 1342*22dc650dSSadaf Ebrahimi pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); 1343*22dc650dSSadaf Ebrahimi 1344*22dc650dSSadaf Ebrahimi The pcre2_compile() function compiles a pattern into an internal form. 1345*22dc650dSSadaf Ebrahimi The pattern is defined by a pointer to a string of code units and a 1346*22dc650dSSadaf Ebrahimi length in code units. If the pattern is zero-terminated, the length can 1347*22dc650dSSadaf Ebrahimi be specified as PCRE2_ZERO_TERMINATED. A NULL pattern pointer with a 1348*22dc650dSSadaf Ebrahimi length of zero is treated as an empty string (NULL with a non-zero 1349*22dc650dSSadaf Ebrahimi length causes an error return). The function returns a pointer to a 1350*22dc650dSSadaf Ebrahimi block of memory that contains the compiled pattern and related data, or 1351*22dc650dSSadaf Ebrahimi NULL if an error occurred. 1352*22dc650dSSadaf Ebrahimi 1353*22dc650dSSadaf Ebrahimi If the compile context argument ccontext is NULL, memory for the com- 1354*22dc650dSSadaf Ebrahimi piled pattern is obtained by calling malloc(). Otherwise, it is ob- 1355*22dc650dSSadaf Ebrahimi tained from the same memory function that was used for the compile con- 1356*22dc650dSSadaf Ebrahimi text. The caller must free the memory by calling pcre2_code_free() when 1357*22dc650dSSadaf Ebrahimi it is no longer needed. If pcre2_code_free() is called with a NULL ar- 1358*22dc650dSSadaf Ebrahimi gument, it returns immediately, without doing anything. 1359*22dc650dSSadaf Ebrahimi 1360*22dc650dSSadaf Ebrahimi The function pcre2_code_copy() makes a copy of the compiled code in new 1361*22dc650dSSadaf Ebrahimi memory, using the same memory allocator as was used for the original. 1362*22dc650dSSadaf Ebrahimi However, if the code has been processed by the JIT compiler (see be- 1363*22dc650dSSadaf Ebrahimi low), the JIT information cannot be copied (because it is position-de- 1364*22dc650dSSadaf Ebrahimi pendent). The new copy can initially be used only for non-JIT match- 1365*22dc650dSSadaf Ebrahimi ing, though it can be passed to pcre2_jit_compile() if required. If 1366*22dc650dSSadaf Ebrahimi pcre2_code_copy() is called with a NULL argument, it returns NULL. 1367*22dc650dSSadaf Ebrahimi 1368*22dc650dSSadaf Ebrahimi The pcre2_code_copy() function provides a way for individual threads in 1369*22dc650dSSadaf Ebrahimi a multithreaded application to acquire a private copy of shared com- 1370*22dc650dSSadaf Ebrahimi piled code. However, it does not make a copy of the character tables 1371*22dc650dSSadaf Ebrahimi used by the compiled pattern; the new pattern code points to the same 1372*22dc650dSSadaf Ebrahimi tables as the original code. (See "Locale Support" below for details 1373*22dc650dSSadaf Ebrahimi of these character tables.) In many applications the same tables are 1374*22dc650dSSadaf Ebrahimi used throughout, so this behaviour is appropriate. Nevertheless, there 1375*22dc650dSSadaf Ebrahimi are occasions when a copy of a compiled pattern and the relevant tables 1376*22dc650dSSadaf Ebrahimi are needed. The pcre2_code_copy_with_tables() provides this facility. 1377*22dc650dSSadaf Ebrahimi Copies of both the code and the tables are made, with the new code 1378*22dc650dSSadaf Ebrahimi pointing to the new tables. The memory for the new tables is automati- 1379*22dc650dSSadaf Ebrahimi cally freed when pcre2_code_free() is called for the new copy of the 1380*22dc650dSSadaf Ebrahimi compiled code. If pcre2_code_copy_with_tables() is called with a NULL 1381*22dc650dSSadaf Ebrahimi argument, it returns NULL. 1382*22dc650dSSadaf Ebrahimi 1383*22dc650dSSadaf Ebrahimi NOTE: When one of the matching functions is called, pointers to the 1384*22dc650dSSadaf Ebrahimi compiled pattern and the subject string are set in the match data block 1385*22dc650dSSadaf Ebrahimi so that they can be referenced by the substring extraction functions 1386*22dc650dSSadaf Ebrahimi after a successful match. After running a match, you must not free a 1387*22dc650dSSadaf Ebrahimi compiled pattern or a subject string until after all operations on the 1388*22dc650dSSadaf Ebrahimi match data block have taken place, unless, in the case of the subject 1389*22dc650dSSadaf Ebrahimi string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is 1390*22dc650dSSadaf Ebrahimi described in the section entitled "Option bits for pcre2_match()" be- 1391*22dc650dSSadaf Ebrahimi low. 1392*22dc650dSSadaf Ebrahimi 1393*22dc650dSSadaf Ebrahimi The options argument for pcre2_compile() contains various bit settings 1394*22dc650dSSadaf Ebrahimi that affect the compilation. It should be zero if none of them are re- 1395*22dc650dSSadaf Ebrahimi quired. The available options are described below. Some of them (in 1396*22dc650dSSadaf Ebrahimi particular, those that are compatible with Perl, but some others as 1397*22dc650dSSadaf Ebrahimi well) can also be set and unset from within the pattern (see the de- 1398*22dc650dSSadaf Ebrahimi tailed description in the pcre2pattern documentation). 1399*22dc650dSSadaf Ebrahimi 1400*22dc650dSSadaf Ebrahimi For those options that can be different in different parts of the pat- 1401*22dc650dSSadaf Ebrahimi tern, the contents of the options argument specifies their settings at 1402*22dc650dSSadaf Ebrahimi the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and 1403*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK options can be set at the time of matching as well 1404*22dc650dSSadaf Ebrahimi as at compile time. 1405*22dc650dSSadaf Ebrahimi 1406*22dc650dSSadaf Ebrahimi Some additional options and less frequently required compile-time para- 1407*22dc650dSSadaf Ebrahimi meters (for example, the newline setting) can be provided in a compile 1408*22dc650dSSadaf Ebrahimi context (as described above). 1409*22dc650dSSadaf Ebrahimi 1410*22dc650dSSadaf Ebrahimi If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- 1411*22dc650dSSadaf Ebrahimi diately. Otherwise, the variables to which these point are set to an 1412*22dc650dSSadaf Ebrahimi error code and an offset (number of code units) within the pattern, re- 1413*22dc650dSSadaf Ebrahimi spectively, when pcre2_compile() returns NULL because a compilation er- 1414*22dc650dSSadaf Ebrahimi ror has occurred. 1415*22dc650dSSadaf Ebrahimi 1416*22dc650dSSadaf Ebrahimi There are nearly 100 positive error codes that pcre2_compile() may re- 1417*22dc650dSSadaf Ebrahimi turn if it finds an error in the pattern. There are also some negative 1418*22dc650dSSadaf Ebrahimi error codes that are used for invalid UTF strings when validity check- 1419*22dc650dSSadaf Ebrahimi ing is in force. These are the same as given by pcre2_match() and 1420*22dc650dSSadaf Ebrahimi pcre2_dfa_match(), and are described in the pcre2unicode documentation. 1421*22dc650dSSadaf Ebrahimi There is no separate documentation for the positive error codes, be- 1422*22dc650dSSadaf Ebrahimi cause the textual error messages that are obtained by calling the 1423*22dc650dSSadaf Ebrahimi pcre2_get_error_message() function (see "Obtaining a textual error mes- 1424*22dc650dSSadaf Ebrahimi sage" below) should be self-explanatory. Macro names starting with 1425*22dc650dSSadaf Ebrahimi PCRE2_ERROR_ are defined for both positive and negative error codes in 1426*22dc650dSSadaf Ebrahimi pcre2.h. When compilation is successful errorcode is set to a value 1427*22dc650dSSadaf Ebrahimi that returns the message "no error" if passed to pcre2_get_error_mes- 1428*22dc650dSSadaf Ebrahimi sage(). 1429*22dc650dSSadaf Ebrahimi 1430*22dc650dSSadaf Ebrahimi The value returned in erroroffset is an indication of where in the pat- 1431*22dc650dSSadaf Ebrahimi tern an error occurred. When there is no error, zero is returned. A 1432*22dc650dSSadaf Ebrahimi non-zero value is not necessarily the furthest point in the pattern 1433*22dc650dSSadaf Ebrahimi that was read. For example, after the error "lookbehind assertion is 1434*22dc650dSSadaf Ebrahimi not fixed length", the error offset points to the start of the failing 1435*22dc650dSSadaf Ebrahimi assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of 1436*22dc650dSSadaf Ebrahimi the first code unit of the failing character. 1437*22dc650dSSadaf Ebrahimi 1438*22dc650dSSadaf Ebrahimi Some errors are not detected until the whole pattern has been scanned; 1439*22dc650dSSadaf Ebrahimi in these cases, the offset passed back is the length of the pattern. 1440*22dc650dSSadaf Ebrahimi Note that the offset is in code units, not characters, even in a UTF 1441*22dc650dSSadaf Ebrahimi mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- 1442*22dc650dSSadaf Ebrahimi acter. 1443*22dc650dSSadaf Ebrahimi 1444*22dc650dSSadaf Ebrahimi This code fragment shows a typical straightforward call to pcre2_com- 1445*22dc650dSSadaf Ebrahimi pile(): 1446*22dc650dSSadaf Ebrahimi 1447*22dc650dSSadaf Ebrahimi pcre2_code *re; 1448*22dc650dSSadaf Ebrahimi PCRE2_SIZE erroffset; 1449*22dc650dSSadaf Ebrahimi int errorcode; 1450*22dc650dSSadaf Ebrahimi re = pcre2_compile( 1451*22dc650dSSadaf Ebrahimi "^A.*Z", /* the pattern */ 1452*22dc650dSSadaf Ebrahimi PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ 1453*22dc650dSSadaf Ebrahimi 0, /* default options */ 1454*22dc650dSSadaf Ebrahimi &errorcode, /* for error code */ 1455*22dc650dSSadaf Ebrahimi &erroffset, /* for error offset */ 1456*22dc650dSSadaf Ebrahimi NULL); /* no compile context */ 1457*22dc650dSSadaf Ebrahimi 1458*22dc650dSSadaf Ebrahimi 1459*22dc650dSSadaf Ebrahimi Main compile options 1460*22dc650dSSadaf Ebrahimi 1461*22dc650dSSadaf Ebrahimi The following names for option bits are defined in the pcre2.h header 1462*22dc650dSSadaf Ebrahimi file: 1463*22dc650dSSadaf Ebrahimi 1464*22dc650dSSadaf Ebrahimi PCRE2_ANCHORED 1465*22dc650dSSadaf Ebrahimi 1466*22dc650dSSadaf Ebrahimi If this bit is set, the pattern is forced to be "anchored", that is, it 1467*22dc650dSSadaf Ebrahimi is constrained to match only at the first matching point in the string 1468*22dc650dSSadaf Ebrahimi that is being searched (the "subject string"). This effect can also be 1469*22dc650dSSadaf Ebrahimi achieved by appropriate constructs in the pattern itself, which is the 1470*22dc650dSSadaf Ebrahimi only way to do it in Perl. 1471*22dc650dSSadaf Ebrahimi 1472*22dc650dSSadaf Ebrahimi PCRE2_ALLOW_EMPTY_CLASS 1473*22dc650dSSadaf Ebrahimi 1474*22dc650dSSadaf Ebrahimi By default, for compatibility with Perl, a closing square bracket that 1475*22dc650dSSadaf Ebrahimi immediately follows an opening one is treated as a data character for 1476*22dc650dSSadaf Ebrahimi the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the 1477*22dc650dSSadaf Ebrahimi class, which therefore contains no characters and so can never match. 1478*22dc650dSSadaf Ebrahimi 1479*22dc650dSSadaf Ebrahimi PCRE2_ALT_BSUX 1480*22dc650dSSadaf Ebrahimi 1481*22dc650dSSadaf Ebrahimi This option request alternative handling of three escape sequences, 1482*22dc650dSSadaf Ebrahimi which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). 1483*22dc650dSSadaf Ebrahimi When it is set: 1484*22dc650dSSadaf Ebrahimi 1485*22dc650dSSadaf Ebrahimi (1) \U matches an upper case "U" character; by default \U causes a com- 1486*22dc650dSSadaf Ebrahimi pile time error (Perl uses \U to upper case subsequent characters). 1487*22dc650dSSadaf Ebrahimi 1488*22dc650dSSadaf Ebrahimi (2) \u matches a lower case "u" character unless it is followed by four 1489*22dc650dSSadaf Ebrahimi hexadecimal digits, in which case the hexadecimal number defines the 1490*22dc650dSSadaf Ebrahimi code point to match. By default, \u causes a compile time error (Perl 1491*22dc650dSSadaf Ebrahimi uses it to upper case the following character). 1492*22dc650dSSadaf Ebrahimi 1493*22dc650dSSadaf Ebrahimi (3) \x matches a lower case "x" character unless it is followed by two 1494*22dc650dSSadaf Ebrahimi hexadecimal digits, in which case the hexadecimal number defines the 1495*22dc650dSSadaf Ebrahimi code point to match. By default, as in Perl, a hexadecimal number is 1496*22dc650dSSadaf Ebrahimi always expected after \x, but it may have zero, one, or two digits (so, 1497*22dc650dSSadaf Ebrahimi for example, \xz matches a binary zero character followed by z). 1498*22dc650dSSadaf Ebrahimi 1499*22dc650dSSadaf Ebrahimi ECMAscript 6 added additional functionality to \u. This can be accessed 1500*22dc650dSSadaf Ebrahimi using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile op- 1501*22dc650dSSadaf Ebrahimi tions" below). Note that this alternative escape handling applies only 1502*22dc650dSSadaf Ebrahimi to patterns. Neither of these options affects the processing of re- 1503*22dc650dSSadaf Ebrahimi placement strings passed to pcre2_substitute(). 1504*22dc650dSSadaf Ebrahimi 1505*22dc650dSSadaf Ebrahimi PCRE2_ALT_CIRCUMFLEX 1506*22dc650dSSadaf Ebrahimi 1507*22dc650dSSadaf Ebrahimi In multiline mode (when PCRE2_MULTILINE is set), the circumflex 1508*22dc650dSSadaf Ebrahimi metacharacter matches at the start of the subject (unless PCRE2_NOTBOL 1509*22dc650dSSadaf Ebrahimi is set), and also after any internal newline. However, it does not 1510*22dc650dSSadaf Ebrahimi match after a newline at the end of the subject, for compatibility with 1511*22dc650dSSadaf Ebrahimi Perl. If you want a multiline circumflex also to match after a termi- 1512*22dc650dSSadaf Ebrahimi nating newline, you must set PCRE2_ALT_CIRCUMFLEX. 1513*22dc650dSSadaf Ebrahimi 1514*22dc650dSSadaf Ebrahimi PCRE2_ALT_VERBNAMES 1515*22dc650dSSadaf Ebrahimi 1516*22dc650dSSadaf Ebrahimi By default, for compatibility with Perl, the name in any verb sequence 1517*22dc650dSSadaf Ebrahimi such as (*MARK:NAME) is any sequence of characters that does not in- 1518*22dc650dSSadaf Ebrahimi clude a closing parenthesis. The name is not processed in any way, and 1519*22dc650dSSadaf Ebrahimi it is not possible to include a closing parenthesis in the name. How- 1520*22dc650dSSadaf Ebrahimi ever, if the PCRE2_ALT_VERBNAMES option is set, normal backslash pro- 1521*22dc650dSSadaf Ebrahimi cessing is applied to verb names and only an unescaped closing paren- 1522*22dc650dSSadaf Ebrahimi thesis terminates the name. A closing parenthesis can be included in a 1523*22dc650dSSadaf Ebrahimi name either as \) or between \Q and \E. If the PCRE2_EXTENDED or 1524*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped 1525*22dc650dSSadaf Ebrahimi whitespace in verb names is skipped and #-comments are recognized, ex- 1526*22dc650dSSadaf Ebrahimi actly as in the rest of the pattern. 1527*22dc650dSSadaf Ebrahimi 1528*22dc650dSSadaf Ebrahimi PCRE2_AUTO_CALLOUT 1529*22dc650dSSadaf Ebrahimi 1530*22dc650dSSadaf Ebrahimi If this bit is set, pcre2_compile() automatically inserts callout 1531*22dc650dSSadaf Ebrahimi items, all with number 255, before each pattern item, except immedi- 1532*22dc650dSSadaf Ebrahimi ately before or after an explicit callout in the pattern. For discus- 1533*22dc650dSSadaf Ebrahimi sion of the callout facility, see the pcre2callout documentation. 1534*22dc650dSSadaf Ebrahimi 1535*22dc650dSSadaf Ebrahimi PCRE2_CASELESS 1536*22dc650dSSadaf Ebrahimi 1537*22dc650dSSadaf Ebrahimi If this bit is set, letters in the pattern match both upper and lower 1538*22dc650dSSadaf Ebrahimi case letters in the subject. It is equivalent to Perl's /i option, and 1539*22dc650dSSadaf Ebrahimi it can be changed within a pattern by a (?i) option setting. If either 1540*22dc650dSSadaf Ebrahimi PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all 1541*22dc650dSSadaf Ebrahimi characters with more than one other case, and for all characters whose 1542*22dc650dSSadaf Ebrahimi code points are greater than U+007F. Note that there are two ASCII 1543*22dc650dSSadaf Ebrahimi characters, K and S, that, in addition to their lower case ASCII equiv- 1544*22dc650dSSadaf Ebrahimi alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long 1545*22dc650dSSadaf Ebrahimi S) respectively. If you do not want this case equivalence, you can sup- 1546*22dc650dSSadaf Ebrahimi press it by setting PCRE2_EXTRA_CASELESS_RESTRICT. 1547*22dc650dSSadaf Ebrahimi 1548*22dc650dSSadaf Ebrahimi For lower valued characters with only one other case, a lookup table is 1549*22dc650dSSadaf Ebrahimi used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup 1550*22dc650dSSadaf Ebrahimi table is used for all code points less than 256, and higher code points 1551*22dc650dSSadaf Ebrahimi (available only in 16-bit or 32-bit mode) are treated as not having an- 1552*22dc650dSSadaf Ebrahimi other case. 1553*22dc650dSSadaf Ebrahimi 1554*22dc650dSSadaf Ebrahimi PCRE2_DOLLAR_ENDONLY 1555*22dc650dSSadaf Ebrahimi 1556*22dc650dSSadaf Ebrahimi If this bit is set, a dollar metacharacter in the pattern matches only 1557*22dc650dSSadaf Ebrahimi at the end of the subject string. Without this option, a dollar also 1558*22dc650dSSadaf Ebrahimi matches immediately before a newline at the end of the string (but not 1559*22dc650dSSadaf Ebrahimi before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored 1560*22dc650dSSadaf Ebrahimi if PCRE2_MULTILINE is set. There is no equivalent to this option in 1561*22dc650dSSadaf Ebrahimi Perl, and no way to set it within a pattern. 1562*22dc650dSSadaf Ebrahimi 1563*22dc650dSSadaf Ebrahimi PCRE2_DOTALL 1564*22dc650dSSadaf Ebrahimi 1565*22dc650dSSadaf Ebrahimi If this bit is set, a dot metacharacter in the pattern matches any 1566*22dc650dSSadaf Ebrahimi character, including one that indicates a newline. However, it only 1567*22dc650dSSadaf Ebrahimi ever matches one character, even if newlines are coded as CRLF. Without 1568*22dc650dSSadaf Ebrahimi this option, a dot does not match when the current position in the sub- 1569*22dc650dSSadaf Ebrahimi ject is at a newline. This option is equivalent to Perl's /s option, 1570*22dc650dSSadaf Ebrahimi and it can be changed within a pattern by a (?s) option setting. A neg- 1571*22dc650dSSadaf Ebrahimi ative class such as [^a] always matches newline characters, and the \N 1572*22dc650dSSadaf Ebrahimi escape sequence always matches a non-newline character, independent of 1573*22dc650dSSadaf Ebrahimi the setting of PCRE2_DOTALL. 1574*22dc650dSSadaf Ebrahimi 1575*22dc650dSSadaf Ebrahimi PCRE2_DUPNAMES 1576*22dc650dSSadaf Ebrahimi 1577*22dc650dSSadaf Ebrahimi If this bit is set, names used to identify capture groups need not be 1578*22dc650dSSadaf Ebrahimi unique. This can be helpful for certain types of pattern when it is 1579*22dc650dSSadaf Ebrahimi known that only one instance of the named group can ever be matched. 1580*22dc650dSSadaf Ebrahimi There are more details of named capture groups below; see also the 1581*22dc650dSSadaf Ebrahimi pcre2pattern documentation. 1582*22dc650dSSadaf Ebrahimi 1583*22dc650dSSadaf Ebrahimi PCRE2_ENDANCHORED 1584*22dc650dSSadaf Ebrahimi 1585*22dc650dSSadaf Ebrahimi If this bit is set, the end of any pattern match must be right at the 1586*22dc650dSSadaf Ebrahimi end of the string being searched (the "subject string"). If the pattern 1587*22dc650dSSadaf Ebrahimi match succeeds by reaching (*ACCEPT), but does not reach the end of the 1588*22dc650dSSadaf Ebrahimi subject, the match fails at the current starting point. For unanchored 1589*22dc650dSSadaf Ebrahimi patterns, a new match is then tried at the next starting point. How- 1590*22dc650dSSadaf Ebrahimi ever, if the match succeeds by reaching the end of the pattern, but not 1591*22dc650dSSadaf Ebrahimi the end of the subject, backtracking occurs and an alternative match 1592*22dc650dSSadaf Ebrahimi may be found. Consider these two patterns: 1593*22dc650dSSadaf Ebrahimi 1594*22dc650dSSadaf Ebrahimi .(*ACCEPT)|.. 1595*22dc650dSSadaf Ebrahimi .|.. 1596*22dc650dSSadaf Ebrahimi 1597*22dc650dSSadaf Ebrahimi If matched against "abc" with PCRE2_ENDANCHORED set, the first matches 1598*22dc650dSSadaf Ebrahimi "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED 1599*22dc650dSSadaf Ebrahimi can also be achieved by appropriate constructs in the pattern itself, 1600*22dc650dSSadaf Ebrahimi which is the only way to do it in Perl. 1601*22dc650dSSadaf Ebrahimi 1602*22dc650dSSadaf Ebrahimi For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only 1603*22dc650dSSadaf Ebrahimi to the first (that is, the longest) matched string. Other parallel 1604*22dc650dSSadaf Ebrahimi matches, which are necessarily substrings of the first one, must obvi- 1605*22dc650dSSadaf Ebrahimi ously end before the end of the subject. 1606*22dc650dSSadaf Ebrahimi 1607*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED 1608*22dc650dSSadaf Ebrahimi 1609*22dc650dSSadaf Ebrahimi If this bit is set, most white space characters in the pattern are to- 1610*22dc650dSSadaf Ebrahimi tally ignored except when escaped, inside a character class, or inside 1611*22dc650dSSadaf Ebrahimi a \Q...\E sequence. However, white space is not allowed within se- 1612*22dc650dSSadaf Ebrahimi quences such as (?> that introduce various parenthesized groups, nor 1613*22dc650dSSadaf Ebrahimi within numerical quantifiers such as {1,3}. Ignorable white space is 1614*22dc650dSSadaf Ebrahimi permitted between an item and a following quantifier and between a 1615*22dc650dSSadaf Ebrahimi quantifier and a following + that indicates possessiveness. PCRE2_EX- 1616*22dc650dSSadaf Ebrahimi TENDED is equivalent to Perl's /x option, and it can be changed within 1617*22dc650dSSadaf Ebrahimi a pattern by a (?x) option setting. 1618*22dc650dSSadaf Ebrahimi 1619*22dc650dSSadaf Ebrahimi When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog- 1620*22dc650dSSadaf Ebrahimi nizes as white space only those characters with code points less than 1621*22dc650dSSadaf Ebrahimi 256 that are flagged as white space in its low-character table. The ta- 1622*22dc650dSSadaf Ebrahimi ble is normally created by pcre2_maketables(), which uses the isspace() 1623*22dc650dSSadaf Ebrahimi function to identify space characters. In most ASCII environments, the 1624*22dc650dSSadaf Ebrahimi relevant characters are those with code points 0x0009 (tab), 0x000A 1625*22dc650dSSadaf Ebrahimi (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage 1626*22dc650dSSadaf Ebrahimi return), and 0x0020 (space). 1627*22dc650dSSadaf Ebrahimi 1628*22dc650dSSadaf Ebrahimi When PCRE2 is compiled with Unicode support, in addition to these char- 1629*22dc650dSSadaf Ebrahimi acters, five more Unicode "Pattern White Space" characters are recog- 1630*22dc650dSSadaf Ebrahimi nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to- 1631*22dc650dSSadaf Ebrahimi right mark), U+200F (right-to-left mark), U+2028 (line separator), and 1632*22dc650dSSadaf Ebrahimi U+2029 (paragraph separator). This set of characters is the same as 1633*22dc650dSSadaf Ebrahimi recognized by Perl's /x option. Note that the horizontal and vertical 1634*22dc650dSSadaf Ebrahimi space characters that are matched by the \h and \v escapes in patterns 1635*22dc650dSSadaf Ebrahimi are a much bigger set. 1636*22dc650dSSadaf Ebrahimi 1637*22dc650dSSadaf Ebrahimi As well as ignoring most white space, PCRE2_EXTENDED also causes char- 1638*22dc650dSSadaf Ebrahimi acters between an unescaped # outside a character class and the next 1639*22dc650dSSadaf Ebrahimi newline, inclusive, to be ignored, which makes it possible to include 1640*22dc650dSSadaf Ebrahimi comments inside complicated patterns. Note that the end of this type of 1641*22dc650dSSadaf Ebrahimi comment is a literal newline sequence in the pattern; escape sequences 1642*22dc650dSSadaf Ebrahimi that happen to represent a newline do not count. 1643*22dc650dSSadaf Ebrahimi 1644*22dc650dSSadaf Ebrahimi Which characters are interpreted as newlines can be specified by a set- 1645*22dc650dSSadaf Ebrahimi ting in the compile context that is passed to pcre2_compile() or by a 1646*22dc650dSSadaf Ebrahimi special sequence at the start of the pattern, as described in the sec- 1647*22dc650dSSadaf Ebrahimi tion entitled "Newline conventions" in the pcre2pattern documentation. 1648*22dc650dSSadaf Ebrahimi A default is defined when PCRE2 is built. 1649*22dc650dSSadaf Ebrahimi 1650*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED_MORE 1651*22dc650dSSadaf Ebrahimi 1652*22dc650dSSadaf Ebrahimi This option has the effect of PCRE2_EXTENDED, but, in addition, un- 1653*22dc650dSSadaf Ebrahimi escaped space and horizontal tab characters are ignored inside a char- 1654*22dc650dSSadaf Ebrahimi acter class. Note: only these two characters are ignored, not the full 1655*22dc650dSSadaf Ebrahimi set of pattern white space characters that are ignored outside a char- 1656*22dc650dSSadaf Ebrahimi acter class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx option, 1657*22dc650dSSadaf Ebrahimi and it can be changed within a pattern by a (?xx) option setting. 1658*22dc650dSSadaf Ebrahimi 1659*22dc650dSSadaf Ebrahimi PCRE2_FIRSTLINE 1660*22dc650dSSadaf Ebrahimi 1661*22dc650dSSadaf Ebrahimi If this option is set, the start of an unanchored pattern match must be 1662*22dc650dSSadaf Ebrahimi before or at the first newline in the subject string following the 1663*22dc650dSSadaf Ebrahimi start of matching, though the matched text may continue over the new- 1664*22dc650dSSadaf Ebrahimi line. If startoffset is non-zero, the limiting newline is not necessar- 1665*22dc650dSSadaf Ebrahimi ily the first newline in the subject. For example, if the subject 1666*22dc650dSSadaf Ebrahimi string is "abc\nxyz" (where \n represents a single-character newline) a 1667*22dc650dSSadaf Ebrahimi pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is 1668*22dc650dSSadaf Ebrahimi greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more 1669*22dc650dSSadaf Ebrahimi general limiting facility. If PCRE2_FIRSTLINE is set with an offset 1670*22dc650dSSadaf Ebrahimi limit, a match must occur in the first line and also within the offset 1671*22dc650dSSadaf Ebrahimi limit. In other words, whichever limit comes first is used. This option 1672*22dc650dSSadaf Ebrahimi has no effect for anchored patterns. 1673*22dc650dSSadaf Ebrahimi 1674*22dc650dSSadaf Ebrahimi PCRE2_LITERAL 1675*22dc650dSSadaf Ebrahimi 1676*22dc650dSSadaf Ebrahimi If this option is set, all meta-characters in the pattern are disabled, 1677*22dc650dSSadaf Ebrahimi and it is treated as a literal string. Matching literal strings with a 1678*22dc650dSSadaf Ebrahimi regular expression engine is not the most efficient way of doing it. If 1679*22dc650dSSadaf Ebrahimi you are doing a lot of literal matching and are worried about effi- 1680*22dc650dSSadaf Ebrahimi ciency, you should consider using other approaches. The only other main 1681*22dc650dSSadaf Ebrahimi options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED, 1682*22dc650dSSadaf Ebrahimi PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE, 1683*22dc650dSSadaf Ebrahimi PCRE2_MATCH_INVALID_UTF, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, 1684*22dc650dSSadaf Ebrahimi PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EX- 1685*22dc650dSSadaf Ebrahimi TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other 1686*22dc650dSSadaf Ebrahimi options cause an error. 1687*22dc650dSSadaf Ebrahimi 1688*22dc650dSSadaf Ebrahimi PCRE2_MATCH_INVALID_UTF 1689*22dc650dSSadaf Ebrahimi 1690*22dc650dSSadaf Ebrahimi This option forces PCRE2_UTF (see below) and also enables support for 1691*22dc650dSSadaf Ebrahimi matching by pcre2_match() in subject strings that contain invalid UTF 1692*22dc650dSSadaf Ebrahimi sequences. Note, however, that the 16-bit and 32-bit PCRE2 libraries 1693*22dc650dSSadaf Ebrahimi process strings as sequences of uint16_t or uint32_t code points. They 1694*22dc650dSSadaf Ebrahimi cannot find valid UTF sequences within an arbitrary string of bytes un- 1695*22dc650dSSadaf Ebrahimi less such sequences are suitably aligned. This facility is not sup- 1696*22dc650dSSadaf Ebrahimi ported for DFA matching. For details, see the pcre2unicode documenta- 1697*22dc650dSSadaf Ebrahimi tion. 1698*22dc650dSSadaf Ebrahimi 1699*22dc650dSSadaf Ebrahimi PCRE2_MATCH_UNSET_BACKREF 1700*22dc650dSSadaf Ebrahimi 1701*22dc650dSSadaf Ebrahimi If this option is set, a backreference to an unset capture group 1702*22dc650dSSadaf Ebrahimi matches an empty string (by default this causes the current matching 1703*22dc650dSSadaf Ebrahimi alternative to fail). A pattern such as (\1)(a) succeeds when this op- 1704*22dc650dSSadaf Ebrahimi tion is set (assuming it can find an "a" in the subject), whereas it 1705*22dc650dSSadaf Ebrahimi fails by default, for Perl compatibility. Setting this option makes 1706*22dc650dSSadaf Ebrahimi PCRE2 behave more like ECMAscript (aka JavaScript). 1707*22dc650dSSadaf Ebrahimi 1708*22dc650dSSadaf Ebrahimi PCRE2_MULTILINE 1709*22dc650dSSadaf Ebrahimi 1710*22dc650dSSadaf Ebrahimi By default, for the purposes of matching "start of line" and "end of 1711*22dc650dSSadaf Ebrahimi line", PCRE2 treats the subject string as consisting of a single line 1712*22dc650dSSadaf Ebrahimi of characters, even if it actually contains newlines. The "start of 1713*22dc650dSSadaf Ebrahimi line" metacharacter (^) matches only at the start of the string, and 1714*22dc650dSSadaf Ebrahimi the "end of line" metacharacter ($) matches only at the end of the 1715*22dc650dSSadaf Ebrahimi string, or before a terminating newline (except when PCRE2_DOLLAR_EN- 1716*22dc650dSSadaf Ebrahimi DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any 1717*22dc650dSSadaf Ebrahimi character" metacharacter (.) does not match at a newline. This behav- 1718*22dc650dSSadaf Ebrahimi iour (for ^, $, and dot) is the same as Perl. 1719*22dc650dSSadaf Ebrahimi 1720*22dc650dSSadaf Ebrahimi When PCRE2_MULTILINE it is set, the "start of line" and "end of line" 1721*22dc650dSSadaf Ebrahimi constructs match immediately following or immediately before internal 1722*22dc650dSSadaf Ebrahimi newlines in the subject string, respectively, as well as at the very 1723*22dc650dSSadaf Ebrahimi start and end. This is equivalent to Perl's /m option, and it can be 1724*22dc650dSSadaf Ebrahimi changed within a pattern by a (?m) option setting. Note that the "start 1725*22dc650dSSadaf Ebrahimi of line" metacharacter does not match after a newline at the end of the 1726*22dc650dSSadaf Ebrahimi subject, for compatibility with Perl. However, you can change this by 1727*22dc650dSSadaf Ebrahimi setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a 1728*22dc650dSSadaf Ebrahimi subject string, or no occurrences of ^ or $ in a pattern, setting 1729*22dc650dSSadaf Ebrahimi PCRE2_MULTILINE has no effect. 1730*22dc650dSSadaf Ebrahimi 1731*22dc650dSSadaf Ebrahimi PCRE2_NEVER_BACKSLASH_C 1732*22dc650dSSadaf Ebrahimi 1733*22dc650dSSadaf Ebrahimi This option locks out the use of \C in the pattern that is being com- 1734*22dc650dSSadaf Ebrahimi piled. This escape can cause unpredictable behaviour in UTF-8 or 1735*22dc650dSSadaf Ebrahimi UTF-16 modes, because it may leave the current matching point in the 1736*22dc650dSSadaf Ebrahimi middle of a multi-code-unit character. This option may be useful in ap- 1737*22dc650dSSadaf Ebrahimi plications that process patterns from external sources. Note that there 1738*22dc650dSSadaf Ebrahimi is also a build-time option that permanently locks out the use of \C. 1739*22dc650dSSadaf Ebrahimi 1740*22dc650dSSadaf Ebrahimi PCRE2_NEVER_UCP 1741*22dc650dSSadaf Ebrahimi 1742*22dc650dSSadaf Ebrahimi This option locks out the use of Unicode properties for handling \B, 1743*22dc650dSSadaf Ebrahimi \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as 1744*22dc650dSSadaf Ebrahimi described for the PCRE2_UCP option below. In particular, it prevents 1745*22dc650dSSadaf Ebrahimi the creator of the pattern from enabling this facility by starting the 1746*22dc650dSSadaf Ebrahimi pattern with (*UCP). This option may be useful in applications that 1747*22dc650dSSadaf Ebrahimi process patterns from external sources. The option combination PCRE_UCP 1748*22dc650dSSadaf Ebrahimi and PCRE_NEVER_UCP causes an error. 1749*22dc650dSSadaf Ebrahimi 1750*22dc650dSSadaf Ebrahimi PCRE2_NEVER_UTF 1751*22dc650dSSadaf Ebrahimi 1752*22dc650dSSadaf Ebrahimi This option locks out interpretation of the pattern as UTF-8, UTF-16, 1753*22dc650dSSadaf Ebrahimi or UTF-32, depending on which library is in use. In particular, it pre- 1754*22dc650dSSadaf Ebrahimi vents the creator of the pattern from switching to UTF interpretation 1755*22dc650dSSadaf Ebrahimi by starting the pattern with (*UTF). This option may be useful in ap- 1756*22dc650dSSadaf Ebrahimi plications that process patterns from external sources. The combination 1757*22dc650dSSadaf Ebrahimi of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. 1758*22dc650dSSadaf Ebrahimi 1759*22dc650dSSadaf Ebrahimi PCRE2_NO_AUTO_CAPTURE 1760*22dc650dSSadaf Ebrahimi 1761*22dc650dSSadaf Ebrahimi If this option is set, it disables the use of numbered capturing paren- 1762*22dc650dSSadaf Ebrahimi theses in the pattern. Any opening parenthesis that is not followed by 1763*22dc650dSSadaf Ebrahimi ? behaves as if it were followed by ?: but named parentheses can still 1764*22dc650dSSadaf Ebrahimi be used for capturing (and they acquire numbers in the usual way). This 1765*22dc650dSSadaf Ebrahimi is the same as Perl's /n option. Note that, when this option is set, 1766*22dc650dSSadaf Ebrahimi references to capture groups (backreferences or recursion/subroutine 1767*22dc650dSSadaf Ebrahimi calls) may only refer to named groups, though the reference can be by 1768*22dc650dSSadaf Ebrahimi name or by number. 1769*22dc650dSSadaf Ebrahimi 1770*22dc650dSSadaf Ebrahimi PCRE2_NO_AUTO_POSSESS 1771*22dc650dSSadaf Ebrahimi 1772*22dc650dSSadaf Ebrahimi If this option is set, it disables "auto-possessification", which is an 1773*22dc650dSSadaf Ebrahimi optimization that, for example, turns a+b into a++b in order to avoid 1774*22dc650dSSadaf Ebrahimi backtracks into a+ that can never be successful. However, if callouts 1775*22dc650dSSadaf Ebrahimi are in use, auto-possessification means that some callouts are never 1776*22dc650dSSadaf Ebrahimi taken. You can set this option if you want the matching functions to do 1777*22dc650dSSadaf Ebrahimi a full unoptimized search and run all the callouts, but it is mainly 1778*22dc650dSSadaf Ebrahimi provided for testing purposes. 1779*22dc650dSSadaf Ebrahimi 1780*22dc650dSSadaf Ebrahimi PCRE2_NO_DOTSTAR_ANCHOR 1781*22dc650dSSadaf Ebrahimi 1782*22dc650dSSadaf Ebrahimi If this option is set, it disables an optimization that is applied when 1783*22dc650dSSadaf Ebrahimi .* is the first significant item in a top-level branch of a pattern, 1784*22dc650dSSadaf Ebrahimi and all the other branches also start with .* or with \A or \G or ^. 1785*22dc650dSSadaf Ebrahimi The optimization is automatically disabled for .* if it is inside an 1786*22dc650dSSadaf Ebrahimi atomic group or a capture group that is the subject of a backreference, 1787*22dc650dSSadaf Ebrahimi or if the pattern contains (*PRUNE) or (*SKIP). When the optimization 1788*22dc650dSSadaf Ebrahimi is not disabled, such a pattern is automatically anchored if 1789*22dc650dSSadaf Ebrahimi PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set 1790*22dc650dSSadaf Ebrahimi for any ^ items. Otherwise, the fact that any match must start either 1791*22dc650dSSadaf Ebrahimi at the start of the subject or following a newline is remembered. Like 1792*22dc650dSSadaf Ebrahimi other optimizations, this can cause callouts to be skipped. 1793*22dc650dSSadaf Ebrahimi 1794*22dc650dSSadaf Ebrahimi PCRE2_NO_START_OPTIMIZE 1795*22dc650dSSadaf Ebrahimi 1796*22dc650dSSadaf Ebrahimi This is an option whose main effect is at matching time. It does not 1797*22dc650dSSadaf Ebrahimi change what pcre2_compile() generates, but it does affect the output of 1798*22dc650dSSadaf Ebrahimi the JIT compiler. 1799*22dc650dSSadaf Ebrahimi 1800*22dc650dSSadaf Ebrahimi There are a number of optimizations that may occur at the start of a 1801*22dc650dSSadaf Ebrahimi match, in order to speed up the process. For example, if it is known 1802*22dc650dSSadaf Ebrahimi that an unanchored match must start with a specific code unit value, 1803*22dc650dSSadaf Ebrahimi the matching code searches the subject for that value, and fails imme- 1804*22dc650dSSadaf Ebrahimi diately if it cannot find it, without actually running the main match- 1805*22dc650dSSadaf Ebrahimi ing function. This means that a special item such as (*COMMIT) at the 1806*22dc650dSSadaf Ebrahimi start of a pattern is not considered until after a suitable starting 1807*22dc650dSSadaf Ebrahimi point for the match has been found. Also, when callouts or (*MARK) 1808*22dc650dSSadaf Ebrahimi items are in use, these "start-up" optimizations can cause them to be 1809*22dc650dSSadaf Ebrahimi skipped if the pattern is never actually used. The start-up optimiza- 1810*22dc650dSSadaf Ebrahimi tions are in effect a pre-scan of the subject that takes place before 1811*22dc650dSSadaf Ebrahimi the pattern is run. 1812*22dc650dSSadaf Ebrahimi 1813*22dc650dSSadaf Ebrahimi The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, 1814*22dc650dSSadaf Ebrahimi possibly causing performance to suffer, but ensuring that in cases 1815*22dc650dSSadaf Ebrahimi where the result is "no match", the callouts do occur, and that items 1816*22dc650dSSadaf Ebrahimi such as (*COMMIT) and (*MARK) are considered at every possible starting 1817*22dc650dSSadaf Ebrahimi position in the subject string. 1818*22dc650dSSadaf Ebrahimi 1819*22dc650dSSadaf Ebrahimi Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching 1820*22dc650dSSadaf Ebrahimi operation. Consider the pattern 1821*22dc650dSSadaf Ebrahimi 1822*22dc650dSSadaf Ebrahimi (*COMMIT)ABC 1823*22dc650dSSadaf Ebrahimi 1824*22dc650dSSadaf Ebrahimi When this is compiled, PCRE2 records the fact that a match must start 1825*22dc650dSSadaf Ebrahimi with the character "A". Suppose the subject string is "DEFABC". The 1826*22dc650dSSadaf Ebrahimi start-up optimization scans along the subject, finds "A" and runs the 1827*22dc650dSSadaf Ebrahimi first match attempt from there. The (*COMMIT) item means that the pat- 1828*22dc650dSSadaf Ebrahimi tern must match the current starting position, which in this case, it 1829*22dc650dSSadaf Ebrahimi does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE 1830*22dc650dSSadaf Ebrahimi set, the initial scan along the subject string does not happen. The 1831*22dc650dSSadaf Ebrahimi first match attempt is run starting from "D" and when this fails, 1832*22dc650dSSadaf Ebrahimi (*COMMIT) prevents any further matches being tried, so the overall re- 1833*22dc650dSSadaf Ebrahimi sult is "no match". 1834*22dc650dSSadaf Ebrahimi 1835*22dc650dSSadaf Ebrahimi As another start-up optimization makes use of a minimum length for a 1836*22dc650dSSadaf Ebrahimi matching subject, which is recorded when possible. Consider the pattern 1837*22dc650dSSadaf Ebrahimi 1838*22dc650dSSadaf Ebrahimi (*MARK:1)B(*MARK:2)(X|Y) 1839*22dc650dSSadaf Ebrahimi 1840*22dc650dSSadaf Ebrahimi The minimum length for a match is two characters. If the subject is 1841*22dc650dSSadaf Ebrahimi "XXBB", the "starting character" optimization skips "XX", then tries to 1842*22dc650dSSadaf Ebrahimi match "BB", which is long enough. In the process, (*MARK:2) is encoun- 1843*22dc650dSSadaf Ebrahimi tered and remembered. When the match attempt fails, the next "B" is 1844*22dc650dSSadaf Ebrahimi found, but there is only one character left, so there are no more at- 1845*22dc650dSSadaf Ebrahimi tempts, and "no match" is returned with the "last mark seen" set to 1846*22dc650dSSadaf Ebrahimi "2". If NO_START_OPTIMIZE is set, however, matches are tried at every 1847*22dc650dSSadaf Ebrahimi possible starting position, including at the end of the subject, where 1848*22dc650dSSadaf Ebrahimi (*MARK:1) is encountered, but there is no "B", so the "last mark seen" 1849*22dc650dSSadaf Ebrahimi that is returned is "1". In this case, the optimizations do not affect 1850*22dc650dSSadaf Ebrahimi the overall match result, which is still "no match", but they do affect 1851*22dc650dSSadaf Ebrahimi the auxiliary information that is returned. 1852*22dc650dSSadaf Ebrahimi 1853*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK 1854*22dc650dSSadaf Ebrahimi 1855*22dc650dSSadaf Ebrahimi When PCRE2_UTF is set, the validity of the pattern as a UTF string is 1856*22dc650dSSadaf Ebrahimi automatically checked. There are discussions about the validity of 1857*22dc650dSSadaf Ebrahimi UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode 1858*22dc650dSSadaf Ebrahimi document. If an invalid UTF sequence is found, pcre2_compile() returns 1859*22dc650dSSadaf Ebrahimi a negative error code. 1860*22dc650dSSadaf Ebrahimi 1861*22dc650dSSadaf Ebrahimi If you know that your pattern is a valid UTF string, and you want to 1862*22dc650dSSadaf Ebrahimi skip this check for performance reasons, you can set the 1863*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in- 1864*22dc650dSSadaf Ebrahimi valid UTF string as a pattern is undefined. It may cause your program 1865*22dc650dSSadaf Ebrahimi to crash or loop. 1866*22dc650dSSadaf Ebrahimi 1867*22dc650dSSadaf Ebrahimi Note that this option can also be passed to pcre2_match() and 1868*22dc650dSSadaf Ebrahimi pcre2_dfa_match(), to suppress UTF validity checking of the subject 1869*22dc650dSSadaf Ebrahimi string. 1870*22dc650dSSadaf Ebrahimi 1871*22dc650dSSadaf Ebrahimi Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis- 1872*22dc650dSSadaf Ebrahimi able the error that is given if an escape sequence for an invalid Uni- 1873*22dc650dSSadaf Ebrahimi code code point is encountered in the pattern. In particular, the so- 1874*22dc650dSSadaf Ebrahimi called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you 1875*22dc650dSSadaf Ebrahimi want to allow escape sequences such as \x{d800} you can set the 1876*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the 1877*22dc650dSSadaf Ebrahimi section entitled "Extra compile options" below. However, this is pos- 1878*22dc650dSSadaf Ebrahimi sible only in UTF-8 and UTF-32 modes, because these values are not rep- 1879*22dc650dSSadaf Ebrahimi resentable in UTF-16. 1880*22dc650dSSadaf Ebrahimi 1881*22dc650dSSadaf Ebrahimi PCRE2_UCP 1882*22dc650dSSadaf Ebrahimi 1883*22dc650dSSadaf Ebrahimi This option has two effects. Firstly, it change the way PCRE2 processes 1884*22dc650dSSadaf Ebrahimi \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character 1885*22dc650dSSadaf Ebrahimi classes. By default, only ASCII characters are recognized, but if 1886*22dc650dSSadaf Ebrahimi PCRE2_UCP is set, Unicode properties are used to classify characters. 1887*22dc650dSSadaf Ebrahimi There are some PCRE2_EXTRA options (see below) that add finer control 1888*22dc650dSSadaf Ebrahimi to this behaviour. More details are given in the section on generic 1889*22dc650dSSadaf Ebrahimi character types in the pcre2pattern page. 1890*22dc650dSSadaf Ebrahimi 1891*22dc650dSSadaf Ebrahimi The second effect of PCRE2_UCP is to force the use of Unicode proper- 1892*22dc650dSSadaf Ebrahimi ties for upper/lower casing operations, even when PCRE2_UTF is not set. 1893*22dc650dSSadaf Ebrahimi This makes it possible to process strings in the 16-bit UCS-2 code. 1894*22dc650dSSadaf Ebrahimi This option is available only if PCRE2 has been compiled with Unicode 1895*22dc650dSSadaf Ebrahimi support (which is the default). The PCRE2_EXTRA_CASELESS_RESTRICT op- 1896*22dc650dSSadaf Ebrahimi tion (see below) restricts caseless matching such that ASCII characters 1897*22dc650dSSadaf Ebrahimi match only ASCII characters and non-ASCII characters match only non- 1898*22dc650dSSadaf Ebrahimi ASCII characters. 1899*22dc650dSSadaf Ebrahimi 1900*22dc650dSSadaf Ebrahimi PCRE2_UNGREEDY 1901*22dc650dSSadaf Ebrahimi 1902*22dc650dSSadaf Ebrahimi This option inverts the "greediness" of the quantifiers so that they 1903*22dc650dSSadaf Ebrahimi are not greedy by default, but become greedy if followed by "?". It is 1904*22dc650dSSadaf Ebrahimi not compatible with Perl. It can also be set by a (?U) option setting 1905*22dc650dSSadaf Ebrahimi within the pattern. 1906*22dc650dSSadaf Ebrahimi 1907*22dc650dSSadaf Ebrahimi PCRE2_USE_OFFSET_LIMIT 1908*22dc650dSSadaf Ebrahimi 1909*22dc650dSSadaf Ebrahimi This option must be set for pcre2_compile() if pcre2_set_offset_limit() 1910*22dc650dSSadaf Ebrahimi is going to be used to set a non-default offset limit in a match con- 1911*22dc650dSSadaf Ebrahimi text for matches that use this pattern. An error is generated if an 1912*22dc650dSSadaf Ebrahimi offset limit is set without this option. For more details, see the de- 1913*22dc650dSSadaf Ebrahimi scription of pcre2_set_offset_limit() in the section that describes 1914*22dc650dSSadaf Ebrahimi match contexts. See also the PCRE2_FIRSTLINE option above. 1915*22dc650dSSadaf Ebrahimi 1916*22dc650dSSadaf Ebrahimi PCRE2_UTF 1917*22dc650dSSadaf Ebrahimi 1918*22dc650dSSadaf Ebrahimi This option causes PCRE2 to regard both the pattern and the subject 1919*22dc650dSSadaf Ebrahimi strings that are subsequently processed as strings of UTF characters 1920*22dc650dSSadaf Ebrahimi instead of single-code-unit strings. It is available when PCRE2 is 1921*22dc650dSSadaf Ebrahimi built to include Unicode support (which is the default). If Unicode 1922*22dc650dSSadaf Ebrahimi support is not available, the use of this option provokes an error. De- 1923*22dc650dSSadaf Ebrahimi tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in the 1924*22dc650dSSadaf Ebrahimi pcre2unicode page. In particular, note that it changes the way 1925*22dc650dSSadaf Ebrahimi PCRE2_CASELESS works. 1926*22dc650dSSadaf Ebrahimi 1927*22dc650dSSadaf Ebrahimi Extra compile options 1928*22dc650dSSadaf Ebrahimi 1929*22dc650dSSadaf Ebrahimi The option bits that can be set in a compile context by calling the 1930*22dc650dSSadaf Ebrahimi pcre2_set_compile_extra_options() function are as follows: 1931*22dc650dSSadaf Ebrahimi 1932*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK 1933*22dc650dSSadaf Ebrahimi 1934*22dc650dSSadaf Ebrahimi Since release 10.38 PCRE2 has forbidden the use of \K within lookaround 1935*22dc650dSSadaf Ebrahimi assertions, following Perl's lead. This option is provided to re-enable 1936*22dc650dSSadaf Ebrahimi the previous behaviour (act in positive lookarounds, ignore in negative 1937*22dc650dSSadaf Ebrahimi ones) in case anybody is relying on it. 1938*22dc650dSSadaf Ebrahimi 1939*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 1940*22dc650dSSadaf Ebrahimi 1941*22dc650dSSadaf Ebrahimi This option applies when compiling a pattern in UTF-8 or UTF-32 mode. 1942*22dc650dSSadaf Ebrahimi It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode 1943*22dc650dSSadaf Ebrahimi "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs 1944*22dc650dSSadaf Ebrahimi in UTF-16 to encode code points with values in the range 0x10000 to 1945*22dc650dSSadaf Ebrahimi 0x10ffff. The surrogates cannot therefore be represented in UTF-16. 1946*22dc650dSSadaf Ebrahimi They can be represented in UTF-8 and UTF-32, but are defined as invalid 1947*22dc650dSSadaf Ebrahimi code points, and cause errors if encountered in a UTF-8 or UTF-32 1948*22dc650dSSadaf Ebrahimi string that is being checked for validity by PCRE2. 1949*22dc650dSSadaf Ebrahimi 1950*22dc650dSSadaf Ebrahimi These values also cause errors if encountered in escape sequences such 1951*22dc650dSSadaf Ebrahimi as \x{d912} within a pattern. However, it seems that some applications, 1952*22dc650dSSadaf Ebrahimi when using PCRE2 to check for unwanted characters in UTF-8 strings, ex- 1953*22dc650dSSadaf Ebrahimi plicitly test for the surrogates using escape sequences. The 1954*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK option does not disable the error that occurs, be- 1955*22dc650dSSadaf Ebrahimi cause it applies only to the testing of input strings for UTF validity. 1956*22dc650dSSadaf Ebrahimi 1957*22dc650dSSadaf Ebrahimi If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro- 1958*22dc650dSSadaf Ebrahimi gate code point values in UTF-8 and UTF-32 patterns no longer provoke 1959*22dc650dSSadaf Ebrahimi errors and are incorporated in the compiled pattern. However, they can 1960*22dc650dSSadaf Ebrahimi only match subject characters if the matching function is called with 1961*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK set. 1962*22dc650dSSadaf Ebrahimi 1963*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALT_BSUX 1964*22dc650dSSadaf Ebrahimi 1965*22dc650dSSadaf Ebrahimi The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and 1966*22dc650dSSadaf Ebrahimi \x in the way that ECMAscript (aka JavaScript) does. Additional func- 1967*22dc650dSSadaf Ebrahimi tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has 1968*22dc650dSSadaf Ebrahimi the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} 1969*22dc650dSSadaf Ebrahimi as a hexadecimal character code, where hhh.. is any number of hexadeci- 1970*22dc650dSSadaf Ebrahimi mal digits. 1971*22dc650dSSadaf Ebrahimi 1972*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ASCII_BSD 1973*22dc650dSSadaf Ebrahimi 1974*22dc650dSSadaf Ebrahimi This option forces \d to match only ASCII digits, even when PCRE2_UCP 1975*22dc650dSSadaf Ebrahimi is set. It can be changed within a pattern by means of the (?aD) op- 1976*22dc650dSSadaf Ebrahimi tion setting. 1977*22dc650dSSadaf Ebrahimi 1978*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ASCII_BSS 1979*22dc650dSSadaf Ebrahimi 1980*22dc650dSSadaf Ebrahimi This option forces \s to match only ASCII space characters, even when 1981*22dc650dSSadaf Ebrahimi PCRE2_UCP is set. It can be changed within a pattern by means of the 1982*22dc650dSSadaf Ebrahimi (?aS) option setting. 1983*22dc650dSSadaf Ebrahimi 1984*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ASCII_BSW 1985*22dc650dSSadaf Ebrahimi 1986*22dc650dSSadaf Ebrahimi This option forces \w to match only ASCII word characters, even when 1987*22dc650dSSadaf Ebrahimi PCRE2_UCP is set. It can be changed within a pattern by means of the 1988*22dc650dSSadaf Ebrahimi (?aW) option setting. 1989*22dc650dSSadaf Ebrahimi 1990*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ASCII_DIGIT 1991*22dc650dSSadaf Ebrahimi 1992*22dc650dSSadaf Ebrahimi This option forces the POSIX character classes [:digit:] and [:xdigit:] 1993*22dc650dSSadaf Ebrahimi to match only ASCII digits, even when PCRE2_UCP is set. It can be 1994*22dc650dSSadaf Ebrahimi changed within a pattern by means of the (?aT) option setting. 1995*22dc650dSSadaf Ebrahimi 1996*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ASCII_POSIX 1997*22dc650dSSadaf Ebrahimi 1998*22dc650dSSadaf Ebrahimi This option forces all the POSIX character classes, including [:digit:] 1999*22dc650dSSadaf Ebrahimi and [:xdigit:], to match only ASCII characters, even when PCRE2_UCP is 2000*22dc650dSSadaf Ebrahimi set. It can be changed within a pattern by means of the (?aP) option 2001*22dc650dSSadaf Ebrahimi setting, but note that this also sets PCRE2_EXTRA_ASCII_DIGIT in order 2002*22dc650dSSadaf Ebrahimi to ensure that (?-aP) unsets all ASCII restrictions for POSIX classes. 2003*22dc650dSSadaf Ebrahimi 2004*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 2005*22dc650dSSadaf Ebrahimi 2006*22dc650dSSadaf Ebrahimi This is a dangerous option. Use with care. By default, an unrecognized 2007*22dc650dSSadaf Ebrahimi escape such as \j or a malformed one such as \x{2z} causes a compile- 2008*22dc650dSSadaf Ebrahimi time error when detected by pcre2_compile(). Perl is somewhat inconsis- 2009*22dc650dSSadaf Ebrahimi tent in handling such items: for example, \j is treated as a literal 2010*22dc650dSSadaf Ebrahimi "j", and non-hexadecimal digits in \x{} are just ignored, though warn- 2011*22dc650dSSadaf Ebrahimi ings are given in both cases if Perl's warning switch is enabled. How- 2012*22dc650dSSadaf Ebrahimi ever, a malformed octal number after \o{ always causes an error in 2013*22dc650dSSadaf Ebrahimi Perl. 2014*22dc650dSSadaf Ebrahimi 2015*22dc650dSSadaf Ebrahimi If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to 2016*22dc650dSSadaf Ebrahimi pcre2_compile(), all unrecognized or malformed escape sequences are 2017*22dc650dSSadaf Ebrahimi treated as single-character escapes. For example, \j is a literal "j" 2018*22dc650dSSadaf Ebrahimi and \x{2z} is treated as the literal string "x{2z}". Setting this op- 2019*22dc650dSSadaf Ebrahimi tion means that typos in patterns may go undetected and have unexpected 2020*22dc650dSSadaf Ebrahimi results. Also note that a sequence such as [\N{] is interpreted as a 2021*22dc650dSSadaf Ebrahimi malformed attempt at [\N{...}] and so is treated as [N{] whereas [\N] 2022*22dc650dSSadaf Ebrahimi gives an error because an unqualified \N is a valid escape sequence but 2023*22dc650dSSadaf Ebrahimi is not supported in a character class. To reiterate: this is a danger- 2024*22dc650dSSadaf Ebrahimi ous option. Use with great care. 2025*22dc650dSSadaf Ebrahimi 2026*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_CASELESS_RESTRICT 2027*22dc650dSSadaf Ebrahimi 2028*22dc650dSSadaf Ebrahimi When either PCRE2_UCP or PCRE2_UTF is set, caseless matching follows 2029*22dc650dSSadaf Ebrahimi Unicode rules, which allow for more than two cases per character. There 2030*22dc650dSSadaf Ebrahimi are two case-equivalent character sets that contain both ASCII and non- 2031*22dc650dSSadaf Ebrahimi ASCII characters. The ASCII letter S is case-equivalent to U+017f (long 2032*22dc650dSSadaf Ebrahimi S) and the ASCII letter K is case-equivalent to U+212a (Kelvin sign). 2033*22dc650dSSadaf Ebrahimi This option disables recognition of case-equivalences that cross the 2034*22dc650dSSadaf Ebrahimi ASCII/non-ASCII boundary. In a caseless match, both characters must ei- 2035*22dc650dSSadaf Ebrahimi ther be ASCII or non-ASCII. The option can be changed with a pattern by 2036*22dc650dSSadaf Ebrahimi the (?r) option setting. 2037*22dc650dSSadaf Ebrahimi 2038*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ESCAPED_CR_IS_LF 2039*22dc650dSSadaf Ebrahimi 2040*22dc650dSSadaf Ebrahimi There are some legacy applications where the escape sequence \r in a 2041*22dc650dSSadaf Ebrahimi pattern is expected to match a newline. If this option is set, \r in a 2042*22dc650dSSadaf Ebrahimi pattern is converted to \n so that it matches a LF (linefeed) instead 2043*22dc650dSSadaf Ebrahimi of a CR (carriage return) character. The option does not affect a lit- 2044*22dc650dSSadaf Ebrahimi eral CR in the pattern, nor does it affect CR specified as an explicit 2045*22dc650dSSadaf Ebrahimi code point such as \x{0D}. 2046*22dc650dSSadaf Ebrahimi 2047*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_MATCH_LINE 2048*22dc650dSSadaf Ebrahimi 2049*22dc650dSSadaf Ebrahimi This option is provided for use by the -x option of pcre2grep. It 2050*22dc650dSSadaf Ebrahimi causes the pattern only to match complete lines. This is achieved by 2051*22dc650dSSadaf Ebrahimi automatically inserting the code for "^(?:" at the start of the com- 2052*22dc650dSSadaf Ebrahimi piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, 2053*22dc650dSSadaf Ebrahimi the matched line may be in the middle of the subject string. This op- 2054*22dc650dSSadaf Ebrahimi tion can be used with PCRE2_LITERAL. 2055*22dc650dSSadaf Ebrahimi 2056*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_MATCH_WORD 2057*22dc650dSSadaf Ebrahimi 2058*22dc650dSSadaf Ebrahimi This option is provided for use by the -w option of pcre2grep. It 2059*22dc650dSSadaf Ebrahimi causes the pattern only to match strings that have a word boundary at 2060*22dc650dSSadaf Ebrahimi the start and the end. This is achieved by automatically inserting the 2061*22dc650dSSadaf Ebrahimi code for "\b(?:" at the start of the compiled pattern and ")\b" at the 2062*22dc650dSSadaf Ebrahimi end. The option may be used with PCRE2_LITERAL. However, it is ignored 2063*22dc650dSSadaf Ebrahimi if PCRE2_EXTRA_MATCH_LINE is also set. 2064*22dc650dSSadaf Ebrahimi 2065*22dc650dSSadaf Ebrahimi 2066*22dc650dSSadaf EbrahimiJUST-IN-TIME (JIT) COMPILATION 2067*22dc650dSSadaf Ebrahimi 2068*22dc650dSSadaf Ebrahimi int pcre2_jit_compile(pcre2_code *code, uint32_t options); 2069*22dc650dSSadaf Ebrahimi 2070*22dc650dSSadaf Ebrahimi int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 2071*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 2072*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 2073*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext); 2074*22dc650dSSadaf Ebrahimi 2075*22dc650dSSadaf Ebrahimi void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 2076*22dc650dSSadaf Ebrahimi 2077*22dc650dSSadaf Ebrahimi pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize, 2078*22dc650dSSadaf Ebrahimi size_t maxsize, pcre2_general_context *gcontext); 2079*22dc650dSSadaf Ebrahimi 2080*22dc650dSSadaf Ebrahimi void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 2081*22dc650dSSadaf Ebrahimi pcre2_jit_callback callback_function, void *callback_data); 2082*22dc650dSSadaf Ebrahimi 2083*22dc650dSSadaf Ebrahimi void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 2084*22dc650dSSadaf Ebrahimi 2085*22dc650dSSadaf Ebrahimi These functions provide support for JIT compilation, which, if the 2086*22dc650dSSadaf Ebrahimi just-in-time compiler is available, further processes a compiled pat- 2087*22dc650dSSadaf Ebrahimi tern into machine code that executes much faster than the pcre2_match() 2088*22dc650dSSadaf Ebrahimi interpretive matching function. Full details are given in the pcre2jit 2089*22dc650dSSadaf Ebrahimi documentation. 2090*22dc650dSSadaf Ebrahimi 2091*22dc650dSSadaf Ebrahimi JIT compilation is a heavyweight optimization. It can take some time 2092*22dc650dSSadaf Ebrahimi for patterns to be analyzed, and for one-off matches and simple pat- 2093*22dc650dSSadaf Ebrahimi terns the benefit of faster execution might be offset by a much slower 2094*22dc650dSSadaf Ebrahimi compilation time. Most (but not all) patterns can be optimized by the 2095*22dc650dSSadaf Ebrahimi JIT compiler. 2096*22dc650dSSadaf Ebrahimi 2097*22dc650dSSadaf Ebrahimi 2098*22dc650dSSadaf EbrahimiLOCALE SUPPORT 2099*22dc650dSSadaf Ebrahimi 2100*22dc650dSSadaf Ebrahimi const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); 2101*22dc650dSSadaf Ebrahimi 2102*22dc650dSSadaf Ebrahimi void pcre2_maketables_free(pcre2_general_context *gcontext, 2103*22dc650dSSadaf Ebrahimi const uint8_t *tables); 2104*22dc650dSSadaf Ebrahimi 2105*22dc650dSSadaf Ebrahimi PCRE2 handles caseless matching, and determines whether characters are 2106*22dc650dSSadaf Ebrahimi letters, digits, or whatever, by reference to a set of tables, indexed 2107*22dc650dSSadaf Ebrahimi by character code point. However, this applies only to characters whose 2108*22dc650dSSadaf Ebrahimi code points are less than 256. By default, higher-valued code points 2109*22dc650dSSadaf Ebrahimi never match escapes such as \w or \d. 2110*22dc650dSSadaf Ebrahimi 2111*22dc650dSSadaf Ebrahimi When PCRE2 is built with Unicode support (the default), certain Unicode 2112*22dc650dSSadaf Ebrahimi character properties can be tested with \p and \P, or, alternatively, 2113*22dc650dSSadaf Ebrahimi the PCRE2_UCP option can be set when a pattern is compiled; this causes 2114*22dc650dSSadaf Ebrahimi \w and friends to use Unicode property support instead of the built-in 2115*22dc650dSSadaf Ebrahimi tables. PCRE2_UCP also causes upper/lower casing operations on charac- 2116*22dc650dSSadaf Ebrahimi ters with code points greater than 127 to use Unicode properties. These 2117*22dc650dSSadaf Ebrahimi effects apply even when PCRE2_UTF is not set. There are, however, some 2118*22dc650dSSadaf Ebrahimi PCRE2_EXTRA options (see above) that can be used to modify or suppress 2119*22dc650dSSadaf Ebrahimi them. 2120*22dc650dSSadaf Ebrahimi 2121*22dc650dSSadaf Ebrahimi The use of locales with Unicode is discouraged. If you are handling 2122*22dc650dSSadaf Ebrahimi characters with code points greater than 127, you should either use 2123*22dc650dSSadaf Ebrahimi Unicode support, or use locales, but not try to mix the two. 2124*22dc650dSSadaf Ebrahimi 2125*22dc650dSSadaf Ebrahimi PCRE2 contains a built-in set of character tables that are used by de- 2126*22dc650dSSadaf Ebrahimi fault. These are sufficient for many applications. Normally, the in- 2127*22dc650dSSadaf Ebrahimi ternal tables recognize only ASCII characters. However, when PCRE2 is 2128*22dc650dSSadaf Ebrahimi built, it is possible to cause the internal tables to be rebuilt in the 2129*22dc650dSSadaf Ebrahimi default "C" locale of the local system, which may cause them to be dif- 2130*22dc650dSSadaf Ebrahimi ferent. 2131*22dc650dSSadaf Ebrahimi 2132*22dc650dSSadaf Ebrahimi The built-in tables can be overridden by tables supplied by the appli- 2133*22dc650dSSadaf Ebrahimi cation that calls PCRE2. These may be created in a different locale 2134*22dc650dSSadaf Ebrahimi from the default. As more and more applications change to using Uni- 2135*22dc650dSSadaf Ebrahimi code, the need for this locale support is expected to die away. 2136*22dc650dSSadaf Ebrahimi 2137*22dc650dSSadaf Ebrahimi External tables are built by calling the pcre2_maketables() function, 2138*22dc650dSSadaf Ebrahimi in the relevant locale. The only argument to this function is a general 2139*22dc650dSSadaf Ebrahimi context, which can be used to pass a custom memory allocator. If the 2140*22dc650dSSadaf Ebrahimi argument is NULL, the system malloc() is used. The result can be passed 2141*22dc650dSSadaf Ebrahimi to pcre2_compile() as often as necessary, by creating a compile context 2142*22dc650dSSadaf Ebrahimi and calling pcre2_set_character_tables() to set the tables pointer 2143*22dc650dSSadaf Ebrahimi therein. 2144*22dc650dSSadaf Ebrahimi 2145*22dc650dSSadaf Ebrahimi For example, to build and use tables that are appropriate for the 2146*22dc650dSSadaf Ebrahimi French locale (where accented characters with values greater than 127 2147*22dc650dSSadaf Ebrahimi are treated as letters), the following code could be used: 2148*22dc650dSSadaf Ebrahimi 2149*22dc650dSSadaf Ebrahimi setlocale(LC_CTYPE, "fr_FR"); 2150*22dc650dSSadaf Ebrahimi tables = pcre2_maketables(NULL); 2151*22dc650dSSadaf Ebrahimi ccontext = pcre2_compile_context_create(NULL); 2152*22dc650dSSadaf Ebrahimi pcre2_set_character_tables(ccontext, tables); 2153*22dc650dSSadaf Ebrahimi re = pcre2_compile(..., ccontext); 2154*22dc650dSSadaf Ebrahimi 2155*22dc650dSSadaf Ebrahimi The locale name "fr_FR" is used on Linux and other Unix-like systems; 2156*22dc650dSSadaf Ebrahimi if you are using Windows, the name for the French locale is "french". 2157*22dc650dSSadaf Ebrahimi 2158*22dc650dSSadaf Ebrahimi The pointer that is passed (via the compile context) to pcre2_compile() 2159*22dc650dSSadaf Ebrahimi is saved with the compiled pattern, and the same tables are used by the 2160*22dc650dSSadaf Ebrahimi matching functions. Thus, for any single pattern, compilation and 2161*22dc650dSSadaf Ebrahimi matching both happen in the same locale, but different patterns can be 2162*22dc650dSSadaf Ebrahimi processed in different locales. 2163*22dc650dSSadaf Ebrahimi 2164*22dc650dSSadaf Ebrahimi It is the caller's responsibility to ensure that the memory containing 2165*22dc650dSSadaf Ebrahimi the tables remains available while they are still in use. When they are 2166*22dc650dSSadaf Ebrahimi no longer needed, you can discard them using pcre2_maketables_free(), 2167*22dc650dSSadaf Ebrahimi which should pass as its first parameter the same global context that 2168*22dc650dSSadaf Ebrahimi was used to create the tables. 2169*22dc650dSSadaf Ebrahimi 2170*22dc650dSSadaf Ebrahimi Saving locale tables 2171*22dc650dSSadaf Ebrahimi 2172*22dc650dSSadaf Ebrahimi The tables described above are just a sequence of binary bytes, which 2173*22dc650dSSadaf Ebrahimi makes them independent of hardware characteristics such as endianness 2174*22dc650dSSadaf Ebrahimi or whether the processor is 32-bit or 64-bit. A copy of the result of 2175*22dc650dSSadaf Ebrahimi pcre2_maketables() can therefore be saved in a file or elsewhere and 2176*22dc650dSSadaf Ebrahimi re-used later, even in a different program or on another computer. The 2177*22dc650dSSadaf Ebrahimi size of the tables (number of bytes) must be obtained by calling 2178*22dc650dSSadaf Ebrahimi pcre2_config() with the PCRE2_CONFIG_TABLES_LENGTH option because 2179*22dc650dSSadaf Ebrahimi pcre2_maketables() does not return this value. Note that the 2180*22dc650dSSadaf Ebrahimi pcre2_dftables program, which is part of the PCRE2 build system, can be 2181*22dc650dSSadaf Ebrahimi used stand-alone to create a file that contains a set of binary tables. 2182*22dc650dSSadaf Ebrahimi See the pcre2build documentation for details. 2183*22dc650dSSadaf Ebrahimi 2184*22dc650dSSadaf Ebrahimi 2185*22dc650dSSadaf EbrahimiINFORMATION ABOUT A COMPILED PATTERN 2186*22dc650dSSadaf Ebrahimi 2187*22dc650dSSadaf Ebrahimi int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); 2188*22dc650dSSadaf Ebrahimi 2189*22dc650dSSadaf Ebrahimi The pcre2_pattern_info() function returns general information about a 2190*22dc650dSSadaf Ebrahimi compiled pattern. For information about callouts, see the next section. 2191*22dc650dSSadaf Ebrahimi The first argument for pcre2_pattern_info() is a pointer to the com- 2192*22dc650dSSadaf Ebrahimi piled pattern. The second argument specifies which piece of information 2193*22dc650dSSadaf Ebrahimi is required, and the third argument is a pointer to a variable to re- 2194*22dc650dSSadaf Ebrahimi ceive the data. If the third argument is NULL, the first argument is 2195*22dc650dSSadaf Ebrahimi ignored, and the function returns the size in bytes of the variable 2196*22dc650dSSadaf Ebrahimi that is required for the information requested. Otherwise, the yield of 2197*22dc650dSSadaf Ebrahimi the function is zero for success, or one of the following negative num- 2198*22dc650dSSadaf Ebrahimi bers: 2199*22dc650dSSadaf Ebrahimi 2200*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NULL the argument code was NULL 2201*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMAGIC the "magic number" was not found 2202*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADOPTION the value of what was invalid 2203*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UNSET the requested field is not set 2204*22dc650dSSadaf Ebrahimi 2205*22dc650dSSadaf Ebrahimi The "magic number" is placed at the start of each compiled pattern as a 2206*22dc650dSSadaf Ebrahimi simple check against passing an arbitrary memory pointer. Here is a 2207*22dc650dSSadaf Ebrahimi typical call of pcre2_pattern_info(), to obtain the length of the com- 2208*22dc650dSSadaf Ebrahimi piled pattern: 2209*22dc650dSSadaf Ebrahimi 2210*22dc650dSSadaf Ebrahimi int rc; 2211*22dc650dSSadaf Ebrahimi size_t length; 2212*22dc650dSSadaf Ebrahimi rc = pcre2_pattern_info( 2213*22dc650dSSadaf Ebrahimi re, /* result of pcre2_compile() */ 2214*22dc650dSSadaf Ebrahimi PCRE2_INFO_SIZE, /* what is required */ 2215*22dc650dSSadaf Ebrahimi &length); /* where to put the data */ 2216*22dc650dSSadaf Ebrahimi 2217*22dc650dSSadaf Ebrahimi The possible values for the second argument are defined in pcre2.h, and 2218*22dc650dSSadaf Ebrahimi are as follows: 2219*22dc650dSSadaf Ebrahimi 2220*22dc650dSSadaf Ebrahimi PCRE2_INFO_ALLOPTIONS 2221*22dc650dSSadaf Ebrahimi PCRE2_INFO_ARGOPTIONS 2222*22dc650dSSadaf Ebrahimi PCRE2_INFO_EXTRAOPTIONS 2223*22dc650dSSadaf Ebrahimi 2224*22dc650dSSadaf Ebrahimi Return copies of the pattern's options. The third argument should point 2225*22dc650dSSadaf Ebrahimi to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op- 2226*22dc650dSSadaf Ebrahimi tions that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP- 2227*22dc650dSSadaf Ebrahimi TIONS returns the compile options as modified by any top-level (*XXX) 2228*22dc650dSSadaf Ebrahimi option settings such as (*UTF) at the start of the pattern itself. 2229*22dc650dSSadaf Ebrahimi PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the 2230*22dc650dSSadaf Ebrahimi compile context by calling the pcre2_set_compile_extra_options() func- 2231*22dc650dSSadaf Ebrahimi tion. 2232*22dc650dSSadaf Ebrahimi 2233*22dc650dSSadaf Ebrahimi For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX- 2234*22dc650dSSadaf Ebrahimi TENDED option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED 2235*22dc650dSSadaf Ebrahimi and PCRE2_UTF. Option settings such as (?i) that can change within a 2236*22dc650dSSadaf Ebrahimi pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they 2237*22dc650dSSadaf Ebrahimi appear right at the start of the pattern. (This was different in some 2238*22dc650dSSadaf Ebrahimi earlier releases.) 2239*22dc650dSSadaf Ebrahimi 2240*22dc650dSSadaf Ebrahimi A pattern compiled without PCRE2_ANCHORED is automatically anchored by 2241*22dc650dSSadaf Ebrahimi PCRE2 if the first significant item in every top-level branch is one of 2242*22dc650dSSadaf Ebrahimi the following: 2243*22dc650dSSadaf Ebrahimi 2244*22dc650dSSadaf Ebrahimi ^ unless PCRE2_MULTILINE is set 2245*22dc650dSSadaf Ebrahimi \A always 2246*22dc650dSSadaf Ebrahimi \G always 2247*22dc650dSSadaf Ebrahimi .* sometimes - see below 2248*22dc650dSSadaf Ebrahimi 2249*22dc650dSSadaf Ebrahimi When .* is the first significant item, anchoring is possible only when 2250*22dc650dSSadaf Ebrahimi all the following are true: 2251*22dc650dSSadaf Ebrahimi 2252*22dc650dSSadaf Ebrahimi .* is not in an atomic group 2253*22dc650dSSadaf Ebrahimi .* is not in a capture group that is the subject 2254*22dc650dSSadaf Ebrahimi of a backreference 2255*22dc650dSSadaf Ebrahimi PCRE2_DOTALL is in force for .* 2256*22dc650dSSadaf Ebrahimi Neither (*PRUNE) nor (*SKIP) appears in the pattern 2257*22dc650dSSadaf Ebrahimi PCRE2_NO_DOTSTAR_ANCHOR is not set 2258*22dc650dSSadaf Ebrahimi 2259*22dc650dSSadaf Ebrahimi For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in 2260*22dc650dSSadaf Ebrahimi the options returned for PCRE2_INFO_ALLOPTIONS. 2261*22dc650dSSadaf Ebrahimi 2262*22dc650dSSadaf Ebrahimi PCRE2_INFO_BACKREFMAX 2263*22dc650dSSadaf Ebrahimi 2264*22dc650dSSadaf Ebrahimi Return the number of the highest backreference in the pattern. The 2265*22dc650dSSadaf Ebrahimi third argument should point to a uint32_t variable. Named capture 2266*22dc650dSSadaf Ebrahimi groups acquire numbers as well as names, and these count towards the 2267*22dc650dSSadaf Ebrahimi highest backreference. Backreferences such as \4 or \g{12} match the 2268*22dc650dSSadaf Ebrahimi captured characters of the given group, but in addition, the check that 2269*22dc650dSSadaf Ebrahimi a capture group is set in a conditional group such as (?(3)a|b) is also 2270*22dc650dSSadaf Ebrahimi a backreference. Zero is returned if there are no backreferences. 2271*22dc650dSSadaf Ebrahimi 2272*22dc650dSSadaf Ebrahimi PCRE2_INFO_BSR 2273*22dc650dSSadaf Ebrahimi 2274*22dc650dSSadaf Ebrahimi The output is a uint32_t integer whose value indicates what character 2275*22dc650dSSadaf Ebrahimi sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE 2276*22dc650dSSadaf Ebrahimi means that \R matches any Unicode line ending sequence; a value of 2277*22dc650dSSadaf Ebrahimi PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. 2278*22dc650dSSadaf Ebrahimi 2279*22dc650dSSadaf Ebrahimi PCRE2_INFO_CAPTURECOUNT 2280*22dc650dSSadaf Ebrahimi 2281*22dc650dSSadaf Ebrahimi Return the highest capture group number in the pattern. In patterns 2282*22dc650dSSadaf Ebrahimi where (?| is not used, this is also the total number of capture groups. 2283*22dc650dSSadaf Ebrahimi The third argument should point to a uint32_t variable. 2284*22dc650dSSadaf Ebrahimi 2285*22dc650dSSadaf Ebrahimi PCRE2_INFO_DEPTHLIMIT 2286*22dc650dSSadaf Ebrahimi 2287*22dc650dSSadaf Ebrahimi If the pattern set a backtracking depth limit by including an item of 2288*22dc650dSSadaf Ebrahimi the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The 2289*22dc650dSSadaf Ebrahimi third argument should point to a uint32_t integer. If no such value has 2290*22dc650dSSadaf Ebrahimi been set, the call to pcre2_pattern_info() returns the error PCRE2_ER- 2291*22dc650dSSadaf Ebrahimi ROR_UNSET. Note that this limit will only be used during matching if it 2292*22dc650dSSadaf Ebrahimi is less than the limit set or defaulted by the caller of the match 2293*22dc650dSSadaf Ebrahimi function. 2294*22dc650dSSadaf Ebrahimi 2295*22dc650dSSadaf Ebrahimi PCRE2_INFO_FIRSTBITMAP 2296*22dc650dSSadaf Ebrahimi 2297*22dc650dSSadaf Ebrahimi In the absence of a single first code unit for a non-anchored pattern, 2298*22dc650dSSadaf Ebrahimi pcre2_compile() may construct a 256-bit table that defines a fixed set 2299*22dc650dSSadaf Ebrahimi of values for the first code unit in any match. For example, a pattern 2300*22dc650dSSadaf Ebrahimi that starts with [abc] results in a table with three bits set. When 2301*22dc650dSSadaf Ebrahimi code unit values greater than 255 are supported, the flag bit for 255 2302*22dc650dSSadaf Ebrahimi means "any code unit of value 255 or above". If such a table was con- 2303*22dc650dSSadaf Ebrahimi structed, a pointer to it is returned. Otherwise NULL is returned. The 2304*22dc650dSSadaf Ebrahimi third argument should point to a const uint8_t * variable. 2305*22dc650dSSadaf Ebrahimi 2306*22dc650dSSadaf Ebrahimi PCRE2_INFO_FIRSTCODETYPE 2307*22dc650dSSadaf Ebrahimi 2308*22dc650dSSadaf Ebrahimi Return information about the first code unit of any matched string, for 2309*22dc650dSSadaf Ebrahimi a non-anchored pattern. The third argument should point to a uint32_t 2310*22dc650dSSadaf Ebrahimi variable. If there is a fixed first value, for example, the letter "c" 2311*22dc650dSSadaf Ebrahimi from a pattern such as (cat|cow|coyote), 1 is returned, and the value 2312*22dc650dSSadaf Ebrahimi can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed 2313*22dc650dSSadaf Ebrahimi first value, but it is known that a match can occur only at the start 2314*22dc650dSSadaf Ebrahimi of the subject or following a newline in the subject, 2 is returned. 2315*22dc650dSSadaf Ebrahimi Otherwise, and for anchored patterns, 0 is returned. 2316*22dc650dSSadaf Ebrahimi 2317*22dc650dSSadaf Ebrahimi PCRE2_INFO_FIRSTCODEUNIT 2318*22dc650dSSadaf Ebrahimi 2319*22dc650dSSadaf Ebrahimi Return the value of the first code unit of any matched string for a 2320*22dc650dSSadaf Ebrahimi pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. 2321*22dc650dSSadaf Ebrahimi The third argument should point to a uint32_t variable. In the 8-bit 2322*22dc650dSSadaf Ebrahimi library, the value is always less than 256. In the 16-bit library the 2323*22dc650dSSadaf Ebrahimi value can be up to 0xffff. In the 32-bit library in UTF-32 mode the 2324*22dc650dSSadaf Ebrahimi value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 2325*22dc650dSSadaf Ebrahimi mode. 2326*22dc650dSSadaf Ebrahimi 2327*22dc650dSSadaf Ebrahimi PCRE2_INFO_FRAMESIZE 2328*22dc650dSSadaf Ebrahimi 2329*22dc650dSSadaf Ebrahimi Return the size (in bytes) of the data frames that are used to remember 2330*22dc650dSSadaf Ebrahimi backtracking positions when the pattern is processed by pcre2_match() 2331*22dc650dSSadaf Ebrahimi without the use of JIT. The third argument should point to a size_t 2332*22dc650dSSadaf Ebrahimi variable. The frame size depends on the number of capturing parentheses 2333*22dc650dSSadaf Ebrahimi in the pattern. Each additional capture group adds two PCRE2_SIZE vari- 2334*22dc650dSSadaf Ebrahimi ables. 2335*22dc650dSSadaf Ebrahimi 2336*22dc650dSSadaf Ebrahimi PCRE2_INFO_HASBACKSLASHC 2337*22dc650dSSadaf Ebrahimi 2338*22dc650dSSadaf Ebrahimi Return 1 if the pattern contains any instances of \C, otherwise 0. The 2339*22dc650dSSadaf Ebrahimi third argument should point to a uint32_t variable. 2340*22dc650dSSadaf Ebrahimi 2341*22dc650dSSadaf Ebrahimi PCRE2_INFO_HASCRORLF 2342*22dc650dSSadaf Ebrahimi 2343*22dc650dSSadaf Ebrahimi Return 1 if the pattern contains any explicit matches for CR or LF 2344*22dc650dSSadaf Ebrahimi characters, otherwise 0. The third argument should point to a uint32_t 2345*22dc650dSSadaf Ebrahimi variable. An explicit match is either a literal CR or LF character, or 2346*22dc650dSSadaf Ebrahimi \r or \n or one of the equivalent hexadecimal or octal escape se- 2347*22dc650dSSadaf Ebrahimi quences. 2348*22dc650dSSadaf Ebrahimi 2349*22dc650dSSadaf Ebrahimi PCRE2_INFO_HEAPLIMIT 2350*22dc650dSSadaf Ebrahimi 2351*22dc650dSSadaf Ebrahimi If the pattern set a heap memory limit by including an item of the form 2352*22dc650dSSadaf Ebrahimi (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu- 2353*22dc650dSSadaf Ebrahimi ment should point to a uint32_t integer. If no such value has been set, 2354*22dc650dSSadaf Ebrahimi the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET. 2355*22dc650dSSadaf Ebrahimi Note that this limit will only be used during matching if it is less 2356*22dc650dSSadaf Ebrahimi than the limit set or defaulted by the caller of the match function. 2357*22dc650dSSadaf Ebrahimi 2358*22dc650dSSadaf Ebrahimi PCRE2_INFO_JCHANGED 2359*22dc650dSSadaf Ebrahimi 2360*22dc650dSSadaf Ebrahimi Return 1 if the (?J) or (?-J) option setting is used in the pattern, 2361*22dc650dSSadaf Ebrahimi otherwise 0. The third argument should point to a uint32_t variable. 2362*22dc650dSSadaf Ebrahimi (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec- 2363*22dc650dSSadaf Ebrahimi tively. 2364*22dc650dSSadaf Ebrahimi 2365*22dc650dSSadaf Ebrahimi PCRE2_INFO_JITSIZE 2366*22dc650dSSadaf Ebrahimi 2367*22dc650dSSadaf Ebrahimi If the compiled pattern was successfully processed by pcre2_jit_com- 2368*22dc650dSSadaf Ebrahimi pile(), return the size of the JIT compiled code, otherwise return 2369*22dc650dSSadaf Ebrahimi zero. The third argument should point to a size_t variable. 2370*22dc650dSSadaf Ebrahimi 2371*22dc650dSSadaf Ebrahimi PCRE2_INFO_LASTCODETYPE 2372*22dc650dSSadaf Ebrahimi 2373*22dc650dSSadaf Ebrahimi Returns 1 if there is a rightmost literal code unit that must exist in 2374*22dc650dSSadaf Ebrahimi any matched string, other than at its start. The third argument should 2375*22dc650dSSadaf Ebrahimi point to a uint32_t variable. If there is no such value, 0 is returned. 2376*22dc650dSSadaf Ebrahimi When 1 is returned, the code unit value itself can be retrieved using 2377*22dc650dSSadaf Ebrahimi PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is 2378*22dc650dSSadaf Ebrahimi recorded only if it follows something of variable length. For example, 2379*22dc650dSSadaf Ebrahimi for the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned 2380*22dc650dSSadaf Ebrahimi from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 2381*22dc650dSSadaf Ebrahimi 0. 2382*22dc650dSSadaf Ebrahimi 2383*22dc650dSSadaf Ebrahimi PCRE2_INFO_LASTCODEUNIT 2384*22dc650dSSadaf Ebrahimi 2385*22dc650dSSadaf Ebrahimi Return the value of the rightmost literal code unit that must exist in 2386*22dc650dSSadaf Ebrahimi any matched string, other than at its start, for a pattern where 2387*22dc650dSSadaf Ebrahimi PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu- 2388*22dc650dSSadaf Ebrahimi ment should point to a uint32_t variable. 2389*22dc650dSSadaf Ebrahimi 2390*22dc650dSSadaf Ebrahimi PCRE2_INFO_MATCHEMPTY 2391*22dc650dSSadaf Ebrahimi 2392*22dc650dSSadaf Ebrahimi Return 1 if the pattern might match an empty string, otherwise 0. The 2393*22dc650dSSadaf Ebrahimi third argument should point to a uint32_t variable. When a pattern con- 2394*22dc650dSSadaf Ebrahimi tains recursive subroutine calls it is not always possible to determine 2395*22dc650dSSadaf Ebrahimi whether or not it can match an empty string. PCRE2 takes a cautious ap- 2396*22dc650dSSadaf Ebrahimi proach and returns 1 in such cases. 2397*22dc650dSSadaf Ebrahimi 2398*22dc650dSSadaf Ebrahimi PCRE2_INFO_MATCHLIMIT 2399*22dc650dSSadaf Ebrahimi 2400*22dc650dSSadaf Ebrahimi If the pattern set a match limit by including an item of the form 2401*22dc650dSSadaf Ebrahimi (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third ar- 2402*22dc650dSSadaf Ebrahimi gument should point to a uint32_t integer. If no such value has been 2403*22dc650dSSadaf Ebrahimi set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN- 2404*22dc650dSSadaf Ebrahimi SET. Note that this limit will only be used during matching if it is 2405*22dc650dSSadaf Ebrahimi less than the limit set or defaulted by the caller of the match func- 2406*22dc650dSSadaf Ebrahimi tion. 2407*22dc650dSSadaf Ebrahimi 2408*22dc650dSSadaf Ebrahimi PCRE2_INFO_MAXLOOKBEHIND 2409*22dc650dSSadaf Ebrahimi 2410*22dc650dSSadaf Ebrahimi A lookbehind assertion moves back a certain number of characters (not 2411*22dc650dSSadaf Ebrahimi code units) when it starts to process each of its branches. This re- 2412*22dc650dSSadaf Ebrahimi quest returns the largest of these backward moves. The third argument 2413*22dc650dSSadaf Ebrahimi should point to a uint32_t integer. The simple assertions \b and \B re- 2414*22dc650dSSadaf Ebrahimi quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to 2415*22dc650dSSadaf Ebrahimi return 1 in the absence of anything longer. \A also registers a one- 2416*22dc650dSSadaf Ebrahimi character lookbehind, though it does not actually inspect the previous 2417*22dc650dSSadaf Ebrahimi character. 2418*22dc650dSSadaf Ebrahimi 2419*22dc650dSSadaf Ebrahimi Note that this information is useful for multi-segment matching only if 2420*22dc650dSSadaf Ebrahimi the pattern contains no nested lookbehinds. For example, the pattern 2421*22dc650dSSadaf Ebrahimi (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is 2422*22dc650dSSadaf Ebrahimi processed, the first lookbehind moves back by two characters, matches 2423*22dc650dSSadaf Ebrahimi one character, then the nested lookbehind also moves back by two char- 2424*22dc650dSSadaf Ebrahimi acters. This puts the matching point three characters earlier than it 2425*22dc650dSSadaf Ebrahimi was at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a 2426*22dc650dSSadaf Ebrahimi debugging tool. See the pcre2partial documentation for a discussion of 2427*22dc650dSSadaf Ebrahimi multi-segment matching. 2428*22dc650dSSadaf Ebrahimi 2429*22dc650dSSadaf Ebrahimi PCRE2_INFO_MINLENGTH 2430*22dc650dSSadaf Ebrahimi 2431*22dc650dSSadaf Ebrahimi If a minimum length for matching subject strings was computed, its 2432*22dc650dSSadaf Ebrahimi value is returned. Otherwise the returned value is 0. This value is not 2433*22dc650dSSadaf Ebrahimi computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number of 2434*22dc650dSSadaf Ebrahimi characters, which in UTF mode may be different from the number of code 2435*22dc650dSSadaf Ebrahimi units. The third argument should point to a uint32_t variable. The 2436*22dc650dSSadaf Ebrahimi value is a lower bound to the length of any matching string. There may 2437*22dc650dSSadaf Ebrahimi not be any strings of that length that do actually match, but every 2438*22dc650dSSadaf Ebrahimi string that does match is at least that long. 2439*22dc650dSSadaf Ebrahimi 2440*22dc650dSSadaf Ebrahimi PCRE2_INFO_NAMECOUNT 2441*22dc650dSSadaf Ebrahimi PCRE2_INFO_NAMEENTRYSIZE 2442*22dc650dSSadaf Ebrahimi PCRE2_INFO_NAMETABLE 2443*22dc650dSSadaf Ebrahimi 2444*22dc650dSSadaf Ebrahimi PCRE2 supports the use of named as well as numbered capturing parenthe- 2445*22dc650dSSadaf Ebrahimi ses. The names are just an additional way of identifying the parenthe- 2446*22dc650dSSadaf Ebrahimi ses, which still acquire numbers. Several convenience functions such as 2447*22dc650dSSadaf Ebrahimi pcre2_substring_get_byname() are provided for extracting captured sub- 2448*22dc650dSSadaf Ebrahimi strings by name. It is also possible to extract the data directly, by 2449*22dc650dSSadaf Ebrahimi first converting the name to a number in order to access the correct 2450*22dc650dSSadaf Ebrahimi pointers in the output vector (described with pcre2_match() below). To 2451*22dc650dSSadaf Ebrahimi do the conversion, you need to use the name-to-number map, which is de- 2452*22dc650dSSadaf Ebrahimi scribed by these three values. 2453*22dc650dSSadaf Ebrahimi 2454*22dc650dSSadaf Ebrahimi The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- 2455*22dc650dSSadaf Ebrahimi COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives 2456*22dc650dSSadaf Ebrahimi the size of each entry in code units; both of these return a uint32_t 2457*22dc650dSSadaf Ebrahimi value. The entry size depends on the length of the longest name. 2458*22dc650dSSadaf Ebrahimi 2459*22dc650dSSadaf Ebrahimi PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. 2460*22dc650dSSadaf Ebrahimi This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li- 2461*22dc650dSSadaf Ebrahimi brary, the first two bytes of each entry are the number of the captur- 2462*22dc650dSSadaf Ebrahimi ing parenthesis, most significant byte first. In the 16-bit library, 2463*22dc650dSSadaf Ebrahimi the pointer points to 16-bit code units, the first of which contains 2464*22dc650dSSadaf Ebrahimi the parenthesis number. In the 32-bit library, the pointer points to 2465*22dc650dSSadaf Ebrahimi 32-bit code units, the first of which contains the parenthesis number. 2466*22dc650dSSadaf Ebrahimi The rest of the entry is the corresponding name, zero terminated. 2467*22dc650dSSadaf Ebrahimi 2468*22dc650dSSadaf Ebrahimi The names are in alphabetical order. If (?| is used to create multiple 2469*22dc650dSSadaf Ebrahimi capture groups with the same number, as described in the section on du- 2470*22dc650dSSadaf Ebrahimi plicate group numbers in the pcre2pattern page, the groups may be given 2471*22dc650dSSadaf Ebrahimi the same name, but there is only one entry in the table. Different 2472*22dc650dSSadaf Ebrahimi names for groups of the same number are not permitted. 2473*22dc650dSSadaf Ebrahimi 2474*22dc650dSSadaf Ebrahimi Duplicate names for capture groups with different numbers are permit- 2475*22dc650dSSadaf Ebrahimi ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the 2476*22dc650dSSadaf Ebrahimi order in which they were found in the pattern. In the absence of (?| 2477*22dc650dSSadaf Ebrahimi this is the order of increasing number; when (?| is used this is not 2478*22dc650dSSadaf Ebrahimi necessarily the case because later capture groups may have lower num- 2479*22dc650dSSadaf Ebrahimi bers. 2480*22dc650dSSadaf Ebrahimi 2481*22dc650dSSadaf Ebrahimi As a simple example of the name/number table, consider the following 2482*22dc650dSSadaf Ebrahimi pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED 2483*22dc650dSSadaf Ebrahimi is set, so white space - including newlines - is ignored): 2484*22dc650dSSadaf Ebrahimi 2485*22dc650dSSadaf Ebrahimi (?<date> (?<year>(\d\d)?\d\d) - 2486*22dc650dSSadaf Ebrahimi (?<month>\d\d) - (?<day>\d\d) ) 2487*22dc650dSSadaf Ebrahimi 2488*22dc650dSSadaf Ebrahimi There are four named capture groups, so the table has four entries, and 2489*22dc650dSSadaf Ebrahimi each entry in the table is eight bytes long. The table is as follows, 2490*22dc650dSSadaf Ebrahimi with non-printing bytes shows in hexadecimal, and undefined bytes shown 2491*22dc650dSSadaf Ebrahimi as ??: 2492*22dc650dSSadaf Ebrahimi 2493*22dc650dSSadaf Ebrahimi 00 01 d a t e 00 ?? 2494*22dc650dSSadaf Ebrahimi 00 05 d a y 00 ?? ?? 2495*22dc650dSSadaf Ebrahimi 00 04 m o n t h 00 2496*22dc650dSSadaf Ebrahimi 00 02 y e a r 00 ?? 2497*22dc650dSSadaf Ebrahimi 2498*22dc650dSSadaf Ebrahimi When writing code to extract data from named capture groups using the 2499*22dc650dSSadaf Ebrahimi name-to-number map, remember that the length of the entries is likely 2500*22dc650dSSadaf Ebrahimi to be different for each compiled pattern. 2501*22dc650dSSadaf Ebrahimi 2502*22dc650dSSadaf Ebrahimi PCRE2_INFO_NEWLINE 2503*22dc650dSSadaf Ebrahimi 2504*22dc650dSSadaf Ebrahimi The output is one of the following uint32_t values: 2505*22dc650dSSadaf Ebrahimi 2506*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_CR Carriage return (CR) 2507*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_LF Linefeed (LF) 2508*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 2509*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_ANY Any Unicode line ending 2510*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 2511*22dc650dSSadaf Ebrahimi PCRE2_NEWLINE_NUL The NUL character (binary zero) 2512*22dc650dSSadaf Ebrahimi 2513*22dc650dSSadaf Ebrahimi This identifies the character sequence that will be recognized as mean- 2514*22dc650dSSadaf Ebrahimi ing "newline" while matching. 2515*22dc650dSSadaf Ebrahimi 2516*22dc650dSSadaf Ebrahimi PCRE2_INFO_SIZE 2517*22dc650dSSadaf Ebrahimi 2518*22dc650dSSadaf Ebrahimi Return the size of the compiled pattern in bytes (for all three li- 2519*22dc650dSSadaf Ebrahimi braries). The third argument should point to a size_t variable. This 2520*22dc650dSSadaf Ebrahimi value includes the size of the general data block that precedes the 2521*22dc650dSSadaf Ebrahimi code units of the compiled pattern itself. The value that is used when 2522*22dc650dSSadaf Ebrahimi pcre2_compile() is getting memory in which to place the compiled pat- 2523*22dc650dSSadaf Ebrahimi tern may be slightly larger than the value returned by this option, be- 2524*22dc650dSSadaf Ebrahimi cause there are cases where the code that calculates the size has to 2525*22dc650dSSadaf Ebrahimi over-estimate. Processing a pattern with the JIT compiler does not al- 2526*22dc650dSSadaf Ebrahimi ter the value returned by this option. 2527*22dc650dSSadaf Ebrahimi 2528*22dc650dSSadaf Ebrahimi 2529*22dc650dSSadaf EbrahimiINFORMATION ABOUT A PATTERN'S CALLOUTS 2530*22dc650dSSadaf Ebrahimi 2531*22dc650dSSadaf Ebrahimi int pcre2_callout_enumerate(const pcre2_code *code, 2532*22dc650dSSadaf Ebrahimi int (*callback)(pcre2_callout_enumerate_block *, void *), 2533*22dc650dSSadaf Ebrahimi void *user_data); 2534*22dc650dSSadaf Ebrahimi 2535*22dc650dSSadaf Ebrahimi A script language that supports the use of string arguments in callouts 2536*22dc650dSSadaf Ebrahimi might like to scan all the callouts in a pattern before running the 2537*22dc650dSSadaf Ebrahimi match. This can be done by calling pcre2_callout_enumerate(). The first 2538*22dc650dSSadaf Ebrahimi argument is a pointer to a compiled pattern, the second points to a 2539*22dc650dSSadaf Ebrahimi callback function, and the third is arbitrary user data. The callback 2540*22dc650dSSadaf Ebrahimi function is called for every callout in the pattern in the order in 2541*22dc650dSSadaf Ebrahimi which they appear. Its first argument is a pointer to a callout enumer- 2542*22dc650dSSadaf Ebrahimi ation block, and its second argument is the user_data value that was 2543*22dc650dSSadaf Ebrahimi passed to pcre2_callout_enumerate(). The contents of the callout enu- 2544*22dc650dSSadaf Ebrahimi meration block are described in the pcre2callout documentation, which 2545*22dc650dSSadaf Ebrahimi also gives further details about callouts. 2546*22dc650dSSadaf Ebrahimi 2547*22dc650dSSadaf Ebrahimi 2548*22dc650dSSadaf EbrahimiSERIALIZATION AND PRECOMPILING 2549*22dc650dSSadaf Ebrahimi 2550*22dc650dSSadaf Ebrahimi It is possible to save compiled patterns on disc or elsewhere, and re- 2551*22dc650dSSadaf Ebrahimi load them later, subject to a number of restrictions. The host on which 2552*22dc650dSSadaf Ebrahimi the patterns are reloaded must be running the same version of PCRE2, 2553*22dc650dSSadaf Ebrahimi with the same code unit width, and must also have the same endianness, 2554*22dc650dSSadaf Ebrahimi pointer width, and PCRE2_SIZE type. Before compiled patterns can be 2555*22dc650dSSadaf Ebrahimi saved, they must be converted to a "serialized" form, which in the case 2556*22dc650dSSadaf Ebrahimi of PCRE2 is really just a bytecode dump. The functions whose names be- 2557*22dc650dSSadaf Ebrahimi gin with pcre2_serialize_ are used for converting to and from the seri- 2558*22dc650dSSadaf Ebrahimi alized form. They are described in the pcre2serialize documentation. 2559*22dc650dSSadaf Ebrahimi Note that PCRE2 serialization does not convert compiled patterns to an 2560*22dc650dSSadaf Ebrahimi abstract format like Java or .NET serialization. 2561*22dc650dSSadaf Ebrahimi 2562*22dc650dSSadaf Ebrahimi 2563*22dc650dSSadaf EbrahimiTHE MATCH DATA BLOCK 2564*22dc650dSSadaf Ebrahimi 2565*22dc650dSSadaf Ebrahimi pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 2566*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 2567*22dc650dSSadaf Ebrahimi 2568*22dc650dSSadaf Ebrahimi pcre2_match_data *pcre2_match_data_create_from_pattern( 2569*22dc650dSSadaf Ebrahimi const pcre2_code *code, pcre2_general_context *gcontext); 2570*22dc650dSSadaf Ebrahimi 2571*22dc650dSSadaf Ebrahimi void pcre2_match_data_free(pcre2_match_data *match_data); 2572*22dc650dSSadaf Ebrahimi 2573*22dc650dSSadaf Ebrahimi Information about a successful or unsuccessful match is placed in a 2574*22dc650dSSadaf Ebrahimi match data block, which is an opaque structure that is accessed by 2575*22dc650dSSadaf Ebrahimi function calls. In particular, the match data block contains a vector 2576*22dc650dSSadaf Ebrahimi of offsets into the subject string that define the matched parts of the 2577*22dc650dSSadaf Ebrahimi subject. This is known as the ovector. 2578*22dc650dSSadaf Ebrahimi 2579*22dc650dSSadaf Ebrahimi Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() 2580*22dc650dSSadaf Ebrahimi you must create a match data block by calling one of the creation func- 2581*22dc650dSSadaf Ebrahimi tions above. For pcre2_match_data_create(), the first argument is the 2582*22dc650dSSadaf Ebrahimi number of pairs of offsets in the ovector. 2583*22dc650dSSadaf Ebrahimi 2584*22dc650dSSadaf Ebrahimi When using pcre2_match(), one pair of offsets is required to identify 2585*22dc650dSSadaf Ebrahimi the string that matched the whole pattern, with an additional pair for 2586*22dc650dSSadaf Ebrahimi each captured substring. For example, a value of 4 creates enough space 2587*22dc650dSSadaf Ebrahimi to record the matched portion of the subject plus three captured sub- 2588*22dc650dSSadaf Ebrahimi strings. 2589*22dc650dSSadaf Ebrahimi 2590*22dc650dSSadaf Ebrahimi When using pcre2_dfa_match() there may be multiple matched substrings 2591*22dc650dSSadaf Ebrahimi of different lengths at the same point in the subject. The ovector 2592*22dc650dSSadaf Ebrahimi should be made large enough to hold as many as are expected. 2593*22dc650dSSadaf Ebrahimi 2594*22dc650dSSadaf Ebrahimi A minimum of at least 1 pair is imposed by pcre2_match_data_create(), 2595*22dc650dSSadaf Ebrahimi so it is always possible to return the overall matched string in the 2596*22dc650dSSadaf Ebrahimi case of pcre2_match() or the longest match in the case of 2597*22dc650dSSadaf Ebrahimi pcre2_dfa_match(). The maximum number of pairs is 65535; if the first 2598*22dc650dSSadaf Ebrahimi argument of pcre2_match_data_create() is greater than this, 65535 is 2599*22dc650dSSadaf Ebrahimi used. 2600*22dc650dSSadaf Ebrahimi 2601*22dc650dSSadaf Ebrahimi The second argument of pcre2_match_data_create() is a pointer to a gen- 2602*22dc650dSSadaf Ebrahimi eral context, which can specify custom memory management for obtaining 2603*22dc650dSSadaf Ebrahimi the memory for the match data block. If you are not using custom memory 2604*22dc650dSSadaf Ebrahimi management, pass NULL, which causes malloc() to be used. 2605*22dc650dSSadaf Ebrahimi 2606*22dc650dSSadaf Ebrahimi For pcre2_match_data_create_from_pattern(), the first argument is a 2607*22dc650dSSadaf Ebrahimi pointer to a compiled pattern. The ovector is created to be exactly the 2608*22dc650dSSadaf Ebrahimi right size to hold all the substrings a pattern might capture when 2609*22dc650dSSadaf Ebrahimi matched using pcre2_match(). You should not use this call when matching 2610*22dc650dSSadaf Ebrahimi with pcre2_dfa_match(). The second argument is again a pointer to a 2611*22dc650dSSadaf Ebrahimi general context, but in this case if NULL is passed, the memory is ob- 2612*22dc650dSSadaf Ebrahimi tained using the same allocator that was used for the compiled pattern 2613*22dc650dSSadaf Ebrahimi (custom or default). 2614*22dc650dSSadaf Ebrahimi 2615*22dc650dSSadaf Ebrahimi A match data block can be used many times, with the same or different 2616*22dc650dSSadaf Ebrahimi compiled patterns. You can extract information from a match data block 2617*22dc650dSSadaf Ebrahimi after a match operation has finished, using functions that are de- 2618*22dc650dSSadaf Ebrahimi scribed in the sections on matched strings and other match data below. 2619*22dc650dSSadaf Ebrahimi 2620*22dc650dSSadaf Ebrahimi When a call of pcre2_match() fails, valid data is available in the 2621*22dc650dSSadaf Ebrahimi match block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ER- 2622*22dc650dSSadaf Ebrahimi ROR_PARTIAL, or one of the error codes for an invalid UTF string. Ex- 2623*22dc650dSSadaf Ebrahimi actly what is available depends on the error, and is detailed below. 2624*22dc650dSSadaf Ebrahimi 2625*22dc650dSSadaf Ebrahimi When one of the matching functions is called, pointers to the compiled 2626*22dc650dSSadaf Ebrahimi pattern and the subject string are set in the match data block so that 2627*22dc650dSSadaf Ebrahimi they can be referenced by the extraction functions after a successful 2628*22dc650dSSadaf Ebrahimi match. After running a match, you must not free a compiled pattern or a 2629*22dc650dSSadaf Ebrahimi subject string until after all operations on the match data block (for 2630*22dc650dSSadaf Ebrahimi that match) have taken place, unless, in the case of the subject 2631*22dc650dSSadaf Ebrahimi string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is 2632*22dc650dSSadaf Ebrahimi described in the section entitled "Option bits for pcre2_match()" be- 2633*22dc650dSSadaf Ebrahimi low. 2634*22dc650dSSadaf Ebrahimi 2635*22dc650dSSadaf Ebrahimi When a match data block itself is no longer needed, it should be freed 2636*22dc650dSSadaf Ebrahimi by calling pcre2_match_data_free(). If this function is called with a 2637*22dc650dSSadaf Ebrahimi NULL argument, it returns immediately, without doing anything. 2638*22dc650dSSadaf Ebrahimi 2639*22dc650dSSadaf Ebrahimi 2640*22dc650dSSadaf EbrahimiMEMORY USE FOR MATCH DATA BLOCKS 2641*22dc650dSSadaf Ebrahimi 2642*22dc650dSSadaf Ebrahimi PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data); 2643*22dc650dSSadaf Ebrahimi 2644*22dc650dSSadaf Ebrahimi PCRE2_SIZE pcre2_get_match_data_heapframes_size( 2645*22dc650dSSadaf Ebrahimi pcre2_match_data *match_data); 2646*22dc650dSSadaf Ebrahimi 2647*22dc650dSSadaf Ebrahimi The size of a match data block depends on the size of the ovector that 2648*22dc650dSSadaf Ebrahimi it contains. The function pcre2_get_match_data_size() returns the size, 2649*22dc650dSSadaf Ebrahimi in bytes, of the block that is its argument. 2650*22dc650dSSadaf Ebrahimi 2651*22dc650dSSadaf Ebrahimi When pcre2_match() runs interpretively (that is, without using JIT), it 2652*22dc650dSSadaf Ebrahimi makes use of a vector of data frames for remembering backtracking posi- 2653*22dc650dSSadaf Ebrahimi tions. The size of each individual frame depends on the number of cap- 2654*22dc650dSSadaf Ebrahimi turing parentheses in the pattern and can be obtained by calling 2655*22dc650dSSadaf Ebrahimi pcre2_pattern_info() with the PCRE2_INFO_FRAMESIZE option (see the sec- 2656*22dc650dSSadaf Ebrahimi tion entitled "Information about a compiled pattern" above). 2657*22dc650dSSadaf Ebrahimi 2658*22dc650dSSadaf Ebrahimi Heap memory is used for the frames vector; if the initial memory block 2659*22dc650dSSadaf Ebrahimi turns out to be too small during matching, it is automatically ex- 2660*22dc650dSSadaf Ebrahimi panded. When pcre2_match() returns, the memory is not freed, but re- 2661*22dc650dSSadaf Ebrahimi mains attached to the match data block, for use by any subsequent 2662*22dc650dSSadaf Ebrahimi matches that use the same block. It is automatically freed when the 2663*22dc650dSSadaf Ebrahimi match data block itself is freed. 2664*22dc650dSSadaf Ebrahimi 2665*22dc650dSSadaf Ebrahimi You can find the current size of the frames vector that a match data 2666*22dc650dSSadaf Ebrahimi block owns by calling pcre2_get_match_data_heapframes_size(). For a 2667*22dc650dSSadaf Ebrahimi newly created match data block the size will be zero. Some types of 2668*22dc650dSSadaf Ebrahimi match may require a lot of frames and thus a large vector; applications 2669*22dc650dSSadaf Ebrahimi that run in environments where memory is constrained can check this and 2670*22dc650dSSadaf Ebrahimi free the match data block if the heap frames vector has become too big. 2671*22dc650dSSadaf Ebrahimi 2672*22dc650dSSadaf Ebrahimi 2673*22dc650dSSadaf EbrahimiMATCHING A PATTERN: THE TRADITIONAL FUNCTION 2674*22dc650dSSadaf Ebrahimi 2675*22dc650dSSadaf Ebrahimi int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 2676*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 2677*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 2678*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext); 2679*22dc650dSSadaf Ebrahimi 2680*22dc650dSSadaf Ebrahimi The function pcre2_match() is called to match a subject string against 2681*22dc650dSSadaf Ebrahimi a compiled pattern, which is passed in the code argument. You can call 2682*22dc650dSSadaf Ebrahimi pcre2_match() with the same code argument as many times as you like, in 2683*22dc650dSSadaf Ebrahimi order to find multiple matches in the subject string or to match dif- 2684*22dc650dSSadaf Ebrahimi ferent subject strings with the same pattern. 2685*22dc650dSSadaf Ebrahimi 2686*22dc650dSSadaf Ebrahimi This function is the main matching facility of the library, and it op- 2687*22dc650dSSadaf Ebrahimi erates in a Perl-like manner. For specialist use there is also an al- 2688*22dc650dSSadaf Ebrahimi ternative matching function, which is described below in the section 2689*22dc650dSSadaf Ebrahimi about the pcre2_dfa_match() function. 2690*22dc650dSSadaf Ebrahimi 2691*22dc650dSSadaf Ebrahimi Here is an example of a simple call to pcre2_match(): 2692*22dc650dSSadaf Ebrahimi 2693*22dc650dSSadaf Ebrahimi pcre2_match_data *md = pcre2_match_data_create(4, NULL); 2694*22dc650dSSadaf Ebrahimi int rc = pcre2_match( 2695*22dc650dSSadaf Ebrahimi re, /* result of pcre2_compile() */ 2696*22dc650dSSadaf Ebrahimi "some string", /* the subject string */ 2697*22dc650dSSadaf Ebrahimi 11, /* the length of the subject string */ 2698*22dc650dSSadaf Ebrahimi 0, /* start at offset 0 in the subject */ 2699*22dc650dSSadaf Ebrahimi 0, /* default options */ 2700*22dc650dSSadaf Ebrahimi md, /* the match data block */ 2701*22dc650dSSadaf Ebrahimi NULL); /* a match context; NULL means use defaults */ 2702*22dc650dSSadaf Ebrahimi 2703*22dc650dSSadaf Ebrahimi If the subject string is zero-terminated, the length can be given as 2704*22dc650dSSadaf Ebrahimi PCRE2_ZERO_TERMINATED. A match context must be provided if certain less 2705*22dc650dSSadaf Ebrahimi common matching parameters are to be changed. For details, see the sec- 2706*22dc650dSSadaf Ebrahimi tion on the match context above. 2707*22dc650dSSadaf Ebrahimi 2708*22dc650dSSadaf Ebrahimi The string to be matched by pcre2_match() 2709*22dc650dSSadaf Ebrahimi 2710*22dc650dSSadaf Ebrahimi The subject string is passed to pcre2_match() as a pointer in subject, 2711*22dc650dSSadaf Ebrahimi a length in length, and a starting offset in startoffset. The length 2712*22dc650dSSadaf Ebrahimi and offset are in code units, not characters. That is, they are in 2713*22dc650dSSadaf Ebrahimi bytes for the 8-bit library, 16-bit code units for the 16-bit library, 2714*22dc650dSSadaf Ebrahimi and 32-bit code units for the 32-bit library, whether or not UTF pro- 2715*22dc650dSSadaf Ebrahimi cessing is enabled. As a special case, if subject is NULL and length is 2716*22dc650dSSadaf Ebrahimi zero, the subject is assumed to be an empty string. If length is non- 2717*22dc650dSSadaf Ebrahimi zero, an error occurs if subject is NULL. 2718*22dc650dSSadaf Ebrahimi 2719*22dc650dSSadaf Ebrahimi If startoffset is greater than the length of the subject, pcre2_match() 2720*22dc650dSSadaf Ebrahimi returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the 2721*22dc650dSSadaf Ebrahimi search for a match starts at the beginning of the subject, and this is 2722*22dc650dSSadaf Ebrahimi by far the most common case. In UTF-8 or UTF-16 mode, the starting off- 2723*22dc650dSSadaf Ebrahimi set must point to the start of a character, or to the end of the sub- 2724*22dc650dSSadaf Ebrahimi ject (in UTF-32 mode, one code unit equals one character, so all off- 2725*22dc650dSSadaf Ebrahimi sets are valid). Like the pattern string, the subject may contain bi- 2726*22dc650dSSadaf Ebrahimi nary zeros. 2727*22dc650dSSadaf Ebrahimi 2728*22dc650dSSadaf Ebrahimi A non-zero starting offset is useful when searching for another match 2729*22dc650dSSadaf Ebrahimi in the same subject by calling pcre2_match() again after a previous 2730*22dc650dSSadaf Ebrahimi success. Setting startoffset differs from passing over a shortened 2731*22dc650dSSadaf Ebrahimi string and setting PCRE2_NOTBOL in the case of a pattern that begins 2732*22dc650dSSadaf Ebrahimi with any kind of lookbehind. For example, consider the pattern 2733*22dc650dSSadaf Ebrahimi 2734*22dc650dSSadaf Ebrahimi \Biss\B 2735*22dc650dSSadaf Ebrahimi 2736*22dc650dSSadaf Ebrahimi which finds occurrences of "iss" in the middle of words. (\B matches 2737*22dc650dSSadaf Ebrahimi only if the current position in the subject is not a word boundary.) 2738*22dc650dSSadaf Ebrahimi When applied to the string "Mississippi" the first call to 2739*22dc650dSSadaf Ebrahimi pcre2_match() finds the first occurrence. If pcre2_match() is called 2740*22dc650dSSadaf Ebrahimi again with just the remainder of the subject, namely "issippi", it does 2741*22dc650dSSadaf Ebrahimi not match, because \B is always false at the start of the subject, 2742*22dc650dSSadaf Ebrahimi which is deemed to be a word boundary. However, if pcre2_match() is 2743*22dc650dSSadaf Ebrahimi passed the entire string again, but with startoffset set to 4, it finds 2744*22dc650dSSadaf Ebrahimi the second occurrence of "iss" because it is able to look behind the 2745*22dc650dSSadaf Ebrahimi starting point to discover that it is preceded by a letter. 2746*22dc650dSSadaf Ebrahimi 2747*22dc650dSSadaf Ebrahimi Finding all the matches in a subject is tricky when the pattern can 2748*22dc650dSSadaf Ebrahimi match an empty string. It is possible to emulate Perl's /g behaviour by 2749*22dc650dSSadaf Ebrahimi first trying the match again at the same offset, with the 2750*22dc650dSSadaf Ebrahimi PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that 2751*22dc650dSSadaf Ebrahimi fails, advancing the starting offset and trying an ordinary match 2752*22dc650dSSadaf Ebrahimi again. There is some code that demonstrates how to do this in the 2753*22dc650dSSadaf Ebrahimi pcre2demo sample program. In the most general case, you have to check 2754*22dc650dSSadaf Ebrahimi to see if the newline convention recognizes CRLF as a newline, and if 2755*22dc650dSSadaf Ebrahimi so, and the current character is CR followed by LF, advance the start- 2756*22dc650dSSadaf Ebrahimi ing offset by two characters instead of one. 2757*22dc650dSSadaf Ebrahimi 2758*22dc650dSSadaf Ebrahimi If a non-zero starting offset is passed when the pattern is anchored, a 2759*22dc650dSSadaf Ebrahimi single attempt to match at the given offset is made. This can only suc- 2760*22dc650dSSadaf Ebrahimi ceed if the pattern does not require the match to be at the start of 2761*22dc650dSSadaf Ebrahimi the subject. In other words, the anchoring must be the result of set- 2762*22dc650dSSadaf Ebrahimi ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not 2763*22dc650dSSadaf Ebrahimi by starting the pattern with ^ or \A. 2764*22dc650dSSadaf Ebrahimi 2765*22dc650dSSadaf Ebrahimi Option bits for pcre2_match() 2766*22dc650dSSadaf Ebrahimi 2767*22dc650dSSadaf Ebrahimi The unused bits of the options argument for pcre2_match() must be zero. 2768*22dc650dSSadaf Ebrahimi The only bits that may be set are PCRE2_ANCHORED, 2769*22dc650dSSadaf Ebrahimi PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK, PCRE2_EN- 2770*22dc650dSSadaf Ebrahimi DANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, 2771*22dc650dSSadaf Ebrahimi PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PAR- 2772*22dc650dSSadaf Ebrahimi TIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. 2773*22dc650dSSadaf Ebrahimi 2774*22dc650dSSadaf Ebrahimi Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup- 2775*22dc650dSSadaf Ebrahimi ported by the just-in-time (JIT) compiler. If it is set, JIT matching 2776*22dc650dSSadaf Ebrahimi is disabled and the interpretive code in pcre2_match() is run. 2777*22dc650dSSadaf Ebrahimi PCRE2_DISABLE_RECURSELOOP_CHECK is ignored by JIT, but apart from 2778*22dc650dSSadaf Ebrahimi PCRE2_NO_JIT (obviously), the remaining options are supported for JIT 2779*22dc650dSSadaf Ebrahimi matching. 2780*22dc650dSSadaf Ebrahimi 2781*22dc650dSSadaf Ebrahimi PCRE2_ANCHORED 2782*22dc650dSSadaf Ebrahimi 2783*22dc650dSSadaf Ebrahimi The PCRE2_ANCHORED option limits pcre2_match() to matching at the first 2784*22dc650dSSadaf Ebrahimi matching position. If a pattern was compiled with PCRE2_ANCHORED, or 2785*22dc650dSSadaf Ebrahimi turned out to be anchored by virtue of its contents, it cannot be made 2786*22dc650dSSadaf Ebrahimi unachored at matching time. Note that setting the option at match time 2787*22dc650dSSadaf Ebrahimi disables JIT matching. 2788*22dc650dSSadaf Ebrahimi 2789*22dc650dSSadaf Ebrahimi PCRE2_COPY_MATCHED_SUBJECT 2790*22dc650dSSadaf Ebrahimi 2791*22dc650dSSadaf Ebrahimi By default, a pointer to the subject is remembered in the match data 2792*22dc650dSSadaf Ebrahimi block so that, after a successful match, it can be referenced by the 2793*22dc650dSSadaf Ebrahimi substring extraction functions. This means that the subject's memory 2794*22dc650dSSadaf Ebrahimi must not be freed until all such operations are complete. For some ap- 2795*22dc650dSSadaf Ebrahimi plications where the lifetime of the subject string is not guaranteed, 2796*22dc650dSSadaf Ebrahimi it may be necessary to make a copy of the subject string, but it is 2797*22dc650dSSadaf Ebrahimi wasteful to do this unless the match is successful. After a successful 2798*22dc650dSSadaf Ebrahimi match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied and 2799*22dc650dSSadaf Ebrahimi the new pointer is remembered in the match data block instead of the 2800*22dc650dSSadaf Ebrahimi original subject pointer. The memory allocator that was used for the 2801*22dc650dSSadaf Ebrahimi match block itself is used. The copy is automatically freed when 2802*22dc650dSSadaf Ebrahimi pcre2_match_data_free() is called to free the match data block. It is 2803*22dc650dSSadaf Ebrahimi also automatically freed if the match data block is re-used for another 2804*22dc650dSSadaf Ebrahimi match operation. 2805*22dc650dSSadaf Ebrahimi 2806*22dc650dSSadaf Ebrahimi PCRE2_DISABLE_RECURSELOOP_CHECK 2807*22dc650dSSadaf Ebrahimi 2808*22dc650dSSadaf Ebrahimi This option is relevant only to pcre2_match() for interpretive match- 2809*22dc650dSSadaf Ebrahimi ing. It is ignored when JIT is used, and is forbidden for 2810*22dc650dSSadaf Ebrahimi pcre2_dfa_match(). 2811*22dc650dSSadaf Ebrahimi 2812*22dc650dSSadaf Ebrahimi The use of recursion in patterns can lead to infinite loops. In the in- 2813*22dc650dSSadaf Ebrahimi terpretive matcher these would be eventually caught by the match or 2814*22dc650dSSadaf Ebrahimi heap limits, but this could take a long time and/or use a lot of memory 2815*22dc650dSSadaf Ebrahimi if the limits are large. There is therefore a check at the start of 2816*22dc650dSSadaf Ebrahimi each recursion. If the same group is still active from a previous 2817*22dc650dSSadaf Ebrahimi call, and the current subject pointer is the same as it was at the 2818*22dc650dSSadaf Ebrahimi start of that group, and the furthest inspected character of the sub- 2819*22dc650dSSadaf Ebrahimi ject has not changed, an error is generated. 2820*22dc650dSSadaf Ebrahimi 2821*22dc650dSSadaf Ebrahimi There are rare cases of matches that would complete, but nevertheless 2822*22dc650dSSadaf Ebrahimi trigger this error. This option disables the check. It is provided 2823*22dc650dSSadaf Ebrahimi mainly for testing when comparing JIT and interpretive behaviour. 2824*22dc650dSSadaf Ebrahimi 2825*22dc650dSSadaf Ebrahimi PCRE2_ENDANCHORED 2826*22dc650dSSadaf Ebrahimi 2827*22dc650dSSadaf Ebrahimi If the PCRE2_ENDANCHORED option is set, any string that pcre2_match() 2828*22dc650dSSadaf Ebrahimi matches must be right at the end of the subject string. Note that set- 2829*22dc650dSSadaf Ebrahimi ting the option at match time disables JIT matching. 2830*22dc650dSSadaf Ebrahimi 2831*22dc650dSSadaf Ebrahimi PCRE2_NOTBOL 2832*22dc650dSSadaf Ebrahimi 2833*22dc650dSSadaf Ebrahimi This option specifies that first character of the subject string is not 2834*22dc650dSSadaf Ebrahimi the beginning of a line, so the circumflex metacharacter should not 2835*22dc650dSSadaf Ebrahimi match before it. Setting this without having set PCRE2_MULTILINE at 2836*22dc650dSSadaf Ebrahimi compile time causes circumflex never to match. This option affects only 2837*22dc650dSSadaf Ebrahimi the behaviour of the circumflex metacharacter. It does not affect \A. 2838*22dc650dSSadaf Ebrahimi 2839*22dc650dSSadaf Ebrahimi PCRE2_NOTEOL 2840*22dc650dSSadaf Ebrahimi 2841*22dc650dSSadaf Ebrahimi This option specifies that the end of the subject string is not the end 2842*22dc650dSSadaf Ebrahimi of a line, so the dollar metacharacter should not match it nor (except 2843*22dc650dSSadaf Ebrahimi in multiline mode) a newline immediately before it. Setting this with- 2844*22dc650dSSadaf Ebrahimi out having set PCRE2_MULTILINE at compile time causes dollar never to 2845*22dc650dSSadaf Ebrahimi match. This option affects only the behaviour of the dollar metacharac- 2846*22dc650dSSadaf Ebrahimi ter. It does not affect \Z or \z. 2847*22dc650dSSadaf Ebrahimi 2848*22dc650dSSadaf Ebrahimi PCRE2_NOTEMPTY 2849*22dc650dSSadaf Ebrahimi 2850*22dc650dSSadaf Ebrahimi An empty string is not considered to be a valid match if this option is 2851*22dc650dSSadaf Ebrahimi set. If there are alternatives in the pattern, they are tried. If all 2852*22dc650dSSadaf Ebrahimi the alternatives match the empty string, the entire match fails. For 2853*22dc650dSSadaf Ebrahimi example, if the pattern 2854*22dc650dSSadaf Ebrahimi 2855*22dc650dSSadaf Ebrahimi a?b? 2856*22dc650dSSadaf Ebrahimi 2857*22dc650dSSadaf Ebrahimi is applied to a string not beginning with "a" or "b", it matches an 2858*22dc650dSSadaf Ebrahimi empty string at the start of the subject. With PCRE2_NOTEMPTY set, this 2859*22dc650dSSadaf Ebrahimi match is not valid, so pcre2_match() searches further into the string 2860*22dc650dSSadaf Ebrahimi for occurrences of "a" or "b". 2861*22dc650dSSadaf Ebrahimi 2862*22dc650dSSadaf Ebrahimi PCRE2_NOTEMPTY_ATSTART 2863*22dc650dSSadaf Ebrahimi 2864*22dc650dSSadaf Ebrahimi This is like PCRE2_NOTEMPTY, except that it locks out an empty string 2865*22dc650dSSadaf Ebrahimi match only at the first matching position, that is, at the start of the 2866*22dc650dSSadaf Ebrahimi subject plus the starting offset. An empty string match later in the 2867*22dc650dSSadaf Ebrahimi subject is permitted. If the pattern is anchored, such a match can oc- 2868*22dc650dSSadaf Ebrahimi cur only if the pattern contains \K. 2869*22dc650dSSadaf Ebrahimi 2870*22dc650dSSadaf Ebrahimi PCRE2_NO_JIT 2871*22dc650dSSadaf Ebrahimi 2872*22dc650dSSadaf Ebrahimi By default, if a pattern has been successfully processed by 2873*22dc650dSSadaf Ebrahimi pcre2_jit_compile(), JIT is automatically used when pcre2_match() is 2874*22dc650dSSadaf Ebrahimi called with options that JIT supports. Setting PCRE2_NO_JIT disables 2875*22dc650dSSadaf Ebrahimi the use of JIT; it forces matching to be done by the interpreter. 2876*22dc650dSSadaf Ebrahimi 2877*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK 2878*22dc650dSSadaf Ebrahimi 2879*22dc650dSSadaf Ebrahimi When PCRE2_UTF is set at compile time, the validity of the subject as a 2880*22dc650dSSadaf Ebrahimi UTF string is checked unless PCRE2_NO_UTF_CHECK is passed to 2881*22dc650dSSadaf Ebrahimi pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile(). 2882*22dc650dSSadaf Ebrahimi The latter special case is discussed in detail in the pcre2unicode doc- 2883*22dc650dSSadaf Ebrahimi umentation. 2884*22dc650dSSadaf Ebrahimi 2885*22dc650dSSadaf Ebrahimi In the default case, if a non-zero starting offset is given, the check 2886*22dc650dSSadaf Ebrahimi is applied only to that part of the subject that could be inspected 2887*22dc650dSSadaf Ebrahimi during matching, and there is a check that the starting offset points 2888*22dc650dSSadaf Ebrahimi to the first code unit of a character or to the end of the subject. If 2889*22dc650dSSadaf Ebrahimi there are no lookbehind assertions in the pattern, the check starts at 2890*22dc650dSSadaf Ebrahimi the starting offset. Otherwise, it starts at the length of the longest 2891*22dc650dSSadaf Ebrahimi lookbehind before the starting offset, or at the start of the subject 2892*22dc650dSSadaf Ebrahimi if there are not that many characters before the starting offset. Note 2893*22dc650dSSadaf Ebrahimi that the sequences \b and \B are one-character lookbehinds. 2894*22dc650dSSadaf Ebrahimi 2895*22dc650dSSadaf Ebrahimi The check is carried out before any other processing takes place, and a 2896*22dc650dSSadaf Ebrahimi negative error code is returned if the check fails. There are several 2897*22dc650dSSadaf Ebrahimi UTF error codes for each code unit width, corresponding to different 2898*22dc650dSSadaf Ebrahimi problems with the code unit sequence. There are discussions about the 2899*22dc650dSSadaf Ebrahimi validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the 2900*22dc650dSSadaf Ebrahimi pcre2unicode documentation. 2901*22dc650dSSadaf Ebrahimi 2902*22dc650dSSadaf Ebrahimi If you know that your subject is valid, and you want to skip this check 2903*22dc650dSSadaf Ebrahimi for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when 2904*22dc650dSSadaf Ebrahimi calling pcre2_match(). You might want to do this for the second and 2905*22dc650dSSadaf Ebrahimi subsequent calls to pcre2_match() if you are making repeated calls to 2906*22dc650dSSadaf Ebrahimi find multiple matches in the same subject string. 2907*22dc650dSSadaf Ebrahimi 2908*22dc650dSSadaf Ebrahimi Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when 2909*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK is set at match time the effect of passing an in- 2910*22dc650dSSadaf Ebrahimi valid string as a subject, or an invalid value of startoffset, is unde- 2911*22dc650dSSadaf Ebrahimi fined. Your program may crash or loop indefinitely or give wrong re- 2912*22dc650dSSadaf Ebrahimi sults. 2913*22dc650dSSadaf Ebrahimi 2914*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD 2915*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_SOFT 2916*22dc650dSSadaf Ebrahimi 2917*22dc650dSSadaf Ebrahimi These options turn on the partial matching feature. A partial match oc- 2918*22dc650dSSadaf Ebrahimi curs if the end of the subject string is reached successfully, but 2919*22dc650dSSadaf Ebrahimi there are not enough subject characters to complete the match. In addi- 2920*22dc650dSSadaf Ebrahimi tion, either at least one character must have been inspected or the 2921*22dc650dSSadaf Ebrahimi pattern must contain a lookbehind, or the pattern must be one that 2922*22dc650dSSadaf Ebrahimi could match an empty string. 2923*22dc650dSSadaf Ebrahimi 2924*22dc650dSSadaf Ebrahimi If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR- 2925*22dc650dSSadaf Ebrahimi TIAL_HARD) is set, matching continues by testing any remaining alterna- 2926*22dc650dSSadaf Ebrahimi tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL 2927*22dc650dSSadaf Ebrahimi returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR- 2928*22dc650dSSadaf Ebrahimi TIAL_SOFT specifies that the caller is prepared to handle a partial 2929*22dc650dSSadaf Ebrahimi match, but only if no complete match can be found. 2930*22dc650dSSadaf Ebrahimi 2931*22dc650dSSadaf Ebrahimi If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this 2932*22dc650dSSadaf Ebrahimi case, if a partial match is found, pcre2_match() immediately returns 2933*22dc650dSSadaf Ebrahimi PCRE2_ERROR_PARTIAL, without considering any other alternatives. In 2934*22dc650dSSadaf Ebrahimi other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- 2935*22dc650dSSadaf Ebrahimi ered to be more important that an alternative complete match. 2936*22dc650dSSadaf Ebrahimi 2937*22dc650dSSadaf Ebrahimi There is a more detailed discussion of partial and multi-segment match- 2938*22dc650dSSadaf Ebrahimi ing, with examples, in the pcre2partial documentation. 2939*22dc650dSSadaf Ebrahimi 2940*22dc650dSSadaf Ebrahimi 2941*22dc650dSSadaf EbrahimiNEWLINE HANDLING WHEN MATCHING 2942*22dc650dSSadaf Ebrahimi 2943*22dc650dSSadaf Ebrahimi When PCRE2 is built, a default newline convention is set; this is usu- 2944*22dc650dSSadaf Ebrahimi ally the standard convention for the operating system. The default can 2945*22dc650dSSadaf Ebrahimi be overridden in a compile context by calling pcre2_set_newline(). It 2946*22dc650dSSadaf Ebrahimi can also be overridden by starting a pattern string with, for example, 2947*22dc650dSSadaf Ebrahimi (*CRLF), as described in the section on newline conventions in the 2948*22dc650dSSadaf Ebrahimi pcre2pattern page. During matching, the newline choice affects the be- 2949*22dc650dSSadaf Ebrahimi haviour of the dot, circumflex, and dollar metacharacters. It may also 2950*22dc650dSSadaf Ebrahimi alter the way the match starting position is advanced after a match 2951*22dc650dSSadaf Ebrahimi failure for an unanchored pattern. 2952*22dc650dSSadaf Ebrahimi 2953*22dc650dSSadaf Ebrahimi When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is 2954*22dc650dSSadaf Ebrahimi set as the newline convention, and a match attempt for an unanchored 2955*22dc650dSSadaf Ebrahimi pattern fails when the current starting position is at a CRLF sequence, 2956*22dc650dSSadaf Ebrahimi and the pattern contains no explicit matches for CR or LF characters, 2957*22dc650dSSadaf Ebrahimi the match position is advanced by two characters instead of one, in 2958*22dc650dSSadaf Ebrahimi other words, to after the CRLF. 2959*22dc650dSSadaf Ebrahimi 2960*22dc650dSSadaf Ebrahimi The above rule is a compromise that makes the most common cases work as 2961*22dc650dSSadaf Ebrahimi expected. For example, if the pattern is .+A (and the PCRE2_DOTALL op- 2962*22dc650dSSadaf Ebrahimi tion is not set), it does not match the string "\r\nA" because, after 2963*22dc650dSSadaf Ebrahimi failing at the start, it skips both the CR and the LF before retrying. 2964*22dc650dSSadaf Ebrahimi However, the pattern [\r\n]A does match that string, because it con- 2965*22dc650dSSadaf Ebrahimi tains an explicit CR or LF reference, and so advances only by one char- 2966*22dc650dSSadaf Ebrahimi acter after the first failure. 2967*22dc650dSSadaf Ebrahimi 2968*22dc650dSSadaf Ebrahimi An explicit match for CR of LF is either a literal appearance of one of 2969*22dc650dSSadaf Ebrahimi those characters in the pattern, or one of the \r or \n or equivalent 2970*22dc650dSSadaf Ebrahimi octal or hexadecimal escape sequences. Implicit matches such as [^X] do 2971*22dc650dSSadaf Ebrahimi not count, nor does \s, even though it includes CR and LF in the char- 2972*22dc650dSSadaf Ebrahimi acters that it matches. 2973*22dc650dSSadaf Ebrahimi 2974*22dc650dSSadaf Ebrahimi Notwithstanding the above, anomalous effects may still occur when CRLF 2975*22dc650dSSadaf Ebrahimi is a valid newline sequence and explicit \r or \n escapes appear in the 2976*22dc650dSSadaf Ebrahimi pattern. 2977*22dc650dSSadaf Ebrahimi 2978*22dc650dSSadaf Ebrahimi 2979*22dc650dSSadaf EbrahimiHOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS 2980*22dc650dSSadaf Ebrahimi 2981*22dc650dSSadaf Ebrahimi uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 2982*22dc650dSSadaf Ebrahimi 2983*22dc650dSSadaf Ebrahimi PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 2984*22dc650dSSadaf Ebrahimi 2985*22dc650dSSadaf Ebrahimi In general, a pattern matches a certain portion of the subject, and in 2986*22dc650dSSadaf Ebrahimi addition, further substrings from the subject may be picked out by 2987*22dc650dSSadaf Ebrahimi parenthesized parts of the pattern. Following the usage in Jeffrey 2988*22dc650dSSadaf Ebrahimi Friedl's book, this is called "capturing" in what follows, and the 2989*22dc650dSSadaf Ebrahimi phrase "capture group" (Perl terminology) is used for a fragment of a 2990*22dc650dSSadaf Ebrahimi pattern that picks out a substring. PCRE2 supports several other kinds 2991*22dc650dSSadaf Ebrahimi of parenthesized group that do not cause substrings to be captured. The 2992*22dc650dSSadaf Ebrahimi pcre2_pattern_info() function can be used to find out how many capture 2993*22dc650dSSadaf Ebrahimi groups there are in a compiled pattern. 2994*22dc650dSSadaf Ebrahimi 2995*22dc650dSSadaf Ebrahimi You can use auxiliary functions for accessing captured substrings by 2996*22dc650dSSadaf Ebrahimi number or by name, as described in sections below. 2997*22dc650dSSadaf Ebrahimi 2998*22dc650dSSadaf Ebrahimi Alternatively, you can make direct use of the vector of PCRE2_SIZE val- 2999*22dc650dSSadaf Ebrahimi ues, called the ovector, which contains the offsets of captured 3000*22dc650dSSadaf Ebrahimi strings. It is part of the match data block. The function 3001*22dc650dSSadaf Ebrahimi pcre2_get_ovector_pointer() returns the address of the ovector, and 3002*22dc650dSSadaf Ebrahimi pcre2_get_ovector_count() returns the number of pairs of values it con- 3003*22dc650dSSadaf Ebrahimi tains. 3004*22dc650dSSadaf Ebrahimi 3005*22dc650dSSadaf Ebrahimi Within the ovector, the first in each pair of values is set to the off- 3006*22dc650dSSadaf Ebrahimi set of the first code unit of a substring, and the second is set to the 3007*22dc650dSSadaf Ebrahimi offset of the first code unit after the end of a substring. These val- 3008*22dc650dSSadaf Ebrahimi ues are always code unit offsets, not character offsets. That is, they 3009*22dc650dSSadaf Ebrahimi are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li- 3010*22dc650dSSadaf Ebrahimi brary, and 32-bit offsets in the 32-bit library. 3011*22dc650dSSadaf Ebrahimi 3012*22dc650dSSadaf Ebrahimi After a partial match (error return PCRE2_ERROR_PARTIAL), only the 3013*22dc650dSSadaf Ebrahimi first pair of offsets (that is, ovector[0] and ovector[1]) are set. 3014*22dc650dSSadaf Ebrahimi They identify the part of the subject that was partially matched. See 3015*22dc650dSSadaf Ebrahimi the pcre2partial documentation for details of partial matching. 3016*22dc650dSSadaf Ebrahimi 3017*22dc650dSSadaf Ebrahimi After a fully successful match, the first pair of offsets identifies 3018*22dc650dSSadaf Ebrahimi the portion of the subject string that was matched by the entire pat- 3019*22dc650dSSadaf Ebrahimi tern. The next pair is used for the first captured substring, and so 3020*22dc650dSSadaf Ebrahimi on. The value returned by pcre2_match() is one more than the highest 3021*22dc650dSSadaf Ebrahimi numbered pair that has been set. For example, if two substrings have 3022*22dc650dSSadaf Ebrahimi been captured, the returned value is 3. If there are no captured sub- 3023*22dc650dSSadaf Ebrahimi strings, the return value from a successful match is 1, indicating that 3024*22dc650dSSadaf Ebrahimi just the first pair of offsets has been set. 3025*22dc650dSSadaf Ebrahimi 3026*22dc650dSSadaf Ebrahimi If a pattern uses the \K escape sequence within a positive assertion, 3027*22dc650dSSadaf Ebrahimi the reported start of a successful match can be greater than the end of 3028*22dc650dSSadaf Ebrahimi the match. For example, if the pattern (?=ab\K) is matched against 3029*22dc650dSSadaf Ebrahimi "ab", the start and end offset values for the match are 2 and 0. 3030*22dc650dSSadaf Ebrahimi 3031*22dc650dSSadaf Ebrahimi If a capture group is matched repeatedly within a single match opera- 3032*22dc650dSSadaf Ebrahimi tion, it is the last portion of the subject that it matched that is re- 3033*22dc650dSSadaf Ebrahimi turned. 3034*22dc650dSSadaf Ebrahimi 3035*22dc650dSSadaf Ebrahimi If the ovector is too small to hold all the captured substring offsets, 3036*22dc650dSSadaf Ebrahimi as much as possible is filled in, and the function returns a value of 3037*22dc650dSSadaf Ebrahimi zero. If captured substrings are not of interest, pcre2_match() may be 3038*22dc650dSSadaf Ebrahimi called with a match data block whose ovector is of minimum length (that 3039*22dc650dSSadaf Ebrahimi is, one pair). 3040*22dc650dSSadaf Ebrahimi 3041*22dc650dSSadaf Ebrahimi It is possible for capture group number n+1 to match some part of the 3042*22dc650dSSadaf Ebrahimi subject when group n has not been used at all. For example, if the 3043*22dc650dSSadaf Ebrahimi string "abc" is matched against the pattern (a|(z))(bc) the return from 3044*22dc650dSSadaf Ebrahimi the function is 4, and groups 1 and 3 are matched, but 2 is not. When 3045*22dc650dSSadaf Ebrahimi this happens, both values in the offset pairs corresponding to unused 3046*22dc650dSSadaf Ebrahimi groups are set to PCRE2_UNSET. 3047*22dc650dSSadaf Ebrahimi 3048*22dc650dSSadaf Ebrahimi Offset values that correspond to unused groups at the end of the ex- 3049*22dc650dSSadaf Ebrahimi pression are also set to PCRE2_UNSET. For example, if the string "abc" 3050*22dc650dSSadaf Ebrahimi is matched against the pattern (abc)(x(yz)?)? groups 2 and 3 are not 3051*22dc650dSSadaf Ebrahimi matched. The return from the function is 2, because the highest used 3052*22dc650dSSadaf Ebrahimi capture group number is 1. The offsets for the second and third capture 3053*22dc650dSSadaf Ebrahimi groups (assuming the vector is large enough, of course) are set to 3054*22dc650dSSadaf Ebrahimi PCRE2_UNSET. 3055*22dc650dSSadaf Ebrahimi 3056*22dc650dSSadaf Ebrahimi Elements in the ovector that do not correspond to capturing parentheses 3057*22dc650dSSadaf Ebrahimi in the pattern are never changed. That is, if a pattern contains n cap- 3058*22dc650dSSadaf Ebrahimi turing parentheses, no more than ovector[0] to ovector[2n+1] are set by 3059*22dc650dSSadaf Ebrahimi pcre2_match(). The other elements retain whatever values they previ- 3060*22dc650dSSadaf Ebrahimi ously had. After a failed match attempt, the contents of the ovector 3061*22dc650dSSadaf Ebrahimi are unchanged. 3062*22dc650dSSadaf Ebrahimi 3063*22dc650dSSadaf Ebrahimi 3064*22dc650dSSadaf EbrahimiOTHER INFORMATION ABOUT A MATCH 3065*22dc650dSSadaf Ebrahimi 3066*22dc650dSSadaf Ebrahimi PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 3067*22dc650dSSadaf Ebrahimi 3068*22dc650dSSadaf Ebrahimi PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 3069*22dc650dSSadaf Ebrahimi 3070*22dc650dSSadaf Ebrahimi As well as the offsets in the ovector, other information about a match 3071*22dc650dSSadaf Ebrahimi is retained in the match data block and can be retrieved by the above 3072*22dc650dSSadaf Ebrahimi functions in appropriate circumstances. If they are called at other 3073*22dc650dSSadaf Ebrahimi times, the result is undefined. 3074*22dc650dSSadaf Ebrahimi 3075*22dc650dSSadaf Ebrahimi After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a 3076*22dc650dSSadaf Ebrahimi failure to match (PCRE2_ERROR_NOMATCH), a mark name may be available. 3077*22dc650dSSadaf Ebrahimi The function pcre2_get_mark() can be called to access this name, which 3078*22dc650dSSadaf Ebrahimi can be specified in the pattern by any of the backtracking control 3079*22dc650dSSadaf Ebrahimi verbs, not just (*MARK). The same function applies to all the verbs. It 3080*22dc650dSSadaf Ebrahimi returns a pointer to the zero-terminated name, which is within the com- 3081*22dc650dSSadaf Ebrahimi piled pattern. If no name is available, NULL is returned. The length of 3082*22dc650dSSadaf Ebrahimi the name (excluding the terminating zero) is stored in the code unit 3083*22dc650dSSadaf Ebrahimi that precedes the name. You should use this length instead of relying 3084*22dc650dSSadaf Ebrahimi on the terminating zero if the name might contain a binary zero. 3085*22dc650dSSadaf Ebrahimi 3086*22dc650dSSadaf Ebrahimi After a successful match, the name that is returned is the last mark 3087*22dc650dSSadaf Ebrahimi name encountered on the matching path through the pattern. Instances of 3088*22dc650dSSadaf Ebrahimi backtracking verbs without names do not count. Thus, for example, if 3089*22dc650dSSadaf Ebrahimi the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned. 3090*22dc650dSSadaf Ebrahimi After a "no match" or a partial match, the last encountered name is re- 3091*22dc650dSSadaf Ebrahimi turned. For example, consider this pattern: 3092*22dc650dSSadaf Ebrahimi 3093*22dc650dSSadaf Ebrahimi ^(*MARK:A)((*MARK:B)a|b)c 3094*22dc650dSSadaf Ebrahimi 3095*22dc650dSSadaf Ebrahimi When it matches "bc", the returned name is A. The B mark is "seen" in 3096*22dc650dSSadaf Ebrahimi the first branch of the group, but it is not on the matching path. On 3097*22dc650dSSadaf Ebrahimi the other hand, when this pattern fails to match "bx", the returned 3098*22dc650dSSadaf Ebrahimi name is B. 3099*22dc650dSSadaf Ebrahimi 3100*22dc650dSSadaf Ebrahimi Warning: By default, certain start-of-match optimizations are used to 3101*22dc650dSSadaf Ebrahimi give a fast "no match" result in some situations. For example, if the 3102*22dc650dSSadaf Ebrahimi anchoring is removed from the pattern above, there is an initial check 3103*22dc650dSSadaf Ebrahimi for the presence of "c" in the subject before running the matching en- 3104*22dc650dSSadaf Ebrahimi gine. This check fails for "bx", causing a match failure without seeing 3105*22dc650dSSadaf Ebrahimi any marks. You can disable the start-of-match optimizations by setting 3106*22dc650dSSadaf Ebrahimi the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or by starting 3107*22dc650dSSadaf Ebrahimi the pattern with (*NO_START_OPT). 3108*22dc650dSSadaf Ebrahimi 3109*22dc650dSSadaf Ebrahimi After a successful match, a partial match, or one of the invalid UTF 3110*22dc650dSSadaf Ebrahimi errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can 3111*22dc650dSSadaf Ebrahimi be called. After a successful or partial match it returns the code unit 3112*22dc650dSSadaf Ebrahimi offset of the character at which the match started. For a non-partial 3113*22dc650dSSadaf Ebrahimi match, this can be different to the value of ovector[0] if the pattern 3114*22dc650dSSadaf Ebrahimi contains the \K escape sequence. After a partial match, however, this 3115*22dc650dSSadaf Ebrahimi value is always the same as ovector[0] because \K does not affect the 3116*22dc650dSSadaf Ebrahimi result of a partial match. 3117*22dc650dSSadaf Ebrahimi 3118*22dc650dSSadaf Ebrahimi After a UTF check failure, pcre2_get_startchar() can be used to obtain 3119*22dc650dSSadaf Ebrahimi the code unit offset of the invalid UTF character. Details are given in 3120*22dc650dSSadaf Ebrahimi the pcre2unicode page. 3121*22dc650dSSadaf Ebrahimi 3122*22dc650dSSadaf Ebrahimi 3123*22dc650dSSadaf EbrahimiERROR RETURNS FROM pcre2_match() 3124*22dc650dSSadaf Ebrahimi 3125*22dc650dSSadaf Ebrahimi If pcre2_match() fails, it returns a negative number. This can be con- 3126*22dc650dSSadaf Ebrahimi verted to a text string by calling the pcre2_get_error_message() func- 3127*22dc650dSSadaf Ebrahimi tion (see "Obtaining a textual error message" below). Negative error 3128*22dc650dSSadaf Ebrahimi codes are also returned by other functions, and are documented with 3129*22dc650dSSadaf Ebrahimi them. The codes are given names in the header file. If UTF checking is 3130*22dc650dSSadaf Ebrahimi in force and an invalid UTF subject string is detected, one of a number 3131*22dc650dSSadaf Ebrahimi of UTF-specific negative error codes is returned. Details are given in 3132*22dc650dSSadaf Ebrahimi the pcre2unicode page. The following are the other errors that may be 3133*22dc650dSSadaf Ebrahimi returned by pcre2_match(): 3134*22dc650dSSadaf Ebrahimi 3135*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMATCH 3136*22dc650dSSadaf Ebrahimi 3137*22dc650dSSadaf Ebrahimi The subject string did not match the pattern. 3138*22dc650dSSadaf Ebrahimi 3139*22dc650dSSadaf Ebrahimi PCRE2_ERROR_PARTIAL 3140*22dc650dSSadaf Ebrahimi 3141*22dc650dSSadaf Ebrahimi The subject string did not match, but it did match partially. See the 3142*22dc650dSSadaf Ebrahimi pcre2partial documentation for details of partial matching. 3143*22dc650dSSadaf Ebrahimi 3144*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMAGIC 3145*22dc650dSSadaf Ebrahimi 3146*22dc650dSSadaf Ebrahimi PCRE2 stores a 4-byte "magic number" at the start of the compiled code, 3147*22dc650dSSadaf Ebrahimi to catch the case when it is passed a junk pointer. This is the error 3148*22dc650dSSadaf Ebrahimi that is returned when the magic number is not present. 3149*22dc650dSSadaf Ebrahimi 3150*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMODE 3151*22dc650dSSadaf Ebrahimi 3152*22dc650dSSadaf Ebrahimi This error is given when a compiled pattern is passed to a function in 3153*22dc650dSSadaf Ebrahimi a library of a different code unit width, for example, a pattern com- 3154*22dc650dSSadaf Ebrahimi piled by the 8-bit library is passed to a 16-bit or 32-bit library 3155*22dc650dSSadaf Ebrahimi function. 3156*22dc650dSSadaf Ebrahimi 3157*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADOFFSET 3158*22dc650dSSadaf Ebrahimi 3159*22dc650dSSadaf Ebrahimi The value of startoffset was greater than the length of the subject. 3160*22dc650dSSadaf Ebrahimi 3161*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADOPTION 3162*22dc650dSSadaf Ebrahimi 3163*22dc650dSSadaf Ebrahimi An unrecognized bit was set in the options argument. 3164*22dc650dSSadaf Ebrahimi 3165*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADUTFOFFSET 3166*22dc650dSSadaf Ebrahimi 3167*22dc650dSSadaf Ebrahimi The UTF code unit sequence that was passed as a subject was checked and 3168*22dc650dSSadaf Ebrahimi found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the 3169*22dc650dSSadaf Ebrahimi value of startoffset did not point to the beginning of a UTF character 3170*22dc650dSSadaf Ebrahimi or the end of the subject. 3171*22dc650dSSadaf Ebrahimi 3172*22dc650dSSadaf Ebrahimi PCRE2_ERROR_CALLOUT 3173*22dc650dSSadaf Ebrahimi 3174*22dc650dSSadaf Ebrahimi This error is never generated by pcre2_match() itself. It is provided 3175*22dc650dSSadaf Ebrahimi for use by callout functions that want to cause pcre2_match() or 3176*22dc650dSSadaf Ebrahimi pcre2_callout_enumerate() to return a distinctive error code. See the 3177*22dc650dSSadaf Ebrahimi pcre2callout documentation for details. 3178*22dc650dSSadaf Ebrahimi 3179*22dc650dSSadaf Ebrahimi PCRE2_ERROR_DEPTHLIMIT 3180*22dc650dSSadaf Ebrahimi 3181*22dc650dSSadaf Ebrahimi The nested backtracking depth limit was reached. 3182*22dc650dSSadaf Ebrahimi 3183*22dc650dSSadaf Ebrahimi PCRE2_ERROR_HEAPLIMIT 3184*22dc650dSSadaf Ebrahimi 3185*22dc650dSSadaf Ebrahimi The heap limit was reached. 3186*22dc650dSSadaf Ebrahimi 3187*22dc650dSSadaf Ebrahimi PCRE2_ERROR_INTERNAL 3188*22dc650dSSadaf Ebrahimi 3189*22dc650dSSadaf Ebrahimi An unexpected internal error has occurred. This error could be caused 3190*22dc650dSSadaf Ebrahimi by a bug in PCRE2 or by overwriting of the compiled pattern. 3191*22dc650dSSadaf Ebrahimi 3192*22dc650dSSadaf Ebrahimi PCRE2_ERROR_JIT_STACKLIMIT 3193*22dc650dSSadaf Ebrahimi 3194*22dc650dSSadaf Ebrahimi This error is returned when a pattern that was successfully studied us- 3195*22dc650dSSadaf Ebrahimi ing JIT is being matched, but the memory available for the just-in-time 3196*22dc650dSSadaf Ebrahimi processing stack is not large enough. See the pcre2jit documentation 3197*22dc650dSSadaf Ebrahimi for more details. 3198*22dc650dSSadaf Ebrahimi 3199*22dc650dSSadaf Ebrahimi PCRE2_ERROR_MATCHLIMIT 3200*22dc650dSSadaf Ebrahimi 3201*22dc650dSSadaf Ebrahimi The backtracking match limit was reached. 3202*22dc650dSSadaf Ebrahimi 3203*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMEMORY 3204*22dc650dSSadaf Ebrahimi 3205*22dc650dSSadaf Ebrahimi Heap memory is used to remember backtracking points. This error is 3206*22dc650dSSadaf Ebrahimi given when the memory allocation function (default or custom) fails. 3207*22dc650dSSadaf Ebrahimi Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given if the 3208*22dc650dSSadaf Ebrahimi amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is 3209*22dc650dSSadaf Ebrahimi also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca- 3210*22dc650dSSadaf Ebrahimi tion fails. 3211*22dc650dSSadaf Ebrahimi 3212*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NULL 3213*22dc650dSSadaf Ebrahimi 3214*22dc650dSSadaf Ebrahimi Either the code, subject, or match_data argument was passed as NULL. 3215*22dc650dSSadaf Ebrahimi 3216*22dc650dSSadaf Ebrahimi PCRE2_ERROR_RECURSELOOP 3217*22dc650dSSadaf Ebrahimi 3218*22dc650dSSadaf Ebrahimi This error is returned when pcre2_match() detects a recursion loop 3219*22dc650dSSadaf Ebrahimi within the pattern. Specifically, it means that either the whole pat- 3220*22dc650dSSadaf Ebrahimi tern or a capture group has been called recursively for the second time 3221*22dc650dSSadaf Ebrahimi at the same position in the subject string. Some simple patterns that 3222*22dc650dSSadaf Ebrahimi might do this are detected and faulted at compile time, but more com- 3223*22dc650dSSadaf Ebrahimi plicated cases, in particular mutual recursions between two different 3224*22dc650dSSadaf Ebrahimi groups, cannot be detected until matching is attempted. 3225*22dc650dSSadaf Ebrahimi 3226*22dc650dSSadaf Ebrahimi 3227*22dc650dSSadaf EbrahimiOBTAINING A TEXTUAL ERROR MESSAGE 3228*22dc650dSSadaf Ebrahimi 3229*22dc650dSSadaf Ebrahimi int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 3230*22dc650dSSadaf Ebrahimi PCRE2_SIZE bufflen); 3231*22dc650dSSadaf Ebrahimi 3232*22dc650dSSadaf Ebrahimi A text message for an error code from any PCRE2 function (compile, 3233*22dc650dSSadaf Ebrahimi match, or auxiliary) can be obtained by calling pcre2_get_error_mes- 3234*22dc650dSSadaf Ebrahimi sage(). The code is passed as the first argument, with the remaining 3235*22dc650dSSadaf Ebrahimi two arguments specifying a code unit buffer and its length in code 3236*22dc650dSSadaf Ebrahimi units, into which the text message is placed. The message is returned 3237*22dc650dSSadaf Ebrahimi in code units of the appropriate width for the library that is being 3238*22dc650dSSadaf Ebrahimi used. 3239*22dc650dSSadaf Ebrahimi 3240*22dc650dSSadaf Ebrahimi The returned message is terminated with a trailing zero, and the func- 3241*22dc650dSSadaf Ebrahimi tion returns the number of code units used, excluding the trailing 3242*22dc650dSSadaf Ebrahimi zero. If the error number is unknown, the negative error code PCRE2_ER- 3243*22dc650dSSadaf Ebrahimi ROR_BADDATA is returned. If the buffer is too small, the message is 3244*22dc650dSSadaf Ebrahimi truncated (but still with a trailing zero), and the negative error code 3245*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMEMORY is returned. None of the messages are very long; 3246*22dc650dSSadaf Ebrahimi a buffer size of 120 code units is ample. 3247*22dc650dSSadaf Ebrahimi 3248*22dc650dSSadaf Ebrahimi 3249*22dc650dSSadaf EbrahimiEXTRACTING CAPTURED SUBSTRINGS BY NUMBER 3250*22dc650dSSadaf Ebrahimi 3251*22dc650dSSadaf Ebrahimi int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 3252*22dc650dSSadaf Ebrahimi uint32_t number, PCRE2_SIZE *length); 3253*22dc650dSSadaf Ebrahimi 3254*22dc650dSSadaf Ebrahimi int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 3255*22dc650dSSadaf Ebrahimi uint32_t number, PCRE2_UCHAR *buffer, 3256*22dc650dSSadaf Ebrahimi PCRE2_SIZE *bufflen); 3257*22dc650dSSadaf Ebrahimi 3258*22dc650dSSadaf Ebrahimi int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 3259*22dc650dSSadaf Ebrahimi uint32_t number, PCRE2_UCHAR **bufferptr, 3260*22dc650dSSadaf Ebrahimi PCRE2_SIZE *bufflen); 3261*22dc650dSSadaf Ebrahimi 3262*22dc650dSSadaf Ebrahimi void pcre2_substring_free(PCRE2_UCHAR *buffer); 3263*22dc650dSSadaf Ebrahimi 3264*22dc650dSSadaf Ebrahimi Captured substrings can be accessed directly by using the ovector as 3265*22dc650dSSadaf Ebrahimi described above. For convenience, auxiliary functions are provided for 3266*22dc650dSSadaf Ebrahimi extracting captured substrings as new, separate, zero-terminated 3267*22dc650dSSadaf Ebrahimi strings. A substring that contains a binary zero is correctly extracted 3268*22dc650dSSadaf Ebrahimi and has a further zero added on the end, but the result is not, of 3269*22dc650dSSadaf Ebrahimi course, a C string. 3270*22dc650dSSadaf Ebrahimi 3271*22dc650dSSadaf Ebrahimi The functions in this section identify substrings by number. The number 3272*22dc650dSSadaf Ebrahimi zero refers to the entire matched substring, with higher numbers refer- 3273*22dc650dSSadaf Ebrahimi ring to substrings captured by parenthesized groups. After a partial 3274*22dc650dSSadaf Ebrahimi match, only substring zero is available. An attempt to extract any 3275*22dc650dSSadaf Ebrahimi other substring gives the error PCRE2_ERROR_PARTIAL. The next section 3276*22dc650dSSadaf Ebrahimi describes similar functions for extracting captured substrings by name. 3277*22dc650dSSadaf Ebrahimi 3278*22dc650dSSadaf Ebrahimi If a pattern uses the \K escape sequence within a positive assertion, 3279*22dc650dSSadaf Ebrahimi the reported start of a successful match can be greater than the end of 3280*22dc650dSSadaf Ebrahimi the match. For example, if the pattern (?=ab\K) is matched against 3281*22dc650dSSadaf Ebrahimi "ab", the start and end offset values for the match are 2 and 0. In 3282*22dc650dSSadaf Ebrahimi this situation, calling these functions with a zero substring number 3283*22dc650dSSadaf Ebrahimi extracts a zero-length empty string. 3284*22dc650dSSadaf Ebrahimi 3285*22dc650dSSadaf Ebrahimi You can find the length in code units of a captured substring without 3286*22dc650dSSadaf Ebrahimi extracting it by calling pcre2_substring_length_bynumber(). The first 3287*22dc650dSSadaf Ebrahimi argument is a pointer to the match data block, the second is the group 3288*22dc650dSSadaf Ebrahimi number, and the third is a pointer to a variable into which the length 3289*22dc650dSSadaf Ebrahimi is placed. If you just want to know whether or not the substring has 3290*22dc650dSSadaf Ebrahimi been captured, you can pass the third argument as NULL. 3291*22dc650dSSadaf Ebrahimi 3292*22dc650dSSadaf Ebrahimi The pcre2_substring_copy_bynumber() function copies a captured sub- 3293*22dc650dSSadaf Ebrahimi string into a supplied buffer, whereas pcre2_substring_get_bynumber() 3294*22dc650dSSadaf Ebrahimi copies it into new memory, obtained using the same memory allocation 3295*22dc650dSSadaf Ebrahimi function that was used for the match data block. The first two argu- 3296*22dc650dSSadaf Ebrahimi ments of these functions are a pointer to the match data block and a 3297*22dc650dSSadaf Ebrahimi capture group number. 3298*22dc650dSSadaf Ebrahimi 3299*22dc650dSSadaf Ebrahimi The final arguments of pcre2_substring_copy_bynumber() are a pointer to 3300*22dc650dSSadaf Ebrahimi the buffer and a pointer to a variable that contains its length in code 3301*22dc650dSSadaf Ebrahimi units. This is updated to contain the actual number of code units used 3302*22dc650dSSadaf Ebrahimi for the extracted substring, excluding the terminating zero. 3303*22dc650dSSadaf Ebrahimi 3304*22dc650dSSadaf Ebrahimi For pcre2_substring_get_bynumber() the third and fourth arguments point 3305*22dc650dSSadaf Ebrahimi to variables that are updated with a pointer to the new memory and the 3306*22dc650dSSadaf Ebrahimi number of code units that comprise the substring, again excluding the 3307*22dc650dSSadaf Ebrahimi terminating zero. When the substring is no longer needed, the memory 3308*22dc650dSSadaf Ebrahimi should be freed by calling pcre2_substring_free(). 3309*22dc650dSSadaf Ebrahimi 3310*22dc650dSSadaf Ebrahimi The return value from all these functions is zero for success, or a 3311*22dc650dSSadaf Ebrahimi negative error code. If the pattern match failed, the match failure 3312*22dc650dSSadaf Ebrahimi code is returned. If a substring number greater than zero is used af- 3313*22dc650dSSadaf Ebrahimi ter a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible 3314*22dc650dSSadaf Ebrahimi error codes are: 3315*22dc650dSSadaf Ebrahimi 3316*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMEMORY 3317*22dc650dSSadaf Ebrahimi 3318*22dc650dSSadaf Ebrahimi The buffer was too small for pcre2_substring_copy_bynumber(), or the 3319*22dc650dSSadaf Ebrahimi attempt to get memory failed for pcre2_substring_get_bynumber(). 3320*22dc650dSSadaf Ebrahimi 3321*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOSUBSTRING 3322*22dc650dSSadaf Ebrahimi 3323*22dc650dSSadaf Ebrahimi There is no substring with that number in the pattern, that is, the 3324*22dc650dSSadaf Ebrahimi number is greater than the number of capturing parentheses. 3325*22dc650dSSadaf Ebrahimi 3326*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UNAVAILABLE 3327*22dc650dSSadaf Ebrahimi 3328*22dc650dSSadaf Ebrahimi The substring number, though not greater than the number of captures in 3329*22dc650dSSadaf Ebrahimi the pattern, is greater than the number of slots in the ovector, so the 3330*22dc650dSSadaf Ebrahimi substring could not be captured. 3331*22dc650dSSadaf Ebrahimi 3332*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UNSET 3333*22dc650dSSadaf Ebrahimi 3334*22dc650dSSadaf Ebrahimi The substring did not participate in the match. For example, if the 3335*22dc650dSSadaf Ebrahimi pattern is (abc)|(def) and the subject is "def", and the ovector con- 3336*22dc650dSSadaf Ebrahimi tains at least two capturing slots, substring number 1 is unset. 3337*22dc650dSSadaf Ebrahimi 3338*22dc650dSSadaf Ebrahimi 3339*22dc650dSSadaf EbrahimiEXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS 3340*22dc650dSSadaf Ebrahimi 3341*22dc650dSSadaf Ebrahimi int pcre2_substring_list_get(pcre2_match_data *match_data, 3342*22dc650dSSadaf Ebrahimi PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 3343*22dc650dSSadaf Ebrahimi 3344*22dc650dSSadaf Ebrahimi void pcre2_substring_list_free(PCRE2_UCHAR **list); 3345*22dc650dSSadaf Ebrahimi 3346*22dc650dSSadaf Ebrahimi The pcre2_substring_list_get() function extracts all available sub- 3347*22dc650dSSadaf Ebrahimi strings and builds a list of pointers to them. It also (optionally) 3348*22dc650dSSadaf Ebrahimi builds a second list that contains their lengths (in code units), ex- 3349*22dc650dSSadaf Ebrahimi cluding a terminating zero that is added to each of them. All this is 3350*22dc650dSSadaf Ebrahimi done in a single block of memory that is obtained using the same memory 3351*22dc650dSSadaf Ebrahimi allocation function that was used to get the match data block. 3352*22dc650dSSadaf Ebrahimi 3353*22dc650dSSadaf Ebrahimi This function must be called only after a successful match. If called 3354*22dc650dSSadaf Ebrahimi after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. 3355*22dc650dSSadaf Ebrahimi 3356*22dc650dSSadaf Ebrahimi The address of the memory block is returned via listptr, which is also 3357*22dc650dSSadaf Ebrahimi the start of the list of string pointers. The end of the list is marked 3358*22dc650dSSadaf Ebrahimi by a NULL pointer. The address of the list of lengths is returned via 3359*22dc650dSSadaf Ebrahimi lengthsptr. If your strings do not contain binary zeros and you do not 3360*22dc650dSSadaf Ebrahimi therefore need the lengths, you may supply NULL as the lengthsptr argu- 3361*22dc650dSSadaf Ebrahimi ment to disable the creation of a list of lengths. The yield of the 3362*22dc650dSSadaf Ebrahimi function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- 3363*22dc650dSSadaf Ebrahimi ory block could not be obtained. When the list is no longer needed, it 3364*22dc650dSSadaf Ebrahimi should be freed by calling pcre2_substring_list_free(). 3365*22dc650dSSadaf Ebrahimi 3366*22dc650dSSadaf Ebrahimi If this function encounters a substring that is unset, which can happen 3367*22dc650dSSadaf Ebrahimi when capture group number n+1 matches some part of the subject, but 3368*22dc650dSSadaf Ebrahimi group n has not been used at all, it returns an empty string. This can 3369*22dc650dSSadaf Ebrahimi be distinguished from a genuine zero-length substring by inspecting the 3370*22dc650dSSadaf Ebrahimi appropriate offset in the ovector, which contain PCRE2_UNSET for unset 3371*22dc650dSSadaf Ebrahimi substrings, or by calling pcre2_substring_length_bynumber(). 3372*22dc650dSSadaf Ebrahimi 3373*22dc650dSSadaf Ebrahimi 3374*22dc650dSSadaf EbrahimiEXTRACTING CAPTURED SUBSTRINGS BY NAME 3375*22dc650dSSadaf Ebrahimi 3376*22dc650dSSadaf Ebrahimi int pcre2_substring_number_from_name(const pcre2_code *code, 3377*22dc650dSSadaf Ebrahimi PCRE2_SPTR name); 3378*22dc650dSSadaf Ebrahimi 3379*22dc650dSSadaf Ebrahimi int pcre2_substring_length_byname(pcre2_match_data *match_data, 3380*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_SIZE *length); 3381*22dc650dSSadaf Ebrahimi 3382*22dc650dSSadaf Ebrahimi int pcre2_substring_copy_byname(pcre2_match_data *match_data, 3383*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 3384*22dc650dSSadaf Ebrahimi 3385*22dc650dSSadaf Ebrahimi int pcre2_substring_get_byname(pcre2_match_data *match_data, 3386*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 3387*22dc650dSSadaf Ebrahimi 3388*22dc650dSSadaf Ebrahimi void pcre2_substring_free(PCRE2_UCHAR *buffer); 3389*22dc650dSSadaf Ebrahimi 3390*22dc650dSSadaf Ebrahimi To extract a substring by name, you first have to find associated num- 3391*22dc650dSSadaf Ebrahimi ber. For example, for this pattern: 3392*22dc650dSSadaf Ebrahimi 3393*22dc650dSSadaf Ebrahimi (a+)b(?<xxx>\d+)... 3394*22dc650dSSadaf Ebrahimi 3395*22dc650dSSadaf Ebrahimi the number of the capture group called "xxx" is 2. If the name is known 3396*22dc650dSSadaf Ebrahimi to be unique (PCRE2_DUPNAMES was not set), you can find the number from 3397*22dc650dSSadaf Ebrahimi the name by calling pcre2_substring_number_from_name(). The first argu- 3398*22dc650dSSadaf Ebrahimi ment is the compiled pattern, and the second is the name. The yield of 3399*22dc650dSSadaf Ebrahimi the function is the group number, PCRE2_ERROR_NOSUBSTRING if there is 3400*22dc650dSSadaf Ebrahimi no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is 3401*22dc650dSSadaf Ebrahimi more than one group with that name. Given the number, you can extract 3402*22dc650dSSadaf Ebrahimi the substring directly from the ovector, or use one of the "bynumber" 3403*22dc650dSSadaf Ebrahimi functions described above. 3404*22dc650dSSadaf Ebrahimi 3405*22dc650dSSadaf Ebrahimi For convenience, there are also "byname" functions that correspond to 3406*22dc650dSSadaf Ebrahimi the "bynumber" functions, the only difference being that the second ar- 3407*22dc650dSSadaf Ebrahimi gument is a name instead of a number. If PCRE2_DUPNAMES is set and 3408*22dc650dSSadaf Ebrahimi there are duplicate names, these functions scan all the groups with the 3409*22dc650dSSadaf Ebrahimi given name, and return the captured substring from the first named 3410*22dc650dSSadaf Ebrahimi group that is set. 3411*22dc650dSSadaf Ebrahimi 3412*22dc650dSSadaf Ebrahimi If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is 3413*22dc650dSSadaf Ebrahimi returned. If all groups with the name have numbers that are greater 3414*22dc650dSSadaf Ebrahimi than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re- 3415*22dc650dSSadaf Ebrahimi turned. If there is at least one group with a slot in the ovector, but 3416*22dc650dSSadaf Ebrahimi no group is found to be set, PCRE2_ERROR_UNSET is returned. 3417*22dc650dSSadaf Ebrahimi 3418*22dc650dSSadaf Ebrahimi Warning: If the pattern uses the (?| feature to set up multiple capture 3419*22dc650dSSadaf Ebrahimi groups with the same number, as described in the section on duplicate 3420*22dc650dSSadaf Ebrahimi group numbers in the pcre2pattern page, you cannot use names to distin- 3421*22dc650dSSadaf Ebrahimi guish the different capture groups, because names are not included in 3422*22dc650dSSadaf Ebrahimi the compiled code. The matching process uses only numbers. For this 3423*22dc650dSSadaf Ebrahimi reason, the use of different names for groups with the same number 3424*22dc650dSSadaf Ebrahimi causes an error at compile time. 3425*22dc650dSSadaf Ebrahimi 3426*22dc650dSSadaf Ebrahimi 3427*22dc650dSSadaf EbrahimiCREATING A NEW STRING WITH SUBSTITUTIONS 3428*22dc650dSSadaf Ebrahimi 3429*22dc650dSSadaf Ebrahimi int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 3430*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 3431*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 3432*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext, PCRE2_SPTR replacement, 3433*22dc650dSSadaf Ebrahimi PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, 3434*22dc650dSSadaf Ebrahimi PCRE2_SIZE *outlengthptr); 3435*22dc650dSSadaf Ebrahimi 3436*22dc650dSSadaf Ebrahimi This function optionally calls pcre2_match() and then makes a copy of 3437*22dc650dSSadaf Ebrahimi the subject string in outputbuffer, replacing parts that were matched 3438*22dc650dSSadaf Ebrahimi with the replacement string, whose length is supplied in rlength, which 3439*22dc650dSSadaf Ebrahimi can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As 3440*22dc650dSSadaf Ebrahimi a special case, if replacement is NULL and rlength is zero, the re- 3441*22dc650dSSadaf Ebrahimi placement is assumed to be an empty string. If rlength is non-zero, an 3442*22dc650dSSadaf Ebrahimi error occurs if replacement is NULL. 3443*22dc650dSSadaf Ebrahimi 3444*22dc650dSSadaf Ebrahimi There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re- 3445*22dc650dSSadaf Ebrahimi turn just the replacement string(s). The default action is to perform 3446*22dc650dSSadaf Ebrahimi just one replacement if the pattern matches, but there is an option 3447*22dc650dSSadaf Ebrahimi that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL be- 3448*22dc650dSSadaf Ebrahimi low). 3449*22dc650dSSadaf Ebrahimi 3450*22dc650dSSadaf Ebrahimi If successful, pcre2_substitute() returns the number of substitutions 3451*22dc650dSSadaf Ebrahimi that were carried out. This may be zero if no match was found, and is 3452*22dc650dSSadaf Ebrahimi never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega- 3453*22dc650dSSadaf Ebrahimi tive value is returned if an error is detected. 3454*22dc650dSSadaf Ebrahimi 3455*22dc650dSSadaf Ebrahimi Matches in which a \K item in a lookahead in the pattern causes the 3456*22dc650dSSadaf Ebrahimi match to end before it starts are not supported, and give rise to an 3457*22dc650dSSadaf Ebrahimi error return. For global replacements, matches in which \K in a lookbe- 3458*22dc650dSSadaf Ebrahimi hind causes the match to start earlier than the point that was reached 3459*22dc650dSSadaf Ebrahimi in the previous iteration are also not supported. 3460*22dc650dSSadaf Ebrahimi 3461*22dc650dSSadaf Ebrahimi The first seven arguments of pcre2_substitute() are the same as for 3462*22dc650dSSadaf Ebrahimi pcre2_match(), except that the partial matching options are not permit- 3463*22dc650dSSadaf Ebrahimi ted, and match_data may be passed as NULL, in which case a match data 3464*22dc650dSSadaf Ebrahimi block is obtained and freed within this function, using memory manage- 3465*22dc650dSSadaf Ebrahimi ment functions from the match context, if provided, or else those that 3466*22dc650dSSadaf Ebrahimi were used to allocate memory for the compiled code. 3467*22dc650dSSadaf Ebrahimi 3468*22dc650dSSadaf Ebrahimi If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the 3469*22dc650dSSadaf Ebrahimi provided block is used for all calls to pcre2_match(), and its contents 3470*22dc650dSSadaf Ebrahimi afterwards are the result of the final call. For global changes, this 3471*22dc650dSSadaf Ebrahimi will always be a no-match error. The contents of the ovector within the 3472*22dc650dSSadaf Ebrahimi match data block may or may not have been changed. 3473*22dc650dSSadaf Ebrahimi 3474*22dc650dSSadaf Ebrahimi As well as the usual options for pcre2_match(), a number of additional 3475*22dc650dSSadaf Ebrahimi options can be set in the options argument of pcre2_substitute(). One 3476*22dc650dSSadaf Ebrahimi such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external 3477*22dc650dSSadaf Ebrahimi match_data block must be provided, and it must have already been used 3478*22dc650dSSadaf Ebrahimi for an external call to pcre2_match() with the same pattern and subject 3479*22dc650dSSadaf Ebrahimi arguments. The data in the match_data block (return code, offset vec- 3480*22dc650dSSadaf Ebrahimi tor) is then used for the first substitution instead of calling 3481*22dc650dSSadaf Ebrahimi pcre2_match() from within pcre2_substitute(). This allows an applica- 3482*22dc650dSSadaf Ebrahimi tion to check for a match before choosing to substitute, without having 3483*22dc650dSSadaf Ebrahimi to repeat the match. 3484*22dc650dSSadaf Ebrahimi 3485*22dc650dSSadaf Ebrahimi The contents of the externally supplied match data block are not 3486*22dc650dSSadaf Ebrahimi changed when PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTI- 3487*22dc650dSSadaf Ebrahimi TUTE_GLOBAL is also set, pcre2_match() is called after the first sub- 3488*22dc650dSSadaf Ebrahimi stitution to check for further matches, but this is done using an in- 3489*22dc650dSSadaf Ebrahimi ternally obtained match data block, thus always leaving the external 3490*22dc650dSSadaf Ebrahimi block unchanged. 3491*22dc650dSSadaf Ebrahimi 3492*22dc650dSSadaf Ebrahimi The code argument is not used for matching before the first substitu- 3493*22dc650dSSadaf Ebrahimi tion when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided, 3494*22dc650dSSadaf Ebrahimi even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in- 3495*22dc650dSSadaf Ebrahimi formation such as the UTF setting and the number of capturing parenthe- 3496*22dc650dSSadaf Ebrahimi ses in the pattern. 3497*22dc650dSSadaf Ebrahimi 3498*22dc650dSSadaf Ebrahimi The default action of pcre2_substitute() is to return a copy of the 3499*22dc650dSSadaf Ebrahimi subject string with matched substrings replaced. However, if PCRE2_SUB- 3500*22dc650dSSadaf Ebrahimi STITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are 3501*22dc650dSSadaf Ebrahimi returned. In the global case, multiple replacements are concatenated in 3502*22dc650dSSadaf Ebrahimi the output buffer. Substitution callouts (see below) can be used to 3503*22dc650dSSadaf Ebrahimi separate them if necessary. 3504*22dc650dSSadaf Ebrahimi 3505*22dc650dSSadaf Ebrahimi The outlengthptr argument of pcre2_substitute() must point to a vari- 3506*22dc650dSSadaf Ebrahimi able that contains the length, in code units, of the output buffer. If 3507*22dc650dSSadaf Ebrahimi the function is successful, the value is updated to contain the length 3508*22dc650dSSadaf Ebrahimi in code units of the new string, excluding the trailing zero that is 3509*22dc650dSSadaf Ebrahimi automatically added. 3510*22dc650dSSadaf Ebrahimi 3511*22dc650dSSadaf Ebrahimi If the function is not successful, the value set via outlengthptr de- 3512*22dc650dSSadaf Ebrahimi pends on the type of error. For syntax errors in the replacement 3513*22dc650dSSadaf Ebrahimi string, the value is the offset in the replacement string where the er- 3514*22dc650dSSadaf Ebrahimi ror was detected. For other errors, the value is PCRE2_UNSET by de- 3515*22dc650dSSadaf Ebrahimi fault. This includes the case of the output buffer being too small, un- 3516*22dc650dSSadaf Ebrahimi less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set. 3517*22dc650dSSadaf Ebrahimi 3518*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output 3519*22dc650dSSadaf Ebrahimi buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- 3520*22dc650dSSadaf Ebrahimi ORY immediately. If this option is set, however, pcre2_substitute() 3521*22dc650dSSadaf Ebrahimi continues to go through the motions of matching and substituting (with- 3522*22dc650dSSadaf Ebrahimi out, of course, writing anything) in order to compute the size of 3523*22dc650dSSadaf Ebrahimi buffer that is needed. This value is passed back via the outlengthptr 3524*22dc650dSSadaf Ebrahimi variable, with the result of the function still being PCRE2_ER- 3525*22dc650dSSadaf Ebrahimi ROR_NOMEMORY. 3526*22dc650dSSadaf Ebrahimi 3527*22dc650dSSadaf Ebrahimi Passing a buffer size of zero is a permitted way of finding out how 3528*22dc650dSSadaf Ebrahimi much memory is needed for given substitution. However, this does mean 3529*22dc650dSSadaf Ebrahimi that the entire operation is carried out twice. Depending on the appli- 3530*22dc650dSSadaf Ebrahimi cation, it may be more efficient to allocate a large buffer and free 3531*22dc650dSSadaf Ebrahimi the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- 3532*22dc650dSSadaf Ebrahimi FLOW_LENGTH. 3533*22dc650dSSadaf Ebrahimi 3534*22dc650dSSadaf Ebrahimi The replacement string, which is interpreted as a UTF string in UTF 3535*22dc650dSSadaf Ebrahimi mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An 3536*22dc650dSSadaf Ebrahimi invalid UTF replacement string causes an immediate return with the rel- 3537*22dc650dSSadaf Ebrahimi evant UTF error code. 3538*22dc650dSSadaf Ebrahimi 3539*22dc650dSSadaf Ebrahimi If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not in- 3540*22dc650dSSadaf Ebrahimi terpreted in any way. By default, however, a dollar character is an es- 3541*22dc650dSSadaf Ebrahimi cape character that can specify the insertion of characters from cap- 3542*22dc650dSSadaf Ebrahimi ture groups and names from (*MARK) or other control verbs in the pat- 3543*22dc650dSSadaf Ebrahimi tern. Dollar is the only escape character (backslash is treated as lit- 3544*22dc650dSSadaf Ebrahimi eral). The following forms are always recognized: 3545*22dc650dSSadaf Ebrahimi 3546*22dc650dSSadaf Ebrahimi $$ insert a dollar character 3547*22dc650dSSadaf Ebrahimi $<n> or ${<n>} insert the contents of group <n> 3548*22dc650dSSadaf Ebrahimi $*MARK or ${*MARK} insert a control verb name 3549*22dc650dSSadaf Ebrahimi 3550*22dc650dSSadaf Ebrahimi Either a group number or a group name can be given for <n>. Curly 3551*22dc650dSSadaf Ebrahimi brackets are required only if the following character would be inter- 3552*22dc650dSSadaf Ebrahimi preted as part of the number or name. The number may be zero to include 3553*22dc650dSSadaf Ebrahimi the entire matched string. For example, if the pattern a(b)c is 3554*22dc650dSSadaf Ebrahimi matched with "=abc=" and the replacement string "+$1$0$1+", the result 3555*22dc650dSSadaf Ebrahimi is "=+babcb+=". 3556*22dc650dSSadaf Ebrahimi 3557*22dc650dSSadaf Ebrahimi $*MARK inserts the name from the last encountered backtracking control 3558*22dc650dSSadaf Ebrahimi verb on the matching path that has a name. (*MARK) must always include 3559*22dc650dSSadaf Ebrahimi a name, but the other verbs need not. For example, in the case of 3560*22dc650dSSadaf Ebrahimi (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B) 3561*22dc650dSSadaf Ebrahimi the relevant name is "B". This facility can be used to perform simple 3562*22dc650dSSadaf Ebrahimi simultaneous substitutions, as this pcre2test example shows: 3563*22dc650dSSadaf Ebrahimi 3564*22dc650dSSadaf Ebrahimi /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} 3565*22dc650dSSadaf Ebrahimi apple lemon 3566*22dc650dSSadaf Ebrahimi 2: pear orange 3567*22dc650dSSadaf Ebrahimi 3568*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject 3569*22dc650dSSadaf Ebrahimi string, replacing every matching substring. If this option is not set, 3570*22dc650dSSadaf Ebrahimi only the first matching substring is replaced. The search for matches 3571*22dc650dSSadaf Ebrahimi takes place in the original subject string (that is, previous replace- 3572*22dc650dSSadaf Ebrahimi ments do not affect it). Iteration is implemented by advancing the 3573*22dc650dSSadaf Ebrahimi startoffset value for each search, which is always passed the entire 3574*22dc650dSSadaf Ebrahimi subject string. If an offset limit is set in the match context, search- 3575*22dc650dSSadaf Ebrahimi ing stops when that limit is reached. 3576*22dc650dSSadaf Ebrahimi 3577*22dc650dSSadaf Ebrahimi You can restrict the effect of a global substitution to a portion of 3578*22dc650dSSadaf Ebrahimi the subject string by setting either or both of startoffset and an off- 3579*22dc650dSSadaf Ebrahimi set limit. Here is a pcre2test example: 3580*22dc650dSSadaf Ebrahimi 3581*22dc650dSSadaf Ebrahimi /B/g,replace=!,use_offset_limit 3582*22dc650dSSadaf Ebrahimi ABC ABC ABC ABC\=offset=3,offset_limit=12 3583*22dc650dSSadaf Ebrahimi 2: ABC A!C A!C ABC 3584*22dc650dSSadaf Ebrahimi 3585*22dc650dSSadaf Ebrahimi When continuing with global substitutions after matching a substring 3586*22dc650dSSadaf Ebrahimi with zero length, an attempt to find a non-empty match at the same off- 3587*22dc650dSSadaf Ebrahimi set is performed. If this is not successful, the offset is advanced by 3588*22dc650dSSadaf Ebrahimi one character except when CRLF is a valid newline sequence and the next 3589*22dc650dSSadaf Ebrahimi two characters are CR, LF. In this case, the offset is advanced by two 3590*22dc650dSSadaf Ebrahimi characters. 3591*22dc650dSSadaf Ebrahimi 3592*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that 3593*22dc650dSSadaf Ebrahimi do not appear in the pattern to be treated as unset groups. This option 3594*22dc650dSSadaf Ebrahimi should be used with care, because it means that a typo in a group name 3595*22dc650dSSadaf Ebrahimi or number no longer causes the PCRE2_ERROR_NOSUBSTRING error. 3596*22dc650dSSadaf Ebrahimi 3597*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un- 3598*22dc650dSSadaf Ebrahimi known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated 3599*22dc650dSSadaf Ebrahimi as empty strings when inserted as described above. If this option is 3600*22dc650dSSadaf Ebrahimi not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN- 3601*22dc650dSSadaf Ebrahimi SET error. This option does not influence the extended substitution 3602*22dc650dSSadaf Ebrahimi syntax described below. 3603*22dc650dSSadaf Ebrahimi 3604*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the 3605*22dc650dSSadaf Ebrahimi replacement string. Without this option, only the dollar character is 3606*22dc650dSSadaf Ebrahimi special, and only the group insertion forms listed above are valid. 3607*22dc650dSSadaf Ebrahimi When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: 3608*22dc650dSSadaf Ebrahimi 3609*22dc650dSSadaf Ebrahimi Firstly, backslash in a replacement string is interpreted as an escape 3610*22dc650dSSadaf Ebrahimi character. The usual forms such as \n or \x{ddd} can be used to specify 3611*22dc650dSSadaf Ebrahimi particular character codes, and backslash followed by any non-alphanu- 3612*22dc650dSSadaf Ebrahimi meric character quotes that character. Extended quoting can be coded 3613*22dc650dSSadaf Ebrahimi using \Q...\E, exactly as in pattern strings. 3614*22dc650dSSadaf Ebrahimi 3615*22dc650dSSadaf Ebrahimi There are also four escape sequences for forcing the case of inserted 3616*22dc650dSSadaf Ebrahimi letters. The insertion mechanism has three states: no case forcing, 3617*22dc650dSSadaf Ebrahimi force upper case, and force lower case. The escape sequences change the 3618*22dc650dSSadaf Ebrahimi current state: \U and \L change to upper or lower case forcing, respec- 3619*22dc650dSSadaf Ebrahimi tively, and \E (when not terminating a \Q quoted sequence) reverts to 3620*22dc650dSSadaf Ebrahimi no case forcing. The sequences \u and \l force the next character (if 3621*22dc650dSSadaf Ebrahimi it is a letter) to upper or lower case, respectively, and then the 3622*22dc650dSSadaf Ebrahimi state automatically reverts to no case forcing. Case forcing applies to 3623*22dc650dSSadaf Ebrahimi all inserted characters, including those from capture groups and let- 3624*22dc650dSSadaf Ebrahimi ters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP 3625*22dc650dSSadaf Ebrahimi was set when the pattern was compiled, Unicode properties are used for 3626*22dc650dSSadaf Ebrahimi case forcing characters whose code points are greater than 127. 3627*22dc650dSSadaf Ebrahimi 3628*22dc650dSSadaf Ebrahimi Note that case forcing sequences such as \U...\E do not nest. For exam- 3629*22dc650dSSadaf Ebrahimi ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final 3630*22dc650dSSadaf Ebrahimi \E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX- 3631*22dc650dSSadaf Ebrahimi TRA_ALT_BSUX options do not apply to replacement strings. 3632*22dc650dSSadaf Ebrahimi 3633*22dc650dSSadaf Ebrahimi The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more 3634*22dc650dSSadaf Ebrahimi flexibility to capture group substitution. The syntax is similar to 3635*22dc650dSSadaf Ebrahimi that used by Bash: 3636*22dc650dSSadaf Ebrahimi 3637*22dc650dSSadaf Ebrahimi ${<n>:-<string>} 3638*22dc650dSSadaf Ebrahimi ${<n>:+<string1>:<string2>} 3639*22dc650dSSadaf Ebrahimi 3640*22dc650dSSadaf Ebrahimi As before, <n> may be a group number or a name. The first form speci- 3641*22dc650dSSadaf Ebrahimi fies a default value. If group <n> is set, its value is inserted; if 3642*22dc650dSSadaf Ebrahimi not, <string> is expanded and the result inserted. The second form 3643*22dc650dSSadaf Ebrahimi specifies strings that are expanded and inserted when group <n> is set 3644*22dc650dSSadaf Ebrahimi or unset, respectively. The first form is just a convenient shorthand 3645*22dc650dSSadaf Ebrahimi for 3646*22dc650dSSadaf Ebrahimi 3647*22dc650dSSadaf Ebrahimi ${<n>:+${<n>}:<string>} 3648*22dc650dSSadaf Ebrahimi 3649*22dc650dSSadaf Ebrahimi Backslash can be used to escape colons and closing curly brackets in 3650*22dc650dSSadaf Ebrahimi the replacement strings. A change of the case forcing state within a 3651*22dc650dSSadaf Ebrahimi replacement string remains in force afterwards, as shown in this 3652*22dc650dSSadaf Ebrahimi pcre2test example: 3653*22dc650dSSadaf Ebrahimi 3654*22dc650dSSadaf Ebrahimi /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo 3655*22dc650dSSadaf Ebrahimi body 3656*22dc650dSSadaf Ebrahimi 1: hello 3657*22dc650dSSadaf Ebrahimi somebody 3658*22dc650dSSadaf Ebrahimi 1: HELLO 3659*22dc650dSSadaf Ebrahimi 3660*22dc650dSSadaf Ebrahimi The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended 3661*22dc650dSSadaf Ebrahimi substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un- 3662*22dc650dSSadaf Ebrahimi known groups in the extended syntax forms to be treated as unset. 3663*22dc650dSSadaf Ebrahimi 3664*22dc650dSSadaf Ebrahimi If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, 3665*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele- 3666*22dc650dSSadaf Ebrahimi vant and are ignored. 3667*22dc650dSSadaf Ebrahimi 3668*22dc650dSSadaf Ebrahimi Substitution errors 3669*22dc650dSSadaf Ebrahimi 3670*22dc650dSSadaf Ebrahimi In the event of an error, pcre2_substitute() returns a negative error 3671*22dc650dSSadaf Ebrahimi code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors 3672*22dc650dSSadaf Ebrahimi from pcre2_match() are passed straight back. 3673*22dc650dSSadaf Ebrahimi 3674*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- 3675*22dc650dSSadaf Ebrahimi tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. 3676*22dc650dSSadaf Ebrahimi 3677*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- 3678*22dc650dSSadaf Ebrahimi ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) 3679*22dc650dSSadaf Ebrahimi when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN- 3680*22dc650dSSadaf Ebrahimi SET_EMPTY is not set. 3681*22dc650dSSadaf Ebrahimi 3682*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big 3683*22dc650dSSadaf Ebrahimi enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size 3684*22dc650dSSadaf Ebrahimi of buffer that is needed is returned via outlengthptr. Note that this 3685*22dc650dSSadaf Ebrahimi does not happen by default. 3686*22dc650dSSadaf Ebrahimi 3687*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the 3688*22dc650dSSadaf Ebrahimi match_data argument is NULL or if the subject or replacement arguments 3689*22dc650dSSadaf Ebrahimi are NULL. For backward compatibility reasons an exception is made for 3690*22dc650dSSadaf Ebrahimi the replacement argument if the rlength argument is also 0. 3691*22dc650dSSadaf Ebrahimi 3692*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in 3693*22dc650dSSadaf Ebrahimi the replacement string, with more particular errors being PCRE2_ER- 3694*22dc650dSSadaf Ebrahimi ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE 3695*22dc650dSSadaf Ebrahimi (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax 3696*22dc650dSSadaf Ebrahimi error in extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN 3697*22dc650dSSadaf Ebrahimi (the pattern match ended before it started or the match started earlier 3698*22dc650dSSadaf Ebrahimi than the current position in the subject, which can happen if \K is 3699*22dc650dSSadaf Ebrahimi used in an assertion). 3700*22dc650dSSadaf Ebrahimi 3701*22dc650dSSadaf Ebrahimi As for all PCRE2 errors, a text message that describes the error can be 3702*22dc650dSSadaf Ebrahimi obtained by calling the pcre2_get_error_message() function (see "Ob- 3703*22dc650dSSadaf Ebrahimi taining a textual error message" above). 3704*22dc650dSSadaf Ebrahimi 3705*22dc650dSSadaf Ebrahimi Substitution callouts 3706*22dc650dSSadaf Ebrahimi 3707*22dc650dSSadaf Ebrahimi int pcre2_set_substitute_callout(pcre2_match_context *mcontext, 3708*22dc650dSSadaf Ebrahimi int (*callout_function)(pcre2_substitute_callout_block *, void *), 3709*22dc650dSSadaf Ebrahimi void *callout_data); 3710*22dc650dSSadaf Ebrahimi 3711*22dc650dSSadaf Ebrahimi The pcre2_set_substitution_callout() function can be used to specify a 3712*22dc650dSSadaf Ebrahimi callout function for pcre2_substitute(). This information is passed in 3713*22dc650dSSadaf Ebrahimi a match context. The callout function is called after each substitution 3714*22dc650dSSadaf Ebrahimi has been processed, but it can cause the replacement not to happen. The 3715*22dc650dSSadaf Ebrahimi callout function is not called for simulated substitutions that happen 3716*22dc650dSSadaf Ebrahimi as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. 3717*22dc650dSSadaf Ebrahimi 3718*22dc650dSSadaf Ebrahimi The first argument of the callout function is a pointer to a substitute 3719*22dc650dSSadaf Ebrahimi callout block structure, which contains the following fields, not nec- 3720*22dc650dSSadaf Ebrahimi essarily in this order: 3721*22dc650dSSadaf Ebrahimi 3722*22dc650dSSadaf Ebrahimi uint32_t version; 3723*22dc650dSSadaf Ebrahimi uint32_t subscount; 3724*22dc650dSSadaf Ebrahimi PCRE2_SPTR input; 3725*22dc650dSSadaf Ebrahimi PCRE2_SPTR output; 3726*22dc650dSSadaf Ebrahimi PCRE2_SIZE *ovector; 3727*22dc650dSSadaf Ebrahimi uint32_t oveccount; 3728*22dc650dSSadaf Ebrahimi PCRE2_SIZE output_offsets[2]; 3729*22dc650dSSadaf Ebrahimi 3730*22dc650dSSadaf Ebrahimi The version field contains the version number of the block format. The 3731*22dc650dSSadaf Ebrahimi current version is 0. The version number will increase in future if 3732*22dc650dSSadaf Ebrahimi more fields are added, but the intention is never to remove any of the 3733*22dc650dSSadaf Ebrahimi existing fields. 3734*22dc650dSSadaf Ebrahimi 3735*22dc650dSSadaf Ebrahimi The subscount field is the number of the current match. It is 1 for the 3736*22dc650dSSadaf Ebrahimi first callout, 2 for the second, and so on. The input and output point- 3737*22dc650dSSadaf Ebrahimi ers are copies of the values passed to pcre2_substitute(). 3738*22dc650dSSadaf Ebrahimi 3739*22dc650dSSadaf Ebrahimi The ovector field points to the ovector, which contains the result of 3740*22dc650dSSadaf Ebrahimi the most recent match. The oveccount field contains the number of pairs 3741*22dc650dSSadaf Ebrahimi that are set in the ovector, and is always greater than zero. 3742*22dc650dSSadaf Ebrahimi 3743*22dc650dSSadaf Ebrahimi The output_offsets vector contains the offsets of the replacement in 3744*22dc650dSSadaf Ebrahimi the output string. This has already been processed for dollar and (if 3745*22dc650dSSadaf Ebrahimi requested) backslash substitutions as described above. 3746*22dc650dSSadaf Ebrahimi 3747*22dc650dSSadaf Ebrahimi The second argument of the callout function is the value passed as 3748*22dc650dSSadaf Ebrahimi callout_data when the function was registered. The value returned by 3749*22dc650dSSadaf Ebrahimi the callout function is interpreted as follows: 3750*22dc650dSSadaf Ebrahimi 3751*22dc650dSSadaf Ebrahimi If the value is zero, the replacement is accepted, and, if PCRE2_SUB- 3752*22dc650dSSadaf Ebrahimi STITUTE_GLOBAL is set, processing continues with a search for the next 3753*22dc650dSSadaf Ebrahimi match. If the value is not zero, the current replacement is not ac- 3754*22dc650dSSadaf Ebrahimi cepted. If the value is greater than zero, processing continues when 3755*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero 3756*22dc650dSSadaf Ebrahimi or PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied 3757*22dc650dSSadaf Ebrahimi to the output and the call to pcre2_substitute() exits, returning the 3758*22dc650dSSadaf Ebrahimi number of matches so far. 3759*22dc650dSSadaf Ebrahimi 3760*22dc650dSSadaf Ebrahimi 3761*22dc650dSSadaf EbrahimiDUPLICATE CAPTURE GROUP NAMES 3762*22dc650dSSadaf Ebrahimi 3763*22dc650dSSadaf Ebrahimi int pcre2_substring_nametable_scan(const pcre2_code *code, 3764*22dc650dSSadaf Ebrahimi PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 3765*22dc650dSSadaf Ebrahimi 3766*22dc650dSSadaf Ebrahimi When a pattern is compiled with the PCRE2_DUPNAMES option, names for 3767*22dc650dSSadaf Ebrahimi capture groups are not required to be unique. Duplicate names are al- 3768*22dc650dSSadaf Ebrahimi ways allowed for groups with the same number, created by using the (?| 3769*22dc650dSSadaf Ebrahimi feature. Indeed, if such groups are named, they are required to use the 3770*22dc650dSSadaf Ebrahimi same names. 3771*22dc650dSSadaf Ebrahimi 3772*22dc650dSSadaf Ebrahimi Normally, patterns that use duplicate names are such that in any one 3773*22dc650dSSadaf Ebrahimi match, only one of each set of identically-named groups participates. 3774*22dc650dSSadaf Ebrahimi An example is shown in the pcre2pattern documentation. 3775*22dc650dSSadaf Ebrahimi 3776*22dc650dSSadaf Ebrahimi When duplicates are present, pcre2_substring_copy_byname() and 3777*22dc650dSSadaf Ebrahimi pcre2_substring_get_byname() return the first substring corresponding 3778*22dc650dSSadaf Ebrahimi to the given name that is set. Only if none are set is PCRE2_ERROR_UN- 3779*22dc650dSSadaf Ebrahimi SET is returned. The pcre2_substring_number_from_name() function re- 3780*22dc650dSSadaf Ebrahimi turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate 3781*22dc650dSSadaf Ebrahimi names. 3782*22dc650dSSadaf Ebrahimi 3783*22dc650dSSadaf Ebrahimi If you want to get full details of all captured substrings for a given 3784*22dc650dSSadaf Ebrahimi name, you must use the pcre2_substring_nametable_scan() function. The 3785*22dc650dSSadaf Ebrahimi first argument is the compiled pattern, and the second is the name. If 3786*22dc650dSSadaf Ebrahimi the third and fourth arguments are NULL, the function returns a group 3787*22dc650dSSadaf Ebrahimi number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. 3788*22dc650dSSadaf Ebrahimi 3789*22dc650dSSadaf Ebrahimi When the third and fourth arguments are not NULL, they must be pointers 3790*22dc650dSSadaf Ebrahimi to variables that are updated by the function. After it has run, they 3791*22dc650dSSadaf Ebrahimi point to the first and last entries in the name-to-number table for the 3792*22dc650dSSadaf Ebrahimi given name, and the function returns the length of each entry in code 3793*22dc650dSSadaf Ebrahimi units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are 3794*22dc650dSSadaf Ebrahimi no entries for the given name. 3795*22dc650dSSadaf Ebrahimi 3796*22dc650dSSadaf Ebrahimi The format of the name table is described above in the section entitled 3797*22dc650dSSadaf Ebrahimi Information about a pattern. Given all the relevant entries for the 3798*22dc650dSSadaf Ebrahimi name, you can extract each of their numbers, and hence the captured 3799*22dc650dSSadaf Ebrahimi data. 3800*22dc650dSSadaf Ebrahimi 3801*22dc650dSSadaf Ebrahimi 3802*22dc650dSSadaf EbrahimiFINDING ALL POSSIBLE MATCHES AT ONE POSITION 3803*22dc650dSSadaf Ebrahimi 3804*22dc650dSSadaf Ebrahimi The traditional matching function uses a similar algorithm to Perl, 3805*22dc650dSSadaf Ebrahimi which stops when it finds the first match at a given point in the sub- 3806*22dc650dSSadaf Ebrahimi ject. If you want to find all possible matches, or the longest possible 3807*22dc650dSSadaf Ebrahimi match at a given position, consider using the alternative matching 3808*22dc650dSSadaf Ebrahimi function (see below) instead. If you cannot use the alternative func- 3809*22dc650dSSadaf Ebrahimi tion, you can kludge it up by making use of the callout facility, which 3810*22dc650dSSadaf Ebrahimi is described in the pcre2callout documentation. 3811*22dc650dSSadaf Ebrahimi 3812*22dc650dSSadaf Ebrahimi What you have to do is to insert a callout right at the end of the pat- 3813*22dc650dSSadaf Ebrahimi tern. When your callout function is called, extract and save the cur- 3814*22dc650dSSadaf Ebrahimi rent matched substring. Then return 1, which forces pcre2_match() to 3815*22dc650dSSadaf Ebrahimi backtrack and try other alternatives. Ultimately, when it runs out of 3816*22dc650dSSadaf Ebrahimi matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. 3817*22dc650dSSadaf Ebrahimi 3818*22dc650dSSadaf Ebrahimi 3819*22dc650dSSadaf EbrahimiMATCHING A PATTERN: THE ALTERNATIVE FUNCTION 3820*22dc650dSSadaf Ebrahimi 3821*22dc650dSSadaf Ebrahimi int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 3822*22dc650dSSadaf Ebrahimi PCRE2_SIZE length, PCRE2_SIZE startoffset, 3823*22dc650dSSadaf Ebrahimi uint32_t options, pcre2_match_data *match_data, 3824*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext, 3825*22dc650dSSadaf Ebrahimi int *workspace, PCRE2_SIZE wscount); 3826*22dc650dSSadaf Ebrahimi 3827*22dc650dSSadaf Ebrahimi The function pcre2_dfa_match() is called to match a subject string 3828*22dc650dSSadaf Ebrahimi against a compiled pattern, using a matching algorithm that scans the 3829*22dc650dSSadaf Ebrahimi subject string just once (not counting lookaround assertions), and does 3830*22dc650dSSadaf Ebrahimi not backtrack (except when processing lookaround assertions). This has 3831*22dc650dSSadaf Ebrahimi different characteristics to the normal algorithm, and is not compati- 3832*22dc650dSSadaf Ebrahimi ble with Perl. Some of the features of PCRE2 patterns are not sup- 3833*22dc650dSSadaf Ebrahimi ported. Nevertheless, there are times when this kind of matching can be 3834*22dc650dSSadaf Ebrahimi useful. For a discussion of the two matching algorithms, and a list of 3835*22dc650dSSadaf Ebrahimi features that pcre2_dfa_match() does not support, see the pcre2matching 3836*22dc650dSSadaf Ebrahimi documentation. 3837*22dc650dSSadaf Ebrahimi 3838*22dc650dSSadaf Ebrahimi The arguments for the pcre2_dfa_match() function are the same as for 3839*22dc650dSSadaf Ebrahimi pcre2_match(), plus two extras. The ovector within the match data block 3840*22dc650dSSadaf Ebrahimi is used in a different way, and this is described below. The other com- 3841*22dc650dSSadaf Ebrahimi mon arguments are used in the same way as for pcre2_match(), so their 3842*22dc650dSSadaf Ebrahimi description is not repeated here. 3843*22dc650dSSadaf Ebrahimi 3844*22dc650dSSadaf Ebrahimi The two additional arguments provide workspace for the function. The 3845*22dc650dSSadaf Ebrahimi workspace vector should contain at least 20 elements. It is used for 3846*22dc650dSSadaf Ebrahimi keeping track of multiple paths through the pattern tree. More work- 3847*22dc650dSSadaf Ebrahimi space is needed for patterns and subjects where there are a lot of po- 3848*22dc650dSSadaf Ebrahimi tential matches. 3849*22dc650dSSadaf Ebrahimi 3850*22dc650dSSadaf Ebrahimi Here is an example of a simple call to pcre2_dfa_match(): 3851*22dc650dSSadaf Ebrahimi 3852*22dc650dSSadaf Ebrahimi int wspace[20]; 3853*22dc650dSSadaf Ebrahimi pcre2_match_data *md = pcre2_match_data_create(4, NULL); 3854*22dc650dSSadaf Ebrahimi int rc = pcre2_dfa_match( 3855*22dc650dSSadaf Ebrahimi re, /* result of pcre2_compile() */ 3856*22dc650dSSadaf Ebrahimi "some string", /* the subject string */ 3857*22dc650dSSadaf Ebrahimi 11, /* the length of the subject string */ 3858*22dc650dSSadaf Ebrahimi 0, /* start at offset 0 in the subject */ 3859*22dc650dSSadaf Ebrahimi 0, /* default options */ 3860*22dc650dSSadaf Ebrahimi md, /* the match data block */ 3861*22dc650dSSadaf Ebrahimi NULL, /* a match context; NULL means use defaults */ 3862*22dc650dSSadaf Ebrahimi wspace, /* working space vector */ 3863*22dc650dSSadaf Ebrahimi 20); /* number of elements (NOT size in bytes) */ 3864*22dc650dSSadaf Ebrahimi 3865*22dc650dSSadaf Ebrahimi Option bits for pcre2_dfa_match() 3866*22dc650dSSadaf Ebrahimi 3867*22dc650dSSadaf Ebrahimi The unused bits of the options argument for pcre2_dfa_match() must be 3868*22dc650dSSadaf Ebrahimi zero. The only bits that may be set are PCRE2_ANCHORED, 3869*22dc650dSSadaf Ebrahimi PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO- 3870*22dc650dSSadaf Ebrahimi TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, 3871*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and 3872*22dc650dSSadaf Ebrahimi PCRE2_DFA_RESTART. All but the last four of these are exactly the same 3873*22dc650dSSadaf Ebrahimi as for pcre2_match(), so their description is not repeated here. 3874*22dc650dSSadaf Ebrahimi 3875*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD 3876*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_SOFT 3877*22dc650dSSadaf Ebrahimi 3878*22dc650dSSadaf Ebrahimi These have the same general effect as they do for pcre2_match(), but 3879*22dc650dSSadaf Ebrahimi the details are slightly different. When PCRE2_PARTIAL_HARD is set for 3880*22dc650dSSadaf Ebrahimi pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the 3881*22dc650dSSadaf Ebrahimi subject is reached and there is still at least one matching possibility 3882*22dc650dSSadaf Ebrahimi that requires additional characters. This happens even if some complete 3883*22dc650dSSadaf Ebrahimi matches have already been found. When PCRE2_PARTIAL_SOFT is set, the 3884*22dc650dSSadaf Ebrahimi return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL 3885*22dc650dSSadaf Ebrahimi if the end of the subject is reached, there have been no complete 3886*22dc650dSSadaf Ebrahimi matches, but there is still at least one matching possibility. The por- 3887*22dc650dSSadaf Ebrahimi tion of the string that was inspected when the longest partial match 3888*22dc650dSSadaf Ebrahimi was found is set as the first matching string in both cases. There is a 3889*22dc650dSSadaf Ebrahimi more detailed discussion of partial and multi-segment matching, with 3890*22dc650dSSadaf Ebrahimi examples, in the pcre2partial documentation. 3891*22dc650dSSadaf Ebrahimi 3892*22dc650dSSadaf Ebrahimi PCRE2_DFA_SHORTEST 3893*22dc650dSSadaf Ebrahimi 3894*22dc650dSSadaf Ebrahimi Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to 3895*22dc650dSSadaf Ebrahimi stop as soon as it has found one match. Because of the way the alterna- 3896*22dc650dSSadaf Ebrahimi tive algorithm works, this is necessarily the shortest possible match 3897*22dc650dSSadaf Ebrahimi at the first possible matching point in the subject string. 3898*22dc650dSSadaf Ebrahimi 3899*22dc650dSSadaf Ebrahimi PCRE2_DFA_RESTART 3900*22dc650dSSadaf Ebrahimi 3901*22dc650dSSadaf Ebrahimi When pcre2_dfa_match() returns a partial match, it is possible to call 3902*22dc650dSSadaf Ebrahimi it again, with additional subject characters, and have it continue with 3903*22dc650dSSadaf Ebrahimi the same match. The PCRE2_DFA_RESTART option requests this action; when 3904*22dc650dSSadaf Ebrahimi it is set, the workspace and wscount options must reference the same 3905*22dc650dSSadaf Ebrahimi vector as before because data about the match so far is left in them 3906*22dc650dSSadaf Ebrahimi after a partial match. There is more discussion of this facility in the 3907*22dc650dSSadaf Ebrahimi pcre2partial documentation. 3908*22dc650dSSadaf Ebrahimi 3909*22dc650dSSadaf Ebrahimi Successful returns from pcre2_dfa_match() 3910*22dc650dSSadaf Ebrahimi 3911*22dc650dSSadaf Ebrahimi When pcre2_dfa_match() succeeds, it may have matched more than one sub- 3912*22dc650dSSadaf Ebrahimi string in the subject. Note, however, that all the matches from one run 3913*22dc650dSSadaf Ebrahimi of the function start at the same point in the subject. The shorter 3914*22dc650dSSadaf Ebrahimi matches are all initial substrings of the longer matches. For example, 3915*22dc650dSSadaf Ebrahimi if the pattern 3916*22dc650dSSadaf Ebrahimi 3917*22dc650dSSadaf Ebrahimi <.*> 3918*22dc650dSSadaf Ebrahimi 3919*22dc650dSSadaf Ebrahimi is matched against the string 3920*22dc650dSSadaf Ebrahimi 3921*22dc650dSSadaf Ebrahimi This is <something> <something else> <something further> no more 3922*22dc650dSSadaf Ebrahimi 3923*22dc650dSSadaf Ebrahimi the three matched strings are 3924*22dc650dSSadaf Ebrahimi 3925*22dc650dSSadaf Ebrahimi <something> <something else> <something further> 3926*22dc650dSSadaf Ebrahimi <something> <something else> 3927*22dc650dSSadaf Ebrahimi <something> 3928*22dc650dSSadaf Ebrahimi 3929*22dc650dSSadaf Ebrahimi On success, the yield of the function is a number greater than zero, 3930*22dc650dSSadaf Ebrahimi which is the number of matched substrings. The offsets of the sub- 3931*22dc650dSSadaf Ebrahimi strings are returned in the ovector, and can be extracted by number in 3932*22dc650dSSadaf Ebrahimi the same way as for pcre2_match(), but the numbers bear no relation to 3933*22dc650dSSadaf Ebrahimi any capture groups that may exist in the pattern, because DFA matching 3934*22dc650dSSadaf Ebrahimi does not support capturing. 3935*22dc650dSSadaf Ebrahimi 3936*22dc650dSSadaf Ebrahimi Calls to the convenience functions that extract substrings by name re- 3937*22dc650dSSadaf Ebrahimi turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af- 3938*22dc650dSSadaf Ebrahimi ter a DFA match. The convenience functions that extract substrings by 3939*22dc650dSSadaf Ebrahimi number never return PCRE2_ERROR_NOSUBSTRING. 3940*22dc650dSSadaf Ebrahimi 3941*22dc650dSSadaf Ebrahimi The matched strings are stored in the ovector in reverse order of 3942*22dc650dSSadaf Ebrahimi length; that is, the longest matching string is first. If there were 3943*22dc650dSSadaf Ebrahimi too many matches to fit into the ovector, the yield of the function is 3944*22dc650dSSadaf Ebrahimi zero, and the vector is filled with the longest matches. 3945*22dc650dSSadaf Ebrahimi 3946*22dc650dSSadaf Ebrahimi NOTE: PCRE2's "auto-possessification" optimization usually applies to 3947*22dc650dSSadaf Ebrahimi character repeats at the end of a pattern (as well as internally). For 3948*22dc650dSSadaf Ebrahimi example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA 3949*22dc650dSSadaf Ebrahimi matching, this means that only one possible match is found. If you re- 3950*22dc650dSSadaf Ebrahimi ally do want multiple matches in such cases, either use an ungreedy re- 3951*22dc650dSSadaf Ebrahimi peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com- 3952*22dc650dSSadaf Ebrahimi piling. 3953*22dc650dSSadaf Ebrahimi 3954*22dc650dSSadaf Ebrahimi Error returns from pcre2_dfa_match() 3955*22dc650dSSadaf Ebrahimi 3956*22dc650dSSadaf Ebrahimi The pcre2_dfa_match() function returns a negative number when it fails. 3957*22dc650dSSadaf Ebrahimi Many of the errors are the same as for pcre2_match(), as described 3958*22dc650dSSadaf Ebrahimi above. There are in addition the following errors that are specific to 3959*22dc650dSSadaf Ebrahimi pcre2_dfa_match(): 3960*22dc650dSSadaf Ebrahimi 3961*22dc650dSSadaf Ebrahimi PCRE2_ERROR_DFA_UITEM 3962*22dc650dSSadaf Ebrahimi 3963*22dc650dSSadaf Ebrahimi This return is given if pcre2_dfa_match() encounters an item in the 3964*22dc650dSSadaf Ebrahimi pattern that it does not support, for instance, the use of \C in a UTF 3965*22dc650dSSadaf Ebrahimi mode or a backreference. 3966*22dc650dSSadaf Ebrahimi 3967*22dc650dSSadaf Ebrahimi PCRE2_ERROR_DFA_UCOND 3968*22dc650dSSadaf Ebrahimi 3969*22dc650dSSadaf Ebrahimi This return is given if pcre2_dfa_match() encounters a condition item 3970*22dc650dSSadaf Ebrahimi that uses a backreference for the condition, or a test for recursion in 3971*22dc650dSSadaf Ebrahimi a specific capture group. These are not supported. 3972*22dc650dSSadaf Ebrahimi 3973*22dc650dSSadaf Ebrahimi PCRE2_ERROR_DFA_UINVALID_UTF 3974*22dc650dSSadaf Ebrahimi 3975*22dc650dSSadaf Ebrahimi This return is given if pcre2_dfa_match() is called for a pattern that 3976*22dc650dSSadaf Ebrahimi was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for 3977*22dc650dSSadaf Ebrahimi DFA matching. 3978*22dc650dSSadaf Ebrahimi 3979*22dc650dSSadaf Ebrahimi PCRE2_ERROR_DFA_WSSIZE 3980*22dc650dSSadaf Ebrahimi 3981*22dc650dSSadaf Ebrahimi This return is given if pcre2_dfa_match() runs out of space in the 3982*22dc650dSSadaf Ebrahimi workspace vector. 3983*22dc650dSSadaf Ebrahimi 3984*22dc650dSSadaf Ebrahimi PCRE2_ERROR_DFA_RECURSE 3985*22dc650dSSadaf Ebrahimi 3986*22dc650dSSadaf Ebrahimi When a recursion or subroutine call is processed, the matching function 3987*22dc650dSSadaf Ebrahimi calls itself recursively, using private memory for the ovector and 3988*22dc650dSSadaf Ebrahimi workspace. This error is given if the internal ovector is not large 3989*22dc650dSSadaf Ebrahimi enough. This should be extremely rare, as a vector of size 1000 is 3990*22dc650dSSadaf Ebrahimi used. 3991*22dc650dSSadaf Ebrahimi 3992*22dc650dSSadaf Ebrahimi PCRE2_ERROR_DFA_BADRESTART 3993*22dc650dSSadaf Ebrahimi 3994*22dc650dSSadaf Ebrahimi When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, 3995*22dc650dSSadaf Ebrahimi some plausibility checks are made on the contents of the workspace, 3996*22dc650dSSadaf Ebrahimi which should contain data about the previous partial match. If any of 3997*22dc650dSSadaf Ebrahimi these checks fail, this error is given. 3998*22dc650dSSadaf Ebrahimi 3999*22dc650dSSadaf Ebrahimi 4000*22dc650dSSadaf EbrahimiSEE ALSO 4001*22dc650dSSadaf Ebrahimi 4002*22dc650dSSadaf Ebrahimi pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), 4003*22dc650dSSadaf Ebrahimi pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). 4004*22dc650dSSadaf Ebrahimi 4005*22dc650dSSadaf Ebrahimi 4006*22dc650dSSadaf EbrahimiAUTHOR 4007*22dc650dSSadaf Ebrahimi 4008*22dc650dSSadaf Ebrahimi Philip Hazel 4009*22dc650dSSadaf Ebrahimi Retired from University Computing Service 4010*22dc650dSSadaf Ebrahimi Cambridge, England. 4011*22dc650dSSadaf Ebrahimi 4012*22dc650dSSadaf Ebrahimi 4013*22dc650dSSadaf EbrahimiREVISION 4014*22dc650dSSadaf Ebrahimi 4015*22dc650dSSadaf Ebrahimi Last updated: 24 April 2024 4016*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 4017*22dc650dSSadaf Ebrahimi 4018*22dc650dSSadaf Ebrahimi 4019*22dc650dSSadaf EbrahimiPCRE2 10.44 24 April 2024 PCRE2API(3) 4020*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 4021*22dc650dSSadaf Ebrahimi 4022*22dc650dSSadaf Ebrahimi 4023*22dc650dSSadaf Ebrahimi 4024*22dc650dSSadaf EbrahimiPCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) 4025*22dc650dSSadaf Ebrahimi 4026*22dc650dSSadaf Ebrahimi 4027*22dc650dSSadaf EbrahimiNAME 4028*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 4029*22dc650dSSadaf Ebrahimi 4030*22dc650dSSadaf Ebrahimi 4031*22dc650dSSadaf EbrahimiBUILDING PCRE2 4032*22dc650dSSadaf Ebrahimi 4033*22dc650dSSadaf Ebrahimi PCRE2 is distributed with a configure script that can be used to build 4034*22dc650dSSadaf Ebrahimi the library in Unix-like environments using the applications known as 4035*22dc650dSSadaf Ebrahimi Autotools. Also in the distribution are files to support building using 4036*22dc650dSSadaf Ebrahimi CMake instead of configure. The text file README contains general in- 4037*22dc650dSSadaf Ebrahimi formation about building with Autotools (some of which is repeated be- 4038*22dc650dSSadaf Ebrahimi low), and also has some comments about building on various operating 4039*22dc650dSSadaf Ebrahimi systems. The files in the vms directory support building under OpenVMS. 4040*22dc650dSSadaf Ebrahimi There is a lot more information about building PCRE2 without using Au- 4041*22dc650dSSadaf Ebrahimi totools (including information about using CMake and building "by 4042*22dc650dSSadaf Ebrahimi hand") in the text file called NON-AUTOTOOLS-BUILD. You should consult 4043*22dc650dSSadaf Ebrahimi this file as well as the README file if you are building in a non-Unix- 4044*22dc650dSSadaf Ebrahimi like environment. 4045*22dc650dSSadaf Ebrahimi 4046*22dc650dSSadaf Ebrahimi 4047*22dc650dSSadaf EbrahimiPCRE2 BUILD-TIME OPTIONS 4048*22dc650dSSadaf Ebrahimi 4049*22dc650dSSadaf Ebrahimi The rest of this document describes the optional features of PCRE2 that 4050*22dc650dSSadaf Ebrahimi can be selected when the library is compiled. It assumes use of the 4051*22dc650dSSadaf Ebrahimi configure script, where the optional features are selected or dese- 4052*22dc650dSSadaf Ebrahimi lected by providing options to configure before running the make com- 4053*22dc650dSSadaf Ebrahimi mand. However, the same options can be selected in both Unix-like and 4054*22dc650dSSadaf Ebrahimi non-Unix-like environments if you are using CMake instead of configure 4055*22dc650dSSadaf Ebrahimi to build PCRE2. 4056*22dc650dSSadaf Ebrahimi 4057*22dc650dSSadaf Ebrahimi If you are not using Autotools or CMake, option selection can be done 4058*22dc650dSSadaf Ebrahimi by editing the config.h file, or by passing parameter settings to the 4059*22dc650dSSadaf Ebrahimi compiler, as described in NON-AUTOTOOLS-BUILD. 4060*22dc650dSSadaf Ebrahimi 4061*22dc650dSSadaf Ebrahimi The complete list of options for configure (which includes the standard 4062*22dc650dSSadaf Ebrahimi ones such as the selection of the installation directory) can be ob- 4063*22dc650dSSadaf Ebrahimi tained by running 4064*22dc650dSSadaf Ebrahimi 4065*22dc650dSSadaf Ebrahimi ./configure --help 4066*22dc650dSSadaf Ebrahimi 4067*22dc650dSSadaf Ebrahimi The following sections include descriptions of "on/off" options whose 4068*22dc650dSSadaf Ebrahimi names begin with --enable or --disable. Because of the way that config- 4069*22dc650dSSadaf Ebrahimi ure works, --enable and --disable always come in pairs, so the comple- 4070*22dc650dSSadaf Ebrahimi mentary option always exists as well, but as it specifies the default, 4071*22dc650dSSadaf Ebrahimi it is not described. Options that specify values have names that start 4072*22dc650dSSadaf Ebrahimi with --with. At the end of a configure run, a summary of the configura- 4073*22dc650dSSadaf Ebrahimi tion is output. 4074*22dc650dSSadaf Ebrahimi 4075*22dc650dSSadaf Ebrahimi 4076*22dc650dSSadaf EbrahimiBUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES 4077*22dc650dSSadaf Ebrahimi 4078*22dc650dSSadaf Ebrahimi By default, a library called libpcre2-8 is built, containing functions 4079*22dc650dSSadaf Ebrahimi that take string arguments contained in arrays of bytes, interpreted 4080*22dc650dSSadaf Ebrahimi either as single-byte characters, or UTF-8 strings. You can also build 4081*22dc650dSSadaf Ebrahimi two other libraries, called libpcre2-16 and libpcre2-32, which process 4082*22dc650dSSadaf Ebrahimi strings that are contained in arrays of 16-bit and 32-bit code units, 4083*22dc650dSSadaf Ebrahimi respectively. These can be interpreted either as single-unit characters 4084*22dc650dSSadaf Ebrahimi or UTF-16/UTF-32 strings. To build these additional libraries, add one 4085*22dc650dSSadaf Ebrahimi or both of the following to the configure command: 4086*22dc650dSSadaf Ebrahimi 4087*22dc650dSSadaf Ebrahimi --enable-pcre2-16 4088*22dc650dSSadaf Ebrahimi --enable-pcre2-32 4089*22dc650dSSadaf Ebrahimi 4090*22dc650dSSadaf Ebrahimi If you do not want the 8-bit library, add 4091*22dc650dSSadaf Ebrahimi 4092*22dc650dSSadaf Ebrahimi --disable-pcre2-8 4093*22dc650dSSadaf Ebrahimi 4094*22dc650dSSadaf Ebrahimi as well. At least one of the three libraries must be built. Note that 4095*22dc650dSSadaf Ebrahimi the POSIX wrapper is for the 8-bit library only, and that pcre2grep is 4096*22dc650dSSadaf Ebrahimi an 8-bit program. Neither of these are built if you select only the 4097*22dc650dSSadaf Ebrahimi 16-bit or 32-bit libraries. 4098*22dc650dSSadaf Ebrahimi 4099*22dc650dSSadaf Ebrahimi 4100*22dc650dSSadaf EbrahimiBUILDING SHARED AND STATIC LIBRARIES 4101*22dc650dSSadaf Ebrahimi 4102*22dc650dSSadaf Ebrahimi The Autotools PCRE2 building process uses libtool to build both shared 4103*22dc650dSSadaf Ebrahimi and static libraries by default. You can suppress an unwanted library 4104*22dc650dSSadaf Ebrahimi by adding one of 4105*22dc650dSSadaf Ebrahimi 4106*22dc650dSSadaf Ebrahimi --disable-shared 4107*22dc650dSSadaf Ebrahimi --disable-static 4108*22dc650dSSadaf Ebrahimi 4109*22dc650dSSadaf Ebrahimi to the configure command. Setting --disable-shared ensures that PCRE2 4110*22dc650dSSadaf Ebrahimi libraries are built as static libraries. The binaries that are then 4111*22dc650dSSadaf Ebrahimi created as part of the build process (for example, pcre2test and 4112*22dc650dSSadaf Ebrahimi pcre2grep) are linked statically with one or more PCRE2 libraries, but 4113*22dc650dSSadaf Ebrahimi may also be dynamically linked with other libraries such as libc. If 4114*22dc650dSSadaf Ebrahimi you want these binaries to be fully statically linked, you can set LD- 4115*22dc650dSSadaf Ebrahimi FLAGS like this: 4116*22dc650dSSadaf Ebrahimi 4117*22dc650dSSadaf Ebrahimi LDFLAGS=--static ./configure --disable-shared 4118*22dc650dSSadaf Ebrahimi 4119*22dc650dSSadaf Ebrahimi Note the two hyphens in --static. Of course, this works only if static 4120*22dc650dSSadaf Ebrahimi versions of all the relevant libraries are available for linking. 4121*22dc650dSSadaf Ebrahimi 4122*22dc650dSSadaf Ebrahimi 4123*22dc650dSSadaf EbrahimiUNICODE AND UTF SUPPORT 4124*22dc650dSSadaf Ebrahimi 4125*22dc650dSSadaf Ebrahimi By default, PCRE2 is built with support for Unicode and UTF character 4126*22dc650dSSadaf Ebrahimi strings. To build it without Unicode support, add 4127*22dc650dSSadaf Ebrahimi 4128*22dc650dSSadaf Ebrahimi --disable-unicode 4129*22dc650dSSadaf Ebrahimi 4130*22dc650dSSadaf Ebrahimi to the configure command. This setting applies to all three libraries. 4131*22dc650dSSadaf Ebrahimi It is not possible to build one library with Unicode support and an- 4132*22dc650dSSadaf Ebrahimi other without in the same configuration. 4133*22dc650dSSadaf Ebrahimi 4134*22dc650dSSadaf Ebrahimi Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, 4135*22dc650dSSadaf Ebrahimi UTF-16 or UTF-32. To do that, applications that use the library can set 4136*22dc650dSSadaf Ebrahimi the PCRE2_UTF option when they call pcre2_compile() to compile a pat- 4137*22dc650dSSadaf Ebrahimi tern. Alternatively, patterns may be started with (*UTF) unless the 4138*22dc650dSSadaf Ebrahimi application has locked this out by setting PCRE2_NEVER_UTF. 4139*22dc650dSSadaf Ebrahimi 4140*22dc650dSSadaf Ebrahimi UTF support allows the libraries to process character code points up to 4141*22dc650dSSadaf Ebrahimi 0x10ffff in the strings that they handle. Unicode support also gives 4142*22dc650dSSadaf Ebrahimi access to the Unicode properties of characters, using pattern escapes 4143*22dc650dSSadaf Ebrahimi such as \P, \p, and \X. Only the general category properties such as Lu 4144*22dc650dSSadaf Ebrahimi and Nd, script names, and some bi-directional properties are supported. 4145*22dc650dSSadaf Ebrahimi Details are given in the pcre2pattern documentation. 4146*22dc650dSSadaf Ebrahimi 4147*22dc650dSSadaf Ebrahimi Pattern escapes such as \d and \w do not by default make use of Unicode 4148*22dc650dSSadaf Ebrahimi properties. The application can request that they do by setting the 4149*22dc650dSSadaf Ebrahimi PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a 4150*22dc650dSSadaf Ebrahimi pattern may also request this by starting with (*UCP). 4151*22dc650dSSadaf Ebrahimi 4152*22dc650dSSadaf Ebrahimi 4153*22dc650dSSadaf EbrahimiDISABLING THE USE OF \C 4154*22dc650dSSadaf Ebrahimi 4155*22dc650dSSadaf Ebrahimi The \C escape sequence, which matches a single code unit, even in a UTF 4156*22dc650dSSadaf Ebrahimi mode, can cause unpredictable behaviour because it may leave the cur- 4157*22dc650dSSadaf Ebrahimi rent matching point in the middle of a multi-code-unit character. The 4158*22dc650dSSadaf Ebrahimi application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op- 4159*22dc650dSSadaf Ebrahimi tion when calling pcre2_compile(). There is also a build-time option 4160*22dc650dSSadaf Ebrahimi 4161*22dc650dSSadaf Ebrahimi --enable-never-backslash-C 4162*22dc650dSSadaf Ebrahimi 4163*22dc650dSSadaf Ebrahimi (note the upper case C) which locks out the use of \C entirely. 4164*22dc650dSSadaf Ebrahimi 4165*22dc650dSSadaf Ebrahimi 4166*22dc650dSSadaf EbrahimiJUST-IN-TIME COMPILER SUPPORT 4167*22dc650dSSadaf Ebrahimi 4168*22dc650dSSadaf Ebrahimi Just-in-time (JIT) compiler support is included in the build by speci- 4169*22dc650dSSadaf Ebrahimi fying 4170*22dc650dSSadaf Ebrahimi 4171*22dc650dSSadaf Ebrahimi --enable-jit 4172*22dc650dSSadaf Ebrahimi 4173*22dc650dSSadaf Ebrahimi This support is available only for certain hardware architectures. If 4174*22dc650dSSadaf Ebrahimi this option is set for an unsupported architecture, a building error 4175*22dc650dSSadaf Ebrahimi occurs. If in doubt, use 4176*22dc650dSSadaf Ebrahimi 4177*22dc650dSSadaf Ebrahimi --enable-jit=auto 4178*22dc650dSSadaf Ebrahimi 4179*22dc650dSSadaf Ebrahimi which enables JIT only if the current hardware is supported. You can 4180*22dc650dSSadaf Ebrahimi check if JIT is enabled in the configuration summary that is output at 4181*22dc650dSSadaf Ebrahimi the end of a configure run. If you are enabling JIT under SELinux you 4182*22dc650dSSadaf Ebrahimi may also want to add 4183*22dc650dSSadaf Ebrahimi 4184*22dc650dSSadaf Ebrahimi --enable-jit-sealloc 4185*22dc650dSSadaf Ebrahimi 4186*22dc650dSSadaf Ebrahimi which enables the use of an execmem allocator in JIT that is compatible 4187*22dc650dSSadaf Ebrahimi with SELinux. This has no effect if JIT is not enabled. See the 4188*22dc650dSSadaf Ebrahimi pcre2jit documentation for a discussion of JIT usage. When JIT support 4189*22dc650dSSadaf Ebrahimi is enabled, pcre2grep automatically makes use of it, unless you add 4190*22dc650dSSadaf Ebrahimi 4191*22dc650dSSadaf Ebrahimi --disable-pcre2grep-jit 4192*22dc650dSSadaf Ebrahimi 4193*22dc650dSSadaf Ebrahimi to the configure command. 4194*22dc650dSSadaf Ebrahimi 4195*22dc650dSSadaf Ebrahimi 4196*22dc650dSSadaf EbrahimiNEWLINE RECOGNITION 4197*22dc650dSSadaf Ebrahimi 4198*22dc650dSSadaf Ebrahimi By default, PCRE2 interprets the linefeed (LF) character as indicating 4199*22dc650dSSadaf Ebrahimi the end of a line. This is the normal newline character on Unix-like 4200*22dc650dSSadaf Ebrahimi systems. You can compile PCRE2 to use carriage return (CR) instead, by 4201*22dc650dSSadaf Ebrahimi adding 4202*22dc650dSSadaf Ebrahimi 4203*22dc650dSSadaf Ebrahimi --enable-newline-is-cr 4204*22dc650dSSadaf Ebrahimi 4205*22dc650dSSadaf Ebrahimi to the configure command. There is also an --enable-newline-is-lf op- 4206*22dc650dSSadaf Ebrahimi tion, which explicitly specifies linefeed as the newline character. 4207*22dc650dSSadaf Ebrahimi 4208*22dc650dSSadaf Ebrahimi Alternatively, you can specify that line endings are to be indicated by 4209*22dc650dSSadaf Ebrahimi the two-character sequence CRLF (CR immediately followed by LF). If you 4210*22dc650dSSadaf Ebrahimi want this, add 4211*22dc650dSSadaf Ebrahimi 4212*22dc650dSSadaf Ebrahimi --enable-newline-is-crlf 4213*22dc650dSSadaf Ebrahimi 4214*22dc650dSSadaf Ebrahimi to the configure command. There is a fourth option, specified by 4215*22dc650dSSadaf Ebrahimi 4216*22dc650dSSadaf Ebrahimi --enable-newline-is-anycrlf 4217*22dc650dSSadaf Ebrahimi 4218*22dc650dSSadaf Ebrahimi which causes PCRE2 to recognize any of the three sequences CR, LF, or 4219*22dc650dSSadaf Ebrahimi CRLF as indicating a line ending. A fifth option, specified by 4220*22dc650dSSadaf Ebrahimi 4221*22dc650dSSadaf Ebrahimi --enable-newline-is-any 4222*22dc650dSSadaf Ebrahimi 4223*22dc650dSSadaf Ebrahimi causes PCRE2 to recognize any Unicode newline sequence. The Unicode 4224*22dc650dSSadaf Ebrahimi newline sequences are the three just mentioned, plus the single charac- 4225*22dc650dSSadaf Ebrahimi ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, 4226*22dc650dSSadaf Ebrahimi U+0085), LS (line separator, U+2028), and PS (paragraph separator, 4227*22dc650dSSadaf Ebrahimi U+2029). The final option is 4228*22dc650dSSadaf Ebrahimi 4229*22dc650dSSadaf Ebrahimi --enable-newline-is-nul 4230*22dc650dSSadaf Ebrahimi 4231*22dc650dSSadaf Ebrahimi which causes NUL (binary zero) to be set as the default line-ending 4232*22dc650dSSadaf Ebrahimi character. 4233*22dc650dSSadaf Ebrahimi 4234*22dc650dSSadaf Ebrahimi Whatever default line ending convention is selected when PCRE2 is built 4235*22dc650dSSadaf Ebrahimi can be overridden by applications that use the library. At build time 4236*22dc650dSSadaf Ebrahimi it is recommended to use the standard for your operating system. 4237*22dc650dSSadaf Ebrahimi 4238*22dc650dSSadaf Ebrahimi 4239*22dc650dSSadaf EbrahimiWHAT \R MATCHES 4240*22dc650dSSadaf Ebrahimi 4241*22dc650dSSadaf Ebrahimi By default, the sequence \R in a pattern matches any Unicode newline 4242*22dc650dSSadaf Ebrahimi sequence, independently of what has been selected as the line ending 4243*22dc650dSSadaf Ebrahimi sequence. If you specify 4244*22dc650dSSadaf Ebrahimi 4245*22dc650dSSadaf Ebrahimi --enable-bsr-anycrlf 4246*22dc650dSSadaf Ebrahimi 4247*22dc650dSSadaf Ebrahimi the default is changed so that \R matches only CR, LF, or CRLF. What- 4248*22dc650dSSadaf Ebrahimi ever is selected when PCRE2 is built can be overridden by applications 4249*22dc650dSSadaf Ebrahimi that use the library. 4250*22dc650dSSadaf Ebrahimi 4251*22dc650dSSadaf Ebrahimi 4252*22dc650dSSadaf EbrahimiHANDLING VERY LARGE PATTERNS 4253*22dc650dSSadaf Ebrahimi 4254*22dc650dSSadaf Ebrahimi Within a compiled pattern, offset values are used to point from one 4255*22dc650dSSadaf Ebrahimi part to another (for example, from an opening parenthesis to an alter- 4256*22dc650dSSadaf Ebrahimi nation metacharacter). By default, in the 8-bit and 16-bit libraries, 4257*22dc650dSSadaf Ebrahimi two-byte values are used for these offsets, leading to a maximum size 4258*22dc650dSSadaf Ebrahimi for a compiled pattern of around 64 thousand code units. This is suffi- 4259*22dc650dSSadaf Ebrahimi cient to handle all but the most gigantic patterns. Nevertheless, some 4260*22dc650dSSadaf Ebrahimi people do want to process truly enormous patterns, so it is possible to 4261*22dc650dSSadaf Ebrahimi compile PCRE2 to use three-byte or four-byte offsets by adding a set- 4262*22dc650dSSadaf Ebrahimi ting such as 4263*22dc650dSSadaf Ebrahimi 4264*22dc650dSSadaf Ebrahimi --with-link-size=3 4265*22dc650dSSadaf Ebrahimi 4266*22dc650dSSadaf Ebrahimi to the configure command. The value given must be 2, 3, or 4. For the 4267*22dc650dSSadaf Ebrahimi 16-bit library, a value of 3 is rounded up to 4. In these libraries, 4268*22dc650dSSadaf Ebrahimi using longer offsets slows down the operation of PCRE2 because it has 4269*22dc650dSSadaf Ebrahimi to load additional data when handling them. For the 32-bit library the 4270*22dc650dSSadaf Ebrahimi value is always 4 and cannot be overridden; the value of --with-link- 4271*22dc650dSSadaf Ebrahimi size is ignored. 4272*22dc650dSSadaf Ebrahimi 4273*22dc650dSSadaf Ebrahimi 4274*22dc650dSSadaf EbrahimiLIMITING PCRE2 RESOURCE USAGE 4275*22dc650dSSadaf Ebrahimi 4276*22dc650dSSadaf Ebrahimi The pcre2_match() function increments a counter each time it goes round 4277*22dc650dSSadaf Ebrahimi its main loop. Putting a limit on this counter controls the amount of 4278*22dc650dSSadaf Ebrahimi computing resource used by a single call to pcre2_match(). The limit 4279*22dc650dSSadaf Ebrahimi can be changed at run time, as described in the pcre2api documentation. 4280*22dc650dSSadaf Ebrahimi The default is 10 million, but this can be changed by adding a setting 4281*22dc650dSSadaf Ebrahimi such as 4282*22dc650dSSadaf Ebrahimi 4283*22dc650dSSadaf Ebrahimi --with-match-limit=500000 4284*22dc650dSSadaf Ebrahimi 4285*22dc650dSSadaf Ebrahimi to the configure command. This setting also applies to the 4286*22dc650dSSadaf Ebrahimi pcre2_dfa_match() matching function, and to JIT matching (though the 4287*22dc650dSSadaf Ebrahimi counting is done differently). 4288*22dc650dSSadaf Ebrahimi 4289*22dc650dSSadaf Ebrahimi The pcre2_match() function uses heap memory to record backtracking 4290*22dc650dSSadaf Ebrahimi points. The more nested backtracking points there are (that is, the 4291*22dc650dSSadaf Ebrahimi deeper the search tree), the more memory is needed. There is an upper 4292*22dc650dSSadaf Ebrahimi limit, specified in kibibytes (units of 1024 bytes). This limit can be 4293*22dc650dSSadaf Ebrahimi changed at run time, as described in the pcre2api documentation. The 4294*22dc650dSSadaf Ebrahimi default limit (in effect unlimited) is 20 million. You can change this 4295*22dc650dSSadaf Ebrahimi by a setting such as 4296*22dc650dSSadaf Ebrahimi 4297*22dc650dSSadaf Ebrahimi --with-heap-limit=500 4298*22dc650dSSadaf Ebrahimi 4299*22dc650dSSadaf Ebrahimi which limits the amount of heap to 500 KiB. This limit applies only to 4300*22dc650dSSadaf Ebrahimi interpretive matching in pcre2_match() and pcre2_dfa_match(), which may 4301*22dc650dSSadaf Ebrahimi also use the heap for internal workspace when processing complicated 4302*22dc650dSSadaf Ebrahimi patterns. This limit does not apply when JIT (which has its own memory 4303*22dc650dSSadaf Ebrahimi arrangements) is used. 4304*22dc650dSSadaf Ebrahimi 4305*22dc650dSSadaf Ebrahimi You can also explicitly limit the depth of nested backtracking in the 4306*22dc650dSSadaf Ebrahimi pcre2_match() interpreter. This limit defaults to the value that is set 4307*22dc650dSSadaf Ebrahimi for --with-match-limit. You can set a lower default limit by adding, 4308*22dc650dSSadaf Ebrahimi for example, 4309*22dc650dSSadaf Ebrahimi 4310*22dc650dSSadaf Ebrahimi --with-match-limit-depth=10000 4311*22dc650dSSadaf Ebrahimi 4312*22dc650dSSadaf Ebrahimi to the configure command. This value can be overridden at run time. 4313*22dc650dSSadaf Ebrahimi This depth limit indirectly limits the amount of heap memory that is 4314*22dc650dSSadaf Ebrahimi used, but because the size of each backtracking "frame" depends on the 4315*22dc650dSSadaf Ebrahimi number of capturing parentheses in a pattern, the amount of heap that 4316*22dc650dSSadaf Ebrahimi is used before the limit is reached varies from pattern to pattern. 4317*22dc650dSSadaf Ebrahimi This limit was more useful in versions before 10.30, where function re- 4318*22dc650dSSadaf Ebrahimi cursion was used for backtracking. 4319*22dc650dSSadaf Ebrahimi 4320*22dc650dSSadaf Ebrahimi As well as applying to pcre2_match(), the depth limit also controls the 4321*22dc650dSSadaf Ebrahimi depth of recursive function calls in pcre2_dfa_match(). These are used 4322*22dc650dSSadaf Ebrahimi for lookaround assertions, atomic groups, and recursion within pat- 4323*22dc650dSSadaf Ebrahimi terns. The limit does not apply to JIT matching. 4324*22dc650dSSadaf Ebrahimi 4325*22dc650dSSadaf Ebrahimi 4326*22dc650dSSadaf EbrahimiLIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS 4327*22dc650dSSadaf Ebrahimi 4328*22dc650dSSadaf Ebrahimi Lookbehind assertions in which one or more branches can match a vari- 4329*22dc650dSSadaf Ebrahimi able number of characters are supported only if there is a maximum 4330*22dc650dSSadaf Ebrahimi matching length for each top-level branch. There is a limit to this 4331*22dc650dSSadaf Ebrahimi maximum that defaults to 255 characters. You can alter this default by 4332*22dc650dSSadaf Ebrahimi a setting such as 4333*22dc650dSSadaf Ebrahimi 4334*22dc650dSSadaf Ebrahimi --with-max-varlookbehind=100 4335*22dc650dSSadaf Ebrahimi 4336*22dc650dSSadaf Ebrahimi The limit can be changed at runtime by calling pcre2_set_max_varlookbe- 4337*22dc650dSSadaf Ebrahimi hind(). Lookbehind assertions in which every branch matches a fixed 4338*22dc650dSSadaf Ebrahimi number of characters (not necessarily all the same) are not constrained 4339*22dc650dSSadaf Ebrahimi by this limit. 4340*22dc650dSSadaf Ebrahimi 4341*22dc650dSSadaf Ebrahimi 4342*22dc650dSSadaf EbrahimiCREATING CHARACTER TABLES AT BUILD TIME 4343*22dc650dSSadaf Ebrahimi 4344*22dc650dSSadaf Ebrahimi PCRE2 uses fixed tables for processing characters whose code points are 4345*22dc650dSSadaf Ebrahimi less than 256. By default, PCRE2 is built with a set of tables that are 4346*22dc650dSSadaf Ebrahimi distributed in the file src/pcre2_chartables.c.dist. These tables are 4347*22dc650dSSadaf Ebrahimi for ASCII codes only. If you add 4348*22dc650dSSadaf Ebrahimi 4349*22dc650dSSadaf Ebrahimi --enable-rebuild-chartables 4350*22dc650dSSadaf Ebrahimi 4351*22dc650dSSadaf Ebrahimi to the configure command, the distributed tables are no longer used. 4352*22dc650dSSadaf Ebrahimi Instead, a program called pcre2_dftables is compiled and run. This out- 4353*22dc650dSSadaf Ebrahimi puts the source for new set of tables, created in the default locale of 4354*22dc650dSSadaf Ebrahimi your C run-time system. This method of replacing the tables does not 4355*22dc650dSSadaf Ebrahimi work if you are cross compiling, because pcre2_dftables needs to be run 4356*22dc650dSSadaf Ebrahimi on the local host and therefore not compiled with the cross compiler. 4357*22dc650dSSadaf Ebrahimi 4358*22dc650dSSadaf Ebrahimi If you need to create alternative tables when cross compiling, you will 4359*22dc650dSSadaf Ebrahimi have to do so "by hand". There may also be other reasons for creating 4360*22dc650dSSadaf Ebrahimi tables manually. To cause pcre2_dftables to be built on the local 4361*22dc650dSSadaf Ebrahimi host, run a normal compiling command, and then run the program with the 4362*22dc650dSSadaf Ebrahimi output file as its argument, for example: 4363*22dc650dSSadaf Ebrahimi 4364*22dc650dSSadaf Ebrahimi cc src/pcre2_dftables.c -o pcre2_dftables 4365*22dc650dSSadaf Ebrahimi ./pcre2_dftables src/pcre2_chartables.c 4366*22dc650dSSadaf Ebrahimi 4367*22dc650dSSadaf Ebrahimi This builds the tables in the default locale of the local host. If you 4368*22dc650dSSadaf Ebrahimi want to specify a locale, you must use the -L option: 4369*22dc650dSSadaf Ebrahimi 4370*22dc650dSSadaf Ebrahimi LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c 4371*22dc650dSSadaf Ebrahimi 4372*22dc650dSSadaf Ebrahimi You can also specify -b (with or without -L). This causes the tables to 4373*22dc650dSSadaf Ebrahimi be written in binary instead of as source code. A set of binary tables 4374*22dc650dSSadaf Ebrahimi can be loaded into memory by an application and passed to pcre2_com- 4375*22dc650dSSadaf Ebrahimi pile() in the same way as tables created by calling pcre2_maketables(). 4376*22dc650dSSadaf Ebrahimi The tables are just a string of bytes, independent of hardware charac- 4377*22dc650dSSadaf Ebrahimi teristics such as endianness. This means they can be bundled with an 4378*22dc650dSSadaf Ebrahimi application that runs in different environments, to ensure consistent 4379*22dc650dSSadaf Ebrahimi behaviour. 4380*22dc650dSSadaf Ebrahimi 4381*22dc650dSSadaf Ebrahimi 4382*22dc650dSSadaf EbrahimiUSING EBCDIC CODE 4383*22dc650dSSadaf Ebrahimi 4384*22dc650dSSadaf Ebrahimi PCRE2 assumes by default that it will run in an environment where the 4385*22dc650dSSadaf Ebrahimi character code is ASCII or Unicode, which is a superset of ASCII. This 4386*22dc650dSSadaf Ebrahimi is the case for most computer operating systems. PCRE2 can, however, be 4387*22dc650dSSadaf Ebrahimi compiled to run in an 8-bit EBCDIC environment by adding 4388*22dc650dSSadaf Ebrahimi 4389*22dc650dSSadaf Ebrahimi --enable-ebcdic --disable-unicode 4390*22dc650dSSadaf Ebrahimi 4391*22dc650dSSadaf Ebrahimi to the configure command. This setting implies --enable-rebuild-charta- 4392*22dc650dSSadaf Ebrahimi bles. You should only use it if you know that you are in an EBCDIC en- 4393*22dc650dSSadaf Ebrahimi vironment (for example, an IBM mainframe operating system). 4394*22dc650dSSadaf Ebrahimi 4395*22dc650dSSadaf Ebrahimi It is not possible to support both EBCDIC and UTF-8 codes in the same 4396*22dc650dSSadaf Ebrahimi version of the library. Consequently, --enable-unicode and --enable- 4397*22dc650dSSadaf Ebrahimi ebcdic are mutually exclusive. 4398*22dc650dSSadaf Ebrahimi 4399*22dc650dSSadaf Ebrahimi The EBCDIC character that corresponds to an ASCII LF is assumed to have 4400*22dc650dSSadaf Ebrahimi the value 0x15 by default. However, in some EBCDIC environments, 0x25 4401*22dc650dSSadaf Ebrahimi is used. In such an environment you should use 4402*22dc650dSSadaf Ebrahimi 4403*22dc650dSSadaf Ebrahimi --enable-ebcdic-nl25 4404*22dc650dSSadaf Ebrahimi 4405*22dc650dSSadaf Ebrahimi as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR 4406*22dc650dSSadaf Ebrahimi has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 4407*22dc650dSSadaf Ebrahimi 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- 4408*22dc650dSSadaf Ebrahimi acter (which, in Unicode, is 0x85). 4409*22dc650dSSadaf Ebrahimi 4410*22dc650dSSadaf Ebrahimi The options that select newline behaviour, such as --enable-newline-is- 4411*22dc650dSSadaf Ebrahimi cr, and equivalent run-time options, refer to these character values in 4412*22dc650dSSadaf Ebrahimi an EBCDIC environment. 4413*22dc650dSSadaf Ebrahimi 4414*22dc650dSSadaf Ebrahimi 4415*22dc650dSSadaf EbrahimiPCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS 4416*22dc650dSSadaf Ebrahimi 4417*22dc650dSSadaf Ebrahimi By default pcre2grep supports the use of callouts with string arguments 4418*22dc650dSSadaf Ebrahimi within the patterns it is matching. There are two kinds: one that gen- 4419*22dc650dSSadaf Ebrahimi erates output using local code, and another that calls an external pro- 4420*22dc650dSSadaf Ebrahimi gram or script. If --disable-pcre2grep-callout-fork is added to the 4421*22dc650dSSadaf Ebrahimi configure command, only the first kind of callout is supported; if 4422*22dc650dSSadaf Ebrahimi --disable-pcre2grep-callout is used, all callouts are completely ig- 4423*22dc650dSSadaf Ebrahimi nored. For more details of pcre2grep callouts, see the pcre2grep docu- 4424*22dc650dSSadaf Ebrahimi mentation. 4425*22dc650dSSadaf Ebrahimi 4426*22dc650dSSadaf Ebrahimi 4427*22dc650dSSadaf EbrahimiPCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT 4428*22dc650dSSadaf Ebrahimi 4429*22dc650dSSadaf Ebrahimi By default, pcre2grep reads all files as plain text. You can build it 4430*22dc650dSSadaf Ebrahimi so that it recognizes files whose names end in .gz or .bz2, and reads 4431*22dc650dSSadaf Ebrahimi them with libz or libbz2, respectively, by adding one or both of 4432*22dc650dSSadaf Ebrahimi 4433*22dc650dSSadaf Ebrahimi --enable-pcre2grep-libz 4434*22dc650dSSadaf Ebrahimi --enable-pcre2grep-libbz2 4435*22dc650dSSadaf Ebrahimi 4436*22dc650dSSadaf Ebrahimi to the configure command. These options naturally require that the rel- 4437*22dc650dSSadaf Ebrahimi evant libraries are installed on your system. Configuration will fail 4438*22dc650dSSadaf Ebrahimi if they are not. 4439*22dc650dSSadaf Ebrahimi 4440*22dc650dSSadaf Ebrahimi 4441*22dc650dSSadaf EbrahimiPCRE2GREP BUFFER SIZE 4442*22dc650dSSadaf Ebrahimi 4443*22dc650dSSadaf Ebrahimi pcre2grep uses an internal buffer to hold a "window" on the file it is 4444*22dc650dSSadaf Ebrahimi scanning, in order to be able to output "before" and "after" lines when 4445*22dc650dSSadaf Ebrahimi it finds a match. The default starting size of the buffer is 20KiB. The 4446*22dc650dSSadaf Ebrahimi buffer itself is three times this size, but because of the way it is 4447*22dc650dSSadaf Ebrahimi used for holding "before" lines, the longest line that is guaranteed to 4448*22dc650dSSadaf Ebrahimi be processable is the notional buffer size. If a longer line is encoun- 4449*22dc650dSSadaf Ebrahimi tered, pcre2grep automatically expands the buffer, up to a specified 4450*22dc650dSSadaf Ebrahimi maximum size, whose default is 1MiB or the starting size, whichever is 4451*22dc650dSSadaf Ebrahimi the larger. You can change the default parameter values by adding, for 4452*22dc650dSSadaf Ebrahimi example, 4453*22dc650dSSadaf Ebrahimi 4454*22dc650dSSadaf Ebrahimi --with-pcre2grep-bufsize=51200 4455*22dc650dSSadaf Ebrahimi --with-pcre2grep-max-bufsize=2097152 4456*22dc650dSSadaf Ebrahimi 4457*22dc650dSSadaf Ebrahimi to the configure command. The caller of pcre2grep can override these 4458*22dc650dSSadaf Ebrahimi values by using --buffer-size and --max-buffer-size on the command 4459*22dc650dSSadaf Ebrahimi line. 4460*22dc650dSSadaf Ebrahimi 4461*22dc650dSSadaf Ebrahimi 4462*22dc650dSSadaf EbrahimiPCRE2TEST OPTION FOR LIBREADLINE SUPPORT 4463*22dc650dSSadaf Ebrahimi 4464*22dc650dSSadaf Ebrahimi If you add one of 4465*22dc650dSSadaf Ebrahimi 4466*22dc650dSSadaf Ebrahimi --enable-pcre2test-libreadline 4467*22dc650dSSadaf Ebrahimi --enable-pcre2test-libedit 4468*22dc650dSSadaf Ebrahimi 4469*22dc650dSSadaf Ebrahimi to the configure command, pcre2test is linked with the libreadline or- 4470*22dc650dSSadaf Ebrahimi libedit library, respectively, and when its input is from a terminal, 4471*22dc650dSSadaf Ebrahimi it reads it using the readline() function. This provides line-editing 4472*22dc650dSSadaf Ebrahimi and history facilities. Note that libreadline is GPL-licensed, so if 4473*22dc650dSSadaf Ebrahimi you distribute a binary of pcre2test linked in this way, there may be 4474*22dc650dSSadaf Ebrahimi licensing issues. These can be avoided by linking instead with libedit, 4475*22dc650dSSadaf Ebrahimi which has a BSD licence. 4476*22dc650dSSadaf Ebrahimi 4477*22dc650dSSadaf Ebrahimi Setting --enable-pcre2test-libreadline causes the -lreadline option to 4478*22dc650dSSadaf Ebrahimi be added to the pcre2test build. In many operating environments with a 4479*22dc650dSSadaf Ebrahimi system-installed readline library this is sufficient. However, in some 4480*22dc650dSSadaf Ebrahimi environments (e.g. if an unmodified distribution version of readline is 4481*22dc650dSSadaf Ebrahimi in use), some extra configuration may be necessary. The INSTALL file 4482*22dc650dSSadaf Ebrahimi for libreadline says this: 4483*22dc650dSSadaf Ebrahimi 4484*22dc650dSSadaf Ebrahimi "Readline uses the termcap functions, but does not link with 4485*22dc650dSSadaf Ebrahimi the termcap or curses library itself, allowing applications 4486*22dc650dSSadaf Ebrahimi which link with readline the to choose an appropriate library." 4487*22dc650dSSadaf Ebrahimi 4488*22dc650dSSadaf Ebrahimi If your environment has not been set up so that an appropriate library 4489*22dc650dSSadaf Ebrahimi is automatically included, you may need to add something like 4490*22dc650dSSadaf Ebrahimi 4491*22dc650dSSadaf Ebrahimi LIBS="-ncurses" 4492*22dc650dSSadaf Ebrahimi 4493*22dc650dSSadaf Ebrahimi immediately before the configure command. 4494*22dc650dSSadaf Ebrahimi 4495*22dc650dSSadaf Ebrahimi 4496*22dc650dSSadaf EbrahimiINCLUDING DEBUGGING CODE 4497*22dc650dSSadaf Ebrahimi 4498*22dc650dSSadaf Ebrahimi If you add 4499*22dc650dSSadaf Ebrahimi 4500*22dc650dSSadaf Ebrahimi --enable-debug 4501*22dc650dSSadaf Ebrahimi 4502*22dc650dSSadaf Ebrahimi to the configure command, additional debugging code is included in the 4503*22dc650dSSadaf Ebrahimi build. This feature is intended for use by the PCRE2 maintainers. 4504*22dc650dSSadaf Ebrahimi 4505*22dc650dSSadaf Ebrahimi 4506*22dc650dSSadaf EbrahimiDEBUGGING WITH VALGRIND SUPPORT 4507*22dc650dSSadaf Ebrahimi 4508*22dc650dSSadaf Ebrahimi If you add 4509*22dc650dSSadaf Ebrahimi 4510*22dc650dSSadaf Ebrahimi --enable-valgrind 4511*22dc650dSSadaf Ebrahimi 4512*22dc650dSSadaf Ebrahimi to the configure command, PCRE2 will use valgrind annotations to mark 4513*22dc650dSSadaf Ebrahimi certain memory regions as unaddressable. This allows it to detect in- 4514*22dc650dSSadaf Ebrahimi valid memory accesses, and is mostly useful for debugging PCRE2 itself. 4515*22dc650dSSadaf Ebrahimi 4516*22dc650dSSadaf Ebrahimi 4517*22dc650dSSadaf EbrahimiCODE COVERAGE REPORTING 4518*22dc650dSSadaf Ebrahimi 4519*22dc650dSSadaf Ebrahimi If your C compiler is gcc, you can build a version of PCRE2 that can 4520*22dc650dSSadaf Ebrahimi generate a code coverage report for its test suite. To enable this, you 4521*22dc650dSSadaf Ebrahimi must install lcov version 1.6 or above. Then specify 4522*22dc650dSSadaf Ebrahimi 4523*22dc650dSSadaf Ebrahimi --enable-coverage 4524*22dc650dSSadaf Ebrahimi 4525*22dc650dSSadaf Ebrahimi to the configure command and build PCRE2 in the usual way. 4526*22dc650dSSadaf Ebrahimi 4527*22dc650dSSadaf Ebrahimi Note that using ccache (a caching C compiler) is incompatible with code 4528*22dc650dSSadaf Ebrahimi coverage reporting. If you have configured ccache to run automatically 4529*22dc650dSSadaf Ebrahimi on your system, you must set the environment variable 4530*22dc650dSSadaf Ebrahimi 4531*22dc650dSSadaf Ebrahimi CCACHE_DISABLE=1 4532*22dc650dSSadaf Ebrahimi 4533*22dc650dSSadaf Ebrahimi before running make to build PCRE2, so that ccache is not used. 4534*22dc650dSSadaf Ebrahimi 4535*22dc650dSSadaf Ebrahimi When --enable-coverage is used, the following addition targets are 4536*22dc650dSSadaf Ebrahimi added to the Makefile: 4537*22dc650dSSadaf Ebrahimi 4538*22dc650dSSadaf Ebrahimi make coverage 4539*22dc650dSSadaf Ebrahimi 4540*22dc650dSSadaf Ebrahimi This creates a fresh coverage report for the PCRE2 test suite. It is 4541*22dc650dSSadaf Ebrahimi equivalent to running "make coverage-reset", "make coverage-baseline", 4542*22dc650dSSadaf Ebrahimi "make check", and then "make coverage-report". 4543*22dc650dSSadaf Ebrahimi 4544*22dc650dSSadaf Ebrahimi make coverage-reset 4545*22dc650dSSadaf Ebrahimi 4546*22dc650dSSadaf Ebrahimi This zeroes the coverage counters, but does nothing else. 4547*22dc650dSSadaf Ebrahimi 4548*22dc650dSSadaf Ebrahimi make coverage-baseline 4549*22dc650dSSadaf Ebrahimi 4550*22dc650dSSadaf Ebrahimi This captures baseline coverage information. 4551*22dc650dSSadaf Ebrahimi 4552*22dc650dSSadaf Ebrahimi make coverage-report 4553*22dc650dSSadaf Ebrahimi 4554*22dc650dSSadaf Ebrahimi This creates the coverage report. 4555*22dc650dSSadaf Ebrahimi 4556*22dc650dSSadaf Ebrahimi make coverage-clean-report 4557*22dc650dSSadaf Ebrahimi 4558*22dc650dSSadaf Ebrahimi This removes the generated coverage report without cleaning the cover- 4559*22dc650dSSadaf Ebrahimi age data itself. 4560*22dc650dSSadaf Ebrahimi 4561*22dc650dSSadaf Ebrahimi make coverage-clean-data 4562*22dc650dSSadaf Ebrahimi 4563*22dc650dSSadaf Ebrahimi This removes the captured coverage data without removing the coverage 4564*22dc650dSSadaf Ebrahimi files created at compile time (*.gcno). 4565*22dc650dSSadaf Ebrahimi 4566*22dc650dSSadaf Ebrahimi make coverage-clean 4567*22dc650dSSadaf Ebrahimi 4568*22dc650dSSadaf Ebrahimi This cleans all coverage data including the generated coverage report. 4569*22dc650dSSadaf Ebrahimi For more information about code coverage, see the gcov and lcov docu- 4570*22dc650dSSadaf Ebrahimi mentation. 4571*22dc650dSSadaf Ebrahimi 4572*22dc650dSSadaf Ebrahimi 4573*22dc650dSSadaf EbrahimiDISABLING THE Z AND T FORMATTING MODIFIERS 4574*22dc650dSSadaf Ebrahimi 4575*22dc650dSSadaf Ebrahimi The C99 standard defines formatting modifiers z and t for size_t and 4576*22dc650dSSadaf Ebrahimi ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers 4577*22dc650dSSadaf Ebrahimi in environments other than old versions of Microsoft Visual Studio when 4578*22dc650dSSadaf Ebrahimi __STDC_VERSION__ is defined and has a value greater than or equal to 4579*22dc650dSSadaf Ebrahimi 199901L (indicating support for C99). However, there is at least one 4580*22dc650dSSadaf Ebrahimi environment that claims to be C99 but does not support these modifiers. 4581*22dc650dSSadaf Ebrahimi If 4582*22dc650dSSadaf Ebrahimi 4583*22dc650dSSadaf Ebrahimi --disable-percent-zt 4584*22dc650dSSadaf Ebrahimi 4585*22dc650dSSadaf Ebrahimi is specified, no use is made of the z or t modifiers. Instead of %td or 4586*22dc650dSSadaf Ebrahimi %zu, a suitable format is used depending in the size of long for the 4587*22dc650dSSadaf Ebrahimi platform. 4588*22dc650dSSadaf Ebrahimi 4589*22dc650dSSadaf Ebrahimi 4590*22dc650dSSadaf EbrahimiSUPPORT FOR FUZZERS 4591*22dc650dSSadaf Ebrahimi 4592*22dc650dSSadaf Ebrahimi There is a special option for use by people who want to run fuzzing 4593*22dc650dSSadaf Ebrahimi tests on PCRE2: 4594*22dc650dSSadaf Ebrahimi 4595*22dc650dSSadaf Ebrahimi --enable-fuzz-support 4596*22dc650dSSadaf Ebrahimi 4597*22dc650dSSadaf Ebrahimi At present this applies only to the 8-bit library. If set, it causes an 4598*22dc650dSSadaf Ebrahimi extra library called libpcre2-fuzzsupport.a to be built, but not in- 4599*22dc650dSSadaf Ebrahimi stalled. This contains a single function called LLVMFuzzerTestOneIn- 4600*22dc650dSSadaf Ebrahimi put() whose arguments are a pointer to a string and the length of the 4601*22dc650dSSadaf Ebrahimi string. When called, this function tries to compile the string as a 4602*22dc650dSSadaf Ebrahimi pattern, and if that succeeds, to match it. This is done both with no 4603*22dc650dSSadaf Ebrahimi options and with some random options bits that are generated from the 4604*22dc650dSSadaf Ebrahimi string. 4605*22dc650dSSadaf Ebrahimi 4606*22dc650dSSadaf Ebrahimi Setting --enable-fuzz-support also causes a binary called pcre2fuz- 4607*22dc650dSSadaf Ebrahimi zcheck to be created. This is normally run under valgrind or used when 4608*22dc650dSSadaf Ebrahimi PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing 4609*22dc650dSSadaf Ebrahimi function and outputs information about what it is doing. The input 4610*22dc650dSSadaf Ebrahimi strings are specified by arguments: if an argument starts with "=" the 4611*22dc650dSSadaf Ebrahimi rest of it is a literal input string. Otherwise, it is assumed to be a 4612*22dc650dSSadaf Ebrahimi file name, and the contents of the file are the test string. 4613*22dc650dSSadaf Ebrahimi 4614*22dc650dSSadaf Ebrahimi 4615*22dc650dSSadaf EbrahimiOBSOLETE OPTION 4616*22dc650dSSadaf Ebrahimi 4617*22dc650dSSadaf Ebrahimi In versions of PCRE2 prior to 10.30, there were two ways of handling 4618*22dc650dSSadaf Ebrahimi backtracking in the pcre2_match() function. The default was to use the 4619*22dc650dSSadaf Ebrahimi system stack, but if 4620*22dc650dSSadaf Ebrahimi 4621*22dc650dSSadaf Ebrahimi --disable-stack-for-recursion 4622*22dc650dSSadaf Ebrahimi 4623*22dc650dSSadaf Ebrahimi was set, memory on the heap was used. From release 10.30 onwards this 4624*22dc650dSSadaf Ebrahimi has changed (the stack is no longer used) and this option now does 4625*22dc650dSSadaf Ebrahimi nothing except give a warning. 4626*22dc650dSSadaf Ebrahimi 4627*22dc650dSSadaf Ebrahimi 4628*22dc650dSSadaf EbrahimiSEE ALSO 4629*22dc650dSSadaf Ebrahimi 4630*22dc650dSSadaf Ebrahimi pcre2api(3), pcre2-config(3). 4631*22dc650dSSadaf Ebrahimi 4632*22dc650dSSadaf Ebrahimi 4633*22dc650dSSadaf EbrahimiAUTHOR 4634*22dc650dSSadaf Ebrahimi 4635*22dc650dSSadaf Ebrahimi Philip Hazel 4636*22dc650dSSadaf Ebrahimi Retired from University Computing Service 4637*22dc650dSSadaf Ebrahimi Cambridge, England. 4638*22dc650dSSadaf Ebrahimi 4639*22dc650dSSadaf Ebrahimi 4640*22dc650dSSadaf EbrahimiREVISION 4641*22dc650dSSadaf Ebrahimi 4642*22dc650dSSadaf Ebrahimi Last updated: 15 April 2024 4643*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 4644*22dc650dSSadaf Ebrahimi 4645*22dc650dSSadaf Ebrahimi 4646*22dc650dSSadaf EbrahimiPCRE2 10.44 15 April 2024 PCRE2BUILD(3) 4647*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 4648*22dc650dSSadaf Ebrahimi 4649*22dc650dSSadaf Ebrahimi 4650*22dc650dSSadaf Ebrahimi 4651*22dc650dSSadaf EbrahimiPCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3) 4652*22dc650dSSadaf Ebrahimi 4653*22dc650dSSadaf Ebrahimi 4654*22dc650dSSadaf EbrahimiNAME 4655*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 4656*22dc650dSSadaf Ebrahimi 4657*22dc650dSSadaf Ebrahimi 4658*22dc650dSSadaf EbrahimiSYNOPSIS 4659*22dc650dSSadaf Ebrahimi 4660*22dc650dSSadaf Ebrahimi #include <pcre2.h> 4661*22dc650dSSadaf Ebrahimi 4662*22dc650dSSadaf Ebrahimi int (*pcre2_callout)(pcre2_callout_block *, void *); 4663*22dc650dSSadaf Ebrahimi 4664*22dc650dSSadaf Ebrahimi int pcre2_callout_enumerate(const pcre2_code *code, 4665*22dc650dSSadaf Ebrahimi int (*callback)(pcre2_callout_enumerate_block *, void *), 4666*22dc650dSSadaf Ebrahimi void *user_data); 4667*22dc650dSSadaf Ebrahimi 4668*22dc650dSSadaf Ebrahimi 4669*22dc650dSSadaf EbrahimiDESCRIPTION 4670*22dc650dSSadaf Ebrahimi 4671*22dc650dSSadaf Ebrahimi PCRE2 provides a feature called "callout", which is a means of tem- 4672*22dc650dSSadaf Ebrahimi porarily passing control to the caller of PCRE2 in the middle of pat- 4673*22dc650dSSadaf Ebrahimi tern matching. The caller of PCRE2 provides an external function by 4674*22dc650dSSadaf Ebrahimi putting its entry point in a match context (see pcre2_set_callout() in 4675*22dc650dSSadaf Ebrahimi the pcre2api documentation). 4676*22dc650dSSadaf Ebrahimi 4677*22dc650dSSadaf Ebrahimi When using the pcre2_substitute() function, an additional callout fea- 4678*22dc650dSSadaf Ebrahimi ture is available. This does a callout after each change to the subject 4679*22dc650dSSadaf Ebrahimi string and is described in the pcre2api documentation; the rest of this 4680*22dc650dSSadaf Ebrahimi document is concerned with callouts during pattern matching. 4681*22dc650dSSadaf Ebrahimi 4682*22dc650dSSadaf Ebrahimi Within a regular expression, (?C<arg>) indicates a point at which the 4683*22dc650dSSadaf Ebrahimi external function is to be called. Different callout points can be 4684*22dc650dSSadaf Ebrahimi identified by putting a number less than 256 after the letter C. The 4685*22dc650dSSadaf Ebrahimi default value is zero. Alternatively, the argument may be a delimited 4686*22dc650dSSadaf Ebrahimi string. The starting delimiter must be one of ` ' " ^ % # $ { and the 4687*22dc650dSSadaf Ebrahimi ending delimiter is the same as the start, except for {, where the end- 4688*22dc650dSSadaf Ebrahimi ing delimiter is }. If the ending delimiter is needed within the 4689*22dc650dSSadaf Ebrahimi string, it must be doubled. For example, this pattern has two callout 4690*22dc650dSSadaf Ebrahimi points: 4691*22dc650dSSadaf Ebrahimi 4692*22dc650dSSadaf Ebrahimi (?C1)abc(?C"some ""arbitrary"" text")def 4693*22dc650dSSadaf Ebrahimi 4694*22dc650dSSadaf Ebrahimi If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, 4695*22dc650dSSadaf Ebrahimi PCRE2 automatically inserts callouts, all with number 255, before each 4696*22dc650dSSadaf Ebrahimi item in the pattern except for immediately before or after an explicit 4697*22dc650dSSadaf Ebrahimi callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern 4698*22dc650dSSadaf Ebrahimi 4699*22dc650dSSadaf Ebrahimi A(?C3)B 4700*22dc650dSSadaf Ebrahimi 4701*22dc650dSSadaf Ebrahimi it is processed as if it were 4702*22dc650dSSadaf Ebrahimi 4703*22dc650dSSadaf Ebrahimi (?C255)A(?C3)B(?C255) 4704*22dc650dSSadaf Ebrahimi 4705*22dc650dSSadaf Ebrahimi Here is a more complicated example: 4706*22dc650dSSadaf Ebrahimi 4707*22dc650dSSadaf Ebrahimi A(\d{2}|--) 4708*22dc650dSSadaf Ebrahimi 4709*22dc650dSSadaf Ebrahimi With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were 4710*22dc650dSSadaf Ebrahimi 4711*22dc650dSSadaf Ebrahimi (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) 4712*22dc650dSSadaf Ebrahimi 4713*22dc650dSSadaf Ebrahimi Notice that there is a callout before and after each parenthesis and 4714*22dc650dSSadaf Ebrahimi alternation bar. If the pattern contains a conditional group whose con- 4715*22dc650dSSadaf Ebrahimi dition is an assertion, an automatic callout is inserted immediately 4716*22dc650dSSadaf Ebrahimi before the condition. Such a callout may also be inserted explicitly, 4717*22dc650dSSadaf Ebrahimi for example: 4718*22dc650dSSadaf Ebrahimi 4719*22dc650dSSadaf Ebrahimi (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) 4720*22dc650dSSadaf Ebrahimi 4721*22dc650dSSadaf Ebrahimi This applies only to assertion conditions (because they are themselves 4722*22dc650dSSadaf Ebrahimi independent groups). 4723*22dc650dSSadaf Ebrahimi 4724*22dc650dSSadaf Ebrahimi Callouts can be useful for tracking the progress of pattern matching. 4725*22dc650dSSadaf Ebrahimi The pcre2test program has a pattern qualifier (/auto_callout) that sets 4726*22dc650dSSadaf Ebrahimi automatic callouts. When any callouts are present, the output from 4727*22dc650dSSadaf Ebrahimi pcre2test indicates how the pattern is being matched. This is useful 4728*22dc650dSSadaf Ebrahimi information when you are trying to optimize the performance of a par- 4729*22dc650dSSadaf Ebrahimi ticular pattern. 4730*22dc650dSSadaf Ebrahimi 4731*22dc650dSSadaf Ebrahimi 4732*22dc650dSSadaf EbrahimiMISSING CALLOUTS 4733*22dc650dSSadaf Ebrahimi 4734*22dc650dSSadaf Ebrahimi You should be aware that, because of optimizations in the way PCRE2 4735*22dc650dSSadaf Ebrahimi compiles and matches patterns, callouts sometimes do not happen exactly 4736*22dc650dSSadaf Ebrahimi as you might expect. 4737*22dc650dSSadaf Ebrahimi 4738*22dc650dSSadaf Ebrahimi Auto-possessification 4739*22dc650dSSadaf Ebrahimi 4740*22dc650dSSadaf Ebrahimi At compile time, PCRE2 "auto-possessifies" repeated items when it knows 4741*22dc650dSSadaf Ebrahimi that what follows cannot be part of the repeat. For example, a+[bc] is 4742*22dc650dSSadaf Ebrahimi compiled as if it were a++[bc]. The pcre2test output when this pattern 4743*22dc650dSSadaf Ebrahimi is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied 4744*22dc650dSSadaf Ebrahimi to the string "aaaa" is: 4745*22dc650dSSadaf Ebrahimi 4746*22dc650dSSadaf Ebrahimi --->aaaa 4747*22dc650dSSadaf Ebrahimi +0 ^ a+ 4748*22dc650dSSadaf Ebrahimi +2 ^ ^ [bc] 4749*22dc650dSSadaf Ebrahimi No match 4750*22dc650dSSadaf Ebrahimi 4751*22dc650dSSadaf Ebrahimi This indicates that when matching [bc] fails, there is no backtracking 4752*22dc650dSSadaf Ebrahimi into a+ (because it is being treated as a++) and therefore the callouts 4753*22dc650dSSadaf Ebrahimi that would be taken for the backtracks do not occur. You can disable 4754*22dc650dSSadaf Ebrahimi the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to 4755*22dc650dSSadaf Ebrahimi pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In 4756*22dc650dSSadaf Ebrahimi this case, the output changes to this: 4757*22dc650dSSadaf Ebrahimi 4758*22dc650dSSadaf Ebrahimi --->aaaa 4759*22dc650dSSadaf Ebrahimi +0 ^ a+ 4760*22dc650dSSadaf Ebrahimi +2 ^ ^ [bc] 4761*22dc650dSSadaf Ebrahimi +2 ^ ^ [bc] 4762*22dc650dSSadaf Ebrahimi +2 ^ ^ [bc] 4763*22dc650dSSadaf Ebrahimi +2 ^^ [bc] 4764*22dc650dSSadaf Ebrahimi No match 4765*22dc650dSSadaf Ebrahimi 4766*22dc650dSSadaf Ebrahimi This time, when matching [bc] fails, the matcher backtracks into a+ and 4767*22dc650dSSadaf Ebrahimi tries again, repeatedly, until a+ itself fails. 4768*22dc650dSSadaf Ebrahimi 4769*22dc650dSSadaf Ebrahimi Automatic .* anchoring 4770*22dc650dSSadaf Ebrahimi 4771*22dc650dSSadaf Ebrahimi By default, an optimization is applied when .* is the first significant 4772*22dc650dSSadaf Ebrahimi item in a pattern. If PCRE2_DOTALL is set, so that the dot can match 4773*22dc650dSSadaf Ebrahimi any character, the pattern is automatically anchored. If PCRE2_DOTALL 4774*22dc650dSSadaf Ebrahimi is not set, a match can start only after an internal newline or at the 4775*22dc650dSSadaf Ebrahimi beginning of the subject, and pcre2_compile() remembers this. If a pat- 4776*22dc650dSSadaf Ebrahimi tern has more than one top-level branch, automatic anchoring occurs if 4777*22dc650dSSadaf Ebrahimi all branches are anchorable. 4778*22dc650dSSadaf Ebrahimi 4779*22dc650dSSadaf Ebrahimi This optimization is disabled, however, if .* is in an atomic group or 4780*22dc650dSSadaf Ebrahimi if there is a backreference to the capture group in which it appears. 4781*22dc650dSSadaf Ebrahimi It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How- 4782*22dc650dSSadaf Ebrahimi ever, the presence of callouts does not affect it. 4783*22dc650dSSadaf Ebrahimi 4784*22dc650dSSadaf Ebrahimi For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT 4785*22dc650dSSadaf Ebrahimi and applied to the string "aa", the pcre2test output is: 4786*22dc650dSSadaf Ebrahimi 4787*22dc650dSSadaf Ebrahimi --->aa 4788*22dc650dSSadaf Ebrahimi +0 ^ .* 4789*22dc650dSSadaf Ebrahimi +2 ^ ^ \d 4790*22dc650dSSadaf Ebrahimi +2 ^^ \d 4791*22dc650dSSadaf Ebrahimi +2 ^ \d 4792*22dc650dSSadaf Ebrahimi No match 4793*22dc650dSSadaf Ebrahimi 4794*22dc650dSSadaf Ebrahimi This shows that all match attempts start at the beginning of the sub- 4795*22dc650dSSadaf Ebrahimi ject. In other words, the pattern is anchored. You can disable this op- 4796*22dc650dSSadaf Ebrahimi timization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or 4797*22dc650dSSadaf Ebrahimi starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out- 4798*22dc650dSSadaf Ebrahimi put changes to: 4799*22dc650dSSadaf Ebrahimi 4800*22dc650dSSadaf Ebrahimi --->aa 4801*22dc650dSSadaf Ebrahimi +0 ^ .* 4802*22dc650dSSadaf Ebrahimi +2 ^ ^ \d 4803*22dc650dSSadaf Ebrahimi +2 ^^ \d 4804*22dc650dSSadaf Ebrahimi +2 ^ \d 4805*22dc650dSSadaf Ebrahimi +0 ^ .* 4806*22dc650dSSadaf Ebrahimi +2 ^^ \d 4807*22dc650dSSadaf Ebrahimi +2 ^ \d 4808*22dc650dSSadaf Ebrahimi No match 4809*22dc650dSSadaf Ebrahimi 4810*22dc650dSSadaf Ebrahimi This shows more match attempts, starting at the second subject charac- 4811*22dc650dSSadaf Ebrahimi ter. Another optimization, described in the next section, means that 4812*22dc650dSSadaf Ebrahimi there is no subsequent attempt to match with an empty subject. 4813*22dc650dSSadaf Ebrahimi 4814*22dc650dSSadaf Ebrahimi Other optimizations 4815*22dc650dSSadaf Ebrahimi 4816*22dc650dSSadaf Ebrahimi Other optimizations that provide fast "no match" results also affect 4817*22dc650dSSadaf Ebrahimi callouts. For example, if the pattern is 4818*22dc650dSSadaf Ebrahimi 4819*22dc650dSSadaf Ebrahimi ab(?C4)cd 4820*22dc650dSSadaf Ebrahimi 4821*22dc650dSSadaf Ebrahimi PCRE2 knows that any matching string must contain the letter "d". If 4822*22dc650dSSadaf Ebrahimi the subject string is "abyz", the lack of "d" means that matching 4823*22dc650dSSadaf Ebrahimi doesn't ever start, and the callout is never reached. However, with 4824*22dc650dSSadaf Ebrahimi "abyd", though the result is still no match, the callout is obeyed. 4825*22dc650dSSadaf Ebrahimi 4826*22dc650dSSadaf Ebrahimi For most patterns PCRE2 also knows the minimum length of a matching 4827*22dc650dSSadaf Ebrahimi string, and will immediately give a "no match" return without actually 4828*22dc650dSSadaf Ebrahimi running a match if the subject is not long enough, or, for unanchored 4829*22dc650dSSadaf Ebrahimi patterns, if it has been scanned far enough. 4830*22dc650dSSadaf Ebrahimi 4831*22dc650dSSadaf Ebrahimi You can disable these optimizations by passing the PCRE2_NO_START_OPTI- 4832*22dc650dSSadaf Ebrahimi MIZE option to pcre2_compile(), or by starting the pattern with 4833*22dc650dSSadaf Ebrahimi (*NO_START_OPT). This slows down the matching process, but does ensure 4834*22dc650dSSadaf Ebrahimi that callouts such as the example above are obeyed. 4835*22dc650dSSadaf Ebrahimi 4836*22dc650dSSadaf Ebrahimi 4837*22dc650dSSadaf EbrahimiTHE CALLOUT INTERFACE 4838*22dc650dSSadaf Ebrahimi 4839*22dc650dSSadaf Ebrahimi During matching, when PCRE2 reaches a callout point, if an external 4840*22dc650dSSadaf Ebrahimi function is provided in the match context, it is called. This applies 4841*22dc650dSSadaf Ebrahimi to both normal, DFA, and JIT matching. The first argument to the call- 4842*22dc650dSSadaf Ebrahimi out function is a pointer to a pcre2_callout block. The second argument 4843*22dc650dSSadaf Ebrahimi is the void * callout data that was supplied when the callout was set 4844*22dc650dSSadaf Ebrahimi up by calling pcre2_set_callout() (see the pcre2api documentation). The 4845*22dc650dSSadaf Ebrahimi callout block structure contains the following fields, not necessarily 4846*22dc650dSSadaf Ebrahimi in this order: 4847*22dc650dSSadaf Ebrahimi 4848*22dc650dSSadaf Ebrahimi uint32_t version; 4849*22dc650dSSadaf Ebrahimi uint32_t callout_number; 4850*22dc650dSSadaf Ebrahimi uint32_t capture_top; 4851*22dc650dSSadaf Ebrahimi uint32_t capture_last; 4852*22dc650dSSadaf Ebrahimi uint32_t callout_flags; 4853*22dc650dSSadaf Ebrahimi PCRE2_SIZE *offset_vector; 4854*22dc650dSSadaf Ebrahimi PCRE2_SPTR mark; 4855*22dc650dSSadaf Ebrahimi PCRE2_SPTR subject; 4856*22dc650dSSadaf Ebrahimi PCRE2_SIZE subject_length; 4857*22dc650dSSadaf Ebrahimi PCRE2_SIZE start_match; 4858*22dc650dSSadaf Ebrahimi PCRE2_SIZE current_position; 4859*22dc650dSSadaf Ebrahimi PCRE2_SIZE pattern_position; 4860*22dc650dSSadaf Ebrahimi PCRE2_SIZE next_item_length; 4861*22dc650dSSadaf Ebrahimi PCRE2_SIZE callout_string_offset; 4862*22dc650dSSadaf Ebrahimi PCRE2_SIZE callout_string_length; 4863*22dc650dSSadaf Ebrahimi PCRE2_SPTR callout_string; 4864*22dc650dSSadaf Ebrahimi 4865*22dc650dSSadaf Ebrahimi The version field contains the version number of the block format. The 4866*22dc650dSSadaf Ebrahimi current version is 2; the three callout string fields were added for 4867*22dc650dSSadaf Ebrahimi version 1, and the callout_flags field for version 2. If you are writ- 4868*22dc650dSSadaf Ebrahimi ing an application that might use an earlier release of PCRE2, you 4869*22dc650dSSadaf Ebrahimi should check the version number before accessing any of these fields. 4870*22dc650dSSadaf Ebrahimi The version number will increase in future if more fields are added, 4871*22dc650dSSadaf Ebrahimi but the intention is never to remove any of the existing fields. 4872*22dc650dSSadaf Ebrahimi 4873*22dc650dSSadaf Ebrahimi Fields for numerical callouts 4874*22dc650dSSadaf Ebrahimi 4875*22dc650dSSadaf Ebrahimi For a numerical callout, callout_string is NULL, and callout_number 4876*22dc650dSSadaf Ebrahimi contains the number of the callout, in the range 0-255. This is the 4877*22dc650dSSadaf Ebrahimi number that follows (?C for callouts that part of the pattern; it is 4878*22dc650dSSadaf Ebrahimi 255 for automatically generated callouts. 4879*22dc650dSSadaf Ebrahimi 4880*22dc650dSSadaf Ebrahimi Fields for string callouts 4881*22dc650dSSadaf Ebrahimi 4882*22dc650dSSadaf Ebrahimi For callouts with string arguments, callout_number is always zero, and 4883*22dc650dSSadaf Ebrahimi callout_string points to the string that is contained within the com- 4884*22dc650dSSadaf Ebrahimi piled pattern. Its length is given by callout_string_length. Duplicated 4885*22dc650dSSadaf Ebrahimi ending delimiters that were present in the original pattern string have 4886*22dc650dSSadaf Ebrahimi been turned into single characters, but there is no other processing of 4887*22dc650dSSadaf Ebrahimi the callout string argument. An additional code unit containing binary 4888*22dc650dSSadaf Ebrahimi zero is present after the string, but is not included in the length. 4889*22dc650dSSadaf Ebrahimi The delimiter that was used to start the string is also stored within 4890*22dc650dSSadaf Ebrahimi the pattern, immediately before the string itself. You can access this 4891*22dc650dSSadaf Ebrahimi delimiter as callout_string[-1] if you need it. 4892*22dc650dSSadaf Ebrahimi 4893*22dc650dSSadaf Ebrahimi The callout_string_offset field is the code unit offset to the start of 4894*22dc650dSSadaf Ebrahimi the callout argument string within the original pattern string. This is 4895*22dc650dSSadaf Ebrahimi provided for the benefit of applications such as script languages that 4896*22dc650dSSadaf Ebrahimi might need to report errors in the callout string within the pattern. 4897*22dc650dSSadaf Ebrahimi 4898*22dc650dSSadaf Ebrahimi Fields for all callouts 4899*22dc650dSSadaf Ebrahimi 4900*22dc650dSSadaf Ebrahimi The remaining fields in the callout block are the same for both kinds 4901*22dc650dSSadaf Ebrahimi of callout. 4902*22dc650dSSadaf Ebrahimi 4903*22dc650dSSadaf Ebrahimi The offset_vector field is a pointer to a vector of capturing offsets 4904*22dc650dSSadaf Ebrahimi (the "ovector"). You may read the elements in this vector, but you must 4905*22dc650dSSadaf Ebrahimi not change any of them. 4906*22dc650dSSadaf Ebrahimi 4907*22dc650dSSadaf Ebrahimi For calls to pcre2_match(), the offset_vector field is not (since re- 4908*22dc650dSSadaf Ebrahimi lease 10.30) a pointer to the actual ovector that was passed to the 4909*22dc650dSSadaf Ebrahimi matching function in the match data block. Instead it points to an in- 4910*22dc650dSSadaf Ebrahimi ternal ovector of a size large enough to hold all possible captured 4911*22dc650dSSadaf Ebrahimi substrings in the pattern. Note that whenever a recursion or subroutine 4912*22dc650dSSadaf Ebrahimi call within a pattern completes, the capturing state is reset to what 4913*22dc650dSSadaf Ebrahimi it was before. 4914*22dc650dSSadaf Ebrahimi 4915*22dc650dSSadaf Ebrahimi The capture_last field contains the number of the most recently cap- 4916*22dc650dSSadaf Ebrahimi tured substring, and the capture_top field contains one more than the 4917*22dc650dSSadaf Ebrahimi number of the highest numbered captured substring so far. If no sub- 4918*22dc650dSSadaf Ebrahimi strings have yet been captured, the value of capture_last is 0 and the 4919*22dc650dSSadaf Ebrahimi value of capture_top is 1. The values of these fields do not always 4920*22dc650dSSadaf Ebrahimi differ by one; for example, when the callout in the pattern 4921*22dc650dSSadaf Ebrahimi ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4. 4922*22dc650dSSadaf Ebrahimi 4923*22dc650dSSadaf Ebrahimi The contents of ovector[2] to ovector[<capture_top>*2-1] can be in- 4924*22dc650dSSadaf Ebrahimi spected in order to extract substrings that have been matched so far, 4925*22dc650dSSadaf Ebrahimi in the same way as extracting substrings after a match has completed. 4926*22dc650dSSadaf Ebrahimi The values in ovector[0] and ovector[1] are always PCRE2_UNSET because 4927*22dc650dSSadaf Ebrahimi the match is by definition not complete. Substrings that have not been 4928*22dc650dSSadaf Ebrahimi captured but whose numbers are less than capture_top also have both of 4929*22dc650dSSadaf Ebrahimi their ovector slots set to PCRE2_UNSET. 4930*22dc650dSSadaf Ebrahimi 4931*22dc650dSSadaf Ebrahimi For DFA matching, the offset_vector field points to the ovector that 4932*22dc650dSSadaf Ebrahimi was passed to the matching function in the match data block for call- 4933*22dc650dSSadaf Ebrahimi outs at the top level, but to an internal ovector during the processing 4934*22dc650dSSadaf Ebrahimi of pattern recursions, lookarounds, and atomic groups. However, these 4935*22dc650dSSadaf Ebrahimi ovectors hold no useful information because pcre2_dfa_match() does not 4936*22dc650dSSadaf Ebrahimi support substring capturing. The value of capture_top is always 1 and 4937*22dc650dSSadaf Ebrahimi the value of capture_last is always 0 for DFA matching. 4938*22dc650dSSadaf Ebrahimi 4939*22dc650dSSadaf Ebrahimi The subject and subject_length fields contain copies of the values that 4940*22dc650dSSadaf Ebrahimi were passed to the matching function. 4941*22dc650dSSadaf Ebrahimi 4942*22dc650dSSadaf Ebrahimi The start_match field normally contains the offset within the subject 4943*22dc650dSSadaf Ebrahimi at which the current match attempt started. However, if the escape se- 4944*22dc650dSSadaf Ebrahimi quence \K has been encountered, this value is changed to reflect the 4945*22dc650dSSadaf Ebrahimi modified starting point. If the pattern is not anchored, the callout 4946*22dc650dSSadaf Ebrahimi function may be called several times from the same point in the pattern 4947*22dc650dSSadaf Ebrahimi for different starting points in the subject. 4948*22dc650dSSadaf Ebrahimi 4949*22dc650dSSadaf Ebrahimi The current_position field contains the offset within the subject of 4950*22dc650dSSadaf Ebrahimi the current match pointer. 4951*22dc650dSSadaf Ebrahimi 4952*22dc650dSSadaf Ebrahimi The pattern_position field contains the offset in the pattern string to 4953*22dc650dSSadaf Ebrahimi the next item to be matched. 4954*22dc650dSSadaf Ebrahimi 4955*22dc650dSSadaf Ebrahimi The next_item_length field contains the length of the next item to be 4956*22dc650dSSadaf Ebrahimi processed in the pattern string. When the callout is at the end of the 4957*22dc650dSSadaf Ebrahimi pattern, the length is zero. When the callout precedes an opening 4958*22dc650dSSadaf Ebrahimi parenthesis, the length includes meta characters that follow the paren- 4959*22dc650dSSadaf Ebrahimi thesis. For example, in a callout before an assertion such as (?=ab) 4960*22dc650dSSadaf Ebrahimi the length is 3. For an alternation bar or a closing parenthesis, the 4961*22dc650dSSadaf Ebrahimi length is one, unless a closing parenthesis is followed by a quanti- 4962*22dc650dSSadaf Ebrahimi fier, in which case its length is included. (This changed in release 4963*22dc650dSSadaf Ebrahimi 10.23. In earlier releases, before an opening parenthesis the length 4964*22dc650dSSadaf Ebrahimi was that of the entire group, and before an alternation bar or a clos- 4965*22dc650dSSadaf Ebrahimi ing parenthesis the length was zero.) 4966*22dc650dSSadaf Ebrahimi 4967*22dc650dSSadaf Ebrahimi The pattern_position and next_item_length fields are intended to help 4968*22dc650dSSadaf Ebrahimi in distinguishing between different automatic callouts, which all have 4969*22dc650dSSadaf Ebrahimi the same callout number. However, they are set for all callouts, and 4970*22dc650dSSadaf Ebrahimi are used by pcre2test to show the next item to be matched when display- 4971*22dc650dSSadaf Ebrahimi ing callout information. 4972*22dc650dSSadaf Ebrahimi 4973*22dc650dSSadaf Ebrahimi In callouts from pcre2_match() the mark field contains a pointer to the 4974*22dc650dSSadaf Ebrahimi zero-terminated name of the most recently passed (*MARK), (*PRUNE), or 4975*22dc650dSSadaf Ebrahimi (*THEN) item in the match, or NULL if no such items have been passed. 4976*22dc650dSSadaf Ebrahimi Instances of (*PRUNE) or (*THEN) without a name do not obliterate a 4977*22dc650dSSadaf Ebrahimi previous (*MARK). In callouts from the DFA matching function this field 4978*22dc650dSSadaf Ebrahimi always contains NULL. 4979*22dc650dSSadaf Ebrahimi 4980*22dc650dSSadaf Ebrahimi The callout_flags field is always zero in callouts from 4981*22dc650dSSadaf Ebrahimi pcre2_dfa_match() or when JIT is being used. When pcre2_match() without 4982*22dc650dSSadaf Ebrahimi JIT is used, the following bits may be set: 4983*22dc650dSSadaf Ebrahimi 4984*22dc650dSSadaf Ebrahimi PCRE2_CALLOUT_STARTMATCH 4985*22dc650dSSadaf Ebrahimi 4986*22dc650dSSadaf Ebrahimi This is set for the first callout after the start of matching for each 4987*22dc650dSSadaf Ebrahimi new starting position in the subject. 4988*22dc650dSSadaf Ebrahimi 4989*22dc650dSSadaf Ebrahimi PCRE2_CALLOUT_BACKTRACK 4990*22dc650dSSadaf Ebrahimi 4991*22dc650dSSadaf Ebrahimi This is set if there has been a matching backtrack since the previous 4992*22dc650dSSadaf Ebrahimi callout, or since the start of matching if this is the first callout 4993*22dc650dSSadaf Ebrahimi from a pcre2_match() run. 4994*22dc650dSSadaf Ebrahimi 4995*22dc650dSSadaf Ebrahimi Both bits are set when a backtrack has caused a "bumpalong" to a new 4996*22dc650dSSadaf Ebrahimi starting position in the subject. Output from pcre2test does not indi- 4997*22dc650dSSadaf Ebrahimi cate the presence of these bits unless the callout_extra modifier is 4998*22dc650dSSadaf Ebrahimi set. 4999*22dc650dSSadaf Ebrahimi 5000*22dc650dSSadaf Ebrahimi The information in the callout_flags field is provided so that applica- 5001*22dc650dSSadaf Ebrahimi tions can track and tell their users how matching with backtracking is 5002*22dc650dSSadaf Ebrahimi done. This can be useful when trying to optimize patterns, or just to 5003*22dc650dSSadaf Ebrahimi understand how PCRE2 works. There is no support in pcre2_dfa_match() 5004*22dc650dSSadaf Ebrahimi because there is no backtracking in DFA matching, and there is no sup- 5005*22dc650dSSadaf Ebrahimi port in JIT because JIT is all about maximimizing matching performance. 5006*22dc650dSSadaf Ebrahimi In both these cases the callout_flags field is always zero. 5007*22dc650dSSadaf Ebrahimi 5008*22dc650dSSadaf Ebrahimi 5009*22dc650dSSadaf EbrahimiRETURN VALUES FROM CALLOUTS 5010*22dc650dSSadaf Ebrahimi 5011*22dc650dSSadaf Ebrahimi The external callout function returns an integer to PCRE2. If the value 5012*22dc650dSSadaf Ebrahimi is zero, matching proceeds as normal. If the value is greater than 5013*22dc650dSSadaf Ebrahimi zero, matching fails at the current point, but the testing of other 5014*22dc650dSSadaf Ebrahimi matching possibilities goes ahead, just as if a lookahead assertion had 5015*22dc650dSSadaf Ebrahimi failed. If the value is less than zero, the match is abandoned, and the 5016*22dc650dSSadaf Ebrahimi matching function returns the negative value. 5017*22dc650dSSadaf Ebrahimi 5018*22dc650dSSadaf Ebrahimi Negative values should normally be chosen from the set of PCRE2_ER- 5019*22dc650dSSadaf Ebrahimi ROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a standard 5020*22dc650dSSadaf Ebrahimi "no match" failure. The error number PCRE2_ERROR_CALLOUT is reserved 5021*22dc650dSSadaf Ebrahimi for use by callout functions; it will never be used by PCRE2 itself. 5022*22dc650dSSadaf Ebrahimi 5023*22dc650dSSadaf Ebrahimi 5024*22dc650dSSadaf EbrahimiCALLOUT ENUMERATION 5025*22dc650dSSadaf Ebrahimi 5026*22dc650dSSadaf Ebrahimi int pcre2_callout_enumerate(const pcre2_code *code, 5027*22dc650dSSadaf Ebrahimi int (*callback)(pcre2_callout_enumerate_block *, void *), 5028*22dc650dSSadaf Ebrahimi void *user_data); 5029*22dc650dSSadaf Ebrahimi 5030*22dc650dSSadaf Ebrahimi A script language that supports the use of string arguments in callouts 5031*22dc650dSSadaf Ebrahimi might like to scan all the callouts in a pattern before running the 5032*22dc650dSSadaf Ebrahimi match. This can be done by calling pcre2_callout_enumerate(). The first 5033*22dc650dSSadaf Ebrahimi argument is a pointer to a compiled pattern, the second points to a 5034*22dc650dSSadaf Ebrahimi callback function, and the third is arbitrary user data. The callback 5035*22dc650dSSadaf Ebrahimi function is called for every callout in the pattern in the order in 5036*22dc650dSSadaf Ebrahimi which they appear. Its first argument is a pointer to a callout enumer- 5037*22dc650dSSadaf Ebrahimi ation block, and its second argument is the user_data value that was 5038*22dc650dSSadaf Ebrahimi passed to pcre2_callout_enumerate(). The data block contains the fol- 5039*22dc650dSSadaf Ebrahimi lowing fields: 5040*22dc650dSSadaf Ebrahimi 5041*22dc650dSSadaf Ebrahimi version Block version number 5042*22dc650dSSadaf Ebrahimi pattern_position Offset to next item in pattern 5043*22dc650dSSadaf Ebrahimi next_item_length Length of next item in pattern 5044*22dc650dSSadaf Ebrahimi callout_number Number for numbered callouts 5045*22dc650dSSadaf Ebrahimi callout_string_offset Offset to string within pattern 5046*22dc650dSSadaf Ebrahimi callout_string_length Length of callout string 5047*22dc650dSSadaf Ebrahimi callout_string Points to callout string or is NULL 5048*22dc650dSSadaf Ebrahimi 5049*22dc650dSSadaf Ebrahimi The version number is currently 0. It will increase if new fields are 5050*22dc650dSSadaf Ebrahimi ever added to the block. The remaining fields are the same as their 5051*22dc650dSSadaf Ebrahimi namesakes in the pcre2_callout block that is used for callouts during 5052*22dc650dSSadaf Ebrahimi matching, as described above. 5053*22dc650dSSadaf Ebrahimi 5054*22dc650dSSadaf Ebrahimi Note that the value of pattern_position is unique for each callout. 5055*22dc650dSSadaf Ebrahimi However, if a callout occurs inside a group that is quantified with a 5056*22dc650dSSadaf Ebrahimi non-zero minimum or a fixed maximum, the group is replicated inside the 5057*22dc650dSSadaf Ebrahimi compiled pattern. For example, a pattern such as /(a){2}/ is compiled 5058*22dc650dSSadaf Ebrahimi as if it were /(a)(a)/. This means that the callout will be enumerated 5059*22dc650dSSadaf Ebrahimi more than once, but with the same value for pattern_position in each 5060*22dc650dSSadaf Ebrahimi case. 5061*22dc650dSSadaf Ebrahimi 5062*22dc650dSSadaf Ebrahimi The callback function should normally return zero. If it returns a non- 5063*22dc650dSSadaf Ebrahimi zero value, scanning the pattern stops, and that value is returned from 5064*22dc650dSSadaf Ebrahimi pcre2_callout_enumerate(). 5065*22dc650dSSadaf Ebrahimi 5066*22dc650dSSadaf Ebrahimi 5067*22dc650dSSadaf EbrahimiAUTHOR 5068*22dc650dSSadaf Ebrahimi 5069*22dc650dSSadaf Ebrahimi Philip Hazel 5070*22dc650dSSadaf Ebrahimi Retired from University Computing Service 5071*22dc650dSSadaf Ebrahimi Cambridge, England. 5072*22dc650dSSadaf Ebrahimi 5073*22dc650dSSadaf Ebrahimi 5074*22dc650dSSadaf EbrahimiREVISION 5075*22dc650dSSadaf Ebrahimi 5076*22dc650dSSadaf Ebrahimi Last updated: 19 January 2024 5077*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 5078*22dc650dSSadaf Ebrahimi 5079*22dc650dSSadaf Ebrahimi 5080*22dc650dSSadaf EbrahimiPCRE2 10.43 19 January 2024 PCRE2CALLOUT(3) 5081*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 5082*22dc650dSSadaf Ebrahimi 5083*22dc650dSSadaf Ebrahimi 5084*22dc650dSSadaf Ebrahimi 5085*22dc650dSSadaf EbrahimiPCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3) 5086*22dc650dSSadaf Ebrahimi 5087*22dc650dSSadaf Ebrahimi 5088*22dc650dSSadaf EbrahimiNAME 5089*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 5090*22dc650dSSadaf Ebrahimi 5091*22dc650dSSadaf Ebrahimi 5092*22dc650dSSadaf EbrahimiDIFFERENCES BETWEEN PCRE2 AND PERL 5093*22dc650dSSadaf Ebrahimi 5094*22dc650dSSadaf Ebrahimi This document describes some of the known differences in the ways that 5095*22dc650dSSadaf Ebrahimi PCRE2 and Perl handle regular expressions. The differences described 5096*22dc650dSSadaf Ebrahimi here are with respect to Perl version 5.38.0, but as both Perl and 5097*22dc650dSSadaf Ebrahimi PCRE2 are continually changing, the information may at times be out of 5098*22dc650dSSadaf Ebrahimi date. 5099*22dc650dSSadaf Ebrahimi 5100*22dc650dSSadaf Ebrahimi 1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set, 5101*22dc650dSSadaf Ebrahimi the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' 5102*22dc650dSSadaf Ebrahimi matches the next character unless it is the start of a newline se- 5103*22dc650dSSadaf Ebrahimi quence. This means that, if the newline setting is CR, CRLF, or NUL, 5104*22dc650dSSadaf Ebrahimi '.' will match the code point LF (0x0A) in ASCII/Unicode environments, 5105*22dc650dSSadaf Ebrahimi and NL (either 0x15 or 0x25) when using EBCDIC. In Perl, '.' appears 5106*22dc650dSSadaf Ebrahimi never to match LF, even when 0x0A is not a newline indicator. 5107*22dc650dSSadaf Ebrahimi 5108*22dc650dSSadaf Ebrahimi 2. PCRE2 has only a subset of Perl's Unicode support. Details of what 5109*22dc650dSSadaf Ebrahimi it does have are given in the pcre2unicode page. 5110*22dc650dSSadaf Ebrahimi 5111*22dc650dSSadaf Ebrahimi 3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser- 5112*22dc650dSSadaf Ebrahimi tions, but they do not mean what you might think. For example, (?!a){3} 5113*22dc650dSSadaf Ebrahimi does not assert that the next three characters are not "a". It just as- 5114*22dc650dSSadaf Ebrahimi serts that the next character is not "a" three times (in principle; 5115*22dc650dSSadaf Ebrahimi PCRE2 optimizes this to run the assertion just once). Perl allows some 5116*22dc650dSSadaf Ebrahimi repeat quantifiers on other assertions, for example, \b* , but these do 5117*22dc650dSSadaf Ebrahimi not seem to have any use. PCRE2 does not allow any kind of quantifier 5118*22dc650dSSadaf Ebrahimi on non-lookaround assertions. 5119*22dc650dSSadaf Ebrahimi 5120*22dc650dSSadaf Ebrahimi 4. If a braced quantifier such as {1,2} appears where there is nothing 5121*22dc650dSSadaf Ebrahimi to repeat (for example, at the start of a branch), PCRE2 raises an er- 5122*22dc650dSSadaf Ebrahimi ror whereas Perl treats the quantifier characters as literal. 5123*22dc650dSSadaf Ebrahimi 5124*22dc650dSSadaf Ebrahimi 5. Capture groups that occur inside negative lookaround assertions are 5125*22dc650dSSadaf Ebrahimi counted, but their entries in the offsets vector are set only when a 5126*22dc650dSSadaf Ebrahimi negative assertion is a condition that has a matching branch (that is, 5127*22dc650dSSadaf Ebrahimi the condition is false). Perl may set such capture groups in other 5128*22dc650dSSadaf Ebrahimi circumstances. 5129*22dc650dSSadaf Ebrahimi 5130*22dc650dSSadaf Ebrahimi 6. The following Perl escape sequences are not supported: \F, \l, \L, 5131*22dc650dSSadaf Ebrahimi \u, \U, and \N when followed by a character name. \N on its own, match- 5132*22dc650dSSadaf Ebrahimi ing a non-newline character, and \N{U+dd..}, matching a Unicode code 5133*22dc650dSSadaf Ebrahimi point, are supported. The escapes that modify the case of following 5134*22dc650dSSadaf Ebrahimi letters are implemented by Perl's general string-handling and are not 5135*22dc650dSSadaf Ebrahimi part of its pattern matching engine. If any of these are encountered by 5136*22dc650dSSadaf Ebrahimi PCRE2, an error is generated by default. However, if either of the 5137*22dc650dSSadaf Ebrahimi PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are 5138*22dc650dSSadaf Ebrahimi interpreted as ECMAScript interprets them. 5139*22dc650dSSadaf Ebrahimi 5140*22dc650dSSadaf Ebrahimi 7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 5141*22dc650dSSadaf Ebrahimi is built with Unicode support (the default). The properties that can be 5142*22dc650dSSadaf Ebrahimi tested with \p and \P are limited to the general category properties 5143*22dc650dSSadaf Ebrahimi such as Lu and Nd, the derived properties Any and LC (synonym L&), 5144*22dc650dSSadaf Ebrahimi script names such as Greek or Han, Bidi_Class, Bidi_Control, and a few 5145*22dc650dSSadaf Ebrahimi binary properties. Both PCRE2 and Perl support the Cs (surrogate) prop- 5146*22dc650dSSadaf Ebrahimi erty, but in PCRE2 its use is limited. See the pcre2pattern documenta- 5147*22dc650dSSadaf Ebrahimi tion for details. The long synonyms for property names that Perl sup- 5148*22dc650dSSadaf Ebrahimi ports (such as \p{Letter}) are not supported by PCRE2, nor is it per- 5149*22dc650dSSadaf Ebrahimi mitted to prefix any of these properties with "Is". 5150*22dc650dSSadaf Ebrahimi 5151*22dc650dSSadaf Ebrahimi 8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters 5152*22dc650dSSadaf Ebrahimi in between are treated as literals. However, this is slightly different 5153*22dc650dSSadaf Ebrahimi from Perl in that $ and @ are also handled as literals inside the 5154*22dc650dSSadaf Ebrahimi quotes. In Perl, they cause variable interpolation (PCRE2 does not have 5155*22dc650dSSadaf Ebrahimi variables). Also, Perl does "double-quotish backslash interpolation" on 5156*22dc650dSSadaf Ebrahimi any backslashes between \Q and \E which, its documentation says, "may 5157*22dc650dSSadaf Ebrahimi lead to confusing results". PCRE2 treats a backslash between \Q and \E 5158*22dc650dSSadaf Ebrahimi just like any other character. Note the following examples: 5159*22dc650dSSadaf Ebrahimi 5160*22dc650dSSadaf Ebrahimi Pattern PCRE2 matches Perl matches 5161*22dc650dSSadaf Ebrahimi 5162*22dc650dSSadaf Ebrahimi \Qabc$xyz\E abc$xyz abc followed by the 5163*22dc650dSSadaf Ebrahimi contents of $xyz 5164*22dc650dSSadaf Ebrahimi \Qabc\$xyz\E abc\$xyz abc\$xyz 5165*22dc650dSSadaf Ebrahimi \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 5166*22dc650dSSadaf Ebrahimi \QA\B\E A\B A\B 5167*22dc650dSSadaf Ebrahimi \Q\\E \ \\E 5168*22dc650dSSadaf Ebrahimi 5169*22dc650dSSadaf Ebrahimi The \Q...\E sequence is recognized both inside and outside character 5170*22dc650dSSadaf Ebrahimi classes by both PCRE2 and Perl. 5171*22dc650dSSadaf Ebrahimi 5172*22dc650dSSadaf Ebrahimi 9. Fairly obviously, PCRE2 does not support the (?{code}) and 5173*22dc650dSSadaf Ebrahimi (??{code}) constructions. However, PCRE2 does have a "callout" feature, 5174*22dc650dSSadaf Ebrahimi which allows an external function to be called during pattern matching. 5175*22dc650dSSadaf Ebrahimi See the pcre2callout documentation for details. 5176*22dc650dSSadaf Ebrahimi 5177*22dc650dSSadaf Ebrahimi 10. Subroutine calls (whether recursive or not) were treated as atomic 5178*22dc650dSSadaf Ebrahimi groups up to PCRE2 release 10.23, but from release 10.30 this changed, 5179*22dc650dSSadaf Ebrahimi and backtracking into subroutine calls is now supported, as in Perl. 5180*22dc650dSSadaf Ebrahimi 5181*22dc650dSSadaf Ebrahimi 11. In PCRE2, if any of the backtracking control verbs are used in a 5182*22dc650dSSadaf Ebrahimi group that is called as a subroutine (whether or not recursively), 5183*22dc650dSSadaf Ebrahimi their effect is confined to that group; it does not extend to the sur- 5184*22dc650dSSadaf Ebrahimi rounding pattern. This is not always the case in Perl. In particular, 5185*22dc650dSSadaf Ebrahimi if (*THEN) is present in a group that is called as a subroutine, its 5186*22dc650dSSadaf Ebrahimi action is limited to that group, even if the group does not contain any 5187*22dc650dSSadaf Ebrahimi | characters. Note that such groups are processed as anchored at the 5188*22dc650dSSadaf Ebrahimi point where they are tested. 5189*22dc650dSSadaf Ebrahimi 5190*22dc650dSSadaf Ebrahimi 12. If a pattern contains more than one backtracking control verb, the 5191*22dc650dSSadaf Ebrahimi first one that is backtracked onto acts. For example, in the pattern 5192*22dc650dSSadaf Ebrahimi A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure 5193*22dc650dSSadaf Ebrahimi in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases 5194*22dc650dSSadaf Ebrahimi it is the same as PCRE2, but there are cases where it differs. 5195*22dc650dSSadaf Ebrahimi 5196*22dc650dSSadaf Ebrahimi 13. There are some differences that are concerned with the settings of 5197*22dc650dSSadaf Ebrahimi captured strings when part of a pattern is repeated. For example, 5198*22dc650dSSadaf Ebrahimi matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un- 5199*22dc650dSSadaf Ebrahimi set, but in PCRE2 it is set to "b". 5200*22dc650dSSadaf Ebrahimi 5201*22dc650dSSadaf Ebrahimi 14. PCRE2's handling of duplicate capture group numbers and names is 5202*22dc650dSSadaf Ebrahimi not as general as Perl's. This is a consequence of the fact the PCRE2 5203*22dc650dSSadaf Ebrahimi works internally just with numbers, using an external table to trans- 5204*22dc650dSSadaf Ebrahimi late between numbers and names. In particular, a pattern such as 5205*22dc650dSSadaf Ebrahimi (?|(?<a>A)|(?<b>B)), where the two capture groups have the same number 5206*22dc650dSSadaf Ebrahimi but different names, is not supported, and causes an error at compile 5207*22dc650dSSadaf Ebrahimi time. If it were allowed, it would not be possible to distinguish which 5208*22dc650dSSadaf Ebrahimi group matched, because both names map to capture group number 1. To 5209*22dc650dSSadaf Ebrahimi avoid this confusing situation, an error is given at compile time. 5210*22dc650dSSadaf Ebrahimi 5211*22dc650dSSadaf Ebrahimi 15. Perl used to recognize comments in some places that PCRE2 does not, 5212*22dc650dSSadaf Ebrahimi for example, between the ( and ? at the start of a group. If the /x 5213*22dc650dSSadaf Ebrahimi modifier is set, Perl allowed white space between ( and ? though the 5214*22dc650dSSadaf Ebrahimi latest Perls give an error (for a while it was just deprecated). There 5215*22dc650dSSadaf Ebrahimi may still be some cases where Perl behaves differently. 5216*22dc650dSSadaf Ebrahimi 5217*22dc650dSSadaf Ebrahimi 16. Perl, when in warning mode, gives warnings for character classes 5218*22dc650dSSadaf Ebrahimi such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- 5219*22dc650dSSadaf Ebrahimi als. PCRE2 has no warning features, so it gives an error in these cases 5220*22dc650dSSadaf Ebrahimi because they are almost certainly user mistakes. 5221*22dc650dSSadaf Ebrahimi 5222*22dc650dSSadaf Ebrahimi 17. In PCRE2, the upper/lower case character properties Lu and Ll are 5223*22dc650dSSadaf Ebrahimi not affected when case-independent matching is specified. For example, 5224*22dc650dSSadaf Ebrahimi \p{Lu} always matches an upper case letter. I think Perl has changed in 5225*22dc650dSSadaf Ebrahimi this respect; in the release at the time of writing (5.38), \p{Lu} and 5226*22dc650dSSadaf Ebrahimi \p{Ll} match all letters, regardless of case, when case independence is 5227*22dc650dSSadaf Ebrahimi specified. 5228*22dc650dSSadaf Ebrahimi 5229*22dc650dSSadaf Ebrahimi 18. From release 5.32.0, Perl locks out the use of \K in lookaround as- 5230*22dc650dSSadaf Ebrahimi sertions. From release 10.38 PCRE2 does the same by default. However, 5231*22dc650dSSadaf Ebrahimi there is an option for re-enabling the previous behaviour. When this 5232*22dc650dSSadaf Ebrahimi option is set, \K is acted on when it occurs in positive assertions, 5233*22dc650dSSadaf Ebrahimi but is ignored in negative assertions. 5234*22dc650dSSadaf Ebrahimi 5235*22dc650dSSadaf Ebrahimi 19. PCRE2 provides some extensions to the Perl regular expression fa- 5236*22dc650dSSadaf Ebrahimi cilities. Perl 5.10 included new features that were not in earlier 5237*22dc650dSSadaf Ebrahimi versions of Perl, some of which (such as named parentheses) were in 5238*22dc650dSSadaf Ebrahimi PCRE2 for some time before. This list is with respect to Perl 5.38: 5239*22dc650dSSadaf Ebrahimi 5240*22dc650dSSadaf Ebrahimi (a) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the 5241*22dc650dSSadaf Ebrahimi $ meta-character matches only at the very end of the string. 5242*22dc650dSSadaf Ebrahimi 5243*22dc650dSSadaf Ebrahimi (b) A backslash followed by a letter with no special meaning is 5244*22dc650dSSadaf Ebrahimi faulted. (Perl can be made to issue a warning.) 5245*22dc650dSSadaf Ebrahimi 5246*22dc650dSSadaf Ebrahimi (c) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- 5247*22dc650dSSadaf Ebrahimi fiers is inverted, that is, by default they are not greedy, but if fol- 5248*22dc650dSSadaf Ebrahimi lowed by a question mark they are. 5249*22dc650dSSadaf Ebrahimi 5250*22dc650dSSadaf Ebrahimi (d) PCRE2_ANCHORED can be used at matching time to force a pattern to 5251*22dc650dSSadaf Ebrahimi be tried only at the first matching position in the subject string. 5252*22dc650dSSadaf Ebrahimi 5253*22dc650dSSadaf Ebrahimi (e) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and 5254*22dc650dSSadaf Ebrahimi PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents. 5255*22dc650dSSadaf Ebrahimi 5256*22dc650dSSadaf Ebrahimi (f) The \R escape sequence can be restricted to match only CR, LF, or 5257*22dc650dSSadaf Ebrahimi CRLF by the PCRE2_BSR_ANYCRLF option. 5258*22dc650dSSadaf Ebrahimi 5259*22dc650dSSadaf Ebrahimi (g) The callout facility is PCRE2-specific. Perl supports codeblocks 5260*22dc650dSSadaf Ebrahimi and variable interpolation, but not general hooks on every match. 5261*22dc650dSSadaf Ebrahimi 5262*22dc650dSSadaf Ebrahimi (h) The partial matching facility is PCRE2-specific. 5263*22dc650dSSadaf Ebrahimi 5264*22dc650dSSadaf Ebrahimi (i) The alternative matching function (pcre2_dfa_match() matches in a 5265*22dc650dSSadaf Ebrahimi different way and is not Perl-compatible. 5266*22dc650dSSadaf Ebrahimi 5267*22dc650dSSadaf Ebrahimi (j) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) 5268*22dc650dSSadaf Ebrahimi at the start of a pattern. These set overall options that cannot be 5269*22dc650dSSadaf Ebrahimi changed within the pattern. 5270*22dc650dSSadaf Ebrahimi 5271*22dc650dSSadaf Ebrahimi (k) PCRE2 supports non-atomic positive lookaround assertions. This is 5272*22dc650dSSadaf Ebrahimi an extension to the lookaround facilities. The default, Perl-compatible 5273*22dc650dSSadaf Ebrahimi lookarounds are atomic. 5274*22dc650dSSadaf Ebrahimi 5275*22dc650dSSadaf Ebrahimi (l) There are three syntactical items in patterns that can refer to a 5276*22dc650dSSadaf Ebrahimi capturing group by number: back references such as \g{2}, subroutine 5277*22dc650dSSadaf Ebrahimi calls such as (?3), and condition references such as (?(4)...). PCRE2 5278*22dc650dSSadaf Ebrahimi supports relative group numbers such as +2 and -4 in all three cases. 5279*22dc650dSSadaf Ebrahimi Perl supports both plus and minus for subroutine calls, but only minus 5280*22dc650dSSadaf Ebrahimi for back references, and no relative numbering at all for conditions. 5281*22dc650dSSadaf Ebrahimi 5282*22dc650dSSadaf Ebrahimi 20. Perl has different limits than PCRE2. See the pcre2limit documenta- 5283*22dc650dSSadaf Ebrahimi tion for details. Perl went with 5.10 from recursion to iteration keep- 5284*22dc650dSSadaf Ebrahimi ing the intermediate matches on the heap, which is ~10% slower but does 5285*22dc650dSSadaf Ebrahimi not fall into any stack-overflow limit. PCRE2 made a similar change at 5286*22dc650dSSadaf Ebrahimi release 10.30, and also has many build-time and run-time customizable 5287*22dc650dSSadaf Ebrahimi limits. 5288*22dc650dSSadaf Ebrahimi 5289*22dc650dSSadaf Ebrahimi 21. Unlike Perl, PCRE2 doesn't have character set modifiers and spe- 5290*22dc650dSSadaf Ebrahimi cially no way to set characters by context just like Perl's "/d". A 5291*22dc650dSSadaf Ebrahimi regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules 5292*22dc650dSSadaf Ebrahimi to Perl's "/u"; something closer to "/a" could be selected by adding 5293*22dc650dSSadaf Ebrahimi other PCRE2_EXTRA_ASCII* options on top. 5294*22dc650dSSadaf Ebrahimi 5295*22dc650dSSadaf Ebrahimi 22. Some recursive patterns that Perl diagnoses as infinite recursions 5296*22dc650dSSadaf Ebrahimi can be handled by PCRE2, either by the interpreter or the JIT. An exam- 5297*22dc650dSSadaf Ebrahimi ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number 5298*22dc650dSSadaf Ebrahimi of repeated "abcd" substrings at the end of the subject. 5299*22dc650dSSadaf Ebrahimi 5300*22dc650dSSadaf Ebrahimi 5301*22dc650dSSadaf EbrahimiAUTHOR 5302*22dc650dSSadaf Ebrahimi 5303*22dc650dSSadaf Ebrahimi Philip Hazel 5304*22dc650dSSadaf Ebrahimi Retired from University Computing Service 5305*22dc650dSSadaf Ebrahimi Cambridge, England. 5306*22dc650dSSadaf Ebrahimi 5307*22dc650dSSadaf Ebrahimi 5308*22dc650dSSadaf EbrahimiREVISION 5309*22dc650dSSadaf Ebrahimi 5310*22dc650dSSadaf Ebrahimi Last updated: 30 November 2023 5311*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2023 University of Cambridge. 5312*22dc650dSSadaf Ebrahimi 5313*22dc650dSSadaf Ebrahimi 5314*22dc650dSSadaf EbrahimiPCRE2 10.43 30 November 2023 PCRE2COMPAT(3) 5315*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 5316*22dc650dSSadaf Ebrahimi 5317*22dc650dSSadaf Ebrahimi 5318*22dc650dSSadaf Ebrahimi 5319*22dc650dSSadaf EbrahimiPCRE2JIT(3) Library Functions Manual PCRE2JIT(3) 5320*22dc650dSSadaf Ebrahimi 5321*22dc650dSSadaf Ebrahimi 5322*22dc650dSSadaf EbrahimiNAME 5323*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 5324*22dc650dSSadaf Ebrahimi 5325*22dc650dSSadaf Ebrahimi 5326*22dc650dSSadaf EbrahimiPCRE2 JUST-IN-TIME COMPILER SUPPORT 5327*22dc650dSSadaf Ebrahimi 5328*22dc650dSSadaf Ebrahimi Just-in-time compiling is a heavyweight optimization that can greatly 5329*22dc650dSSadaf Ebrahimi speed up pattern matching. However, it comes at the cost of extra pro- 5330*22dc650dSSadaf Ebrahimi cessing before the match is performed, so it is of most benefit when 5331*22dc650dSSadaf Ebrahimi the same pattern is going to be matched many times. This does not nec- 5332*22dc650dSSadaf Ebrahimi essarily mean many calls of a matching function; if the pattern is not 5333*22dc650dSSadaf Ebrahimi anchored, matching attempts may take place many times at various posi- 5334*22dc650dSSadaf Ebrahimi tions in the subject, even for a single call. Therefore, if the subject 5335*22dc650dSSadaf Ebrahimi string is very long, it may still pay to use JIT even for one-off 5336*22dc650dSSadaf Ebrahimi matches. JIT support is available for all of the 8-bit, 16-bit and 5337*22dc650dSSadaf Ebrahimi 32-bit PCRE2 libraries. 5338*22dc650dSSadaf Ebrahimi 5339*22dc650dSSadaf Ebrahimi JIT support applies only to the traditional Perl-compatible matching 5340*22dc650dSSadaf Ebrahimi function. It does not apply when the DFA matching function is being 5341*22dc650dSSadaf Ebrahimi used. The code for JIT support was written by Zoltan Herczeg. 5342*22dc650dSSadaf Ebrahimi 5343*22dc650dSSadaf Ebrahimi 5344*22dc650dSSadaf EbrahimiAVAILABILITY OF JIT SUPPORT 5345*22dc650dSSadaf Ebrahimi 5346*22dc650dSSadaf Ebrahimi JIT support is an optional feature of PCRE2. The "configure" option 5347*22dc650dSSadaf Ebrahimi --enable-jit (or equivalent CMake option) must be set when PCRE2 is 5348*22dc650dSSadaf Ebrahimi built if you want to use JIT. The support is limited to the following 5349*22dc650dSSadaf Ebrahimi hardware platforms: 5350*22dc650dSSadaf Ebrahimi 5351*22dc650dSSadaf Ebrahimi ARM 32-bit (v7, and Thumb2) 5352*22dc650dSSadaf Ebrahimi ARM 64-bit 5353*22dc650dSSadaf Ebrahimi IBM s390x 64 bit 5354*22dc650dSSadaf Ebrahimi Intel x86 32-bit and 64-bit 5355*22dc650dSSadaf Ebrahimi LoongArch 64 bit 5356*22dc650dSSadaf Ebrahimi MIPS 32-bit and 64-bit 5357*22dc650dSSadaf Ebrahimi Power PC 32-bit and 64-bit 5358*22dc650dSSadaf Ebrahimi RISC-V 32-bit and 64-bit 5359*22dc650dSSadaf Ebrahimi 5360*22dc650dSSadaf Ebrahimi If --enable-jit is set on an unsupported platform, compilation fails. 5361*22dc650dSSadaf Ebrahimi 5362*22dc650dSSadaf Ebrahimi A client program can tell if JIT support is available by calling 5363*22dc650dSSadaf Ebrahimi pcre2_config() with the PCRE2_CONFIG_JIT option. The result is one if 5364*22dc650dSSadaf Ebrahimi PCRE2 was built with JIT support, and zero otherwise. However, having 5365*22dc650dSSadaf Ebrahimi the JIT code available does not guarantee that it will be used for any 5366*22dc650dSSadaf Ebrahimi particular match. One reason for this is that there are a number of op- 5367*22dc650dSSadaf Ebrahimi tions and pattern items that are not supported by JIT (see below). An- 5368*22dc650dSSadaf Ebrahimi other reason is that in some environments JIT is unable to get memory 5369*22dc650dSSadaf Ebrahimi in which to build its compiled code. The only guarantee from pcre2_con- 5370*22dc650dSSadaf Ebrahimi fig() is that if it returns zero, JIT will definitely not be used. 5371*22dc650dSSadaf Ebrahimi 5372*22dc650dSSadaf Ebrahimi A simple program does not need to check availability in order to use 5373*22dc650dSSadaf Ebrahimi JIT when possible. The API is implemented in a way that falls back to 5374*22dc650dSSadaf Ebrahimi the interpretive code if JIT is not available or cannot be used for a 5375*22dc650dSSadaf Ebrahimi given match. For programs that need the best possible performance, 5376*22dc650dSSadaf Ebrahimi there is a "fast path" API that is JIT-specific. 5377*22dc650dSSadaf Ebrahimi 5378*22dc650dSSadaf Ebrahimi 5379*22dc650dSSadaf EbrahimiSIMPLE USE OF JIT 5380*22dc650dSSadaf Ebrahimi 5381*22dc650dSSadaf Ebrahimi To make use of the JIT support in the simplest way, all you have to do 5382*22dc650dSSadaf Ebrahimi is to call pcre2_jit_compile() after successfully compiling a pattern 5383*22dc650dSSadaf Ebrahimi with pcre2_compile(). This function has two arguments: the first is the 5384*22dc650dSSadaf Ebrahimi compiled pattern pointer that was returned by pcre2_compile(), and the 5385*22dc650dSSadaf Ebrahimi second is zero or more of the following option bits: PCRE2_JIT_COM- 5386*22dc650dSSadaf Ebrahimi PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. 5387*22dc650dSSadaf Ebrahimi 5388*22dc650dSSadaf Ebrahimi If JIT support is not available, a call to pcre2_jit_compile() does 5389*22dc650dSSadaf Ebrahimi nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled 5390*22dc650dSSadaf Ebrahimi pattern is passed to the JIT compiler, which turns it into machine code 5391*22dc650dSSadaf Ebrahimi that executes much faster than the normal interpretive code, but yields 5392*22dc650dSSadaf Ebrahimi exactly the same results. The returned value from pcre2_jit_compile() 5393*22dc650dSSadaf Ebrahimi is zero on success, or a negative error code. 5394*22dc650dSSadaf Ebrahimi 5395*22dc650dSSadaf Ebrahimi There is a limit to the size of pattern that JIT supports, imposed by 5396*22dc650dSSadaf Ebrahimi the size of machine stack that it uses. The exact rules are not docu- 5397*22dc650dSSadaf Ebrahimi mented because they may change at any time, in particular, when new op- 5398*22dc650dSSadaf Ebrahimi timizations are introduced. If a pattern is too big, a call to 5399*22dc650dSSadaf Ebrahimi pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY. 5400*22dc650dSSadaf Ebrahimi 5401*22dc650dSSadaf Ebrahimi PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com- 5402*22dc650dSSadaf Ebrahimi plete matches. If you want to run partial matches using the PCRE2_PAR- 5403*22dc650dSSadaf Ebrahimi TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should 5404*22dc650dSSadaf Ebrahimi set one or both of the other options as well as, or instead of 5405*22dc650dSSadaf Ebrahimi PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code 5406*22dc650dSSadaf Ebrahimi for each of the three modes (normal, soft partial, hard partial). When 5407*22dc650dSSadaf Ebrahimi pcre2_match() is called, the appropriate code is run if it is avail- 5408*22dc650dSSadaf Ebrahimi able. Otherwise, the pattern is matched using interpretive code. 5409*22dc650dSSadaf Ebrahimi 5410*22dc650dSSadaf Ebrahimi You can call pcre2_jit_compile() multiple times for the same compiled 5411*22dc650dSSadaf Ebrahimi pattern. It does nothing if it has previously compiled code for any of 5412*22dc650dSSadaf Ebrahimi the option bits. For example, you can call it once with PCRE2_JIT_COM- 5413*22dc650dSSadaf Ebrahimi PLETE and (perhaps later, when you find you need partial matching) 5414*22dc650dSSadaf Ebrahimi again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it 5415*22dc650dSSadaf Ebrahimi will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- 5416*22dc650dSSadaf Ebrahimi ing. If pcre2_jit_compile() is called with no option bits set, it imme- 5417*22dc650dSSadaf Ebrahimi diately returns zero. This is an alternative way of testing whether JIT 5418*22dc650dSSadaf Ebrahimi is available. 5419*22dc650dSSadaf Ebrahimi 5420*22dc650dSSadaf Ebrahimi At present, it is not possible to free JIT compiled code except when 5421*22dc650dSSadaf Ebrahimi the entire compiled pattern is freed by calling pcre2_code_free(). 5422*22dc650dSSadaf Ebrahimi 5423*22dc650dSSadaf Ebrahimi In some circumstances you may need to call additional functions. These 5424*22dc650dSSadaf Ebrahimi are described in the section entitled "Controlling the JIT stack" be- 5425*22dc650dSSadaf Ebrahimi low. 5426*22dc650dSSadaf Ebrahimi 5427*22dc650dSSadaf Ebrahimi There are some pcre2_match() options that are not supported by JIT, and 5428*22dc650dSSadaf Ebrahimi there are also some pattern items that JIT cannot handle. Details are 5429*22dc650dSSadaf Ebrahimi given below. In both cases, matching automatically falls back to the 5430*22dc650dSSadaf Ebrahimi interpretive code. If you want to know whether JIT was actually used 5431*22dc650dSSadaf Ebrahimi for a particular match, you should arrange for a JIT callback function 5432*22dc650dSSadaf Ebrahimi to be set up as described in the section entitled "Controlling the JIT 5433*22dc650dSSadaf Ebrahimi stack" below, even if you do not need to supply a non-default JIT 5434*22dc650dSSadaf Ebrahimi stack. Such a callback function is called whenever JIT code is about to 5435*22dc650dSSadaf Ebrahimi be obeyed. If the match-time options are not right for JIT execution, 5436*22dc650dSSadaf Ebrahimi the callback function is not obeyed. 5437*22dc650dSSadaf Ebrahimi 5438*22dc650dSSadaf Ebrahimi If the JIT compiler finds an unsupported item, no JIT data is gener- 5439*22dc650dSSadaf Ebrahimi ated. You can find out if JIT compilation was successful for a compiled 5440*22dc650dSSadaf Ebrahimi pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op- 5441*22dc650dSSadaf Ebrahimi tion. A non-zero result means that JIT compilation was successful. A 5442*22dc650dSSadaf Ebrahimi result of 0 means that JIT support is not available, or the pattern was 5443*22dc650dSSadaf Ebrahimi not processed by pcre2_jit_compile(), or the JIT compiler was not able 5444*22dc650dSSadaf Ebrahimi to handle the pattern. Successful JIT compilation does not, however, 5445*22dc650dSSadaf Ebrahimi guarantee the use of JIT at match time because there are some match 5446*22dc650dSSadaf Ebrahimi time options that are not supported by JIT. 5447*22dc650dSSadaf Ebrahimi 5448*22dc650dSSadaf Ebrahimi 5449*22dc650dSSadaf EbrahimiMATCHING SUBJECTS CONTAINING INVALID UTF 5450*22dc650dSSadaf Ebrahimi 5451*22dc650dSSadaf Ebrahimi When a pattern is compiled with the PCRE2_UTF option, subject strings 5452*22dc650dSSadaf Ebrahimi are normally expected to be a valid sequence of UTF code units. By de- 5453*22dc650dSSadaf Ebrahimi fault, this is checked at the start of matching and an error is gener- 5454*22dc650dSSadaf Ebrahimi ated if invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be 5455*22dc650dSSadaf Ebrahimi passed to pcre2_match() to skip the check (for improved performance) if 5456*22dc650dSSadaf Ebrahimi you are sure that a subject string is valid. If this option is used 5457*22dc650dSSadaf Ebrahimi with an invalid string, the result is undefined. The calling program 5458*22dc650dSSadaf Ebrahimi may crash or loop or otherwise misbehave. 5459*22dc650dSSadaf Ebrahimi 5460*22dc650dSSadaf Ebrahimi However, a way of running matches on strings that may contain invalid 5461*22dc650dSSadaf Ebrahimi UTF sequences is available. Calling pcre2_compile() with the 5462*22dc650dSSadaf Ebrahimi PCRE2_MATCH_INVALID_UTF option has two effects: it tells the inter- 5463*22dc650dSSadaf Ebrahimi preter in pcre2_match() to support invalid UTF, and, if pcre2_jit_com- 5464*22dc650dSSadaf Ebrahimi pile() is subsequently called, the compiled JIT code also supports in- 5465*22dc650dSSadaf Ebrahimi valid UTF. Details of how this support works, in both the JIT and the 5466*22dc650dSSadaf Ebrahimi interpretive cases, is given in the pcre2unicode documentation. 5467*22dc650dSSadaf Ebrahimi 5468*22dc650dSSadaf Ebrahimi There is also an obsolete option for pcre2_jit_compile() called 5469*22dc650dSSadaf Ebrahimi PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat- 5470*22dc650dSSadaf Ebrahimi ibility. It is superseded by the pcre2_compile() option 5471*22dc650dSSadaf Ebrahimi PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed 5472*22dc650dSSadaf Ebrahimi in future. 5473*22dc650dSSadaf Ebrahimi 5474*22dc650dSSadaf Ebrahimi 5475*22dc650dSSadaf EbrahimiUNSUPPORTED OPTIONS AND PATTERN ITEMS 5476*22dc650dSSadaf Ebrahimi 5477*22dc650dSSadaf Ebrahimi The pcre2_match() options that are supported for JIT matching are 5478*22dc650dSSadaf Ebrahimi PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, 5479*22dc650dSSadaf Ebrahimi PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and 5480*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options 5481*22dc650dSSadaf Ebrahimi are not supported at match time. 5482*22dc650dSSadaf Ebrahimi 5483*22dc650dSSadaf Ebrahimi If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the 5484*22dc650dSSadaf Ebrahimi use of JIT, forcing matching by the interpreter code. 5485*22dc650dSSadaf Ebrahimi 5486*22dc650dSSadaf Ebrahimi The only unsupported pattern items are \C (match a single data unit) 5487*22dc650dSSadaf Ebrahimi when running in a UTF mode, and a callout immediately before an asser- 5488*22dc650dSSadaf Ebrahimi tion condition in a conditional group. 5489*22dc650dSSadaf Ebrahimi 5490*22dc650dSSadaf Ebrahimi 5491*22dc650dSSadaf EbrahimiRETURN VALUES FROM JIT MATCHING 5492*22dc650dSSadaf Ebrahimi 5493*22dc650dSSadaf Ebrahimi When a pattern is matched using JIT, the return values are the same as 5494*22dc650dSSadaf Ebrahimi those given by the interpretive pcre2_match() code, with the addition 5495*22dc650dSSadaf Ebrahimi of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that the 5496*22dc650dSSadaf Ebrahimi memory used for the JIT stack was insufficient. See "Controlling the 5497*22dc650dSSadaf Ebrahimi JIT stack" below for a discussion of JIT stack usage. 5498*22dc650dSSadaf Ebrahimi 5499*22dc650dSSadaf Ebrahimi The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if 5500*22dc650dSSadaf Ebrahimi searching a very large pattern tree goes on for too long, as it is in 5501*22dc650dSSadaf Ebrahimi the same circumstance when JIT is not used, but the details of exactly 5502*22dc650dSSadaf Ebrahimi what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code 5503*22dc650dSSadaf Ebrahimi is never returned when JIT matching is used. 5504*22dc650dSSadaf Ebrahimi 5505*22dc650dSSadaf Ebrahimi 5506*22dc650dSSadaf EbrahimiCONTROLLING THE JIT STACK 5507*22dc650dSSadaf Ebrahimi 5508*22dc650dSSadaf Ebrahimi When the compiled JIT code runs, it needs a block of memory to use as a 5509*22dc650dSSadaf Ebrahimi stack. By default, it uses 32KiB on the machine stack. However, some 5510*22dc650dSSadaf Ebrahimi large or complicated patterns need more than this. The error PCRE2_ER- 5511*22dc650dSSadaf Ebrahimi ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func- 5512*22dc650dSSadaf Ebrahimi tions are provided for managing blocks of memory for use as JIT stacks. 5513*22dc650dSSadaf Ebrahimi There is further discussion about the use of JIT stacks in the section 5514*22dc650dSSadaf Ebrahimi entitled "JIT stack FAQ" below. 5515*22dc650dSSadaf Ebrahimi 5516*22dc650dSSadaf Ebrahimi The pcre2_jit_stack_create() function creates a JIT stack. Its argu- 5517*22dc650dSSadaf Ebrahimi ments are a starting size, a maximum size, and a general context (for 5518*22dc650dSSadaf Ebrahimi memory allocation functions, or NULL for standard memory allocation). 5519*22dc650dSSadaf Ebrahimi It returns a pointer to an opaque structure of type pcre2_jit_stack, or 5520*22dc650dSSadaf Ebrahimi NULL if there is an error. The pcre2_jit_stack_free() function is used 5521*22dc650dSSadaf Ebrahimi to free a stack that is no longer needed. If its argument is NULL, this 5522*22dc650dSSadaf Ebrahimi function returns immediately, without doing anything. (For the techni- 5523*22dc650dSSadaf Ebrahimi cally minded: the address space is allocated by mmap or VirtualAlloc.) 5524*22dc650dSSadaf Ebrahimi A maximum stack size of 512KiB to 1MiB should be more than enough for 5525*22dc650dSSadaf Ebrahimi any pattern. 5526*22dc650dSSadaf Ebrahimi 5527*22dc650dSSadaf Ebrahimi The pcre2_jit_stack_assign() function specifies which stack JIT code 5528*22dc650dSSadaf Ebrahimi should use. Its arguments are as follows: 5529*22dc650dSSadaf Ebrahimi 5530*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext 5531*22dc650dSSadaf Ebrahimi pcre2_jit_callback callback 5532*22dc650dSSadaf Ebrahimi void *data 5533*22dc650dSSadaf Ebrahimi 5534*22dc650dSSadaf Ebrahimi The first argument is a pointer to a match context. When this is subse- 5535*22dc650dSSadaf Ebrahimi quently passed to a matching function, its information determines which 5536*22dc650dSSadaf Ebrahimi JIT stack is used. If this argument is NULL, the function returns imme- 5537*22dc650dSSadaf Ebrahimi diately, without doing anything. There are three cases for the values 5538*22dc650dSSadaf Ebrahimi of the other two options: 5539*22dc650dSSadaf Ebrahimi 5540*22dc650dSSadaf Ebrahimi (1) If callback is NULL and data is NULL, an internal 32KiB block 5541*22dc650dSSadaf Ebrahimi on the machine stack is used. This is the default when a match 5542*22dc650dSSadaf Ebrahimi context is created. 5543*22dc650dSSadaf Ebrahimi 5544*22dc650dSSadaf Ebrahimi (2) If callback is NULL and data is not NULL, data must be 5545*22dc650dSSadaf Ebrahimi a pointer to a valid JIT stack, the result of calling 5546*22dc650dSSadaf Ebrahimi pcre2_jit_stack_create(). 5547*22dc650dSSadaf Ebrahimi 5548*22dc650dSSadaf Ebrahimi (3) If callback is not NULL, it must point to a function that is 5549*22dc650dSSadaf Ebrahimi called with data as an argument at the start of matching, in 5550*22dc650dSSadaf Ebrahimi order to set up a JIT stack. If the return from the callback 5551*22dc650dSSadaf Ebrahimi function is NULL, the internal 32KiB stack is used; otherwise the 5552*22dc650dSSadaf Ebrahimi return value must be a valid JIT stack, the result of calling 5553*22dc650dSSadaf Ebrahimi pcre2_jit_stack_create(). 5554*22dc650dSSadaf Ebrahimi 5555*22dc650dSSadaf Ebrahimi A callback function is obeyed whenever JIT code is about to be run; it 5556*22dc650dSSadaf Ebrahimi is not obeyed when pcre2_match() is called with options that are incom- 5557*22dc650dSSadaf Ebrahimi patible for JIT matching. A callback function can therefore be used to 5558*22dc650dSSadaf Ebrahimi determine whether a match operation was executed by JIT or by the in- 5559*22dc650dSSadaf Ebrahimi terpreter. 5560*22dc650dSSadaf Ebrahimi 5561*22dc650dSSadaf Ebrahimi You may safely use the same JIT stack for more than one pattern (either 5562*22dc650dSSadaf Ebrahimi by assigning directly or by callback), as long as the patterns are 5563*22dc650dSSadaf Ebrahimi matched sequentially in the same thread. Currently, the only way to set 5564*22dc650dSSadaf Ebrahimi up non-sequential matches in one thread is to use callouts: if a call- 5565*22dc650dSSadaf Ebrahimi out function starts another match, that match must use a different JIT 5566*22dc650dSSadaf Ebrahimi stack to the one used for currently suspended match(es). 5567*22dc650dSSadaf Ebrahimi 5568*22dc650dSSadaf Ebrahimi In a multithread application, if you do not specify a JIT stack, or if 5569*22dc650dSSadaf Ebrahimi you assign or pass back NULL from a callback, that is thread-safe, be- 5570*22dc650dSSadaf Ebrahimi cause each thread has its own machine stack. However, if you assign or 5571*22dc650dSSadaf Ebrahimi pass back a non-NULL JIT stack, this must be a different stack for each 5572*22dc650dSSadaf Ebrahimi thread so that the application is thread-safe. 5573*22dc650dSSadaf Ebrahimi 5574*22dc650dSSadaf Ebrahimi Strictly speaking, even more is allowed. You can assign the same non- 5575*22dc650dSSadaf Ebrahimi NULL stack to a match context that is used by any number of patterns, 5576*22dc650dSSadaf Ebrahimi as long as they are not used for matching by multiple threads at the 5577*22dc650dSSadaf Ebrahimi same time. For example, you could use the same stack in all compiled 5578*22dc650dSSadaf Ebrahimi patterns, with a global mutex in the callback to wait until the stack 5579*22dc650dSSadaf Ebrahimi is available for use. However, this is an inefficient solution, and not 5580*22dc650dSSadaf Ebrahimi recommended. 5581*22dc650dSSadaf Ebrahimi 5582*22dc650dSSadaf Ebrahimi This is a suggestion for how a multithreaded program that needs to set 5583*22dc650dSSadaf Ebrahimi up non-default JIT stacks might operate: 5584*22dc650dSSadaf Ebrahimi 5585*22dc650dSSadaf Ebrahimi During thread initialization 5586*22dc650dSSadaf Ebrahimi thread_local_var = pcre2_jit_stack_create(...) 5587*22dc650dSSadaf Ebrahimi 5588*22dc650dSSadaf Ebrahimi During thread exit 5589*22dc650dSSadaf Ebrahimi pcre2_jit_stack_free(thread_local_var) 5590*22dc650dSSadaf Ebrahimi 5591*22dc650dSSadaf Ebrahimi Use a one-line callback function 5592*22dc650dSSadaf Ebrahimi return thread_local_var 5593*22dc650dSSadaf Ebrahimi 5594*22dc650dSSadaf Ebrahimi All the functions described in this section do nothing if JIT is not 5595*22dc650dSSadaf Ebrahimi available. 5596*22dc650dSSadaf Ebrahimi 5597*22dc650dSSadaf Ebrahimi 5598*22dc650dSSadaf EbrahimiJIT STACK FAQ 5599*22dc650dSSadaf Ebrahimi 5600*22dc650dSSadaf Ebrahimi (1) Why do we need JIT stacks? 5601*22dc650dSSadaf Ebrahimi 5602*22dc650dSSadaf Ebrahimi PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack 5603*22dc650dSSadaf Ebrahimi where the local data of the current node is pushed before checking its 5604*22dc650dSSadaf Ebrahimi child nodes. Allocating real machine stack on some platforms is diffi- 5605*22dc650dSSadaf Ebrahimi cult. For example, the stack chain needs to be updated every time if we 5606*22dc650dSSadaf Ebrahimi extend the stack on PowerPC. Although it is possible, its updating 5607*22dc650dSSadaf Ebrahimi time overhead decreases performance. So we do the recursion in memory. 5608*22dc650dSSadaf Ebrahimi 5609*22dc650dSSadaf Ebrahimi (2) Why don't we simply allocate blocks of memory with malloc()? 5610*22dc650dSSadaf Ebrahimi 5611*22dc650dSSadaf Ebrahimi Modern operating systems have a nice feature: they can reserve an ad- 5612*22dc650dSSadaf Ebrahimi dress space instead of allocating memory. We can safely allocate memory 5613*22dc650dSSadaf Ebrahimi pages inside this address space, so the stack could grow without moving 5614*22dc650dSSadaf Ebrahimi memory data (this is important because of pointers). Thus we can allo- 5615*22dc650dSSadaf Ebrahimi cate 1MiB address space, and use only a single memory page (usually 5616*22dc650dSSadaf Ebrahimi 4KiB) if that is enough. However, we can still grow up to 1MiB anytime 5617*22dc650dSSadaf Ebrahimi if needed. 5618*22dc650dSSadaf Ebrahimi 5619*22dc650dSSadaf Ebrahimi (3) Who "owns" a JIT stack? 5620*22dc650dSSadaf Ebrahimi 5621*22dc650dSSadaf Ebrahimi The owner of the stack is the user program, not the JIT studied pattern 5622*22dc650dSSadaf Ebrahimi or anything else. The user program must ensure that if a stack is being 5623*22dc650dSSadaf Ebrahimi used by pcre2_match(), (that is, it is assigned to a match context that 5624*22dc650dSSadaf Ebrahimi is passed to the pattern currently running), that stack must not be 5625*22dc650dSSadaf Ebrahimi used by any other threads (to avoid overwriting the same memory area). 5626*22dc650dSSadaf Ebrahimi The best practice for multithreaded programs is to allocate a stack for 5627*22dc650dSSadaf Ebrahimi each thread, and return this stack through the JIT callback function. 5628*22dc650dSSadaf Ebrahimi 5629*22dc650dSSadaf Ebrahimi (4) When should a JIT stack be freed? 5630*22dc650dSSadaf Ebrahimi 5631*22dc650dSSadaf Ebrahimi You can free a JIT stack at any time, as long as it will not be used by 5632*22dc650dSSadaf Ebrahimi pcre2_match() again. When you assign the stack to a match context, only 5633*22dc650dSSadaf Ebrahimi a pointer is set. There is no reference counting or any other magic. 5634*22dc650dSSadaf Ebrahimi You can free compiled patterns, contexts, and stacks in any order, any- 5635*22dc650dSSadaf Ebrahimi time. Just do not call pcre2_match() with a match context pointing to 5636*22dc650dSSadaf Ebrahimi an already freed stack, as that will cause SEGFAULT. (Also, do not free 5637*22dc650dSSadaf Ebrahimi a stack currently used by pcre2_match() in another thread). You can 5638*22dc650dSSadaf Ebrahimi also replace the stack in a context at any time when it is not in use. 5639*22dc650dSSadaf Ebrahimi You should free the previous stack before assigning a replacement. 5640*22dc650dSSadaf Ebrahimi 5641*22dc650dSSadaf Ebrahimi (5) Should I allocate/free a stack every time before/after calling 5642*22dc650dSSadaf Ebrahimi pcre2_match()? 5643*22dc650dSSadaf Ebrahimi 5644*22dc650dSSadaf Ebrahimi No, because this is too costly in terms of resources. However, you 5645*22dc650dSSadaf Ebrahimi could implement some clever idea which release the stack if it is not 5646*22dc650dSSadaf Ebrahimi used in let's say two minutes. The JIT callback can help to achieve 5647*22dc650dSSadaf Ebrahimi this without keeping a list of patterns. 5648*22dc650dSSadaf Ebrahimi 5649*22dc650dSSadaf Ebrahimi (6) OK, the stack is for long term memory allocation. But what happens 5650*22dc650dSSadaf Ebrahimi if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB 5651*22dc650dSSadaf Ebrahimi kept until the stack is freed? 5652*22dc650dSSadaf Ebrahimi 5653*22dc650dSSadaf Ebrahimi Especially on embedded systems, it might be a good idea to release mem- 5654*22dc650dSSadaf Ebrahimi ory sometimes without freeing the stack. There is no API for this at 5655*22dc650dSSadaf Ebrahimi the moment. Probably a function call which returns with the currently 5656*22dc650dSSadaf Ebrahimi allocated memory for any stack and another which allows releasing mem- 5657*22dc650dSSadaf Ebrahimi ory (shrinking the stack) would be a good idea if someone needs this. 5658*22dc650dSSadaf Ebrahimi 5659*22dc650dSSadaf Ebrahimi (7) This is too much of a headache. Isn't there any better solution for 5660*22dc650dSSadaf Ebrahimi JIT stack handling? 5661*22dc650dSSadaf Ebrahimi 5662*22dc650dSSadaf Ebrahimi No, thanks to Windows. If POSIX threads were used everywhere, we could 5663*22dc650dSSadaf Ebrahimi throw out this complicated API. 5664*22dc650dSSadaf Ebrahimi 5665*22dc650dSSadaf Ebrahimi 5666*22dc650dSSadaf EbrahimiFREEING JIT SPECULATIVE MEMORY 5667*22dc650dSSadaf Ebrahimi 5668*22dc650dSSadaf Ebrahimi void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 5669*22dc650dSSadaf Ebrahimi 5670*22dc650dSSadaf Ebrahimi The JIT executable allocator does not free all memory when it is possi- 5671*22dc650dSSadaf Ebrahimi ble. It expects new allocations, and keeps some free memory around to 5672*22dc650dSSadaf Ebrahimi improve allocation speed. However, in low memory conditions, it might 5673*22dc650dSSadaf Ebrahimi be better to free all possible memory. You can cause this to happen by 5674*22dc650dSSadaf Ebrahimi calling pcre2_jit_free_unused_memory(). Its argument is a general con- 5675*22dc650dSSadaf Ebrahimi text, for custom memory management, or NULL for standard memory manage- 5676*22dc650dSSadaf Ebrahimi ment. 5677*22dc650dSSadaf Ebrahimi 5678*22dc650dSSadaf Ebrahimi 5679*22dc650dSSadaf EbrahimiEXAMPLE CODE 5680*22dc650dSSadaf Ebrahimi 5681*22dc650dSSadaf Ebrahimi This is a single-threaded example that specifies a JIT stack without 5682*22dc650dSSadaf Ebrahimi using a callback. A real program should include error checking after 5683*22dc650dSSadaf Ebrahimi all the function calls. 5684*22dc650dSSadaf Ebrahimi 5685*22dc650dSSadaf Ebrahimi int rc; 5686*22dc650dSSadaf Ebrahimi pcre2_code *re; 5687*22dc650dSSadaf Ebrahimi pcre2_match_data *match_data; 5688*22dc650dSSadaf Ebrahimi pcre2_match_context *mcontext; 5689*22dc650dSSadaf Ebrahimi pcre2_jit_stack *jit_stack; 5690*22dc650dSSadaf Ebrahimi 5691*22dc650dSSadaf Ebrahimi re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, 5692*22dc650dSSadaf Ebrahimi &errornumber, &erroffset, NULL); 5693*22dc650dSSadaf Ebrahimi rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE); 5694*22dc650dSSadaf Ebrahimi mcontext = pcre2_match_context_create(NULL); 5695*22dc650dSSadaf Ebrahimi jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL); 5696*22dc650dSSadaf Ebrahimi pcre2_jit_stack_assign(mcontext, NULL, jit_stack); 5697*22dc650dSSadaf Ebrahimi match_data = pcre2_match_data_create(re, 10); 5698*22dc650dSSadaf Ebrahimi rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext); 5699*22dc650dSSadaf Ebrahimi /* Process result */ 5700*22dc650dSSadaf Ebrahimi 5701*22dc650dSSadaf Ebrahimi pcre2_code_free(re); 5702*22dc650dSSadaf Ebrahimi pcre2_match_data_free(match_data); 5703*22dc650dSSadaf Ebrahimi pcre2_match_context_free(mcontext); 5704*22dc650dSSadaf Ebrahimi pcre2_jit_stack_free(jit_stack); 5705*22dc650dSSadaf Ebrahimi 5706*22dc650dSSadaf Ebrahimi 5707*22dc650dSSadaf EbrahimiJIT FAST PATH API 5708*22dc650dSSadaf Ebrahimi 5709*22dc650dSSadaf Ebrahimi Because the API described above falls back to interpreted matching when 5710*22dc650dSSadaf Ebrahimi JIT is not available, it is convenient for programs that are written 5711*22dc650dSSadaf Ebrahimi for general use in many environments. However, calling JIT via 5712*22dc650dSSadaf Ebrahimi pcre2_match() does have a performance impact. Programs that are written 5713*22dc650dSSadaf Ebrahimi for use where JIT is known to be available, and which need the best 5714*22dc650dSSadaf Ebrahimi possible performance, can instead use a "fast path" API to call JIT 5715*22dc650dSSadaf Ebrahimi matching directly instead of calling pcre2_match() (obviously only for 5716*22dc650dSSadaf Ebrahimi patterns that have been successfully processed by pcre2_jit_compile()). 5717*22dc650dSSadaf Ebrahimi 5718*22dc650dSSadaf Ebrahimi The fast path function is called pcre2_jit_match(), and it takes ex- 5719*22dc650dSSadaf Ebrahimi actly the same arguments as pcre2_match(). However, the subject string 5720*22dc650dSSadaf Ebrahimi must be specified with a length; PCRE2_ZERO_TERMINATED is not sup- 5721*22dc650dSSadaf Ebrahimi ported. Unsupported option bits (for example, PCRE2_ANCHORED and 5722*22dc650dSSadaf Ebrahimi PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The re- 5723*22dc650dSSadaf Ebrahimi turn values are also the same as for pcre2_match(), plus PCRE2_ER- 5724*22dc650dSSadaf Ebrahimi ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested 5725*22dc650dSSadaf Ebrahimi that was not compiled. 5726*22dc650dSSadaf Ebrahimi 5727*22dc650dSSadaf Ebrahimi When you call pcre2_match(), as well as testing for invalid options, a 5728*22dc650dSSadaf Ebrahimi number of other sanity checks are performed on the arguments. For exam- 5729*22dc650dSSadaf Ebrahimi ple, if the subject pointer is NULL but the length is non-zero, an im- 5730*22dc650dSSadaf Ebrahimi mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF 5731*22dc650dSSadaf Ebrahimi subject string is tested for validity. In the interests of speed, these 5732*22dc650dSSadaf Ebrahimi checks do not happen on the JIT fast path. If invalid UTF data is 5733*22dc650dSSadaf Ebrahimi passed when PCRE2_MATCH_INVALID_UTF was not set for pcre2_compile(), 5734*22dc650dSSadaf Ebrahimi the result is undefined. The program may crash or loop or give wrong 5735*22dc650dSSadaf Ebrahimi results. In the absence of PCRE2_MATCH_INVALID_UTF you should call 5736*22dc650dSSadaf Ebrahimi pcre2_jit_match() in UTF mode only if you are sure the subject is 5737*22dc650dSSadaf Ebrahimi valid. 5738*22dc650dSSadaf Ebrahimi 5739*22dc650dSSadaf Ebrahimi Bypassing the sanity checks and the pcre2_match() wrapping can give 5740*22dc650dSSadaf Ebrahimi speedups of more than 10%. 5741*22dc650dSSadaf Ebrahimi 5742*22dc650dSSadaf Ebrahimi 5743*22dc650dSSadaf EbrahimiSEE ALSO 5744*22dc650dSSadaf Ebrahimi 5745*22dc650dSSadaf Ebrahimi pcre2api(3), pcre2unicode(3) 5746*22dc650dSSadaf Ebrahimi 5747*22dc650dSSadaf Ebrahimi 5748*22dc650dSSadaf EbrahimiAUTHOR 5749*22dc650dSSadaf Ebrahimi 5750*22dc650dSSadaf Ebrahimi Philip Hazel (FAQ by Zoltan Herczeg) 5751*22dc650dSSadaf Ebrahimi Retired from University Computing Service 5752*22dc650dSSadaf Ebrahimi Cambridge, England. 5753*22dc650dSSadaf Ebrahimi 5754*22dc650dSSadaf Ebrahimi 5755*22dc650dSSadaf EbrahimiREVISION 5756*22dc650dSSadaf Ebrahimi 5757*22dc650dSSadaf Ebrahimi Last updated: 21 February 2024 5758*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 5759*22dc650dSSadaf Ebrahimi 5760*22dc650dSSadaf Ebrahimi 5761*22dc650dSSadaf EbrahimiPCRE2 10.43 21 February 2024 PCRE2JIT(3) 5762*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 5763*22dc650dSSadaf Ebrahimi 5764*22dc650dSSadaf Ebrahimi 5765*22dc650dSSadaf Ebrahimi 5766*22dc650dSSadaf EbrahimiPCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3) 5767*22dc650dSSadaf Ebrahimi 5768*22dc650dSSadaf Ebrahimi 5769*22dc650dSSadaf EbrahimiNAME 5770*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 5771*22dc650dSSadaf Ebrahimi 5772*22dc650dSSadaf Ebrahimi 5773*22dc650dSSadaf EbrahimiSIZE AND OTHER LIMITATIONS 5774*22dc650dSSadaf Ebrahimi 5775*22dc650dSSadaf Ebrahimi There are some size limitations in PCRE2 but it is hoped that they will 5776*22dc650dSSadaf Ebrahimi never in practice be relevant. 5777*22dc650dSSadaf Ebrahimi 5778*22dc650dSSadaf Ebrahimi The maximum size of a compiled pattern is approximately 64 thousand 5779*22dc650dSSadaf Ebrahimi code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with 5780*22dc650dSSadaf Ebrahimi the default internal linkage size, which is 2 bytes for these li- 5781*22dc650dSSadaf Ebrahimi braries. If you want to process regular expressions that are truly 5782*22dc650dSSadaf Ebrahimi enormous, you can compile PCRE2 with an internal linkage size of 3 or 4 5783*22dc650dSSadaf Ebrahimi (when building the 16-bit library, 3 is rounded up to 4). See the 5784*22dc650dSSadaf Ebrahimi README file in the source distribution and the pcre2build documentation 5785*22dc650dSSadaf Ebrahimi for details. In these cases the limit is substantially larger. How- 5786*22dc650dSSadaf Ebrahimi ever, the speed of execution is slower. In the 32-bit library, the in- 5787*22dc650dSSadaf Ebrahimi ternal linkage size is always 4. 5788*22dc650dSSadaf Ebrahimi 5789*22dc650dSSadaf Ebrahimi The maximum length of a source pattern string is essentially unlimited; 5790*22dc650dSSadaf Ebrahimi it is the largest number a PCRE2_SIZE variable can hold. However, the 5791*22dc650dSSadaf Ebrahimi program that calls pcre2_compile() can specify a smaller limit. 5792*22dc650dSSadaf Ebrahimi 5793*22dc650dSSadaf Ebrahimi The maximum length (in code units) of a subject string is one less than 5794*22dc650dSSadaf Ebrahimi the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un- 5795*22dc650dSSadaf Ebrahimi signed integer type, usually defined as size_t. Its maximum value (that 5796*22dc650dSSadaf Ebrahimi is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-termi- 5797*22dc650dSSadaf Ebrahimi nated strings and unset offsets. 5798*22dc650dSSadaf Ebrahimi 5799*22dc650dSSadaf Ebrahimi All values in repeating quantifiers must be less than 65536. 5800*22dc650dSSadaf Ebrahimi 5801*22dc650dSSadaf Ebrahimi There are two different limits that apply to branches of lookbehind as- 5802*22dc650dSSadaf Ebrahimi sertions. If every branch in such an assertion matches a fixed number 5803*22dc650dSSadaf Ebrahimi of characters, the maximum length of any branch is 65535 characters. If 5804*22dc650dSSadaf Ebrahimi any branch matches a variable number of characters, then the maximum 5805*22dc650dSSadaf Ebrahimi matching length for every branch is limited. The default limit is set 5806*22dc650dSSadaf Ebrahimi at compile time, defaulting to 255, but can be changed by the calling 5807*22dc650dSSadaf Ebrahimi program. 5808*22dc650dSSadaf Ebrahimi 5809*22dc650dSSadaf Ebrahimi There is no limit to the number of parenthesized groups, but there can 5810*22dc650dSSadaf Ebrahimi be no more than 65535 capture groups, and there is a limit to the depth 5811*22dc650dSSadaf Ebrahimi of nesting of parenthesized subpatterns of all kinds. This is imposed 5812*22dc650dSSadaf Ebrahimi in order to limit the amount of system stack used at compile time. The 5813*22dc650dSSadaf Ebrahimi default limit can be specified when PCRE2 is built; if not, the default 5814*22dc650dSSadaf Ebrahimi is set to 250. An application can change this limit by calling 5815*22dc650dSSadaf Ebrahimi pcre2_set_parens_nest_limit() to set the limit in a compile context. 5816*22dc650dSSadaf Ebrahimi 5817*22dc650dSSadaf Ebrahimi The maximum length of name for a named capture group is 32 code units, 5818*22dc650dSSadaf Ebrahimi and the maximum number of such groups is 10000. 5819*22dc650dSSadaf Ebrahimi 5820*22dc650dSSadaf Ebrahimi The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or 5821*22dc650dSSadaf Ebrahimi (*THEN) verb is 255 code units for the 8-bit library and 65535 code 5822*22dc650dSSadaf Ebrahimi units for the 16-bit and 32-bit libraries. 5823*22dc650dSSadaf Ebrahimi 5824*22dc650dSSadaf Ebrahimi The maximum length of a string argument to a callout is the largest 5825*22dc650dSSadaf Ebrahimi number a 32-bit unsigned integer can hold. 5826*22dc650dSSadaf Ebrahimi 5827*22dc650dSSadaf Ebrahimi The maximum amount of heap memory used for matching is controlled by 5828*22dc650dSSadaf Ebrahimi the heap limit, which can be set in a pattern or in a match context. 5829*22dc650dSSadaf Ebrahimi The default is a very large number, effectively unlimited. 5830*22dc650dSSadaf Ebrahimi 5831*22dc650dSSadaf Ebrahimi 5832*22dc650dSSadaf EbrahimiAUTHOR 5833*22dc650dSSadaf Ebrahimi 5834*22dc650dSSadaf Ebrahimi Philip Hazel 5835*22dc650dSSadaf Ebrahimi Retired from University Computing Service 5836*22dc650dSSadaf Ebrahimi Cambridge, England. 5837*22dc650dSSadaf Ebrahimi 5838*22dc650dSSadaf Ebrahimi 5839*22dc650dSSadaf EbrahimiREVISION 5840*22dc650dSSadaf Ebrahimi 5841*22dc650dSSadaf Ebrahimi Last updated: August 2023 5842*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2023 University of Cambridge. 5843*22dc650dSSadaf Ebrahimi 5844*22dc650dSSadaf Ebrahimi 5845*22dc650dSSadaf EbrahimiPCRE2 10.43 1 August 2023 PCRE2LIMITS(3) 5846*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 5847*22dc650dSSadaf Ebrahimi 5848*22dc650dSSadaf Ebrahimi 5849*22dc650dSSadaf Ebrahimi 5850*22dc650dSSadaf EbrahimiPCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3) 5851*22dc650dSSadaf Ebrahimi 5852*22dc650dSSadaf Ebrahimi 5853*22dc650dSSadaf EbrahimiNAME 5854*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 5855*22dc650dSSadaf Ebrahimi 5856*22dc650dSSadaf Ebrahimi 5857*22dc650dSSadaf EbrahimiPCRE2 MATCHING ALGORITHMS 5858*22dc650dSSadaf Ebrahimi 5859*22dc650dSSadaf Ebrahimi This document describes the two different algorithms that are available 5860*22dc650dSSadaf Ebrahimi in PCRE2 for matching a compiled regular expression against a given 5861*22dc650dSSadaf Ebrahimi subject string. The "standard" algorithm is the one provided by the 5862*22dc650dSSadaf Ebrahimi pcre2_match() function. This works in the same as Perl's matching func- 5863*22dc650dSSadaf Ebrahimi tion, and provide a Perl-compatible matching operation. The just-in- 5864*22dc650dSSadaf Ebrahimi time (JIT) optimization that is described in the pcre2jit documentation 5865*22dc650dSSadaf Ebrahimi is compatible with this function. 5866*22dc650dSSadaf Ebrahimi 5867*22dc650dSSadaf Ebrahimi An alternative algorithm is provided by the pcre2_dfa_match() function; 5868*22dc650dSSadaf Ebrahimi it operates in a different way, and is not Perl-compatible. This alter- 5869*22dc650dSSadaf Ebrahimi native has advantages and disadvantages compared with the standard al- 5870*22dc650dSSadaf Ebrahimi gorithm, and these are described below. 5871*22dc650dSSadaf Ebrahimi 5872*22dc650dSSadaf Ebrahimi When there is only one possible way in which a given subject string can 5873*22dc650dSSadaf Ebrahimi match a pattern, the two algorithms give the same answer. A difference 5874*22dc650dSSadaf Ebrahimi arises, however, when there are multiple possibilities. For example, if 5875*22dc650dSSadaf Ebrahimi the pattern 5876*22dc650dSSadaf Ebrahimi 5877*22dc650dSSadaf Ebrahimi ^<.*> 5878*22dc650dSSadaf Ebrahimi 5879*22dc650dSSadaf Ebrahimi is matched against the string 5880*22dc650dSSadaf Ebrahimi 5881*22dc650dSSadaf Ebrahimi <something> <something else> <something further> 5882*22dc650dSSadaf Ebrahimi 5883*22dc650dSSadaf Ebrahimi there are three possible answers. The standard algorithm finds only one 5884*22dc650dSSadaf Ebrahimi of them, whereas the alternative algorithm finds all three. 5885*22dc650dSSadaf Ebrahimi 5886*22dc650dSSadaf Ebrahimi 5887*22dc650dSSadaf EbrahimiREGULAR EXPRESSIONS AS TREES 5888*22dc650dSSadaf Ebrahimi 5889*22dc650dSSadaf Ebrahimi The set of strings that are matched by a regular expression can be rep- 5890*22dc650dSSadaf Ebrahimi resented as a tree structure. An unlimited repetition in the pattern 5891*22dc650dSSadaf Ebrahimi makes the tree of infinite size, but it is still a tree. Matching the 5892*22dc650dSSadaf Ebrahimi pattern to a given subject string (from a given starting point) can be 5893*22dc650dSSadaf Ebrahimi thought of as a search of the tree. There are two ways to search a 5894*22dc650dSSadaf Ebrahimi tree: depth-first and breadth-first, and these correspond to the two 5895*22dc650dSSadaf Ebrahimi matching algorithms provided by PCRE2. 5896*22dc650dSSadaf Ebrahimi 5897*22dc650dSSadaf Ebrahimi 5898*22dc650dSSadaf EbrahimiTHE STANDARD MATCHING ALGORITHM 5899*22dc650dSSadaf Ebrahimi 5900*22dc650dSSadaf Ebrahimi In the terminology of Jeffrey Friedl's book "Mastering Regular Expres- 5901*22dc650dSSadaf Ebrahimi sions", the standard algorithm is an "NFA algorithm". It conducts a 5902*22dc650dSSadaf Ebrahimi depth-first search of the pattern tree. That is, it proceeds along a 5903*22dc650dSSadaf Ebrahimi single path through the tree, checking that the subject matches what is 5904*22dc650dSSadaf Ebrahimi required. When there is a mismatch, the algorithm tries any alterna- 5905*22dc650dSSadaf Ebrahimi tives at the current point, and if they all fail, it backs up to the 5906*22dc650dSSadaf Ebrahimi previous branch point in the tree, and tries the next alternative 5907*22dc650dSSadaf Ebrahimi branch at that level. This often involves backing up (moving to the 5908*22dc650dSSadaf Ebrahimi left) in the subject string as well. The order in which repetition 5909*22dc650dSSadaf Ebrahimi branches are tried is controlled by the greedy or ungreedy nature of 5910*22dc650dSSadaf Ebrahimi the quantifier. 5911*22dc650dSSadaf Ebrahimi 5912*22dc650dSSadaf Ebrahimi If a leaf node is reached, a matching string has been found, and at 5913*22dc650dSSadaf Ebrahimi that point the algorithm stops. Thus, if there is more than one possi- 5914*22dc650dSSadaf Ebrahimi ble match, this algorithm returns the first one that it finds. Whether 5915*22dc650dSSadaf Ebrahimi this is the shortest, the longest, or some intermediate length depends 5916*22dc650dSSadaf Ebrahimi on the way the alternations and the greedy or ungreedy repetition quan- 5917*22dc650dSSadaf Ebrahimi tifiers are specified in the pattern. 5918*22dc650dSSadaf Ebrahimi 5919*22dc650dSSadaf Ebrahimi Because it ends up with a single path through the tree, it is rela- 5920*22dc650dSSadaf Ebrahimi tively straightforward for this algorithm to keep track of the sub- 5921*22dc650dSSadaf Ebrahimi strings that are matched by portions of the pattern in parentheses. 5922*22dc650dSSadaf Ebrahimi This provides support for capturing parentheses and backreferences. 5923*22dc650dSSadaf Ebrahimi 5924*22dc650dSSadaf Ebrahimi 5925*22dc650dSSadaf EbrahimiTHE ALTERNATIVE MATCHING ALGORITHM 5926*22dc650dSSadaf Ebrahimi 5927*22dc650dSSadaf Ebrahimi This algorithm conducts a breadth-first search of the tree. Starting 5928*22dc650dSSadaf Ebrahimi from the first matching point in the subject, it scans the subject 5929*22dc650dSSadaf Ebrahimi string from left to right, once, character by character, and as it does 5930*22dc650dSSadaf Ebrahimi this, it remembers all the paths through the tree that represent valid 5931*22dc650dSSadaf Ebrahimi matches. In Friedl's terminology, this is a kind of "DFA algorithm", 5932*22dc650dSSadaf Ebrahimi though it is not implemented as a traditional finite state machine (it 5933*22dc650dSSadaf Ebrahimi keeps multiple states active simultaneously). 5934*22dc650dSSadaf Ebrahimi 5935*22dc650dSSadaf Ebrahimi Although the general principle of this matching algorithm is that it 5936*22dc650dSSadaf Ebrahimi scans the subject string only once, without backtracking, there is one 5937*22dc650dSSadaf Ebrahimi exception: when a lookaround assertion is encountered, the characters 5938*22dc650dSSadaf Ebrahimi following or preceding the current point have to be independently in- 5939*22dc650dSSadaf Ebrahimi spected. 5940*22dc650dSSadaf Ebrahimi 5941*22dc650dSSadaf Ebrahimi The scan continues until either the end of the subject is reached, or 5942*22dc650dSSadaf Ebrahimi there are no more unterminated paths. At this point, terminated paths 5943*22dc650dSSadaf Ebrahimi represent the different matching possibilities (if there are none, the 5944*22dc650dSSadaf Ebrahimi match has failed). Thus, if there is more than one possible match, 5945*22dc650dSSadaf Ebrahimi this algorithm finds all of them, and in particular, it finds the 5946*22dc650dSSadaf Ebrahimi longest. The matches are returned in the output vector in decreasing 5947*22dc650dSSadaf Ebrahimi order of length. There is an option to stop the algorithm after the 5948*22dc650dSSadaf Ebrahimi first match (which is necessarily the shortest) is found. 5949*22dc650dSSadaf Ebrahimi 5950*22dc650dSSadaf Ebrahimi Note that the size of vector needed to contain all the results depends 5951*22dc650dSSadaf Ebrahimi on the number of simultaneous matches, not on the number of parentheses 5952*22dc650dSSadaf Ebrahimi in the pattern. Using pcre2_match_data_create_from_pattern() to create 5953*22dc650dSSadaf Ebrahimi the match data block is therefore not advisable when doing DFA match- 5954*22dc650dSSadaf Ebrahimi ing. 5955*22dc650dSSadaf Ebrahimi 5956*22dc650dSSadaf Ebrahimi Note also that all the matches that are found start at the same point 5957*22dc650dSSadaf Ebrahimi in the subject. If the pattern 5958*22dc650dSSadaf Ebrahimi 5959*22dc650dSSadaf Ebrahimi cat(er(pillar)?)? 5960*22dc650dSSadaf Ebrahimi 5961*22dc650dSSadaf Ebrahimi is matched against the string "the caterpillar catchment", the result 5962*22dc650dSSadaf Ebrahimi is the three strings "caterpillar", "cater", and "cat" that start at 5963*22dc650dSSadaf Ebrahimi the fifth character of the subject. The algorithm does not automati- 5964*22dc650dSSadaf Ebrahimi cally move on to find matches that start at later positions. 5965*22dc650dSSadaf Ebrahimi 5966*22dc650dSSadaf Ebrahimi PCRE2's "auto-possessification" optimization usually applies to charac- 5967*22dc650dSSadaf Ebrahimi ter repeats at the end of a pattern (as well as internally). For exam- 5968*22dc650dSSadaf Ebrahimi ple, the pattern "a\d+" is compiled as if it were "a\d++" because there 5969*22dc650dSSadaf Ebrahimi is no point even considering the possibility of backtracking into the 5970*22dc650dSSadaf Ebrahimi repeated digits. For DFA matching, this means that only one possible 5971*22dc650dSSadaf Ebrahimi match is found. If you really do want multiple matches in such cases, 5972*22dc650dSSadaf Ebrahimi either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- 5973*22dc650dSSadaf Ebrahimi SESS option when compiling. 5974*22dc650dSSadaf Ebrahimi 5975*22dc650dSSadaf Ebrahimi There are a number of features of PCRE2 regular expressions that are 5976*22dc650dSSadaf Ebrahimi not supported or behave differently in the alternative matching func- 5977*22dc650dSSadaf Ebrahimi tion. Those that are not supported cause an error if encountered. 5978*22dc650dSSadaf Ebrahimi 5979*22dc650dSSadaf Ebrahimi 1. Because the algorithm finds all possible matches, the greedy or un- 5980*22dc650dSSadaf Ebrahimi greedy nature of repetition quantifiers is not relevant (though it may 5981*22dc650dSSadaf Ebrahimi affect auto-possessification, as just described). During matching, 5982*22dc650dSSadaf Ebrahimi greedy and ungreedy quantifiers are treated in exactly the same way. 5983*22dc650dSSadaf Ebrahimi However, possessive quantifiers can make a difference when what follows 5984*22dc650dSSadaf Ebrahimi could also match what is quantified, for example in a pattern like 5985*22dc650dSSadaf Ebrahimi this: 5986*22dc650dSSadaf Ebrahimi 5987*22dc650dSSadaf Ebrahimi ^a++\w! 5988*22dc650dSSadaf Ebrahimi 5989*22dc650dSSadaf Ebrahimi This pattern matches "aaab!" but not "aaa!", which would be matched by 5990*22dc650dSSadaf Ebrahimi a non-possessive quantifier. Similarly, if an atomic group is present, 5991*22dc650dSSadaf Ebrahimi it is matched as if it were a standalone pattern at the current point, 5992*22dc650dSSadaf Ebrahimi and the longest match is then "locked in" for the rest of the overall 5993*22dc650dSSadaf Ebrahimi pattern. 5994*22dc650dSSadaf Ebrahimi 5995*22dc650dSSadaf Ebrahimi 2. When dealing with multiple paths through the tree simultaneously, it 5996*22dc650dSSadaf Ebrahimi is not straightforward to keep track of captured substrings for the 5997*22dc650dSSadaf Ebrahimi different matching possibilities, and PCRE2's implementation of this 5998*22dc650dSSadaf Ebrahimi algorithm does not attempt to do this. This means that no captured sub- 5999*22dc650dSSadaf Ebrahimi strings are available. 6000*22dc650dSSadaf Ebrahimi 6001*22dc650dSSadaf Ebrahimi 3. Because no substrings are captured, backreferences within the pat- 6002*22dc650dSSadaf Ebrahimi tern are not supported. 6003*22dc650dSSadaf Ebrahimi 6004*22dc650dSSadaf Ebrahimi 4. For the same reason, conditional expressions that use a backrefer- 6005*22dc650dSSadaf Ebrahimi ence as the condition or test for a specific group recursion are not 6006*22dc650dSSadaf Ebrahimi supported. 6007*22dc650dSSadaf Ebrahimi 6008*22dc650dSSadaf Ebrahimi 5. Again for the same reason, script runs are not supported. 6009*22dc650dSSadaf Ebrahimi 6010*22dc650dSSadaf Ebrahimi 6. Because many paths through the tree may be active, the \K escape se- 6011*22dc650dSSadaf Ebrahimi quence, which resets the start of the match when encountered (but may 6012*22dc650dSSadaf Ebrahimi be on some paths and not on others), is not supported. 6013*22dc650dSSadaf Ebrahimi 6014*22dc650dSSadaf Ebrahimi 7. Callouts are supported, but the value of the capture_top field is 6015*22dc650dSSadaf Ebrahimi always 1, and the value of the capture_last field is always 0. 6016*22dc650dSSadaf Ebrahimi 6017*22dc650dSSadaf Ebrahimi 8. The \C escape sequence, which (in the standard algorithm) always 6018*22dc650dSSadaf Ebrahimi matches a single code unit, even in a UTF mode, is not supported in 6019*22dc650dSSadaf Ebrahimi these modes, because the alternative algorithm moves through the sub- 6020*22dc650dSSadaf Ebrahimi ject string one character (not code unit) at a time, for all active 6021*22dc650dSSadaf Ebrahimi paths through the tree. 6022*22dc650dSSadaf Ebrahimi 6023*22dc650dSSadaf Ebrahimi 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) 6024*22dc650dSSadaf Ebrahimi are not supported. (*FAIL) is supported, and behaves like a failing 6025*22dc650dSSadaf Ebrahimi negative assertion. 6026*22dc650dSSadaf Ebrahimi 6027*22dc650dSSadaf Ebrahimi 10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup- 6028*22dc650dSSadaf Ebrahimi ported by pcre2_dfa_match(). 6029*22dc650dSSadaf Ebrahimi 6030*22dc650dSSadaf Ebrahimi 6031*22dc650dSSadaf EbrahimiADVANTAGES OF THE ALTERNATIVE ALGORITHM 6032*22dc650dSSadaf Ebrahimi 6033*22dc650dSSadaf Ebrahimi The main advantage of the alternative algorithm is that all possible 6034*22dc650dSSadaf Ebrahimi matches (at a single point in the subject) are automatically found, and 6035*22dc650dSSadaf Ebrahimi in particular, the longest match is found. To find more than one match 6036*22dc650dSSadaf Ebrahimi at the same point using the standard algorithm, you have to do kludgy 6037*22dc650dSSadaf Ebrahimi things with callouts. 6038*22dc650dSSadaf Ebrahimi 6039*22dc650dSSadaf Ebrahimi Partial matching is possible with this algorithm, though it has some 6040*22dc650dSSadaf Ebrahimi limitations. The pcre2partial documentation gives details of partial 6041*22dc650dSSadaf Ebrahimi matching and discusses multi-segment matching. 6042*22dc650dSSadaf Ebrahimi 6043*22dc650dSSadaf Ebrahimi 6044*22dc650dSSadaf EbrahimiDISADVANTAGES OF THE ALTERNATIVE ALGORITHM 6045*22dc650dSSadaf Ebrahimi 6046*22dc650dSSadaf Ebrahimi The alternative algorithm suffers from a number of disadvantages: 6047*22dc650dSSadaf Ebrahimi 6048*22dc650dSSadaf Ebrahimi 1. It is substantially slower than the standard algorithm. This is 6049*22dc650dSSadaf Ebrahimi partly because it has to search for all possible matches, but is also 6050*22dc650dSSadaf Ebrahimi because it is less susceptible to optimization. 6051*22dc650dSSadaf Ebrahimi 6052*22dc650dSSadaf Ebrahimi 2. Capturing parentheses, backreferences, script runs, and matching 6053*22dc650dSSadaf Ebrahimi within invalid UTF string are not supported. 6054*22dc650dSSadaf Ebrahimi 6055*22dc650dSSadaf Ebrahimi 3. Although atomic groups are supported, their use does not provide the 6056*22dc650dSSadaf Ebrahimi performance advantage that it does for the standard algorithm. 6057*22dc650dSSadaf Ebrahimi 6058*22dc650dSSadaf Ebrahimi 4. JIT optimization is not supported. 6059*22dc650dSSadaf Ebrahimi 6060*22dc650dSSadaf Ebrahimi 6061*22dc650dSSadaf EbrahimiAUTHOR 6062*22dc650dSSadaf Ebrahimi 6063*22dc650dSSadaf Ebrahimi Philip Hazel 6064*22dc650dSSadaf Ebrahimi Retired from University Computing Service 6065*22dc650dSSadaf Ebrahimi Cambridge, England. 6066*22dc650dSSadaf Ebrahimi 6067*22dc650dSSadaf Ebrahimi 6068*22dc650dSSadaf EbrahimiREVISION 6069*22dc650dSSadaf Ebrahimi 6070*22dc650dSSadaf Ebrahimi Last updated: 19 January 2024 6071*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 6072*22dc650dSSadaf Ebrahimi 6073*22dc650dSSadaf Ebrahimi 6074*22dc650dSSadaf EbrahimiPCRE2 10.43 19 January 2024 PCRE2MATCHING(3) 6075*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 6076*22dc650dSSadaf Ebrahimi 6077*22dc650dSSadaf Ebrahimi 6078*22dc650dSSadaf Ebrahimi 6079*22dc650dSSadaf EbrahimiPCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3) 6080*22dc650dSSadaf Ebrahimi 6081*22dc650dSSadaf Ebrahimi 6082*22dc650dSSadaf EbrahimiNAME 6083*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions 6084*22dc650dSSadaf Ebrahimi 6085*22dc650dSSadaf Ebrahimi 6086*22dc650dSSadaf EbrahimiPARTIAL MATCHING IN PCRE2 6087*22dc650dSSadaf Ebrahimi 6088*22dc650dSSadaf Ebrahimi In normal use of PCRE2, if there is a match up to the end of a subject 6089*22dc650dSSadaf Ebrahimi string, but more characters are needed to match the entire pattern, 6090*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMATCH is returned, just like any other failing match. 6091*22dc650dSSadaf Ebrahimi There are circumstances where it might be helpful to distinguish this 6092*22dc650dSSadaf Ebrahimi "partial match" case. 6093*22dc650dSSadaf Ebrahimi 6094*22dc650dSSadaf Ebrahimi One example is an application where the subject string is very long, 6095*22dc650dSSadaf Ebrahimi and not all available at once. The requirement here is to be able to do 6096*22dc650dSSadaf Ebrahimi the matching segment by segment, but special action is needed when a 6097*22dc650dSSadaf Ebrahimi matched substring spans the boundary between two segments. 6098*22dc650dSSadaf Ebrahimi 6099*22dc650dSSadaf Ebrahimi Another example is checking a user input string as it is typed, to en- 6100*22dc650dSSadaf Ebrahimi sure that it conforms to a required format. Invalid characters can be 6101*22dc650dSSadaf Ebrahimi immediately diagnosed and rejected, giving instant feedback. 6102*22dc650dSSadaf Ebrahimi 6103*22dc650dSSadaf Ebrahimi Partial matching is a PCRE2-specific feature; it is not Perl-compati- 6104*22dc650dSSadaf Ebrahimi ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or 6105*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_SOFT options when calling a matching function. The dif- 6106*22dc650dSSadaf Ebrahimi ference between the two options is whether or not a partial match is 6107*22dc650dSSadaf Ebrahimi preferred to an alternative complete match, though the details differ 6108*22dc650dSSadaf Ebrahimi between the two types of matching function. If both options are set, 6109*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD takes precedence. 6110*22dc650dSSadaf Ebrahimi 6111*22dc650dSSadaf Ebrahimi If you want to use partial matching with just-in-time optimized code, 6112*22dc650dSSadaf Ebrahimi as well as setting a partial match option for the matching function, 6113*22dc650dSSadaf Ebrahimi you must also call pcre2_jit_compile() with one or both of these op- 6114*22dc650dSSadaf Ebrahimi tions: 6115*22dc650dSSadaf Ebrahimi 6116*22dc650dSSadaf Ebrahimi PCRE2_JIT_PARTIAL_HARD 6117*22dc650dSSadaf Ebrahimi PCRE2_JIT_PARTIAL_SOFT 6118*22dc650dSSadaf Ebrahimi 6119*22dc650dSSadaf Ebrahimi PCRE2_JIT_COMPLETE should also be set if you are going to run non-par- 6120*22dc650dSSadaf Ebrahimi tial matches on the same pattern. Separate code is compiled for each 6121*22dc650dSSadaf Ebrahimi mode. If the appropriate JIT mode has not been compiled, interpretive 6122*22dc650dSSadaf Ebrahimi matching code is used. 6123*22dc650dSSadaf Ebrahimi 6124*22dc650dSSadaf Ebrahimi Setting a partial matching option disables two of PCRE2's standard op- 6125*22dc650dSSadaf Ebrahimi timization hints. PCRE2 remembers the last literal code unit in a pat- 6126*22dc650dSSadaf Ebrahimi tern, and abandons matching immediately if it is not present in the 6127*22dc650dSSadaf Ebrahimi subject string. This optimization cannot be used for a subject string 6128*22dc650dSSadaf Ebrahimi that might match only partially. PCRE2 also remembers a minimum length 6129*22dc650dSSadaf Ebrahimi of a matching string, and does not bother to run the matching function 6130*22dc650dSSadaf Ebrahimi on shorter strings. This optimization is also disabled for partial 6131*22dc650dSSadaf Ebrahimi matching. 6132*22dc650dSSadaf Ebrahimi 6133*22dc650dSSadaf Ebrahimi 6134*22dc650dSSadaf EbrahimiREQUIREMENTS FOR A PARTIAL MATCH 6135*22dc650dSSadaf Ebrahimi 6136*22dc650dSSadaf Ebrahimi A possible partial match occurs during matching when the end of the 6137*22dc650dSSadaf Ebrahimi subject string is reached successfully, but either more characters are 6138*22dc650dSSadaf Ebrahimi needed to complete the match, or the addition of more characters might 6139*22dc650dSSadaf Ebrahimi change what is matched. 6140*22dc650dSSadaf Ebrahimi 6141*22dc650dSSadaf Ebrahimi Example 1: if the pattern is /abc/ and the subject is "ab", more char- 6142*22dc650dSSadaf Ebrahimi acters are definitely needed to complete a match. In this case both 6143*22dc650dSSadaf Ebrahimi hard and soft matching options yield a partial match. 6144*22dc650dSSadaf Ebrahimi 6145*22dc650dSSadaf Ebrahimi Example 2: if the pattern is /ab+/ and the subject is "ab", a complete 6146*22dc650dSSadaf Ebrahimi match can be found, but the addition of more characters might change 6147*22dc650dSSadaf Ebrahimi what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par- 6148*22dc650dSSadaf Ebrahimi tial match; PCRE2_PARTIAL_SOFT returns the complete match. 6149*22dc650dSSadaf Ebrahimi 6150*22dc650dSSadaf Ebrahimi On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if 6151*22dc650dSSadaf Ebrahimi the next pattern item is \z, \Z, \b, \B, or $ there is always a partial 6152*22dc650dSSadaf Ebrahimi match. Otherwise, for both options, the next pattern item must be one 6153*22dc650dSSadaf Ebrahimi that inspects a character, and at least one of the following must be 6154*22dc650dSSadaf Ebrahimi true: 6155*22dc650dSSadaf Ebrahimi 6156*22dc650dSSadaf Ebrahimi (1) At least one character has already been inspected. An inspected 6157*22dc650dSSadaf Ebrahimi character need not form part of the final matched string; lookbehind 6158*22dc650dSSadaf Ebrahimi assertions and the \K escape sequence provide ways of inspecting char- 6159*22dc650dSSadaf Ebrahimi acters before the start of a matched string. 6160*22dc650dSSadaf Ebrahimi 6161*22dc650dSSadaf Ebrahimi (2) The pattern contains one or more lookbehind assertions. This condi- 6162*22dc650dSSadaf Ebrahimi tion exists in case there is a lookbehind that inspects characters be- 6163*22dc650dSSadaf Ebrahimi fore the start of the match. 6164*22dc650dSSadaf Ebrahimi 6165*22dc650dSSadaf Ebrahimi (3) There is a special case when the whole pattern can match an empty 6166*22dc650dSSadaf Ebrahimi string. When the starting point is at the end of the subject, the 6167*22dc650dSSadaf Ebrahimi empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set 6168*22dc650dSSadaf Ebrahimi and neither of the above conditions is true, it is returned. However, 6169*22dc650dSSadaf Ebrahimi because adding more characters might result in a non-empty match, 6170*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD returns a partial match, which in this case means 6171*22dc650dSSadaf Ebrahimi "there is going to be a match at this point, but until some more char- 6172*22dc650dSSadaf Ebrahimi acters are added, we do not know if it will be an empty string or some- 6173*22dc650dSSadaf Ebrahimi thing longer". 6174*22dc650dSSadaf Ebrahimi 6175*22dc650dSSadaf Ebrahimi 6176*22dc650dSSadaf EbrahimiPARTIAL MATCHING USING pcre2_match() 6177*22dc650dSSadaf Ebrahimi 6178*22dc650dSSadaf Ebrahimi When a partial matching option is set, the result of calling 6179*22dc650dSSadaf Ebrahimi pcre2_match() can be one of the following: 6180*22dc650dSSadaf Ebrahimi 6181*22dc650dSSadaf Ebrahimi A successful match 6182*22dc650dSSadaf Ebrahimi A complete match has been found, starting and ending within this sub- 6183*22dc650dSSadaf Ebrahimi ject. 6184*22dc650dSSadaf Ebrahimi 6185*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMATCH 6186*22dc650dSSadaf Ebrahimi No match can start anywhere in this subject. 6187*22dc650dSSadaf Ebrahimi 6188*22dc650dSSadaf Ebrahimi PCRE2_ERROR_PARTIAL 6189*22dc650dSSadaf Ebrahimi Adding more characters may result in a complete match that uses one 6190*22dc650dSSadaf Ebrahimi or more characters from the end of this subject. 6191*22dc650dSSadaf Ebrahimi 6192*22dc650dSSadaf Ebrahimi When a partial match is returned, the first two elements in the ovector 6193*22dc650dSSadaf Ebrahimi point to the portion of the subject that was matched, but the values in 6194*22dc650dSSadaf Ebrahimi the rest of the ovector are undefined. The appearance of \K in the pat- 6195*22dc650dSSadaf Ebrahimi tern has no effect for a partial match. Consider this pattern: 6196*22dc650dSSadaf Ebrahimi 6197*22dc650dSSadaf Ebrahimi /abc\K123/ 6198*22dc650dSSadaf Ebrahimi 6199*22dc650dSSadaf Ebrahimi If it is matched against "456abc123xyz" the result is a complete match, 6200*22dc650dSSadaf Ebrahimi and the ovector defines the matched string as "123", because \K resets 6201*22dc650dSSadaf Ebrahimi the "start of match" point. However, if a partial match is requested 6202*22dc650dSSadaf Ebrahimi and the subject string is "456abc12", a partial match is found for the 6203*22dc650dSSadaf Ebrahimi string "abc12", because all these characters are needed for a subse- 6204*22dc650dSSadaf Ebrahimi quent re-match with additional characters. 6205*22dc650dSSadaf Ebrahimi 6206*22dc650dSSadaf Ebrahimi If there is more than one partial match, the first one that was found 6207*22dc650dSSadaf Ebrahimi provides the data that is returned. Consider this pattern: 6208*22dc650dSSadaf Ebrahimi 6209*22dc650dSSadaf Ebrahimi /123\w+X|dogY/ 6210*22dc650dSSadaf Ebrahimi 6211*22dc650dSSadaf Ebrahimi If this is matched against the subject string "abc123dog", both alter- 6212*22dc650dSSadaf Ebrahimi natives fail to match, but the end of the subject is reached during 6213*22dc650dSSadaf Ebrahimi matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 6214*22dc650dSSadaf Ebrahimi and 9, identifying "123dog" as the first partial match. (In this exam- 6215*22dc650dSSadaf Ebrahimi ple, there are two partial matches, because "dog" on its own partially 6216*22dc650dSSadaf Ebrahimi matches the second alternative.) 6217*22dc650dSSadaf Ebrahimi 6218*22dc650dSSadaf Ebrahimi How a partial match is processed by pcre2_match() 6219*22dc650dSSadaf Ebrahimi 6220*22dc650dSSadaf Ebrahimi What happens when a partial match is identified depends on which of the 6221*22dc650dSSadaf Ebrahimi two partial matching options is set. 6222*22dc650dSSadaf Ebrahimi 6223*22dc650dSSadaf Ebrahimi If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon 6224*22dc650dSSadaf Ebrahimi as a partial match is found, without continuing to search for possible 6225*22dc650dSSadaf Ebrahimi complete matches. This option is "hard" because it prefers an earlier 6226*22dc650dSSadaf Ebrahimi partial match over a later complete match. For this reason, the assump- 6227*22dc650dSSadaf Ebrahimi tion is made that the end of the supplied subject string is not the 6228*22dc650dSSadaf Ebrahimi true end of the available data, which is why \z, \Z, \b, \B, and $ al- 6229*22dc650dSSadaf Ebrahimi ways give a partial match. 6230*22dc650dSSadaf Ebrahimi 6231*22dc650dSSadaf Ebrahimi If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but 6232*22dc650dSSadaf Ebrahimi matching continues as normal, and other alternatives in the pattern are 6233*22dc650dSSadaf Ebrahimi tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re- 6234*22dc650dSSadaf Ebrahimi turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it 6235*22dc650dSSadaf Ebrahimi prefers a complete match over a partial match. All the various matching 6236*22dc650dSSadaf Ebrahimi items in a pattern behave as if the subject string is potentially com- 6237*22dc650dSSadaf Ebrahimi plete; \z, \Z, and $ match at the end of the subject, as normal, and 6238*22dc650dSSadaf Ebrahimi for \b and \B the end of the subject is treated as a non-alphanumeric. 6239*22dc650dSSadaf Ebrahimi 6240*22dc650dSSadaf Ebrahimi The difference between the two partial matching options can be illus- 6241*22dc650dSSadaf Ebrahimi trated by a pattern such as: 6242*22dc650dSSadaf Ebrahimi 6243*22dc650dSSadaf Ebrahimi /dog(sbody)?/ 6244*22dc650dSSadaf Ebrahimi 6245*22dc650dSSadaf Ebrahimi This matches either "dog" or "dogsbody", greedily (that is, it prefers 6246*22dc650dSSadaf Ebrahimi the longer string if possible). If it is matched against the string 6247*22dc650dSSadaf Ebrahimi "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". 6248*22dc650dSSadaf Ebrahimi However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR- 6249*22dc650dSSadaf Ebrahimi TIAL. On the other hand, if the pattern is made ungreedy the result is 6250*22dc650dSSadaf Ebrahimi different: 6251*22dc650dSSadaf Ebrahimi 6252*22dc650dSSadaf Ebrahimi /dog(sbody)??/ 6253*22dc650dSSadaf Ebrahimi 6254*22dc650dSSadaf Ebrahimi In this case the result is always a complete match because that is 6255*22dc650dSSadaf Ebrahimi found first, and matching never continues after finding a complete 6256*22dc650dSSadaf Ebrahimi match. It might be easier to follow this explanation by thinking of the 6257*22dc650dSSadaf Ebrahimi two patterns like this: 6258*22dc650dSSadaf Ebrahimi 6259*22dc650dSSadaf Ebrahimi /dog(sbody)?/ is the same as /dogsbody|dog/ 6260*22dc650dSSadaf Ebrahimi /dog(sbody)??/ is the same as /dog|dogsbody/ 6261*22dc650dSSadaf Ebrahimi 6262*22dc650dSSadaf Ebrahimi The second pattern will never match "dogsbody", because it will always 6263*22dc650dSSadaf Ebrahimi find the shorter match first. 6264*22dc650dSSadaf Ebrahimi 6265*22dc650dSSadaf Ebrahimi Example of partial matching using pcre2test 6266*22dc650dSSadaf Ebrahimi 6267*22dc650dSSadaf Ebrahimi The pcre2test data modifiers partial_hard (or ph) and partial_soft (or 6268*22dc650dSSadaf Ebrahimi ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when 6269*22dc650dSSadaf Ebrahimi calling pcre2_match(). Here is a run of pcre2test using a pattern that 6270*22dc650dSSadaf Ebrahimi matches the whole subject in the form of a date: 6271*22dc650dSSadaf Ebrahimi 6272*22dc650dSSadaf Ebrahimi re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 6273*22dc650dSSadaf Ebrahimi data> 25dec3\=ph 6274*22dc650dSSadaf Ebrahimi Partial match: 23dec3 6275*22dc650dSSadaf Ebrahimi data> 3ju\=ph 6276*22dc650dSSadaf Ebrahimi Partial match: 3ju 6277*22dc650dSSadaf Ebrahimi data> 3juj\=ph 6278*22dc650dSSadaf Ebrahimi No match 6279*22dc650dSSadaf Ebrahimi 6280*22dc650dSSadaf Ebrahimi This example gives the same results for both hard and soft partial 6281*22dc650dSSadaf Ebrahimi matching options. Here is an example where there is a difference: 6282*22dc650dSSadaf Ebrahimi 6283*22dc650dSSadaf Ebrahimi re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 6284*22dc650dSSadaf Ebrahimi data> 25jun04\=ps 6285*22dc650dSSadaf Ebrahimi 0: 25jun04 6286*22dc650dSSadaf Ebrahimi 1: jun 6287*22dc650dSSadaf Ebrahimi data> 25jun04\=ph 6288*22dc650dSSadaf Ebrahimi Partial match: 25jun04 6289*22dc650dSSadaf Ebrahimi 6290*22dc650dSSadaf Ebrahimi With PCRE2_PARTIAL_SOFT, the subject is matched completely. For 6291*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, 6292*22dc650dSSadaf Ebrahimi so there is only a partial match. 6293*22dc650dSSadaf Ebrahimi 6294*22dc650dSSadaf Ebrahimi 6295*22dc650dSSadaf EbrahimiMULTI-SEGMENT MATCHING WITH pcre2_match() 6296*22dc650dSSadaf Ebrahimi 6297*22dc650dSSadaf Ebrahimi PCRE was not originally designed with multi-segment matching in mind. 6298*22dc650dSSadaf Ebrahimi However, over time, features (including partial matching) that make 6299*22dc650dSSadaf Ebrahimi multi-segment matching possible have been added. A very long string can 6300*22dc650dSSadaf Ebrahimi be searched segment by segment by calling pcre2_match() repeatedly, 6301*22dc650dSSadaf Ebrahimi with the aim of achieving the same results that would happen if the en- 6302*22dc650dSSadaf Ebrahimi tire string was available for searching all the time. Normally, the 6303*22dc650dSSadaf Ebrahimi strings that are being sought are much shorter than each individual 6304*22dc650dSSadaf Ebrahimi segment, and are in the middle of very long strings, so the pattern is 6305*22dc650dSSadaf Ebrahimi normally not anchored. 6306*22dc650dSSadaf Ebrahimi 6307*22dc650dSSadaf Ebrahimi Special logic must be implemented to handle a matched substring that 6308*22dc650dSSadaf Ebrahimi spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it 6309*22dc650dSSadaf Ebrahimi returns a partial match at the end of a segment whenever there is the 6310*22dc650dSSadaf Ebrahimi possibility of changing the match by adding more characters. The 6311*22dc650dSSadaf Ebrahimi PCRE2_NOTBOL option should also be set for all but the first segment. 6312*22dc650dSSadaf Ebrahimi 6313*22dc650dSSadaf Ebrahimi When a partial match occurs, the next segment must be added to the cur- 6314*22dc650dSSadaf Ebrahimi rent subject and the match re-run, using the startoffset argument of 6315*22dc650dSSadaf Ebrahimi pcre2_match() to begin at the point where the partial match started. 6316*22dc650dSSadaf Ebrahimi For example: 6317*22dc650dSSadaf Ebrahimi 6318*22dc650dSSadaf Ebrahimi re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ 6319*22dc650dSSadaf Ebrahimi data> ...the date is 23ja\=ph 6320*22dc650dSSadaf Ebrahimi Partial match: 23ja 6321*22dc650dSSadaf Ebrahimi data> ...the date is 23jan19 and on that day...\=offset=15 6322*22dc650dSSadaf Ebrahimi 0: 23jan19 6323*22dc650dSSadaf Ebrahimi 1: jan 6324*22dc650dSSadaf Ebrahimi 6325*22dc650dSSadaf Ebrahimi Note the use of the offset modifier to start the new match where the 6326*22dc650dSSadaf Ebrahimi partial match was found. In this example, the next segment was added to 6327*22dc650dSSadaf Ebrahimi the one in which the partial match was found. This is the most 6328*22dc650dSSadaf Ebrahimi straightforward approach, typically using a memory buffer that is twice 6329*22dc650dSSadaf Ebrahimi the size of each segment. After a partial match, the first half of the 6330*22dc650dSSadaf Ebrahimi buffer is discarded, the second half is moved to the start of the 6331*22dc650dSSadaf Ebrahimi buffer, and a new segment is added before repeating the match as in the 6332*22dc650dSSadaf Ebrahimi example above. After a no match, the entire buffer can be discarded. 6333*22dc650dSSadaf Ebrahimi 6334*22dc650dSSadaf Ebrahimi If there are memory constraints, you may want to discard text that pre- 6335*22dc650dSSadaf Ebrahimi cedes a partial match before adding the next segment. Unfortunately, 6336*22dc650dSSadaf Ebrahimi this is not at present straightforward. In cases such as the above, 6337*22dc650dSSadaf Ebrahimi where the pattern does not contain any lookbehinds, it is sufficient to 6338*22dc650dSSadaf Ebrahimi retain only the partially matched substring. However, if the pattern 6339*22dc650dSSadaf Ebrahimi contains a lookbehind assertion, characters that precede the start of 6340*22dc650dSSadaf Ebrahimi the partial match may have been inspected during the matching process. 6341*22dc650dSSadaf Ebrahimi When pcre2test displays a partial match, it indicates these characters 6342*22dc650dSSadaf Ebrahimi with '<' if the allusedtext modifier is set: 6343*22dc650dSSadaf Ebrahimi 6344*22dc650dSSadaf Ebrahimi re> "(?<=123)abc" 6345*22dc650dSSadaf Ebrahimi data> xx123ab\=ph,allusedtext 6346*22dc650dSSadaf Ebrahimi Partial match: 123ab 6347*22dc650dSSadaf Ebrahimi <<< 6348*22dc650dSSadaf Ebrahimi 6349*22dc650dSSadaf Ebrahimi However, the allusedtext modifier is not available for JIT matching, 6350*22dc650dSSadaf Ebrahimi because JIT matching does not record the first (or last) consulted 6351*22dc650dSSadaf Ebrahimi characters. For this reason, this information is not available via the 6352*22dc650dSSadaf Ebrahimi API. It is therefore not possible in general to obtain the exact number 6353*22dc650dSSadaf Ebrahimi of characters that must be retained in order to get the right match re- 6354*22dc650dSSadaf Ebrahimi sult. If you cannot retain the entire segment, you must find some 6355*22dc650dSSadaf Ebrahimi heuristic way of choosing. 6356*22dc650dSSadaf Ebrahimi 6357*22dc650dSSadaf Ebrahimi If you know the approximate length of the matching substrings, you can 6358*22dc650dSSadaf Ebrahimi use that to decide how much text to retain. The only lookbehind infor- 6359*22dc650dSSadaf Ebrahimi mation that is currently available via the API is the length of the 6360*22dc650dSSadaf Ebrahimi longest individual lookbehind in a pattern, but this can be misleading 6361*22dc650dSSadaf Ebrahimi if there are nested lookbehinds. The value returned by calling 6362*22dc650dSSadaf Ebrahimi pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option is the 6363*22dc650dSSadaf Ebrahimi maximum number of characters (not code units) that any individual look- 6364*22dc650dSSadaf Ebrahimi behind moves back when it is processed. A pattern such as 6365*22dc650dSSadaf Ebrahimi "(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects two 6366*22dc650dSSadaf Ebrahimi characters before its starting point. 6367*22dc650dSSadaf Ebrahimi 6368*22dc650dSSadaf Ebrahimi In a non-UTF or a 32-bit case, moving back is just a subtraction, but 6369*22dc650dSSadaf Ebrahimi in UTF-8 or UTF-16 you have to count characters while moving back 6370*22dc650dSSadaf Ebrahimi through the code units. 6371*22dc650dSSadaf Ebrahimi 6372*22dc650dSSadaf Ebrahimi 6373*22dc650dSSadaf EbrahimiPARTIAL MATCHING USING pcre2_dfa_match() 6374*22dc650dSSadaf Ebrahimi 6375*22dc650dSSadaf Ebrahimi The DFA function moves along the subject string character by character, 6376*22dc650dSSadaf Ebrahimi without backtracking, searching for all possible matches simultane- 6377*22dc650dSSadaf Ebrahimi ously. If the end of the subject is reached before the end of the pat- 6378*22dc650dSSadaf Ebrahimi tern, there is the possibility of a partial match. 6379*22dc650dSSadaf Ebrahimi 6380*22dc650dSSadaf Ebrahimi When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if 6381*22dc650dSSadaf Ebrahimi there have been no complete matches. Otherwise, the complete matches 6382*22dc650dSSadaf Ebrahimi are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes 6383*22dc650dSSadaf Ebrahimi precedence over any complete matches. The portion of the string that 6384*22dc650dSSadaf Ebrahimi was matched when the longest partial match was found is set as the 6385*22dc650dSSadaf Ebrahimi first matching string. 6386*22dc650dSSadaf Ebrahimi 6387*22dc650dSSadaf Ebrahimi Because the DFA function always searches for all possible matches, and 6388*22dc650dSSadaf Ebrahimi there is no difference between greedy and ungreedy repetition, its be- 6389*22dc650dSSadaf Ebrahimi haviour is different from the pcre2_match(). Consider the string "dog" 6390*22dc650dSSadaf Ebrahimi matched against this ungreedy pattern: 6391*22dc650dSSadaf Ebrahimi 6392*22dc650dSSadaf Ebrahimi /dog(sbody)??/ 6393*22dc650dSSadaf Ebrahimi 6394*22dc650dSSadaf Ebrahimi Whereas the standard function stops as soon as it finds the complete 6395*22dc650dSSadaf Ebrahimi match for "dog", the DFA function also finds the partial match for 6396*22dc650dSSadaf Ebrahimi "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set. 6397*22dc650dSSadaf Ebrahimi 6398*22dc650dSSadaf Ebrahimi 6399*22dc650dSSadaf EbrahimiMULTI-SEGMENT MATCHING WITH pcre2_dfa_match() 6400*22dc650dSSadaf Ebrahimi 6401*22dc650dSSadaf Ebrahimi When a partial match has been found using the DFA matching function, it 6402*22dc650dSSadaf Ebrahimi is possible to continue the match by providing additional subject data 6403*22dc650dSSadaf Ebrahimi and calling the function again with the same compiled regular expres- 6404*22dc650dSSadaf Ebrahimi sion, this time setting the PCRE2_DFA_RESTART option. You must pass the 6405*22dc650dSSadaf Ebrahimi same working space as before, because this is where details of the pre- 6406*22dc650dSSadaf Ebrahimi vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or 6407*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to continue partial 6408*22dc650dSSadaf Ebrahimi matching over multiple segments. Here is an example using pcre2test: 6409*22dc650dSSadaf Ebrahimi 6410*22dc650dSSadaf Ebrahimi re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 6411*22dc650dSSadaf Ebrahimi data> 23ja\=dfa,ps 6412*22dc650dSSadaf Ebrahimi Partial match: 23ja 6413*22dc650dSSadaf Ebrahimi data> n05\=dfa,dfa_restart 6414*22dc650dSSadaf Ebrahimi 0: n05 6415*22dc650dSSadaf Ebrahimi 6416*22dc650dSSadaf Ebrahimi The first call has "23ja" as the subject, and requests partial match- 6417*22dc650dSSadaf Ebrahimi ing; the second call has "n05" as the subject for the continued 6418*22dc650dSSadaf Ebrahimi (restarted) match. Notice that when the match is complete, only the 6419*22dc650dSSadaf Ebrahimi last part is shown; PCRE2 does not retain the previously partially- 6420*22dc650dSSadaf Ebrahimi matched string. It is up to the calling program to do that if it needs 6421*22dc650dSSadaf Ebrahimi to. This means that, for an unanchored pattern, if a continued match 6422*22dc650dSSadaf Ebrahimi fails, it is not possible to try again at a new starting point. All 6423*22dc650dSSadaf Ebrahimi this facility is capable of doing is continuing with the previous match 6424*22dc650dSSadaf Ebrahimi attempt. For example, consider this pattern: 6425*22dc650dSSadaf Ebrahimi 6426*22dc650dSSadaf Ebrahimi 1234|3789 6427*22dc650dSSadaf Ebrahimi 6428*22dc650dSSadaf Ebrahimi If the first part of the subject is "ABC123", a partial match of the 6429*22dc650dSSadaf Ebrahimi first alternative is found at offset 3. There is no partial match for 6430*22dc650dSSadaf Ebrahimi the second alternative, because such a match does not start at the same 6431*22dc650dSSadaf Ebrahimi point in the subject string. Attempting to continue with the string 6432*22dc650dSSadaf Ebrahimi "7890" does not yield a match because only those alternatives that 6433*22dc650dSSadaf Ebrahimi match at one point in the subject are remembered. Depending on the ap- 6434*22dc650dSSadaf Ebrahimi plication, this may or may not be what you want. 6435*22dc650dSSadaf Ebrahimi 6436*22dc650dSSadaf Ebrahimi If you do want to allow for starting again at the next character, one 6437*22dc650dSSadaf Ebrahimi way of doing it is to retain some or all of the segment and try a new 6438*22dc650dSSadaf Ebrahimi complete match, as described for pcre2_match() above. Another possibil- 6439*22dc650dSSadaf Ebrahimi ity is to work with two buffers. If a partial match at offset n in the 6440*22dc650dSSadaf Ebrahimi first buffer is followed by "no match" when PCRE2_DFA_RESTART is used 6441*22dc650dSSadaf Ebrahimi on the second buffer, you can then try a new match starting at offset 6442*22dc650dSSadaf Ebrahimi n+1 in the first buffer. 6443*22dc650dSSadaf Ebrahimi 6444*22dc650dSSadaf Ebrahimi 6445*22dc650dSSadaf EbrahimiAUTHOR 6446*22dc650dSSadaf Ebrahimi 6447*22dc650dSSadaf Ebrahimi Philip Hazel 6448*22dc650dSSadaf Ebrahimi Retired from University Computing Service 6449*22dc650dSSadaf Ebrahimi Cambridge, England. 6450*22dc650dSSadaf Ebrahimi 6451*22dc650dSSadaf Ebrahimi 6452*22dc650dSSadaf EbrahimiREVISION 6453*22dc650dSSadaf Ebrahimi 6454*22dc650dSSadaf Ebrahimi Last updated: 04 September 2019 6455*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2019 University of Cambridge. 6456*22dc650dSSadaf Ebrahimi 6457*22dc650dSSadaf Ebrahimi 6458*22dc650dSSadaf EbrahimiPCRE2 10.34 04 September 2019 PCRE2PARTIAL(3) 6459*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 6460*22dc650dSSadaf Ebrahimi 6461*22dc650dSSadaf Ebrahimi 6462*22dc650dSSadaf Ebrahimi 6463*22dc650dSSadaf EbrahimiPCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3) 6464*22dc650dSSadaf Ebrahimi 6465*22dc650dSSadaf Ebrahimi 6466*22dc650dSSadaf EbrahimiNAME 6467*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 6468*22dc650dSSadaf Ebrahimi 6469*22dc650dSSadaf Ebrahimi 6470*22dc650dSSadaf EbrahimiPCRE2 REGULAR EXPRESSION DETAILS 6471*22dc650dSSadaf Ebrahimi 6472*22dc650dSSadaf Ebrahimi The syntax and semantics of the regular expressions that are supported 6473*22dc650dSSadaf Ebrahimi by PCRE2 are described in detail below. There is a quick-reference syn- 6474*22dc650dSSadaf Ebrahimi tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax 6475*22dc650dSSadaf Ebrahimi and semantics as closely as it can. PCRE2 also supports some alterna- 6476*22dc650dSSadaf Ebrahimi tive regular expression syntax (which does not conflict with the Perl 6477*22dc650dSSadaf Ebrahimi syntax) in order to provide some compatibility with regular expressions 6478*22dc650dSSadaf Ebrahimi in Python, .NET, and Oniguruma. 6479*22dc650dSSadaf Ebrahimi 6480*22dc650dSSadaf Ebrahimi Perl's regular expressions are described in its own documentation, and 6481*22dc650dSSadaf Ebrahimi regular expressions in general are covered in a number of books, some 6482*22dc650dSSadaf Ebrahimi of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex- 6483*22dc650dSSadaf Ebrahimi pressions", published by O'Reilly, covers regular expressions in great 6484*22dc650dSSadaf Ebrahimi detail. This description of PCRE2's regular expressions is intended as 6485*22dc650dSSadaf Ebrahimi reference material. 6486*22dc650dSSadaf Ebrahimi 6487*22dc650dSSadaf Ebrahimi This document discusses the regular expression patterns that are sup- 6488*22dc650dSSadaf Ebrahimi ported by PCRE2 when its main matching function, pcre2_match(), is 6489*22dc650dSSadaf Ebrahimi used. PCRE2 also has an alternative matching function, 6490*22dc650dSSadaf Ebrahimi pcre2_dfa_match(), which matches using a different algorithm that is 6491*22dc650dSSadaf Ebrahimi not Perl-compatible. Some of the features discussed below are not 6492*22dc650dSSadaf Ebrahimi available when DFA matching is used. The advantages and disadvantages 6493*22dc650dSSadaf Ebrahimi of the alternative function, and how it differs from the normal func- 6494*22dc650dSSadaf Ebrahimi tion, are discussed in the pcre2matching page. 6495*22dc650dSSadaf Ebrahimi 6496*22dc650dSSadaf Ebrahimi 6497*22dc650dSSadaf EbrahimiSPECIAL START-OF-PATTERN ITEMS 6498*22dc650dSSadaf Ebrahimi 6499*22dc650dSSadaf Ebrahimi A number of options that can be passed to pcre2_compile() can also be 6500*22dc650dSSadaf Ebrahimi set by special items at the start of a pattern. These are not Perl-com- 6501*22dc650dSSadaf Ebrahimi patible, but are provided to make these options accessible to pattern 6502*22dc650dSSadaf Ebrahimi writers who are not able to change the program that processes the pat- 6503*22dc650dSSadaf Ebrahimi tern. Any number of these items may appear, but they must all be to- 6504*22dc650dSSadaf Ebrahimi gether right at the start of the pattern string, and the letters must 6505*22dc650dSSadaf Ebrahimi be in upper case. 6506*22dc650dSSadaf Ebrahimi 6507*22dc650dSSadaf Ebrahimi UTF support 6508*22dc650dSSadaf Ebrahimi 6509*22dc650dSSadaf Ebrahimi In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either 6510*22dc650dSSadaf Ebrahimi as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 6511*22dc650dSSadaf Ebrahimi can be specified for the 32-bit library, in which case it constrains 6512*22dc650dSSadaf Ebrahimi the character values to valid Unicode code points. To process UTF 6513*22dc650dSSadaf Ebrahimi strings, PCRE2 must be built to include Unicode support (which is the 6514*22dc650dSSadaf Ebrahimi default). When using UTF strings you must either call the compiling 6515*22dc650dSSadaf Ebrahimi function with one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF 6516*22dc650dSSadaf Ebrahimi options, or the pattern must start with the special sequence (*UTF), 6517*22dc650dSSadaf Ebrahimi which is equivalent to setting the relevant PCRE2_UTF. How setting a 6518*22dc650dSSadaf Ebrahimi UTF mode affects pattern matching is mentioned in several places below. 6519*22dc650dSSadaf Ebrahimi There is also a summary of features in the pcre2unicode page. 6520*22dc650dSSadaf Ebrahimi 6521*22dc650dSSadaf Ebrahimi Some applications that allow their users to supply patterns may wish to 6522*22dc650dSSadaf Ebrahimi restrict them to non-UTF data for security reasons. If the 6523*22dc650dSSadaf Ebrahimi PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al- 6524*22dc650dSSadaf Ebrahimi lowed, and its appearance in a pattern causes an error. 6525*22dc650dSSadaf Ebrahimi 6526*22dc650dSSadaf Ebrahimi Unicode property support 6527*22dc650dSSadaf Ebrahimi 6528*22dc650dSSadaf Ebrahimi Another special sequence that may appear at the start of a pattern is 6529*22dc650dSSadaf Ebrahimi (*UCP). This has the same effect as setting the PCRE2_UCP option: it 6530*22dc650dSSadaf Ebrahimi causes sequences such as \d and \w to use Unicode properties to deter- 6531*22dc650dSSadaf Ebrahimi mine character types, instead of recognizing only characters with codes 6532*22dc650dSSadaf Ebrahimi less than 256 via a lookup table. If also causes upper/lower casing op- 6533*22dc650dSSadaf Ebrahimi erations to use Unicode properties for characters with code points 6534*22dc650dSSadaf Ebrahimi greater than 127, even when UTF is not set. These behaviours can be 6535*22dc650dSSadaf Ebrahimi changed within the pattern; see the section entitled "Internal Option 6536*22dc650dSSadaf Ebrahimi Setting" below. 6537*22dc650dSSadaf Ebrahimi 6538*22dc650dSSadaf Ebrahimi Some applications that allow their users to supply patterns may wish to 6539*22dc650dSSadaf Ebrahimi restrict them for security reasons. If the PCRE2_NEVER_UCP option is 6540*22dc650dSSadaf Ebrahimi passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in 6541*22dc650dSSadaf Ebrahimi a pattern causes an error. 6542*22dc650dSSadaf Ebrahimi 6543*22dc650dSSadaf Ebrahimi Locking out empty string matching 6544*22dc650dSSadaf Ebrahimi 6545*22dc650dSSadaf Ebrahimi Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same 6546*22dc650dSSadaf Ebrahimi effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option 6547*22dc650dSSadaf Ebrahimi to whichever matching function is subsequently called to match the pat- 6548*22dc650dSSadaf Ebrahimi tern. These options lock out the matching of empty strings, either en- 6549*22dc650dSSadaf Ebrahimi tirely, or only at the start of the subject. 6550*22dc650dSSadaf Ebrahimi 6551*22dc650dSSadaf Ebrahimi Disabling auto-possessification 6552*22dc650dSSadaf Ebrahimi 6553*22dc650dSSadaf Ebrahimi If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as 6554*22dc650dSSadaf Ebrahimi setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making 6555*22dc650dSSadaf Ebrahimi quantifiers possessive when what follows cannot match the repeated 6556*22dc650dSSadaf Ebrahimi item. For example, by default a+b is treated as a++b. For more details, 6557*22dc650dSSadaf Ebrahimi see the pcre2api documentation. 6558*22dc650dSSadaf Ebrahimi 6559*22dc650dSSadaf Ebrahimi Disabling start-up optimizations 6560*22dc650dSSadaf Ebrahimi 6561*22dc650dSSadaf Ebrahimi If a pattern starts with (*NO_START_OPT), it has the same effect as 6562*22dc650dSSadaf Ebrahimi setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti- 6563*22dc650dSSadaf Ebrahimi mizations for quickly reaching "no match" results. For more details, 6564*22dc650dSSadaf Ebrahimi see the pcre2api documentation. 6565*22dc650dSSadaf Ebrahimi 6566*22dc650dSSadaf Ebrahimi Disabling automatic anchoring 6567*22dc650dSSadaf Ebrahimi 6568*22dc650dSSadaf Ebrahimi If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect 6569*22dc650dSSadaf Ebrahimi as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza- 6570*22dc650dSSadaf Ebrahimi tions that apply to patterns whose top-level branches all start with .* 6571*22dc650dSSadaf Ebrahimi (match any number of arbitrary characters). For more details, see the 6572*22dc650dSSadaf Ebrahimi pcre2api documentation. 6573*22dc650dSSadaf Ebrahimi 6574*22dc650dSSadaf Ebrahimi Disabling JIT compilation 6575*22dc650dSSadaf Ebrahimi 6576*22dc650dSSadaf Ebrahimi If a pattern that starts with (*NO_JIT) is successfully compiled, an 6577*22dc650dSSadaf Ebrahimi attempt by the application to apply the JIT optimization by calling 6578*22dc650dSSadaf Ebrahimi pcre2_jit_compile() is ignored. 6579*22dc650dSSadaf Ebrahimi 6580*22dc650dSSadaf Ebrahimi Setting match resource limits 6581*22dc650dSSadaf Ebrahimi 6582*22dc650dSSadaf Ebrahimi The pcre2_match() function contains a counter that is incremented every 6583*22dc650dSSadaf Ebrahimi time it goes round its main loop. The caller of pcre2_match() can set a 6584*22dc650dSSadaf Ebrahimi limit on this counter, which therefore limits the amount of computing 6585*22dc650dSSadaf Ebrahimi resource used for a match. The maximum depth of nested backtracking can 6586*22dc650dSSadaf Ebrahimi also be limited; this indirectly restricts the amount of heap memory 6587*22dc650dSSadaf Ebrahimi that is used, but there is also an explicit memory limit that can be 6588*22dc650dSSadaf Ebrahimi set. 6589*22dc650dSSadaf Ebrahimi 6590*22dc650dSSadaf Ebrahimi These facilities are provided to catch runaway matches that are pro- 6591*22dc650dSSadaf Ebrahimi voked by patterns with huge matching trees. A common example is a pat- 6592*22dc650dSSadaf Ebrahimi tern with nested unlimited repeats applied to a long string that does 6593*22dc650dSSadaf Ebrahimi not match. When one of these limits is reached, pcre2_match() gives an 6594*22dc650dSSadaf Ebrahimi error return. The limits can also be set by items at the start of the 6595*22dc650dSSadaf Ebrahimi pattern of the form 6596*22dc650dSSadaf Ebrahimi 6597*22dc650dSSadaf Ebrahimi (*LIMIT_HEAP=d) 6598*22dc650dSSadaf Ebrahimi (*LIMIT_MATCH=d) 6599*22dc650dSSadaf Ebrahimi (*LIMIT_DEPTH=d) 6600*22dc650dSSadaf Ebrahimi 6601*22dc650dSSadaf Ebrahimi where d is any number of decimal digits. However, the value of the set- 6602*22dc650dSSadaf Ebrahimi ting must be less than the value set (or defaulted) by the caller of 6603*22dc650dSSadaf Ebrahimi pcre2_match() for it to have any effect. In other words, the pattern 6604*22dc650dSSadaf Ebrahimi writer can lower the limits set by the programmer, but not raise them. 6605*22dc650dSSadaf Ebrahimi If there is more than one setting of one of these limits, the lower 6606*22dc650dSSadaf Ebrahimi value is used. The heap limit is specified in kibibytes (units of 1024 6607*22dc650dSSadaf Ebrahimi bytes). 6608*22dc650dSSadaf Ebrahimi 6609*22dc650dSSadaf Ebrahimi Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This 6610*22dc650dSSadaf Ebrahimi name is still recognized for backwards compatibility. 6611*22dc650dSSadaf Ebrahimi 6612*22dc650dSSadaf Ebrahimi The heap limit applies only when the pcre2_match() or pcre2_dfa_match() 6613*22dc650dSSadaf Ebrahimi interpreters are used for matching. It does not apply to JIT. The match 6614*22dc650dSSadaf Ebrahimi limit is used (but in a different way) when JIT is being used, or when 6615*22dc650dSSadaf Ebrahimi pcre2_dfa_match() is called, to limit computing resource usage by those 6616*22dc650dSSadaf Ebrahimi matching functions. The depth limit is ignored by JIT but is relevant 6617*22dc650dSSadaf Ebrahimi for DFA matching, which uses function recursion for recursions within 6618*22dc650dSSadaf Ebrahimi the pattern and for lookaround assertions and atomic groups. In this 6619*22dc650dSSadaf Ebrahimi case, the depth limit controls the depth of such recursion. 6620*22dc650dSSadaf Ebrahimi 6621*22dc650dSSadaf Ebrahimi Newline conventions 6622*22dc650dSSadaf Ebrahimi 6623*22dc650dSSadaf Ebrahimi PCRE2 supports six different conventions for indicating line breaks in 6624*22dc650dSSadaf Ebrahimi strings: a single CR (carriage return) character, a single LF (line- 6625*22dc650dSSadaf Ebrahimi feed) character, the two-character sequence CRLF, any of the three pre- 6626*22dc650dSSadaf Ebrahimi ceding, any Unicode newline sequence, or the NUL character (binary 6627*22dc650dSSadaf Ebrahimi zero). The pcre2api page has further discussion about newlines, and 6628*22dc650dSSadaf Ebrahimi shows how to set the newline convention when calling pcre2_compile(). 6629*22dc650dSSadaf Ebrahimi 6630*22dc650dSSadaf Ebrahimi It is also possible to specify a newline convention by starting a pat- 6631*22dc650dSSadaf Ebrahimi tern string with one of the following sequences: 6632*22dc650dSSadaf Ebrahimi 6633*22dc650dSSadaf Ebrahimi (*CR) carriage return 6634*22dc650dSSadaf Ebrahimi (*LF) linefeed 6635*22dc650dSSadaf Ebrahimi (*CRLF) carriage return, followed by linefeed 6636*22dc650dSSadaf Ebrahimi (*ANYCRLF) any of the three above 6637*22dc650dSSadaf Ebrahimi (*ANY) all Unicode newline sequences 6638*22dc650dSSadaf Ebrahimi (*NUL) the NUL character (binary zero) 6639*22dc650dSSadaf Ebrahimi 6640*22dc650dSSadaf Ebrahimi These override the default and the options given to the compiling func- 6641*22dc650dSSadaf Ebrahimi tion. For example, on a Unix system where LF is the default newline se- 6642*22dc650dSSadaf Ebrahimi quence, the pattern 6643*22dc650dSSadaf Ebrahimi 6644*22dc650dSSadaf Ebrahimi (*CR)a.b 6645*22dc650dSSadaf Ebrahimi 6646*22dc650dSSadaf Ebrahimi changes the convention to CR. That pattern matches "a\nb" because LF is 6647*22dc650dSSadaf Ebrahimi no longer a newline. If more than one of these settings is present, the 6648*22dc650dSSadaf Ebrahimi last one is used. 6649*22dc650dSSadaf Ebrahimi 6650*22dc650dSSadaf Ebrahimi The newline convention affects where the circumflex and dollar asser- 6651*22dc650dSSadaf Ebrahimi tions are true. It also affects the interpretation of the dot metachar- 6652*22dc650dSSadaf Ebrahimi acter when PCRE2_DOTALL is not set, and the behaviour of \N when not 6653*22dc650dSSadaf Ebrahimi followed by an opening brace. However, it does not affect what the \R 6654*22dc650dSSadaf Ebrahimi escape sequence matches. By default, this is any Unicode newline se- 6655*22dc650dSSadaf Ebrahimi quence, for Perl compatibility. However, this can be changed; see the 6656*22dc650dSSadaf Ebrahimi next section and the description of \R in the section entitled "Newline 6657*22dc650dSSadaf Ebrahimi sequences" below. A change of \R setting can be combined with a change 6658*22dc650dSSadaf Ebrahimi of newline convention. 6659*22dc650dSSadaf Ebrahimi 6660*22dc650dSSadaf Ebrahimi Specifying what \R matches 6661*22dc650dSSadaf Ebrahimi 6662*22dc650dSSadaf Ebrahimi It is possible to restrict \R to match only CR, LF, or CRLF (instead of 6663*22dc650dSSadaf Ebrahimi the complete set of Unicode line endings) by setting the option 6664*22dc650dSSadaf Ebrahimi PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by 6665*22dc650dSSadaf Ebrahimi starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- 6666*22dc650dSSadaf Ebrahimi CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. 6667*22dc650dSSadaf Ebrahimi 6668*22dc650dSSadaf Ebrahimi 6669*22dc650dSSadaf EbrahimiEBCDIC CHARACTER CODES 6670*22dc650dSSadaf Ebrahimi 6671*22dc650dSSadaf Ebrahimi PCRE2 can be compiled to run in an environment that uses EBCDIC as its 6672*22dc650dSSadaf Ebrahimi character code instead of ASCII or Unicode (typically a mainframe sys- 6673*22dc650dSSadaf Ebrahimi tem). In the sections below, character code values are ASCII or Uni- 6674*22dc650dSSadaf Ebrahimi code; in an EBCDIC environment these characters may have different code 6675*22dc650dSSadaf Ebrahimi values, and there are no code points greater than 255. 6676*22dc650dSSadaf Ebrahimi 6677*22dc650dSSadaf Ebrahimi 6678*22dc650dSSadaf EbrahimiCHARACTERS AND METACHARACTERS 6679*22dc650dSSadaf Ebrahimi 6680*22dc650dSSadaf Ebrahimi A regular expression is a pattern that is matched against a subject 6681*22dc650dSSadaf Ebrahimi string from left to right. Most characters stand for themselves in a 6682*22dc650dSSadaf Ebrahimi pattern, and match the corresponding characters in the subject. As a 6683*22dc650dSSadaf Ebrahimi trivial example, the pattern 6684*22dc650dSSadaf Ebrahimi 6685*22dc650dSSadaf Ebrahimi The quick brown fox 6686*22dc650dSSadaf Ebrahimi 6687*22dc650dSSadaf Ebrahimi matches a portion of a subject string that is identical to itself. When 6688*22dc650dSSadaf Ebrahimi caseless matching is specified (the PCRE2_CASELESS option or (?i) 6689*22dc650dSSadaf Ebrahimi within the pattern), letters are matched independently of case. Note 6690*22dc650dSSadaf Ebrahimi that there are two ASCII characters, K and S, that, in addition to 6691*22dc650dSSadaf Ebrahimi their lower case ASCII equivalents, are case-equivalent with Unicode 6692*22dc650dSSadaf Ebrahimi U+212A (Kelvin sign) and U+017F (long S) respectively when either 6693*22dc650dSSadaf Ebrahimi PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT 6694*22dc650dSSadaf Ebrahimi option is in force (either passed to pcre2_compile() or set by (?r) 6695*22dc650dSSadaf Ebrahimi within the pattern). 6696*22dc650dSSadaf Ebrahimi 6697*22dc650dSSadaf Ebrahimi The power of regular expressions comes from the ability to include wild 6698*22dc650dSSadaf Ebrahimi cards, character classes, alternatives, and repetitions in the pattern. 6699*22dc650dSSadaf Ebrahimi These are encoded in the pattern by the use of metacharacters, which do 6700*22dc650dSSadaf Ebrahimi not stand for themselves but instead are interpreted in some special 6701*22dc650dSSadaf Ebrahimi way. 6702*22dc650dSSadaf Ebrahimi 6703*22dc650dSSadaf Ebrahimi There are two different sets of metacharacters: those that are recog- 6704*22dc650dSSadaf Ebrahimi nized anywhere in the pattern except within square brackets, and those 6705*22dc650dSSadaf Ebrahimi that are recognized within square brackets. Outside square brackets, 6706*22dc650dSSadaf Ebrahimi the metacharacters are as follows: 6707*22dc650dSSadaf Ebrahimi 6708*22dc650dSSadaf Ebrahimi \ general escape character with several uses 6709*22dc650dSSadaf Ebrahimi ^ assert start of string (or line, in multiline mode) 6710*22dc650dSSadaf Ebrahimi $ assert end of string (or line, in multiline mode) 6711*22dc650dSSadaf Ebrahimi . match any character except newline (by default) 6712*22dc650dSSadaf Ebrahimi [ start character class definition 6713*22dc650dSSadaf Ebrahimi | start of alternative branch 6714*22dc650dSSadaf Ebrahimi ( start group or control verb 6715*22dc650dSSadaf Ebrahimi ) end group or control verb 6716*22dc650dSSadaf Ebrahimi * 0 or more quantifier 6717*22dc650dSSadaf Ebrahimi + 1 or more quantifier; also "possessive quantifier" 6718*22dc650dSSadaf Ebrahimi ? 0 or 1 quantifier; also quantifier minimizer 6719*22dc650dSSadaf Ebrahimi { potential start of min/max quantifier 6720*22dc650dSSadaf Ebrahimi 6721*22dc650dSSadaf Ebrahimi Brace characters { and } are also used to enclose data for construc- 6722*22dc650dSSadaf Ebrahimi tions such as \g{2} or \k{name}. In almost all uses of braces, space 6723*22dc650dSSadaf Ebrahimi and/or horizontal tab characters that follow { or precede } are allowed 6724*22dc650dSSadaf Ebrahimi and are ignored. In the case of quantifiers, they may also appear be- 6725*22dc650dSSadaf Ebrahimi fore or after the comma. The exception to this is \u{...} which is an 6726*22dc650dSSadaf Ebrahimi ECMAScript compatibility feature that is recognized only when the 6727*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript does not ignore such 6728*22dc650dSSadaf Ebrahimi white space; it causes the item to be interpreted as literal. 6729*22dc650dSSadaf Ebrahimi 6730*22dc650dSSadaf Ebrahimi Part of a pattern that is in square brackets is called a "character 6731*22dc650dSSadaf Ebrahimi class". In a character class the only metacharacters are: 6732*22dc650dSSadaf Ebrahimi 6733*22dc650dSSadaf Ebrahimi \ general escape character 6734*22dc650dSSadaf Ebrahimi ^ negate the class, but only if the first character 6735*22dc650dSSadaf Ebrahimi - indicates character range 6736*22dc650dSSadaf Ebrahimi [ POSIX character class (if followed by POSIX syntax) 6737*22dc650dSSadaf Ebrahimi ] terminates the character class 6738*22dc650dSSadaf Ebrahimi 6739*22dc650dSSadaf Ebrahimi If a pattern is compiled with the PCRE2_EXTENDED option, most white 6740*22dc650dSSadaf Ebrahimi space in the pattern, other than in a character class, within a \Q...\E 6741*22dc650dSSadaf Ebrahimi sequence, or between a # outside a character class and the next new- 6742*22dc650dSSadaf Ebrahimi line, inclusive, are ignored. An escaping backslash can be used to in- 6743*22dc650dSSadaf Ebrahimi clude a white space or a # character as part of the pattern. If the 6744*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED_MORE option is set, the same applies, but in addition 6745*22dc650dSSadaf Ebrahimi unescaped space and horizontal tab characters are ignored inside a 6746*22dc650dSSadaf Ebrahimi character class. Note: only these two characters are ignored, not the 6747*22dc650dSSadaf Ebrahimi full set of pattern white space characters that are ignored outside a 6748*22dc650dSSadaf Ebrahimi character class. Option settings can be changed within a pattern; see 6749*22dc650dSSadaf Ebrahimi the section entitled "Internal Option Setting" below. 6750*22dc650dSSadaf Ebrahimi 6751*22dc650dSSadaf Ebrahimi The following sections describe the use of each of the metacharacters. 6752*22dc650dSSadaf Ebrahimi 6753*22dc650dSSadaf Ebrahimi 6754*22dc650dSSadaf EbrahimiBACKSLASH 6755*22dc650dSSadaf Ebrahimi 6756*22dc650dSSadaf Ebrahimi The backslash character has several uses. Firstly, if it is followed by 6757*22dc650dSSadaf Ebrahimi a character that is not a digit or a letter, it takes away any special 6758*22dc650dSSadaf Ebrahimi meaning that character may have. This use of backslash as an escape 6759*22dc650dSSadaf Ebrahimi character applies both inside and outside character classes. 6760*22dc650dSSadaf Ebrahimi 6761*22dc650dSSadaf Ebrahimi For example, if you want to match a * character, you must write \* in 6762*22dc650dSSadaf Ebrahimi the pattern. This escaping action applies whether or not the following 6763*22dc650dSSadaf Ebrahimi character would otherwise be interpreted as a metacharacter, so it is 6764*22dc650dSSadaf Ebrahimi always safe to precede a non-alphanumeric with backslash to specify 6765*22dc650dSSadaf Ebrahimi that it stands for itself. In particular, if you want to match a back- 6766*22dc650dSSadaf Ebrahimi slash, you write \\. 6767*22dc650dSSadaf Ebrahimi 6768*22dc650dSSadaf Ebrahimi Only ASCII digits and letters have any special meaning after a back- 6769*22dc650dSSadaf Ebrahimi slash. All other characters (in particular, those whose code points are 6770*22dc650dSSadaf Ebrahimi greater than 127) are treated as literals. 6771*22dc650dSSadaf Ebrahimi 6772*22dc650dSSadaf Ebrahimi If you want to treat all characters in a sequence as literals, you can 6773*22dc650dSSadaf Ebrahimi do so by putting them between \Q and \E. Note that this includes white 6774*22dc650dSSadaf Ebrahimi space even when the PCRE2_EXTENDED option is set so that most other 6775*22dc650dSSadaf Ebrahimi white space is ignored. The behaviour is different from Perl in that $ 6776*22dc650dSSadaf Ebrahimi and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in 6777*22dc650dSSadaf Ebrahimi Perl, $ and @ cause variable interpolation. Also, Perl does "double- 6778*22dc650dSSadaf Ebrahimi quotish backslash interpolation" on any backslashes between \Q and \E 6779*22dc650dSSadaf Ebrahimi which, its documentation says, "may lead to confusing results". PCRE2 6780*22dc650dSSadaf Ebrahimi treats a backslash between \Q and \E just like any other character. 6781*22dc650dSSadaf Ebrahimi Note the following examples: 6782*22dc650dSSadaf Ebrahimi 6783*22dc650dSSadaf Ebrahimi Pattern PCRE2 matches Perl matches 6784*22dc650dSSadaf Ebrahimi 6785*22dc650dSSadaf Ebrahimi \Qabc$xyz\E abc$xyz abc followed by the 6786*22dc650dSSadaf Ebrahimi contents of $xyz 6787*22dc650dSSadaf Ebrahimi \Qabc\$xyz\E abc\$xyz abc\$xyz 6788*22dc650dSSadaf Ebrahimi \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 6789*22dc650dSSadaf Ebrahimi \QA\B\E A\B A\B 6790*22dc650dSSadaf Ebrahimi \Q\\E \ \\E 6791*22dc650dSSadaf Ebrahimi 6792*22dc650dSSadaf Ebrahimi The \Q...\E sequence is recognized both inside and outside character 6793*22dc650dSSadaf Ebrahimi classes. An isolated \E that is not preceded by \Q is ignored. If \Q 6794*22dc650dSSadaf Ebrahimi is not followed by \E later in the pattern, the literal interpretation 6795*22dc650dSSadaf Ebrahimi continues to the end of the pattern (that is, \E is assumed at the 6796*22dc650dSSadaf Ebrahimi end). If the isolated \Q is inside a character class, this causes an 6797*22dc650dSSadaf Ebrahimi error, because the character class is then not terminated by a closing 6798*22dc650dSSadaf Ebrahimi square bracket. 6799*22dc650dSSadaf Ebrahimi 6800*22dc650dSSadaf Ebrahimi Non-printing characters 6801*22dc650dSSadaf Ebrahimi 6802*22dc650dSSadaf Ebrahimi A second use of backslash provides a way of encoding non-printing char- 6803*22dc650dSSadaf Ebrahimi acters in patterns in a visible manner. There is no restriction on the 6804*22dc650dSSadaf Ebrahimi appearance of non-printing characters in a pattern, but when a pattern 6805*22dc650dSSadaf Ebrahimi is being prepared by text editing, it is often easier to use one of the 6806*22dc650dSSadaf Ebrahimi following escape sequences instead of the binary character it repre- 6807*22dc650dSSadaf Ebrahimi sents. In an ASCII or Unicode environment, these escapes are as fol- 6808*22dc650dSSadaf Ebrahimi lows: 6809*22dc650dSSadaf Ebrahimi 6810*22dc650dSSadaf Ebrahimi \a alarm, that is, the BEL character (hex 07) 6811*22dc650dSSadaf Ebrahimi \cx "control-x", where x is a non-control ASCII character 6812*22dc650dSSadaf Ebrahimi \e escape (hex 1B) 6813*22dc650dSSadaf Ebrahimi \f form feed (hex 0C) 6814*22dc650dSSadaf Ebrahimi \n linefeed (hex 0A) 6815*22dc650dSSadaf Ebrahimi \r carriage return (hex 0D) (but see below) 6816*22dc650dSSadaf Ebrahimi \t tab (hex 09) 6817*22dc650dSSadaf Ebrahimi \0dd character with octal code 0dd 6818*22dc650dSSadaf Ebrahimi \ddd character with octal code ddd, or backreference 6819*22dc650dSSadaf Ebrahimi \o{ddd..} character with octal code ddd.. 6820*22dc650dSSadaf Ebrahimi \xhh character with hex code hh 6821*22dc650dSSadaf Ebrahimi \x{hhh..} character with hex code hhh.. 6822*22dc650dSSadaf Ebrahimi \N{U+hhh..} character with Unicode hex code point hhh.. 6823*22dc650dSSadaf Ebrahimi 6824*22dc650dSSadaf Ebrahimi By default, after \x that is not followed by {, from zero to two hexa- 6825*22dc650dSSadaf Ebrahimi decimal digits are read (letters can be in upper or lower case). Any 6826*22dc650dSSadaf Ebrahimi number of hexadecimal digits may appear between \x{ and }. If a charac- 6827*22dc650dSSadaf Ebrahimi ter other than a hexadecimal digit appears between \x{ and }, or if 6828*22dc650dSSadaf Ebrahimi there is no terminating }, an error occurs. 6829*22dc650dSSadaf Ebrahimi 6830*22dc650dSSadaf Ebrahimi Characters whose code points are less than 256 can be defined by either 6831*22dc650dSSadaf Ebrahimi of the two syntaxes for \x or by an octal sequence. There is no differ- 6832*22dc650dSSadaf Ebrahimi ence in the way they are handled. For example, \xdc is exactly the same 6833*22dc650dSSadaf Ebrahimi as \x{dc} or \334. However, using the braced versions does make such 6834*22dc650dSSadaf Ebrahimi sequences easier to read. 6835*22dc650dSSadaf Ebrahimi 6836*22dc650dSSadaf Ebrahimi Support is available for some ECMAScript (aka JavaScript) escape se- 6837*22dc650dSSadaf Ebrahimi quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se- 6838*22dc650dSSadaf Ebrahimi quence \x followed by { is not recognized. Only if \x is followed by 6839*22dc650dSSadaf Ebrahimi two hexadecimal digits is it recognized as a character escape. Other- 6840*22dc650dSSadaf Ebrahimi wise it is interpreted as a literal "x" character. In this mode, sup- 6841*22dc650dSSadaf Ebrahimi port for code points greater than 256 is provided by \u, which must be 6842*22dc650dSSadaf Ebrahimi followed by four hexadecimal digits; otherwise it is interpreted as a 6843*22dc650dSSadaf Ebrahimi literal "u" character. 6844*22dc650dSSadaf Ebrahimi 6845*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in ad- 6846*22dc650dSSadaf Ebrahimi dition, \u{hhh..} is recognized as the character specified by hexadeci- 6847*22dc650dSSadaf Ebrahimi mal code point. There may be any number of hexadecimal digits, but un- 6848*22dc650dSSadaf Ebrahimi like other places that also use curly brackets, spaces are not allowed 6849*22dc650dSSadaf Ebrahimi and would result in the string being interpreted as a literal. This 6850*22dc650dSSadaf Ebrahimi syntax is from ECMAScript 6. 6851*22dc650dSSadaf Ebrahimi 6852*22dc650dSSadaf Ebrahimi The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper- 6853*22dc650dSSadaf Ebrahimi ating in UTF mode. Perl also uses \N{name} to specify characters by 6854*22dc650dSSadaf Ebrahimi Unicode name; PCRE2 does not support this. Note that when \N is not 6855*22dc650dSSadaf Ebrahimi followed by an opening brace (curly bracket) it has an entirely differ- 6856*22dc650dSSadaf Ebrahimi ent meaning, matching any character that is not a newline. 6857*22dc650dSSadaf Ebrahimi 6858*22dc650dSSadaf Ebrahimi There are some legacy applications where the escape sequence \r is ex- 6859*22dc650dSSadaf Ebrahimi pected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option 6860*22dc650dSSadaf Ebrahimi is set, \r in a pattern is converted to \n so that it matches a LF 6861*22dc650dSSadaf Ebrahimi (linefeed) instead of a CR (carriage return) character. 6862*22dc650dSSadaf Ebrahimi 6863*22dc650dSSadaf Ebrahimi An error occurs if \c is not followed by a character whose ASCII code 6864*22dc650dSSadaf Ebrahimi point is in the range 32 to 126. The precise effect of \cx is as fol- 6865*22dc650dSSadaf Ebrahimi lows: if x is a lower case letter, it is converted to upper case. Then 6866*22dc650dSSadaf Ebrahimi bit 6 of the character (hex 40) is inverted. Thus \cA to \cZ become hex 6867*22dc650dSSadaf Ebrahimi 01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and 6868*22dc650dSSadaf Ebrahimi \c; becomes hex 7B (; is 3B). If the code unit following \c has a code 6869*22dc650dSSadaf Ebrahimi point less than 32 or greater than 126, a compile-time error occurs. 6870*22dc650dSSadaf Ebrahimi 6871*22dc650dSSadaf Ebrahimi When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. 6872*22dc650dSSadaf Ebrahimi \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values. 6873*22dc650dSSadaf Ebrahimi The \c escape is processed as specified for Perl in the perlebcdic doc- 6874*22dc650dSSadaf Ebrahimi ument. The only characters that are allowed after \c are A-Z, a-z, or 6875*22dc650dSSadaf Ebrahimi one of @, [, \, ], ^, _, or ?. Any other character provokes a compile- 6876*22dc650dSSadaf Ebrahimi time error. The sequence \c@ encodes character code 0; after \c the 6877*22dc650dSSadaf Ebrahimi letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [, 6878*22dc650dSSadaf Ebrahimi \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be- 6879*22dc650dSSadaf Ebrahimi comes either 255 (hex FF) or 95 (hex 5F). 6880*22dc650dSSadaf Ebrahimi 6881*22dc650dSSadaf Ebrahimi Thus, apart from \c?, these escapes generate the same character code 6882*22dc650dSSadaf Ebrahimi values as they do in an ASCII environment, though the meanings of the 6883*22dc650dSSadaf Ebrahimi values mostly differ. For example, \cG always generates code value 7, 6884*22dc650dSSadaf Ebrahimi which is BEL in ASCII but DEL in EBCDIC. 6885*22dc650dSSadaf Ebrahimi 6886*22dc650dSSadaf Ebrahimi The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, 6887*22dc650dSSadaf Ebrahimi but because 127 is not a control character in EBCDIC, Perl makes it 6888*22dc650dSSadaf Ebrahimi generate the APC character. Unfortunately, there are several variants 6889*22dc650dSSadaf Ebrahimi of EBCDIC. In most of them the APC character has the value 255 (hex 6890*22dc650dSSadaf Ebrahimi FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If 6891*22dc650dSSadaf Ebrahimi certain other characters have POSIX-BC values, PCRE2 makes \c? generate 6892*22dc650dSSadaf Ebrahimi 95; otherwise it generates 255. 6893*22dc650dSSadaf Ebrahimi 6894*22dc650dSSadaf Ebrahimi After \0 up to two further octal digits are read. If there are fewer 6895*22dc650dSSadaf Ebrahimi than two digits, just those that are present are used. Thus the se- 6896*22dc650dSSadaf Ebrahimi quence \0\x\015 specifies two binary zeros followed by a CR character 6897*22dc650dSSadaf Ebrahimi (code value 13). Make sure you supply two digits after the initial zero 6898*22dc650dSSadaf Ebrahimi if the pattern character that follows is itself an octal digit. 6899*22dc650dSSadaf Ebrahimi 6900*22dc650dSSadaf Ebrahimi The escape \o must be followed by a sequence of octal digits, enclosed 6901*22dc650dSSadaf Ebrahimi in braces. An error occurs if this is not the case. This escape is a 6902*22dc650dSSadaf Ebrahimi recent addition to Perl; it provides way of specifying character code 6903*22dc650dSSadaf Ebrahimi points as octal numbers greater than 0777, and it also allows octal 6904*22dc650dSSadaf Ebrahimi numbers and backreferences to be unambiguously specified. 6905*22dc650dSSadaf Ebrahimi 6906*22dc650dSSadaf Ebrahimi For greater clarity and unambiguity, it is best to avoid following \ by 6907*22dc650dSSadaf Ebrahimi a digit greater than zero. Instead, use \o{...} or \x{...} to specify 6908*22dc650dSSadaf Ebrahimi numerical character code points, and \g{...} to specify backreferences. 6909*22dc650dSSadaf Ebrahimi The following paragraphs describe the old, ambiguous syntax. 6910*22dc650dSSadaf Ebrahimi 6911*22dc650dSSadaf Ebrahimi The handling of a backslash followed by a digit other than 0 is compli- 6912*22dc650dSSadaf Ebrahimi cated, and Perl has changed over time, causing PCRE2 also to change. 6913*22dc650dSSadaf Ebrahimi 6914*22dc650dSSadaf Ebrahimi Outside a character class, PCRE2 reads the digit and any following dig- 6915*22dc650dSSadaf Ebrahimi its as a decimal number. If the number is less than 10, begins with the 6916*22dc650dSSadaf Ebrahimi digit 8 or 9, or if there are at least that many previous capture 6917*22dc650dSSadaf Ebrahimi groups in the expression, the entire sequence is taken as a backrefer- 6918*22dc650dSSadaf Ebrahimi ence. A description of how this works is given later, following the 6919*22dc650dSSadaf Ebrahimi discussion of parenthesized groups. Otherwise, up to three octal dig- 6920*22dc650dSSadaf Ebrahimi its are read to form a character code. 6921*22dc650dSSadaf Ebrahimi 6922*22dc650dSSadaf Ebrahimi Inside a character class, PCRE2 handles \8 and \9 as the literal char- 6923*22dc650dSSadaf Ebrahimi acters "8" and "9", and otherwise reads up to three octal digits fol- 6924*22dc650dSSadaf Ebrahimi lowing the backslash, using them to generate a data character. Any sub- 6925*22dc650dSSadaf Ebrahimi sequent digits stand for themselves. For example, outside a character 6926*22dc650dSSadaf Ebrahimi class: 6927*22dc650dSSadaf Ebrahimi 6928*22dc650dSSadaf Ebrahimi \040 is another way of writing an ASCII space 6929*22dc650dSSadaf Ebrahimi \40 is the same, provided there are fewer than 40 6930*22dc650dSSadaf Ebrahimi previous capture groups 6931*22dc650dSSadaf Ebrahimi \7 is always a backreference 6932*22dc650dSSadaf Ebrahimi \11 might be a backreference, or another way of 6933*22dc650dSSadaf Ebrahimi writing a tab 6934*22dc650dSSadaf Ebrahimi \011 is always a tab 6935*22dc650dSSadaf Ebrahimi \0113 is a tab followed by the character "3" 6936*22dc650dSSadaf Ebrahimi \113 might be a backreference, otherwise the 6937*22dc650dSSadaf Ebrahimi character with octal code 113 6938*22dc650dSSadaf Ebrahimi \377 might be a backreference, otherwise 6939*22dc650dSSadaf Ebrahimi the value 255 (decimal) 6940*22dc650dSSadaf Ebrahimi \81 is always a backreference 6941*22dc650dSSadaf Ebrahimi 6942*22dc650dSSadaf Ebrahimi Note that octal values of 100 or greater that are specified using this 6943*22dc650dSSadaf Ebrahimi syntax must not be introduced by a leading zero, because no more than 6944*22dc650dSSadaf Ebrahimi three octal digits are ever read. 6945*22dc650dSSadaf Ebrahimi 6946*22dc650dSSadaf Ebrahimi Constraints on character values 6947*22dc650dSSadaf Ebrahimi 6948*22dc650dSSadaf Ebrahimi Characters that are specified using octal or hexadecimal numbers are 6949*22dc650dSSadaf Ebrahimi limited to certain values, as follows: 6950*22dc650dSSadaf Ebrahimi 6951*22dc650dSSadaf Ebrahimi 8-bit non-UTF mode no greater than 0xff 6952*22dc650dSSadaf Ebrahimi 16-bit non-UTF mode no greater than 0xffff 6953*22dc650dSSadaf Ebrahimi 32-bit non-UTF mode no greater than 0xffffffff 6954*22dc650dSSadaf Ebrahimi All UTF modes no greater than 0x10ffff and a valid code point 6955*22dc650dSSadaf Ebrahimi 6956*22dc650dSSadaf Ebrahimi Invalid Unicode code points are all those in the range 0xd800 to 0xdfff 6957*22dc650dSSadaf Ebrahimi (the so-called "surrogate" code points). The check for these can be 6958*22dc650dSSadaf Ebrahimi disabled by the caller of pcre2_compile() by setting the option 6959*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in 6960*22dc650dSSadaf Ebrahimi UTF-8 and UTF-32 modes, because these values are not representable in 6961*22dc650dSSadaf Ebrahimi UTF-16. 6962*22dc650dSSadaf Ebrahimi 6963*22dc650dSSadaf Ebrahimi Escape sequences in character classes 6964*22dc650dSSadaf Ebrahimi 6965*22dc650dSSadaf Ebrahimi All the sequences that define a single character value can be used both 6966*22dc650dSSadaf Ebrahimi inside and outside character classes. In addition, inside a character 6967*22dc650dSSadaf Ebrahimi class, \b is interpreted as the backspace character (hex 08). 6968*22dc650dSSadaf Ebrahimi 6969*22dc650dSSadaf Ebrahimi When not followed by an opening brace, \N is not allowed in a character 6970*22dc650dSSadaf Ebrahimi class. \B, \R, and \X are not special inside a character class. Like 6971*22dc650dSSadaf Ebrahimi other unrecognized alphabetic escape sequences, they cause an error. 6972*22dc650dSSadaf Ebrahimi Outside a character class, these sequences have different meanings. 6973*22dc650dSSadaf Ebrahimi 6974*22dc650dSSadaf Ebrahimi Unsupported escape sequences 6975*22dc650dSSadaf Ebrahimi 6976*22dc650dSSadaf Ebrahimi In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its 6977*22dc650dSSadaf Ebrahimi string handler and used to modify the case of following characters. By 6978*22dc650dSSadaf Ebrahimi default, PCRE2 does not support these escape sequences in patterns. 6979*22dc650dSSadaf Ebrahimi However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op- 6980*22dc650dSSadaf Ebrahimi tions is set, \U matches a "U" character, and \u can be used to define 6981*22dc650dSSadaf Ebrahimi a character by code point, as described above. 6982*22dc650dSSadaf Ebrahimi 6983*22dc650dSSadaf Ebrahimi Absolute and relative backreferences 6984*22dc650dSSadaf Ebrahimi 6985*22dc650dSSadaf Ebrahimi The sequence \g followed by a signed or unsigned number, optionally en- 6986*22dc650dSSadaf Ebrahimi closed in braces, is an absolute or relative backreference. A named 6987*22dc650dSSadaf Ebrahimi backreference can be coded as \g{name}. Backreferences are discussed 6988*22dc650dSSadaf Ebrahimi later, following the discussion of parenthesized groups. 6989*22dc650dSSadaf Ebrahimi 6990*22dc650dSSadaf Ebrahimi Absolute and relative subroutine calls 6991*22dc650dSSadaf Ebrahimi 6992*22dc650dSSadaf Ebrahimi For compatibility with Oniguruma, the non-Perl syntax \g followed by a 6993*22dc650dSSadaf Ebrahimi name or a number enclosed either in angle brackets or single quotes, is 6994*22dc650dSSadaf Ebrahimi an alternative syntax for referencing a capture group as a subroutine. 6995*22dc650dSSadaf Ebrahimi Details are discussed later. Note that \g{...} (Perl syntax) and 6996*22dc650dSSadaf Ebrahimi \g<...> (Oniguruma syntax) are not synonymous. The former is a backref- 6997*22dc650dSSadaf Ebrahimi erence; the latter is a subroutine call. 6998*22dc650dSSadaf Ebrahimi 6999*22dc650dSSadaf Ebrahimi Generic character types 7000*22dc650dSSadaf Ebrahimi 7001*22dc650dSSadaf Ebrahimi Another use of backslash is for specifying generic character types: 7002*22dc650dSSadaf Ebrahimi 7003*22dc650dSSadaf Ebrahimi \d any decimal digit 7004*22dc650dSSadaf Ebrahimi \D any character that is not a decimal digit 7005*22dc650dSSadaf Ebrahimi \h any horizontal white space character 7006*22dc650dSSadaf Ebrahimi \H any character that is not a horizontal white space character 7007*22dc650dSSadaf Ebrahimi \N any character that is not a newline 7008*22dc650dSSadaf Ebrahimi \s any white space character 7009*22dc650dSSadaf Ebrahimi \S any character that is not a white space character 7010*22dc650dSSadaf Ebrahimi \v any vertical white space character 7011*22dc650dSSadaf Ebrahimi \V any character that is not a vertical white space character 7012*22dc650dSSadaf Ebrahimi \w any "word" character 7013*22dc650dSSadaf Ebrahimi \W any "non-word" character 7014*22dc650dSSadaf Ebrahimi 7015*22dc650dSSadaf Ebrahimi The \N escape sequence has the same meaning as the "." metacharacter 7016*22dc650dSSadaf Ebrahimi when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change 7017*22dc650dSSadaf Ebrahimi the meaning of \N. Note that when \N is followed by an opening brace it 7018*22dc650dSSadaf Ebrahimi has a different meaning. See the section entitled "Non-printing charac- 7019*22dc650dSSadaf Ebrahimi ters" above for details. Perl also uses \N{name} to specify characters 7020*22dc650dSSadaf Ebrahimi by Unicode name; PCRE2 does not support this. 7021*22dc650dSSadaf Ebrahimi 7022*22dc650dSSadaf Ebrahimi Each pair of lower and upper case escape sequences partitions the com- 7023*22dc650dSSadaf Ebrahimi plete set of characters into two disjoint sets. Any given character 7024*22dc650dSSadaf Ebrahimi matches one, and only one, of each pair. The sequences can appear both 7025*22dc650dSSadaf Ebrahimi inside and outside character classes. They each match one character of 7026*22dc650dSSadaf Ebrahimi the appropriate type. If the current matching point is at the end of 7027*22dc650dSSadaf Ebrahimi the subject string, all of them fail, because there is no character to 7028*22dc650dSSadaf Ebrahimi match. 7029*22dc650dSSadaf Ebrahimi 7030*22dc650dSSadaf Ebrahimi The default \s characters are HT (9), LF (10), VT (11), FF (12), CR 7031*22dc650dSSadaf Ebrahimi (13), and space (32), which are defined as white space in the "C" lo- 7032*22dc650dSSadaf Ebrahimi cale. This list may vary if locale-specific matching is taking place. 7033*22dc650dSSadaf Ebrahimi For example, in some locales the "non-breaking space" character (\xA0) 7034*22dc650dSSadaf Ebrahimi is recognized as white space, and in others the VT character is not. 7035*22dc650dSSadaf Ebrahimi 7036*22dc650dSSadaf Ebrahimi A "word" character is an underscore or any character that is a letter 7037*22dc650dSSadaf Ebrahimi or digit. By default, the definition of letters and digits is con- 7038*22dc650dSSadaf Ebrahimi trolled by PCRE2's low-valued character tables, and may vary if locale- 7039*22dc650dSSadaf Ebrahimi specific matching is taking place (see "Locale support" in the pcre2api 7040*22dc650dSSadaf Ebrahimi page). For example, in a French locale such as "fr_FR" in Unix-like 7041*22dc650dSSadaf Ebrahimi systems, or "french" in Windows, some character codes greater than 127 7042*22dc650dSSadaf Ebrahimi are used for accented letters, and these are then matched by \w. The 7043*22dc650dSSadaf Ebrahimi use of locales with Unicode is discouraged. 7044*22dc650dSSadaf Ebrahimi 7045*22dc650dSSadaf Ebrahimi By default, characters whose code points are greater than 127 never 7046*22dc650dSSadaf Ebrahimi match \d, \s, or \w, and always match \D, \S, and \W, although this may 7047*22dc650dSSadaf Ebrahimi be different for characters in the range 128-255 when locale-specific 7048*22dc650dSSadaf Ebrahimi matching is happening. These escape sequences retain their original 7049*22dc650dSSadaf Ebrahimi meanings from before Unicode support was available, mainly for effi- 7050*22dc650dSSadaf Ebrahimi ciency reasons. If the PCRE2_UCP option is set, the behaviour is 7051*22dc650dSSadaf Ebrahimi changed so that Unicode properties are used to determine character 7052*22dc650dSSadaf Ebrahimi types, as follows: 7053*22dc650dSSadaf Ebrahimi 7054*22dc650dSSadaf Ebrahimi \d any character that matches \p{Nd} (decimal digit) 7055*22dc650dSSadaf Ebrahimi \s any character that matches \p{Z} or \h or \v 7056*22dc650dSSadaf Ebrahimi \w any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc} 7057*22dc650dSSadaf Ebrahimi 7058*22dc650dSSadaf Ebrahimi The addition of \p{Mn} (non-spacing mark) and the replacement of an ex- 7059*22dc650dSSadaf Ebrahimi plicit test for underscore with a test for \p{Pc} (connector punctua- 7060*22dc650dSSadaf Ebrahimi tion) happened in PCRE2 release 10.43. This brings PCRE2 into line with 7061*22dc650dSSadaf Ebrahimi Perl. 7062*22dc650dSSadaf Ebrahimi 7063*22dc650dSSadaf Ebrahimi The upper case escapes match the inverse sets of characters. Note that 7064*22dc650dSSadaf Ebrahimi \d matches only decimal digits, whereas \w matches any Unicode digit, 7065*22dc650dSSadaf Ebrahimi as well as other character categories. Note also that PCRE2_UCP affects 7066*22dc650dSSadaf Ebrahimi \b, and \B because they are defined in terms of \w and \W. Matching 7067*22dc650dSSadaf Ebrahimi these sequences is noticeably slower when PCRE2_UCP is set. 7068*22dc650dSSadaf Ebrahimi 7069*22dc650dSSadaf Ebrahimi The effect of PCRE2_UCP on any one of these escape sequences can be 7070*22dc650dSSadaf Ebrahimi negated by the options PCRE2_EXTRA_ASCII_BSD, PCRE2_EXTRA_ASCII_BSS, 7071*22dc650dSSadaf Ebrahimi and PCRE2_EXTRA_ASCII_BSW, respectively. These options can be set and 7072*22dc650dSSadaf Ebrahimi reset within a pattern by means of an internal option setting (see be- 7073*22dc650dSSadaf Ebrahimi low). 7074*22dc650dSSadaf Ebrahimi 7075*22dc650dSSadaf Ebrahimi The sequences \h, \H, \v, and \V, in contrast to the other sequences, 7076*22dc650dSSadaf Ebrahimi which match only ASCII characters by default, always match a specific 7077*22dc650dSSadaf Ebrahimi list of code points, whether or not PCRE2_UCP is set. The horizontal 7078*22dc650dSSadaf Ebrahimi space characters are: 7079*22dc650dSSadaf Ebrahimi 7080*22dc650dSSadaf Ebrahimi U+0009 Horizontal tab (HT) 7081*22dc650dSSadaf Ebrahimi U+0020 Space 7082*22dc650dSSadaf Ebrahimi U+00A0 Non-break space 7083*22dc650dSSadaf Ebrahimi U+1680 Ogham space mark 7084*22dc650dSSadaf Ebrahimi U+180E Mongolian vowel separator 7085*22dc650dSSadaf Ebrahimi U+2000 En quad 7086*22dc650dSSadaf Ebrahimi U+2001 Em quad 7087*22dc650dSSadaf Ebrahimi U+2002 En space 7088*22dc650dSSadaf Ebrahimi U+2003 Em space 7089*22dc650dSSadaf Ebrahimi U+2004 Three-per-em space 7090*22dc650dSSadaf Ebrahimi U+2005 Four-per-em space 7091*22dc650dSSadaf Ebrahimi U+2006 Six-per-em space 7092*22dc650dSSadaf Ebrahimi U+2007 Figure space 7093*22dc650dSSadaf Ebrahimi U+2008 Punctuation space 7094*22dc650dSSadaf Ebrahimi U+2009 Thin space 7095*22dc650dSSadaf Ebrahimi U+200A Hair space 7096*22dc650dSSadaf Ebrahimi U+202F Narrow no-break space 7097*22dc650dSSadaf Ebrahimi U+205F Medium mathematical space 7098*22dc650dSSadaf Ebrahimi U+3000 Ideographic space 7099*22dc650dSSadaf Ebrahimi 7100*22dc650dSSadaf Ebrahimi The vertical space characters are: 7101*22dc650dSSadaf Ebrahimi 7102*22dc650dSSadaf Ebrahimi U+000A Linefeed (LF) 7103*22dc650dSSadaf Ebrahimi U+000B Vertical tab (VT) 7104*22dc650dSSadaf Ebrahimi U+000C Form feed (FF) 7105*22dc650dSSadaf Ebrahimi U+000D Carriage return (CR) 7106*22dc650dSSadaf Ebrahimi U+0085 Next line (NEL) 7107*22dc650dSSadaf Ebrahimi U+2028 Line separator 7108*22dc650dSSadaf Ebrahimi U+2029 Paragraph separator 7109*22dc650dSSadaf Ebrahimi 7110*22dc650dSSadaf Ebrahimi In 8-bit, non-UTF-8 mode, only the characters with code points less 7111*22dc650dSSadaf Ebrahimi than 256 are relevant. 7112*22dc650dSSadaf Ebrahimi 7113*22dc650dSSadaf Ebrahimi Newline sequences 7114*22dc650dSSadaf Ebrahimi 7115*22dc650dSSadaf Ebrahimi Outside a character class, by default, the escape sequence \R matches 7116*22dc650dSSadaf Ebrahimi any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent 7117*22dc650dSSadaf Ebrahimi to the following: 7118*22dc650dSSadaf Ebrahimi 7119*22dc650dSSadaf Ebrahimi (?>\r\n|\n|\x0b|\f|\r|\x85) 7120*22dc650dSSadaf Ebrahimi 7121*22dc650dSSadaf Ebrahimi This is an example of an "atomic group", details of which are given be- 7122*22dc650dSSadaf Ebrahimi low. This particular group matches either the two-character sequence 7123*22dc650dSSadaf Ebrahimi CR followed by LF, or one of the single characters LF (linefeed, 7124*22dc650dSSadaf Ebrahimi U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- 7125*22dc650dSSadaf Ebrahimi riage return, U+000D), or NEL (next line, U+0085). Because this is an 7126*22dc650dSSadaf Ebrahimi atomic group, the two-character sequence is treated as a single unit 7127*22dc650dSSadaf Ebrahimi that cannot be split. 7128*22dc650dSSadaf Ebrahimi 7129*22dc650dSSadaf Ebrahimi In other modes, two additional characters whose code points are greater 7130*22dc650dSSadaf Ebrahimi than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- 7131*22dc650dSSadaf Ebrahimi rator, U+2029). Unicode support is not needed for these characters to 7132*22dc650dSSadaf Ebrahimi be recognized. 7133*22dc650dSSadaf Ebrahimi 7134*22dc650dSSadaf Ebrahimi It is possible to restrict \R to match only CR, LF, or CRLF (instead of 7135*22dc650dSSadaf Ebrahimi the complete set of Unicode line endings) by setting the option 7136*22dc650dSSadaf Ebrahimi PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back- 7137*22dc650dSSadaf Ebrahimi slash R".) This can be made the default when PCRE2 is built; if this is 7138*22dc650dSSadaf Ebrahimi the case, the other behaviour can be requested via the PCRE2_BSR_UNI- 7139*22dc650dSSadaf Ebrahimi CODE option. It is also possible to specify these settings by starting 7140*22dc650dSSadaf Ebrahimi a pattern string with one of the following sequences: 7141*22dc650dSSadaf Ebrahimi 7142*22dc650dSSadaf Ebrahimi (*BSR_ANYCRLF) CR, LF, or CRLF only 7143*22dc650dSSadaf Ebrahimi (*BSR_UNICODE) any Unicode newline sequence 7144*22dc650dSSadaf Ebrahimi 7145*22dc650dSSadaf Ebrahimi These override the default and the options given to the compiling func- 7146*22dc650dSSadaf Ebrahimi tion. Note that these special settings, which are not Perl-compatible, 7147*22dc650dSSadaf Ebrahimi are recognized only at the very start of a pattern, and that they must 7148*22dc650dSSadaf Ebrahimi be in upper case. If more than one of them is present, the last one is 7149*22dc650dSSadaf Ebrahimi used. They can be combined with a change of newline convention; for ex- 7150*22dc650dSSadaf Ebrahimi ample, a pattern can start with: 7151*22dc650dSSadaf Ebrahimi 7152*22dc650dSSadaf Ebrahimi (*ANY)(*BSR_ANYCRLF) 7153*22dc650dSSadaf Ebrahimi 7154*22dc650dSSadaf Ebrahimi They can also be combined with the (*UTF) or (*UCP) special sequences. 7155*22dc650dSSadaf Ebrahimi Inside a character class, \R is treated as an unrecognized escape se- 7156*22dc650dSSadaf Ebrahimi quence, and causes an error. 7157*22dc650dSSadaf Ebrahimi 7158*22dc650dSSadaf Ebrahimi Unicode character properties 7159*22dc650dSSadaf Ebrahimi 7160*22dc650dSSadaf Ebrahimi When PCRE2 is built with Unicode support (the default), three addi- 7161*22dc650dSSadaf Ebrahimi tional escape sequences that match characters with specific properties 7162*22dc650dSSadaf Ebrahimi are available. They can be used in any mode, though in 8-bit and 16-bit 7163*22dc650dSSadaf Ebrahimi non-UTF modes these sequences are of course limited to testing charac- 7164*22dc650dSSadaf Ebrahimi ters whose code points are less than U+0100 and U+10000, respectively. 7165*22dc650dSSadaf Ebrahimi In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode 7166*22dc650dSSadaf Ebrahimi limit) may be encountered. These are all treated as being in the Un- 7167*22dc650dSSadaf Ebrahimi known script and with an unassigned type. 7168*22dc650dSSadaf Ebrahimi 7169*22dc650dSSadaf Ebrahimi Matching characters by Unicode property is not fast, because PCRE2 has 7170*22dc650dSSadaf Ebrahimi to do a multistage table lookup in order to find a character's prop- 7171*22dc650dSSadaf Ebrahimi erty. That is why the traditional escape sequences such as \d and \w do 7172*22dc650dSSadaf Ebrahimi not use Unicode properties in PCRE2 by default, though you can make 7173*22dc650dSSadaf Ebrahimi them do so by setting the PCRE2_UCP option or by starting the pattern 7174*22dc650dSSadaf Ebrahimi with (*UCP). 7175*22dc650dSSadaf Ebrahimi 7176*22dc650dSSadaf Ebrahimi The extra escape sequences that provide property support are: 7177*22dc650dSSadaf Ebrahimi 7178*22dc650dSSadaf Ebrahimi \p{xx} a character with the xx property 7179*22dc650dSSadaf Ebrahimi \P{xx} a character without the xx property 7180*22dc650dSSadaf Ebrahimi \X a Unicode extended grapheme cluster 7181*22dc650dSSadaf Ebrahimi 7182*22dc650dSSadaf Ebrahimi The property names represented by xx above are not case-sensitive, and 7183*22dc650dSSadaf Ebrahimi in accordance with Unicode's "loose matching" rules, spaces, hyphens, 7184*22dc650dSSadaf Ebrahimi and underscores are ignored. There is support for Unicode script names, 7185*22dc650dSSadaf Ebrahimi Unicode general category properties, "Any", which matches any character 7186*22dc650dSSadaf Ebrahimi (including newline), Bidi_Class, a number of binary (yes/no) proper- 7187*22dc650dSSadaf Ebrahimi ties, and some special PCRE2 properties (described below). Certain 7188*22dc650dSSadaf Ebrahimi other Perl properties such as "InMusicalSymbols" are not supported by 7189*22dc650dSSadaf Ebrahimi PCRE2. Note that \P{Any} does not match any characters, so always 7190*22dc650dSSadaf Ebrahimi causes a match failure. 7191*22dc650dSSadaf Ebrahimi 7192*22dc650dSSadaf Ebrahimi Script properties for \p and \P 7193*22dc650dSSadaf Ebrahimi 7194*22dc650dSSadaf Ebrahimi There are three different syntax forms for matching a script. Each Uni- 7195*22dc650dSSadaf Ebrahimi code character has a basic script and, optionally, a list of other 7196*22dc650dSSadaf Ebrahimi scripts ("Script Extensions") with which it is commonly used. Using the 7197*22dc650dSSadaf Ebrahimi Adlam script as an example, \p{sc:Adlam} matches characters whose basic 7198*22dc650dSSadaf Ebrahimi script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters 7199*22dc650dSSadaf Ebrahimi that have Adlam in their extensions list. The full names "script" and 7200*22dc650dSSadaf Ebrahimi "script extensions" for the property types are recognized, and a equals 7201*22dc650dSSadaf Ebrahimi sign is an alternative to the colon. If a script name is given without 7202*22dc650dSSadaf Ebrahimi a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad- 7203*22dc650dSSadaf Ebrahimi lam}. Perl changed to this interpretation at release 5.26 and PCRE2 7204*22dc650dSSadaf Ebrahimi changed at release 10.40. 7205*22dc650dSSadaf Ebrahimi 7206*22dc650dSSadaf Ebrahimi Unassigned characters (and in non-UTF 32-bit mode, characters with code 7207*22dc650dSSadaf Ebrahimi points greater than 0x10FFFF) are assigned the "Unknown" script. Others 7208*22dc650dSSadaf Ebrahimi that are not part of an identified script are lumped together as "Com- 7209*22dc650dSSadaf Ebrahimi mon". The current list of recognized script names and their 4-character 7210*22dc650dSSadaf Ebrahimi abbreviations can be obtained by running this command: 7211*22dc650dSSadaf Ebrahimi 7212*22dc650dSSadaf Ebrahimi pcre2test -LS 7213*22dc650dSSadaf Ebrahimi 7214*22dc650dSSadaf Ebrahimi 7215*22dc650dSSadaf Ebrahimi The general category property for \p and \P 7216*22dc650dSSadaf Ebrahimi 7217*22dc650dSSadaf Ebrahimi Each character has exactly one Unicode general category property, spec- 7218*22dc650dSSadaf Ebrahimi ified by a two-letter abbreviation. For compatibility with Perl, nega- 7219*22dc650dSSadaf Ebrahimi tion can be specified by including a circumflex between the opening 7220*22dc650dSSadaf Ebrahimi brace and the property name. For example, \p{^Lu} is the same as 7221*22dc650dSSadaf Ebrahimi \P{Lu}. 7222*22dc650dSSadaf Ebrahimi 7223*22dc650dSSadaf Ebrahimi If only one letter is specified with \p or \P, it includes all the gen- 7224*22dc650dSSadaf Ebrahimi eral category properties that start with that letter. In this case, in 7225*22dc650dSSadaf Ebrahimi the absence of negation, the curly brackets in the escape sequence are 7226*22dc650dSSadaf Ebrahimi optional; these two examples have the same effect: 7227*22dc650dSSadaf Ebrahimi 7228*22dc650dSSadaf Ebrahimi \p{L} 7229*22dc650dSSadaf Ebrahimi \pL 7230*22dc650dSSadaf Ebrahimi 7231*22dc650dSSadaf Ebrahimi The following general category property codes are supported: 7232*22dc650dSSadaf Ebrahimi 7233*22dc650dSSadaf Ebrahimi C Other 7234*22dc650dSSadaf Ebrahimi Cc Control 7235*22dc650dSSadaf Ebrahimi Cf Format 7236*22dc650dSSadaf Ebrahimi Cn Unassigned 7237*22dc650dSSadaf Ebrahimi Co Private use 7238*22dc650dSSadaf Ebrahimi Cs Surrogate 7239*22dc650dSSadaf Ebrahimi 7240*22dc650dSSadaf Ebrahimi L Letter 7241*22dc650dSSadaf Ebrahimi Ll Lower case letter 7242*22dc650dSSadaf Ebrahimi Lm Modifier letter 7243*22dc650dSSadaf Ebrahimi Lo Other letter 7244*22dc650dSSadaf Ebrahimi Lt Title case letter 7245*22dc650dSSadaf Ebrahimi Lu Upper case letter 7246*22dc650dSSadaf Ebrahimi 7247*22dc650dSSadaf Ebrahimi M Mark 7248*22dc650dSSadaf Ebrahimi Mc Spacing mark 7249*22dc650dSSadaf Ebrahimi Me Enclosing mark 7250*22dc650dSSadaf Ebrahimi Mn Non-spacing mark 7251*22dc650dSSadaf Ebrahimi 7252*22dc650dSSadaf Ebrahimi N Number 7253*22dc650dSSadaf Ebrahimi Nd Decimal number 7254*22dc650dSSadaf Ebrahimi Nl Letter number 7255*22dc650dSSadaf Ebrahimi No Other number 7256*22dc650dSSadaf Ebrahimi 7257*22dc650dSSadaf Ebrahimi P Punctuation 7258*22dc650dSSadaf Ebrahimi Pc Connector punctuation 7259*22dc650dSSadaf Ebrahimi Pd Dash punctuation 7260*22dc650dSSadaf Ebrahimi Pe Close punctuation 7261*22dc650dSSadaf Ebrahimi Pf Final punctuation 7262*22dc650dSSadaf Ebrahimi Pi Initial punctuation 7263*22dc650dSSadaf Ebrahimi Po Other punctuation 7264*22dc650dSSadaf Ebrahimi Ps Open punctuation 7265*22dc650dSSadaf Ebrahimi 7266*22dc650dSSadaf Ebrahimi S Symbol 7267*22dc650dSSadaf Ebrahimi Sc Currency symbol 7268*22dc650dSSadaf Ebrahimi Sk Modifier symbol 7269*22dc650dSSadaf Ebrahimi Sm Mathematical symbol 7270*22dc650dSSadaf Ebrahimi So Other symbol 7271*22dc650dSSadaf Ebrahimi 7272*22dc650dSSadaf Ebrahimi Z Separator 7273*22dc650dSSadaf Ebrahimi Zl Line separator 7274*22dc650dSSadaf Ebrahimi Zp Paragraph separator 7275*22dc650dSSadaf Ebrahimi Zs Space separator 7276*22dc650dSSadaf Ebrahimi 7277*22dc650dSSadaf Ebrahimi The special property LC, which has the synonym L&, is also supported: 7278*22dc650dSSadaf Ebrahimi it matches a character that has the Lu, Ll, or Lt property, in other 7279*22dc650dSSadaf Ebrahimi words, a letter that is not classified as a modifier or "other". 7280*22dc650dSSadaf Ebrahimi 7281*22dc650dSSadaf Ebrahimi The Cs (Surrogate) property applies only to characters whose code 7282*22dc650dSSadaf Ebrahimi points are in the range U+D800 to U+DFFF. These characters are no dif- 7283*22dc650dSSadaf Ebrahimi ferent to any other character when PCRE2 is not in UTF mode (using the 7284*22dc650dSSadaf Ebrahimi 16-bit or 32-bit library). However, they are not valid in Unicode 7285*22dc650dSSadaf Ebrahimi strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid- 7286*22dc650dSSadaf Ebrahimi ity checking has been turned off (see the discussion of 7287*22dc650dSSadaf Ebrahimi PCRE2_NO_UTF_CHECK in the pcre2api page). 7288*22dc650dSSadaf Ebrahimi 7289*22dc650dSSadaf Ebrahimi The long synonyms for property names that Perl supports (such as 7290*22dc650dSSadaf Ebrahimi \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix 7291*22dc650dSSadaf Ebrahimi any of these properties with "Is". 7292*22dc650dSSadaf Ebrahimi 7293*22dc650dSSadaf Ebrahimi No character that is in the Unicode table has the Cn (unassigned) prop- 7294*22dc650dSSadaf Ebrahimi erty. Instead, this property is assumed for any code point that is not 7295*22dc650dSSadaf Ebrahimi in the Unicode table. 7296*22dc650dSSadaf Ebrahimi 7297*22dc650dSSadaf Ebrahimi Specifying caseless matching does not affect these escape sequences. 7298*22dc650dSSadaf Ebrahimi For example, \p{Lu} always matches only upper case letters. This is 7299*22dc650dSSadaf Ebrahimi different from the behaviour of current versions of Perl. 7300*22dc650dSSadaf Ebrahimi 7301*22dc650dSSadaf Ebrahimi Binary (yes/no) properties for \p and \P 7302*22dc650dSSadaf Ebrahimi 7303*22dc650dSSadaf Ebrahimi Unicode defines a number of binary properties, that is, properties 7304*22dc650dSSadaf Ebrahimi whose only values are true or false. You can obtain a list of those 7305*22dc650dSSadaf Ebrahimi that are recognized by \p and \P, along with their abbreviations, by 7306*22dc650dSSadaf Ebrahimi running this command: 7307*22dc650dSSadaf Ebrahimi 7308*22dc650dSSadaf Ebrahimi pcre2test -LP 7309*22dc650dSSadaf Ebrahimi 7310*22dc650dSSadaf Ebrahimi 7311*22dc650dSSadaf Ebrahimi The Bidi_Class property for \p and \P 7312*22dc650dSSadaf Ebrahimi 7313*22dc650dSSadaf Ebrahimi \p{Bidi_Class:<class>} matches a character with the given class 7314*22dc650dSSadaf Ebrahimi \p{BC:<class>} matches a character with the given class 7315*22dc650dSSadaf Ebrahimi 7316*22dc650dSSadaf Ebrahimi The recognized classes are: 7317*22dc650dSSadaf Ebrahimi 7318*22dc650dSSadaf Ebrahimi AL Arabic letter 7319*22dc650dSSadaf Ebrahimi AN Arabic number 7320*22dc650dSSadaf Ebrahimi B paragraph separator 7321*22dc650dSSadaf Ebrahimi BN boundary neutral 7322*22dc650dSSadaf Ebrahimi CS common separator 7323*22dc650dSSadaf Ebrahimi EN European number 7324*22dc650dSSadaf Ebrahimi ES European separator 7325*22dc650dSSadaf Ebrahimi ET European terminator 7326*22dc650dSSadaf Ebrahimi FSI first strong isolate 7327*22dc650dSSadaf Ebrahimi L left-to-right 7328*22dc650dSSadaf Ebrahimi LRE left-to-right embedding 7329*22dc650dSSadaf Ebrahimi LRI left-to-right isolate 7330*22dc650dSSadaf Ebrahimi LRO left-to-right override 7331*22dc650dSSadaf Ebrahimi NSM non-spacing mark 7332*22dc650dSSadaf Ebrahimi ON other neutral 7333*22dc650dSSadaf Ebrahimi PDF pop directional format 7334*22dc650dSSadaf Ebrahimi PDI pop directional isolate 7335*22dc650dSSadaf Ebrahimi R right-to-left 7336*22dc650dSSadaf Ebrahimi RLE right-to-left embedding 7337*22dc650dSSadaf Ebrahimi RLI right-to-left isolate 7338*22dc650dSSadaf Ebrahimi RLO right-to-left override 7339*22dc650dSSadaf Ebrahimi S segment separator 7340*22dc650dSSadaf Ebrahimi WS which space 7341*22dc650dSSadaf Ebrahimi 7342*22dc650dSSadaf Ebrahimi An equals sign may be used instead of a colon. The class names are 7343*22dc650dSSadaf Ebrahimi case-insensitive; only the short names listed above are recognized. 7344*22dc650dSSadaf Ebrahimi 7345*22dc650dSSadaf Ebrahimi Extended grapheme clusters 7346*22dc650dSSadaf Ebrahimi 7347*22dc650dSSadaf Ebrahimi The \X escape matches any number of Unicode characters that form an 7348*22dc650dSSadaf Ebrahimi "extended grapheme cluster", and treats the sequence as an atomic group 7349*22dc650dSSadaf Ebrahimi (see below). Unicode supports various kinds of composite character by 7350*22dc650dSSadaf Ebrahimi giving each character a grapheme breaking property, and having rules 7351*22dc650dSSadaf Ebrahimi that use these properties to define the boundaries of extended grapheme 7352*22dc650dSSadaf Ebrahimi clusters. The rules are defined in Unicode Standard Annex 29, "Unicode 7353*22dc650dSSadaf Ebrahimi Text Segmentation". Unicode 11.0.0 abandoned the use of some previous 7354*22dc650dSSadaf Ebrahimi properties that had been used for emojis. Instead it introduced vari- 7355*22dc650dSSadaf Ebrahimi ous emoji-specific properties. PCRE2 uses only the Extended Picto- 7356*22dc650dSSadaf Ebrahimi graphic property. 7357*22dc650dSSadaf Ebrahimi 7358*22dc650dSSadaf Ebrahimi \X always matches at least one character. Then it decides whether to 7359*22dc650dSSadaf Ebrahimi add additional characters according to the following rules for ending a 7360*22dc650dSSadaf Ebrahimi cluster: 7361*22dc650dSSadaf Ebrahimi 7362*22dc650dSSadaf Ebrahimi 1. End at the end of the subject string. 7363*22dc650dSSadaf Ebrahimi 7364*22dc650dSSadaf Ebrahimi 2. Do not end between CR and LF; otherwise end after any control char- 7365*22dc650dSSadaf Ebrahimi acter. 7366*22dc650dSSadaf Ebrahimi 7367*22dc650dSSadaf Ebrahimi 3. Do not break Hangul (a Korean script) syllable sequences. Hangul 7368*22dc650dSSadaf Ebrahimi characters are of five types: L, V, T, LV, and LVT. An L character may 7369*22dc650dSSadaf Ebrahimi be followed by an L, V, LV, or LVT character; an LV or V character may 7370*22dc650dSSadaf Ebrahimi be followed by a V or T character; an LVT or T character may be fol- 7371*22dc650dSSadaf Ebrahimi lowed only by a T character. 7372*22dc650dSSadaf Ebrahimi 7373*22dc650dSSadaf Ebrahimi 4. Do not end before extending characters or spacing marks or the zero- 7374*22dc650dSSadaf Ebrahimi width joiner (ZWJ) character. Characters with the "mark" property al- 7375*22dc650dSSadaf Ebrahimi ways have the "extend" grapheme breaking property. 7376*22dc650dSSadaf Ebrahimi 7377*22dc650dSSadaf Ebrahimi 5. Do not end after prepend characters. 7378*22dc650dSSadaf Ebrahimi 7379*22dc650dSSadaf Ebrahimi 6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width 7380*22dc650dSSadaf Ebrahimi joiner) sequences. An emoji ZWJ sequence consists of a character with 7381*22dc650dSSadaf Ebrahimi the Extended_Pictographic property, optionally followed by one or more 7382*22dc650dSSadaf Ebrahimi characters with the Extend property, followed by the ZWJ character, 7383*22dc650dSSadaf Ebrahimi followed by another Extended_Pictographic character. 7384*22dc650dSSadaf Ebrahimi 7385*22dc650dSSadaf Ebrahimi 7. Do not break within emoji flag sequences. That is, do not break be- 7386*22dc650dSSadaf Ebrahimi tween regional indicator (RI) characters if there are an odd number of 7387*22dc650dSSadaf Ebrahimi RI characters before the break point. 7388*22dc650dSSadaf Ebrahimi 7389*22dc650dSSadaf Ebrahimi 8. Otherwise, end the cluster. 7390*22dc650dSSadaf Ebrahimi 7391*22dc650dSSadaf Ebrahimi PCRE2's additional properties 7392*22dc650dSSadaf Ebrahimi 7393*22dc650dSSadaf Ebrahimi As well as the standard Unicode properties described above, PCRE2 sup- 7394*22dc650dSSadaf Ebrahimi ports four more that make it possible to convert traditional escape se- 7395*22dc650dSSadaf Ebrahimi quences such as \w and \s to use Unicode properties. PCRE2 uses these 7396*22dc650dSSadaf Ebrahimi non-standard, non-Perl properties internally when PCRE2_UCP is set. 7397*22dc650dSSadaf Ebrahimi However, they may also be used explicitly. These properties are: 7398*22dc650dSSadaf Ebrahimi 7399*22dc650dSSadaf Ebrahimi Xan Any alphanumeric character 7400*22dc650dSSadaf Ebrahimi Xps Any POSIX space character 7401*22dc650dSSadaf Ebrahimi Xsp Any Perl space character 7402*22dc650dSSadaf Ebrahimi Xwd Any Perl "word" character 7403*22dc650dSSadaf Ebrahimi 7404*22dc650dSSadaf Ebrahimi Xan matches characters that have either the L (letter) or the N (num- 7405*22dc650dSSadaf Ebrahimi ber) property. Xps matches the characters tab, linefeed, vertical tab, 7406*22dc650dSSadaf Ebrahimi form feed, or carriage return, and any other character that has the Z 7407*22dc650dSSadaf Ebrahimi (separator) property. Xsp is the same as Xps; in PCRE1 it used to ex- 7408*22dc650dSSadaf Ebrahimi clude vertical tab, for Perl compatibility, but Perl changed. Xwd 7409*22dc650dSSadaf Ebrahimi matches the same characters as Xan, plus those that match Mn (non-spac- 7410*22dc650dSSadaf Ebrahimi ing mark) or Pc (connector punctuation, which includes underscore). 7411*22dc650dSSadaf Ebrahimi 7412*22dc650dSSadaf Ebrahimi There is another non-standard property, Xuc, which matches any charac- 7413*22dc650dSSadaf Ebrahimi ter that can be represented by a Universal Character Name in C++ and 7414*22dc650dSSadaf Ebrahimi other programming languages. These are the characters $, @, ` (grave 7415*22dc650dSSadaf Ebrahimi accent), and all characters with Unicode code points greater than or 7416*22dc650dSSadaf Ebrahimi equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that 7417*22dc650dSSadaf Ebrahimi most base (ASCII) characters are excluded. (Universal Character Names 7418*22dc650dSSadaf Ebrahimi are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. 7419*22dc650dSSadaf Ebrahimi Note that the Xuc property does not match these sequences but the char- 7420*22dc650dSSadaf Ebrahimi acters that they represent.) 7421*22dc650dSSadaf Ebrahimi 7422*22dc650dSSadaf Ebrahimi Resetting the match start 7423*22dc650dSSadaf Ebrahimi 7424*22dc650dSSadaf Ebrahimi In normal use, the escape sequence \K causes any previously matched 7425*22dc650dSSadaf Ebrahimi characters not to be included in the final matched sequence that is re- 7426*22dc650dSSadaf Ebrahimi turned. For example, the pattern: 7427*22dc650dSSadaf Ebrahimi 7428*22dc650dSSadaf Ebrahimi foo\Kbar 7429*22dc650dSSadaf Ebrahimi 7430*22dc650dSSadaf Ebrahimi matches "foobar", but reports that it has matched "bar". \K does not 7431*22dc650dSSadaf Ebrahimi interact with anchoring in any way. The pattern: 7432*22dc650dSSadaf Ebrahimi 7433*22dc650dSSadaf Ebrahimi ^foo\Kbar 7434*22dc650dSSadaf Ebrahimi 7435*22dc650dSSadaf Ebrahimi matches only when the subject begins with "foobar" (in single line 7436*22dc650dSSadaf Ebrahimi mode), though it again reports the matched string as "bar". This fea- 7437*22dc650dSSadaf Ebrahimi ture is similar to a lookbehind assertion (described below), but the 7438*22dc650dSSadaf Ebrahimi part of the pattern that precedes \K is not constrained to match a lim- 7439*22dc650dSSadaf Ebrahimi ited number of characters, as is required for a lookbehind assertion. 7440*22dc650dSSadaf Ebrahimi The use of \K does not interfere with the setting of captured sub- 7441*22dc650dSSadaf Ebrahimi strings. For example, when the pattern 7442*22dc650dSSadaf Ebrahimi 7443*22dc650dSSadaf Ebrahimi (foo)\Kbar 7444*22dc650dSSadaf Ebrahimi 7445*22dc650dSSadaf Ebrahimi matches "foobar", the first substring is still set to "foo". 7446*22dc650dSSadaf Ebrahimi 7447*22dc650dSSadaf Ebrahimi From version 5.32.0 Perl forbids the use of \K in lookaround asser- 7448*22dc650dSSadaf Ebrahimi tions. From release 10.38 PCRE2 also forbids this by default. However, 7449*22dc650dSSadaf Ebrahimi the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling 7450*22dc650dSSadaf Ebrahimi pcre2_compile() to re-enable the previous behaviour. When this option 7451*22dc650dSSadaf Ebrahimi is set, \K is acted upon when it occurs inside positive assertions, but 7452*22dc650dSSadaf Ebrahimi is ignored in negative assertions. Note that when a pattern such as 7453*22dc650dSSadaf Ebrahimi (?=ab\K) matches, the reported start of the match can be greater than 7454*22dc650dSSadaf Ebrahimi the end of the match. Using \K in a lookbehind assertion at the start 7455*22dc650dSSadaf Ebrahimi of a pattern can also lead to odd effects. For example, consider this 7456*22dc650dSSadaf Ebrahimi pattern: 7457*22dc650dSSadaf Ebrahimi 7458*22dc650dSSadaf Ebrahimi (?<=\Kfoo)bar 7459*22dc650dSSadaf Ebrahimi 7460*22dc650dSSadaf Ebrahimi If the subject is "foobar", a call to pcre2_match() with a starting 7461*22dc650dSSadaf Ebrahimi offset of 3 succeeds and reports the matching string as "foobar", that 7462*22dc650dSSadaf Ebrahimi is, the start of the reported match is earlier than where the match 7463*22dc650dSSadaf Ebrahimi started. 7464*22dc650dSSadaf Ebrahimi 7465*22dc650dSSadaf Ebrahimi Simple assertions 7466*22dc650dSSadaf Ebrahimi 7467*22dc650dSSadaf Ebrahimi The final use of backslash is for certain simple assertions. An asser- 7468*22dc650dSSadaf Ebrahimi tion specifies a condition that has to be met at a particular point in 7469*22dc650dSSadaf Ebrahimi a match, without consuming any characters from the subject string. The 7470*22dc650dSSadaf Ebrahimi use of groups for more complicated assertions is described below. The 7471*22dc650dSSadaf Ebrahimi backslashed assertions are: 7472*22dc650dSSadaf Ebrahimi 7473*22dc650dSSadaf Ebrahimi \b matches at a word boundary 7474*22dc650dSSadaf Ebrahimi \B matches when not at a word boundary 7475*22dc650dSSadaf Ebrahimi \A matches at the start of the subject 7476*22dc650dSSadaf Ebrahimi \Z matches at the end of the subject 7477*22dc650dSSadaf Ebrahimi also matches before a newline at the end of the subject 7478*22dc650dSSadaf Ebrahimi \z matches only at the end of the subject 7479*22dc650dSSadaf Ebrahimi \G matches at the first matching position in the subject 7480*22dc650dSSadaf Ebrahimi 7481*22dc650dSSadaf Ebrahimi Inside a character class, \b has a different meaning; it matches the 7482*22dc650dSSadaf Ebrahimi backspace character. If any other of these assertions appears in a 7483*22dc650dSSadaf Ebrahimi character class, an "invalid escape sequence" error is generated. 7484*22dc650dSSadaf Ebrahimi 7485*22dc650dSSadaf Ebrahimi A word boundary is a position in the subject string where the current 7486*22dc650dSSadaf Ebrahimi character and the previous character do not both match \w or \W (i.e. 7487*22dc650dSSadaf Ebrahimi one matches \w and the other matches \W), or the start or end of the 7488*22dc650dSSadaf Ebrahimi string if the first or last character matches \w, respectively. When 7489*22dc650dSSadaf Ebrahimi PCRE2 is built with Unicode support, the meanings of \w and \W can be 7490*22dc650dSSadaf Ebrahimi changed by setting the PCRE2_UCP option. When this is done, it also af- 7491*22dc650dSSadaf Ebrahimi fects \b and \B. Neither PCRE2 nor Perl has a separate "start of word" 7492*22dc650dSSadaf Ebrahimi or "end of word" metasequence. However, whatever follows \b normally 7493*22dc650dSSadaf Ebrahimi determines which it is. For example, the fragment \ba matches "a" at 7494*22dc650dSSadaf Ebrahimi the start of a word. 7495*22dc650dSSadaf Ebrahimi 7496*22dc650dSSadaf Ebrahimi The \A, \Z, and \z assertions differ from the traditional circumflex 7497*22dc650dSSadaf Ebrahimi and dollar (described in the next section) in that they only ever match 7498*22dc650dSSadaf Ebrahimi at the very start and end of the subject string, whatever options are 7499*22dc650dSSadaf Ebrahimi set. Thus, they are independent of multiline mode. These three asser- 7500*22dc650dSSadaf Ebrahimi tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options, 7501*22dc650dSSadaf Ebrahimi which affect only the behaviour of the circumflex and dollar metachar- 7502*22dc650dSSadaf Ebrahimi acters. However, if the startoffset argument of pcre2_match() is non- 7503*22dc650dSSadaf Ebrahimi zero, indicating that matching is to start at a point other than the 7504*22dc650dSSadaf Ebrahimi beginning of the subject, \A can never match. The difference between 7505*22dc650dSSadaf Ebrahimi \Z and \z is that \Z matches before a newline at the end of the string 7506*22dc650dSSadaf Ebrahimi as well as at the very end, whereas \z matches only at the end. 7507*22dc650dSSadaf Ebrahimi 7508*22dc650dSSadaf Ebrahimi The \G assertion is true only when the current matching position is at 7509*22dc650dSSadaf Ebrahimi the start point of the matching process, as specified by the startoff- 7510*22dc650dSSadaf Ebrahimi set argument of pcre2_match(). It differs from \A when the value of 7511*22dc650dSSadaf Ebrahimi startoffset is non-zero. By calling pcre2_match() multiple times with 7512*22dc650dSSadaf Ebrahimi appropriate arguments, you can mimic Perl's /g option, and it is in 7513*22dc650dSSadaf Ebrahimi this kind of implementation where \G can be useful. 7514*22dc650dSSadaf Ebrahimi 7515*22dc650dSSadaf Ebrahimi Note, however, that PCRE2's implementation of \G, being true at the 7516*22dc650dSSadaf Ebrahimi starting character of the matching process, is subtly different from 7517*22dc650dSSadaf Ebrahimi Perl's, which defines it as true at the end of the previous match. In 7518*22dc650dSSadaf Ebrahimi Perl, these can be different when the previously matched string was 7519*22dc650dSSadaf Ebrahimi empty. Because PCRE2 does just one match at a time, it cannot reproduce 7520*22dc650dSSadaf Ebrahimi this behaviour. 7521*22dc650dSSadaf Ebrahimi 7522*22dc650dSSadaf Ebrahimi If all the alternatives of a pattern begin with \G, the expression is 7523*22dc650dSSadaf Ebrahimi anchored to the starting match position, and the "anchored" flag is set 7524*22dc650dSSadaf Ebrahimi in the compiled regular expression. 7525*22dc650dSSadaf Ebrahimi 7526*22dc650dSSadaf Ebrahimi 7527*22dc650dSSadaf EbrahimiCIRCUMFLEX AND DOLLAR 7528*22dc650dSSadaf Ebrahimi 7529*22dc650dSSadaf Ebrahimi The circumflex and dollar metacharacters are zero-width assertions. 7530*22dc650dSSadaf Ebrahimi That is, they test for a particular condition being true without con- 7531*22dc650dSSadaf Ebrahimi suming any characters from the subject string. These two metacharacters 7532*22dc650dSSadaf Ebrahimi are concerned with matching the starts and ends of lines. If the new- 7533*22dc650dSSadaf Ebrahimi line convention is set so that only the two-character sequence CRLF is 7534*22dc650dSSadaf Ebrahimi recognized as a newline, isolated CR and LF characters are treated as 7535*22dc650dSSadaf Ebrahimi ordinary data characters, and are not recognized as newlines. 7536*22dc650dSSadaf Ebrahimi 7537*22dc650dSSadaf Ebrahimi Outside a character class, in the default matching mode, the circumflex 7538*22dc650dSSadaf Ebrahimi character is an assertion that is true only if the current matching 7539*22dc650dSSadaf Ebrahimi point is at the start of the subject string. If the startoffset argu- 7540*22dc650dSSadaf Ebrahimi ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum- 7541*22dc650dSSadaf Ebrahimi flex can never match if the PCRE2_MULTILINE option is unset. Inside a 7542*22dc650dSSadaf Ebrahimi character class, circumflex has an entirely different meaning (see be- 7543*22dc650dSSadaf Ebrahimi low). 7544*22dc650dSSadaf Ebrahimi 7545*22dc650dSSadaf Ebrahimi Circumflex need not be the first character of the pattern if a number 7546*22dc650dSSadaf Ebrahimi of alternatives are involved, but it should be the first thing in each 7547*22dc650dSSadaf Ebrahimi alternative in which it appears if the pattern is ever to match that 7548*22dc650dSSadaf Ebrahimi branch. If all possible alternatives start with a circumflex, that is, 7549*22dc650dSSadaf Ebrahimi if the pattern is constrained to match only at the start of the sub- 7550*22dc650dSSadaf Ebrahimi ject, it is said to be an "anchored" pattern. (There are also other 7551*22dc650dSSadaf Ebrahimi constructs that can cause a pattern to be anchored.) 7552*22dc650dSSadaf Ebrahimi 7553*22dc650dSSadaf Ebrahimi The dollar character is an assertion that is true only if the current 7554*22dc650dSSadaf Ebrahimi matching point is at the end of the subject string, or immediately be- 7555*22dc650dSSadaf Ebrahimi fore a newline at the end of the string (by default), unless PCRE2_NO- 7556*22dc650dSSadaf Ebrahimi TEOL is set. Note, however, that it does not actually match the new- 7557*22dc650dSSadaf Ebrahimi line. Dollar need not be the last character of the pattern if a number 7558*22dc650dSSadaf Ebrahimi of alternatives are involved, but it should be the last item in any 7559*22dc650dSSadaf Ebrahimi branch in which it appears. Dollar has no special meaning in a charac- 7560*22dc650dSSadaf Ebrahimi ter class. 7561*22dc650dSSadaf Ebrahimi 7562*22dc650dSSadaf Ebrahimi The meaning of dollar can be changed so that it matches only at the 7563*22dc650dSSadaf Ebrahimi very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at 7564*22dc650dSSadaf Ebrahimi compile time. This does not affect the \Z assertion. 7565*22dc650dSSadaf Ebrahimi 7566*22dc650dSSadaf Ebrahimi The meanings of the circumflex and dollar metacharacters are changed if 7567*22dc650dSSadaf Ebrahimi the PCRE2_MULTILINE option is set. When this is the case, a dollar 7568*22dc650dSSadaf Ebrahimi character matches before any newlines in the string, as well as at the 7569*22dc650dSSadaf Ebrahimi very end, and a circumflex matches immediately after internal newlines 7570*22dc650dSSadaf Ebrahimi as well as at the start of the subject string. It does not match after 7571*22dc650dSSadaf Ebrahimi a newline that ends the string, for compatibility with Perl. However, 7572*22dc650dSSadaf Ebrahimi this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option. 7573*22dc650dSSadaf Ebrahimi 7574*22dc650dSSadaf Ebrahimi For example, the pattern /^abc$/ matches the subject string "def\nabc" 7575*22dc650dSSadaf Ebrahimi (where \n represents a newline) in multiline mode, but not otherwise. 7576*22dc650dSSadaf Ebrahimi Consequently, patterns that are anchored in single line mode because 7577*22dc650dSSadaf Ebrahimi all branches start with ^ are not anchored in multiline mode, and a 7578*22dc650dSSadaf Ebrahimi match for circumflex is possible when the startoffset argument of 7579*22dc650dSSadaf Ebrahimi pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored 7580*22dc650dSSadaf Ebrahimi if PCRE2_MULTILINE is set. 7581*22dc650dSSadaf Ebrahimi 7582*22dc650dSSadaf Ebrahimi When the newline convention (see "Newline conventions" below) recog- 7583*22dc650dSSadaf Ebrahimi nizes the two-character sequence CRLF as a newline, this is preferred, 7584*22dc650dSSadaf Ebrahimi even if the single characters CR and LF are also recognized as new- 7585*22dc650dSSadaf Ebrahimi lines. For example, if the newline convention is "any", a multiline 7586*22dc650dSSadaf Ebrahimi mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather 7587*22dc650dSSadaf Ebrahimi than after CR, even though CR on its own is a valid newline. (It also 7588*22dc650dSSadaf Ebrahimi matches at the very start of the string, of course.) 7589*22dc650dSSadaf Ebrahimi 7590*22dc650dSSadaf Ebrahimi Note that the sequences \A, \Z, and \z can be used to match the start 7591*22dc650dSSadaf Ebrahimi and end of the subject in both modes, and if all branches of a pattern 7592*22dc650dSSadaf Ebrahimi start with \A it is always anchored, whether or not PCRE2_MULTILINE is 7593*22dc650dSSadaf Ebrahimi set. 7594*22dc650dSSadaf Ebrahimi 7595*22dc650dSSadaf Ebrahimi 7596*22dc650dSSadaf EbrahimiFULL STOP (PERIOD, DOT) AND \N 7597*22dc650dSSadaf Ebrahimi 7598*22dc650dSSadaf Ebrahimi Outside a character class, a dot in the pattern matches any one charac- 7599*22dc650dSSadaf Ebrahimi ter in the subject string except (by default) a character that signi- 7600*22dc650dSSadaf Ebrahimi fies the end of a line. One or more characters may be specified as line 7601*22dc650dSSadaf Ebrahimi terminators (see "Newline conventions" above). 7602*22dc650dSSadaf Ebrahimi 7603*22dc650dSSadaf Ebrahimi Dot never matches a single line-ending character. When the two-charac- 7604*22dc650dSSadaf Ebrahimi ter sequence CRLF is the only line ending, dot does not match CR if it 7605*22dc650dSSadaf Ebrahimi is immediately followed by LF, but otherwise it matches all characters 7606*22dc650dSSadaf Ebrahimi (including isolated CRs and LFs). When ANYCRLF is selected for line 7607*22dc650dSSadaf Ebrahimi endings, no occurrences of CR of LF match dot. When all Unicode line 7608*22dc650dSSadaf Ebrahimi endings are being recognized, dot does not match CR or LF or any of the 7609*22dc650dSSadaf Ebrahimi other line ending characters. 7610*22dc650dSSadaf Ebrahimi 7611*22dc650dSSadaf Ebrahimi The behaviour of dot with regard to newlines can be changed. If the 7612*22dc650dSSadaf Ebrahimi PCRE2_DOTALL option is set, a dot matches any one character, without 7613*22dc650dSSadaf Ebrahimi exception. If the two-character sequence CRLF is present in the sub- 7614*22dc650dSSadaf Ebrahimi ject string, it takes two dots to match it. 7615*22dc650dSSadaf Ebrahimi 7616*22dc650dSSadaf Ebrahimi The handling of dot is entirely independent of the handling of circum- 7617*22dc650dSSadaf Ebrahimi flex and dollar, the only relationship being that they both involve 7618*22dc650dSSadaf Ebrahimi newlines. Dot has no special meaning in a character class. 7619*22dc650dSSadaf Ebrahimi 7620*22dc650dSSadaf Ebrahimi The escape sequence \N when not followed by an opening brace behaves 7621*22dc650dSSadaf Ebrahimi like a dot, except that it is not affected by the PCRE2_DOTALL option. 7622*22dc650dSSadaf Ebrahimi In other words, it matches any character except one that signifies the 7623*22dc650dSSadaf Ebrahimi end of a line. 7624*22dc650dSSadaf Ebrahimi 7625*22dc650dSSadaf Ebrahimi When \N is followed by an opening brace it has a different meaning. See 7626*22dc650dSSadaf Ebrahimi the section entitled "Non-printing characters" above for details. Perl 7627*22dc650dSSadaf Ebrahimi also uses \N{name} to specify characters by Unicode name; PCRE2 does 7628*22dc650dSSadaf Ebrahimi not support this. 7629*22dc650dSSadaf Ebrahimi 7630*22dc650dSSadaf Ebrahimi 7631*22dc650dSSadaf EbrahimiMATCHING A SINGLE CODE UNIT 7632*22dc650dSSadaf Ebrahimi 7633*22dc650dSSadaf Ebrahimi Outside a character class, the escape sequence \C matches any one code 7634*22dc650dSSadaf Ebrahimi unit, whether or not a UTF mode is set. In the 8-bit library, one code 7635*22dc650dSSadaf Ebrahimi unit is one byte; in the 16-bit library it is a 16-bit unit; in the 7636*22dc650dSSadaf Ebrahimi 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches 7637*22dc650dSSadaf Ebrahimi line-ending characters. The feature is provided in Perl in order to 7638*22dc650dSSadaf Ebrahimi match individual bytes in UTF-8 mode, but it is unclear how it can use- 7639*22dc650dSSadaf Ebrahimi fully be used. 7640*22dc650dSSadaf Ebrahimi 7641*22dc650dSSadaf Ebrahimi Because \C breaks up characters into individual code units, matching 7642*22dc650dSSadaf Ebrahimi one unit with \C in UTF-8 or UTF-16 mode means that the rest of the 7643*22dc650dSSadaf Ebrahimi string may start with a malformed UTF character. This has undefined re- 7644*22dc650dSSadaf Ebrahimi sults, because PCRE2 assumes that it is matching character by character 7645*22dc650dSSadaf Ebrahimi in a valid UTF string (by default it checks the subject string's valid- 7646*22dc650dSSadaf Ebrahimi ity at the start of processing unless the PCRE2_NO_UTF_CHECK or 7647*22dc650dSSadaf Ebrahimi PCRE2_MATCH_INVALID_UTF option is used). 7648*22dc650dSSadaf Ebrahimi 7649*22dc650dSSadaf Ebrahimi An application can lock out the use of \C by setting the 7650*22dc650dSSadaf Ebrahimi PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also 7651*22dc650dSSadaf Ebrahimi possible to build PCRE2 with the use of \C permanently disabled. 7652*22dc650dSSadaf Ebrahimi 7653*22dc650dSSadaf Ebrahimi PCRE2 does not allow \C to appear in lookbehind assertions (described 7654*22dc650dSSadaf Ebrahimi below) in UTF-8 or UTF-16 modes, because this would make it impossible 7655*22dc650dSSadaf Ebrahimi to calculate the length of the lookbehind. Neither the alternative 7656*22dc650dSSadaf Ebrahimi matching function pcre2_dfa_match() nor the JIT optimizer support \C in 7657*22dc650dSSadaf Ebrahimi these UTF modes. The former gives a match-time error; the latter fails 7658*22dc650dSSadaf Ebrahimi to optimize and so the match is always run using the interpreter. 7659*22dc650dSSadaf Ebrahimi 7660*22dc650dSSadaf Ebrahimi In the 32-bit library, however, \C is always supported (when not ex- 7661*22dc650dSSadaf Ebrahimi plicitly locked out) because it always matches a single code unit, 7662*22dc650dSSadaf Ebrahimi whether or not UTF-32 is specified. 7663*22dc650dSSadaf Ebrahimi 7664*22dc650dSSadaf Ebrahimi In general, the \C escape sequence is best avoided. However, one way of 7665*22dc650dSSadaf Ebrahimi using it that avoids the problem of malformed UTF-8 or UTF-16 charac- 7666*22dc650dSSadaf Ebrahimi ters is to use a lookahead to check the length of the next character, 7667*22dc650dSSadaf Ebrahimi as in this pattern, which could be used with a UTF-8 string (ignore 7668*22dc650dSSadaf Ebrahimi white space and line breaks): 7669*22dc650dSSadaf Ebrahimi 7670*22dc650dSSadaf Ebrahimi (?| (?=[\x00-\x7f])(\C) | 7671*22dc650dSSadaf Ebrahimi (?=[\x80-\x{7ff}])(\C)(\C) | 7672*22dc650dSSadaf Ebrahimi (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | 7673*22dc650dSSadaf Ebrahimi (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) 7674*22dc650dSSadaf Ebrahimi 7675*22dc650dSSadaf Ebrahimi In this example, a group that starts with (?| resets the capturing 7676*22dc650dSSadaf Ebrahimi parentheses numbers in each alternative (see "Duplicate Group Numbers" 7677*22dc650dSSadaf Ebrahimi below). The assertions at the start of each branch check the next UTF-8 7678*22dc650dSSadaf Ebrahimi character for values whose encoding uses 1, 2, 3, or 4 bytes, respec- 7679*22dc650dSSadaf Ebrahimi tively. The character's individual bytes are then captured by the ap- 7680*22dc650dSSadaf Ebrahimi propriate number of \C groups. 7681*22dc650dSSadaf Ebrahimi 7682*22dc650dSSadaf Ebrahimi 7683*22dc650dSSadaf EbrahimiSQUARE BRACKETS AND CHARACTER CLASSES 7684*22dc650dSSadaf Ebrahimi 7685*22dc650dSSadaf Ebrahimi An opening square bracket introduces a character class, terminated by a 7686*22dc650dSSadaf Ebrahimi closing square bracket. A closing square bracket on its own is not spe- 7687*22dc650dSSadaf Ebrahimi cial by default. If a closing square bracket is required as a member 7688*22dc650dSSadaf Ebrahimi of the class, it should be the first data character in the class (after 7689*22dc650dSSadaf Ebrahimi an initial circumflex, if present) or escaped with a backslash. This 7690*22dc650dSSadaf Ebrahimi means that, by default, an empty class cannot be defined. However, if 7691*22dc650dSSadaf Ebrahimi the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at 7692*22dc650dSSadaf Ebrahimi the start does end the (empty) class. 7693*22dc650dSSadaf Ebrahimi 7694*22dc650dSSadaf Ebrahimi A character class matches a single character in the subject. A matched 7695*22dc650dSSadaf Ebrahimi character must be in the set of characters defined by the class, unless 7696*22dc650dSSadaf Ebrahimi the first character in the class definition is a circumflex, in which 7697*22dc650dSSadaf Ebrahimi case the subject character must not be in the set defined by the class. 7698*22dc650dSSadaf Ebrahimi If a circumflex is actually required as a member of the class, ensure 7699*22dc650dSSadaf Ebrahimi it is not the first character, or escape it with a backslash. 7700*22dc650dSSadaf Ebrahimi 7701*22dc650dSSadaf Ebrahimi For example, the character class [aeiou] matches any lower case vowel, 7702*22dc650dSSadaf Ebrahimi while [^aeiou] matches any character that is not a lower case vowel. 7703*22dc650dSSadaf Ebrahimi Note that a circumflex is just a convenient notation for specifying the 7704*22dc650dSSadaf Ebrahimi characters that are in the class by enumerating those that are not. A 7705*22dc650dSSadaf Ebrahimi class that starts with a circumflex is not an assertion; it still con- 7706*22dc650dSSadaf Ebrahimi sumes a character from the subject string, and therefore it fails if 7707*22dc650dSSadaf Ebrahimi the current pointer is at the end of the string. 7708*22dc650dSSadaf Ebrahimi 7709*22dc650dSSadaf Ebrahimi Characters in a class may be specified by their code points using \o, 7710*22dc650dSSadaf Ebrahimi \x, or \N{U+hh..} in the usual way. When caseless matching is set, any 7711*22dc650dSSadaf Ebrahimi letters in a class represent both their upper case and lower case ver- 7712*22dc650dSSadaf Ebrahimi sions, so for example, a caseless [aeiou] matches "A" as well as "a", 7713*22dc650dSSadaf Ebrahimi and a caseless [^aeiou] does not match "A", whereas a caseful version 7714*22dc650dSSadaf Ebrahimi would. Note that there are two ASCII characters, K and S, that, in ad- 7715*22dc650dSSadaf Ebrahimi dition to their lower case ASCII equivalents, are case-equivalent with 7716*22dc650dSSadaf Ebrahimi Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei- 7717*22dc650dSSadaf Ebrahimi ther PCRE2_UTF or PCRE2_UCP is set. 7718*22dc650dSSadaf Ebrahimi 7719*22dc650dSSadaf Ebrahimi Characters that might indicate line breaks are never treated in any 7720*22dc650dSSadaf Ebrahimi special way when matching character classes, whatever line-ending se- 7721*22dc650dSSadaf Ebrahimi quence is in use, and whatever setting of the PCRE2_DOTALL and 7722*22dc650dSSadaf Ebrahimi PCRE2_MULTILINE options is used. A class such as [^a] always matches 7723*22dc650dSSadaf Ebrahimi one of these characters. 7724*22dc650dSSadaf Ebrahimi 7725*22dc650dSSadaf Ebrahimi The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, 7726*22dc650dSSadaf Ebrahimi \S, \v, \V, \w, and \W may appear in a character class, and add the 7727*22dc650dSSadaf Ebrahimi characters that they match to the class. For example, [\dABCDEF] 7728*22dc650dSSadaf Ebrahimi matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option af- 7729*22dc650dSSadaf Ebrahimi fects the meanings of \d, \s, \w and their upper case partners, just as 7730*22dc650dSSadaf Ebrahimi it does when they appear outside a character class, as described in the 7731*22dc650dSSadaf Ebrahimi section entitled "Generic character types" above. The escape sequence 7732*22dc650dSSadaf Ebrahimi \b has a different meaning inside a character class; it matches the 7733*22dc650dSSadaf Ebrahimi backspace character. The sequences \B, \R, and \X are not special in- 7734*22dc650dSSadaf Ebrahimi side a character class. Like any other unrecognized escape sequences, 7735*22dc650dSSadaf Ebrahimi they cause an error. The same is true for \N when not followed by an 7736*22dc650dSSadaf Ebrahimi opening brace. 7737*22dc650dSSadaf Ebrahimi 7738*22dc650dSSadaf Ebrahimi The minus (hyphen) character can be used to specify a range of charac- 7739*22dc650dSSadaf Ebrahimi ters in a character class. For example, [d-m] matches any letter be- 7740*22dc650dSSadaf Ebrahimi tween d and m, inclusive. If a minus character is required in a class, 7741*22dc650dSSadaf Ebrahimi it must be escaped with a backslash or appear in a position where it 7742*22dc650dSSadaf Ebrahimi cannot be interpreted as indicating a range, typically as the first or 7743*22dc650dSSadaf Ebrahimi last character in the class, or immediately after a range. For example, 7744*22dc650dSSadaf Ebrahimi [b-d-z] matches letters in the range b to d, a hyphen character, or z. 7745*22dc650dSSadaf Ebrahimi 7746*22dc650dSSadaf Ebrahimi Perl treats a hyphen as a literal if it appears before or after a POSIX 7747*22dc650dSSadaf Ebrahimi class (see below) or before or after a character type escape such as \d 7748*22dc650dSSadaf Ebrahimi or \H. However, unless the hyphen is the last character in the class, 7749*22dc650dSSadaf Ebrahimi Perl outputs a warning in its warning mode, as this is most likely a 7750*22dc650dSSadaf Ebrahimi user error. As PCRE2 has no facility for warning, an error is given in 7751*22dc650dSSadaf Ebrahimi these cases. 7752*22dc650dSSadaf Ebrahimi 7753*22dc650dSSadaf Ebrahimi It is not possible to have the literal character "]" as the end charac- 7754*22dc650dSSadaf Ebrahimi ter of a range. A pattern such as [W-]46] is interpreted as a class of 7755*22dc650dSSadaf Ebrahimi two characters ("W" and "-") followed by a literal string "46]", so it 7756*22dc650dSSadaf Ebrahimi would match "W46]" or "-46]". However, if the "]" is escaped with a 7757*22dc650dSSadaf Ebrahimi backslash it is interpreted as the end of range, so [W-\]46] is inter- 7758*22dc650dSSadaf Ebrahimi preted as a class containing a range followed by two other characters. 7759*22dc650dSSadaf Ebrahimi The octal or hexadecimal representation of "]" can also be used to end 7760*22dc650dSSadaf Ebrahimi a range. 7761*22dc650dSSadaf Ebrahimi 7762*22dc650dSSadaf Ebrahimi Ranges normally include all code points between the start and end char- 7763*22dc650dSSadaf Ebrahimi acters, inclusive. They can also be used for code points specified nu- 7764*22dc650dSSadaf Ebrahimi merically, for example [\000-\037]. Ranges can include any characters 7765*22dc650dSSadaf Ebrahimi that are valid for the current mode. In any UTF mode, the so-called 7766*22dc650dSSadaf Ebrahimi "surrogate" characters (those whose code points lie between 0xd800 and 7767*22dc650dSSadaf Ebrahimi 0xdfff inclusive) may not be specified explicitly by default (the 7768*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How- 7769*22dc650dSSadaf Ebrahimi ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates, 7770*22dc650dSSadaf Ebrahimi are always permitted. 7771*22dc650dSSadaf Ebrahimi 7772*22dc650dSSadaf Ebrahimi There is a special case in EBCDIC environments for ranges whose end 7773*22dc650dSSadaf Ebrahimi points are both specified as literal letters in the same case. For com- 7774*22dc650dSSadaf Ebrahimi patibility with Perl, EBCDIC code points within the range that are not 7775*22dc650dSSadaf Ebrahimi letters are omitted. For example, [h-k] matches only four characters, 7776*22dc650dSSadaf Ebrahimi even though the codes for h and k are 0x88 and 0x92, a range of 11 code 7777*22dc650dSSadaf Ebrahimi points. However, if the range is specified numerically, for example, 7778*22dc650dSSadaf Ebrahimi [\x88-\x92] or [h-\x92], all code points are included. 7779*22dc650dSSadaf Ebrahimi 7780*22dc650dSSadaf Ebrahimi If a range that includes letters is used when caseless matching is set, 7781*22dc650dSSadaf Ebrahimi it matches the letters in either case. For example, [W-c] is equivalent 7782*22dc650dSSadaf Ebrahimi to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if 7783*22dc650dSSadaf Ebrahimi character tables for a French locale are in use, [\xc8-\xcb] matches 7784*22dc650dSSadaf Ebrahimi accented E characters in both cases. 7785*22dc650dSSadaf Ebrahimi 7786*22dc650dSSadaf Ebrahimi A circumflex can conveniently be used with the upper case character 7787*22dc650dSSadaf Ebrahimi types to specify a more restricted set of characters than the matching 7788*22dc650dSSadaf Ebrahimi lower case type. For example, the class [^\W_] matches any letter or 7789*22dc650dSSadaf Ebrahimi digit, but not underscore, whereas [\w] includes underscore. A positive 7790*22dc650dSSadaf Ebrahimi character class should be read as "something OR something OR ..." and a 7791*22dc650dSSadaf Ebrahimi negative class as "NOT something AND NOT something AND NOT ...". 7792*22dc650dSSadaf Ebrahimi 7793*22dc650dSSadaf Ebrahimi The only metacharacters that are recognized in character classes are 7794*22dc650dSSadaf Ebrahimi backslash, hyphen (only where it can be interpreted as specifying a 7795*22dc650dSSadaf Ebrahimi range), circumflex (only at the start), opening square bracket (only 7796*22dc650dSSadaf Ebrahimi when it can be interpreted as introducing a POSIX class name, or for a 7797*22dc650dSSadaf Ebrahimi special compatibility feature - see the next two sections), and the 7798*22dc650dSSadaf Ebrahimi terminating closing square bracket. However, escaping other non-al- 7799*22dc650dSSadaf Ebrahimi phanumeric characters does no harm. 7800*22dc650dSSadaf Ebrahimi 7801*22dc650dSSadaf Ebrahimi 7802*22dc650dSSadaf EbrahimiPOSIX CHARACTER CLASSES 7803*22dc650dSSadaf Ebrahimi 7804*22dc650dSSadaf Ebrahimi Perl supports the POSIX notation for character classes. This uses names 7805*22dc650dSSadaf Ebrahimi enclosed by [: and :] within the enclosing square brackets. PCRE2 also 7806*22dc650dSSadaf Ebrahimi supports this notation. For example, 7807*22dc650dSSadaf Ebrahimi 7808*22dc650dSSadaf Ebrahimi [01[:alpha:]%] 7809*22dc650dSSadaf Ebrahimi 7810*22dc650dSSadaf Ebrahimi matches "0", "1", any alphabetic character, or "%". The supported class 7811*22dc650dSSadaf Ebrahimi names are: 7812*22dc650dSSadaf Ebrahimi 7813*22dc650dSSadaf Ebrahimi alnum letters and digits 7814*22dc650dSSadaf Ebrahimi alpha letters 7815*22dc650dSSadaf Ebrahimi ascii character codes 0 - 127 7816*22dc650dSSadaf Ebrahimi blank space or tab only 7817*22dc650dSSadaf Ebrahimi cntrl control characters 7818*22dc650dSSadaf Ebrahimi digit decimal digits (same as \d) 7819*22dc650dSSadaf Ebrahimi graph printing characters, excluding space 7820*22dc650dSSadaf Ebrahimi lower lower case letters 7821*22dc650dSSadaf Ebrahimi print printing characters, including space 7822*22dc650dSSadaf Ebrahimi punct printing characters, excluding letters and digits and space 7823*22dc650dSSadaf Ebrahimi space white space (the same as \s from PCRE2 8.34) 7824*22dc650dSSadaf Ebrahimi upper upper case letters 7825*22dc650dSSadaf Ebrahimi word "word" characters (same as \w) 7826*22dc650dSSadaf Ebrahimi xdigit hexadecimal digits 7827*22dc650dSSadaf Ebrahimi 7828*22dc650dSSadaf Ebrahimi The default "space" characters are HT (9), LF (10), VT (11), FF (12), 7829*22dc650dSSadaf Ebrahimi CR (13), and space (32). If locale-specific matching is taking place, 7830*22dc650dSSadaf Ebrahimi the list of space characters may be different; there may be fewer or 7831*22dc650dSSadaf Ebrahimi more of them. "Space" and \s match the same set of characters, as do 7832*22dc650dSSadaf Ebrahimi "word" and \w. 7833*22dc650dSSadaf Ebrahimi 7834*22dc650dSSadaf Ebrahimi The name "word" is a Perl extension, and "blank" is a GNU extension 7835*22dc650dSSadaf Ebrahimi from Perl 5.8. Another Perl extension is negation, which is indicated 7836*22dc650dSSadaf Ebrahimi by a ^ character after the colon. For example, 7837*22dc650dSSadaf Ebrahimi 7838*22dc650dSSadaf Ebrahimi [12[:^digit:]] 7839*22dc650dSSadaf Ebrahimi 7840*22dc650dSSadaf Ebrahimi matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the 7841*22dc650dSSadaf Ebrahimi POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but 7842*22dc650dSSadaf Ebrahimi these are not supported, and an error is given if they are encountered. 7843*22dc650dSSadaf Ebrahimi 7844*22dc650dSSadaf Ebrahimi By default, characters with values greater than 127 do not match any of 7845*22dc650dSSadaf Ebrahimi the POSIX character classes, although this may be different for charac- 7846*22dc650dSSadaf Ebrahimi ters in the range 128-255 when locale-specific matching is happening. 7847*22dc650dSSadaf Ebrahimi However, in UCP mode, unless certain options are set (see below), some 7848*22dc650dSSadaf Ebrahimi of the classes are changed so that Unicode character properties are 7849*22dc650dSSadaf Ebrahimi used. This is achieved by replacing POSIX classes with other sequences, 7850*22dc650dSSadaf Ebrahimi as follows: 7851*22dc650dSSadaf Ebrahimi 7852*22dc650dSSadaf Ebrahimi [:alnum:] becomes \p{Xan} 7853*22dc650dSSadaf Ebrahimi [:alpha:] becomes \p{L} 7854*22dc650dSSadaf Ebrahimi [:blank:] becomes \h 7855*22dc650dSSadaf Ebrahimi [:cntrl:] becomes \p{Cc} 7856*22dc650dSSadaf Ebrahimi [:digit:] becomes \p{Nd} 7857*22dc650dSSadaf Ebrahimi [:lower:] becomes \p{Ll} 7858*22dc650dSSadaf Ebrahimi [:space:] becomes \p{Xps} 7859*22dc650dSSadaf Ebrahimi [:upper:] becomes \p{Lu} 7860*22dc650dSSadaf Ebrahimi [:word:] becomes \p{Xwd} 7861*22dc650dSSadaf Ebrahimi 7862*22dc650dSSadaf Ebrahimi Negated versions, such as [:^alpha:] use \P instead of \p. Four other 7863*22dc650dSSadaf Ebrahimi POSIX classes are handled specially in UCP mode: 7864*22dc650dSSadaf Ebrahimi 7865*22dc650dSSadaf Ebrahimi [:graph:] This matches characters that have glyphs that mark the page 7866*22dc650dSSadaf Ebrahimi when printed. In Unicode property terms, it matches all char- 7867*22dc650dSSadaf Ebrahimi acters with the L, M, N, P, S, or Cf properties, except for: 7868*22dc650dSSadaf Ebrahimi 7869*22dc650dSSadaf Ebrahimi U+061C Arabic Letter Mark 7870*22dc650dSSadaf Ebrahimi U+180E Mongolian Vowel Separator 7871*22dc650dSSadaf Ebrahimi U+2066 - U+2069 Various "isolate"s 7872*22dc650dSSadaf Ebrahimi 7873*22dc650dSSadaf Ebrahimi 7874*22dc650dSSadaf Ebrahimi [:print:] This matches the same characters as [:graph:] plus space 7875*22dc650dSSadaf Ebrahimi characters that are not controls, that is, characters with 7876*22dc650dSSadaf Ebrahimi the Zs property. 7877*22dc650dSSadaf Ebrahimi 7878*22dc650dSSadaf Ebrahimi [:punct:] This matches all characters that have the Unicode P (punctua- 7879*22dc650dSSadaf Ebrahimi tion) property, plus those characters with code points less 7880*22dc650dSSadaf Ebrahimi than 256 that have the S (Symbol) property. 7881*22dc650dSSadaf Ebrahimi 7882*22dc650dSSadaf Ebrahimi [:xdigit:] 7883*22dc650dSSadaf Ebrahimi In addition to the ASCII hexadecimal digits, this also 7884*22dc650dSSadaf Ebrahimi matches the "fullwidth" versions of those characters, whose 7885*22dc650dSSadaf Ebrahimi Unicode code points start at U+FF10. This is a change that 7886*22dc650dSSadaf Ebrahimi was made in PCRE release 10.43 for Perl compatibility. 7887*22dc650dSSadaf Ebrahimi 7888*22dc650dSSadaf Ebrahimi The other POSIX classes are unchanged by PCRE2_UCP, and match only 7889*22dc650dSSadaf Ebrahimi characters with code points less than 256. 7890*22dc650dSSadaf Ebrahimi 7891*22dc650dSSadaf Ebrahimi There are two options that can be used to restrict the POSIX classes to 7892*22dc650dSSadaf Ebrahimi ASCII characters when PCRE2_UCP is set. The option PCRE2_EX- 7893*22dc650dSSadaf Ebrahimi TRA_ASCII_DIGIT affects just [:digit:] and [:xdigit:]. Within a pat- 7894*22dc650dSSadaf Ebrahimi tern, this can be set and unset by (?aT) and (?-aT). The PCRE2_EX- 7895*22dc650dSSadaf Ebrahimi TRA_ASCII_POSIX option disables UCP processing for all POSIX classes, 7896*22dc650dSSadaf Ebrahimi including [:digit:] and [:xdigit:]. Within a pattern, (?aP) and (?-aP) 7897*22dc650dSSadaf Ebrahimi set and unset both these options for consistency. 7898*22dc650dSSadaf Ebrahimi 7899*22dc650dSSadaf Ebrahimi 7900*22dc650dSSadaf EbrahimiCOMPATIBILITY FEATURE FOR WORD BOUNDARIES 7901*22dc650dSSadaf Ebrahimi 7902*22dc650dSSadaf Ebrahimi In the POSIX.2 compliant library that was included in 4.4BSD Unix, the 7903*22dc650dSSadaf Ebrahimi ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" 7904*22dc650dSSadaf Ebrahimi and "end of word". PCRE2 treats these items as follows: 7905*22dc650dSSadaf Ebrahimi 7906*22dc650dSSadaf Ebrahimi [[:<:]] is converted to \b(?=\w) 7907*22dc650dSSadaf Ebrahimi [[:>:]] is converted to \b(?<=\w) 7908*22dc650dSSadaf Ebrahimi 7909*22dc650dSSadaf Ebrahimi Only these exact character sequences are recognized. A sequence such as 7910*22dc650dSSadaf Ebrahimi [a[:<:]b] provokes error for an unrecognized POSIX class name. This 7911*22dc650dSSadaf Ebrahimi support is not compatible with Perl. It is provided to help migrations 7912*22dc650dSSadaf Ebrahimi from other environments, and is best not used in any new patterns. Note 7913*22dc650dSSadaf Ebrahimi that \b matches at the start and the end of a word (see "Simple asser- 7914*22dc650dSSadaf Ebrahimi tions" above), and in a Perl-style pattern the preceding or following 7915*22dc650dSSadaf Ebrahimi character normally shows which is wanted, without the need for the as- 7916*22dc650dSSadaf Ebrahimi sertions that are used above in order to give exactly the POSIX behav- 7917*22dc650dSSadaf Ebrahimi iour. Note also that the PCRE2_UCP option changes the meaning of \w 7918*22dc650dSSadaf Ebrahimi (and therefore \b) by default, so it also affects these POSIX se- 7919*22dc650dSSadaf Ebrahimi quences. 7920*22dc650dSSadaf Ebrahimi 7921*22dc650dSSadaf Ebrahimi 7922*22dc650dSSadaf EbrahimiVERTICAL BAR 7923*22dc650dSSadaf Ebrahimi 7924*22dc650dSSadaf Ebrahimi Vertical bar characters are used to separate alternative patterns. For 7925*22dc650dSSadaf Ebrahimi example, the pattern 7926*22dc650dSSadaf Ebrahimi 7927*22dc650dSSadaf Ebrahimi gilbert|sullivan 7928*22dc650dSSadaf Ebrahimi 7929*22dc650dSSadaf Ebrahimi matches either "gilbert" or "sullivan". Any number of alternatives may 7930*22dc650dSSadaf Ebrahimi appear, and an empty alternative is permitted (matching the empty 7931*22dc650dSSadaf Ebrahimi string). The matching process tries each alternative in turn, from left 7932*22dc650dSSadaf Ebrahimi to right, and the first one that succeeds is used. If the alternatives 7933*22dc650dSSadaf Ebrahimi are within a group (defined below), "succeeds" means matching the rest 7934*22dc650dSSadaf Ebrahimi of the main pattern as well as the alternative in the group. 7935*22dc650dSSadaf Ebrahimi 7936*22dc650dSSadaf Ebrahimi 7937*22dc650dSSadaf EbrahimiINTERNAL OPTION SETTING 7938*22dc650dSSadaf Ebrahimi 7939*22dc650dSSadaf Ebrahimi The settings of several options can be changed within a pattern by a 7940*22dc650dSSadaf Ebrahimi sequence of letters enclosed between "(?" and ")". The following are 7941*22dc650dSSadaf Ebrahimi Perl-compatible, and are described in detail in the pcre2api documenta- 7942*22dc650dSSadaf Ebrahimi tion. The option letters are: 7943*22dc650dSSadaf Ebrahimi 7944*22dc650dSSadaf Ebrahimi i for PCRE2_CASELESS 7945*22dc650dSSadaf Ebrahimi m for PCRE2_MULTILINE 7946*22dc650dSSadaf Ebrahimi n for PCRE2_NO_AUTO_CAPTURE 7947*22dc650dSSadaf Ebrahimi s for PCRE2_DOTALL 7948*22dc650dSSadaf Ebrahimi x for PCRE2_EXTENDED 7949*22dc650dSSadaf Ebrahimi xx for PCRE2_EXTENDED_MORE 7950*22dc650dSSadaf Ebrahimi 7951*22dc650dSSadaf Ebrahimi For example, (?im) sets caseless, multiline matching. It is also possi- 7952*22dc650dSSadaf Ebrahimi ble to unset these options by preceding the relevant letters with a hy- 7953*22dc650dSSadaf Ebrahimi phen, for example (?-im). The two "extended" options are not indepen- 7954*22dc650dSSadaf Ebrahimi dent; unsetting either one cancels the effects of both of them. 7955*22dc650dSSadaf Ebrahimi 7956*22dc650dSSadaf Ebrahimi A combined setting and unsetting such as (?im-sx), which sets 7957*22dc650dSSadaf Ebrahimi PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and 7958*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the 7959*22dc650dSSadaf Ebrahimi options string. If a letter appears both before and after the hyphen, 7960*22dc650dSSadaf Ebrahimi the option is unset. An empty options setting "(?)" is allowed. Need- 7961*22dc650dSSadaf Ebrahimi less to say, it has no effect. 7962*22dc650dSSadaf Ebrahimi 7963*22dc650dSSadaf Ebrahimi If the first character following (? is a circumflex, it causes all of 7964*22dc650dSSadaf Ebrahimi the above options to be unset. Letters may follow the circumflex to 7965*22dc650dSSadaf Ebrahimi cause some options to be re-instated, but a hyphen may not appear. 7966*22dc650dSSadaf Ebrahimi 7967*22dc650dSSadaf Ebrahimi Some PCRE2-specific options can be changed by the same mechanism using 7968*22dc650dSSadaf Ebrahimi these pairs or individual letters: 7969*22dc650dSSadaf Ebrahimi 7970*22dc650dSSadaf Ebrahimi aD for PCRE2_EXTRA_ASCII_BSD 7971*22dc650dSSadaf Ebrahimi aS for PCRE2_EXTRA_ASCII_BSS 7972*22dc650dSSadaf Ebrahimi aW for PCRE2_EXTRA_ASCII_BSW 7973*22dc650dSSadaf Ebrahimi aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT 7974*22dc650dSSadaf Ebrahimi aT for PCRE2_EXTRA_ASCII_DIGIT 7975*22dc650dSSadaf Ebrahimi r for PCRE2_EXTRA_CASELESS_RESTRICT 7976*22dc650dSSadaf Ebrahimi J for PCRE2_DUPNAMES 7977*22dc650dSSadaf Ebrahimi U for PCRE2_UNGREEDY 7978*22dc650dSSadaf Ebrahimi 7979*22dc650dSSadaf Ebrahimi However, except for 'r', these are not unset by (?^), which is equiva- 7980*22dc650dSSadaf Ebrahimi lent to (?-imnrsx). If 'a' is not followed by any of the upper case 7981*22dc650dSSadaf Ebrahimi letters shown above, it sets (or unsets) all the ASCII options. 7982*22dc650dSSadaf Ebrahimi 7983*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EX- 7984*22dc650dSSadaf Ebrahimi TRA_ASCII_POSIX is set, but including it in (?aP) means that (?-aP) 7985*22dc650dSSadaf Ebrahimi suppresses all ASCII restrictions for POSIX classes. 7986*22dc650dSSadaf Ebrahimi 7987*22dc650dSSadaf Ebrahimi When one of these option changes occurs at top level (that is, not in- 7988*22dc650dSSadaf Ebrahimi side group parentheses), the change applies until a subsequent change, 7989*22dc650dSSadaf Ebrahimi or the end of the pattern. An option change within a group (see below 7990*22dc650dSSadaf Ebrahimi for a description of groups) affects only that part of the group that 7991*22dc650dSSadaf Ebrahimi follows it. At the end of the group these options are reset to the 7992*22dc650dSSadaf Ebrahimi state they were before the group. For example, 7993*22dc650dSSadaf Ebrahimi 7994*22dc650dSSadaf Ebrahimi (a(?i)b)c 7995*22dc650dSSadaf Ebrahimi 7996*22dc650dSSadaf Ebrahimi matches abc and aBc and no other strings (assuming PCRE2_CASELESS is 7997*22dc650dSSadaf Ebrahimi not set externally). Any changes made in one alternative do carry on 7998*22dc650dSSadaf Ebrahimi into subsequent branches within the same group. For example, 7999*22dc650dSSadaf Ebrahimi 8000*22dc650dSSadaf Ebrahimi (a(?i)b|c) 8001*22dc650dSSadaf Ebrahimi 8002*22dc650dSSadaf Ebrahimi matches "ab", "aB", "c", and "C", even though when matching "C" the 8003*22dc650dSSadaf Ebrahimi first branch is abandoned before the option setting. This is because 8004*22dc650dSSadaf Ebrahimi the effects of option settings happen at compile time. There would be 8005*22dc650dSSadaf Ebrahimi some very weird behaviour otherwise. 8006*22dc650dSSadaf Ebrahimi 8007*22dc650dSSadaf Ebrahimi As a convenient shorthand, if any option settings are required at the 8008*22dc650dSSadaf Ebrahimi start of a non-capturing group (see the next section), the option let- 8009*22dc650dSSadaf Ebrahimi ters may appear between the "?" and the ":". Thus the two patterns 8010*22dc650dSSadaf Ebrahimi 8011*22dc650dSSadaf Ebrahimi (?i:saturday|sunday) 8012*22dc650dSSadaf Ebrahimi (?:(?i)saturday|sunday) 8013*22dc650dSSadaf Ebrahimi 8014*22dc650dSSadaf Ebrahimi match exactly the same set of strings. 8015*22dc650dSSadaf Ebrahimi 8016*22dc650dSSadaf Ebrahimi Note: There are other PCRE2-specific options, applying to the whole 8017*22dc650dSSadaf Ebrahimi pattern, which can be set by the application when the compiling func- 8018*22dc650dSSadaf Ebrahimi tion is called. In addition, the pattern can contain special leading 8019*22dc650dSSadaf Ebrahimi sequences such as (*CRLF) to override what the application has set or 8020*22dc650dSSadaf Ebrahimi what has been defaulted. Details are given in the section entitled 8021*22dc650dSSadaf Ebrahimi "Newline sequences" above. There are also the (*UTF) and (*UCP) leading 8022*22dc650dSSadaf Ebrahimi sequences that can be used to set UTF and Unicode property modes; they 8023*22dc650dSSadaf Ebrahimi are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec- 8024*22dc650dSSadaf Ebrahimi tively. However, the application can set the PCRE2_NEVER_UTF or 8025*22dc650dSSadaf Ebrahimi PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and 8026*22dc650dSSadaf Ebrahimi (*UCP) sequences. 8027*22dc650dSSadaf Ebrahimi 8028*22dc650dSSadaf Ebrahimi 8029*22dc650dSSadaf EbrahimiGROUPS 8030*22dc650dSSadaf Ebrahimi 8031*22dc650dSSadaf Ebrahimi Groups are delimited by parentheses (round brackets), which can be 8032*22dc650dSSadaf Ebrahimi nested. Turning part of a pattern into a group does two things: 8033*22dc650dSSadaf Ebrahimi 8034*22dc650dSSadaf Ebrahimi 1. It localizes a set of alternatives. For example, the pattern 8035*22dc650dSSadaf Ebrahimi 8036*22dc650dSSadaf Ebrahimi cat(aract|erpillar|) 8037*22dc650dSSadaf Ebrahimi 8038*22dc650dSSadaf Ebrahimi matches "cataract", "caterpillar", or "cat". Without the parentheses, 8039*22dc650dSSadaf Ebrahimi it would match "cataract", "erpillar" or an empty string. 8040*22dc650dSSadaf Ebrahimi 8041*22dc650dSSadaf Ebrahimi 2. It creates a "capture group". This means that, when the whole pat- 8042*22dc650dSSadaf Ebrahimi tern matches, the portion of the subject string that matched the group 8043*22dc650dSSadaf Ebrahimi is passed back to the caller, separately from the portion that matched 8044*22dc650dSSadaf Ebrahimi the whole pattern. (This applies only to the traditional matching 8045*22dc650dSSadaf Ebrahimi function; the DFA matching function does not support capturing.) 8046*22dc650dSSadaf Ebrahimi 8047*22dc650dSSadaf Ebrahimi Opening parentheses are counted from left to right (starting from 1) to 8048*22dc650dSSadaf Ebrahimi obtain numbers for capture groups. For example, if the string "the red 8049*22dc650dSSadaf Ebrahimi king" is matched against the pattern 8050*22dc650dSSadaf Ebrahimi 8051*22dc650dSSadaf Ebrahimi the ((red|white) (king|queen)) 8052*22dc650dSSadaf Ebrahimi 8053*22dc650dSSadaf Ebrahimi the captured substrings are "red king", "red", and "king", and are num- 8054*22dc650dSSadaf Ebrahimi bered 1, 2, and 3, respectively. 8055*22dc650dSSadaf Ebrahimi 8056*22dc650dSSadaf Ebrahimi The fact that plain parentheses fulfil two functions is not always 8057*22dc650dSSadaf Ebrahimi helpful. There are often times when grouping is required without cap- 8058*22dc650dSSadaf Ebrahimi turing. If an opening parenthesis is followed by a question mark and a 8059*22dc650dSSadaf Ebrahimi colon, the group does not do any capturing, and is not counted when 8060*22dc650dSSadaf Ebrahimi computing the number of any subsequent capture groups. For example, if 8061*22dc650dSSadaf Ebrahimi the string "the white queen" is matched against the pattern 8062*22dc650dSSadaf Ebrahimi 8063*22dc650dSSadaf Ebrahimi the ((?:red|white) (king|queen)) 8064*22dc650dSSadaf Ebrahimi 8065*22dc650dSSadaf Ebrahimi the captured substrings are "white queen" and "queen", and are numbered 8066*22dc650dSSadaf Ebrahimi 1 and 2. The maximum number of capture groups is 65535. 8067*22dc650dSSadaf Ebrahimi 8068*22dc650dSSadaf Ebrahimi As a convenient shorthand, if any option settings are required at the 8069*22dc650dSSadaf Ebrahimi start of a non-capturing group, the option letters may appear between 8070*22dc650dSSadaf Ebrahimi the "?" and the ":". Thus the two patterns 8071*22dc650dSSadaf Ebrahimi 8072*22dc650dSSadaf Ebrahimi (?i:saturday|sunday) 8073*22dc650dSSadaf Ebrahimi (?:(?i)saturday|sunday) 8074*22dc650dSSadaf Ebrahimi 8075*22dc650dSSadaf Ebrahimi match exactly the same set of strings. Because alternative branches are 8076*22dc650dSSadaf Ebrahimi tried from left to right, and options are not reset until the end of 8077*22dc650dSSadaf Ebrahimi the group is reached, an option setting in one branch does affect sub- 8078*22dc650dSSadaf Ebrahimi sequent branches, so the above patterns match "SUNDAY" as well as "Sat- 8079*22dc650dSSadaf Ebrahimi urday". 8080*22dc650dSSadaf Ebrahimi 8081*22dc650dSSadaf Ebrahimi 8082*22dc650dSSadaf EbrahimiDUPLICATE GROUP NUMBERS 8083*22dc650dSSadaf Ebrahimi 8084*22dc650dSSadaf Ebrahimi Perl 5.10 introduced a feature whereby each alternative in a group uses 8085*22dc650dSSadaf Ebrahimi the same numbers for its capturing parentheses. Such a group starts 8086*22dc650dSSadaf Ebrahimi with (?| and is itself a non-capturing group. For example, consider 8087*22dc650dSSadaf Ebrahimi this pattern: 8088*22dc650dSSadaf Ebrahimi 8089*22dc650dSSadaf Ebrahimi (?|(Sat)ur|(Sun))day 8090*22dc650dSSadaf Ebrahimi 8091*22dc650dSSadaf Ebrahimi Because the two alternatives are inside a (?| group, both sets of cap- 8092*22dc650dSSadaf Ebrahimi turing parentheses are numbered one. Thus, when the pattern matches, 8093*22dc650dSSadaf Ebrahimi you can look at captured substring number one, whichever alternative 8094*22dc650dSSadaf Ebrahimi matched. This construct is useful when you want to capture part, but 8095*22dc650dSSadaf Ebrahimi not all, of one of a number of alternatives. Inside a (?| group, paren- 8096*22dc650dSSadaf Ebrahimi theses are numbered as usual, but the number is reset at the start of 8097*22dc650dSSadaf Ebrahimi each branch. The numbers of any capturing parentheses that follow the 8098*22dc650dSSadaf Ebrahimi whole group start after the highest number used in any branch. The fol- 8099*22dc650dSSadaf Ebrahimi lowing example is taken from the Perl documentation. The numbers under- 8100*22dc650dSSadaf Ebrahimi neath show in which buffer the captured content will be stored. 8101*22dc650dSSadaf Ebrahimi 8102*22dc650dSSadaf Ebrahimi # before ---------------branch-reset----------- after 8103*22dc650dSSadaf Ebrahimi / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 8104*22dc650dSSadaf Ebrahimi # 1 2 2 3 2 3 4 8105*22dc650dSSadaf Ebrahimi 8106*22dc650dSSadaf Ebrahimi A backreference to a capture group uses the most recent value that is 8107*22dc650dSSadaf Ebrahimi set for the group. The following pattern matches "abcabc" or "defdef": 8108*22dc650dSSadaf Ebrahimi 8109*22dc650dSSadaf Ebrahimi /(?|(abc)|(def))\1/ 8110*22dc650dSSadaf Ebrahimi 8111*22dc650dSSadaf Ebrahimi In contrast, a subroutine call to a capture group always refers to the 8112*22dc650dSSadaf Ebrahimi first one in the pattern with the given number. The following pattern 8113*22dc650dSSadaf Ebrahimi matches "abcabc" or "defabc": 8114*22dc650dSSadaf Ebrahimi 8115*22dc650dSSadaf Ebrahimi /(?|(abc)|(def))(?1)/ 8116*22dc650dSSadaf Ebrahimi 8117*22dc650dSSadaf Ebrahimi A relative reference such as (?-1) is no different: it is just a conve- 8118*22dc650dSSadaf Ebrahimi nient way of computing an absolute group number. 8119*22dc650dSSadaf Ebrahimi 8120*22dc650dSSadaf Ebrahimi If a condition test for a group's having matched refers to a non-unique 8121*22dc650dSSadaf Ebrahimi number, the test is true if any group with that number has matched. 8122*22dc650dSSadaf Ebrahimi 8123*22dc650dSSadaf Ebrahimi An alternative approach to using this "branch reset" feature is to use 8124*22dc650dSSadaf Ebrahimi duplicate named groups, as described in the next section. 8125*22dc650dSSadaf Ebrahimi 8126*22dc650dSSadaf Ebrahimi 8127*22dc650dSSadaf EbrahimiNAMED CAPTURE GROUPS 8128*22dc650dSSadaf Ebrahimi 8129*22dc650dSSadaf Ebrahimi Identifying capture groups by number is simple, but it can be very hard 8130*22dc650dSSadaf Ebrahimi to keep track of the numbers in complicated patterns. Furthermore, if 8131*22dc650dSSadaf Ebrahimi an expression is modified, the numbers may change. To help with this 8132*22dc650dSSadaf Ebrahimi difficulty, PCRE2 supports the naming of capture groups. This feature 8133*22dc650dSSadaf Ebrahimi was not added to Perl until release 5.10. Python had the feature ear- 8134*22dc650dSSadaf Ebrahimi lier, and PCRE1 introduced it at release 4.0, using the Python syntax. 8135*22dc650dSSadaf Ebrahimi PCRE2 supports both the Perl and the Python syntax. 8136*22dc650dSSadaf Ebrahimi 8137*22dc650dSSadaf Ebrahimi In PCRE2, a capture group can be named in one of three ways: 8138*22dc650dSSadaf Ebrahimi (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. 8139*22dc650dSSadaf Ebrahimi Names may be up to 128 code units long. When PCRE2_UTF is not set, they 8140*22dc650dSSadaf Ebrahimi may contain only ASCII alphanumeric characters and underscores, but 8141*22dc650dSSadaf Ebrahimi must start with a non-digit. When PCRE2_UTF is set, the syntax of group 8142*22dc650dSSadaf Ebrahimi names is extended to allow any Unicode letter or Unicode decimal digit. 8143*22dc650dSSadaf Ebrahimi In other words, group names must match one of these patterns: 8144*22dc650dSSadaf Ebrahimi 8145*22dc650dSSadaf Ebrahimi ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set 8146*22dc650dSSadaf Ebrahimi ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set 8147*22dc650dSSadaf Ebrahimi 8148*22dc650dSSadaf Ebrahimi References to capture groups from other parts of the pattern, such as 8149*22dc650dSSadaf Ebrahimi backreferences, recursion, and conditions, can all be made by name as 8150*22dc650dSSadaf Ebrahimi well as by number. 8151*22dc650dSSadaf Ebrahimi 8152*22dc650dSSadaf Ebrahimi Named capture groups are allocated numbers as well as names, exactly as 8153*22dc650dSSadaf Ebrahimi if the names were not present. In both PCRE2 and Perl, capture groups 8154*22dc650dSSadaf Ebrahimi are primarily identified by numbers; any names are just aliases for 8155*22dc650dSSadaf Ebrahimi these numbers. The PCRE2 API provides function calls for extracting the 8156*22dc650dSSadaf Ebrahimi complete name-to-number translation table from a compiled pattern, as 8157*22dc650dSSadaf Ebrahimi well as convenience functions for extracting captured substrings by 8158*22dc650dSSadaf Ebrahimi name. 8159*22dc650dSSadaf Ebrahimi 8160*22dc650dSSadaf Ebrahimi Warning: When more than one capture group has the same number, as de- 8161*22dc650dSSadaf Ebrahimi scribed in the previous section, a name given to one of them applies to 8162*22dc650dSSadaf Ebrahimi all of them. Perl allows identically numbered groups to have different 8163*22dc650dSSadaf Ebrahimi names. Consider this pattern, where there are two capture groups, both 8164*22dc650dSSadaf Ebrahimi numbered 1: 8165*22dc650dSSadaf Ebrahimi 8166*22dc650dSSadaf Ebrahimi (?|(?<AA>aa)|(?<BB>bb)) 8167*22dc650dSSadaf Ebrahimi 8168*22dc650dSSadaf Ebrahimi Perl allows this, with both names AA and BB as aliases of group 1. 8169*22dc650dSSadaf Ebrahimi Thus, after a successful match, both names yield the same value (either 8170*22dc650dSSadaf Ebrahimi "aa" or "bb"). 8171*22dc650dSSadaf Ebrahimi 8172*22dc650dSSadaf Ebrahimi In an attempt to reduce confusion, PCRE2 does not allow the same group 8173*22dc650dSSadaf Ebrahimi number to be associated with more than one name. The example above pro- 8174*22dc650dSSadaf Ebrahimi vokes a compile-time error. However, there is still scope for confu- 8175*22dc650dSSadaf Ebrahimi sion. Consider this pattern: 8176*22dc650dSSadaf Ebrahimi 8177*22dc650dSSadaf Ebrahimi (?|(?<AA>aa)|(bb)) 8178*22dc650dSSadaf Ebrahimi 8179*22dc650dSSadaf Ebrahimi Although the second group number 1 is not explicitly named, the name AA 8180*22dc650dSSadaf Ebrahimi is still an alias for any group 1. Whether the pattern matches "aa" or 8181*22dc650dSSadaf Ebrahimi "bb", a reference by name to group AA yields the matched string. 8182*22dc650dSSadaf Ebrahimi 8183*22dc650dSSadaf Ebrahimi By default, a name must be unique within a pattern, except that dupli- 8184*22dc650dSSadaf Ebrahimi cate names are permitted for groups with the same number, for example: 8185*22dc650dSSadaf Ebrahimi 8186*22dc650dSSadaf Ebrahimi (?|(?<AA>aa)|(?<AA>bb)) 8187*22dc650dSSadaf Ebrahimi 8188*22dc650dSSadaf Ebrahimi The duplicate name constraint can be disabled by setting the PCRE2_DUP- 8189*22dc650dSSadaf Ebrahimi NAMES option at compile time, or by the use of (?J) within the pattern, 8190*22dc650dSSadaf Ebrahimi as described in the section entitled "Internal Option Setting" above. 8191*22dc650dSSadaf Ebrahimi 8192*22dc650dSSadaf Ebrahimi Duplicate names can be useful for patterns where only one instance of 8193*22dc650dSSadaf Ebrahimi the named capture group can match. Suppose you want to match the name 8194*22dc650dSSadaf Ebrahimi of a weekday, either as a 3-letter abbreviation or as the full name, 8195*22dc650dSSadaf Ebrahimi and in both cases you want to extract the abbreviation. This pattern 8196*22dc650dSSadaf Ebrahimi (ignoring the line breaks) does the job: 8197*22dc650dSSadaf Ebrahimi 8198*22dc650dSSadaf Ebrahimi (?J) 8199*22dc650dSSadaf Ebrahimi (?<DN>Mon|Fri|Sun)(?:day)?| 8200*22dc650dSSadaf Ebrahimi (?<DN>Tue)(?:sday)?| 8201*22dc650dSSadaf Ebrahimi (?<DN>Wed)(?:nesday)?| 8202*22dc650dSSadaf Ebrahimi (?<DN>Thu)(?:rsday)?| 8203*22dc650dSSadaf Ebrahimi (?<DN>Sat)(?:urday)? 8204*22dc650dSSadaf Ebrahimi 8205*22dc650dSSadaf Ebrahimi There are five capture groups, but only one is ever set after a match. 8206*22dc650dSSadaf Ebrahimi The convenience functions for extracting the data by name returns the 8207*22dc650dSSadaf Ebrahimi substring for the first (and in this example, the only) group of that 8208*22dc650dSSadaf Ebrahimi name that matched. This saves searching to find which numbered group it 8209*22dc650dSSadaf Ebrahimi was. (An alternative way of solving this problem is to use a "branch 8210*22dc650dSSadaf Ebrahimi reset" group, as described in the previous section.) 8211*22dc650dSSadaf Ebrahimi 8212*22dc650dSSadaf Ebrahimi If you make a backreference to a non-unique named group from elsewhere 8213*22dc650dSSadaf Ebrahimi in the pattern, the groups to which the name refers are checked in the 8214*22dc650dSSadaf Ebrahimi order in which they appear in the overall pattern. The first one that 8215*22dc650dSSadaf Ebrahimi is set is used for the reference. For example, this pattern matches 8216*22dc650dSSadaf Ebrahimi both "foofoo" and "barbar" but not "foobar" or "barfoo": 8217*22dc650dSSadaf Ebrahimi 8218*22dc650dSSadaf Ebrahimi (?J)(?:(?<n>foo)|(?<n>bar))\k<n> 8219*22dc650dSSadaf Ebrahimi 8220*22dc650dSSadaf Ebrahimi 8221*22dc650dSSadaf Ebrahimi If you make a subroutine call to a non-unique named group, the one that 8222*22dc650dSSadaf Ebrahimi corresponds to the first occurrence of the name is used. In the absence 8223*22dc650dSSadaf Ebrahimi of duplicate numbers this is the one with the lowest number. 8224*22dc650dSSadaf Ebrahimi 8225*22dc650dSSadaf Ebrahimi If you use a named reference in a condition test (see the section about 8226*22dc650dSSadaf Ebrahimi conditions below), either to check whether a capture group has matched, 8227*22dc650dSSadaf Ebrahimi or to check for recursion, all groups with the same name are tested. If 8228*22dc650dSSadaf Ebrahimi the condition is true for any one of them, the overall condition is 8229*22dc650dSSadaf Ebrahimi true. This is the same behaviour as testing by number. For further de- 8230*22dc650dSSadaf Ebrahimi tails of the interfaces for handling named capture groups, see the 8231*22dc650dSSadaf Ebrahimi pcre2api documentation. 8232*22dc650dSSadaf Ebrahimi 8233*22dc650dSSadaf Ebrahimi 8234*22dc650dSSadaf EbrahimiREPETITION 8235*22dc650dSSadaf Ebrahimi 8236*22dc650dSSadaf Ebrahimi Repetition is specified by quantifiers, which may follow any one of 8237*22dc650dSSadaf Ebrahimi these items: 8238*22dc650dSSadaf Ebrahimi 8239*22dc650dSSadaf Ebrahimi a literal data character 8240*22dc650dSSadaf Ebrahimi the dot metacharacter 8241*22dc650dSSadaf Ebrahimi the \C escape sequence 8242*22dc650dSSadaf Ebrahimi the \R escape sequence 8243*22dc650dSSadaf Ebrahimi the \X escape sequence 8244*22dc650dSSadaf Ebrahimi any escape sequence that matches a single character 8245*22dc650dSSadaf Ebrahimi a character class 8246*22dc650dSSadaf Ebrahimi a backreference 8247*22dc650dSSadaf Ebrahimi a parenthesized group (including lookaround assertions) 8248*22dc650dSSadaf Ebrahimi a subroutine call (recursive or otherwise) 8249*22dc650dSSadaf Ebrahimi 8250*22dc650dSSadaf Ebrahimi If a quantifier does not follow a repeatable item, an error occurs. The 8251*22dc650dSSadaf Ebrahimi general repetition quantifier specifies a minimum and maximum number of 8252*22dc650dSSadaf Ebrahimi permitted matches by giving two numbers in curly brackets (braces), 8253*22dc650dSSadaf Ebrahimi separated by a comma. The numbers must be less than 65536, and the 8254*22dc650dSSadaf Ebrahimi first must be less than or equal to the second. For example, 8255*22dc650dSSadaf Ebrahimi 8256*22dc650dSSadaf Ebrahimi z{2,4} 8257*22dc650dSSadaf Ebrahimi 8258*22dc650dSSadaf Ebrahimi matches "zz", "zzz", or "zzzz". A closing brace on its own is not a 8259*22dc650dSSadaf Ebrahimi special character. If the second number is omitted, but the comma is 8260*22dc650dSSadaf Ebrahimi present, there is no upper limit; if the second number and the comma 8261*22dc650dSSadaf Ebrahimi are both omitted, the quantifier specifies an exact number of required 8262*22dc650dSSadaf Ebrahimi matches. Thus 8263*22dc650dSSadaf Ebrahimi 8264*22dc650dSSadaf Ebrahimi [aeiou]{3,} 8265*22dc650dSSadaf Ebrahimi 8266*22dc650dSSadaf Ebrahimi matches at least 3 successive vowels, but may match many more, whereas 8267*22dc650dSSadaf Ebrahimi 8268*22dc650dSSadaf Ebrahimi \d{8} 8269*22dc650dSSadaf Ebrahimi 8270*22dc650dSSadaf Ebrahimi matches exactly 8 digits. If the first number is omitted, the lower 8271*22dc650dSSadaf Ebrahimi limit is taken as zero; in this case the upper limit must be present. 8272*22dc650dSSadaf Ebrahimi 8273*22dc650dSSadaf Ebrahimi X{,4} is interpreted as X{0,4} 8274*22dc650dSSadaf Ebrahimi 8275*22dc650dSSadaf Ebrahimi This is a change in behaviour that happened in Perl 5.34.0 and PCRE2 8276*22dc650dSSadaf Ebrahimi 10.43. In earlier versions such a sequence was not interpreted as a 8277*22dc650dSSadaf Ebrahimi quantifier. Other regular expression engines may behave either way. 8278*22dc650dSSadaf Ebrahimi 8279*22dc650dSSadaf Ebrahimi If the characters that follow an opening brace do not match the syntax 8280*22dc650dSSadaf Ebrahimi of a quantifier, the brace is taken as a literal character. In particu- 8281*22dc650dSSadaf Ebrahimi lar, this means that {,} is a literal string of three characters. 8282*22dc650dSSadaf Ebrahimi 8283*22dc650dSSadaf Ebrahimi Note that not every opening brace is potentially the start of a quanti- 8284*22dc650dSSadaf Ebrahimi fier because braces are used in other items such as \N{U+345} or 8285*22dc650dSSadaf Ebrahimi \k{name}. 8286*22dc650dSSadaf Ebrahimi 8287*22dc650dSSadaf Ebrahimi In UTF modes, quantifiers apply to characters rather than to individual 8288*22dc650dSSadaf Ebrahimi code units. Thus, for example, \x{100}{2} matches two characters, each 8289*22dc650dSSadaf Ebrahimi of which is represented by a two-byte sequence in a UTF-8 string. Simi- 8290*22dc650dSSadaf Ebrahimi larly, \X{3} matches three Unicode extended grapheme clusters, each of 8291*22dc650dSSadaf Ebrahimi which may be several code units long (and they may be of different 8292*22dc650dSSadaf Ebrahimi lengths). 8293*22dc650dSSadaf Ebrahimi 8294*22dc650dSSadaf Ebrahimi The quantifier {0} is permitted, causing the expression to behave as if 8295*22dc650dSSadaf Ebrahimi the previous item and the quantifier were not present. This may be use- 8296*22dc650dSSadaf Ebrahimi ful for capture groups that are referenced as subroutines from else- 8297*22dc650dSSadaf Ebrahimi where in the pattern (but see also the section entitled "Defining cap- 8298*22dc650dSSadaf Ebrahimi ture groups for use by reference only" below). Except for parenthesized 8299*22dc650dSSadaf Ebrahimi groups, items that have a {0} quantifier are omitted from the compiled 8300*22dc650dSSadaf Ebrahimi pattern. 8301*22dc650dSSadaf Ebrahimi 8302*22dc650dSSadaf Ebrahimi For convenience, the three most common quantifiers have single-charac- 8303*22dc650dSSadaf Ebrahimi ter abbreviations: 8304*22dc650dSSadaf Ebrahimi 8305*22dc650dSSadaf Ebrahimi * is equivalent to {0,} 8306*22dc650dSSadaf Ebrahimi + is equivalent to {1,} 8307*22dc650dSSadaf Ebrahimi ? is equivalent to {0,1} 8308*22dc650dSSadaf Ebrahimi 8309*22dc650dSSadaf Ebrahimi It is possible to construct infinite loops by following a group that 8310*22dc650dSSadaf Ebrahimi can match no characters with a quantifier that has no upper limit, for 8311*22dc650dSSadaf Ebrahimi example: 8312*22dc650dSSadaf Ebrahimi 8313*22dc650dSSadaf Ebrahimi (a?)* 8314*22dc650dSSadaf Ebrahimi 8315*22dc650dSSadaf Ebrahimi Earlier versions of Perl and PCRE1 used to give an error at compile 8316*22dc650dSSadaf Ebrahimi time for such patterns. However, because there are cases where this can 8317*22dc650dSSadaf Ebrahimi be useful, such patterns are now accepted, but whenever an iteration of 8318*22dc650dSSadaf Ebrahimi such a group matches no characters, matching moves on to the next item 8319*22dc650dSSadaf Ebrahimi in the pattern instead of repeatedly matching an empty string. This 8320*22dc650dSSadaf Ebrahimi does not prevent backtracking into any of the iterations if a subse- 8321*22dc650dSSadaf Ebrahimi quent item fails to match. 8322*22dc650dSSadaf Ebrahimi 8323*22dc650dSSadaf Ebrahimi By default, quantifiers are "greedy", that is, they match as much as 8324*22dc650dSSadaf Ebrahimi possible (up to the maximum number of permitted repetitions), without 8325*22dc650dSSadaf Ebrahimi causing the rest of the pattern to fail. The classic example of where 8326*22dc650dSSadaf Ebrahimi this gives problems is in trying to match comments in C programs. These 8327*22dc650dSSadaf Ebrahimi appear between /* and */ and within the comment, individual * and / 8328*22dc650dSSadaf Ebrahimi characters may appear. An attempt to match C comments by applying the 8329*22dc650dSSadaf Ebrahimi pattern 8330*22dc650dSSadaf Ebrahimi 8331*22dc650dSSadaf Ebrahimi /\*.*\*/ 8332*22dc650dSSadaf Ebrahimi 8333*22dc650dSSadaf Ebrahimi to the string 8334*22dc650dSSadaf Ebrahimi 8335*22dc650dSSadaf Ebrahimi /* first comment */ not comment /* second comment */ 8336*22dc650dSSadaf Ebrahimi 8337*22dc650dSSadaf Ebrahimi fails, because it matches the entire string owing to the greediness of 8338*22dc650dSSadaf Ebrahimi the .* item. However, if a quantifier is followed by a question mark, 8339*22dc650dSSadaf Ebrahimi it ceases to be greedy, and instead matches the minimum number of times 8340*22dc650dSSadaf Ebrahimi possible, so the pattern 8341*22dc650dSSadaf Ebrahimi 8342*22dc650dSSadaf Ebrahimi /\*.*?\*/ 8343*22dc650dSSadaf Ebrahimi 8344*22dc650dSSadaf Ebrahimi does the right thing with C comments. The meaning of the various quan- 8345*22dc650dSSadaf Ebrahimi tifiers is not otherwise changed, just the preferred number of matches. 8346*22dc650dSSadaf Ebrahimi Do not confuse this use of question mark with its use as a quantifier 8347*22dc650dSSadaf Ebrahimi in its own right. Because it has two uses, it can sometimes appear 8348*22dc650dSSadaf Ebrahimi doubled, as in 8349*22dc650dSSadaf Ebrahimi 8350*22dc650dSSadaf Ebrahimi \d??\d 8351*22dc650dSSadaf Ebrahimi 8352*22dc650dSSadaf Ebrahimi which matches one digit by preference, but can match two if that is the 8353*22dc650dSSadaf Ebrahimi only way the rest of the pattern matches. 8354*22dc650dSSadaf Ebrahimi 8355*22dc650dSSadaf Ebrahimi If the PCRE2_UNGREEDY option is set (an option that is not available in 8356*22dc650dSSadaf Ebrahimi Perl), the quantifiers are not greedy by default, but individual ones 8357*22dc650dSSadaf Ebrahimi can be made greedy by following them with a question mark. In other 8358*22dc650dSSadaf Ebrahimi words, it inverts the default behaviour. 8359*22dc650dSSadaf Ebrahimi 8360*22dc650dSSadaf Ebrahimi When a parenthesized group is quantified with a minimum repeat count 8361*22dc650dSSadaf Ebrahimi that is greater than 1 or with a limited maximum, more memory is re- 8362*22dc650dSSadaf Ebrahimi quired for the compiled pattern, in proportion to the size of the mini- 8363*22dc650dSSadaf Ebrahimi mum or maximum. 8364*22dc650dSSadaf Ebrahimi 8365*22dc650dSSadaf Ebrahimi If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option 8366*22dc650dSSadaf Ebrahimi (equivalent to Perl's /s) is set, thus allowing the dot to match new- 8367*22dc650dSSadaf Ebrahimi lines, the pattern is implicitly anchored, because whatever follows 8368*22dc650dSSadaf Ebrahimi will be tried against every character position in the subject string, 8369*22dc650dSSadaf Ebrahimi so there is no point in retrying the overall match at any position af- 8370*22dc650dSSadaf Ebrahimi ter the first. PCRE2 normally treats such a pattern as though it were 8371*22dc650dSSadaf Ebrahimi preceded by \A. 8372*22dc650dSSadaf Ebrahimi 8373*22dc650dSSadaf Ebrahimi In cases where it is known that the subject string contains no new- 8374*22dc650dSSadaf Ebrahimi lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- 8375*22dc650dSSadaf Ebrahimi mization, or alternatively, using ^ to indicate anchoring explicitly. 8376*22dc650dSSadaf Ebrahimi 8377*22dc650dSSadaf Ebrahimi However, there are some cases where the optimization cannot be used. 8378*22dc650dSSadaf Ebrahimi When .* is inside capturing parentheses that are the subject of a 8379*22dc650dSSadaf Ebrahimi backreference elsewhere in the pattern, a match at the start may fail 8380*22dc650dSSadaf Ebrahimi where a later one succeeds. Consider, for example: 8381*22dc650dSSadaf Ebrahimi 8382*22dc650dSSadaf Ebrahimi (.*)abc\1 8383*22dc650dSSadaf Ebrahimi 8384*22dc650dSSadaf Ebrahimi If the subject is "xyz123abc123" the match point is the fourth charac- 8385*22dc650dSSadaf Ebrahimi ter. For this reason, such a pattern is not implicitly anchored. 8386*22dc650dSSadaf Ebrahimi 8387*22dc650dSSadaf Ebrahimi Another case where implicit anchoring is not applied is when the lead- 8388*22dc650dSSadaf Ebrahimi ing .* is inside an atomic group. Once again, a match at the start may 8389*22dc650dSSadaf Ebrahimi fail where a later one succeeds. Consider this pattern: 8390*22dc650dSSadaf Ebrahimi 8391*22dc650dSSadaf Ebrahimi (?>.*?a)b 8392*22dc650dSSadaf Ebrahimi 8393*22dc650dSSadaf Ebrahimi It matches "ab" in the subject "aab". The use of the backtracking con- 8394*22dc650dSSadaf Ebrahimi trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and 8395*22dc650dSSadaf Ebrahimi there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. 8396*22dc650dSSadaf Ebrahimi 8397*22dc650dSSadaf Ebrahimi When a capture group is repeated, the value captured is the substring 8398*22dc650dSSadaf Ebrahimi that matched the final iteration. For example, after 8399*22dc650dSSadaf Ebrahimi 8400*22dc650dSSadaf Ebrahimi (tweedle[dume]{3}\s*)+ 8401*22dc650dSSadaf Ebrahimi 8402*22dc650dSSadaf Ebrahimi has matched "tweedledum tweedledee" the value of the captured substring 8403*22dc650dSSadaf Ebrahimi is "tweedledee". However, if there are nested capture groups, the cor- 8404*22dc650dSSadaf Ebrahimi responding captured values may have been set in previous iterations. 8405*22dc650dSSadaf Ebrahimi For example, after 8406*22dc650dSSadaf Ebrahimi 8407*22dc650dSSadaf Ebrahimi (a|(b))+ 8408*22dc650dSSadaf Ebrahimi 8409*22dc650dSSadaf Ebrahimi matches "aba" the value of the second captured substring is "b". 8410*22dc650dSSadaf Ebrahimi 8411*22dc650dSSadaf Ebrahimi 8412*22dc650dSSadaf EbrahimiATOMIC GROUPING AND POSSESSIVE QUANTIFIERS 8413*22dc650dSSadaf Ebrahimi 8414*22dc650dSSadaf Ebrahimi With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") 8415*22dc650dSSadaf Ebrahimi repetition, failure of what follows normally causes the repeated item 8416*22dc650dSSadaf Ebrahimi to be re-evaluated to see if a different number of repeats allows the 8417*22dc650dSSadaf Ebrahimi rest of the pattern to match. Sometimes it is useful to prevent this, 8418*22dc650dSSadaf Ebrahimi either to change the nature of the match, or to cause it fail earlier 8419*22dc650dSSadaf Ebrahimi than it otherwise might, when the author of the pattern knows there is 8420*22dc650dSSadaf Ebrahimi no point in carrying on. 8421*22dc650dSSadaf Ebrahimi 8422*22dc650dSSadaf Ebrahimi Consider, for example, the pattern \d+foo when applied to the subject 8423*22dc650dSSadaf Ebrahimi line 8424*22dc650dSSadaf Ebrahimi 8425*22dc650dSSadaf Ebrahimi 123456bar 8426*22dc650dSSadaf Ebrahimi 8427*22dc650dSSadaf Ebrahimi After matching all 6 digits and then failing to match "foo", the normal 8428*22dc650dSSadaf Ebrahimi action of the matcher is to try again with only 5 digits matching the 8429*22dc650dSSadaf Ebrahimi \d+ item, and then with 4, and so on, before ultimately failing. 8430*22dc650dSSadaf Ebrahimi "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides 8431*22dc650dSSadaf Ebrahimi the means for specifying that once a group has matched, it is not to be 8432*22dc650dSSadaf Ebrahimi re-evaluated in this way. 8433*22dc650dSSadaf Ebrahimi 8434*22dc650dSSadaf Ebrahimi If we use atomic grouping for the previous example, the matcher gives 8435*22dc650dSSadaf Ebrahimi up immediately on failing to match "foo" the first time. The notation 8436*22dc650dSSadaf Ebrahimi is a kind of special parenthesis, starting with (?> as in this example: 8437*22dc650dSSadaf Ebrahimi 8438*22dc650dSSadaf Ebrahimi (?>\d+)foo 8439*22dc650dSSadaf Ebrahimi 8440*22dc650dSSadaf Ebrahimi Perl 5.28 introduced an experimental alphabetic form starting with (* 8441*22dc650dSSadaf Ebrahimi which may be easier to remember: 8442*22dc650dSSadaf Ebrahimi 8443*22dc650dSSadaf Ebrahimi (*atomic:\d+)foo 8444*22dc650dSSadaf Ebrahimi 8445*22dc650dSSadaf Ebrahimi This kind of parenthesized group "locks up" the part of the pattern it 8446*22dc650dSSadaf Ebrahimi contains once it has matched, and a failure further into the pattern is 8447*22dc650dSSadaf Ebrahimi prevented from backtracking into it. Backtracking past it to previous 8448*22dc650dSSadaf Ebrahimi items, however, works as normal. 8449*22dc650dSSadaf Ebrahimi 8450*22dc650dSSadaf Ebrahimi An alternative description is that a group of this type matches exactly 8451*22dc650dSSadaf Ebrahimi the string of characters that an identical standalone pattern would 8452*22dc650dSSadaf Ebrahimi match, if anchored at the current point in the subject string. 8453*22dc650dSSadaf Ebrahimi 8454*22dc650dSSadaf Ebrahimi Atomic groups are not capture groups. Simple cases such as the above 8455*22dc650dSSadaf Ebrahimi example can be thought of as a maximizing repeat that must swallow 8456*22dc650dSSadaf Ebrahimi everything it can. So, while both \d+ and \d+? are prepared to adjust 8457*22dc650dSSadaf Ebrahimi the number of digits they match in order to make the rest of the pat- 8458*22dc650dSSadaf Ebrahimi tern match, (?>\d+) can only match an entire sequence of digits. 8459*22dc650dSSadaf Ebrahimi 8460*22dc650dSSadaf Ebrahimi Atomic groups in general can of course contain arbitrarily complicated 8461*22dc650dSSadaf Ebrahimi expressions, and can be nested. However, when the contents of an atomic 8462*22dc650dSSadaf Ebrahimi group is just a single repeated item, as in the example above, a sim- 8463*22dc650dSSadaf Ebrahimi pler notation, called a "possessive quantifier" can be used. This con- 8464*22dc650dSSadaf Ebrahimi sists of an additional + character following a quantifier. Using this 8465*22dc650dSSadaf Ebrahimi notation, the previous example can be rewritten as 8466*22dc650dSSadaf Ebrahimi 8467*22dc650dSSadaf Ebrahimi \d++foo 8468*22dc650dSSadaf Ebrahimi 8469*22dc650dSSadaf Ebrahimi Note that a possessive quantifier can be used with an entire group, for 8470*22dc650dSSadaf Ebrahimi example: 8471*22dc650dSSadaf Ebrahimi 8472*22dc650dSSadaf Ebrahimi (abc|xyz){2,3}+ 8473*22dc650dSSadaf Ebrahimi 8474*22dc650dSSadaf Ebrahimi Possessive quantifiers are always greedy; the setting of the PCRE2_UN- 8475*22dc650dSSadaf Ebrahimi GREEDY option is ignored. They are a convenient notation for the sim- 8476*22dc650dSSadaf Ebrahimi pler forms of atomic group. However, there is no difference in the 8477*22dc650dSSadaf Ebrahimi meaning of a possessive quantifier and the equivalent atomic group, 8478*22dc650dSSadaf Ebrahimi though there may be a performance difference; possessive quantifiers 8479*22dc650dSSadaf Ebrahimi should be slightly faster. 8480*22dc650dSSadaf Ebrahimi 8481*22dc650dSSadaf Ebrahimi The possessive quantifier syntax is an extension to the Perl 5.8 syn- 8482*22dc650dSSadaf Ebrahimi tax. Jeffrey Friedl originated the idea (and the name) in the first 8483*22dc650dSSadaf Ebrahimi edition of his book. Mike McCloskey liked it, so implemented it when he 8484*22dc650dSSadaf Ebrahimi built Sun's Java package, and PCRE1 copied it from there. It found its 8485*22dc650dSSadaf Ebrahimi way into Perl at release 5.10. 8486*22dc650dSSadaf Ebrahimi 8487*22dc650dSSadaf Ebrahimi PCRE2 has an optimization that automatically "possessifies" certain 8488*22dc650dSSadaf Ebrahimi simple pattern constructs. For example, the sequence A+B is treated as 8489*22dc650dSSadaf Ebrahimi A++B because there is no point in backtracking into a sequence of A's 8490*22dc650dSSadaf Ebrahimi when B must follow. This feature can be disabled by the PCRE2_NO_AUTO- 8491*22dc650dSSadaf Ebrahimi POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). 8492*22dc650dSSadaf Ebrahimi 8493*22dc650dSSadaf Ebrahimi When a pattern contains an unlimited repeat inside a group that can it- 8494*22dc650dSSadaf Ebrahimi self be repeated an unlimited number of times, the use of an atomic 8495*22dc650dSSadaf Ebrahimi group is the only way to avoid some failing matches taking a very long 8496*22dc650dSSadaf Ebrahimi time indeed. The pattern 8497*22dc650dSSadaf Ebrahimi 8498*22dc650dSSadaf Ebrahimi (\D+|<\d+>)*[!?] 8499*22dc650dSSadaf Ebrahimi 8500*22dc650dSSadaf Ebrahimi matches an unlimited number of substrings that either consist of non- 8501*22dc650dSSadaf Ebrahimi digits, or digits enclosed in <>, followed by either ! or ?. When it 8502*22dc650dSSadaf Ebrahimi matches, it runs quickly. However, if it is applied to 8503*22dc650dSSadaf Ebrahimi 8504*22dc650dSSadaf Ebrahimi aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 8505*22dc650dSSadaf Ebrahimi 8506*22dc650dSSadaf Ebrahimi it takes a long time before reporting failure. This is because the 8507*22dc650dSSadaf Ebrahimi string can be divided between the internal \D+ repeat and the external 8508*22dc650dSSadaf Ebrahimi * repeat in a large number of ways, and all have to be tried. (The ex- 8509*22dc650dSSadaf Ebrahimi ample uses [!?] rather than a single character at the end, because both 8510*22dc650dSSadaf Ebrahimi PCRE2 and Perl have an optimization that allows for fast failure when a 8511*22dc650dSSadaf Ebrahimi single character is used. They remember the last single character that 8512*22dc650dSSadaf Ebrahimi is required for a match, and fail early if it is not present in the 8513*22dc650dSSadaf Ebrahimi string.) If the pattern is changed so that it uses an atomic group, 8514*22dc650dSSadaf Ebrahimi like this: 8515*22dc650dSSadaf Ebrahimi 8516*22dc650dSSadaf Ebrahimi ((?>\D+)|<\d+>)*[!?] 8517*22dc650dSSadaf Ebrahimi 8518*22dc650dSSadaf Ebrahimi sequences of non-digits cannot be broken, and failure happens quickly. 8519*22dc650dSSadaf Ebrahimi 8520*22dc650dSSadaf Ebrahimi 8521*22dc650dSSadaf EbrahimiBACKREFERENCES 8522*22dc650dSSadaf Ebrahimi 8523*22dc650dSSadaf Ebrahimi Outside a character class, a backslash followed by a digit greater than 8524*22dc650dSSadaf Ebrahimi 0 (and possibly further digits) is a backreference to a capture group 8525*22dc650dSSadaf Ebrahimi earlier (that is, to its left) in the pattern, provided there have been 8526*22dc650dSSadaf Ebrahimi that many previous capture groups. 8527*22dc650dSSadaf Ebrahimi 8528*22dc650dSSadaf Ebrahimi However, if the decimal number following the backslash is less than 8, 8529*22dc650dSSadaf Ebrahimi it is always taken as a backreference, and causes an error only if 8530*22dc650dSSadaf Ebrahimi there are not that many capture groups in the entire pattern. In other 8531*22dc650dSSadaf Ebrahimi words, the group that is referenced need not be to the left of the ref- 8532*22dc650dSSadaf Ebrahimi erence for numbers less than 8. A "forward backreference" of this type 8533*22dc650dSSadaf Ebrahimi can make sense when a repetition is involved and the group to the right 8534*22dc650dSSadaf Ebrahimi has participated in an earlier iteration. 8535*22dc650dSSadaf Ebrahimi 8536*22dc650dSSadaf Ebrahimi It is not possible to have a numerical "forward backreference" to a 8537*22dc650dSSadaf Ebrahimi group whose number is 8 or more using this syntax because a sequence 8538*22dc650dSSadaf Ebrahimi such as \50 is interpreted as a character defined in octal. See the 8539*22dc650dSSadaf Ebrahimi subsection entitled "Non-printing characters" above for further details 8540*22dc650dSSadaf Ebrahimi of the handling of digits following a backslash. Other forms of back- 8541*22dc650dSSadaf Ebrahimi referencing do not suffer from this restriction. In particular, there 8542*22dc650dSSadaf Ebrahimi is no problem when named capture groups are used (see below). 8543*22dc650dSSadaf Ebrahimi 8544*22dc650dSSadaf Ebrahimi Another way of avoiding the ambiguity inherent in the use of digits 8545*22dc650dSSadaf Ebrahimi following a backslash is to use the \g escape sequence. This escape 8546*22dc650dSSadaf Ebrahimi must be followed by a signed or unsigned number, optionally enclosed in 8547*22dc650dSSadaf Ebrahimi braces. These examples are all identical: 8548*22dc650dSSadaf Ebrahimi 8549*22dc650dSSadaf Ebrahimi (ring), \1 8550*22dc650dSSadaf Ebrahimi (ring), \g1 8551*22dc650dSSadaf Ebrahimi (ring), \g{1} 8552*22dc650dSSadaf Ebrahimi 8553*22dc650dSSadaf Ebrahimi An unsigned number specifies an absolute reference without the ambigu- 8554*22dc650dSSadaf Ebrahimi ity that is present in the older syntax. It is also useful when literal 8555*22dc650dSSadaf Ebrahimi digits follow the reference. A signed number is a relative reference. 8556*22dc650dSSadaf Ebrahimi Consider this example: 8557*22dc650dSSadaf Ebrahimi 8558*22dc650dSSadaf Ebrahimi (abc(def)ghi)\g{-1} 8559*22dc650dSSadaf Ebrahimi 8560*22dc650dSSadaf Ebrahimi The sequence \g{-1} is a reference to the capture group whose number is 8561*22dc650dSSadaf Ebrahimi one less than the number of the next group to be started, so in this 8562*22dc650dSSadaf Ebrahimi example (where the next group would be numbered 3) is it equivalent to 8563*22dc650dSSadaf Ebrahimi \2, and \g{-2} would be equivalent to \1. Note that if this construct 8564*22dc650dSSadaf Ebrahimi is inside a capture group, that group is included in the count, so in 8565*22dc650dSSadaf Ebrahimi this example \g{-2} also refers to group 1: 8566*22dc650dSSadaf Ebrahimi 8567*22dc650dSSadaf Ebrahimi (A)(\g{-2}B) 8568*22dc650dSSadaf Ebrahimi 8569*22dc650dSSadaf Ebrahimi The use of relative references can be helpful in long patterns, and 8570*22dc650dSSadaf Ebrahimi also in patterns that are created by joining together fragments that 8571*22dc650dSSadaf Ebrahimi contain references within themselves. 8572*22dc650dSSadaf Ebrahimi 8573*22dc650dSSadaf Ebrahimi The sequence \g{+1} is a reference to the next capture group that is 8574*22dc650dSSadaf Ebrahimi started after this item, and \g{+2} refers to the one after that, and 8575*22dc650dSSadaf Ebrahimi so on. This kind of forward reference can be useful in patterns that 8576*22dc650dSSadaf Ebrahimi repeat. Perl does not support the use of + in this way. 8577*22dc650dSSadaf Ebrahimi 8578*22dc650dSSadaf Ebrahimi A backreference matches whatever actually most recently matched the 8579*22dc650dSSadaf Ebrahimi capture group in the current subject string, rather than anything at 8580*22dc650dSSadaf Ebrahimi all that matches the group (see "Groups as subroutines" below for a way 8581*22dc650dSSadaf Ebrahimi of doing that). So the pattern 8582*22dc650dSSadaf Ebrahimi 8583*22dc650dSSadaf Ebrahimi (sens|respons)e and \1ibility 8584*22dc650dSSadaf Ebrahimi 8585*22dc650dSSadaf Ebrahimi matches "sense and sensibility" and "response and responsibility", but 8586*22dc650dSSadaf Ebrahimi not "sense and responsibility". If caseful matching is in force at the 8587*22dc650dSSadaf Ebrahimi time of the backreference, the case of letters is relevant. For exam- 8588*22dc650dSSadaf Ebrahimi ple, 8589*22dc650dSSadaf Ebrahimi 8590*22dc650dSSadaf Ebrahimi ((?i)rah)\s+\1 8591*22dc650dSSadaf Ebrahimi 8592*22dc650dSSadaf Ebrahimi matches "rah rah" and "RAH RAH", but not "RAH rah", even though the 8593*22dc650dSSadaf Ebrahimi original capture group is matched caselessly. 8594*22dc650dSSadaf Ebrahimi 8595*22dc650dSSadaf Ebrahimi There are several different ways of writing backreferences to named 8596*22dc650dSSadaf Ebrahimi capture groups. The .NET syntax is \k{name}, the Python syntax is 8597*22dc650dSSadaf Ebrahimi (?=name), and the original Perl syntax is \k<name> or \k'name'. All of 8598*22dc650dSSadaf Ebrahimi these are now supported by both Perl and PCRE2. Perl 5.10's unified 8599*22dc650dSSadaf Ebrahimi backreference syntax, in which \g can be used for both numeric and 8600*22dc650dSSadaf Ebrahimi named references, is also supported by PCRE2. We could rewrite the 8601*22dc650dSSadaf Ebrahimi above example in any of the following ways: 8602*22dc650dSSadaf Ebrahimi 8603*22dc650dSSadaf Ebrahimi (?<p1>(?i)rah)\s+\k<p1> 8604*22dc650dSSadaf Ebrahimi (?'p1'(?i)rah)\s+\k{p1} 8605*22dc650dSSadaf Ebrahimi (?P<p1>(?i)rah)\s+(?P=p1) 8606*22dc650dSSadaf Ebrahimi (?<p1>(?i)rah)\s+\g{p1} 8607*22dc650dSSadaf Ebrahimi 8608*22dc650dSSadaf Ebrahimi A capture group that is referenced by name may appear in the pattern 8609*22dc650dSSadaf Ebrahimi before or after the reference. 8610*22dc650dSSadaf Ebrahimi 8611*22dc650dSSadaf Ebrahimi There may be more than one backreference to the same group. If a group 8612*22dc650dSSadaf Ebrahimi has not actually been used in a particular match, backreferences to it 8613*22dc650dSSadaf Ebrahimi always fail by default. For example, the pattern 8614*22dc650dSSadaf Ebrahimi 8615*22dc650dSSadaf Ebrahimi (a|(bc))\2 8616*22dc650dSSadaf Ebrahimi 8617*22dc650dSSadaf Ebrahimi always fails if it starts to match "a" rather than "bc". However, if 8618*22dc650dSSadaf Ebrahimi the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref- 8619*22dc650dSSadaf Ebrahimi erence to an unset value matches an empty string. 8620*22dc650dSSadaf Ebrahimi 8621*22dc650dSSadaf Ebrahimi Because there may be many capture groups in a pattern, all digits fol- 8622*22dc650dSSadaf Ebrahimi lowing a backslash are taken as part of a potential backreference num- 8623*22dc650dSSadaf Ebrahimi ber. If the pattern continues with a digit character, some delimiter 8624*22dc650dSSadaf Ebrahimi must be used to terminate the backreference. If the PCRE2_EXTENDED or 8625*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise, 8626*22dc650dSSadaf Ebrahimi the \g{} syntax or an empty comment (see "Comments" below) can be used. 8627*22dc650dSSadaf Ebrahimi 8628*22dc650dSSadaf Ebrahimi Recursive backreferences 8629*22dc650dSSadaf Ebrahimi 8630*22dc650dSSadaf Ebrahimi A backreference that occurs inside the group to which it refers fails 8631*22dc650dSSadaf Ebrahimi when the group is first used, so, for example, (a\1) never matches. 8632*22dc650dSSadaf Ebrahimi However, such references can be useful inside repeated groups. For ex- 8633*22dc650dSSadaf Ebrahimi ample, the pattern 8634*22dc650dSSadaf Ebrahimi 8635*22dc650dSSadaf Ebrahimi (a|b\1)+ 8636*22dc650dSSadaf Ebrahimi 8637*22dc650dSSadaf Ebrahimi matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- 8638*22dc650dSSadaf Ebrahimi ation of the group, the backreference matches the character string cor- 8639*22dc650dSSadaf Ebrahimi responding to the previous iteration. In order for this to work, the 8640*22dc650dSSadaf Ebrahimi pattern must be such that the first iteration does not need to match 8641*22dc650dSSadaf Ebrahimi the backreference. This can be done using alternation, as in the exam- 8642*22dc650dSSadaf Ebrahimi ple above, or by a quantifier with a minimum of zero. 8643*22dc650dSSadaf Ebrahimi 8644*22dc650dSSadaf Ebrahimi For versions of PCRE2 less than 10.25, backreferences of this type used 8645*22dc650dSSadaf Ebrahimi to cause the group that they reference to be treated as an atomic 8646*22dc650dSSadaf Ebrahimi group. This restriction no longer applies, and backtracking into such 8647*22dc650dSSadaf Ebrahimi groups can occur as normal. 8648*22dc650dSSadaf Ebrahimi 8649*22dc650dSSadaf Ebrahimi 8650*22dc650dSSadaf EbrahimiASSERTIONS 8651*22dc650dSSadaf Ebrahimi 8652*22dc650dSSadaf Ebrahimi An assertion is a test on the characters following or preceding the 8653*22dc650dSSadaf Ebrahimi current matching point that does not consume any characters. The simple 8654*22dc650dSSadaf Ebrahimi assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described 8655*22dc650dSSadaf Ebrahimi above. 8656*22dc650dSSadaf Ebrahimi 8657*22dc650dSSadaf Ebrahimi More complicated assertions are coded as parenthesized groups. There 8658*22dc650dSSadaf Ebrahimi are two kinds: those that look ahead of the current position in the 8659*22dc650dSSadaf Ebrahimi subject string, and those that look behind it, and in each case an as- 8660*22dc650dSSadaf Ebrahimi sertion may be positive (must match for the assertion to be true) or 8661*22dc650dSSadaf Ebrahimi negative (must not match for the assertion to be true). An assertion 8662*22dc650dSSadaf Ebrahimi group is matched in the normal way, and if it is true, matching contin- 8663*22dc650dSSadaf Ebrahimi ues after it, but with the matching position in the subject string re- 8664*22dc650dSSadaf Ebrahimi set to what it was before the assertion was processed. 8665*22dc650dSSadaf Ebrahimi 8666*22dc650dSSadaf Ebrahimi The Perl-compatible lookaround assertions are atomic. If an assertion 8667*22dc650dSSadaf Ebrahimi is true, but there is a subsequent matching failure, there is no back- 8668*22dc650dSSadaf Ebrahimi tracking into the assertion. However, there are some cases where non- 8669*22dc650dSSadaf Ebrahimi atomic assertions can be useful. PCRE2 has some support for these, de- 8670*22dc650dSSadaf Ebrahimi scribed in the section entitled "Non-atomic assertions" below, but they 8671*22dc650dSSadaf Ebrahimi are not Perl-compatible. 8672*22dc650dSSadaf Ebrahimi 8673*22dc650dSSadaf Ebrahimi A lookaround assertion may appear as the condition in a conditional 8674*22dc650dSSadaf Ebrahimi group (see below). In this case, the result of matching the assertion 8675*22dc650dSSadaf Ebrahimi determines which branch of the condition is followed. 8676*22dc650dSSadaf Ebrahimi 8677*22dc650dSSadaf Ebrahimi Assertion groups are not capture groups. If an assertion contains cap- 8678*22dc650dSSadaf Ebrahimi ture groups within it, these are counted for the purposes of numbering 8679*22dc650dSSadaf Ebrahimi the capture groups in the whole pattern. Within each branch of an as- 8680*22dc650dSSadaf Ebrahimi sertion, locally captured substrings may be referenced in the usual 8681*22dc650dSSadaf Ebrahimi way. For example, a sequence such as (.)\g{-1} can be used to check 8682*22dc650dSSadaf Ebrahimi that two adjacent characters are the same. 8683*22dc650dSSadaf Ebrahimi 8684*22dc650dSSadaf Ebrahimi When a branch within an assertion fails to match, any substrings that 8685*22dc650dSSadaf Ebrahimi were captured are discarded (as happens with any pattern branch that 8686*22dc650dSSadaf Ebrahimi fails to match). A negative assertion is true only when all its 8687*22dc650dSSadaf Ebrahimi branches fail to match; this means that no captured substrings are ever 8688*22dc650dSSadaf Ebrahimi retained after a successful negative assertion. When an assertion con- 8689*22dc650dSSadaf Ebrahimi tains a matching branch, what happens depends on the type of assertion. 8690*22dc650dSSadaf Ebrahimi 8691*22dc650dSSadaf Ebrahimi For a positive assertion, internally captured substrings in the suc- 8692*22dc650dSSadaf Ebrahimi cessful branch are retained, and matching continues with the next pat- 8693*22dc650dSSadaf Ebrahimi tern item after the assertion. For a negative assertion, a matching 8694*22dc650dSSadaf Ebrahimi branch means that the assertion is not true. If such an assertion is 8695*22dc650dSSadaf Ebrahimi being used as a condition in a conditional group (see below), captured 8696*22dc650dSSadaf Ebrahimi substrings are retained, because matching continues with the "no" 8697*22dc650dSSadaf Ebrahimi branch of the condition. For other failing negative assertions, control 8698*22dc650dSSadaf Ebrahimi passes to the previous backtracking point, thus discarding any captured 8699*22dc650dSSadaf Ebrahimi strings within the assertion. 8700*22dc650dSSadaf Ebrahimi 8701*22dc650dSSadaf Ebrahimi Most assertion groups may be repeated; though it makes no sense to as- 8702*22dc650dSSadaf Ebrahimi sert the same thing several times, the side effect of capturing in pos- 8703*22dc650dSSadaf Ebrahimi itive assertions may occasionally be useful. However, an assertion that 8704*22dc650dSSadaf Ebrahimi forms the condition for a conditional group may not be quantified. 8705*22dc650dSSadaf Ebrahimi PCRE2 used to restrict the repetition of assertions, but from release 8706*22dc650dSSadaf Ebrahimi 10.35 the only restriction is that an unlimited maximum repetition is 8707*22dc650dSSadaf Ebrahimi changed to be one more than the minimum. For example, {3,} is treated 8708*22dc650dSSadaf Ebrahimi as {3,4}. 8709*22dc650dSSadaf Ebrahimi 8710*22dc650dSSadaf Ebrahimi Alphabetic assertion names 8711*22dc650dSSadaf Ebrahimi 8712*22dc650dSSadaf Ebrahimi Traditionally, symbolic sequences such as (?= and (?<= have been used 8713*22dc650dSSadaf Ebrahimi to specify lookaround assertions. Perl 5.28 introduced some experimen- 8714*22dc650dSSadaf Ebrahimi tal alphabetic alternatives which might be easier to remember. They all 8715*22dc650dSSadaf Ebrahimi start with (* instead of (? and must be written using lower case let- 8716*22dc650dSSadaf Ebrahimi ters. PCRE2 supports the following synonyms: 8717*22dc650dSSadaf Ebrahimi 8718*22dc650dSSadaf Ebrahimi (*positive_lookahead: or (*pla: is the same as (?= 8719*22dc650dSSadaf Ebrahimi (*negative_lookahead: or (*nla: is the same as (?! 8720*22dc650dSSadaf Ebrahimi (*positive_lookbehind: or (*plb: is the same as (?<= 8721*22dc650dSSadaf Ebrahimi (*negative_lookbehind: or (*nlb: is the same as (?<! 8722*22dc650dSSadaf Ebrahimi 8723*22dc650dSSadaf Ebrahimi For example, (*pla:foo) is the same assertion as (?=foo). In the fol- 8724*22dc650dSSadaf Ebrahimi lowing sections, the various assertions are described using the origi- 8725*22dc650dSSadaf Ebrahimi nal symbolic forms. 8726*22dc650dSSadaf Ebrahimi 8727*22dc650dSSadaf Ebrahimi Lookahead assertions 8728*22dc650dSSadaf Ebrahimi 8729*22dc650dSSadaf Ebrahimi Lookahead assertions start with (?= for positive assertions and (?! for 8730*22dc650dSSadaf Ebrahimi negative assertions. For example, 8731*22dc650dSSadaf Ebrahimi 8732*22dc650dSSadaf Ebrahimi \w+(?=;) 8733*22dc650dSSadaf Ebrahimi 8734*22dc650dSSadaf Ebrahimi matches a word followed by a semicolon, but does not include the semi- 8735*22dc650dSSadaf Ebrahimi colon in the match, and 8736*22dc650dSSadaf Ebrahimi 8737*22dc650dSSadaf Ebrahimi foo(?!bar) 8738*22dc650dSSadaf Ebrahimi 8739*22dc650dSSadaf Ebrahimi matches any occurrence of "foo" that is not followed by "bar". Note 8740*22dc650dSSadaf Ebrahimi that the apparently similar pattern 8741*22dc650dSSadaf Ebrahimi 8742*22dc650dSSadaf Ebrahimi (?!foo)bar 8743*22dc650dSSadaf Ebrahimi 8744*22dc650dSSadaf Ebrahimi does not find an occurrence of "bar" that is preceded by something 8745*22dc650dSSadaf Ebrahimi other than "foo"; it finds any occurrence of "bar" whatsoever, because 8746*22dc650dSSadaf Ebrahimi the assertion (?!foo) is always true when the next three characters are 8747*22dc650dSSadaf Ebrahimi "bar". A lookbehind assertion is needed to achieve the other effect. 8748*22dc650dSSadaf Ebrahimi 8749*22dc650dSSadaf Ebrahimi If you want to force a matching failure at some point in a pattern, the 8750*22dc650dSSadaf Ebrahimi most convenient way to do it is with (?!) because an empty string al- 8751*22dc650dSSadaf Ebrahimi ways matches, so an assertion that requires there not to be an empty 8752*22dc650dSSadaf Ebrahimi string must always fail. The backtracking control verb (*FAIL) or (*F) 8753*22dc650dSSadaf Ebrahimi is a synonym for (?!). 8754*22dc650dSSadaf Ebrahimi 8755*22dc650dSSadaf Ebrahimi Lookbehind assertions 8756*22dc650dSSadaf Ebrahimi 8757*22dc650dSSadaf Ebrahimi Lookbehind assertions start with (?<= for positive assertions and (?<! 8758*22dc650dSSadaf Ebrahimi for negative assertions. For example, 8759*22dc650dSSadaf Ebrahimi 8760*22dc650dSSadaf Ebrahimi (?<!foo)bar 8761*22dc650dSSadaf Ebrahimi 8762*22dc650dSSadaf Ebrahimi does find an occurrence of "bar" that is not preceded by "foo". The 8763*22dc650dSSadaf Ebrahimi contents of a lookbehind assertion are restricted such that there must 8764*22dc650dSSadaf Ebrahimi be a known maximum to the lengths of all the strings it matches. There 8765*22dc650dSSadaf Ebrahimi are two cases: 8766*22dc650dSSadaf Ebrahimi 8767*22dc650dSSadaf Ebrahimi If every top-level alternative matches a fixed length, for example 8768*22dc650dSSadaf Ebrahimi 8769*22dc650dSSadaf Ebrahimi (?<=colour|color) 8770*22dc650dSSadaf Ebrahimi 8771*22dc650dSSadaf Ebrahimi there is a limit of 65535 characters to the lengths, which do not have 8772*22dc650dSSadaf Ebrahimi to be the same, as this example demonstrates. This is the only kind of 8773*22dc650dSSadaf Ebrahimi lookbehind supported by PCRE2 versions earlier than 10.43 and by the 8774*22dc650dSSadaf Ebrahimi alternative matching function pcre2_dfa_match(). 8775*22dc650dSSadaf Ebrahimi 8776*22dc650dSSadaf Ebrahimi In PCRE2 10.43 and later, pcre2_match() supports lookbehind assertions 8777*22dc650dSSadaf Ebrahimi in which one or more top-level alternatives can match more than one 8778*22dc650dSSadaf Ebrahimi string length, for example 8779*22dc650dSSadaf Ebrahimi 8780*22dc650dSSadaf Ebrahimi (?<=colou?r) 8781*22dc650dSSadaf Ebrahimi 8782*22dc650dSSadaf Ebrahimi The maximum matching length for any branch of the lookbehind is limited 8783*22dc650dSSadaf Ebrahimi to a value set by the calling program (default 255 characters). Unlim- 8784*22dc650dSSadaf Ebrahimi ited repetition (for example \d*) is not supported. In some cases, the 8785*22dc650dSSadaf Ebrahimi escape sequence \K (see above) can be used instead of a lookbehind as- 8786*22dc650dSSadaf Ebrahimi sertion at the start of a pattern to get round the length limit re- 8787*22dc650dSSadaf Ebrahimi striction. 8788*22dc650dSSadaf Ebrahimi 8789*22dc650dSSadaf Ebrahimi In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which 8790*22dc650dSSadaf Ebrahimi matches a single code unit even in a UTF mode) to appear in lookbehind 8791*22dc650dSSadaf Ebrahimi assertions, because it makes it impossible to calculate the length of 8792*22dc650dSSadaf Ebrahimi the lookbehind. The \X and \R escapes, which can match different num- 8793*22dc650dSSadaf Ebrahimi bers of code units, are never permitted in lookbehinds. 8794*22dc650dSSadaf Ebrahimi 8795*22dc650dSSadaf Ebrahimi "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in 8796*22dc650dSSadaf Ebrahimi lookbehinds, as long as the called capture group matches a limited- 8797*22dc650dSSadaf Ebrahimi length string. However, recursion, that is, a "subroutine" call into a 8798*22dc650dSSadaf Ebrahimi group that is already active, is not supported. 8799*22dc650dSSadaf Ebrahimi 8800*22dc650dSSadaf Ebrahimi PCRE2 supports backreferences in lookbehinds, but only if certain con- 8801*22dc650dSSadaf Ebrahimi ditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, 8802*22dc650dSSadaf Ebrahimi there must be no use of (?| in the pattern (it creates duplicate group 8803*22dc650dSSadaf Ebrahimi numbers), and if the backreference is by name, the name must be unique. 8804*22dc650dSSadaf Ebrahimi Of course, the referenced group must itself match a limited length sub- 8805*22dc650dSSadaf Ebrahimi string. The following pattern matches words containing at least two 8806*22dc650dSSadaf Ebrahimi characters that begin and end with the same character: 8807*22dc650dSSadaf Ebrahimi 8808*22dc650dSSadaf Ebrahimi \b(\w)\w++(?<=\1) 8809*22dc650dSSadaf Ebrahimi 8810*22dc650dSSadaf Ebrahimi Possessive quantifiers can be used in conjunction with lookbehind as- 8811*22dc650dSSadaf Ebrahimi sertions to specify efficient matching at the end of subject strings. 8812*22dc650dSSadaf Ebrahimi Consider a simple pattern such as 8813*22dc650dSSadaf Ebrahimi 8814*22dc650dSSadaf Ebrahimi abcd$ 8815*22dc650dSSadaf Ebrahimi 8816*22dc650dSSadaf Ebrahimi when applied to a long string that does not match. Because matching 8817*22dc650dSSadaf Ebrahimi proceeds from left to right, PCRE2 will look for each "a" in the sub- 8818*22dc650dSSadaf Ebrahimi ject and then see if what follows matches the rest of the pattern. If 8819*22dc650dSSadaf Ebrahimi the pattern is specified as 8820*22dc650dSSadaf Ebrahimi 8821*22dc650dSSadaf Ebrahimi ^.*abcd$ 8822*22dc650dSSadaf Ebrahimi 8823*22dc650dSSadaf Ebrahimi the initial .* matches the entire string at first, but when this fails 8824*22dc650dSSadaf Ebrahimi (because there is no following "a"), it backtracks to match all but the 8825*22dc650dSSadaf Ebrahimi last character, then all but the last two characters, and so on. Once 8826*22dc650dSSadaf Ebrahimi again the search for "a" covers the entire string, from right to left, 8827*22dc650dSSadaf Ebrahimi so we are no better off. However, if the pattern is written as 8828*22dc650dSSadaf Ebrahimi 8829*22dc650dSSadaf Ebrahimi ^.*+(?<=abcd) 8830*22dc650dSSadaf Ebrahimi 8831*22dc650dSSadaf Ebrahimi there can be no backtracking for the .*+ item because of the possessive 8832*22dc650dSSadaf Ebrahimi quantifier; it can match only the entire string. The subsequent lookbe- 8833*22dc650dSSadaf Ebrahimi hind assertion does a single test on the last four characters. If it 8834*22dc650dSSadaf Ebrahimi fails, the match fails immediately. For long strings, this approach 8835*22dc650dSSadaf Ebrahimi makes a significant difference to the processing time. 8836*22dc650dSSadaf Ebrahimi 8837*22dc650dSSadaf Ebrahimi Using multiple assertions 8838*22dc650dSSadaf Ebrahimi 8839*22dc650dSSadaf Ebrahimi Several assertions (of any sort) may occur in succession. For example, 8840*22dc650dSSadaf Ebrahimi 8841*22dc650dSSadaf Ebrahimi (?<=\d{3})(?<!999)foo 8842*22dc650dSSadaf Ebrahimi 8843*22dc650dSSadaf Ebrahimi matches "foo" preceded by three digits that are not "999". Notice that 8844*22dc650dSSadaf Ebrahimi each of the assertions is applied independently at the same point in 8845*22dc650dSSadaf Ebrahimi the subject string. First there is a check that the previous three 8846*22dc650dSSadaf Ebrahimi characters are all digits, and then there is a check that the same 8847*22dc650dSSadaf Ebrahimi three characters are not "999". This pattern does not match "foo" pre- 8848*22dc650dSSadaf Ebrahimi ceded by six characters, the first of which are digits and the last 8849*22dc650dSSadaf Ebrahimi three of which are not "999". For example, it doesn't match "123abc- 8850*22dc650dSSadaf Ebrahimi foo". A pattern to do that is 8851*22dc650dSSadaf Ebrahimi 8852*22dc650dSSadaf Ebrahimi (?<=\d{3}...)(?<!999)foo 8853*22dc650dSSadaf Ebrahimi 8854*22dc650dSSadaf Ebrahimi This time the first assertion looks at the preceding six characters, 8855*22dc650dSSadaf Ebrahimi checking that the first three are digits, and then the second assertion 8856*22dc650dSSadaf Ebrahimi checks that the preceding three characters are not "999". 8857*22dc650dSSadaf Ebrahimi 8858*22dc650dSSadaf Ebrahimi Assertions can be nested in any combination. For example, 8859*22dc650dSSadaf Ebrahimi 8860*22dc650dSSadaf Ebrahimi (?<=(?<!foo)bar)baz 8861*22dc650dSSadaf Ebrahimi 8862*22dc650dSSadaf Ebrahimi matches an occurrence of "baz" that is preceded by "bar" which in turn 8863*22dc650dSSadaf Ebrahimi is not preceded by "foo", while 8864*22dc650dSSadaf Ebrahimi 8865*22dc650dSSadaf Ebrahimi (?<=\d{3}(?!999)...)foo 8866*22dc650dSSadaf Ebrahimi 8867*22dc650dSSadaf Ebrahimi is another pattern that matches "foo" preceded by three digits and any 8868*22dc650dSSadaf Ebrahimi three characters that are not "999". 8869*22dc650dSSadaf Ebrahimi 8870*22dc650dSSadaf Ebrahimi 8871*22dc650dSSadaf EbrahimiNON-ATOMIC ASSERTIONS 8872*22dc650dSSadaf Ebrahimi 8873*22dc650dSSadaf Ebrahimi Traditional lookaround assertions are atomic. That is, if an assertion 8874*22dc650dSSadaf Ebrahimi is true, but there is a subsequent matching failure, there is no back- 8875*22dc650dSSadaf Ebrahimi tracking into the assertion. However, there are some cases where non- 8876*22dc650dSSadaf Ebrahimi atomic positive assertions can be useful. PCRE2 provides these using 8877*22dc650dSSadaf Ebrahimi the following syntax: 8878*22dc650dSSadaf Ebrahimi 8879*22dc650dSSadaf Ebrahimi (*non_atomic_positive_lookahead: or (*napla: or (?* 8880*22dc650dSSadaf Ebrahimi (*non_atomic_positive_lookbehind: or (*naplb: or (?<* 8881*22dc650dSSadaf Ebrahimi 8882*22dc650dSSadaf Ebrahimi Consider the problem of finding the right-most word in a string that 8883*22dc650dSSadaf Ebrahimi also appears earlier in the string, that is, it must appear at least 8884*22dc650dSSadaf Ebrahimi twice in total. This pattern returns the required result as captured 8885*22dc650dSSadaf Ebrahimi substring 1: 8886*22dc650dSSadaf Ebrahimi 8887*22dc650dSSadaf Ebrahimi ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2} 8888*22dc650dSSadaf Ebrahimi 8889*22dc650dSSadaf Ebrahimi For a subject such as "word1 word2 word3 word2 word3 word4" the result 8890*22dc650dSSadaf Ebrahimi is "word3". How does it work? At the start, ^(?x) anchors the pattern 8891*22dc650dSSadaf Ebrahimi and sets the "x" option, which causes white space (introduced for read- 8892*22dc650dSSadaf Ebrahimi ability) to be ignored. Inside the assertion, the greedy .* at first 8893*22dc650dSSadaf Ebrahimi consumes the entire string, but then has to backtrack until the rest of 8894*22dc650dSSadaf Ebrahimi the assertion can match a word, which is captured by group 1. In other 8895*22dc650dSSadaf Ebrahimi words, when the assertion first succeeds, it captures the right-most 8896*22dc650dSSadaf Ebrahimi word in the string. 8897*22dc650dSSadaf Ebrahimi 8898*22dc650dSSadaf Ebrahimi The current matching point is then reset to the start of the subject, 8899*22dc650dSSadaf Ebrahimi and the rest of the pattern match checks for two occurrences of the 8900*22dc650dSSadaf Ebrahimi captured word, using an ungreedy .*? to scan from the left. If this 8901*22dc650dSSadaf Ebrahimi succeeds, we are done, but if the last word in the string does not oc- 8902*22dc650dSSadaf Ebrahimi cur twice, this part of the pattern fails. If a traditional atomic 8903*22dc650dSSadaf Ebrahimi lookahead (?= or (*pla: had been used, the assertion could not be re- 8904*22dc650dSSadaf Ebrahimi entered, and the whole match would fail. The pattern would succeed only 8905*22dc650dSSadaf Ebrahimi if the very last word in the subject was found twice. 8906*22dc650dSSadaf Ebrahimi 8907*22dc650dSSadaf Ebrahimi Using a non-atomic lookahead, however, means that when the last word 8908*22dc650dSSadaf Ebrahimi does not occur twice in the string, the lookahead can backtrack and 8909*22dc650dSSadaf Ebrahimi find the second-last word, and so on, until either the match succeeds, 8910*22dc650dSSadaf Ebrahimi or all words have been tested. 8911*22dc650dSSadaf Ebrahimi 8912*22dc650dSSadaf Ebrahimi Two conditions must be met for a non-atomic assertion to be useful: the 8913*22dc650dSSadaf Ebrahimi contents of one or more capturing groups must change after a backtrack 8914*22dc650dSSadaf Ebrahimi into the assertion, and there must be a backreference to a changed 8915*22dc650dSSadaf Ebrahimi group later in the pattern. If this is not the case, the rest of the 8916*22dc650dSSadaf Ebrahimi pattern match fails exactly as before because nothing has changed, so 8917*22dc650dSSadaf Ebrahimi using a non-atomic assertion just wastes resources. 8918*22dc650dSSadaf Ebrahimi 8919*22dc650dSSadaf Ebrahimi There is one exception to backtracking into a non-atomic assertion. If 8920*22dc650dSSadaf Ebrahimi an (*ACCEPT) control verb is triggered, the assertion succeeds atomi- 8921*22dc650dSSadaf Ebrahimi cally. That is, a subsequent match failure cannot backtrack into the 8922*22dc650dSSadaf Ebrahimi assertion. 8923*22dc650dSSadaf Ebrahimi 8924*22dc650dSSadaf Ebrahimi Non-atomic assertions are not supported by the alternative matching 8925*22dc650dSSadaf Ebrahimi function pcre2_dfa_match(). They are supported by JIT, but only if they 8926*22dc650dSSadaf Ebrahimi do not contain any control verbs such as (*ACCEPT). (This may change in 8927*22dc650dSSadaf Ebrahimi future). Note that assertions that appear as conditions for conditional 8928*22dc650dSSadaf Ebrahimi groups (see below) must be atomic. 8929*22dc650dSSadaf Ebrahimi 8930*22dc650dSSadaf Ebrahimi 8931*22dc650dSSadaf EbrahimiSCRIPT RUNS 8932*22dc650dSSadaf Ebrahimi 8933*22dc650dSSadaf Ebrahimi In concept, a script run is a sequence of characters that are all from 8934*22dc650dSSadaf Ebrahimi the same Unicode script such as Latin or Greek. However, because some 8935*22dc650dSSadaf Ebrahimi scripts are commonly used together, and because some diacritical and 8936*22dc650dSSadaf Ebrahimi other marks are used with multiple scripts, it is not that simple. 8937*22dc650dSSadaf Ebrahimi There is a full description of the rules that PCRE2 uses in the section 8938*22dc650dSSadaf Ebrahimi entitled "Script Runs" in the pcre2unicode documentation. 8939*22dc650dSSadaf Ebrahimi 8940*22dc650dSSadaf Ebrahimi If part of a pattern is enclosed between (*script_run: or (*sr: and a 8941*22dc650dSSadaf Ebrahimi closing parenthesis, it fails if the sequence of characters that it 8942*22dc650dSSadaf Ebrahimi matches are not a script run. After a failure, normal backtracking oc- 8943*22dc650dSSadaf Ebrahimi curs. Script runs can be used to detect spoofing attacks using charac- 8944*22dc650dSSadaf Ebrahimi ters that look the same, but are from different scripts. The string 8945*22dc650dSSadaf Ebrahimi "paypal.com" is an infamous example, where the letters could be a mix- 8946*22dc650dSSadaf Ebrahimi ture of Latin and Cyrillic. This pattern ensures that the matched char- 8947*22dc650dSSadaf Ebrahimi acters in a sequence of non-spaces that follow white space are a script 8948*22dc650dSSadaf Ebrahimi run: 8949*22dc650dSSadaf Ebrahimi 8950*22dc650dSSadaf Ebrahimi \s+(*sr:\S+) 8951*22dc650dSSadaf Ebrahimi 8952*22dc650dSSadaf Ebrahimi To be sure that they are all from the Latin script (for example), a 8953*22dc650dSSadaf Ebrahimi lookahead can be used: 8954*22dc650dSSadaf Ebrahimi 8955*22dc650dSSadaf Ebrahimi \s+(?=\p{Latin})(*sr:\S+) 8956*22dc650dSSadaf Ebrahimi 8957*22dc650dSSadaf Ebrahimi This works as long as the first character is expected to be a character 8958*22dc650dSSadaf Ebrahimi in that script, and not (for example) punctuation, which is allowed 8959*22dc650dSSadaf Ebrahimi with any script. If this is not the case, a more creative lookahead is 8960*22dc650dSSadaf Ebrahimi needed. For example, if digits, underscore, and dots are permitted at 8961*22dc650dSSadaf Ebrahimi the start: 8962*22dc650dSSadaf Ebrahimi 8963*22dc650dSSadaf Ebrahimi \s+(?=[0-9_.]*\p{Latin})(*sr:\S+) 8964*22dc650dSSadaf Ebrahimi 8965*22dc650dSSadaf Ebrahimi 8966*22dc650dSSadaf Ebrahimi In many cases, backtracking into a script run pattern fragment is not 8967*22dc650dSSadaf Ebrahimi desirable. The script run can employ an atomic group to prevent this. 8968*22dc650dSSadaf Ebrahimi Because this is a common requirement, a shorthand notation is provided 8969*22dc650dSSadaf Ebrahimi by (*atomic_script_run: or (*asr: 8970*22dc650dSSadaf Ebrahimi 8971*22dc650dSSadaf Ebrahimi (*asr:...) is the same as (*sr:(?>...)) 8972*22dc650dSSadaf Ebrahimi 8973*22dc650dSSadaf Ebrahimi Note that the atomic group is inside the script run. Putting it outside 8974*22dc650dSSadaf Ebrahimi would not prevent backtracking into the script run pattern. 8975*22dc650dSSadaf Ebrahimi 8976*22dc650dSSadaf Ebrahimi Support for script runs is not available if PCRE2 is compiled without 8977*22dc650dSSadaf Ebrahimi Unicode support. A compile-time error is given if any of the above con- 8978*22dc650dSSadaf Ebrahimi structs is encountered. Script runs are not supported by the alternate 8979*22dc650dSSadaf Ebrahimi matching function, pcre2_dfa_match() because they use the same mecha- 8980*22dc650dSSadaf Ebrahimi nism as capturing parentheses. 8981*22dc650dSSadaf Ebrahimi 8982*22dc650dSSadaf Ebrahimi Warning: The (*ACCEPT) control verb (see below) should not be used 8983*22dc650dSSadaf Ebrahimi within a script run group, because it causes an immediate exit from the 8984*22dc650dSSadaf Ebrahimi group, bypassing the script run checking. 8985*22dc650dSSadaf Ebrahimi 8986*22dc650dSSadaf Ebrahimi 8987*22dc650dSSadaf EbrahimiCONDITIONAL GROUPS 8988*22dc650dSSadaf Ebrahimi 8989*22dc650dSSadaf Ebrahimi It is possible to cause the matching process to obey a pattern fragment 8990*22dc650dSSadaf Ebrahimi conditionally or to choose between two alternative fragments, depending 8991*22dc650dSSadaf Ebrahimi on the result of an assertion, or whether a specific capture group has 8992*22dc650dSSadaf Ebrahimi already been matched. The two possible forms of conditional group are: 8993*22dc650dSSadaf Ebrahimi 8994*22dc650dSSadaf Ebrahimi (?(condition)yes-pattern) 8995*22dc650dSSadaf Ebrahimi (?(condition)yes-pattern|no-pattern) 8996*22dc650dSSadaf Ebrahimi 8997*22dc650dSSadaf Ebrahimi If the condition is satisfied, the yes-pattern is used; otherwise the 8998*22dc650dSSadaf Ebrahimi no-pattern (if present) is used. An absent no-pattern is equivalent to 8999*22dc650dSSadaf Ebrahimi an empty string (it always matches). If there are more than two alter- 9000*22dc650dSSadaf Ebrahimi natives in the group, a compile-time error occurs. Each of the two al- 9001*22dc650dSSadaf Ebrahimi ternatives may itself contain nested groups of any form, including con- 9002*22dc650dSSadaf Ebrahimi ditional groups; the restriction to two alternatives applies only at 9003*22dc650dSSadaf Ebrahimi the level of the condition itself. This pattern fragment is an example 9004*22dc650dSSadaf Ebrahimi where the alternatives are complex: 9005*22dc650dSSadaf Ebrahimi 9006*22dc650dSSadaf Ebrahimi (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) 9007*22dc650dSSadaf Ebrahimi 9008*22dc650dSSadaf Ebrahimi 9009*22dc650dSSadaf Ebrahimi There are five kinds of condition: references to capture groups, refer- 9010*22dc650dSSadaf Ebrahimi ences to recursion, two pseudo-conditions called DEFINE and VERSION, 9011*22dc650dSSadaf Ebrahimi and assertions. 9012*22dc650dSSadaf Ebrahimi 9013*22dc650dSSadaf Ebrahimi Checking for a used capture group by number 9014*22dc650dSSadaf Ebrahimi 9015*22dc650dSSadaf Ebrahimi If the text between the parentheses consists of a sequence of digits, 9016*22dc650dSSadaf Ebrahimi the condition is true if a capture group of that number has previously 9017*22dc650dSSadaf Ebrahimi matched. If there is more than one capture group with the same number 9018*22dc650dSSadaf Ebrahimi (see the earlier section about duplicate group numbers), the condition 9019*22dc650dSSadaf Ebrahimi is true if any of them have matched. An alternative notation, which is 9020*22dc650dSSadaf Ebrahimi a PCRE2 extension, not supported by Perl, is to precede the digits with 9021*22dc650dSSadaf Ebrahimi a plus or minus sign. In this case, the group number is relative rather 9022*22dc650dSSadaf Ebrahimi than absolute. The most recently opened capture group (which could be 9023*22dc650dSSadaf Ebrahimi enclosing this condition) can be referenced by (?(-1), the next most 9024*22dc650dSSadaf Ebrahimi recent by (?(-2), and so on. Inside loops it can also make sense to re- 9025*22dc650dSSadaf Ebrahimi fer to subsequent groups. The next capture group to be opened can be 9026*22dc650dSSadaf Ebrahimi referenced as (?(+1), and so on. The value zero in any of these forms 9027*22dc650dSSadaf Ebrahimi is not used; it provokes a compile-time error. 9028*22dc650dSSadaf Ebrahimi 9029*22dc650dSSadaf Ebrahimi Consider the following pattern, which contains non-significant white 9030*22dc650dSSadaf Ebrahimi space to make it more readable (assume the PCRE2_EXTENDED option) and 9031*22dc650dSSadaf Ebrahimi to divide it into three parts for ease of discussion: 9032*22dc650dSSadaf Ebrahimi 9033*22dc650dSSadaf Ebrahimi ( \( )? [^()]+ (?(1) \) ) 9034*22dc650dSSadaf Ebrahimi 9035*22dc650dSSadaf Ebrahimi The first part matches an optional opening parenthesis, and if that 9036*22dc650dSSadaf Ebrahimi character is present, sets it as the first captured substring. The sec- 9037*22dc650dSSadaf Ebrahimi ond part matches one or more characters that are not parentheses. The 9038*22dc650dSSadaf Ebrahimi third part is a conditional group that tests whether or not the first 9039*22dc650dSSadaf Ebrahimi capture group matched. If it did, that is, if subject started with an 9040*22dc650dSSadaf Ebrahimi opening parenthesis, the condition is true, and so the yes-pattern is 9041*22dc650dSSadaf Ebrahimi executed and a closing parenthesis is required. Otherwise, since no- 9042*22dc650dSSadaf Ebrahimi pattern is not present, the conditional group matches nothing. In other 9043*22dc650dSSadaf Ebrahimi words, this pattern matches a sequence of non-parentheses, optionally 9044*22dc650dSSadaf Ebrahimi enclosed in parentheses. 9045*22dc650dSSadaf Ebrahimi 9046*22dc650dSSadaf Ebrahimi If you were embedding this pattern in a larger one, you could use a 9047*22dc650dSSadaf Ebrahimi relative reference: 9048*22dc650dSSadaf Ebrahimi 9049*22dc650dSSadaf Ebrahimi ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... 9050*22dc650dSSadaf Ebrahimi 9051*22dc650dSSadaf Ebrahimi This makes the fragment independent of the parentheses in the larger 9052*22dc650dSSadaf Ebrahimi pattern. 9053*22dc650dSSadaf Ebrahimi 9054*22dc650dSSadaf Ebrahimi Checking for a used capture group by name 9055*22dc650dSSadaf Ebrahimi 9056*22dc650dSSadaf Ebrahimi Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a 9057*22dc650dSSadaf Ebrahimi used capture group by name. For compatibility with earlier versions of 9058*22dc650dSSadaf Ebrahimi PCRE1, which had this facility before Perl, the syntax (?(name)...) is 9059*22dc650dSSadaf Ebrahimi also recognized. Note, however, that undelimited names consisting of 9060*22dc650dSSadaf Ebrahimi the letter R followed by digits are ambiguous (see the following sec- 9061*22dc650dSSadaf Ebrahimi tion). Rewriting the above example to use a named group gives this: 9062*22dc650dSSadaf Ebrahimi 9063*22dc650dSSadaf Ebrahimi (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) 9064*22dc650dSSadaf Ebrahimi 9065*22dc650dSSadaf Ebrahimi If the name used in a condition of this kind is a duplicate, the test 9066*22dc650dSSadaf Ebrahimi is applied to all groups of the same name, and is true if any one of 9067*22dc650dSSadaf Ebrahimi them has matched. 9068*22dc650dSSadaf Ebrahimi 9069*22dc650dSSadaf Ebrahimi Checking for pattern recursion 9070*22dc650dSSadaf Ebrahimi 9071*22dc650dSSadaf Ebrahimi "Recursion" in this sense refers to any subroutine-like call from one 9072*22dc650dSSadaf Ebrahimi part of the pattern to another, whether or not it is actually recur- 9073*22dc650dSSadaf Ebrahimi sive. See the sections entitled "Recursive patterns" and "Groups as 9074*22dc650dSSadaf Ebrahimi subroutines" below for details of recursion and subroutine calls. 9075*22dc650dSSadaf Ebrahimi 9076*22dc650dSSadaf Ebrahimi If a condition is the string (R), and there is no capture group with 9077*22dc650dSSadaf Ebrahimi the name R, the condition is true if matching is currently in a recur- 9078*22dc650dSSadaf Ebrahimi sion or subroutine call to the whole pattern or any capture group. If 9079*22dc650dSSadaf Ebrahimi digits follow the letter R, and there is no group with that name, the 9080*22dc650dSSadaf Ebrahimi condition is true if the most recent call is into a group with the 9081*22dc650dSSadaf Ebrahimi given number, which must exist somewhere in the overall pattern. This 9082*22dc650dSSadaf Ebrahimi is a contrived example that is equivalent to a+b: 9083*22dc650dSSadaf Ebrahimi 9084*22dc650dSSadaf Ebrahimi ((?(R1)a+|(?1)b)) 9085*22dc650dSSadaf Ebrahimi 9086*22dc650dSSadaf Ebrahimi However, in both cases, if there is a capture group with a matching 9087*22dc650dSSadaf Ebrahimi name, the condition tests for its being set, as described in the sec- 9088*22dc650dSSadaf Ebrahimi tion above, instead of testing for recursion. For example, creating a 9089*22dc650dSSadaf Ebrahimi group with the name R1 by adding (?<R1>) to the above pattern com- 9090*22dc650dSSadaf Ebrahimi pletely changes its meaning. 9091*22dc650dSSadaf Ebrahimi 9092*22dc650dSSadaf Ebrahimi If a name preceded by ampersand follows the letter R, for example: 9093*22dc650dSSadaf Ebrahimi 9094*22dc650dSSadaf Ebrahimi (?(R&name)...) 9095*22dc650dSSadaf Ebrahimi 9096*22dc650dSSadaf Ebrahimi the condition is true if the most recent recursion is into a group of 9097*22dc650dSSadaf Ebrahimi that name (which must exist within the pattern). 9098*22dc650dSSadaf Ebrahimi 9099*22dc650dSSadaf Ebrahimi This condition does not check the entire recursion stack. It tests only 9100*22dc650dSSadaf Ebrahimi the current level. If the name used in a condition of this kind is a 9101*22dc650dSSadaf Ebrahimi duplicate, the test is applied to all groups of the same name, and is 9102*22dc650dSSadaf Ebrahimi true if any one of them is the most recent recursion. 9103*22dc650dSSadaf Ebrahimi 9104*22dc650dSSadaf Ebrahimi At "top level", all these recursion test conditions are false. 9105*22dc650dSSadaf Ebrahimi 9106*22dc650dSSadaf Ebrahimi Defining capture groups for use by reference only 9107*22dc650dSSadaf Ebrahimi 9108*22dc650dSSadaf Ebrahimi If the condition is the string (DEFINE), the condition is always false, 9109*22dc650dSSadaf Ebrahimi even if there is a group with the name DEFINE. In this case, there may 9110*22dc650dSSadaf Ebrahimi be only one alternative in the rest of the conditional group. It is al- 9111*22dc650dSSadaf Ebrahimi ways skipped if control reaches this point in the pattern; the idea of 9112*22dc650dSSadaf Ebrahimi DEFINE is that it can be used to define subroutines that can be refer- 9113*22dc650dSSadaf Ebrahimi enced from elsewhere. (The use of subroutines is described below.) For 9114*22dc650dSSadaf Ebrahimi example, a pattern to match an IPv4 address such as "192.168.23.245" 9115*22dc650dSSadaf Ebrahimi could be written like this (ignore white space and line breaks): 9116*22dc650dSSadaf Ebrahimi 9117*22dc650dSSadaf Ebrahimi (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) 9118*22dc650dSSadaf Ebrahimi \b (?&byte) (\.(?&byte)){3} \b 9119*22dc650dSSadaf Ebrahimi 9120*22dc650dSSadaf Ebrahimi The first part of the pattern is a DEFINE group inside which another 9121*22dc650dSSadaf Ebrahimi group named "byte" is defined. This matches an individual component of 9122*22dc650dSSadaf Ebrahimi an IPv4 address (a number less than 256). When matching takes place, 9123*22dc650dSSadaf Ebrahimi this part of the pattern is skipped because DEFINE acts like a false 9124*22dc650dSSadaf Ebrahimi condition. The rest of the pattern uses references to the named group 9125*22dc650dSSadaf Ebrahimi to match the four dot-separated components of an IPv4 address, insist- 9126*22dc650dSSadaf Ebrahimi ing on a word boundary at each end. 9127*22dc650dSSadaf Ebrahimi 9128*22dc650dSSadaf Ebrahimi Checking the PCRE2 version 9129*22dc650dSSadaf Ebrahimi 9130*22dc650dSSadaf Ebrahimi Programs that link with a PCRE2 library can check the version by call- 9131*22dc650dSSadaf Ebrahimi ing pcre2_config() with appropriate arguments. Users of applications 9132*22dc650dSSadaf Ebrahimi that do not have access to the underlying code cannot do this. A spe- 9133*22dc650dSSadaf Ebrahimi cial "condition" called VERSION exists to allow such users to discover 9134*22dc650dSSadaf Ebrahimi which version of PCRE2 they are dealing with by using this condition to 9135*22dc650dSSadaf Ebrahimi match a string such as "yesno". VERSION must be followed either by "=" 9136*22dc650dSSadaf Ebrahimi or ">=" and a version number. For example: 9137*22dc650dSSadaf Ebrahimi 9138*22dc650dSSadaf Ebrahimi (?(VERSION>=10.4)yes|no) 9139*22dc650dSSadaf Ebrahimi 9140*22dc650dSSadaf Ebrahimi This pattern matches "yes" if the PCRE2 version is greater or equal to 9141*22dc650dSSadaf Ebrahimi 10.4, or "no" otherwise. The fractional part of the version number may 9142*22dc650dSSadaf Ebrahimi not contain more than two digits. 9143*22dc650dSSadaf Ebrahimi 9144*22dc650dSSadaf Ebrahimi Assertion conditions 9145*22dc650dSSadaf Ebrahimi 9146*22dc650dSSadaf Ebrahimi If the condition is not in any of the above formats, it must be a 9147*22dc650dSSadaf Ebrahimi parenthesized assertion. This may be a positive or negative lookahead 9148*22dc650dSSadaf Ebrahimi or lookbehind assertion. However, it must be a traditional atomic as- 9149*22dc650dSSadaf Ebrahimi sertion, not one of the non-atomic assertions. 9150*22dc650dSSadaf Ebrahimi 9151*22dc650dSSadaf Ebrahimi Consider this pattern, again containing non-significant white space, 9152*22dc650dSSadaf Ebrahimi and with the two alternatives on the second line: 9153*22dc650dSSadaf Ebrahimi 9154*22dc650dSSadaf Ebrahimi (?(?=[^a-z]*[a-z]) 9155*22dc650dSSadaf Ebrahimi \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) 9156*22dc650dSSadaf Ebrahimi 9157*22dc650dSSadaf Ebrahimi The condition is a positive lookahead assertion that matches an op- 9158*22dc650dSSadaf Ebrahimi tional sequence of non-letters followed by a letter. In other words, it 9159*22dc650dSSadaf Ebrahimi tests for the presence of at least one letter in the subject. If a let- 9160*22dc650dSSadaf Ebrahimi ter is found, the subject is matched against the first alternative; 9161*22dc650dSSadaf Ebrahimi otherwise it is matched against the second. This pattern matches 9162*22dc650dSSadaf Ebrahimi strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are 9163*22dc650dSSadaf Ebrahimi letters and dd are digits. 9164*22dc650dSSadaf Ebrahimi 9165*22dc650dSSadaf Ebrahimi When an assertion that is a condition contains capture groups, any cap- 9166*22dc650dSSadaf Ebrahimi turing that occurs in a matching branch is retained afterwards, for 9167*22dc650dSSadaf Ebrahimi both positive and negative assertions, because matching always contin- 9168*22dc650dSSadaf Ebrahimi ues after the assertion, whether it succeeds or fails. (Compare non- 9169*22dc650dSSadaf Ebrahimi conditional assertions, for which captures are retained only for posi- 9170*22dc650dSSadaf Ebrahimi tive assertions that succeed.) 9171*22dc650dSSadaf Ebrahimi 9172*22dc650dSSadaf Ebrahimi 9173*22dc650dSSadaf EbrahimiCOMMENTS 9174*22dc650dSSadaf Ebrahimi 9175*22dc650dSSadaf Ebrahimi There are two ways of including comments in patterns that are processed 9176*22dc650dSSadaf Ebrahimi by PCRE2. In both cases, the start of the comment must not be in a 9177*22dc650dSSadaf Ebrahimi character class, nor in the middle of any other sequence of related 9178*22dc650dSSadaf Ebrahimi characters such as (?: or a group name or number. The characters that 9179*22dc650dSSadaf Ebrahimi make up a comment play no part in the pattern matching. 9180*22dc650dSSadaf Ebrahimi 9181*22dc650dSSadaf Ebrahimi The sequence (?# marks the start of a comment that continues up to the 9182*22dc650dSSadaf Ebrahimi next closing parenthesis. Nested parentheses are not permitted. If the 9183*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # 9184*22dc650dSSadaf Ebrahimi character also introduces a comment, which in this case continues to 9185*22dc650dSSadaf Ebrahimi immediately after the next newline character or character sequence in 9186*22dc650dSSadaf Ebrahimi the pattern. Which characters are interpreted as newlines is controlled 9187*22dc650dSSadaf Ebrahimi by an option passed to the compiling function or by a special sequence 9188*22dc650dSSadaf Ebrahimi at the start of the pattern, as described in the section entitled "New- 9189*22dc650dSSadaf Ebrahimi line conventions" above. Note that the end of this type of comment is a 9190*22dc650dSSadaf Ebrahimi literal newline sequence in the pattern; escape sequences that happen 9191*22dc650dSSadaf Ebrahimi to represent a newline do not count. For example, consider this pattern 9192*22dc650dSSadaf Ebrahimi when PCRE2_EXTENDED is set, and the default newline convention (a sin- 9193*22dc650dSSadaf Ebrahimi gle linefeed character) is in force: 9194*22dc650dSSadaf Ebrahimi 9195*22dc650dSSadaf Ebrahimi abc #comment \n still comment 9196*22dc650dSSadaf Ebrahimi 9197*22dc650dSSadaf Ebrahimi On encountering the # character, pcre2_compile() skips along, looking 9198*22dc650dSSadaf Ebrahimi for a newline in the pattern. The sequence \n is still literal at this 9199*22dc650dSSadaf Ebrahimi stage, so it does not terminate the comment. Only an actual character 9200*22dc650dSSadaf Ebrahimi with the code value 0x0a (the default newline) does so. 9201*22dc650dSSadaf Ebrahimi 9202*22dc650dSSadaf Ebrahimi 9203*22dc650dSSadaf EbrahimiRECURSIVE PATTERNS 9204*22dc650dSSadaf Ebrahimi 9205*22dc650dSSadaf Ebrahimi Consider the problem of matching a string in parentheses, allowing for 9206*22dc650dSSadaf Ebrahimi unlimited nested parentheses. Without the use of recursion, the best 9207*22dc650dSSadaf Ebrahimi that can be done is to use a pattern that matches up to some fixed 9208*22dc650dSSadaf Ebrahimi depth of nesting. It is not possible to handle an arbitrary nesting 9209*22dc650dSSadaf Ebrahimi depth. 9210*22dc650dSSadaf Ebrahimi 9211*22dc650dSSadaf Ebrahimi For some time, Perl has provided a facility that allows regular expres- 9212*22dc650dSSadaf Ebrahimi sions to recurse (amongst other things). It does this by interpolating 9213*22dc650dSSadaf Ebrahimi Perl code in the expression at run time, and the code can refer to the 9214*22dc650dSSadaf Ebrahimi expression itself. A Perl pattern using code interpolation to solve the 9215*22dc650dSSadaf Ebrahimi parentheses problem can be created like this: 9216*22dc650dSSadaf Ebrahimi 9217*22dc650dSSadaf Ebrahimi $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; 9218*22dc650dSSadaf Ebrahimi 9219*22dc650dSSadaf Ebrahimi The (?p{...}) item interpolates Perl code at run time, and in this case 9220*22dc650dSSadaf Ebrahimi refers recursively to the pattern in which it appears. 9221*22dc650dSSadaf Ebrahimi 9222*22dc650dSSadaf Ebrahimi Obviously, PCRE2 cannot support the interpolation of Perl code. In- 9223*22dc650dSSadaf Ebrahimi stead, it supports special syntax for recursion of the entire pattern, 9224*22dc650dSSadaf Ebrahimi and also for individual capture group recursion. After its introduction 9225*22dc650dSSadaf Ebrahimi in PCRE1 and Python, this kind of recursion was subsequently introduced 9226*22dc650dSSadaf Ebrahimi into Perl at release 5.10. 9227*22dc650dSSadaf Ebrahimi 9228*22dc650dSSadaf Ebrahimi A special item that consists of (? followed by a number greater than 9229*22dc650dSSadaf Ebrahimi zero and a closing parenthesis is a recursive subroutine call of the 9230*22dc650dSSadaf Ebrahimi capture group of the given number, provided that it occurs inside that 9231*22dc650dSSadaf Ebrahimi group. (If not, it is a non-recursive subroutine call, which is de- 9232*22dc650dSSadaf Ebrahimi scribed in the next section.) The special item (?R) or (?0) is a recur- 9233*22dc650dSSadaf Ebrahimi sive call of the entire regular expression. 9234*22dc650dSSadaf Ebrahimi 9235*22dc650dSSadaf Ebrahimi This PCRE2 pattern solves the nested parentheses problem (assume the 9236*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED option is set so that white space is ignored): 9237*22dc650dSSadaf Ebrahimi 9238*22dc650dSSadaf Ebrahimi \( ( [^()]++ | (?R) )* \) 9239*22dc650dSSadaf Ebrahimi 9240*22dc650dSSadaf Ebrahimi First it matches an opening parenthesis. Then it matches any number of 9241*22dc650dSSadaf Ebrahimi substrings which can either be a sequence of non-parentheses, or a re- 9242*22dc650dSSadaf Ebrahimi cursive match of the pattern itself (that is, a correctly parenthesized 9243*22dc650dSSadaf Ebrahimi substring). Finally there is a closing parenthesis. Note the use of a 9244*22dc650dSSadaf Ebrahimi possessive quantifier to avoid backtracking into sequences of non- 9245*22dc650dSSadaf Ebrahimi parentheses. 9246*22dc650dSSadaf Ebrahimi 9247*22dc650dSSadaf Ebrahimi If this were part of a larger pattern, you would not want to recurse 9248*22dc650dSSadaf Ebrahimi the entire pattern, so instead you could use this: 9249*22dc650dSSadaf Ebrahimi 9250*22dc650dSSadaf Ebrahimi ( \( ( [^()]++ | (?1) )* \) ) 9251*22dc650dSSadaf Ebrahimi 9252*22dc650dSSadaf Ebrahimi We have put the pattern into parentheses, and caused the recursion to 9253*22dc650dSSadaf Ebrahimi refer to them instead of the whole pattern. 9254*22dc650dSSadaf Ebrahimi 9255*22dc650dSSadaf Ebrahimi In a larger pattern, keeping track of parenthesis numbers can be 9256*22dc650dSSadaf Ebrahimi tricky. This is made easier by the use of relative references. Instead 9257*22dc650dSSadaf Ebrahimi of (?1) in the pattern above you can write (?-2) to refer to the second 9258*22dc650dSSadaf Ebrahimi most recently opened parentheses preceding the recursion. In other 9259*22dc650dSSadaf Ebrahimi words, a negative number counts capturing parentheses leftwards from 9260*22dc650dSSadaf Ebrahimi the point at which it is encountered. 9261*22dc650dSSadaf Ebrahimi 9262*22dc650dSSadaf Ebrahimi Be aware however, that if duplicate capture group numbers are in use, 9263*22dc650dSSadaf Ebrahimi relative references refer to the earliest group with the appropriate 9264*22dc650dSSadaf Ebrahimi number. Consider, for example: 9265*22dc650dSSadaf Ebrahimi 9266*22dc650dSSadaf Ebrahimi (?|(a)|(b)) (c) (?-2) 9267*22dc650dSSadaf Ebrahimi 9268*22dc650dSSadaf Ebrahimi The first two capture groups (a) and (b) are both numbered 1, and group 9269*22dc650dSSadaf Ebrahimi (c) is number 2. When the reference (?-2) is encountered, the second 9270*22dc650dSSadaf Ebrahimi most recently opened parentheses has the number 1, but it is the first 9271*22dc650dSSadaf Ebrahimi such group (the (a) group) to which the recursion refers. This would be 9272*22dc650dSSadaf Ebrahimi the same if an absolute reference (?1) was used. In other words, rela- 9273*22dc650dSSadaf Ebrahimi tive references are just a shorthand for computing a group number. 9274*22dc650dSSadaf Ebrahimi 9275*22dc650dSSadaf Ebrahimi It is also possible to refer to subsequent capture groups, by writing 9276*22dc650dSSadaf Ebrahimi references such as (?+2). However, these cannot be recursive because 9277*22dc650dSSadaf Ebrahimi the reference is not inside the parentheses that are referenced. They 9278*22dc650dSSadaf Ebrahimi are always non-recursive subroutine calls, as described in the next 9279*22dc650dSSadaf Ebrahimi section. 9280*22dc650dSSadaf Ebrahimi 9281*22dc650dSSadaf Ebrahimi An alternative approach is to use named parentheses. The Perl syntax 9282*22dc650dSSadaf Ebrahimi for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- 9283*22dc650dSSadaf Ebrahimi ported. We could rewrite the above example as follows: 9284*22dc650dSSadaf Ebrahimi 9285*22dc650dSSadaf Ebrahimi (?<pn> \( ( [^()]++ | (?&pn) )* \) ) 9286*22dc650dSSadaf Ebrahimi 9287*22dc650dSSadaf Ebrahimi If there is more than one group with the same name, the earliest one is 9288*22dc650dSSadaf Ebrahimi used. 9289*22dc650dSSadaf Ebrahimi 9290*22dc650dSSadaf Ebrahimi The example pattern that we have been looking at contains nested unlim- 9291*22dc650dSSadaf Ebrahimi ited repeats, and so the use of a possessive quantifier for matching 9292*22dc650dSSadaf Ebrahimi strings of non-parentheses is important when applying the pattern to 9293*22dc650dSSadaf Ebrahimi strings that do not match. For example, when this pattern is applied to 9294*22dc650dSSadaf Ebrahimi 9295*22dc650dSSadaf Ebrahimi (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 9296*22dc650dSSadaf Ebrahimi 9297*22dc650dSSadaf Ebrahimi it yields "no match" quickly. However, if a possessive quantifier is 9298*22dc650dSSadaf Ebrahimi not used, the match runs for a very long time indeed because there are 9299*22dc650dSSadaf Ebrahimi so many different ways the + and * repeats can carve up the subject, 9300*22dc650dSSadaf Ebrahimi and all have to be tested before failure can be reported. 9301*22dc650dSSadaf Ebrahimi 9302*22dc650dSSadaf Ebrahimi At the end of a match, the values of capturing parentheses are those 9303*22dc650dSSadaf Ebrahimi from the outermost level. If you want to obtain intermediate values, a 9304*22dc650dSSadaf Ebrahimi callout function can be used (see below and the pcre2callout documenta- 9305*22dc650dSSadaf Ebrahimi tion). If the pattern above is matched against 9306*22dc650dSSadaf Ebrahimi 9307*22dc650dSSadaf Ebrahimi (ab(cd)ef) 9308*22dc650dSSadaf Ebrahimi 9309*22dc650dSSadaf Ebrahimi the value for the inner capturing parentheses (numbered 2) is "ef", 9310*22dc650dSSadaf Ebrahimi which is the last value taken on at the top level. If a capture group 9311*22dc650dSSadaf Ebrahimi is not matched at the top level, its final captured value is unset, 9312*22dc650dSSadaf Ebrahimi even if it was (temporarily) set at a deeper level during the matching 9313*22dc650dSSadaf Ebrahimi process. 9314*22dc650dSSadaf Ebrahimi 9315*22dc650dSSadaf Ebrahimi Do not confuse the (?R) item with the condition (R), which tests for 9316*22dc650dSSadaf Ebrahimi recursion. Consider this pattern, which matches text in angle brack- 9317*22dc650dSSadaf Ebrahimi ets, allowing for arbitrary nesting. Only digits are allowed in nested 9318*22dc650dSSadaf Ebrahimi brackets (that is, when recursing), whereas any characters are permit- 9319*22dc650dSSadaf Ebrahimi ted at the outer level. 9320*22dc650dSSadaf Ebrahimi 9321*22dc650dSSadaf Ebrahimi < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > 9322*22dc650dSSadaf Ebrahimi 9323*22dc650dSSadaf Ebrahimi In this pattern, (?(R) is the start of a conditional group, with two 9324*22dc650dSSadaf Ebrahimi different alternatives for the recursive and non-recursive cases. The 9325*22dc650dSSadaf Ebrahimi (?R) item is the actual recursive call. 9326*22dc650dSSadaf Ebrahimi 9327*22dc650dSSadaf Ebrahimi Differences in recursion processing between PCRE2 and Perl 9328*22dc650dSSadaf Ebrahimi 9329*22dc650dSSadaf Ebrahimi Some former differences between PCRE2 and Perl no longer exist. 9330*22dc650dSSadaf Ebrahimi 9331*22dc650dSSadaf Ebrahimi Before release 10.30, recursion processing in PCRE2 differed from Perl 9332*22dc650dSSadaf Ebrahimi in that a recursive subroutine call was always treated as an atomic 9333*22dc650dSSadaf Ebrahimi group. That is, once it had matched some of the subject string, it was 9334*22dc650dSSadaf Ebrahimi never re-entered, even if it contained untried alternatives and there 9335*22dc650dSSadaf Ebrahimi was a subsequent matching failure. (Historical note: PCRE implemented 9336*22dc650dSSadaf Ebrahimi recursion before Perl did.) 9337*22dc650dSSadaf Ebrahimi 9338*22dc650dSSadaf Ebrahimi Starting with release 10.30, recursive subroutine calls are no longer 9339*22dc650dSSadaf Ebrahimi treated as atomic. That is, they can be re-entered to try unused alter- 9340*22dc650dSSadaf Ebrahimi natives if there is a matching failure later in the pattern. This is 9341*22dc650dSSadaf Ebrahimi now compatible with the way Perl works. If you want a subroutine call 9342*22dc650dSSadaf Ebrahimi to be atomic, you must explicitly enclose it in an atomic group. 9343*22dc650dSSadaf Ebrahimi 9344*22dc650dSSadaf Ebrahimi Supporting backtracking into recursions simplifies certain types of re- 9345*22dc650dSSadaf Ebrahimi cursive pattern. For example, this pattern matches palindromic strings: 9346*22dc650dSSadaf Ebrahimi 9347*22dc650dSSadaf Ebrahimi ^((.)(?1)\2|.?)$ 9348*22dc650dSSadaf Ebrahimi 9349*22dc650dSSadaf Ebrahimi The second branch in the group matches a single central character in 9350*22dc650dSSadaf Ebrahimi the palindrome when there are an odd number of characters, or nothing 9351*22dc650dSSadaf Ebrahimi when there are an even number of characters, but in order to work it 9352*22dc650dSSadaf Ebrahimi has to be able to try the second case when the rest of the pattern 9353*22dc650dSSadaf Ebrahimi match fails. If you want to match typical palindromic phrases, the pat- 9354*22dc650dSSadaf Ebrahimi tern has to ignore all non-word characters, which can be done like 9355*22dc650dSSadaf Ebrahimi this: 9356*22dc650dSSadaf Ebrahimi 9357*22dc650dSSadaf Ebrahimi ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$ 9358*22dc650dSSadaf Ebrahimi 9359*22dc650dSSadaf Ebrahimi If run with the PCRE2_CASELESS option, this pattern matches phrases 9360*22dc650dSSadaf Ebrahimi such as "A man, a plan, a canal: Panama!". Note the use of the posses- 9361*22dc650dSSadaf Ebrahimi sive quantifier *+ to avoid backtracking into sequences of non-word 9362*22dc650dSSadaf Ebrahimi characters. Without this, PCRE2 takes a great deal longer (ten times or 9363*22dc650dSSadaf Ebrahimi more) to match typical phrases, and Perl takes so long that you think 9364*22dc650dSSadaf Ebrahimi it has gone into a loop. 9365*22dc650dSSadaf Ebrahimi 9366*22dc650dSSadaf Ebrahimi Another way in which PCRE2 and Perl used to differ in their recursion 9367*22dc650dSSadaf Ebrahimi processing is in the handling of captured values. Formerly in Perl, 9368*22dc650dSSadaf Ebrahimi when a group was called recursively or as a subroutine (see the next 9369*22dc650dSSadaf Ebrahimi section), it had no access to any values that were captured outside the 9370*22dc650dSSadaf Ebrahimi recursion, whereas in PCRE2 these values can be referenced. Consider 9371*22dc650dSSadaf Ebrahimi this pattern: 9372*22dc650dSSadaf Ebrahimi 9373*22dc650dSSadaf Ebrahimi ^(.)(\1|a(?2)) 9374*22dc650dSSadaf Ebrahimi 9375*22dc650dSSadaf Ebrahimi This pattern matches "bab". The first capturing parentheses match "b", 9376*22dc650dSSadaf Ebrahimi then in the second group, when the backreference \1 fails to match "b", 9377*22dc650dSSadaf Ebrahimi the second alternative matches "a" and then recurses. In the recursion, 9378*22dc650dSSadaf Ebrahimi \1 does now match "b" and so the whole match succeeds. This match used 9379*22dc650dSSadaf Ebrahimi to fail in Perl, but in later versions (I tried 5.024) it now works. 9380*22dc650dSSadaf Ebrahimi 9381*22dc650dSSadaf Ebrahimi 9382*22dc650dSSadaf EbrahimiGROUPS AS SUBROUTINES 9383*22dc650dSSadaf Ebrahimi 9384*22dc650dSSadaf Ebrahimi If the syntax for a recursive group call (either by number or by name) 9385*22dc650dSSadaf Ebrahimi is used outside the parentheses to which it refers, it operates a bit 9386*22dc650dSSadaf Ebrahimi like a subroutine in a programming language. More accurately, PCRE2 9387*22dc650dSSadaf Ebrahimi treats the referenced group as an independent subpattern which it tries 9388*22dc650dSSadaf Ebrahimi to match at the current matching position. The called group may be de- 9389*22dc650dSSadaf Ebrahimi fined before or after the reference. A numbered reference can be ab- 9390*22dc650dSSadaf Ebrahimi solute or relative, as in these examples: 9391*22dc650dSSadaf Ebrahimi 9392*22dc650dSSadaf Ebrahimi (...(absolute)...)...(?2)... 9393*22dc650dSSadaf Ebrahimi (...(relative)...)...(?-1)... 9394*22dc650dSSadaf Ebrahimi (...(?+1)...(relative)... 9395*22dc650dSSadaf Ebrahimi 9396*22dc650dSSadaf Ebrahimi An earlier example pointed out that the pattern 9397*22dc650dSSadaf Ebrahimi 9398*22dc650dSSadaf Ebrahimi (sens|respons)e and \1ibility 9399*22dc650dSSadaf Ebrahimi 9400*22dc650dSSadaf Ebrahimi matches "sense and sensibility" and "response and responsibility", but 9401*22dc650dSSadaf Ebrahimi not "sense and responsibility". If instead the pattern 9402*22dc650dSSadaf Ebrahimi 9403*22dc650dSSadaf Ebrahimi (sens|respons)e and (?1)ibility 9404*22dc650dSSadaf Ebrahimi 9405*22dc650dSSadaf Ebrahimi is used, it does match "sense and responsibility" as well as the other 9406*22dc650dSSadaf Ebrahimi two strings. Another example is given in the discussion of DEFINE 9407*22dc650dSSadaf Ebrahimi above. 9408*22dc650dSSadaf Ebrahimi 9409*22dc650dSSadaf Ebrahimi Like recursions, subroutine calls used to be treated as atomic, but 9410*22dc650dSSadaf Ebrahimi this changed at PCRE2 release 10.30, so backtracking into subroutine 9411*22dc650dSSadaf Ebrahimi calls can now occur. However, any capturing parentheses that are set 9412*22dc650dSSadaf Ebrahimi during the subroutine call revert to their previous values afterwards. 9413*22dc650dSSadaf Ebrahimi 9414*22dc650dSSadaf Ebrahimi Processing options such as case-independence are fixed when a group is 9415*22dc650dSSadaf Ebrahimi defined, so if it is used as a subroutine, such options cannot be 9416*22dc650dSSadaf Ebrahimi changed for different calls. For example, consider this pattern: 9417*22dc650dSSadaf Ebrahimi 9418*22dc650dSSadaf Ebrahimi (abc)(?i:(?-1)) 9419*22dc650dSSadaf Ebrahimi 9420*22dc650dSSadaf Ebrahimi It matches "abcabc". It does not match "abcABC" because the change of 9421*22dc650dSSadaf Ebrahimi processing option does not affect the called group. 9422*22dc650dSSadaf Ebrahimi 9423*22dc650dSSadaf Ebrahimi The behaviour of backtracking control verbs in groups when called as 9424*22dc650dSSadaf Ebrahimi subroutines is described in the section entitled "Backtracking verbs in 9425*22dc650dSSadaf Ebrahimi subroutines" below. 9426*22dc650dSSadaf Ebrahimi 9427*22dc650dSSadaf Ebrahimi 9428*22dc650dSSadaf EbrahimiONIGURUMA SUBROUTINE SYNTAX 9429*22dc650dSSadaf Ebrahimi 9430*22dc650dSSadaf Ebrahimi For compatibility with Oniguruma, the non-Perl syntax \g followed by a 9431*22dc650dSSadaf Ebrahimi name or a number enclosed either in angle brackets or single quotes, is 9432*22dc650dSSadaf Ebrahimi an alternative syntax for calling a group as a subroutine, possibly re- 9433*22dc650dSSadaf Ebrahimi cursively. Here are two of the examples used above, rewritten using 9434*22dc650dSSadaf Ebrahimi this syntax: 9435*22dc650dSSadaf Ebrahimi 9436*22dc650dSSadaf Ebrahimi (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) 9437*22dc650dSSadaf Ebrahimi (sens|respons)e and \g'1'ibility 9438*22dc650dSSadaf Ebrahimi 9439*22dc650dSSadaf Ebrahimi PCRE2 supports an extension to Oniguruma: if a number is preceded by a 9440*22dc650dSSadaf Ebrahimi plus or a minus sign it is taken as a relative reference. For example: 9441*22dc650dSSadaf Ebrahimi 9442*22dc650dSSadaf Ebrahimi (abc)(?i:\g<-1>) 9443*22dc650dSSadaf Ebrahimi 9444*22dc650dSSadaf Ebrahimi Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not 9445*22dc650dSSadaf Ebrahimi synonymous. The former is a backreference; the latter is a subroutine 9446*22dc650dSSadaf Ebrahimi call. 9447*22dc650dSSadaf Ebrahimi 9448*22dc650dSSadaf Ebrahimi 9449*22dc650dSSadaf EbrahimiCALLOUTS 9450*22dc650dSSadaf Ebrahimi 9451*22dc650dSSadaf Ebrahimi Perl has a feature whereby using the sequence (?{...}) causes arbitrary 9452*22dc650dSSadaf Ebrahimi Perl code to be obeyed in the middle of matching a regular expression. 9453*22dc650dSSadaf Ebrahimi This makes it possible, amongst other things, to extract different sub- 9454*22dc650dSSadaf Ebrahimi strings that match the same pair of parentheses when there is a repeti- 9455*22dc650dSSadaf Ebrahimi tion. 9456*22dc650dSSadaf Ebrahimi 9457*22dc650dSSadaf Ebrahimi PCRE2 provides a similar feature, but of course it cannot obey arbi- 9458*22dc650dSSadaf Ebrahimi trary Perl code. The feature is called "callout". The caller of PCRE2 9459*22dc650dSSadaf Ebrahimi provides an external function by putting its entry point in a match 9460*22dc650dSSadaf Ebrahimi context using the function pcre2_set_callout(), and then passing that 9461*22dc650dSSadaf Ebrahimi context to pcre2_match() or pcre2_dfa_match(). If no match context is 9462*22dc650dSSadaf Ebrahimi passed, or if the callout entry point is set to NULL, callouts are dis- 9463*22dc650dSSadaf Ebrahimi abled. 9464*22dc650dSSadaf Ebrahimi 9465*22dc650dSSadaf Ebrahimi Within a regular expression, (?C<arg>) indicates a point at which the 9466*22dc650dSSadaf Ebrahimi external function is to be called. There are two kinds of callout: 9467*22dc650dSSadaf Ebrahimi those with a numerical argument and those with a string argument. (?C) 9468*22dc650dSSadaf Ebrahimi on its own with no argument is treated as (?C0). A numerical argument 9469*22dc650dSSadaf Ebrahimi allows the application to distinguish between different callouts. 9470*22dc650dSSadaf Ebrahimi String arguments were added for release 10.20 to make it possible for 9471*22dc650dSSadaf Ebrahimi script languages that use PCRE2 to embed short scripts within patterns 9472*22dc650dSSadaf Ebrahimi in a similar way to Perl. 9473*22dc650dSSadaf Ebrahimi 9474*22dc650dSSadaf Ebrahimi During matching, when PCRE2 reaches a callout point, the external func- 9475*22dc650dSSadaf Ebrahimi tion is called. It is provided with the number or string argument of 9476*22dc650dSSadaf Ebrahimi the callout, the position in the pattern, and one item of data that is 9477*22dc650dSSadaf Ebrahimi also set in the match block. The callout function may cause matching to 9478*22dc650dSSadaf Ebrahimi proceed, to backtrack, or to fail. 9479*22dc650dSSadaf Ebrahimi 9480*22dc650dSSadaf Ebrahimi By default, PCRE2 implements a number of optimizations at matching 9481*22dc650dSSadaf Ebrahimi time, and one side-effect is that sometimes callouts are skipped. If 9482*22dc650dSSadaf Ebrahimi you need all possible callouts to happen, you need to set options that 9483*22dc650dSSadaf Ebrahimi disable the relevant optimizations. More details, including a complete 9484*22dc650dSSadaf Ebrahimi description of the programming interface to the callout function, are 9485*22dc650dSSadaf Ebrahimi given in the pcre2callout documentation. 9486*22dc650dSSadaf Ebrahimi 9487*22dc650dSSadaf Ebrahimi Callouts with numerical arguments 9488*22dc650dSSadaf Ebrahimi 9489*22dc650dSSadaf Ebrahimi If you just want to have a means of identifying different callout 9490*22dc650dSSadaf Ebrahimi points, put a number less than 256 after the letter C. For example, 9491*22dc650dSSadaf Ebrahimi this pattern has two callout points: 9492*22dc650dSSadaf Ebrahimi 9493*22dc650dSSadaf Ebrahimi (?C1)abc(?C2)def 9494*22dc650dSSadaf Ebrahimi 9495*22dc650dSSadaf Ebrahimi If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical 9496*22dc650dSSadaf Ebrahimi callouts are automatically installed before each item in the pattern. 9497*22dc650dSSadaf Ebrahimi They are all numbered 255. If there is a conditional group in the pat- 9498*22dc650dSSadaf Ebrahimi tern whose condition is an assertion, an additional callout is inserted 9499*22dc650dSSadaf Ebrahimi just before the condition. An explicit callout may also be set at this 9500*22dc650dSSadaf Ebrahimi position, as in this example: 9501*22dc650dSSadaf Ebrahimi 9502*22dc650dSSadaf Ebrahimi (?(?C9)(?=a)abc|def) 9503*22dc650dSSadaf Ebrahimi 9504*22dc650dSSadaf Ebrahimi Note that this applies only to assertion conditions, not to other types 9505*22dc650dSSadaf Ebrahimi of condition. 9506*22dc650dSSadaf Ebrahimi 9507*22dc650dSSadaf Ebrahimi Callouts with string arguments 9508*22dc650dSSadaf Ebrahimi 9509*22dc650dSSadaf Ebrahimi A delimited string may be used instead of a number as a callout argu- 9510*22dc650dSSadaf Ebrahimi ment. The starting delimiter must be one of ` ' " ^ % # $ { and the 9511*22dc650dSSadaf Ebrahimi ending delimiter is the same as the start, except for {, where the end- 9512*22dc650dSSadaf Ebrahimi ing delimiter is }. If the ending delimiter is needed within the 9513*22dc650dSSadaf Ebrahimi string, it must be doubled. For example: 9514*22dc650dSSadaf Ebrahimi 9515*22dc650dSSadaf Ebrahimi (?C'ab ''c'' d')xyz(?C{any text})pqr 9516*22dc650dSSadaf Ebrahimi 9517*22dc650dSSadaf Ebrahimi The doubling is removed before the string is passed to the callout 9518*22dc650dSSadaf Ebrahimi function. 9519*22dc650dSSadaf Ebrahimi 9520*22dc650dSSadaf Ebrahimi 9521*22dc650dSSadaf EbrahimiBACKTRACKING CONTROL 9522*22dc650dSSadaf Ebrahimi 9523*22dc650dSSadaf Ebrahimi There are a number of special "Backtracking Control Verbs" (to use 9524*22dc650dSSadaf Ebrahimi Perl's terminology) that modify the behaviour of backtracking during 9525*22dc650dSSadaf Ebrahimi matching. They are generally of the form (*VERB) or (*VERB:NAME). Some 9526*22dc650dSSadaf Ebrahimi verbs take either form, and may behave differently depending on whether 9527*22dc650dSSadaf Ebrahimi or not a name argument is present. The names are not required to be 9528*22dc650dSSadaf Ebrahimi unique within the pattern. 9529*22dc650dSSadaf Ebrahimi 9530*22dc650dSSadaf Ebrahimi By default, for compatibility with Perl, a name is any sequence of 9531*22dc650dSSadaf Ebrahimi characters that does not include a closing parenthesis. The name is not 9532*22dc650dSSadaf Ebrahimi processed in any way, and it is not possible to include a closing 9533*22dc650dSSadaf Ebrahimi parenthesis in the name. This can be changed by setting the 9534*22dc650dSSadaf Ebrahimi PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- 9535*22dc650dSSadaf Ebrahimi ble. 9536*22dc650dSSadaf Ebrahimi 9537*22dc650dSSadaf Ebrahimi When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to 9538*22dc650dSSadaf Ebrahimi verb names and only an unescaped closing parenthesis terminates the 9539*22dc650dSSadaf Ebrahimi name. However, the only backslash items that are permitted are \Q, \E, 9540*22dc650dSSadaf Ebrahimi and sequences such as \x{100} that define character code points. Char- 9541*22dc650dSSadaf Ebrahimi acter type escapes such as \d are faulted. 9542*22dc650dSSadaf Ebrahimi 9543*22dc650dSSadaf Ebrahimi A closing parenthesis can be included in a name either as \) or between 9544*22dc650dSSadaf Ebrahimi \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED 9545*22dc650dSSadaf Ebrahimi or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb 9546*22dc650dSSadaf Ebrahimi names is skipped, and #-comments are recognized, exactly as in the rest 9547*22dc650dSSadaf Ebrahimi of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect 9548*22dc650dSSadaf Ebrahimi verb names unless PCRE2_ALT_VERBNAMES is also set. 9549*22dc650dSSadaf Ebrahimi 9550*22dc650dSSadaf Ebrahimi The maximum length of a name is 255 in the 8-bit library and 65535 in 9551*22dc650dSSadaf Ebrahimi the 16-bit and 32-bit libraries. If the name is empty, that is, if the 9552*22dc650dSSadaf Ebrahimi closing parenthesis immediately follows the colon, the effect is as if 9553*22dc650dSSadaf Ebrahimi the colon were not there. Any number of these verbs may occur in a pat- 9554*22dc650dSSadaf Ebrahimi tern. Except for (*ACCEPT), they may not be quantified. 9555*22dc650dSSadaf Ebrahimi 9556*22dc650dSSadaf Ebrahimi Since these verbs are specifically related to backtracking, most of 9557*22dc650dSSadaf Ebrahimi them can be used only when the pattern is to be matched using the tra- 9558*22dc650dSSadaf Ebrahimi ditional matching function, because that uses a backtracking algorithm. 9559*22dc650dSSadaf Ebrahimi With the exception of (*FAIL), which behaves like a failing negative 9560*22dc650dSSadaf Ebrahimi assertion, the backtracking control verbs cause an error if encountered 9561*22dc650dSSadaf Ebrahimi by the DFA matching function. 9562*22dc650dSSadaf Ebrahimi 9563*22dc650dSSadaf Ebrahimi The behaviour of these verbs in repeated groups, assertions, and in 9564*22dc650dSSadaf Ebrahimi capture groups called as subroutines (whether or not recursively) is 9565*22dc650dSSadaf Ebrahimi documented below. 9566*22dc650dSSadaf Ebrahimi 9567*22dc650dSSadaf Ebrahimi Optimizations that affect backtracking verbs 9568*22dc650dSSadaf Ebrahimi 9569*22dc650dSSadaf Ebrahimi PCRE2 contains some optimizations that are used to speed up matching by 9570*22dc650dSSadaf Ebrahimi running some checks at the start of each match attempt. For example, it 9571*22dc650dSSadaf Ebrahimi may know the minimum length of matching subject, or that a particular 9572*22dc650dSSadaf Ebrahimi character must be present. When one of these optimizations bypasses the 9573*22dc650dSSadaf Ebrahimi running of a match, any included backtracking verbs will not, of 9574*22dc650dSSadaf Ebrahimi course, be processed. You can suppress the start-of-match optimizations 9575*22dc650dSSadaf Ebrahimi by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- 9576*22dc650dSSadaf Ebrahimi pile(), or by starting the pattern with (*NO_START_OPT). There is more 9577*22dc650dSSadaf Ebrahimi discussion of this option in the section entitled "Compiling a pattern" 9578*22dc650dSSadaf Ebrahimi in the pcre2api documentation. 9579*22dc650dSSadaf Ebrahimi 9580*22dc650dSSadaf Ebrahimi Experiments with Perl suggest that it too has similar optimizations, 9581*22dc650dSSadaf Ebrahimi and like PCRE2, turning them off can change the result of a match. 9582*22dc650dSSadaf Ebrahimi 9583*22dc650dSSadaf Ebrahimi Verbs that act immediately 9584*22dc650dSSadaf Ebrahimi 9585*22dc650dSSadaf Ebrahimi The following verbs act as soon as they are encountered. 9586*22dc650dSSadaf Ebrahimi 9587*22dc650dSSadaf Ebrahimi (*ACCEPT) or (*ACCEPT:NAME) 9588*22dc650dSSadaf Ebrahimi 9589*22dc650dSSadaf Ebrahimi This verb causes the match to end successfully, skipping the remainder 9590*22dc650dSSadaf Ebrahimi of the pattern. However, when it is inside a capture group that is 9591*22dc650dSSadaf Ebrahimi called as a subroutine, only that group is ended successfully. Matching 9592*22dc650dSSadaf Ebrahimi then continues at the outer level. If (*ACCEPT) in triggered in a posi- 9593*22dc650dSSadaf Ebrahimi tive assertion, the assertion succeeds; in a negative assertion, the 9594*22dc650dSSadaf Ebrahimi assertion fails. 9595*22dc650dSSadaf Ebrahimi 9596*22dc650dSSadaf Ebrahimi If (*ACCEPT) is inside capturing parentheses, the data so far is cap- 9597*22dc650dSSadaf Ebrahimi tured. For example: 9598*22dc650dSSadaf Ebrahimi 9599*22dc650dSSadaf Ebrahimi A((?:A|B(*ACCEPT)|C)D) 9600*22dc650dSSadaf Ebrahimi 9601*22dc650dSSadaf Ebrahimi This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- 9602*22dc650dSSadaf Ebrahimi tured by the outer parentheses. 9603*22dc650dSSadaf Ebrahimi 9604*22dc650dSSadaf Ebrahimi (*ACCEPT) is the only backtracking verb that is allowed to be quanti- 9605*22dc650dSSadaf Ebrahimi fied because an ungreedy quantification with a minimum of zero acts 9606*22dc650dSSadaf Ebrahimi only when a backtrack happens. Consider, for example, 9607*22dc650dSSadaf Ebrahimi 9608*22dc650dSSadaf Ebrahimi (A(*ACCEPT)??B)C 9609*22dc650dSSadaf Ebrahimi 9610*22dc650dSSadaf Ebrahimi where A, B, and C may be complex expressions. After matching "A", the 9611*22dc650dSSadaf Ebrahimi matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT) 9612*22dc650dSSadaf Ebrahimi is triggered and the match succeeds. In both cases, all but C is cap- 9613*22dc650dSSadaf Ebrahimi tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re- 9614*22dc650dSSadaf Ebrahimi peated (*ACCEPT) of this type means "succeed on backtrack". 9615*22dc650dSSadaf Ebrahimi 9616*22dc650dSSadaf Ebrahimi Warning: (*ACCEPT) should not be used within a script run group, be- 9617*22dc650dSSadaf Ebrahimi cause it causes an immediate exit from the group, bypassing the script 9618*22dc650dSSadaf Ebrahimi run checking. 9619*22dc650dSSadaf Ebrahimi 9620*22dc650dSSadaf Ebrahimi (*FAIL) or (*FAIL:NAME) 9621*22dc650dSSadaf Ebrahimi 9622*22dc650dSSadaf Ebrahimi This verb causes a matching failure, forcing backtracking to occur. It 9623*22dc650dSSadaf Ebrahimi may be abbreviated to (*F). It is equivalent to (?!) but easier to 9624*22dc650dSSadaf Ebrahimi read. The Perl documentation notes that it is probably useful only when 9625*22dc650dSSadaf Ebrahimi combined with (?{}) or (??{}). Those are, of course, Perl features that 9626*22dc650dSSadaf Ebrahimi are not present in PCRE2. The nearest equivalent is the callout fea- 9627*22dc650dSSadaf Ebrahimi ture, as for example in this pattern: 9628*22dc650dSSadaf Ebrahimi 9629*22dc650dSSadaf Ebrahimi a+(?C)(*FAIL) 9630*22dc650dSSadaf Ebrahimi 9631*22dc650dSSadaf Ebrahimi A match with the string "aaaa" always fails, but the callout is taken 9632*22dc650dSSadaf Ebrahimi before each backtrack happens (in this example, 10 times). 9633*22dc650dSSadaf Ebrahimi 9634*22dc650dSSadaf Ebrahimi (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC- 9635*22dc650dSSadaf Ebrahimi CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is 9636*22dc650dSSadaf Ebrahimi recorded just before the verb acts. 9637*22dc650dSSadaf Ebrahimi 9638*22dc650dSSadaf Ebrahimi Recording which path was taken 9639*22dc650dSSadaf Ebrahimi 9640*22dc650dSSadaf Ebrahimi There is one verb whose main purpose is to track how a match was ar- 9641*22dc650dSSadaf Ebrahimi rived at, though it also has a secondary use in conjunction with ad- 9642*22dc650dSSadaf Ebrahimi vancing the match starting point (see (*SKIP) below). 9643*22dc650dSSadaf Ebrahimi 9644*22dc650dSSadaf Ebrahimi (*MARK:NAME) or (*:NAME) 9645*22dc650dSSadaf Ebrahimi 9646*22dc650dSSadaf Ebrahimi A name is always required with this verb. For all the other backtrack- 9647*22dc650dSSadaf Ebrahimi ing control verbs, a NAME argument is optional. 9648*22dc650dSSadaf Ebrahimi 9649*22dc650dSSadaf Ebrahimi When a match succeeds, the name of the last-encountered mark name on 9650*22dc650dSSadaf Ebrahimi the matching path is passed back to the caller as described in the sec- 9651*22dc650dSSadaf Ebrahimi tion entitled "Other information about the match" in the pcre2api docu- 9652*22dc650dSSadaf Ebrahimi mentation. This applies to all instances of (*MARK) and other verbs, 9653*22dc650dSSadaf Ebrahimi including those inside assertions and atomic groups. However, there are 9654*22dc650dSSadaf Ebrahimi differences in those cases when (*MARK) is used in conjunction with 9655*22dc650dSSadaf Ebrahimi (*SKIP) as described below. 9656*22dc650dSSadaf Ebrahimi 9657*22dc650dSSadaf Ebrahimi The mark name that was last encountered on the matching path is passed 9658*22dc650dSSadaf Ebrahimi back. A verb without a NAME argument is ignored for this purpose. Here 9659*22dc650dSSadaf Ebrahimi is an example of pcre2test output, where the "mark" modifier requests 9660*22dc650dSSadaf Ebrahimi the retrieval and outputting of (*MARK) data: 9661*22dc650dSSadaf Ebrahimi 9662*22dc650dSSadaf Ebrahimi re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 9663*22dc650dSSadaf Ebrahimi data> XY 9664*22dc650dSSadaf Ebrahimi 0: XY 9665*22dc650dSSadaf Ebrahimi MK: A 9666*22dc650dSSadaf Ebrahimi XZ 9667*22dc650dSSadaf Ebrahimi 0: XZ 9668*22dc650dSSadaf Ebrahimi MK: B 9669*22dc650dSSadaf Ebrahimi 9670*22dc650dSSadaf Ebrahimi The (*MARK) name is tagged with "MK:" in this output, and in this exam- 9671*22dc650dSSadaf Ebrahimi ple it indicates which of the two alternatives matched. This is a more 9672*22dc650dSSadaf Ebrahimi efficient way of obtaining this information than putting each alterna- 9673*22dc650dSSadaf Ebrahimi tive in its own capturing parentheses. 9674*22dc650dSSadaf Ebrahimi 9675*22dc650dSSadaf Ebrahimi If a verb with a name is encountered in a positive assertion that is 9676*22dc650dSSadaf Ebrahimi true, the name is recorded and passed back if it is the last-encoun- 9677*22dc650dSSadaf Ebrahimi tered. This does not happen for negative assertions or failing positive 9678*22dc650dSSadaf Ebrahimi assertions. 9679*22dc650dSSadaf Ebrahimi 9680*22dc650dSSadaf Ebrahimi After a partial match or a failed match, the last encountered name in 9681*22dc650dSSadaf Ebrahimi the entire match process is returned. For example: 9682*22dc650dSSadaf Ebrahimi 9683*22dc650dSSadaf Ebrahimi re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 9684*22dc650dSSadaf Ebrahimi data> XP 9685*22dc650dSSadaf Ebrahimi No match, mark = B 9686*22dc650dSSadaf Ebrahimi 9687*22dc650dSSadaf Ebrahimi Note that in this unanchored example the mark is retained from the 9688*22dc650dSSadaf Ebrahimi match attempt that started at the letter "X" in the subject. Subsequent 9689*22dc650dSSadaf Ebrahimi match attempts starting at "P" and then with an empty string do not get 9690*22dc650dSSadaf Ebrahimi as far as the (*MARK) item, but nevertheless do not reset it. 9691*22dc650dSSadaf Ebrahimi 9692*22dc650dSSadaf Ebrahimi If you are interested in (*MARK) values after failed matches, you 9693*22dc650dSSadaf Ebrahimi should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to 9694*22dc650dSSadaf Ebrahimi ensure that the match is always attempted. 9695*22dc650dSSadaf Ebrahimi 9696*22dc650dSSadaf Ebrahimi Verbs that act after backtracking 9697*22dc650dSSadaf Ebrahimi 9698*22dc650dSSadaf Ebrahimi The following verbs do nothing when they are encountered. Matching con- 9699*22dc650dSSadaf Ebrahimi tinues with what follows, but if there is a subsequent match failure, 9700*22dc650dSSadaf Ebrahimi causing a backtrack to the verb, a failure is forced. That is, back- 9701*22dc650dSSadaf Ebrahimi tracking cannot pass to the left of the verb. However, when one of 9702*22dc650dSSadaf Ebrahimi these verbs appears inside an atomic group or in a lookaround assertion 9703*22dc650dSSadaf Ebrahimi that is true, its effect is confined to that group, because once the 9704*22dc650dSSadaf Ebrahimi group has been matched, there is never any backtracking into it. Back- 9705*22dc650dSSadaf Ebrahimi tracking from beyond an assertion or an atomic group ignores the entire 9706*22dc650dSSadaf Ebrahimi group, and seeks a preceding backtracking point. 9707*22dc650dSSadaf Ebrahimi 9708*22dc650dSSadaf Ebrahimi These verbs differ in exactly what kind of failure occurs when back- 9709*22dc650dSSadaf Ebrahimi tracking reaches them. The behaviour described below is what happens 9710*22dc650dSSadaf Ebrahimi when the verb is not in a subroutine or an assertion. Subsequent sec- 9711*22dc650dSSadaf Ebrahimi tions cover these special cases. 9712*22dc650dSSadaf Ebrahimi 9713*22dc650dSSadaf Ebrahimi (*COMMIT) or (*COMMIT:NAME) 9714*22dc650dSSadaf Ebrahimi 9715*22dc650dSSadaf Ebrahimi This verb causes the whole match to fail outright if there is a later 9716*22dc650dSSadaf Ebrahimi matching failure that causes backtracking to reach it. Even if the pat- 9717*22dc650dSSadaf Ebrahimi tern is unanchored, no further attempts to find a match by advancing 9718*22dc650dSSadaf Ebrahimi the starting point take place. If (*COMMIT) is the only backtracking 9719*22dc650dSSadaf Ebrahimi verb that is encountered, once it has been passed pcre2_match() is com- 9720*22dc650dSSadaf Ebrahimi mitted to finding a match at the current starting point, or not at all. 9721*22dc650dSSadaf Ebrahimi For example: 9722*22dc650dSSadaf Ebrahimi 9723*22dc650dSSadaf Ebrahimi a+(*COMMIT)b 9724*22dc650dSSadaf Ebrahimi 9725*22dc650dSSadaf Ebrahimi This matches "xxaab" but not "aacaab". It can be thought of as a kind 9726*22dc650dSSadaf Ebrahimi of dynamic anchor, or "I've started, so I must finish." 9727*22dc650dSSadaf Ebrahimi 9728*22dc650dSSadaf Ebrahimi The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM- 9729*22dc650dSSadaf Ebrahimi MIT). It is like (*MARK:NAME) in that the name is remembered for pass- 9730*22dc650dSSadaf Ebrahimi ing back to the caller. However, (*SKIP:NAME) searches only for names 9731*22dc650dSSadaf Ebrahimi that are set with (*MARK), ignoring those set by any of the other back- 9732*22dc650dSSadaf Ebrahimi tracking verbs. 9733*22dc650dSSadaf Ebrahimi 9734*22dc650dSSadaf Ebrahimi If there is more than one backtracking verb in a pattern, a different 9735*22dc650dSSadaf Ebrahimi one that follows (*COMMIT) may be triggered first, so merely passing 9736*22dc650dSSadaf Ebrahimi (*COMMIT) during a match does not always guarantee that a match must be 9737*22dc650dSSadaf Ebrahimi at this starting point. 9738*22dc650dSSadaf Ebrahimi 9739*22dc650dSSadaf Ebrahimi Note that (*COMMIT) at the start of a pattern is not the same as an an- 9740*22dc650dSSadaf Ebrahimi chor, unless PCRE2's start-of-match optimizations are turned off, as 9741*22dc650dSSadaf Ebrahimi shown in this output from pcre2test: 9742*22dc650dSSadaf Ebrahimi 9743*22dc650dSSadaf Ebrahimi re> /(*COMMIT)abc/ 9744*22dc650dSSadaf Ebrahimi data> xyzabc 9745*22dc650dSSadaf Ebrahimi 0: abc 9746*22dc650dSSadaf Ebrahimi data> 9747*22dc650dSSadaf Ebrahimi re> /(*COMMIT)abc/no_start_optimize 9748*22dc650dSSadaf Ebrahimi data> xyzabc 9749*22dc650dSSadaf Ebrahimi No match 9750*22dc650dSSadaf Ebrahimi 9751*22dc650dSSadaf Ebrahimi For the first pattern, PCRE2 knows that any match must start with "a", 9752*22dc650dSSadaf Ebrahimi so the optimization skips along the subject to "a" before applying the 9753*22dc650dSSadaf Ebrahimi pattern to the first set of data. The match attempt then succeeds. The 9754*22dc650dSSadaf Ebrahimi second pattern disables the optimization that skips along to the first 9755*22dc650dSSadaf Ebrahimi character. The pattern is now applied starting at "x", and so the 9756*22dc650dSSadaf Ebrahimi (*COMMIT) causes the match to fail without trying any other starting 9757*22dc650dSSadaf Ebrahimi points. 9758*22dc650dSSadaf Ebrahimi 9759*22dc650dSSadaf Ebrahimi (*PRUNE) or (*PRUNE:NAME) 9760*22dc650dSSadaf Ebrahimi 9761*22dc650dSSadaf Ebrahimi This verb causes the match to fail at the current starting position in 9762*22dc650dSSadaf Ebrahimi the subject if there is a later matching failure that causes backtrack- 9763*22dc650dSSadaf Ebrahimi ing to reach it. If the pattern is unanchored, the normal "bumpalong" 9764*22dc650dSSadaf Ebrahimi advance to the next starting character then happens. Backtracking can 9765*22dc650dSSadaf Ebrahimi occur as usual to the left of (*PRUNE), before it is reached, or when 9766*22dc650dSSadaf Ebrahimi matching to the right of (*PRUNE), but if there is no match to the 9767*22dc650dSSadaf Ebrahimi right, backtracking cannot cross (*PRUNE). In simple cases, the use of 9768*22dc650dSSadaf Ebrahimi (*PRUNE) is just an alternative to an atomic group or possessive quan- 9769*22dc650dSSadaf Ebrahimi tifier, but there are some uses of (*PRUNE) that cannot be expressed in 9770*22dc650dSSadaf Ebrahimi any other way. In an anchored pattern (*PRUNE) has the same effect as 9771*22dc650dSSadaf Ebrahimi (*COMMIT). 9772*22dc650dSSadaf Ebrahimi 9773*22dc650dSSadaf Ebrahimi The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). 9774*22dc650dSSadaf Ebrahimi It is like (*MARK:NAME) in that the name is remembered for passing back 9775*22dc650dSSadaf Ebrahimi to the caller. However, (*SKIP:NAME) searches only for names set with 9776*22dc650dSSadaf Ebrahimi (*MARK), ignoring those set by other backtracking verbs. 9777*22dc650dSSadaf Ebrahimi 9778*22dc650dSSadaf Ebrahimi (*SKIP) 9779*22dc650dSSadaf Ebrahimi 9780*22dc650dSSadaf Ebrahimi This verb, when given without a name, is like (*PRUNE), except that if 9781*22dc650dSSadaf Ebrahimi the pattern is unanchored, the "bumpalong" advance is not to the next 9782*22dc650dSSadaf Ebrahimi character, but to the position in the subject where (*SKIP) was encoun- 9783*22dc650dSSadaf Ebrahimi tered. (*SKIP) signifies that whatever text was matched leading up to 9784*22dc650dSSadaf Ebrahimi it cannot be part of a successful match if there is a later mismatch. 9785*22dc650dSSadaf Ebrahimi Consider: 9786*22dc650dSSadaf Ebrahimi 9787*22dc650dSSadaf Ebrahimi a+(*SKIP)b 9788*22dc650dSSadaf Ebrahimi 9789*22dc650dSSadaf Ebrahimi If the subject is "aaaac...", after the first match attempt fails 9790*22dc650dSSadaf Ebrahimi (starting at the first character in the string), the starting point 9791*22dc650dSSadaf Ebrahimi skips on to start the next attempt at "c". Note that a possessive quan- 9792*22dc650dSSadaf Ebrahimi tifier does not have the same effect as this example; although it would 9793*22dc650dSSadaf Ebrahimi suppress backtracking during the first match attempt, the second at- 9794*22dc650dSSadaf Ebrahimi tempt would start at the second character instead of skipping on to 9795*22dc650dSSadaf Ebrahimi "c". 9796*22dc650dSSadaf Ebrahimi 9797*22dc650dSSadaf Ebrahimi If (*SKIP) is used to specify a new starting position that is the same 9798*22dc650dSSadaf Ebrahimi as the starting position of the current match, or (by being inside a 9799*22dc650dSSadaf Ebrahimi lookbehind) earlier, the position specified by (*SKIP) is ignored, and 9800*22dc650dSSadaf Ebrahimi instead the normal "bumpalong" occurs. 9801*22dc650dSSadaf Ebrahimi 9802*22dc650dSSadaf Ebrahimi (*SKIP:NAME) 9803*22dc650dSSadaf Ebrahimi 9804*22dc650dSSadaf Ebrahimi When (*SKIP) has an associated name, its behaviour is modified. When 9805*22dc650dSSadaf Ebrahimi such a (*SKIP) is triggered, the previous path through the pattern is 9806*22dc650dSSadaf Ebrahimi searched for the most recent (*MARK) that has the same name. If one is 9807*22dc650dSSadaf Ebrahimi found, the "bumpalong" advance is to the subject position that corre- 9808*22dc650dSSadaf Ebrahimi sponds to that (*MARK) instead of to where (*SKIP) was encountered. If 9809*22dc650dSSadaf Ebrahimi no (*MARK) with a matching name is found, the (*SKIP) is ignored. 9810*22dc650dSSadaf Ebrahimi 9811*22dc650dSSadaf Ebrahimi The search for a (*MARK) name uses the normal backtracking mechanism, 9812*22dc650dSSadaf Ebrahimi which means that it does not see (*MARK) settings that are inside 9813*22dc650dSSadaf Ebrahimi atomic groups or assertions, because they are never re-entered by back- 9814*22dc650dSSadaf Ebrahimi tracking. Compare the following pcre2test examples: 9815*22dc650dSSadaf Ebrahimi 9816*22dc650dSSadaf Ebrahimi re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/ 9817*22dc650dSSadaf Ebrahimi data: abc 9818*22dc650dSSadaf Ebrahimi 0: a 9819*22dc650dSSadaf Ebrahimi 1: a 9820*22dc650dSSadaf Ebrahimi data: 9821*22dc650dSSadaf Ebrahimi re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/ 9822*22dc650dSSadaf Ebrahimi data: abc 9823*22dc650dSSadaf Ebrahimi 0: b 9824*22dc650dSSadaf Ebrahimi 1: b 9825*22dc650dSSadaf Ebrahimi 9826*22dc650dSSadaf Ebrahimi In the first example, the (*MARK) setting is in an atomic group, so it 9827*22dc650dSSadaf Ebrahimi is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. 9828*22dc650dSSadaf Ebrahimi This allows the second branch of the pattern to be tried at the first 9829*22dc650dSSadaf Ebrahimi character position. In the second example, the (*MARK) setting is not 9830*22dc650dSSadaf Ebrahimi in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it 9831*22dc650dSSadaf Ebrahimi backtracks, and this causes a new matching attempt to start at the sec- 9832*22dc650dSSadaf Ebrahimi ond character. This time, the (*MARK) is never seen because "a" does 9833*22dc650dSSadaf Ebrahimi not match "b", so the matcher immediately jumps to the second branch of 9834*22dc650dSSadaf Ebrahimi the pattern. 9835*22dc650dSSadaf Ebrahimi 9836*22dc650dSSadaf Ebrahimi Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It 9837*22dc650dSSadaf Ebrahimi ignores names that are set by other backtracking verbs. 9838*22dc650dSSadaf Ebrahimi 9839*22dc650dSSadaf Ebrahimi (*THEN) or (*THEN:NAME) 9840*22dc650dSSadaf Ebrahimi 9841*22dc650dSSadaf Ebrahimi This verb causes a skip to the next innermost alternative when back- 9842*22dc650dSSadaf Ebrahimi tracking reaches it. That is, it cancels any further backtracking 9843*22dc650dSSadaf Ebrahimi within the current alternative. Its name comes from the observation 9844*22dc650dSSadaf Ebrahimi that it can be used for a pattern-based if-then-else block: 9845*22dc650dSSadaf Ebrahimi 9846*22dc650dSSadaf Ebrahimi ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... 9847*22dc650dSSadaf Ebrahimi 9848*22dc650dSSadaf Ebrahimi If the COND1 pattern matches, FOO is tried (and possibly further items 9849*22dc650dSSadaf Ebrahimi after the end of the group if FOO succeeds); on failure, the matcher 9850*22dc650dSSadaf Ebrahimi skips to the second alternative and tries COND2, without backtracking 9851*22dc650dSSadaf Ebrahimi into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- 9852*22dc650dSSadaf Ebrahimi quently BAZ fails, there are no more alternatives, so there is a back- 9853*22dc650dSSadaf Ebrahimi track to whatever came before the entire group. If (*THEN) is not in- 9854*22dc650dSSadaf Ebrahimi side an alternation, it acts like (*PRUNE). 9855*22dc650dSSadaf Ebrahimi 9856*22dc650dSSadaf Ebrahimi The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). 9857*22dc650dSSadaf Ebrahimi It is like (*MARK:NAME) in that the name is remembered for passing back 9858*22dc650dSSadaf Ebrahimi to the caller. However, (*SKIP:NAME) searches only for names set with 9859*22dc650dSSadaf Ebrahimi (*MARK), ignoring those set by other backtracking verbs. 9860*22dc650dSSadaf Ebrahimi 9861*22dc650dSSadaf Ebrahimi A group that does not contain a | character is just a part of the en- 9862*22dc650dSSadaf Ebrahimi closing alternative; it is not a nested alternation with only one al- 9863*22dc650dSSadaf Ebrahimi ternative. The effect of (*THEN) extends beyond such a group to the en- 9864*22dc650dSSadaf Ebrahimi closing alternative. Consider this pattern, where A, B, etc. are com- 9865*22dc650dSSadaf Ebrahimi plex pattern fragments that do not contain any | characters at this 9866*22dc650dSSadaf Ebrahimi level: 9867*22dc650dSSadaf Ebrahimi 9868*22dc650dSSadaf Ebrahimi A (B(*THEN)C) | D 9869*22dc650dSSadaf Ebrahimi 9870*22dc650dSSadaf Ebrahimi If A and B are matched, but there is a failure in C, matching does not 9871*22dc650dSSadaf Ebrahimi backtrack into A; instead it moves to the next alternative, that is, D. 9872*22dc650dSSadaf Ebrahimi However, if the group containing (*THEN) is given an alternative, it 9873*22dc650dSSadaf Ebrahimi behaves differently: 9874*22dc650dSSadaf Ebrahimi 9875*22dc650dSSadaf Ebrahimi A (B(*THEN)C | (*FAIL)) | D 9876*22dc650dSSadaf Ebrahimi 9877*22dc650dSSadaf Ebrahimi The effect of (*THEN) is now confined to the inner group. After a fail- 9878*22dc650dSSadaf Ebrahimi ure in C, matching moves to (*FAIL), which causes the whole group to 9879*22dc650dSSadaf Ebrahimi fail because there are no more alternatives to try. In this case, 9880*22dc650dSSadaf Ebrahimi matching does backtrack into A. 9881*22dc650dSSadaf Ebrahimi 9882*22dc650dSSadaf Ebrahimi Note that a conditional group is not considered as having two alterna- 9883*22dc650dSSadaf Ebrahimi tives, because only one is ever used. In other words, the | character 9884*22dc650dSSadaf Ebrahimi in a conditional group has a different meaning. Ignoring white space, 9885*22dc650dSSadaf Ebrahimi consider: 9886*22dc650dSSadaf Ebrahimi 9887*22dc650dSSadaf Ebrahimi ^.*? (?(?=a) a | b(*THEN)c ) 9888*22dc650dSSadaf Ebrahimi 9889*22dc650dSSadaf Ebrahimi If the subject is "ba", this pattern does not match. Because .*? is un- 9890*22dc650dSSadaf Ebrahimi greedy, it initially matches zero characters. The condition (?=a) then 9891*22dc650dSSadaf Ebrahimi fails, the character "b" is matched, but "c" is not. At this point, 9892*22dc650dSSadaf Ebrahimi matching does not backtrack to .*? as might perhaps be expected from 9893*22dc650dSSadaf Ebrahimi the presence of the | character. The conditional group is part of the 9894*22dc650dSSadaf Ebrahimi single alternative that comprises the whole pattern, and so the match 9895*22dc650dSSadaf Ebrahimi fails. (If there was a backtrack into .*?, allowing it to match "b", 9896*22dc650dSSadaf Ebrahimi the match would succeed.) 9897*22dc650dSSadaf Ebrahimi 9898*22dc650dSSadaf Ebrahimi The verbs just described provide four different "strengths" of control 9899*22dc650dSSadaf Ebrahimi when subsequent matching fails. (*THEN) is the weakest, carrying on the 9900*22dc650dSSadaf Ebrahimi match at the next alternative. (*PRUNE) comes next, failing the match 9901*22dc650dSSadaf Ebrahimi at the current starting position, but allowing an advance to the next 9902*22dc650dSSadaf Ebrahimi character (for an unanchored pattern). (*SKIP) is similar, except that 9903*22dc650dSSadaf Ebrahimi the advance may be more than one character. (*COMMIT) is the strongest, 9904*22dc650dSSadaf Ebrahimi causing the entire match to fail. 9905*22dc650dSSadaf Ebrahimi 9906*22dc650dSSadaf Ebrahimi More than one backtracking verb 9907*22dc650dSSadaf Ebrahimi 9908*22dc650dSSadaf Ebrahimi If more than one backtracking verb is present in a pattern, the one 9909*22dc650dSSadaf Ebrahimi that is backtracked onto first acts. For example, consider this pat- 9910*22dc650dSSadaf Ebrahimi tern, where A, B, etc. are complex pattern fragments: 9911*22dc650dSSadaf Ebrahimi 9912*22dc650dSSadaf Ebrahimi (A(*COMMIT)B(*THEN)C|ABD) 9913*22dc650dSSadaf Ebrahimi 9914*22dc650dSSadaf Ebrahimi If A matches but B fails, the backtrack to (*COMMIT) causes the entire 9915*22dc650dSSadaf Ebrahimi match to fail. However, if A and B match, but C fails, the backtrack to 9916*22dc650dSSadaf Ebrahimi (*THEN) causes the next alternative (ABD) to be tried. This behaviour 9917*22dc650dSSadaf Ebrahimi is consistent, but is not always the same as Perl's. It means that if 9918*22dc650dSSadaf Ebrahimi two or more backtracking verbs appear in succession, all but the last 9919*22dc650dSSadaf Ebrahimi of them has no effect. Consider this example: 9920*22dc650dSSadaf Ebrahimi 9921*22dc650dSSadaf Ebrahimi ...(*COMMIT)(*PRUNE)... 9922*22dc650dSSadaf Ebrahimi 9923*22dc650dSSadaf Ebrahimi If there is a matching failure to the right, backtracking onto (*PRUNE) 9924*22dc650dSSadaf Ebrahimi causes it to be triggered, and its action is taken. There can never be 9925*22dc650dSSadaf Ebrahimi a backtrack onto (*COMMIT). 9926*22dc650dSSadaf Ebrahimi 9927*22dc650dSSadaf Ebrahimi Backtracking verbs in repeated groups 9928*22dc650dSSadaf Ebrahimi 9929*22dc650dSSadaf Ebrahimi PCRE2 sometimes differs from Perl in its handling of backtracking verbs 9930*22dc650dSSadaf Ebrahimi in repeated groups. For example, consider: 9931*22dc650dSSadaf Ebrahimi 9932*22dc650dSSadaf Ebrahimi /(a(*COMMIT)b)+ac/ 9933*22dc650dSSadaf Ebrahimi 9934*22dc650dSSadaf Ebrahimi If the subject is "abac", Perl matches unless its optimizations are 9935*22dc650dSSadaf Ebrahimi disabled, but PCRE2 always fails because the (*COMMIT) in the second 9936*22dc650dSSadaf Ebrahimi repeat of the group acts. 9937*22dc650dSSadaf Ebrahimi 9938*22dc650dSSadaf Ebrahimi Backtracking verbs in assertions 9939*22dc650dSSadaf Ebrahimi 9940*22dc650dSSadaf Ebrahimi (*FAIL) in any assertion has its normal effect: it forces an immediate 9941*22dc650dSSadaf Ebrahimi backtrack. The behaviour of the other backtracking verbs depends on 9942*22dc650dSSadaf Ebrahimi whether or not the assertion is standalone or acting as the condition 9943*22dc650dSSadaf Ebrahimi in a conditional group. 9944*22dc650dSSadaf Ebrahimi 9945*22dc650dSSadaf Ebrahimi (*ACCEPT) in a standalone positive assertion causes the assertion to 9946*22dc650dSSadaf Ebrahimi succeed without any further processing; captured strings and a mark 9947*22dc650dSSadaf Ebrahimi name (if set) are retained. In a standalone negative assertion, (*AC- 9948*22dc650dSSadaf Ebrahimi CEPT) causes the assertion to fail without any further processing; cap- 9949*22dc650dSSadaf Ebrahimi tured substrings and any mark name are discarded. 9950*22dc650dSSadaf Ebrahimi 9951*22dc650dSSadaf Ebrahimi If the assertion is a condition, (*ACCEPT) causes the condition to be 9952*22dc650dSSadaf Ebrahimi true for a positive assertion and false for a negative one; captured 9953*22dc650dSSadaf Ebrahimi substrings are retained in both cases. 9954*22dc650dSSadaf Ebrahimi 9955*22dc650dSSadaf Ebrahimi The remaining verbs act only when a later failure causes a backtrack to 9956*22dc650dSSadaf Ebrahimi reach them. This means that, for the Perl-compatible assertions, their 9957*22dc650dSSadaf Ebrahimi effect is confined to the assertion, because Perl lookaround assertions 9958*22dc650dSSadaf Ebrahimi are atomic. A backtrack that occurs after such an assertion is complete 9959*22dc650dSSadaf Ebrahimi does not jump back into the assertion. Note in particular that a 9960*22dc650dSSadaf Ebrahimi (*MARK) name that is set in an assertion is not "seen" by an instance 9961*22dc650dSSadaf Ebrahimi of (*SKIP:NAME) later in the pattern. 9962*22dc650dSSadaf Ebrahimi 9963*22dc650dSSadaf Ebrahimi PCRE2 now supports non-atomic positive assertions, as described in the 9964*22dc650dSSadaf Ebrahimi section entitled "Non-atomic assertions" above. These assertions must 9965*22dc650dSSadaf Ebrahimi be standalone (not used as conditions). They are not Perl-compatible. 9966*22dc650dSSadaf Ebrahimi For these assertions, a later backtrack does jump back into the asser- 9967*22dc650dSSadaf Ebrahimi tion, and therefore verbs such as (*COMMIT) can be triggered by back- 9968*22dc650dSSadaf Ebrahimi tracks from later in the pattern. 9969*22dc650dSSadaf Ebrahimi 9970*22dc650dSSadaf Ebrahimi The effect of (*THEN) is not allowed to escape beyond an assertion. If 9971*22dc650dSSadaf Ebrahimi there are no more branches to try, (*THEN) causes a positive assertion 9972*22dc650dSSadaf Ebrahimi to be false, and a negative assertion to be true. 9973*22dc650dSSadaf Ebrahimi 9974*22dc650dSSadaf Ebrahimi The other backtracking verbs are not treated specially if they appear 9975*22dc650dSSadaf Ebrahimi in a standalone positive assertion. In a conditional positive asser- 9976*22dc650dSSadaf Ebrahimi tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP), 9977*22dc650dSSadaf Ebrahimi or (*PRUNE) causes the condition to be false. However, for both stand- 9978*22dc650dSSadaf Ebrahimi alone and conditional negative assertions, backtracking into (*COMMIT), 9979*22dc650dSSadaf Ebrahimi (*SKIP), or (*PRUNE) causes the assertion to be true, without consider- 9980*22dc650dSSadaf Ebrahimi ing any further alternative branches. 9981*22dc650dSSadaf Ebrahimi 9982*22dc650dSSadaf Ebrahimi Backtracking verbs in subroutines 9983*22dc650dSSadaf Ebrahimi 9984*22dc650dSSadaf Ebrahimi These behaviours occur whether or not the group is called recursively. 9985*22dc650dSSadaf Ebrahimi 9986*22dc650dSSadaf Ebrahimi (*ACCEPT) in a group called as a subroutine causes the subroutine match 9987*22dc650dSSadaf Ebrahimi to succeed without any further processing. Matching then continues af- 9988*22dc650dSSadaf Ebrahimi ter the subroutine call. Perl documents this behaviour. Perl's treat- 9989*22dc650dSSadaf Ebrahimi ment of the other verbs in subroutines is different in some cases. 9990*22dc650dSSadaf Ebrahimi 9991*22dc650dSSadaf Ebrahimi (*FAIL) in a group called as a subroutine has its normal effect: it 9992*22dc650dSSadaf Ebrahimi forces an immediate backtrack. 9993*22dc650dSSadaf Ebrahimi 9994*22dc650dSSadaf Ebrahimi (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail 9995*22dc650dSSadaf Ebrahimi when triggered by being backtracked to in a group called as a subrou- 9996*22dc650dSSadaf Ebrahimi tine. There is then a backtrack at the outer level. 9997*22dc650dSSadaf Ebrahimi 9998*22dc650dSSadaf Ebrahimi (*THEN), when triggered, skips to the next alternative in the innermost 9999*22dc650dSSadaf Ebrahimi enclosing group that has alternatives (its normal behaviour). However, 10000*22dc650dSSadaf Ebrahimi if there is no such group within the subroutine's group, the subroutine 10001*22dc650dSSadaf Ebrahimi match fails and there is a backtrack at the outer level. 10002*22dc650dSSadaf Ebrahimi 10003*22dc650dSSadaf Ebrahimi 10004*22dc650dSSadaf EbrahimiSEE ALSO 10005*22dc650dSSadaf Ebrahimi 10006*22dc650dSSadaf Ebrahimi pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), 10007*22dc650dSSadaf Ebrahimi pcre2(3). 10008*22dc650dSSadaf Ebrahimi 10009*22dc650dSSadaf Ebrahimi 10010*22dc650dSSadaf EbrahimiAUTHOR 10011*22dc650dSSadaf Ebrahimi 10012*22dc650dSSadaf Ebrahimi Philip Hazel 10013*22dc650dSSadaf Ebrahimi Retired from University Computing Service 10014*22dc650dSSadaf Ebrahimi Cambridge, England. 10015*22dc650dSSadaf Ebrahimi 10016*22dc650dSSadaf Ebrahimi 10017*22dc650dSSadaf EbrahimiREVISION 10018*22dc650dSSadaf Ebrahimi 10019*22dc650dSSadaf Ebrahimi Last updated: 04 June 2024 10020*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 10021*22dc650dSSadaf Ebrahimi 10022*22dc650dSSadaf Ebrahimi 10023*22dc650dSSadaf EbrahimiPCRE2 10.44 04 June 2024 PCRE2PATTERN(3) 10024*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 10025*22dc650dSSadaf Ebrahimi 10026*22dc650dSSadaf Ebrahimi 10027*22dc650dSSadaf Ebrahimi 10028*22dc650dSSadaf EbrahimiPCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3) 10029*22dc650dSSadaf Ebrahimi 10030*22dc650dSSadaf Ebrahimi 10031*22dc650dSSadaf EbrahimiNAME 10032*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 10033*22dc650dSSadaf Ebrahimi 10034*22dc650dSSadaf Ebrahimi 10035*22dc650dSSadaf EbrahimiPCRE2 PERFORMANCE 10036*22dc650dSSadaf Ebrahimi 10037*22dc650dSSadaf Ebrahimi Two aspects of performance are discussed below: memory usage and pro- 10038*22dc650dSSadaf Ebrahimi cessing time. The way you express your pattern as a regular expression 10039*22dc650dSSadaf Ebrahimi can affect both of them. 10040*22dc650dSSadaf Ebrahimi 10041*22dc650dSSadaf Ebrahimi 10042*22dc650dSSadaf EbrahimiCOMPILED PATTERN MEMORY USAGE 10043*22dc650dSSadaf Ebrahimi 10044*22dc650dSSadaf Ebrahimi Patterns are compiled by PCRE2 into a reasonably efficient interpretive 10045*22dc650dSSadaf Ebrahimi code, so that most simple patterns do not use much memory for storing 10046*22dc650dSSadaf Ebrahimi the compiled version. However, there is one case where the memory usage 10047*22dc650dSSadaf Ebrahimi of a compiled pattern can be unexpectedly large. If a parenthesized 10048*22dc650dSSadaf Ebrahimi group has a quantifier with a minimum greater than 1 and/or a limited 10049*22dc650dSSadaf Ebrahimi maximum, the whole group is repeated in the compiled code. For example, 10050*22dc650dSSadaf Ebrahimi the pattern 10051*22dc650dSSadaf Ebrahimi 10052*22dc650dSSadaf Ebrahimi (abc|def){2,4} 10053*22dc650dSSadaf Ebrahimi 10054*22dc650dSSadaf Ebrahimi is compiled as if it were 10055*22dc650dSSadaf Ebrahimi 10056*22dc650dSSadaf Ebrahimi (abc|def)(abc|def)((abc|def)(abc|def)?)? 10057*22dc650dSSadaf Ebrahimi 10058*22dc650dSSadaf Ebrahimi (Technical aside: It is done this way so that backtrack points within 10059*22dc650dSSadaf Ebrahimi each of the repetitions can be independently maintained.) 10060*22dc650dSSadaf Ebrahimi 10061*22dc650dSSadaf Ebrahimi For regular expressions whose quantifiers use only small numbers, this 10062*22dc650dSSadaf Ebrahimi is not usually a problem. However, if the numbers are large, and par- 10063*22dc650dSSadaf Ebrahimi ticularly if such repetitions are nested, the memory usage can become 10064*22dc650dSSadaf Ebrahimi an embarrassment. For example, the very simple pattern 10065*22dc650dSSadaf Ebrahimi 10066*22dc650dSSadaf Ebrahimi ((ab){1,1000}c){1,3} 10067*22dc650dSSadaf Ebrahimi 10068*22dc650dSSadaf Ebrahimi uses over 50KiB when compiled using the 8-bit library. When PCRE2 is 10069*22dc650dSSadaf Ebrahimi compiled with its default internal pointer size of two bytes, the size 10070*22dc650dSSadaf Ebrahimi limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit 10071*22dc650dSSadaf Ebrahimi libraries, and this is reached with the above pattern if the outer rep- 10072*22dc650dSSadaf Ebrahimi etition is increased from 3 to 4. PCRE2 can be compiled to use larger 10073*22dc650dSSadaf Ebrahimi internal pointers and thus handle larger compiled patterns, but it is 10074*22dc650dSSadaf Ebrahimi better to try to rewrite your pattern to use less memory if you can. 10075*22dc650dSSadaf Ebrahimi 10076*22dc650dSSadaf Ebrahimi One way of reducing the memory usage for such patterns is to make use 10077*22dc650dSSadaf Ebrahimi of PCRE2's "subroutine" facility. Re-writing the above pattern as 10078*22dc650dSSadaf Ebrahimi 10079*22dc650dSSadaf Ebrahimi ((ab)(?2){0,999}c)(?1){0,2} 10080*22dc650dSSadaf Ebrahimi 10081*22dc650dSSadaf Ebrahimi reduces the memory requirements to around 16KiB, and indeed it remains 10082*22dc650dSSadaf Ebrahimi under 20KiB even with the outer repetition increased to 100. However, 10083*22dc650dSSadaf Ebrahimi this kind of pattern is not always exactly equivalent, because any cap- 10084*22dc650dSSadaf Ebrahimi tures within subroutine calls are lost when the subroutine completes. 10085*22dc650dSSadaf Ebrahimi If this is not a problem, this kind of rewriting will allow you to 10086*22dc650dSSadaf Ebrahimi process patterns that PCRE2 cannot otherwise handle. The matching per- 10087*22dc650dSSadaf Ebrahimi formance of the two different versions of the pattern are roughly the 10088*22dc650dSSadaf Ebrahimi same. (This applies from release 10.30 - things were different in ear- 10089*22dc650dSSadaf Ebrahimi lier releases.) 10090*22dc650dSSadaf Ebrahimi 10091*22dc650dSSadaf Ebrahimi 10092*22dc650dSSadaf EbrahimiSTACK AND HEAP USAGE AT RUN TIME 10093*22dc650dSSadaf Ebrahimi 10094*22dc650dSSadaf Ebrahimi From release 10.30, the interpretive (non-JIT) version of pcre2_match() 10095*22dc650dSSadaf Ebrahimi uses very little system stack at run time. In earlier releases recur- 10096*22dc650dSSadaf Ebrahimi sive function calls could use a great deal of stack, and this could 10097*22dc650dSSadaf Ebrahimi cause problems, but this usage has been eliminated. Backtracking posi- 10098*22dc650dSSadaf Ebrahimi tions are now explicitly remembered in memory frames controlled by the 10099*22dc650dSSadaf Ebrahimi code. 10100*22dc650dSSadaf Ebrahimi 10101*22dc650dSSadaf Ebrahimi The size of each frame depends on the size of pointer variables and the 10102*22dc650dSSadaf Ebrahimi number of capturing parenthesized groups in the pattern being matched. 10103*22dc650dSSadaf Ebrahimi On a 64-bit system the frame size for a pattern with no captures is 128 10104*22dc650dSSadaf Ebrahimi bytes. For each capturing group the size increases by 16 bytes. 10105*22dc650dSSadaf Ebrahimi 10106*22dc650dSSadaf Ebrahimi Until release 10.41, an initial 20KiB frames vector was allocated on 10107*22dc650dSSadaf Ebrahimi the system stack, but this still caused some issues for multi-thread 10108*22dc650dSSadaf Ebrahimi applications where each thread has a very small stack. From release 10109*22dc650dSSadaf Ebrahimi 10.41 backtracking memory frames are always held in heap memory. An 10110*22dc650dSSadaf Ebrahimi initial heap allocation is obtained the first time any match data block 10111*22dc650dSSadaf Ebrahimi is passed to pcre2_match(). This is remembered with the match data 10112*22dc650dSSadaf Ebrahimi block and re-used if that block is used for another match. It is freed 10113*22dc650dSSadaf Ebrahimi when the match data block itself is freed. 10114*22dc650dSSadaf Ebrahimi 10115*22dc650dSSadaf Ebrahimi The size of the initial block is the larger of 20KiB or ten times the 10116*22dc650dSSadaf Ebrahimi pattern's frame size, unless the heap limit is less than this, in which 10117*22dc650dSSadaf Ebrahimi case the heap limit is used. If the initial block proves to be too 10118*22dc650dSSadaf Ebrahimi small during matching, it is replaced by a larger block, subject to the 10119*22dc650dSSadaf Ebrahimi heap limit. The heap limit is checked only when a new block is to be 10120*22dc650dSSadaf Ebrahimi allocated. Reducing the heap limit between calls to pcre2_match() with 10121*22dc650dSSadaf Ebrahimi the same match data block does not affect the saved block. 10122*22dc650dSSadaf Ebrahimi 10123*22dc650dSSadaf Ebrahimi In contrast to pcre2_match(), pcre2_dfa_match() does use recursive 10124*22dc650dSSadaf Ebrahimi function calls, but only for processing atomic groups, lookaround as- 10125*22dc650dSSadaf Ebrahimi sertions, and recursion within the pattern. The original version of the 10126*22dc650dSSadaf Ebrahimi code used to allocate quite large internal workspace vectors on the 10127*22dc650dSSadaf Ebrahimi stack, which caused some problems for some patterns in environments 10128*22dc650dSSadaf Ebrahimi with small stacks. From release 10.32 the code for pcre2_dfa_match() 10129*22dc650dSSadaf Ebrahimi has been re-factored to use heap memory when necessary for internal 10130*22dc650dSSadaf Ebrahimi workspace when recursing, though recursive function calls are still 10131*22dc650dSSadaf Ebrahimi used. 10132*22dc650dSSadaf Ebrahimi 10133*22dc650dSSadaf Ebrahimi The "match depth" parameter can be used to limit the depth of function 10134*22dc650dSSadaf Ebrahimi recursion, and the "match heap" parameter to limit heap memory in 10135*22dc650dSSadaf Ebrahimi pcre2_dfa_match(). 10136*22dc650dSSadaf Ebrahimi 10137*22dc650dSSadaf Ebrahimi 10138*22dc650dSSadaf EbrahimiPROCESSING TIME 10139*22dc650dSSadaf Ebrahimi 10140*22dc650dSSadaf Ebrahimi Certain items in regular expression patterns are processed more effi- 10141*22dc650dSSadaf Ebrahimi ciently than others. It is more efficient to use a character class like 10142*22dc650dSSadaf Ebrahimi [aeiou] than a set of single-character alternatives such as 10143*22dc650dSSadaf Ebrahimi (a|e|i|o|u). In general, the simplest construction that provides the 10144*22dc650dSSadaf Ebrahimi required behaviour is usually the most efficient. Jeffrey Friedl's book 10145*22dc650dSSadaf Ebrahimi contains a lot of useful general discussion about optimizing regular 10146*22dc650dSSadaf Ebrahimi expressions for efficient performance. This document contains a few ob- 10147*22dc650dSSadaf Ebrahimi servations about PCRE2. 10148*22dc650dSSadaf Ebrahimi 10149*22dc650dSSadaf Ebrahimi Using Unicode character properties (the \p, \P, and \X escapes) is 10150*22dc650dSSadaf Ebrahimi slow, because PCRE2 has to use a multi-stage table lookup whenever it 10151*22dc650dSSadaf Ebrahimi needs a character's property. If you can find an alternative pattern 10152*22dc650dSSadaf Ebrahimi that does not use character properties, it will probably be faster. 10153*22dc650dSSadaf Ebrahimi 10154*22dc650dSSadaf Ebrahimi By default, the escape sequences \b, \d, \s, and \w, and the POSIX 10155*22dc650dSSadaf Ebrahimi character classes such as [:alpha:] do not use Unicode properties, 10156*22dc650dSSadaf Ebrahimi partly for backwards compatibility, and partly for performance reasons. 10157*22dc650dSSadaf Ebrahimi However, you can set the PCRE2_UCP option or start the pattern with 10158*22dc650dSSadaf Ebrahimi (*UCP) if you want Unicode character properties to be used. This can 10159*22dc650dSSadaf Ebrahimi double the matching time for items such as \d, when matched with 10160*22dc650dSSadaf Ebrahimi pcre2_match(); the performance loss is less with a DFA matching func- 10161*22dc650dSSadaf Ebrahimi tion, and in both cases there is not much difference for \b. 10162*22dc650dSSadaf Ebrahimi 10163*22dc650dSSadaf Ebrahimi When a pattern begins with .* not in atomic parentheses, nor in paren- 10164*22dc650dSSadaf Ebrahimi theses that are the subject of a backreference, and the PCRE2_DOTALL 10165*22dc650dSSadaf Ebrahimi option is set, the pattern is implicitly anchored by PCRE2, since it 10166*22dc650dSSadaf Ebrahimi can match only at the start of a subject string. If the pattern has 10167*22dc650dSSadaf Ebrahimi multiple top-level branches, they must all be anchorable. The optimiza- 10168*22dc650dSSadaf Ebrahimi tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au- 10169*22dc650dSSadaf Ebrahimi tomatically disabled if the pattern contains (*PRUNE) or (*SKIP). 10170*22dc650dSSadaf Ebrahimi 10171*22dc650dSSadaf Ebrahimi If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be- 10172*22dc650dSSadaf Ebrahimi cause the dot metacharacter does not then match a newline, and if the 10173*22dc650dSSadaf Ebrahimi subject string contains newlines, the pattern may match from the char- 10174*22dc650dSSadaf Ebrahimi acter immediately following one of them instead of from the very start. 10175*22dc650dSSadaf Ebrahimi For example, the pattern 10176*22dc650dSSadaf Ebrahimi 10177*22dc650dSSadaf Ebrahimi .*second 10178*22dc650dSSadaf Ebrahimi 10179*22dc650dSSadaf Ebrahimi matches the subject "first\nand second" (where \n stands for a newline 10180*22dc650dSSadaf Ebrahimi character), with the match starting at the seventh character. In order 10181*22dc650dSSadaf Ebrahimi to do this, PCRE2 has to retry the match starting after every newline 10182*22dc650dSSadaf Ebrahimi in the subject. 10183*22dc650dSSadaf Ebrahimi 10184*22dc650dSSadaf Ebrahimi If you are using such a pattern with subject strings that do not con- 10185*22dc650dSSadaf Ebrahimi tain newlines, the best performance is obtained by setting 10186*22dc650dSSadaf Ebrahimi PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex- 10187*22dc650dSSadaf Ebrahimi plicit anchoring. That saves PCRE2 from having to scan along the sub- 10188*22dc650dSSadaf Ebrahimi ject looking for a newline to restart at. 10189*22dc650dSSadaf Ebrahimi 10190*22dc650dSSadaf Ebrahimi Beware of patterns that contain nested indefinite repeats. These can 10191*22dc650dSSadaf Ebrahimi take a long time to run when applied to a string that does not match. 10192*22dc650dSSadaf Ebrahimi Consider the pattern fragment 10193*22dc650dSSadaf Ebrahimi 10194*22dc650dSSadaf Ebrahimi ^(a+)* 10195*22dc650dSSadaf Ebrahimi 10196*22dc650dSSadaf Ebrahimi This can match "aaaa" in 16 different ways, and this number increases 10197*22dc650dSSadaf Ebrahimi very rapidly as the string gets longer. (The * repeat can match 0, 1, 10198*22dc650dSSadaf Ebrahimi 2, 3, or 4 times, and for each of those cases other than 0 or 4, the + 10199*22dc650dSSadaf Ebrahimi repeats can match different numbers of times.) When the remainder of 10200*22dc650dSSadaf Ebrahimi the pattern is such that the entire match is going to fail, PCRE2 has 10201*22dc650dSSadaf Ebrahimi in principle to try every possible variation, and this can take an ex- 10202*22dc650dSSadaf Ebrahimi tremely long time, even for relatively short strings. 10203*22dc650dSSadaf Ebrahimi 10204*22dc650dSSadaf Ebrahimi An optimization catches some of the more simple cases such as 10205*22dc650dSSadaf Ebrahimi 10206*22dc650dSSadaf Ebrahimi (a+)*b 10207*22dc650dSSadaf Ebrahimi 10208*22dc650dSSadaf Ebrahimi where a literal character follows. Before embarking on the standard 10209*22dc650dSSadaf Ebrahimi matching procedure, PCRE2 checks that there is a "b" later in the sub- 10210*22dc650dSSadaf Ebrahimi ject string, and if there is not, it fails the match immediately. How- 10211*22dc650dSSadaf Ebrahimi ever, when there is no following literal this optimization cannot be 10212*22dc650dSSadaf Ebrahimi used. You can see the difference by comparing the behaviour of 10213*22dc650dSSadaf Ebrahimi 10214*22dc650dSSadaf Ebrahimi (a+)*\d 10215*22dc650dSSadaf Ebrahimi 10216*22dc650dSSadaf Ebrahimi with the pattern above. The former gives a failure almost instantly 10217*22dc650dSSadaf Ebrahimi when applied to a whole line of "a" characters, whereas the latter 10218*22dc650dSSadaf Ebrahimi takes an appreciable time with strings longer than about 20 characters. 10219*22dc650dSSadaf Ebrahimi 10220*22dc650dSSadaf Ebrahimi In many cases, the solution to this kind of performance issue is to use 10221*22dc650dSSadaf Ebrahimi an atomic group or a possessive quantifier. This can often reduce mem- 10222*22dc650dSSadaf Ebrahimi ory requirements as well. As another example, consider this pattern: 10223*22dc650dSSadaf Ebrahimi 10224*22dc650dSSadaf Ebrahimi ([^<]|<(?!inet))+ 10225*22dc650dSSadaf Ebrahimi 10226*22dc650dSSadaf Ebrahimi It matches from wherever it starts until it encounters "<inet" or the 10227*22dc650dSSadaf Ebrahimi end of the data, and is the kind of pattern that might be used when 10228*22dc650dSSadaf Ebrahimi processing an XML file. Each iteration of the outer parentheses matches 10229*22dc650dSSadaf Ebrahimi either one character that is not "<" or a "<" that is not followed by 10230*22dc650dSSadaf Ebrahimi "inet". However, each time a parenthesis is processed, a backtracking 10231*22dc650dSSadaf Ebrahimi position is passed, so this formulation uses a memory frame for each 10232*22dc650dSSadaf Ebrahimi matched character. For a long string, a lot of memory is required. Con- 10233*22dc650dSSadaf Ebrahimi sider now this rewritten pattern, which matches exactly the same 10234*22dc650dSSadaf Ebrahimi strings: 10235*22dc650dSSadaf Ebrahimi 10236*22dc650dSSadaf Ebrahimi ([^<]++|<(?!inet))+ 10237*22dc650dSSadaf Ebrahimi 10238*22dc650dSSadaf Ebrahimi This runs much faster, because sequences of characters that do not con- 10239*22dc650dSSadaf Ebrahimi tain "<" are "swallowed" in one item inside the parentheses, and a pos- 10240*22dc650dSSadaf Ebrahimi sessive quantifier is used to stop any backtracking into the runs of 10241*22dc650dSSadaf Ebrahimi non-"<" characters. This version also uses a lot less memory because 10242*22dc650dSSadaf Ebrahimi entry to a new set of parentheses happens only when a "<" character 10243*22dc650dSSadaf Ebrahimi that is not followed by "inet" is encountered (and we assume this is 10244*22dc650dSSadaf Ebrahimi relatively rare). 10245*22dc650dSSadaf Ebrahimi 10246*22dc650dSSadaf Ebrahimi This example shows that one way of optimizing performance when matching 10247*22dc650dSSadaf Ebrahimi long subject strings is to write repeated parenthesized subpatterns to 10248*22dc650dSSadaf Ebrahimi match more than one character whenever possible. 10249*22dc650dSSadaf Ebrahimi 10250*22dc650dSSadaf Ebrahimi SETTING RESOURCE LIMITS 10251*22dc650dSSadaf Ebrahimi 10252*22dc650dSSadaf Ebrahimi You can set limits on the amount of processing that takes place when 10253*22dc650dSSadaf Ebrahimi matching, and on the amount of heap memory that is used. The default 10254*22dc650dSSadaf Ebrahimi values of the limits are very large, and unlikely ever to operate. They 10255*22dc650dSSadaf Ebrahimi can be changed when PCRE2 is built, and they can also be set when 10256*22dc650dSSadaf Ebrahimi pcre2_match() or pcre2_dfa_match() is called. For details of these in- 10257*22dc650dSSadaf Ebrahimi terfaces, see the pcre2build documentation and the section entitled 10258*22dc650dSSadaf Ebrahimi "The match context" in the pcre2api documentation. 10259*22dc650dSSadaf Ebrahimi 10260*22dc650dSSadaf Ebrahimi The pcre2test test program has a modifier called "find_limits" which, 10261*22dc650dSSadaf Ebrahimi if applied to a subject line, causes it to find the smallest limits 10262*22dc650dSSadaf Ebrahimi that allow a pattern to match. This is done by repeatedly matching with 10263*22dc650dSSadaf Ebrahimi different limits. 10264*22dc650dSSadaf Ebrahimi 10265*22dc650dSSadaf Ebrahimi 10266*22dc650dSSadaf EbrahimiAUTHOR 10267*22dc650dSSadaf Ebrahimi 10268*22dc650dSSadaf Ebrahimi Philip Hazel 10269*22dc650dSSadaf Ebrahimi Retired from University Computing Service 10270*22dc650dSSadaf Ebrahimi Cambridge, England. 10271*22dc650dSSadaf Ebrahimi 10272*22dc650dSSadaf Ebrahimi 10273*22dc650dSSadaf EbrahimiREVISION 10274*22dc650dSSadaf Ebrahimi 10275*22dc650dSSadaf Ebrahimi Last updated: 27 July 2022 10276*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2022 University of Cambridge. 10277*22dc650dSSadaf Ebrahimi 10278*22dc650dSSadaf Ebrahimi 10279*22dc650dSSadaf EbrahimiPCRE2 10.41 27 July 2022 PCRE2PERFORM(3) 10280*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 10281*22dc650dSSadaf Ebrahimi 10282*22dc650dSSadaf Ebrahimi 10283*22dc650dSSadaf Ebrahimi 10284*22dc650dSSadaf EbrahimiPCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3) 10285*22dc650dSSadaf Ebrahimi 10286*22dc650dSSadaf Ebrahimi 10287*22dc650dSSadaf EbrahimiNAME 10288*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 10289*22dc650dSSadaf Ebrahimi 10290*22dc650dSSadaf Ebrahimi 10291*22dc650dSSadaf EbrahimiSYNOPSIS 10292*22dc650dSSadaf Ebrahimi 10293*22dc650dSSadaf Ebrahimi #include <pcre2posix.h> 10294*22dc650dSSadaf Ebrahimi 10295*22dc650dSSadaf Ebrahimi int pcre2_regcomp(regex_t *preg, const char *pattern, 10296*22dc650dSSadaf Ebrahimi int cflags); 10297*22dc650dSSadaf Ebrahimi 10298*22dc650dSSadaf Ebrahimi int pcre2_regexec(const regex_t *preg, const char *string, 10299*22dc650dSSadaf Ebrahimi size_t nmatch, regmatch_t pmatch[], int eflags); 10300*22dc650dSSadaf Ebrahimi 10301*22dc650dSSadaf Ebrahimi size_t pcre2_regerror(int errcode, const regex_t *preg, 10302*22dc650dSSadaf Ebrahimi char *errbuf, size_t errbuf_size); 10303*22dc650dSSadaf Ebrahimi 10304*22dc650dSSadaf Ebrahimi void pcre2_regfree(regex_t *preg); 10305*22dc650dSSadaf Ebrahimi 10306*22dc650dSSadaf Ebrahimi 10307*22dc650dSSadaf EbrahimiDESCRIPTION 10308*22dc650dSSadaf Ebrahimi 10309*22dc650dSSadaf Ebrahimi This set of functions provides a POSIX-style API for the PCRE2 regular 10310*22dc650dSSadaf Ebrahimi expression 8-bit library. There are no POSIX-style wrappers for PCRE2's 10311*22dc650dSSadaf Ebrahimi 16-bit and 32-bit libraries. See the pcre2api documentation for a de- 10312*22dc650dSSadaf Ebrahimi scription of PCRE2's native API, which contains much additional func- 10313*22dc650dSSadaf Ebrahimi tionality. 10314*22dc650dSSadaf Ebrahimi 10315*22dc650dSSadaf Ebrahimi IMPORTANT NOTE: The functions described here are NOT thread-safe, and 10316*22dc650dSSadaf Ebrahimi should not be used in multi-threaded applications. They are also lim- 10317*22dc650dSSadaf Ebrahimi ited to processing subjects that are not bigger than 2GB. Use the na- 10318*22dc650dSSadaf Ebrahimi tive API instead. 10319*22dc650dSSadaf Ebrahimi 10320*22dc650dSSadaf Ebrahimi These functions are wrapper functions that ultimately call the PCRE2 10321*22dc650dSSadaf Ebrahimi native API. Their prototypes are defined in the pcre2posix.h header 10322*22dc650dSSadaf Ebrahimi file, and they all have unique names starting with pcre2_. However, the 10323*22dc650dSSadaf Ebrahimi pcre2posix.h header also contains macro definitions that convert the 10324*22dc650dSSadaf Ebrahimi standard POSIX names such regcomp() into pcre2_regcomp() etc. This 10325*22dc650dSSadaf Ebrahimi means that a program can use the usual POSIX names without running the 10326*22dc650dSSadaf Ebrahimi risk of accidentally linking with POSIX functions from a different li- 10327*22dc650dSSadaf Ebrahimi brary. 10328*22dc650dSSadaf Ebrahimi 10329*22dc650dSSadaf Ebrahimi On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix, 10330*22dc650dSSadaf Ebrahimi so can be accessed by adding -lpcre2-posix to the command for linking 10331*22dc650dSSadaf Ebrahimi an application. Because the POSIX functions call the native ones, it is 10332*22dc650dSSadaf Ebrahimi also necessary to add -lpcre2-8. 10333*22dc650dSSadaf Ebrahimi 10334*22dc650dSSadaf Ebrahimi On Windows systems, if you are linking to a DLL version of the library, 10335*22dc650dSSadaf Ebrahimi it is recommended that PCRE2POSIX_SHARED is defined before including 10336*22dc650dSSadaf Ebrahimi the pcre2posix.h header, as it will allow for a more efficient way to 10337*22dc650dSSadaf Ebrahimi invoke the functions by adding the __declspec(dllimport) decorator. 10338*22dc650dSSadaf Ebrahimi 10339*22dc650dSSadaf Ebrahimi Although they were not defined as prototypes in pcre2posix.h, releases 10340*22dc650dSSadaf Ebrahimi 10.33 to 10.36 of the library contained functions with the POSIX names 10341*22dc650dSSadaf Ebrahimi regcomp() etc. These simply passed their arguments to the PCRE2 func- 10342*22dc650dSSadaf Ebrahimi tions. These functions were provided for backwards compatibility with 10343*22dc650dSSadaf Ebrahimi earlier versions of PCRE2, which had only POSIX names. However, this 10344*22dc650dSSadaf Ebrahimi has proved troublesome in situations where a program links with several 10345*22dc650dSSadaf Ebrahimi libraries, some of which use PCRE2's POSIX interface while others use 10346*22dc650dSSadaf Ebrahimi the real POSIX functions. For this reason, the POSIX names have been 10347*22dc650dSSadaf Ebrahimi removed since release 10.37. 10348*22dc650dSSadaf Ebrahimi 10349*22dc650dSSadaf Ebrahimi Calling the header file pcre2posix.h avoids any conflict with other 10350*22dc650dSSadaf Ebrahimi POSIX libraries. It can, of course, be renamed or aliased as regex.h, 10351*22dc650dSSadaf Ebrahimi which is the "correct" name, if there is no clash. It provides two 10352*22dc650dSSadaf Ebrahimi structure types, regex_t for compiled internal forms, and regmatch_t 10353*22dc650dSSadaf Ebrahimi for returning captured substrings. It also defines some constants whose 10354*22dc650dSSadaf Ebrahimi names start with "REG_"; these are used for setting options and identi- 10355*22dc650dSSadaf Ebrahimi fying error codes. 10356*22dc650dSSadaf Ebrahimi 10357*22dc650dSSadaf Ebrahimi 10358*22dc650dSSadaf EbrahimiUSING THE POSIX FUNCTIONS 10359*22dc650dSSadaf Ebrahimi 10360*22dc650dSSadaf Ebrahimi Note that these functions are just POSIX-style wrappers for PCRE2's na- 10361*22dc650dSSadaf Ebrahimi tive API. They do not give POSIX regular expression behaviour, and 10362*22dc650dSSadaf Ebrahimi they are not thread-safe or even POSIX compatible. 10363*22dc650dSSadaf Ebrahimi 10364*22dc650dSSadaf Ebrahimi Those POSIX option bits that can reasonably be mapped to PCRE2 native 10365*22dc650dSSadaf Ebrahimi options have been implemented. In addition, the option REG_EXTENDED is 10366*22dc650dSSadaf Ebrahimi defined with the value zero. This has no effect, but since programs 10367*22dc650dSSadaf Ebrahimi that are written to the POSIX interface often use it, this makes it 10368*22dc650dSSadaf Ebrahimi easier to slot in PCRE2 as a replacement library. Other POSIX options 10369*22dc650dSSadaf Ebrahimi are not even defined. 10370*22dc650dSSadaf Ebrahimi 10371*22dc650dSSadaf Ebrahimi There are also some options that are not defined by POSIX. These have 10372*22dc650dSSadaf Ebrahimi been added at the request of users who want to make use of certain 10373*22dc650dSSadaf Ebrahimi PCRE2-specific features via the POSIX calling interface or to add BSD 10374*22dc650dSSadaf Ebrahimi or GNU functionality. 10375*22dc650dSSadaf Ebrahimi 10376*22dc650dSSadaf Ebrahimi When PCRE2 is called via these functions, it is only the API that is 10377*22dc650dSSadaf Ebrahimi POSIX-like in style. The syntax and semantics of the regular expres- 10378*22dc650dSSadaf Ebrahimi sions themselves are still those of Perl, subject to the setting of 10379*22dc650dSSadaf Ebrahimi various PCRE2 options, as described below. "POSIX-like in style" means 10380*22dc650dSSadaf Ebrahimi that the API approximates to the POSIX definition; it is not fully 10381*22dc650dSSadaf Ebrahimi POSIX-compatible, and in multi-unit encoding domains it is probably 10382*22dc650dSSadaf Ebrahimi even less compatible. 10383*22dc650dSSadaf Ebrahimi 10384*22dc650dSSadaf Ebrahimi The descriptions below use the actual names of the functions, but, as 10385*22dc650dSSadaf Ebrahimi described above, the standard POSIX names (without the pcre2_ prefix) 10386*22dc650dSSadaf Ebrahimi may also be used. 10387*22dc650dSSadaf Ebrahimi 10388*22dc650dSSadaf Ebrahimi 10389*22dc650dSSadaf EbrahimiCOMPILING A PATTERN 10390*22dc650dSSadaf Ebrahimi 10391*22dc650dSSadaf Ebrahimi The function pcre2_regcomp() is called to compile a pattern into an in- 10392*22dc650dSSadaf Ebrahimi ternal form. By default, the pattern is a C string terminated by a bi- 10393*22dc650dSSadaf Ebrahimi nary zero (but see REG_PEND below). The preg argument is a pointer to a 10394*22dc650dSSadaf Ebrahimi regex_t structure that is used as a base for storing information about 10395*22dc650dSSadaf Ebrahimi the compiled regular expression. It is also used for input when 10396*22dc650dSSadaf Ebrahimi REG_PEND is set. The regex_t structure used by pcre2_regcomp() is de- 10397*22dc650dSSadaf Ebrahimi fined in pcre2posix.h and is not the same as the structure used by 10398*22dc650dSSadaf Ebrahimi other libraries that provide POSIX-style matching. 10399*22dc650dSSadaf Ebrahimi 10400*22dc650dSSadaf Ebrahimi The argument cflags is either zero, or contains one or more of the bits 10401*22dc650dSSadaf Ebrahimi defined by the following macros: 10402*22dc650dSSadaf Ebrahimi 10403*22dc650dSSadaf Ebrahimi REG_DOTALL 10404*22dc650dSSadaf Ebrahimi 10405*22dc650dSSadaf Ebrahimi The PCRE2_DOTALL option is set when the regular expression is passed 10406*22dc650dSSadaf Ebrahimi for compilation to the native function. Note that REG_DOTALL is not 10407*22dc650dSSadaf Ebrahimi part of the POSIX standard. 10408*22dc650dSSadaf Ebrahimi 10409*22dc650dSSadaf Ebrahimi REG_ICASE 10410*22dc650dSSadaf Ebrahimi 10411*22dc650dSSadaf Ebrahimi The PCRE2_CASELESS option is set when the regular expression is passed 10412*22dc650dSSadaf Ebrahimi for compilation to the native function. 10413*22dc650dSSadaf Ebrahimi 10414*22dc650dSSadaf Ebrahimi REG_NEWLINE 10415*22dc650dSSadaf Ebrahimi 10416*22dc650dSSadaf Ebrahimi The PCRE2_MULTILINE option is set when the regular expression is passed 10417*22dc650dSSadaf Ebrahimi for compilation to the native function. Note that this does not mimic 10418*22dc650dSSadaf Ebrahimi the defined POSIX behaviour for REG_NEWLINE (see the following sec- 10419*22dc650dSSadaf Ebrahimi tion). 10420*22dc650dSSadaf Ebrahimi 10421*22dc650dSSadaf Ebrahimi REG_NOSPEC 10422*22dc650dSSadaf Ebrahimi 10423*22dc650dSSadaf Ebrahimi The PCRE2_LITERAL option is set when the regular expression is passed 10424*22dc650dSSadaf Ebrahimi for compilation to the native function. This disables all meta charac- 10425*22dc650dSSadaf Ebrahimi ters in the pattern, causing it to be treated as a literal string. The 10426*22dc650dSSadaf Ebrahimi only other options that are allowed with REG_NOSPEC are REG_ICASE, 10427*22dc650dSSadaf Ebrahimi REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of 10428*22dc650dSSadaf Ebrahimi the POSIX standard. 10429*22dc650dSSadaf Ebrahimi 10430*22dc650dSSadaf Ebrahimi REG_NOSUB 10431*22dc650dSSadaf Ebrahimi 10432*22dc650dSSadaf Ebrahimi When a pattern that is compiled with this flag is passed to 10433*22dc650dSSadaf Ebrahimi pcre2_regexec() for matching, the nmatch and pmatch arguments are ig- 10434*22dc650dSSadaf Ebrahimi nored, and no captured strings are returned. Versions of the PCRE li- 10435*22dc650dSSadaf Ebrahimi brary prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op- 10436*22dc650dSSadaf Ebrahimi tion, but this no longer happens because it disables the use of back- 10437*22dc650dSSadaf Ebrahimi references. 10438*22dc650dSSadaf Ebrahimi 10439*22dc650dSSadaf Ebrahimi REG_PEND 10440*22dc650dSSadaf Ebrahimi 10441*22dc650dSSadaf Ebrahimi If this option is set, the reg_endp field in the preg structure (which 10442*22dc650dSSadaf Ebrahimi has the type const char *) must be set to point to the character beyond 10443*22dc650dSSadaf Ebrahimi the end of the pattern before calling pcre2_regcomp(). The pattern it- 10444*22dc650dSSadaf Ebrahimi self may now contain binary zeros, which are treated as data charac- 10445*22dc650dSSadaf Ebrahimi ters. Without REG_PEND, a binary zero terminates the pattern and the 10446*22dc650dSSadaf Ebrahimi re_endp field is ignored. This is a GNU extension to the POSIX standard 10447*22dc650dSSadaf Ebrahimi and should be used with caution in software intended to be portable to 10448*22dc650dSSadaf Ebrahimi other systems. 10449*22dc650dSSadaf Ebrahimi 10450*22dc650dSSadaf Ebrahimi REG_UCP 10451*22dc650dSSadaf Ebrahimi 10452*22dc650dSSadaf Ebrahimi The PCRE2_UCP option is set when the regular expression is passed for 10453*22dc650dSSadaf Ebrahimi compilation to the native function. This causes PCRE2 to use Unicode 10454*22dc650dSSadaf Ebrahimi properties when matching \d, \w, etc., instead of just recognizing 10455*22dc650dSSadaf Ebrahimi ASCII values. Note that REG_UCP is not part of the POSIX standard. 10456*22dc650dSSadaf Ebrahimi 10457*22dc650dSSadaf Ebrahimi REG_UNGREEDY 10458*22dc650dSSadaf Ebrahimi 10459*22dc650dSSadaf Ebrahimi The PCRE2_UNGREEDY option is set when the regular expression is passed 10460*22dc650dSSadaf Ebrahimi for compilation to the native function. Note that REG_UNGREEDY is not 10461*22dc650dSSadaf Ebrahimi part of the POSIX standard. 10462*22dc650dSSadaf Ebrahimi 10463*22dc650dSSadaf Ebrahimi REG_UTF 10464*22dc650dSSadaf Ebrahimi 10465*22dc650dSSadaf Ebrahimi The PCRE2_UTF option is set when the regular expression is passed for 10466*22dc650dSSadaf Ebrahimi compilation to the native function. This causes the pattern itself and 10467*22dc650dSSadaf Ebrahimi all data strings used for matching it to be treated as UTF-8 strings. 10468*22dc650dSSadaf Ebrahimi Note that REG_UTF is not part of the POSIX standard. 10469*22dc650dSSadaf Ebrahimi 10470*22dc650dSSadaf Ebrahimi In the absence of these flags, no options are passed to the native 10471*22dc650dSSadaf Ebrahimi function. This means that the regex is compiled with PCRE2 default se- 10472*22dc650dSSadaf Ebrahimi mantics. In particular, the way it handles newline characters in the 10473*22dc650dSSadaf Ebrahimi subject string is the Perl way, not the POSIX way. Note that setting 10474*22dc650dSSadaf Ebrahimi PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. 10475*22dc650dSSadaf Ebrahimi It does not affect the way newlines are matched by the dot metacharac- 10476*22dc650dSSadaf Ebrahimi ter (they are not) or by a negative class such as [^a] (they are). 10477*22dc650dSSadaf Ebrahimi 10478*22dc650dSSadaf Ebrahimi The yield of pcre2_regcomp() is zero on success, and non-zero other- 10479*22dc650dSSadaf Ebrahimi wise. The preg structure is filled in on success, and one other member 10480*22dc650dSSadaf Ebrahimi of the structure (as well as re_endp) is public: re_nsub contains the 10481*22dc650dSSadaf Ebrahimi number of capturing subpatterns in the regular expression. Various er- 10482*22dc650dSSadaf Ebrahimi ror codes are defined in the header file. 10483*22dc650dSSadaf Ebrahimi 10484*22dc650dSSadaf Ebrahimi NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt 10485*22dc650dSSadaf Ebrahimi to use the contents of the preg structure. If, for example, you pass it 10486*22dc650dSSadaf Ebrahimi to pcre2_regexec(), the result is undefined and your program is likely 10487*22dc650dSSadaf Ebrahimi to crash. 10488*22dc650dSSadaf Ebrahimi 10489*22dc650dSSadaf Ebrahimi 10490*22dc650dSSadaf EbrahimiMATCHING NEWLINE CHARACTERS 10491*22dc650dSSadaf Ebrahimi 10492*22dc650dSSadaf Ebrahimi This area is not simple, because POSIX and Perl take different views of 10493*22dc650dSSadaf Ebrahimi things. It is not possible to get PCRE2 to obey POSIX semantics, but 10494*22dc650dSSadaf Ebrahimi then PCRE2 was never intended to be a POSIX engine. The following table 10495*22dc650dSSadaf Ebrahimi lists the different possibilities for matching newline characters in 10496*22dc650dSSadaf Ebrahimi Perl and PCRE2: 10497*22dc650dSSadaf Ebrahimi 10498*22dc650dSSadaf Ebrahimi Default Change with 10499*22dc650dSSadaf Ebrahimi 10500*22dc650dSSadaf Ebrahimi . matches newline no PCRE2_DOTALL 10501*22dc650dSSadaf Ebrahimi newline matches [^a] yes not changeable 10502*22dc650dSSadaf Ebrahimi $ matches \n at end yes PCRE2_DOLLAR_ENDONLY 10503*22dc650dSSadaf Ebrahimi $ matches \n in middle no PCRE2_MULTILINE 10504*22dc650dSSadaf Ebrahimi ^ matches \n in middle no PCRE2_MULTILINE 10505*22dc650dSSadaf Ebrahimi 10506*22dc650dSSadaf Ebrahimi This is the equivalent table for a POSIX-compatible pattern matcher: 10507*22dc650dSSadaf Ebrahimi 10508*22dc650dSSadaf Ebrahimi Default Change with 10509*22dc650dSSadaf Ebrahimi 10510*22dc650dSSadaf Ebrahimi . matches newline yes REG_NEWLINE 10511*22dc650dSSadaf Ebrahimi newline matches [^a] yes REG_NEWLINE 10512*22dc650dSSadaf Ebrahimi $ matches \n at end no REG_NEWLINE 10513*22dc650dSSadaf Ebrahimi $ matches \n in middle no REG_NEWLINE 10514*22dc650dSSadaf Ebrahimi ^ matches \n in middle no REG_NEWLINE 10515*22dc650dSSadaf Ebrahimi 10516*22dc650dSSadaf Ebrahimi This behaviour is not what happens when PCRE2 is called via its POSIX 10517*22dc650dSSadaf Ebrahimi API. By default, PCRE2's behaviour is the same as Perl's, except that 10518*22dc650dSSadaf Ebrahimi there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 10519*22dc650dSSadaf Ebrahimi and Perl, there is no way to stop newline from matching [^a]. 10520*22dc650dSSadaf Ebrahimi 10521*22dc650dSSadaf Ebrahimi Default POSIX newline handling can be obtained by setting PCRE2_DOTALL 10522*22dc650dSSadaf Ebrahimi and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but 10523*22dc650dSSadaf Ebrahimi there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac- 10524*22dc650dSSadaf Ebrahimi tion. When using the POSIX API, passing REG_NEWLINE to PCRE2's 10525*22dc650dSSadaf Ebrahimi pcre2_regcomp() function causes PCRE2_MULTILINE to be passed to 10526*22dc650dSSadaf Ebrahimi pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to 10527*22dc650dSSadaf Ebrahimi pass PCRE2_DOLLAR_ENDONLY. 10528*22dc650dSSadaf Ebrahimi 10529*22dc650dSSadaf Ebrahimi 10530*22dc650dSSadaf EbrahimiMATCHING A PATTERN 10531*22dc650dSSadaf Ebrahimi 10532*22dc650dSSadaf Ebrahimi The function pcre2_regexec() is called to match a compiled pattern preg 10533*22dc650dSSadaf Ebrahimi against a given string, which is by default terminated by a zero byte 10534*22dc650dSSadaf Ebrahimi (but see REG_STARTEND below), subject to the options in eflags. These 10535*22dc650dSSadaf Ebrahimi can be: 10536*22dc650dSSadaf Ebrahimi 10537*22dc650dSSadaf Ebrahimi REG_NOTBOL 10538*22dc650dSSadaf Ebrahimi 10539*22dc650dSSadaf Ebrahimi The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match- 10540*22dc650dSSadaf Ebrahimi ing function. 10541*22dc650dSSadaf Ebrahimi 10542*22dc650dSSadaf Ebrahimi REG_NOTEMPTY 10543*22dc650dSSadaf Ebrahimi 10544*22dc650dSSadaf Ebrahimi The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 10545*22dc650dSSadaf Ebrahimi matching function. Note that REG_NOTEMPTY is not part of the POSIX 10546*22dc650dSSadaf Ebrahimi standard. However, setting this option can give more POSIX-like behav- 10547*22dc650dSSadaf Ebrahimi iour in some situations. 10548*22dc650dSSadaf Ebrahimi 10549*22dc650dSSadaf Ebrahimi REG_NOTEOL 10550*22dc650dSSadaf Ebrahimi 10551*22dc650dSSadaf Ebrahimi The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match- 10552*22dc650dSSadaf Ebrahimi ing function. 10553*22dc650dSSadaf Ebrahimi 10554*22dc650dSSadaf Ebrahimi REG_STARTEND 10555*22dc650dSSadaf Ebrahimi 10556*22dc650dSSadaf Ebrahimi When this option is set, the subject string starts at string + 10557*22dc650dSSadaf Ebrahimi pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should 10558*22dc650dSSadaf Ebrahimi point to the first character beyond the string. There may be binary ze- 10559*22dc650dSSadaf Ebrahimi ros within the subject string, and indeed, using REG_STARTEND is the 10560*22dc650dSSadaf Ebrahimi only way to pass a subject string that contains a binary zero. 10561*22dc650dSSadaf Ebrahimi 10562*22dc650dSSadaf Ebrahimi Whatever the value of pmatch[0].rm_so, the offsets of the matched 10563*22dc650dSSadaf Ebrahimi string and any captured substrings are still given relative to the 10564*22dc650dSSadaf Ebrahimi start of string itself. (Before PCRE2 release 10.30 these were given 10565*22dc650dSSadaf Ebrahimi relative to string + pmatch[0].rm_so, but this differs from other im- 10566*22dc650dSSadaf Ebrahimi plementations.) 10567*22dc650dSSadaf Ebrahimi 10568*22dc650dSSadaf Ebrahimi This is a BSD extension, compatible with but not specified by IEEE 10569*22dc650dSSadaf Ebrahimi Standard 1003.2 (POSIX.2), and should be used with caution in software 10570*22dc650dSSadaf Ebrahimi intended to be portable to other systems. Note that a non-zero rm_so 10571*22dc650dSSadaf Ebrahimi does not imply REG_NOTBOL; REG_STARTEND affects only the location and 10572*22dc650dSSadaf Ebrahimi length of the string, not how it is matched. Setting REG_STARTEND and 10573*22dc650dSSadaf Ebrahimi passing pmatch as NULL are mutually exclusive; the error REG_INVARG is 10574*22dc650dSSadaf Ebrahimi returned. 10575*22dc650dSSadaf Ebrahimi 10576*22dc650dSSadaf Ebrahimi If the pattern was compiled with the REG_NOSUB flag, no data about any 10577*22dc650dSSadaf Ebrahimi matched strings is returned. The nmatch and pmatch arguments of 10578*22dc650dSSadaf Ebrahimi pcre2_regexec() are ignored (except possibly as input for REG_STAR- 10579*22dc650dSSadaf Ebrahimi TEND). 10580*22dc650dSSadaf Ebrahimi 10581*22dc650dSSadaf Ebrahimi The value of nmatch may be zero, and the value pmatch may be NULL (un- 10582*22dc650dSSadaf Ebrahimi less REG_STARTEND is set); in both these cases no data about any 10583*22dc650dSSadaf Ebrahimi matched strings is returned. 10584*22dc650dSSadaf Ebrahimi 10585*22dc650dSSadaf Ebrahimi Otherwise, the portion of the string that was matched, and also any 10586*22dc650dSSadaf Ebrahimi captured substrings, are returned via the pmatch argument, which points 10587*22dc650dSSadaf Ebrahimi to an array of nmatch structures of type regmatch_t, containing the 10588*22dc650dSSadaf Ebrahimi members rm_so and rm_eo. These contain the byte offset to the first 10589*22dc650dSSadaf Ebrahimi character of each substring and the offset to the first character after 10590*22dc650dSSadaf Ebrahimi the end of each substring, respectively. The 0th element of the vector 10591*22dc650dSSadaf Ebrahimi relates to the entire portion of string that was matched; subsequent 10592*22dc650dSSadaf Ebrahimi elements relate to the capturing subpatterns of the regular expression. 10593*22dc650dSSadaf Ebrahimi Unused entries in the array have both structure members set to -1. 10594*22dc650dSSadaf Ebrahimi 10595*22dc650dSSadaf Ebrahimi regmatch_t as well as the regoff_t typedef it uses are defined in 10596*22dc650dSSadaf Ebrahimi pcre2posix.h and are not warranted to have the same size or layout as 10597*22dc650dSSadaf Ebrahimi other similarly named types from other libraries that provide POSIX- 10598*22dc650dSSadaf Ebrahimi style matching. 10599*22dc650dSSadaf Ebrahimi 10600*22dc650dSSadaf Ebrahimi A successful match yields a zero return; various error codes are de- 10601*22dc650dSSadaf Ebrahimi fined in the header file, of which REG_NOMATCH is the "expected" fail- 10602*22dc650dSSadaf Ebrahimi ure code. 10603*22dc650dSSadaf Ebrahimi 10604*22dc650dSSadaf Ebrahimi 10605*22dc650dSSadaf EbrahimiERROR MESSAGES 10606*22dc650dSSadaf Ebrahimi 10607*22dc650dSSadaf Ebrahimi The pcre2_regerror() function maps a non-zero errorcode from either 10608*22dc650dSSadaf Ebrahimi pcre2_regcomp() or pcre2_regexec() to a printable message. If preg is 10609*22dc650dSSadaf Ebrahimi not NULL, the error should have arisen from the use of that structure. 10610*22dc650dSSadaf Ebrahimi A message terminated by a binary zero is placed in errbuf. If the 10611*22dc650dSSadaf Ebrahimi buffer is too short, only the first errbuf_size - 1 characters of the 10612*22dc650dSSadaf Ebrahimi error message are used. The yield of the function is the size of buffer 10613*22dc650dSSadaf Ebrahimi needed to hold the whole message, including the terminating zero. This 10614*22dc650dSSadaf Ebrahimi value is greater than errbuf_size if the message was truncated. 10615*22dc650dSSadaf Ebrahimi 10616*22dc650dSSadaf Ebrahimi 10617*22dc650dSSadaf EbrahimiMEMORY USAGE 10618*22dc650dSSadaf Ebrahimi 10619*22dc650dSSadaf Ebrahimi Compiling a regular expression causes memory to be allocated and asso- 10620*22dc650dSSadaf Ebrahimi ciated with the preg structure. The function pcre2_regfree() frees all 10621*22dc650dSSadaf Ebrahimi such memory, after which preg may no longer be used as a compiled ex- 10622*22dc650dSSadaf Ebrahimi pression. 10623*22dc650dSSadaf Ebrahimi 10624*22dc650dSSadaf Ebrahimi 10625*22dc650dSSadaf EbrahimiAUTHOR 10626*22dc650dSSadaf Ebrahimi 10627*22dc650dSSadaf Ebrahimi Philip Hazel 10628*22dc650dSSadaf Ebrahimi Retired from University Computing Service 10629*22dc650dSSadaf Ebrahimi Cambridge, England. 10630*22dc650dSSadaf Ebrahimi 10631*22dc650dSSadaf Ebrahimi 10632*22dc650dSSadaf EbrahimiREVISION 10633*22dc650dSSadaf Ebrahimi 10634*22dc650dSSadaf Ebrahimi Last updated: 19 January 2024 10635*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 10636*22dc650dSSadaf Ebrahimi 10637*22dc650dSSadaf Ebrahimi 10638*22dc650dSSadaf EbrahimiPCRE2 10.43 19 January 2024 PCRE2POSIX(3) 10639*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 10640*22dc650dSSadaf Ebrahimi 10641*22dc650dSSadaf Ebrahimi 10642*22dc650dSSadaf Ebrahimi 10643*22dc650dSSadaf EbrahimiPCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3) 10644*22dc650dSSadaf Ebrahimi 10645*22dc650dSSadaf Ebrahimi 10646*22dc650dSSadaf EbrahimiNAME 10647*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 10648*22dc650dSSadaf Ebrahimi 10649*22dc650dSSadaf Ebrahimi 10650*22dc650dSSadaf EbrahimiPCRE2 SAMPLE PROGRAM 10651*22dc650dSSadaf Ebrahimi 10652*22dc650dSSadaf Ebrahimi A simple, complete demonstration program to get you started with using 10653*22dc650dSSadaf Ebrahimi PCRE2 is supplied in the file pcre2demo.c in the src directory in the 10654*22dc650dSSadaf Ebrahimi PCRE2 distribution. A listing of this program is given in the pcre2demo 10655*22dc650dSSadaf Ebrahimi documentation. If you do not have a copy of the PCRE2 distribution, you 10656*22dc650dSSadaf Ebrahimi can save this listing to re-create the contents of pcre2demo.c. 10657*22dc650dSSadaf Ebrahimi 10658*22dc650dSSadaf Ebrahimi The demonstration program compiles the regular expression that is its 10659*22dc650dSSadaf Ebrahimi first argument, and matches it against the subject string in its second 10660*22dc650dSSadaf Ebrahimi argument. No PCRE2 options are set, and default character tables are 10661*22dc650dSSadaf Ebrahimi used. If matching succeeds, the program outputs the portion of the sub- 10662*22dc650dSSadaf Ebrahimi ject that matched, together with the contents of any captured sub- 10663*22dc650dSSadaf Ebrahimi strings. 10664*22dc650dSSadaf Ebrahimi 10665*22dc650dSSadaf Ebrahimi If the -g option is given on the command line, the program then goes on 10666*22dc650dSSadaf Ebrahimi to check for further matches of the same regular expression in the same 10667*22dc650dSSadaf Ebrahimi subject string. The logic is a little bit tricky because of the possi- 10668*22dc650dSSadaf Ebrahimi bility of matching an empty string. Comments in the code explain what 10669*22dc650dSSadaf Ebrahimi is going on. 10670*22dc650dSSadaf Ebrahimi 10671*22dc650dSSadaf Ebrahimi The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit 10672*22dc650dSSadaf Ebrahimi library. It handles strings and characters that are stored in 8-bit 10673*22dc650dSSadaf Ebrahimi code units. By default, one character corresponds to one code unit, 10674*22dc650dSSadaf Ebrahimi but if the pattern starts with "(*UTF)", both it and the subject are 10675*22dc650dSSadaf Ebrahimi treated as UTF-8 strings, where characters may occupy multiple code 10676*22dc650dSSadaf Ebrahimi units. 10677*22dc650dSSadaf Ebrahimi 10678*22dc650dSSadaf Ebrahimi If PCRE2 is installed in the standard include and library directories 10679*22dc650dSSadaf Ebrahimi for your operating system, you should be able to compile the demonstra- 10680*22dc650dSSadaf Ebrahimi tion program using a command like this: 10681*22dc650dSSadaf Ebrahimi 10682*22dc650dSSadaf Ebrahimi cc -o pcre2demo pcre2demo.c -lpcre2-8 10683*22dc650dSSadaf Ebrahimi 10684*22dc650dSSadaf Ebrahimi If PCRE2 is installed elsewhere, you may need to add additional options 10685*22dc650dSSadaf Ebrahimi to the command line. For example, on a Unix-like system that has PCRE2 10686*22dc650dSSadaf Ebrahimi installed in /usr/local, you can compile the demonstration program us- 10687*22dc650dSSadaf Ebrahimi ing a command like this: 10688*22dc650dSSadaf Ebrahimi 10689*22dc650dSSadaf Ebrahimi cc -o pcre2demo -I/usr/local/include pcre2demo.c \ 10690*22dc650dSSadaf Ebrahimi -L/usr/local/lib -lpcre2-8 10691*22dc650dSSadaf Ebrahimi 10692*22dc650dSSadaf Ebrahimi Once you have built the demonstration program, you can run simple tests 10693*22dc650dSSadaf Ebrahimi like this: 10694*22dc650dSSadaf Ebrahimi 10695*22dc650dSSadaf Ebrahimi ./pcre2demo 'cat|dog' 'the cat sat on the mat' 10696*22dc650dSSadaf Ebrahimi ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' 10697*22dc650dSSadaf Ebrahimi 10698*22dc650dSSadaf Ebrahimi Note that there is a much more comprehensive test program, called 10699*22dc650dSSadaf Ebrahimi pcre2test, which supports many more facilities for testing regular ex- 10700*22dc650dSSadaf Ebrahimi pressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit, 10701*22dc650dSSadaf Ebrahimi though not all three need be installed). The pcre2demo program is pro- 10702*22dc650dSSadaf Ebrahimi vided as a relatively simple coding example. 10703*22dc650dSSadaf Ebrahimi 10704*22dc650dSSadaf Ebrahimi If you try to run pcre2demo when PCRE2 is not installed in the standard 10705*22dc650dSSadaf Ebrahimi library directory, you may get an error like this on some operating 10706*22dc650dSSadaf Ebrahimi systems (e.g. Solaris): 10707*22dc650dSSadaf Ebrahimi 10708*22dc650dSSadaf Ebrahimi ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file 10709*22dc650dSSadaf Ebrahimi or directory 10710*22dc650dSSadaf Ebrahimi 10711*22dc650dSSadaf Ebrahimi This is caused by the way shared library support works on those sys- 10712*22dc650dSSadaf Ebrahimi tems. You need to add 10713*22dc650dSSadaf Ebrahimi 10714*22dc650dSSadaf Ebrahimi -R/usr/local/lib 10715*22dc650dSSadaf Ebrahimi 10716*22dc650dSSadaf Ebrahimi (for example) to the compile command to get round this problem. 10717*22dc650dSSadaf Ebrahimi 10718*22dc650dSSadaf Ebrahimi 10719*22dc650dSSadaf EbrahimiAUTHOR 10720*22dc650dSSadaf Ebrahimi 10721*22dc650dSSadaf Ebrahimi Philip Hazel 10722*22dc650dSSadaf Ebrahimi Retired from University Computing Service 10723*22dc650dSSadaf Ebrahimi Cambridge, England. 10724*22dc650dSSadaf Ebrahimi 10725*22dc650dSSadaf Ebrahimi 10726*22dc650dSSadaf EbrahimiREVISION 10727*22dc650dSSadaf Ebrahimi 10728*22dc650dSSadaf Ebrahimi Last updated: 02 February 2016 10729*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2016 University of Cambridge. 10730*22dc650dSSadaf Ebrahimi 10731*22dc650dSSadaf Ebrahimi 10732*22dc650dSSadaf EbrahimiPCRE2 10.22 02 February 2016 PCRE2SAMPLE(3) 10733*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 10734*22dc650dSSadaf Ebrahimi 10735*22dc650dSSadaf EbrahimiPCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3) 10736*22dc650dSSadaf Ebrahimi 10737*22dc650dSSadaf Ebrahimi 10738*22dc650dSSadaf EbrahimiNAME 10739*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 10740*22dc650dSSadaf Ebrahimi 10741*22dc650dSSadaf Ebrahimi 10742*22dc650dSSadaf EbrahimiSAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS 10743*22dc650dSSadaf Ebrahimi 10744*22dc650dSSadaf Ebrahimi int32_t pcre2_serialize_decode(pcre2_code **codes, 10745*22dc650dSSadaf Ebrahimi int32_t number_of_codes, const uint8_t *bytes, 10746*22dc650dSSadaf Ebrahimi pcre2_general_context *gcontext); 10747*22dc650dSSadaf Ebrahimi 10748*22dc650dSSadaf Ebrahimi int32_t pcre2_serialize_encode(const pcre2_code **codes, 10749*22dc650dSSadaf Ebrahimi int32_t number_of_codes, uint8_t **serialized_bytes, 10750*22dc650dSSadaf Ebrahimi PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 10751*22dc650dSSadaf Ebrahimi 10752*22dc650dSSadaf Ebrahimi void pcre2_serialize_free(uint8_t *bytes); 10753*22dc650dSSadaf Ebrahimi 10754*22dc650dSSadaf Ebrahimi int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 10755*22dc650dSSadaf Ebrahimi 10756*22dc650dSSadaf Ebrahimi If you are running an application that uses a large number of regular 10757*22dc650dSSadaf Ebrahimi expression patterns, it may be useful to store them in a precompiled 10758*22dc650dSSadaf Ebrahimi form instead of having to compile them every time the application is 10759*22dc650dSSadaf Ebrahimi run. However, if you are using the just-in-time optimization feature, 10760*22dc650dSSadaf Ebrahimi it is not possible to save and reload the JIT data, because it is posi- 10761*22dc650dSSadaf Ebrahimi tion-dependent. The host on which the patterns are reloaded must be 10762*22dc650dSSadaf Ebrahimi running the same version of PCRE2, with the same code unit width, and 10763*22dc650dSSadaf Ebrahimi must also have the same endianness, pointer width and PCRE2_SIZE type. 10764*22dc650dSSadaf Ebrahimi For example, patterns compiled on a 32-bit system using PCRE2's 16-bit 10765*22dc650dSSadaf Ebrahimi library cannot be reloaded on a 64-bit system, nor can they be reloaded 10766*22dc650dSSadaf Ebrahimi using the 8-bit library. 10767*22dc650dSSadaf Ebrahimi 10768*22dc650dSSadaf Ebrahimi Note that "serialization" in PCRE2 does not convert compiled patterns 10769*22dc650dSSadaf Ebrahimi to an abstract format like Java or .NET serialization. The serialized 10770*22dc650dSSadaf Ebrahimi output is really just a bytecode dump, which is why it can only be re- 10771*22dc650dSSadaf Ebrahimi loaded in the same environment as the one that created it. Hence the 10772*22dc650dSSadaf Ebrahimi restrictions mentioned above. Applications that are not statically 10773*22dc650dSSadaf Ebrahimi linked with a fixed version of PCRE2 must be prepared to recompile pat- 10774*22dc650dSSadaf Ebrahimi terns from their sources, in order to be immune to PCRE2 upgrades. 10775*22dc650dSSadaf Ebrahimi 10776*22dc650dSSadaf Ebrahimi 10777*22dc650dSSadaf EbrahimiSECURITY CONCERNS 10778*22dc650dSSadaf Ebrahimi 10779*22dc650dSSadaf Ebrahimi The facility for saving and restoring compiled patterns is intended for 10780*22dc650dSSadaf Ebrahimi use within individual applications. As such, the data supplied to 10781*22dc650dSSadaf Ebrahimi pcre2_serialize_decode() is expected to be trusted data, not data from 10782*22dc650dSSadaf Ebrahimi arbitrary external sources. There is only some simple consistency 10783*22dc650dSSadaf Ebrahimi checking, not complete validation of what is being re-loaded. Corrupted 10784*22dc650dSSadaf Ebrahimi data may cause undefined results. For example, if the length field of a 10785*22dc650dSSadaf Ebrahimi pattern in the serialized data is corrupted, the deserializing code may 10786*22dc650dSSadaf Ebrahimi read beyond the end of the byte stream that is passed to it. 10787*22dc650dSSadaf Ebrahimi 10788*22dc650dSSadaf Ebrahimi 10789*22dc650dSSadaf EbrahimiSAVING COMPILED PATTERNS 10790*22dc650dSSadaf Ebrahimi 10791*22dc650dSSadaf Ebrahimi Before compiled patterns can be saved they must be serialized, which in 10792*22dc650dSSadaf Ebrahimi PCRE2 means converting the pattern to a stream of bytes. A single byte 10793*22dc650dSSadaf Ebrahimi stream may contain any number of compiled patterns, but they must all 10794*22dc650dSSadaf Ebrahimi use the same character tables. A single copy of the tables is included 10795*22dc650dSSadaf Ebrahimi in the byte stream (its size is 1088 bytes). For more details of char- 10796*22dc650dSSadaf Ebrahimi acter tables, see the section on locale support in the pcre2api docu- 10797*22dc650dSSadaf Ebrahimi mentation. 10798*22dc650dSSadaf Ebrahimi 10799*22dc650dSSadaf Ebrahimi The function pcre2_serialize_encode() creates a serialized byte stream 10800*22dc650dSSadaf Ebrahimi from a list of compiled patterns. Its first two arguments specify the 10801*22dc650dSSadaf Ebrahimi list, being a pointer to a vector of pointers to compiled patterns, and 10802*22dc650dSSadaf Ebrahimi the length of the vector. The third and fourth arguments point to vari- 10803*22dc650dSSadaf Ebrahimi ables which are set to point to the created byte stream and its length, 10804*22dc650dSSadaf Ebrahimi respectively. The final argument is a pointer to a general context, 10805*22dc650dSSadaf Ebrahimi which can be used to specify custom memory management functions. If 10806*22dc650dSSadaf Ebrahimi this argument is NULL, malloc() is used to obtain memory for the byte 10807*22dc650dSSadaf Ebrahimi stream. The yield of the function is the number of serialized patterns, 10808*22dc650dSSadaf Ebrahimi or one of the following negative error codes: 10809*22dc650dSSadaf Ebrahimi 10810*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADDATA the number of patterns is zero or less 10811*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns 10812*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NOMEMORY memory allocation failed 10813*22dc650dSSadaf Ebrahimi PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables 10814*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL 10815*22dc650dSSadaf Ebrahimi 10816*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor- 10817*22dc650dSSadaf Ebrahimi rupted, or that a slot in the vector does not point to a compiled pat- 10818*22dc650dSSadaf Ebrahimi tern. 10819*22dc650dSSadaf Ebrahimi 10820*22dc650dSSadaf Ebrahimi Once a set of patterns has been serialized you can save the data in any 10821*22dc650dSSadaf Ebrahimi appropriate manner. Here is sample code that compiles two patterns and 10822*22dc650dSSadaf Ebrahimi writes them to a file. It assumes that the variable fd refers to a file 10823*22dc650dSSadaf Ebrahimi that is open for output. The error checking that should be present in a 10824*22dc650dSSadaf Ebrahimi real application has been omitted for simplicity. 10825*22dc650dSSadaf Ebrahimi 10826*22dc650dSSadaf Ebrahimi int errorcode; 10827*22dc650dSSadaf Ebrahimi uint8_t *bytes; 10828*22dc650dSSadaf Ebrahimi PCRE2_SIZE erroroffset; 10829*22dc650dSSadaf Ebrahimi PCRE2_SIZE bytescount; 10830*22dc650dSSadaf Ebrahimi pcre2_code *list_of_codes[2]; 10831*22dc650dSSadaf Ebrahimi list_of_codes[0] = pcre2_compile("first pattern", 10832*22dc650dSSadaf Ebrahimi PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 10833*22dc650dSSadaf Ebrahimi list_of_codes[1] = pcre2_compile("second pattern", 10834*22dc650dSSadaf Ebrahimi PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 10835*22dc650dSSadaf Ebrahimi errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes, 10836*22dc650dSSadaf Ebrahimi &bytescount, NULL); 10837*22dc650dSSadaf Ebrahimi errorcode = fwrite(bytes, 1, bytescount, fd); 10838*22dc650dSSadaf Ebrahimi 10839*22dc650dSSadaf Ebrahimi Note that the serialized data is binary data that may contain any of 10840*22dc650dSSadaf Ebrahimi the 256 possible byte values. On systems that make a distinction be- 10841*22dc650dSSadaf Ebrahimi tween binary and non-binary data, be sure that the file is opened for 10842*22dc650dSSadaf Ebrahimi binary output. 10843*22dc650dSSadaf Ebrahimi 10844*22dc650dSSadaf Ebrahimi Serializing a set of patterns leaves the original data untouched, so 10845*22dc650dSSadaf Ebrahimi they can still be used for matching. Their memory must eventually be 10846*22dc650dSSadaf Ebrahimi freed in the usual way by calling pcre2_code_free(). When you have fin- 10847*22dc650dSSadaf Ebrahimi ished with the byte stream, it too must be freed by calling pcre2_seri- 10848*22dc650dSSadaf Ebrahimi alize_free(). If this function is called with a NULL argument, it re- 10849*22dc650dSSadaf Ebrahimi turns immediately without doing anything. 10850*22dc650dSSadaf Ebrahimi 10851*22dc650dSSadaf Ebrahimi 10852*22dc650dSSadaf EbrahimiRE-USING PRECOMPILED PATTERNS 10853*22dc650dSSadaf Ebrahimi 10854*22dc650dSSadaf Ebrahimi In order to re-use a set of saved patterns you must first make the se- 10855*22dc650dSSadaf Ebrahimi rialized byte stream available in main memory (for example, by reading 10856*22dc650dSSadaf Ebrahimi from a file). The management of this memory block is up to the applica- 10857*22dc650dSSadaf Ebrahimi tion. You can use the pcre2_serialize_get_number_of_codes() function to 10858*22dc650dSSadaf Ebrahimi find out how many compiled patterns are in the serialized data without 10859*22dc650dSSadaf Ebrahimi actually decoding the patterns: 10860*22dc650dSSadaf Ebrahimi 10861*22dc650dSSadaf Ebrahimi uint8_t *bytes = <serialized data>; 10862*22dc650dSSadaf Ebrahimi int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes); 10863*22dc650dSSadaf Ebrahimi 10864*22dc650dSSadaf Ebrahimi The pcre2_serialize_decode() function reads a byte stream and recreates 10865*22dc650dSSadaf Ebrahimi the compiled patterns in new memory blocks, setting pointers to them in 10866*22dc650dSSadaf Ebrahimi a vector. The first two arguments are a pointer to a suitable vector 10867*22dc650dSSadaf Ebrahimi and its length, and the third argument points to a byte stream. The fi- 10868*22dc650dSSadaf Ebrahimi nal argument is a pointer to a general context, which can be used to 10869*22dc650dSSadaf Ebrahimi specify custom memory management functions for the decoded patterns. If 10870*22dc650dSSadaf Ebrahimi this argument is NULL, malloc() and free() are used. After deserializa- 10871*22dc650dSSadaf Ebrahimi tion, the byte stream is no longer needed and can be discarded. 10872*22dc650dSSadaf Ebrahimi 10873*22dc650dSSadaf Ebrahimi pcre2_code *list_of_codes[2]; 10874*22dc650dSSadaf Ebrahimi uint8_t *bytes = <serialized data>; 10875*22dc650dSSadaf Ebrahimi int32_t number_of_codes = 10876*22dc650dSSadaf Ebrahimi pcre2_serialize_decode(list_of_codes, 2, bytes, NULL); 10877*22dc650dSSadaf Ebrahimi 10878*22dc650dSSadaf Ebrahimi If the vector is not large enough for all the patterns in the byte 10879*22dc650dSSadaf Ebrahimi stream, it is filled with those that fit, and the remainder are ig- 10880*22dc650dSSadaf Ebrahimi nored. The yield of the function is the number of decoded patterns, or 10881*22dc650dSSadaf Ebrahimi one of the following negative error codes: 10882*22dc650dSSadaf Ebrahimi 10883*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADDATA second argument is zero or less 10884*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data 10885*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version 10886*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure 10887*22dc650dSSadaf Ebrahimi PCRE2_ERROR_MEMORY memory allocation failed 10888*22dc650dSSadaf Ebrahimi PCRE2_ERROR_NULL first or third argument is NULL 10889*22dc650dSSadaf Ebrahimi 10890*22dc650dSSadaf Ebrahimi PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was 10891*22dc650dSSadaf Ebrahimi compiled on a system with different endianness. 10892*22dc650dSSadaf Ebrahimi 10893*22dc650dSSadaf Ebrahimi Decoded patterns can be used for matching in the usual way, and must be 10894*22dc650dSSadaf Ebrahimi freed by calling pcre2_code_free(). However, be aware that there is a 10895*22dc650dSSadaf Ebrahimi potential race issue if you are using multiple patterns that were de- 10896*22dc650dSSadaf Ebrahimi coded from a single byte stream in a multithreaded application. A sin- 10897*22dc650dSSadaf Ebrahimi gle copy of the character tables is used by all the decoded patterns 10898*22dc650dSSadaf Ebrahimi and a reference count is used to arrange for its memory to be automati- 10899*22dc650dSSadaf Ebrahimi cally freed when the last pattern is freed, but there is no locking on 10900*22dc650dSSadaf Ebrahimi this reference count. Therefore, if you want to call pcre2_code_free() 10901*22dc650dSSadaf Ebrahimi for these patterns in different threads, you must arrange your own 10902*22dc650dSSadaf Ebrahimi locking, and ensure that pcre2_code_free() cannot be called by two 10903*22dc650dSSadaf Ebrahimi threads at the same time. 10904*22dc650dSSadaf Ebrahimi 10905*22dc650dSSadaf Ebrahimi If a pattern was processed by pcre2_jit_compile() before being serial- 10906*22dc650dSSadaf Ebrahimi ized, the JIT data is discarded and so is no longer available after a 10907*22dc650dSSadaf Ebrahimi save/restore cycle. You can, however, process a restored pattern with 10908*22dc650dSSadaf Ebrahimi pcre2_jit_compile() if you wish. 10909*22dc650dSSadaf Ebrahimi 10910*22dc650dSSadaf Ebrahimi 10911*22dc650dSSadaf EbrahimiAUTHOR 10912*22dc650dSSadaf Ebrahimi 10913*22dc650dSSadaf Ebrahimi Philip Hazel 10914*22dc650dSSadaf Ebrahimi Retired from University Computing Service 10915*22dc650dSSadaf Ebrahimi Cambridge, England. 10916*22dc650dSSadaf Ebrahimi 10917*22dc650dSSadaf Ebrahimi 10918*22dc650dSSadaf EbrahimiREVISION 10919*22dc650dSSadaf Ebrahimi 10920*22dc650dSSadaf Ebrahimi Last updated: 27 June 2018 10921*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2018 University of Cambridge. 10922*22dc650dSSadaf Ebrahimi 10923*22dc650dSSadaf Ebrahimi 10924*22dc650dSSadaf EbrahimiPCRE2 10.32 27 June 2018 PCRE2SERIALIZE(3) 10925*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 10926*22dc650dSSadaf Ebrahimi 10927*22dc650dSSadaf Ebrahimi 10928*22dc650dSSadaf Ebrahimi 10929*22dc650dSSadaf EbrahimiPCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3) 10930*22dc650dSSadaf Ebrahimi 10931*22dc650dSSadaf Ebrahimi 10932*22dc650dSSadaf EbrahimiNAME 10933*22dc650dSSadaf Ebrahimi PCRE2 - Perl-compatible regular expressions (revised API) 10934*22dc650dSSadaf Ebrahimi 10935*22dc650dSSadaf Ebrahimi 10936*22dc650dSSadaf EbrahimiPCRE2 REGULAR EXPRESSION SYNTAX SUMMARY 10937*22dc650dSSadaf Ebrahimi 10938*22dc650dSSadaf Ebrahimi The full syntax and semantics of the regular expressions that are sup- 10939*22dc650dSSadaf Ebrahimi ported by PCRE2 are described in the pcre2pattern documentation. This 10940*22dc650dSSadaf Ebrahimi document contains a quick-reference summary of the syntax. 10941*22dc650dSSadaf Ebrahimi 10942*22dc650dSSadaf Ebrahimi 10943*22dc650dSSadaf EbrahimiQUOTING 10944*22dc650dSSadaf Ebrahimi 10945*22dc650dSSadaf Ebrahimi \x where x is non-alphanumeric is a literal x 10946*22dc650dSSadaf Ebrahimi \Q...\E treat enclosed characters as literal 10947*22dc650dSSadaf Ebrahimi 10948*22dc650dSSadaf Ebrahimi Note that white space inside \Q...\E is always treated as literal, even 10949*22dc650dSSadaf Ebrahimi if PCRE2_EXTENDED is set, causing most other white space to be ignored. 10950*22dc650dSSadaf Ebrahimi 10951*22dc650dSSadaf Ebrahimi 10952*22dc650dSSadaf EbrahimiBRACED ITEMS 10953*22dc650dSSadaf Ebrahimi 10954*22dc650dSSadaf Ebrahimi With one exception, wherever brace characters { and } are required to 10955*22dc650dSSadaf Ebrahimi enclose data for constructions such as \g{2} or \k{name}, space and/or 10956*22dc650dSSadaf Ebrahimi horizontal tab characters that follow { or precede } are allowed and 10957*22dc650dSSadaf Ebrahimi are ignored. In the case of quantifiers, they may also appear before or 10958*22dc650dSSadaf Ebrahimi after the comma. The exception is \u{...} which is not Perl-compatible 10959*22dc650dSSadaf Ebrahimi and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC- 10960*22dc650dSSadaf Ebrahimi MAScript compatibility feature, and follows ECMAScript's behaviour. 10961*22dc650dSSadaf Ebrahimi 10962*22dc650dSSadaf Ebrahimi 10963*22dc650dSSadaf EbrahimiESCAPED CHARACTERS 10964*22dc650dSSadaf Ebrahimi 10965*22dc650dSSadaf Ebrahimi This table applies to ASCII and Unicode environments. An unrecognized 10966*22dc650dSSadaf Ebrahimi escape sequence causes an error. 10967*22dc650dSSadaf Ebrahimi 10968*22dc650dSSadaf Ebrahimi \a alarm, that is, the BEL character (hex 07) 10969*22dc650dSSadaf Ebrahimi \cx "control-x", where x is a non-control ASCII character 10970*22dc650dSSadaf Ebrahimi \e escape (hex 1B) 10971*22dc650dSSadaf Ebrahimi \f form feed (hex 0C) 10972*22dc650dSSadaf Ebrahimi \n newline (hex 0A) 10973*22dc650dSSadaf Ebrahimi \r carriage return (hex 0D) 10974*22dc650dSSadaf Ebrahimi \t tab (hex 09) 10975*22dc650dSSadaf Ebrahimi \0dd character with octal code 0dd 10976*22dc650dSSadaf Ebrahimi \ddd character with octal code ddd, or backreference 10977*22dc650dSSadaf Ebrahimi \o{ddd..} character with octal code ddd.. 10978*22dc650dSSadaf Ebrahimi \N{U+hh..} character with Unicode code point hh.. (Unicode mode only) 10979*22dc650dSSadaf Ebrahimi \xhh character with hex code hh 10980*22dc650dSSadaf Ebrahimi \x{hh..} character with hex code hh.. 10981*22dc650dSSadaf Ebrahimi 10982*22dc650dSSadaf Ebrahimi If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the 10983*22dc650dSSadaf Ebrahimi following are also recognized: 10984*22dc650dSSadaf Ebrahimi 10985*22dc650dSSadaf Ebrahimi \U the character "U" 10986*22dc650dSSadaf Ebrahimi \uhhhh character with hex code hhhh 10987*22dc650dSSadaf Ebrahimi \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX 10988*22dc650dSSadaf Ebrahimi 10989*22dc650dSSadaf Ebrahimi When \x is not followed by {, from zero to two hexadecimal digits are 10990*22dc650dSSadaf Ebrahimi read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig- 10991*22dc650dSSadaf Ebrahimi its to be recognized as a hexadecimal escape; otherwise it matches a 10992*22dc650dSSadaf Ebrahimi literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by 10993*22dc650dSSadaf Ebrahimi four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex 10994*22dc650dSSadaf Ebrahimi digits in curly brackets, it matches a literal "u". 10995*22dc650dSSadaf Ebrahimi 10996*22dc650dSSadaf Ebrahimi Note that \0dd is always an octal code. The treatment of backslash fol- 10997*22dc650dSSadaf Ebrahimi lowed by a non-zero digit is complicated; for details see the section 10998*22dc650dSSadaf Ebrahimi "Non-printing characters" in the pcre2pattern documentation, where de- 10999*22dc650dSSadaf Ebrahimi tails of escape processing in EBCDIC environments are also given. 11000*22dc650dSSadaf Ebrahimi \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in 11001*22dc650dSSadaf Ebrahimi EBCDIC environments. Note that \N not followed by an opening curly 11002*22dc650dSSadaf Ebrahimi bracket has a different meaning (see below). 11003*22dc650dSSadaf Ebrahimi 11004*22dc650dSSadaf Ebrahimi 11005*22dc650dSSadaf EbrahimiCHARACTER TYPES 11006*22dc650dSSadaf Ebrahimi 11007*22dc650dSSadaf Ebrahimi . any character except newline; 11008*22dc650dSSadaf Ebrahimi in dotall mode, any character whatsoever 11009*22dc650dSSadaf Ebrahimi \C one code unit, even in UTF mode (best avoided) 11010*22dc650dSSadaf Ebrahimi \d a decimal digit 11011*22dc650dSSadaf Ebrahimi \D a character that is not a decimal digit 11012*22dc650dSSadaf Ebrahimi \h a horizontal white space character 11013*22dc650dSSadaf Ebrahimi \H a character that is not a horizontal white space character 11014*22dc650dSSadaf Ebrahimi \N a character that is not a newline 11015*22dc650dSSadaf Ebrahimi \p{xx} a character with the xx property 11016*22dc650dSSadaf Ebrahimi \P{xx} a character without the xx property 11017*22dc650dSSadaf Ebrahimi \R a newline sequence 11018*22dc650dSSadaf Ebrahimi \s a white space character 11019*22dc650dSSadaf Ebrahimi \S a character that is not a white space character 11020*22dc650dSSadaf Ebrahimi \v a vertical white space character 11021*22dc650dSSadaf Ebrahimi \V a character that is not a vertical white space character 11022*22dc650dSSadaf Ebrahimi \w a "word" character 11023*22dc650dSSadaf Ebrahimi \W a "non-word" character 11024*22dc650dSSadaf Ebrahimi \X a Unicode extended grapheme cluster 11025*22dc650dSSadaf Ebrahimi 11026*22dc650dSSadaf Ebrahimi \C is dangerous because it may leave the current matching point in the 11027*22dc650dSSadaf Ebrahimi middle of a UTF-8 or UTF-16 character. The application can lock out the 11028*22dc650dSSadaf Ebrahimi use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also 11029*22dc650dSSadaf Ebrahimi possible to build PCRE2 with the use of \C permanently disabled. 11030*22dc650dSSadaf Ebrahimi 11031*22dc650dSSadaf Ebrahimi By default, \d, \s, and \w match only ASCII characters, even in UTF-8 11032*22dc650dSSadaf Ebrahimi mode or in the 16-bit and 32-bit libraries. However, if locale-specific 11033*22dc650dSSadaf Ebrahimi matching is happening, \s and \w may also match characters with code 11034*22dc650dSSadaf Ebrahimi points in the range 128-255. If the PCRE2_UCP option is set, the behav- 11035*22dc650dSSadaf Ebrahimi iour of these escape sequences is changed to use Unicode properties and 11036*22dc650dSSadaf Ebrahimi they match many more characters, but there are some option settings 11037*22dc650dSSadaf Ebrahimi that can restrict individual sequences to matching only ASCII charac- 11038*22dc650dSSadaf Ebrahimi ters. 11039*22dc650dSSadaf Ebrahimi 11040*22dc650dSSadaf Ebrahimi Property descriptions in \p and \P are matched caselessly; hyphens, un- 11041*22dc650dSSadaf Ebrahimi derscores, and white space are ignored, in accordance with Unicode's 11042*22dc650dSSadaf Ebrahimi "loose matching" rules. 11043*22dc650dSSadaf Ebrahimi 11044*22dc650dSSadaf Ebrahimi 11045*22dc650dSSadaf EbrahimiGENERAL CATEGORY PROPERTIES FOR \p and \P 11046*22dc650dSSadaf Ebrahimi 11047*22dc650dSSadaf Ebrahimi C Other 11048*22dc650dSSadaf Ebrahimi Cc Control 11049*22dc650dSSadaf Ebrahimi Cf Format 11050*22dc650dSSadaf Ebrahimi Cn Unassigned 11051*22dc650dSSadaf Ebrahimi Co Private use 11052*22dc650dSSadaf Ebrahimi Cs Surrogate 11053*22dc650dSSadaf Ebrahimi 11054*22dc650dSSadaf Ebrahimi L Letter 11055*22dc650dSSadaf Ebrahimi Ll Lower case letter 11056*22dc650dSSadaf Ebrahimi Lm Modifier letter 11057*22dc650dSSadaf Ebrahimi Lo Other letter 11058*22dc650dSSadaf Ebrahimi Lt Title case letter 11059*22dc650dSSadaf Ebrahimi Lu Upper case letter 11060*22dc650dSSadaf Ebrahimi Lc Ll, Lu, or Lt 11061*22dc650dSSadaf Ebrahimi L& Ll, Lu, or Lt 11062*22dc650dSSadaf Ebrahimi 11063*22dc650dSSadaf Ebrahimi M Mark 11064*22dc650dSSadaf Ebrahimi Mc Spacing mark 11065*22dc650dSSadaf Ebrahimi Me Enclosing mark 11066*22dc650dSSadaf Ebrahimi Mn Non-spacing mark 11067*22dc650dSSadaf Ebrahimi 11068*22dc650dSSadaf Ebrahimi N Number 11069*22dc650dSSadaf Ebrahimi Nd Decimal number 11070*22dc650dSSadaf Ebrahimi Nl Letter number 11071*22dc650dSSadaf Ebrahimi No Other number 11072*22dc650dSSadaf Ebrahimi 11073*22dc650dSSadaf Ebrahimi P Punctuation 11074*22dc650dSSadaf Ebrahimi Pc Connector punctuation 11075*22dc650dSSadaf Ebrahimi Pd Dash punctuation 11076*22dc650dSSadaf Ebrahimi Pe Close punctuation 11077*22dc650dSSadaf Ebrahimi Pf Final punctuation 11078*22dc650dSSadaf Ebrahimi Pi Initial punctuation 11079*22dc650dSSadaf Ebrahimi Po Other punctuation 11080*22dc650dSSadaf Ebrahimi Ps Open punctuation 11081*22dc650dSSadaf Ebrahimi 11082*22dc650dSSadaf Ebrahimi S Symbol 11083*22dc650dSSadaf Ebrahimi Sc Currency symbol 11084*22dc650dSSadaf Ebrahimi Sk Modifier symbol 11085*22dc650dSSadaf Ebrahimi Sm Mathematical symbol 11086*22dc650dSSadaf Ebrahimi So Other symbol 11087*22dc650dSSadaf Ebrahimi 11088*22dc650dSSadaf Ebrahimi Z Separator 11089*22dc650dSSadaf Ebrahimi Zl Line separator 11090*22dc650dSSadaf Ebrahimi Zp Paragraph separator 11091*22dc650dSSadaf Ebrahimi Zs Space separator 11092*22dc650dSSadaf Ebrahimi 11093*22dc650dSSadaf Ebrahimi 11094*22dc650dSSadaf EbrahimiPCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P 11095*22dc650dSSadaf Ebrahimi 11096*22dc650dSSadaf Ebrahimi Xan Alphanumeric: union of properties L and N 11097*22dc650dSSadaf Ebrahimi Xps POSIX space: property Z or tab, NL, VT, FF, CR 11098*22dc650dSSadaf Ebrahimi Xsp Perl space: property Z or tab, NL, VT, FF, CR 11099*22dc650dSSadaf Ebrahimi Xuc Universally-named character: one that can be 11100*22dc650dSSadaf Ebrahimi represented by a Universal Character Name 11101*22dc650dSSadaf Ebrahimi Xwd Perl word: property Xan or underscore 11102*22dc650dSSadaf Ebrahimi 11103*22dc650dSSadaf Ebrahimi Perl and POSIX space are now the same. Perl added VT to its space char- 11104*22dc650dSSadaf Ebrahimi acter set at release 5.18. 11105*22dc650dSSadaf Ebrahimi 11106*22dc650dSSadaf Ebrahimi 11107*22dc650dSSadaf EbrahimiBINARY PROPERTIES FOR \p AND \P 11108*22dc650dSSadaf Ebrahimi 11109*22dc650dSSadaf Ebrahimi Unicode defines a number of binary properties, that is, properties 11110*22dc650dSSadaf Ebrahimi whose only values are true or false. You can obtain a list of those 11111*22dc650dSSadaf Ebrahimi that are recognized by \p and \P, along with their abbreviations, by 11112*22dc650dSSadaf Ebrahimi running this command: 11113*22dc650dSSadaf Ebrahimi 11114*22dc650dSSadaf Ebrahimi pcre2test -LP 11115*22dc650dSSadaf Ebrahimi 11116*22dc650dSSadaf Ebrahimi 11117*22dc650dSSadaf EbrahimiSCRIPT MATCHING WITH \p AND \P 11118*22dc650dSSadaf Ebrahimi 11119*22dc650dSSadaf Ebrahimi Many script names and their 4-letter abbreviations are recognized in 11120*22dc650dSSadaf Ebrahimi \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P 11121*22dc650dSSadaf Ebrahimi of course). You can obtain a list of these scripts by running this com- 11122*22dc650dSSadaf Ebrahimi mand: 11123*22dc650dSSadaf Ebrahimi 11124*22dc650dSSadaf Ebrahimi pcre2test -LS 11125*22dc650dSSadaf Ebrahimi 11126*22dc650dSSadaf Ebrahimi 11127*22dc650dSSadaf EbrahimiTHE BIDI_CLASS PROPERTY FOR \p AND \P 11128*22dc650dSSadaf Ebrahimi 11129*22dc650dSSadaf Ebrahimi \p{Bidi_Class:<class>} matches a character with the given class 11130*22dc650dSSadaf Ebrahimi \p{BC:<class>} matches a character with the given class 11131*22dc650dSSadaf Ebrahimi 11132*22dc650dSSadaf Ebrahimi The recognized classes are: 11133*22dc650dSSadaf Ebrahimi 11134*22dc650dSSadaf Ebrahimi AL Arabic letter 11135*22dc650dSSadaf Ebrahimi AN Arabic number 11136*22dc650dSSadaf Ebrahimi B paragraph separator 11137*22dc650dSSadaf Ebrahimi BN boundary neutral 11138*22dc650dSSadaf Ebrahimi CS common separator 11139*22dc650dSSadaf Ebrahimi EN European number 11140*22dc650dSSadaf Ebrahimi ES European separator 11141*22dc650dSSadaf Ebrahimi ET European terminator 11142*22dc650dSSadaf Ebrahimi FSI first strong isolate 11143*22dc650dSSadaf Ebrahimi L left-to-right 11144*22dc650dSSadaf Ebrahimi LRE left-to-right embedding 11145*22dc650dSSadaf Ebrahimi LRI left-to-right isolate 11146*22dc650dSSadaf Ebrahimi LRO left-to-right override 11147*22dc650dSSadaf Ebrahimi NSM non-spacing mark 11148*22dc650dSSadaf Ebrahimi ON other neutral 11149*22dc650dSSadaf Ebrahimi PDF pop directional format 11150*22dc650dSSadaf Ebrahimi PDI pop directional isolate 11151*22dc650dSSadaf Ebrahimi R right-to-left 11152*22dc650dSSadaf Ebrahimi RLE right-to-left embedding 11153*22dc650dSSadaf Ebrahimi RLI right-to-left isolate 11154*22dc650dSSadaf Ebrahimi RLO right-to-left override 11155*22dc650dSSadaf Ebrahimi S segment separator 11156*22dc650dSSadaf Ebrahimi WS which space 11157*22dc650dSSadaf Ebrahimi 11158*22dc650dSSadaf Ebrahimi 11159*22dc650dSSadaf EbrahimiCHARACTER CLASSES 11160*22dc650dSSadaf Ebrahimi 11161*22dc650dSSadaf Ebrahimi [...] positive character class 11162*22dc650dSSadaf Ebrahimi [^...] negative character class 11163*22dc650dSSadaf Ebrahimi [x-y] range (can be used for hex characters) 11164*22dc650dSSadaf Ebrahimi [[:xxx:]] positive POSIX named set 11165*22dc650dSSadaf Ebrahimi [[:^xxx:]] negative POSIX named set 11166*22dc650dSSadaf Ebrahimi 11167*22dc650dSSadaf Ebrahimi alnum alphanumeric 11168*22dc650dSSadaf Ebrahimi alpha alphabetic 11169*22dc650dSSadaf Ebrahimi ascii 0-127 11170*22dc650dSSadaf Ebrahimi blank space or tab 11171*22dc650dSSadaf Ebrahimi cntrl control character 11172*22dc650dSSadaf Ebrahimi digit decimal digit 11173*22dc650dSSadaf Ebrahimi graph printing, excluding space 11174*22dc650dSSadaf Ebrahimi lower lower case letter 11175*22dc650dSSadaf Ebrahimi print printing, including space 11176*22dc650dSSadaf Ebrahimi punct printing, excluding alphanumeric 11177*22dc650dSSadaf Ebrahimi space white space 11178*22dc650dSSadaf Ebrahimi upper upper case letter 11179*22dc650dSSadaf Ebrahimi word same as \w 11180*22dc650dSSadaf Ebrahimi xdigit hexadecimal digit 11181*22dc650dSSadaf Ebrahimi 11182*22dc650dSSadaf Ebrahimi In PCRE2, POSIX character set names recognize only ASCII characters by 11183*22dc650dSSadaf Ebrahimi default, but some of them use Unicode properties if PCRE2_UCP is set. 11184*22dc650dSSadaf Ebrahimi You can use \Q...\E inside a character class. 11185*22dc650dSSadaf Ebrahimi 11186*22dc650dSSadaf Ebrahimi 11187*22dc650dSSadaf EbrahimiQUANTIFIERS 11188*22dc650dSSadaf Ebrahimi 11189*22dc650dSSadaf Ebrahimi ? 0 or 1, greedy 11190*22dc650dSSadaf Ebrahimi ?+ 0 or 1, possessive 11191*22dc650dSSadaf Ebrahimi ?? 0 or 1, lazy 11192*22dc650dSSadaf Ebrahimi * 0 or more, greedy 11193*22dc650dSSadaf Ebrahimi *+ 0 or more, possessive 11194*22dc650dSSadaf Ebrahimi *? 0 or more, lazy 11195*22dc650dSSadaf Ebrahimi + 1 or more, greedy 11196*22dc650dSSadaf Ebrahimi ++ 1 or more, possessive 11197*22dc650dSSadaf Ebrahimi +? 1 or more, lazy 11198*22dc650dSSadaf Ebrahimi {n} exactly n 11199*22dc650dSSadaf Ebrahimi {n,m} at least n, no more than m, greedy 11200*22dc650dSSadaf Ebrahimi {n,m}+ at least n, no more than m, possessive 11201*22dc650dSSadaf Ebrahimi {n,m}? at least n, no more than m, lazy 11202*22dc650dSSadaf Ebrahimi {n,} n or more, greedy 11203*22dc650dSSadaf Ebrahimi {n,}+ n or more, possessive 11204*22dc650dSSadaf Ebrahimi {n,}? n or more, lazy 11205*22dc650dSSadaf Ebrahimi {,m} zero up to m, greedy 11206*22dc650dSSadaf Ebrahimi {,m}+ zero up to m, possessive 11207*22dc650dSSadaf Ebrahimi {,m}? zero up to m, lazy 11208*22dc650dSSadaf Ebrahimi 11209*22dc650dSSadaf Ebrahimi 11210*22dc650dSSadaf EbrahimiANCHORS AND SIMPLE ASSERTIONS 11211*22dc650dSSadaf Ebrahimi 11212*22dc650dSSadaf Ebrahimi \b word boundary 11213*22dc650dSSadaf Ebrahimi \B not a word boundary 11214*22dc650dSSadaf Ebrahimi ^ start of subject 11215*22dc650dSSadaf Ebrahimi also after an internal newline in multiline mode 11216*22dc650dSSadaf Ebrahimi (after any newline if PCRE2_ALT_CIRCUMFLEX is set) 11217*22dc650dSSadaf Ebrahimi \A start of subject 11218*22dc650dSSadaf Ebrahimi $ end of subject 11219*22dc650dSSadaf Ebrahimi also before newline at end of subject 11220*22dc650dSSadaf Ebrahimi also before internal newline in multiline mode 11221*22dc650dSSadaf Ebrahimi \Z end of subject 11222*22dc650dSSadaf Ebrahimi also before newline at end of subject 11223*22dc650dSSadaf Ebrahimi \z end of subject 11224*22dc650dSSadaf Ebrahimi \G first matching position in subject 11225*22dc650dSSadaf Ebrahimi 11226*22dc650dSSadaf Ebrahimi 11227*22dc650dSSadaf EbrahimiREPORTED MATCH POINT SETTING 11228*22dc650dSSadaf Ebrahimi 11229*22dc650dSSadaf Ebrahimi \K set reported start of match 11230*22dc650dSSadaf Ebrahimi 11231*22dc650dSSadaf Ebrahimi From release 10.38 \K is not permitted by default in lookaround asser- 11232*22dc650dSSadaf Ebrahimi tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL- 11233*22dc650dSSadaf Ebrahimi LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled. 11234*22dc650dSSadaf Ebrahimi When this option is set, \K is honoured in positive assertions, but ig- 11235*22dc650dSSadaf Ebrahimi nored in negative ones. 11236*22dc650dSSadaf Ebrahimi 11237*22dc650dSSadaf Ebrahimi 11238*22dc650dSSadaf EbrahimiALTERNATION 11239*22dc650dSSadaf Ebrahimi 11240*22dc650dSSadaf Ebrahimi expr|expr|expr... 11241*22dc650dSSadaf Ebrahimi 11242*22dc650dSSadaf Ebrahimi 11243*22dc650dSSadaf EbrahimiCAPTURING 11244*22dc650dSSadaf Ebrahimi 11245*22dc650dSSadaf Ebrahimi (...) capture group 11246*22dc650dSSadaf Ebrahimi (?<name>...) named capture group (Perl) 11247*22dc650dSSadaf Ebrahimi (?'name'...) named capture group (Perl) 11248*22dc650dSSadaf Ebrahimi (?P<name>...) named capture group (Python) 11249*22dc650dSSadaf Ebrahimi (?:...) non-capture group 11250*22dc650dSSadaf Ebrahimi (?|...) non-capture group; reset group numbers for 11251*22dc650dSSadaf Ebrahimi capture groups in each alternative 11252*22dc650dSSadaf Ebrahimi 11253*22dc650dSSadaf Ebrahimi In non-UTF modes, names may contain underscores and ASCII letters and 11254*22dc650dSSadaf Ebrahimi digits; in UTF modes, any Unicode letters and Unicode decimal digits 11255*22dc650dSSadaf Ebrahimi are permitted. In both cases, a name must not start with a digit. 11256*22dc650dSSadaf Ebrahimi 11257*22dc650dSSadaf Ebrahimi 11258*22dc650dSSadaf EbrahimiATOMIC GROUPS 11259*22dc650dSSadaf Ebrahimi 11260*22dc650dSSadaf Ebrahimi (?>...) atomic non-capture group 11261*22dc650dSSadaf Ebrahimi (*atomic:...) atomic non-capture group 11262*22dc650dSSadaf Ebrahimi 11263*22dc650dSSadaf Ebrahimi 11264*22dc650dSSadaf EbrahimiCOMMENT 11265*22dc650dSSadaf Ebrahimi 11266*22dc650dSSadaf Ebrahimi (?#....) comment (not nestable) 11267*22dc650dSSadaf Ebrahimi 11268*22dc650dSSadaf Ebrahimi 11269*22dc650dSSadaf EbrahimiOPTION SETTING 11270*22dc650dSSadaf Ebrahimi Changes of these options within a group are automatically cancelled at 11271*22dc650dSSadaf Ebrahimi the end of the group. 11272*22dc650dSSadaf Ebrahimi 11273*22dc650dSSadaf Ebrahimi (?a) all ASCII options 11274*22dc650dSSadaf Ebrahimi (?aD) restrict \d to ASCII in UCP mode 11275*22dc650dSSadaf Ebrahimi (?aS) restrict \s to ASCII in UCP mode 11276*22dc650dSSadaf Ebrahimi (?aW) restrict \w to ASCII in UCP mode 11277*22dc650dSSadaf Ebrahimi (?aP) restrict all POSIX classes to ASCII in UCP mode 11278*22dc650dSSadaf Ebrahimi (?aT) restrict POSIX digit classes to ASCII in UCP mode 11279*22dc650dSSadaf Ebrahimi (?i) caseless 11280*22dc650dSSadaf Ebrahimi (?J) allow duplicate named groups 11281*22dc650dSSadaf Ebrahimi (?m) multiline 11282*22dc650dSSadaf Ebrahimi (?n) no auto capture 11283*22dc650dSSadaf Ebrahimi (?r) restrict caseless to either ASCII or non-ASCII 11284*22dc650dSSadaf Ebrahimi (?s) single line (dotall) 11285*22dc650dSSadaf Ebrahimi (?U) default ungreedy (lazy) 11286*22dc650dSSadaf Ebrahimi (?x) ignore white space except in classes or \Q...\E 11287*22dc650dSSadaf Ebrahimi (?xx) as (?x) but also ignore space and tab in classes 11288*22dc650dSSadaf Ebrahimi (?-...) unset the given option(s) 11289*22dc650dSSadaf Ebrahimi (?^) unset imnrsx options 11290*22dc650dSSadaf Ebrahimi 11291*22dc650dSSadaf Ebrahimi (?aP) implies (?aT) as well, though this has no additional effect. How- 11292*22dc650dSSadaf Ebrahimi ever, it means that (?-aP) is really (?-PT) which disables all ASCII 11293*22dc650dSSadaf Ebrahimi restrictions for POSIX classes. 11294*22dc650dSSadaf Ebrahimi 11295*22dc650dSSadaf Ebrahimi Unsetting x or xx unsets both. Several options may be set at once, and 11296*22dc650dSSadaf Ebrahimi a mixture of setting and unsetting such as (?i-x) is allowed, but there 11297*22dc650dSSadaf Ebrahimi may be only one hyphen. Setting (but no unsetting) is allowed after (?^ 11298*22dc650dSSadaf Ebrahimi for example (?^in). An option setting may appear at the start of a non- 11299*22dc650dSSadaf Ebrahimi capture group, for example (?i:...). 11300*22dc650dSSadaf Ebrahimi 11301*22dc650dSSadaf Ebrahimi The following are recognized only at the very start of a pattern or af- 11302*22dc650dSSadaf Ebrahimi ter one of the newline or \R options with similar syntax. More than one 11303*22dc650dSSadaf Ebrahimi of them may appear. For the first three, d is a decimal number. 11304*22dc650dSSadaf Ebrahimi 11305*22dc650dSSadaf Ebrahimi (*LIMIT_DEPTH=d) set the backtracking limit to d 11306*22dc650dSSadaf Ebrahimi (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes 11307*22dc650dSSadaf Ebrahimi (*LIMIT_MATCH=d) set the match limit to d 11308*22dc650dSSadaf Ebrahimi (*NOTEMPTY) set PCRE2_NOTEMPTY when matching 11309*22dc650dSSadaf Ebrahimi (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching 11310*22dc650dSSadaf Ebrahimi (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) 11311*22dc650dSSadaf Ebrahimi (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) 11312*22dc650dSSadaf Ebrahimi (*NO_JIT) disable JIT optimization 11313*22dc650dSSadaf Ebrahimi (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) 11314*22dc650dSSadaf Ebrahimi (*UTF) set appropriate UTF mode for the library in use 11315*22dc650dSSadaf Ebrahimi (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) 11316*22dc650dSSadaf Ebrahimi 11317*22dc650dSSadaf Ebrahimi Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the 11318*22dc650dSSadaf Ebrahimi value of the limits set by the caller of pcre2_match() or 11319*22dc650dSSadaf Ebrahimi pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete 11320*22dc650dSSadaf Ebrahimi synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) 11321*22dc650dSSadaf Ebrahimi and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, 11322*22dc650dSSadaf Ebrahimi respectively, at compile time. 11323*22dc650dSSadaf Ebrahimi 11324*22dc650dSSadaf Ebrahimi 11325*22dc650dSSadaf EbrahimiNEWLINE CONVENTION 11326*22dc650dSSadaf Ebrahimi 11327*22dc650dSSadaf Ebrahimi These are recognized only at the very start of the pattern or after op- 11328*22dc650dSSadaf Ebrahimi tion settings with a similar syntax. 11329*22dc650dSSadaf Ebrahimi 11330*22dc650dSSadaf Ebrahimi (*CR) carriage return only 11331*22dc650dSSadaf Ebrahimi (*LF) linefeed only 11332*22dc650dSSadaf Ebrahimi (*CRLF) carriage return followed by linefeed 11333*22dc650dSSadaf Ebrahimi (*ANYCRLF) all three of the above 11334*22dc650dSSadaf Ebrahimi (*ANY) any Unicode newline sequence 11335*22dc650dSSadaf Ebrahimi (*NUL) the NUL character (binary zero) 11336*22dc650dSSadaf Ebrahimi 11337*22dc650dSSadaf Ebrahimi 11338*22dc650dSSadaf EbrahimiWHAT \R MATCHES 11339*22dc650dSSadaf Ebrahimi 11340*22dc650dSSadaf Ebrahimi These are recognized only at the very start of the pattern or after op- 11341*22dc650dSSadaf Ebrahimi tion setting with a similar syntax. 11342*22dc650dSSadaf Ebrahimi 11343*22dc650dSSadaf Ebrahimi (*BSR_ANYCRLF) CR, LF, or CRLF 11344*22dc650dSSadaf Ebrahimi (*BSR_UNICODE) any Unicode newline sequence 11345*22dc650dSSadaf Ebrahimi 11346*22dc650dSSadaf Ebrahimi 11347*22dc650dSSadaf EbrahimiLOOKAHEAD AND LOOKBEHIND ASSERTIONS 11348*22dc650dSSadaf Ebrahimi 11349*22dc650dSSadaf Ebrahimi (?=...) ) 11350*22dc650dSSadaf Ebrahimi (*pla:...) ) positive lookahead 11351*22dc650dSSadaf Ebrahimi (*positive_lookahead:...) ) 11352*22dc650dSSadaf Ebrahimi 11353*22dc650dSSadaf Ebrahimi (?!...) ) 11354*22dc650dSSadaf Ebrahimi (*nla:...) ) negative lookahead 11355*22dc650dSSadaf Ebrahimi (*negative_lookahead:...) ) 11356*22dc650dSSadaf Ebrahimi 11357*22dc650dSSadaf Ebrahimi (?<=...) ) 11358*22dc650dSSadaf Ebrahimi (*plb:...) ) positive lookbehind 11359*22dc650dSSadaf Ebrahimi (*positive_lookbehind:...) ) 11360*22dc650dSSadaf Ebrahimi 11361*22dc650dSSadaf Ebrahimi (?<!...) ) 11362*22dc650dSSadaf Ebrahimi (*nlb:...) ) negative lookbehind 11363*22dc650dSSadaf Ebrahimi (*negative_lookbehind:...) ) 11364*22dc650dSSadaf Ebrahimi 11365*22dc650dSSadaf Ebrahimi Each top-level branch of a lookbehind must have a limit for the number 11366*22dc650dSSadaf Ebrahimi of characters it matches. If any branch can match a variable number of 11367*22dc650dSSadaf Ebrahimi characters, the maximum for each branch is limited to a value set by 11368*22dc650dSSadaf Ebrahimi the caller of pcre2_compile() or defaulted. The default is set when 11369*22dc650dSSadaf Ebrahimi PCRE2 is built (ultimate default 255). If every branch matches a fixed 11370*22dc650dSSadaf Ebrahimi number of characters, the limit for each branch is 65535 characters. 11371*22dc650dSSadaf Ebrahimi 11372*22dc650dSSadaf Ebrahimi 11373*22dc650dSSadaf EbrahimiNON-ATOMIC LOOKAROUND ASSERTIONS 11374*22dc650dSSadaf Ebrahimi 11375*22dc650dSSadaf Ebrahimi These assertions are specific to PCRE2 and are not Perl-compatible. 11376*22dc650dSSadaf Ebrahimi 11377*22dc650dSSadaf Ebrahimi (?*...) ) 11378*22dc650dSSadaf Ebrahimi (*napla:...) ) synonyms 11379*22dc650dSSadaf Ebrahimi (*non_atomic_positive_lookahead:...) ) 11380*22dc650dSSadaf Ebrahimi 11381*22dc650dSSadaf Ebrahimi (?<*...) ) 11382*22dc650dSSadaf Ebrahimi (*naplb:...) ) synonyms 11383*22dc650dSSadaf Ebrahimi (*non_atomic_positive_lookbehind:...) ) 11384*22dc650dSSadaf Ebrahimi 11385*22dc650dSSadaf Ebrahimi 11386*22dc650dSSadaf EbrahimiSCRIPT RUNS 11387*22dc650dSSadaf Ebrahimi 11388*22dc650dSSadaf Ebrahimi (*script_run:...) ) script run, can be backtracked into 11389*22dc650dSSadaf Ebrahimi (*sr:...) ) 11390*22dc650dSSadaf Ebrahimi 11391*22dc650dSSadaf Ebrahimi (*atomic_script_run:...) ) atomic script run 11392*22dc650dSSadaf Ebrahimi (*asr:...) ) 11393*22dc650dSSadaf Ebrahimi 11394*22dc650dSSadaf Ebrahimi 11395*22dc650dSSadaf EbrahimiBACKREFERENCES 11396*22dc650dSSadaf Ebrahimi 11397*22dc650dSSadaf Ebrahimi \n reference by number (can be ambiguous) 11398*22dc650dSSadaf Ebrahimi \gn reference by number 11399*22dc650dSSadaf Ebrahimi \g{n} reference by number 11400*22dc650dSSadaf Ebrahimi \g+n relative reference by number (PCRE2 extension) 11401*22dc650dSSadaf Ebrahimi \g-n relative reference by number 11402*22dc650dSSadaf Ebrahimi \g{+n} relative reference by number (PCRE2 extension) 11403*22dc650dSSadaf Ebrahimi \g{-n} relative reference by number 11404*22dc650dSSadaf Ebrahimi \k<name> reference by name (Perl) 11405*22dc650dSSadaf Ebrahimi \k'name' reference by name (Perl) 11406*22dc650dSSadaf Ebrahimi \g{name} reference by name (Perl) 11407*22dc650dSSadaf Ebrahimi \k{name} reference by name (.NET) 11408*22dc650dSSadaf Ebrahimi (?P=name) reference by name (Python) 11409*22dc650dSSadaf Ebrahimi 11410*22dc650dSSadaf Ebrahimi 11411*22dc650dSSadaf EbrahimiSUBROUTINE REFERENCES (POSSIBLY RECURSIVE) 11412*22dc650dSSadaf Ebrahimi 11413*22dc650dSSadaf Ebrahimi (?R) recurse whole pattern 11414*22dc650dSSadaf Ebrahimi (?n) call subroutine by absolute number 11415*22dc650dSSadaf Ebrahimi (?+n) call subroutine by relative number 11416*22dc650dSSadaf Ebrahimi (?-n) call subroutine by relative number 11417*22dc650dSSadaf Ebrahimi (?&name) call subroutine by name (Perl) 11418*22dc650dSSadaf Ebrahimi (?P>name) call subroutine by name (Python) 11419*22dc650dSSadaf Ebrahimi \g<name> call subroutine by name (Oniguruma) 11420*22dc650dSSadaf Ebrahimi \g'name' call subroutine by name (Oniguruma) 11421*22dc650dSSadaf Ebrahimi \g<n> call subroutine by absolute number (Oniguruma) 11422*22dc650dSSadaf Ebrahimi \g'n' call subroutine by absolute number (Oniguruma) 11423*22dc650dSSadaf Ebrahimi \g<+n> call subroutine by relative number (PCRE2 extension) 11424*22dc650dSSadaf Ebrahimi \g'+n' call subroutine by relative number (PCRE2 extension) 11425*22dc650dSSadaf Ebrahimi \g<-n> call subroutine by relative number (PCRE2 extension) 11426*22dc650dSSadaf Ebrahimi \g'-n' call subroutine by relative number (PCRE2 extension) 11427*22dc650dSSadaf Ebrahimi 11428*22dc650dSSadaf Ebrahimi 11429*22dc650dSSadaf EbrahimiCONDITIONAL PATTERNS 11430*22dc650dSSadaf Ebrahimi 11431*22dc650dSSadaf Ebrahimi (?(condition)yes-pattern) 11432*22dc650dSSadaf Ebrahimi (?(condition)yes-pattern|no-pattern) 11433*22dc650dSSadaf Ebrahimi 11434*22dc650dSSadaf Ebrahimi (?(n) absolute reference condition 11435*22dc650dSSadaf Ebrahimi (?(+n) relative reference condition (PCRE2 extension) 11436*22dc650dSSadaf Ebrahimi (?(-n) relative reference condition (PCRE2 extension) 11437*22dc650dSSadaf Ebrahimi (?(<name>) named reference condition (Perl) 11438*22dc650dSSadaf Ebrahimi (?('name') named reference condition (Perl) 11439*22dc650dSSadaf Ebrahimi (?(name) named reference condition (PCRE2, deprecated) 11440*22dc650dSSadaf Ebrahimi (?(R) overall recursion condition 11441*22dc650dSSadaf Ebrahimi (?(Rn) specific numbered group recursion condition 11442*22dc650dSSadaf Ebrahimi (?(R&name) specific named group recursion condition 11443*22dc650dSSadaf Ebrahimi (?(DEFINE) define groups for reference 11444*22dc650dSSadaf Ebrahimi (?(VERSION[>]=n.m) test PCRE2 version 11445*22dc650dSSadaf Ebrahimi (?(assert) assertion condition 11446*22dc650dSSadaf Ebrahimi 11447*22dc650dSSadaf Ebrahimi Note the ambiguity of (?(R) and (?(Rn) which might be named reference 11448*22dc650dSSadaf Ebrahimi conditions or recursion tests. Such a condition is interpreted as a 11449*22dc650dSSadaf Ebrahimi reference condition if the relevant named group exists. 11450*22dc650dSSadaf Ebrahimi 11451*22dc650dSSadaf Ebrahimi 11452*22dc650dSSadaf EbrahimiBACKTRACKING CONTROL 11453*22dc650dSSadaf Ebrahimi 11454*22dc650dSSadaf Ebrahimi All backtracking control verbs may be in the form (*VERB:NAME). For 11455*22dc650dSSadaf Ebrahimi (*MARK) the name is mandatory, for the others it is optional. (*SKIP) 11456*22dc650dSSadaf Ebrahimi changes its behaviour if :NAME is present. The others just set a name 11457*22dc650dSSadaf Ebrahimi for passing back to the caller, but this is not a name that (*SKIP) can 11458*22dc650dSSadaf Ebrahimi see. The following act immediately they are reached: 11459*22dc650dSSadaf Ebrahimi 11460*22dc650dSSadaf Ebrahimi (*ACCEPT) force successful match 11461*22dc650dSSadaf Ebrahimi (*FAIL) force backtrack; synonym (*F) 11462*22dc650dSSadaf Ebrahimi (*MARK:NAME) set name to be passed back; synonym (*:NAME) 11463*22dc650dSSadaf Ebrahimi 11464*22dc650dSSadaf Ebrahimi The following act only when a subsequent match failure causes a back- 11465*22dc650dSSadaf Ebrahimi track to reach them. They all force a match failure, but they differ in 11466*22dc650dSSadaf Ebrahimi what happens afterwards. Those that advance the start-of-match point do 11467*22dc650dSSadaf Ebrahimi so only if the pattern is not anchored. 11468*22dc650dSSadaf Ebrahimi 11469*22dc650dSSadaf Ebrahimi (*COMMIT) overall failure, no advance of starting point 11470*22dc650dSSadaf Ebrahimi (*PRUNE) advance to next starting character 11471*22dc650dSSadaf Ebrahimi (*SKIP) advance to current matching position 11472*22dc650dSSadaf Ebrahimi (*SKIP:NAME) advance to position corresponding to an earlier 11473*22dc650dSSadaf Ebrahimi (*MARK:NAME); if not found, the (*SKIP) is ignored 11474*22dc650dSSadaf Ebrahimi (*THEN) local failure, backtrack to next alternation 11475*22dc650dSSadaf Ebrahimi 11476*22dc650dSSadaf Ebrahimi The effect of one of these verbs in a group called as a subroutine is 11477*22dc650dSSadaf Ebrahimi confined to the subroutine call. 11478*22dc650dSSadaf Ebrahimi 11479*22dc650dSSadaf Ebrahimi 11480*22dc650dSSadaf EbrahimiCALLOUTS 11481*22dc650dSSadaf Ebrahimi 11482*22dc650dSSadaf Ebrahimi (?C) callout (assumed number 0) 11483*22dc650dSSadaf Ebrahimi (?Cn) callout with numerical data n 11484*22dc650dSSadaf Ebrahimi (?C"text") callout with string data 11485*22dc650dSSadaf Ebrahimi 11486*22dc650dSSadaf Ebrahimi The allowed string delimiters are ` ' " ^ % # $ (which are the same for 11487*22dc650dSSadaf Ebrahimi the start and the end), and the starting delimiter { matched with the 11488*22dc650dSSadaf Ebrahimi ending delimiter }. To encode the ending delimiter within the string, 11489*22dc650dSSadaf Ebrahimi double it. 11490*22dc650dSSadaf Ebrahimi 11491*22dc650dSSadaf Ebrahimi 11492*22dc650dSSadaf EbrahimiSEE ALSO 11493*22dc650dSSadaf Ebrahimi 11494*22dc650dSSadaf Ebrahimi pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), 11495*22dc650dSSadaf Ebrahimi pcre2(3). 11496*22dc650dSSadaf Ebrahimi 11497*22dc650dSSadaf Ebrahimi 11498*22dc650dSSadaf EbrahimiAUTHOR 11499*22dc650dSSadaf Ebrahimi 11500*22dc650dSSadaf Ebrahimi Philip Hazel 11501*22dc650dSSadaf Ebrahimi Retired from University Computing Service 11502*22dc650dSSadaf Ebrahimi Cambridge, England. 11503*22dc650dSSadaf Ebrahimi 11504*22dc650dSSadaf Ebrahimi 11505*22dc650dSSadaf EbrahimiREVISION 11506*22dc650dSSadaf Ebrahimi 11507*22dc650dSSadaf Ebrahimi Last updated: 12 October 2023 11508*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2023 University of Cambridge. 11509*22dc650dSSadaf Ebrahimi 11510*22dc650dSSadaf Ebrahimi 11511*22dc650dSSadaf EbrahimiPCRE2 10.43 12 October 2023 PCRE2SYNTAX(3) 11512*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 11513*22dc650dSSadaf Ebrahimi 11514*22dc650dSSadaf Ebrahimi 11515*22dc650dSSadaf Ebrahimi 11516*22dc650dSSadaf EbrahimiPCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3) 11517*22dc650dSSadaf Ebrahimi 11518*22dc650dSSadaf Ebrahimi 11519*22dc650dSSadaf EbrahimiNAME 11520*22dc650dSSadaf Ebrahimi PCRE - Perl-compatible regular expressions (revised API) 11521*22dc650dSSadaf Ebrahimi 11522*22dc650dSSadaf Ebrahimi 11523*22dc650dSSadaf EbrahimiUNICODE AND UTF SUPPORT 11524*22dc650dSSadaf Ebrahimi 11525*22dc650dSSadaf Ebrahimi PCRE2 is normally built with Unicode support, though if you do not need 11526*22dc650dSSadaf Ebrahimi it, you can build it without, in which case the library will be 11527*22dc650dSSadaf Ebrahimi smaller. With Unicode support, PCRE2 has knowledge of Unicode character 11528*22dc650dSSadaf Ebrahimi properties and can process strings of text in UTF-8, UTF-16, and UTF-32 11529*22dc650dSSadaf Ebrahimi format (depending on the code unit width), but this is not the default. 11530*22dc650dSSadaf Ebrahimi Unless specifically requested, PCRE2 treats each code unit in a string 11531*22dc650dSSadaf Ebrahimi as one character. 11532*22dc650dSSadaf Ebrahimi 11533*22dc650dSSadaf Ebrahimi There are two ways of telling PCRE2 to switch to UTF mode, where char- 11534*22dc650dSSadaf Ebrahimi acters may consist of more than one code unit and the range of values 11535*22dc650dSSadaf Ebrahimi is constrained. The program can call pcre2_compile() with the PCRE2_UTF 11536*22dc650dSSadaf Ebrahimi option, or the pattern may start with the sequence (*UTF). However, 11537*22dc650dSSadaf Ebrahimi the latter facility can be locked out by the PCRE2_NEVER_UTF option. 11538*22dc650dSSadaf Ebrahimi That is, the programmer can prevent the supplier of the pattern from 11539*22dc650dSSadaf Ebrahimi switching to UTF mode. 11540*22dc650dSSadaf Ebrahimi 11541*22dc650dSSadaf Ebrahimi Note that the PCRE2_MATCH_INVALID_UTF option (see below) forces 11542*22dc650dSSadaf Ebrahimi PCRE2_UTF to be set. 11543*22dc650dSSadaf Ebrahimi 11544*22dc650dSSadaf Ebrahimi In UTF mode, both the pattern and any subject strings that are matched 11545*22dc650dSSadaf Ebrahimi against it are treated as UTF strings instead of strings of individual 11546*22dc650dSSadaf Ebrahimi one-code-unit characters. There are also some other changes to the way 11547*22dc650dSSadaf Ebrahimi characters are handled, as documented below. 11548*22dc650dSSadaf Ebrahimi 11549*22dc650dSSadaf Ebrahimi 11550*22dc650dSSadaf EbrahimiUNICODE PROPERTY SUPPORT 11551*22dc650dSSadaf Ebrahimi 11552*22dc650dSSadaf Ebrahimi When PCRE2 is built with Unicode support, the escape sequences \p{..}, 11553*22dc650dSSadaf Ebrahimi \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set- 11554*22dc650dSSadaf Ebrahimi ting. The Unicode properties that can be tested are a subset of those 11555*22dc650dSSadaf Ebrahimi that Perl supports. Currently they are limited to the general category 11556*22dc650dSSadaf Ebrahimi properties such as Lu for an upper case letter or Nd for a decimal num- 11557*22dc650dSSadaf Ebrahimi ber, the derived properties Any and LC (synonym L&), the Unicode script 11558*22dc650dSSadaf Ebrahimi names such as Arabic or Han, Bidi_Class, Bidi_Control, and a few binary 11559*22dc650dSSadaf Ebrahimi properties. 11560*22dc650dSSadaf Ebrahimi 11561*22dc650dSSadaf Ebrahimi The full lists are given in the pcre2pattern and pcre2syntax documenta- 11562*22dc650dSSadaf Ebrahimi tion. In general, only the short names for properties are supported. 11563*22dc650dSSadaf Ebrahimi For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is 11564*22dc650dSSadaf Ebrahimi not supported. Furthermore, in Perl, many properties may optionally be 11565*22dc650dSSadaf Ebrahimi prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup- 11566*22dc650dSSadaf Ebrahimi port this. 11567*22dc650dSSadaf Ebrahimi 11568*22dc650dSSadaf Ebrahimi 11569*22dc650dSSadaf EbrahimiWIDE CHARACTERS AND UTF MODES 11570*22dc650dSSadaf Ebrahimi 11571*22dc650dSSadaf Ebrahimi Code points less than 256 can be specified in patterns by either braced 11572*22dc650dSSadaf Ebrahimi or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). 11573*22dc650dSSadaf Ebrahimi Larger values have to use braced sequences. Unbraced octal code points 11574*22dc650dSSadaf Ebrahimi up to \777 are also recognized; larger ones can be coded using \o{...}. 11575*22dc650dSSadaf Ebrahimi 11576*22dc650dSSadaf Ebrahimi The escape sequence \N{U+<hex digits>} is recognized as another way of 11577*22dc650dSSadaf Ebrahimi specifying a Unicode character by code point in a UTF mode. It is not 11578*22dc650dSSadaf Ebrahimi allowed in non-UTF mode. 11579*22dc650dSSadaf Ebrahimi 11580*22dc650dSSadaf Ebrahimi In UTF mode, repeat quantifiers apply to complete UTF characters, not 11581*22dc650dSSadaf Ebrahimi to individual code units. 11582*22dc650dSSadaf Ebrahimi 11583*22dc650dSSadaf Ebrahimi In UTF mode, the dot metacharacter matches one UTF character instead of 11584*22dc650dSSadaf Ebrahimi a single code unit. 11585*22dc650dSSadaf Ebrahimi 11586*22dc650dSSadaf Ebrahimi In UTF mode, capture group names are not restricted to ASCII, and may 11587*22dc650dSSadaf Ebrahimi contain any Unicode letters and decimal digits, as well as underscore. 11588*22dc650dSSadaf Ebrahimi 11589*22dc650dSSadaf Ebrahimi The escape sequence \C can be used to match a single code unit in UTF 11590*22dc650dSSadaf Ebrahimi mode, but its use can lead to some strange effects because it breaks up 11591*22dc650dSSadaf Ebrahimi multi-unit characters (see the description of \C in the pcre2pattern 11592*22dc650dSSadaf Ebrahimi documentation). For this reason, there is a build-time option that dis- 11593*22dc650dSSadaf Ebrahimi ables support for \C completely. There is also a less draconian com- 11594*22dc650dSSadaf Ebrahimi pile-time option for locking out the use of \C when a pattern is com- 11595*22dc650dSSadaf Ebrahimi piled. 11596*22dc650dSSadaf Ebrahimi 11597*22dc650dSSadaf Ebrahimi The use of \C is not supported by the alternative matching function 11598*22dc650dSSadaf Ebrahimi pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac- 11599*22dc650dSSadaf Ebrahimi ter may consist of more than one code unit. The use of \C in these 11600*22dc650dSSadaf Ebrahimi modes provokes a match-time error. Also, the JIT optimization does not 11601*22dc650dSSadaf Ebrahimi support \C in these modes. If JIT optimization is requested for a UTF-8 11602*22dc650dSSadaf Ebrahimi or UTF-16 pattern that contains \C, it will not succeed, and so when 11603*22dc650dSSadaf Ebrahimi pcre2_match() is called, the matching will be carried out by the inter- 11604*22dc650dSSadaf Ebrahimi pretive function. 11605*22dc650dSSadaf Ebrahimi 11606*22dc650dSSadaf Ebrahimi The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test 11607*22dc650dSSadaf Ebrahimi characters of any code value, but, by default, the characters that 11608*22dc650dSSadaf Ebrahimi PCRE2 recognizes as digits, spaces, or word characters remain the same 11609*22dc650dSSadaf Ebrahimi set as in non-UTF mode, all with code points less than 256. This re- 11610*22dc650dSSadaf Ebrahimi mains true even when PCRE2 is built to include Unicode support, because 11611*22dc650dSSadaf Ebrahimi to do otherwise would slow down matching in many common cases. Note 11612*22dc650dSSadaf Ebrahimi that this also applies to \b and \B, because they are defined in terms 11613*22dc650dSSadaf Ebrahimi of \w and \W. If you want to test for a wider sense of, say, "digit", 11614*22dc650dSSadaf Ebrahimi you can use explicit Unicode property tests such as \p{Nd}. Alterna- 11615*22dc650dSSadaf Ebrahimi tively, if you set the PCRE2_UCP option, the way that the character es- 11616*22dc650dSSadaf Ebrahimi capes work is changed so that Unicode properties are used to determine 11617*22dc650dSSadaf Ebrahimi which characters match, though there are some options that suppress 11618*22dc650dSSadaf Ebrahimi this for individual escapes. For details see the section on generic 11619*22dc650dSSadaf Ebrahimi character types in the pcre2pattern documentation. 11620*22dc650dSSadaf Ebrahimi 11621*22dc650dSSadaf Ebrahimi Like the escapes, characters that match the POSIX named character 11622*22dc650dSSadaf Ebrahimi classes are all low-valued characters unless the PCRE2_UCP option is 11623*22dc650dSSadaf Ebrahimi set, but there is an option to override this. 11624*22dc650dSSadaf Ebrahimi 11625*22dc650dSSadaf Ebrahimi In contrast to the character escapes and character classes, the special 11626*22dc650dSSadaf Ebrahimi horizontal and vertical white space escapes (\h, \H, \v, and \V) do 11627*22dc650dSSadaf Ebrahimi match all the appropriate Unicode characters, whether or not PCRE2_UCP 11628*22dc650dSSadaf Ebrahimi is set. 11629*22dc650dSSadaf Ebrahimi 11630*22dc650dSSadaf Ebrahimi 11631*22dc650dSSadaf EbrahimiUNICODE CASE-EQUIVALENCE 11632*22dc650dSSadaf Ebrahimi 11633*22dc650dSSadaf Ebrahimi If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing 11634*22dc650dSSadaf Ebrahimi makes use of Unicode properties except for characters whose code points 11635*22dc650dSSadaf Ebrahimi are less than 128 and that have at most two case-equivalent values. For 11636*22dc650dSSadaf Ebrahimi these, a direct table lookup is used for speed. A few Unicode charac- 11637*22dc650dSSadaf Ebrahimi ters such as Greek sigma have more than two code points that are case- 11638*22dc650dSSadaf Ebrahimi equivalent, and these are treated specially. Setting PCRE2_UCP without 11639*22dc650dSSadaf Ebrahimi PCRE2_UTF allows Unicode-style case processing for non-UTF character 11640*22dc650dSSadaf Ebrahimi encodings such as UCS-2. 11641*22dc650dSSadaf Ebrahimi 11642*22dc650dSSadaf Ebrahimi There are two ASCII characters (S and K) that, in addition to their 11643*22dc650dSSadaf Ebrahimi ASCII lower case equivalents, have a non-ASCII one as well (long S and 11644*22dc650dSSadaf Ebrahimi Kelvin sign). Recognition of these non-ASCII characters as case-equiv- 11645*22dc650dSSadaf Ebrahimi alent to their ASCII counterparts can be disabled by setting the 11646*22dc650dSSadaf Ebrahimi PCRE2_EXTRA_CASELESS_RESTRICT option. When this is set, all characters 11647*22dc650dSSadaf Ebrahimi in a case equivalence must either be ASCII or non-ASCII; there can be 11648*22dc650dSSadaf Ebrahimi no mixing. 11649*22dc650dSSadaf Ebrahimi 11650*22dc650dSSadaf Ebrahimi 11651*22dc650dSSadaf EbrahimiSCRIPT RUNS 11652*22dc650dSSadaf Ebrahimi 11653*22dc650dSSadaf Ebrahimi The pattern constructs (*script_run:...) and (*atomic_script_run:...), 11654*22dc650dSSadaf Ebrahimi with synonyms (*sr:...) and (*asr:...), verify that the string matched 11655*22dc650dSSadaf Ebrahimi within the parentheses is a script run. In concept, a script run is a 11656*22dc650dSSadaf Ebrahimi sequence of characters that are all from the same Unicode script. How- 11657*22dc650dSSadaf Ebrahimi ever, because some scripts are commonly used together, and because some 11658*22dc650dSSadaf Ebrahimi diacritical and other marks are used with multiple scripts, it is not 11659*22dc650dSSadaf Ebrahimi that simple. 11660*22dc650dSSadaf Ebrahimi 11661*22dc650dSSadaf Ebrahimi Every Unicode character has a Script property, mostly with a value cor- 11662*22dc650dSSadaf Ebrahimi responding to the name of a script, such as Latin, Greek, or Cyrillic. 11663*22dc650dSSadaf Ebrahimi There are also three special values: 11664*22dc650dSSadaf Ebrahimi 11665*22dc650dSSadaf Ebrahimi "Unknown" is used for code points that have not been assigned, and also 11666*22dc650dSSadaf Ebrahimi for the surrogate code points. In the PCRE2 32-bit library, characters 11667*22dc650dSSadaf Ebrahimi whose code points are greater than the Unicode maximum (U+10FFFF), 11668*22dc650dSSadaf Ebrahimi which are accessible only in non-UTF mode, are assigned the Unknown 11669*22dc650dSSadaf Ebrahimi script. 11670*22dc650dSSadaf Ebrahimi 11671*22dc650dSSadaf Ebrahimi "Common" is used for characters that are used with many scripts. These 11672*22dc650dSSadaf Ebrahimi include punctuation, emoji, mathematical, musical, and currency sym- 11673*22dc650dSSadaf Ebrahimi bols, and the ASCII digits 0 to 9. 11674*22dc650dSSadaf Ebrahimi 11675*22dc650dSSadaf Ebrahimi "Inherited" is used for characters such as diacritical marks that mod- 11676*22dc650dSSadaf Ebrahimi ify a previous character. These are considered to take on the script of 11677*22dc650dSSadaf Ebrahimi the character that they modify. 11678*22dc650dSSadaf Ebrahimi 11679*22dc650dSSadaf Ebrahimi Some Inherited characters are used with many scripts, but many of them 11680*22dc650dSSadaf Ebrahimi are only normally used with a small number of scripts. For example, 11681*22dc650dSSadaf Ebrahimi U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop- 11682*22dc650dSSadaf Ebrahimi tic. In order to make it possible to check this, a Unicode property 11683*22dc650dSSadaf Ebrahimi called Script Extension exists. Its value is a list of scripts that ap- 11684*22dc650dSSadaf Ebrahimi ply to the character. For the majority of characters, the list contains 11685*22dc650dSSadaf Ebrahimi just one script, the same one as the Script property. However, for 11686*22dc650dSSadaf Ebrahimi characters such as U+102E0 more than one Script is listed. There are 11687*22dc650dSSadaf Ebrahimi also some Common characters that have a single, non-Common script in 11688*22dc650dSSadaf Ebrahimi their Script Extension list. 11689*22dc650dSSadaf Ebrahimi 11690*22dc650dSSadaf Ebrahimi The next section describes the basic rules for deciding whether a given 11691*22dc650dSSadaf Ebrahimi string of characters is a script run. Note, however, that there are 11692*22dc650dSSadaf Ebrahimi some special cases involving the Chinese Han script, and an additional 11693*22dc650dSSadaf Ebrahimi constraint for decimal digits. These are covered in subsequent sec- 11694*22dc650dSSadaf Ebrahimi tions. 11695*22dc650dSSadaf Ebrahimi 11696*22dc650dSSadaf Ebrahimi Basic script run rules 11697*22dc650dSSadaf Ebrahimi 11698*22dc650dSSadaf Ebrahimi A string that is less than two characters long is a script run. This is 11699*22dc650dSSadaf Ebrahimi the only case in which an Unknown character can be part of a script 11700*22dc650dSSadaf Ebrahimi run. Longer strings are checked using only the Script Extensions prop- 11701*22dc650dSSadaf Ebrahimi erty, not the basic Script property. 11702*22dc650dSSadaf Ebrahimi 11703*22dc650dSSadaf Ebrahimi If a character's Script Extension property is the single value "Inher- 11704*22dc650dSSadaf Ebrahimi ited", it is always accepted as part of a script run. This is also true 11705*22dc650dSSadaf Ebrahimi for the property "Common", subject to the checking of decimal digits 11706*22dc650dSSadaf Ebrahimi described below. All the remaining characters in a script run must have 11707*22dc650dSSadaf Ebrahimi at least one script in common in their Script Extension lists. In set- 11708*22dc650dSSadaf Ebrahimi theoretic terminology, the intersection of all the sets of scripts must 11709*22dc650dSSadaf Ebrahimi not be empty. 11710*22dc650dSSadaf Ebrahimi 11711*22dc650dSSadaf Ebrahimi A simple example is an Internet name such as "google.com". The letters 11712*22dc650dSSadaf Ebrahimi are all in the Latin script, and the dot is Common, so this string is a 11713*22dc650dSSadaf Ebrahimi script run. However, the Cyrillic letter "o" looks exactly the same as 11714*22dc650dSSadaf Ebrahimi the Latin "o"; a string that looks the same, but with Cyrillic "o"s is 11715*22dc650dSSadaf Ebrahimi not a script run. 11716*22dc650dSSadaf Ebrahimi 11717*22dc650dSSadaf Ebrahimi More interesting examples involve characters with more than one script 11718*22dc650dSSadaf Ebrahimi in their Script Extension. Consider the following characters: 11719*22dc650dSSadaf Ebrahimi 11720*22dc650dSSadaf Ebrahimi U+060C Arabic comma 11721*22dc650dSSadaf Ebrahimi U+06D4 Arabic full stop 11722*22dc650dSSadaf Ebrahimi 11723*22dc650dSSadaf Ebrahimi The first has the Script Extension list Arabic, Hanifi Rohingya, Syr- 11724*22dc650dSSadaf Ebrahimi iac, and Thaana; the second has just Arabic and Hanifi Rohingya. Both 11725*22dc650dSSadaf Ebrahimi of them could appear in script runs of either Arabic or Hanifi Ro- 11726*22dc650dSSadaf Ebrahimi hingya. The first could also appear in Syriac or Thaana script runs, 11727*22dc650dSSadaf Ebrahimi but the second could not. 11728*22dc650dSSadaf Ebrahimi 11729*22dc650dSSadaf Ebrahimi The Chinese Han script 11730*22dc650dSSadaf Ebrahimi 11731*22dc650dSSadaf Ebrahimi The Chinese Han script is commonly used in conjunction with other 11732*22dc650dSSadaf Ebrahimi scripts for writing certain languages. Japanese uses the Hiragana and 11733*22dc650dSSadaf Ebrahimi Katakana scripts together with Han; Korean uses Hangul and Han; Tai- 11734*22dc650dSSadaf Ebrahimi wanese Mandarin uses Bopomofo and Han. These three combinations are 11735*22dc650dSSadaf Ebrahimi treated as special cases when checking script runs and are, in effect, 11736*22dc650dSSadaf Ebrahimi "virtual scripts". Thus, a script run may contain a mixture of Hira- 11737*22dc650dSSadaf Ebrahimi gana, Katakana, and Han, or a mixture of Hangul and Han, or a mixture 11738*22dc650dSSadaf Ebrahimi of Bopomofo and Han, but not, for example, a mixture of Hangul and 11739*22dc650dSSadaf Ebrahimi Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan- 11740*22dc650dSSadaf Ebrahimi dard 39 ("Unicode Security Mechanisms", http://unicode.org/re- 11741*22dc650dSSadaf Ebrahimi ports/tr39/) in allowing such mixtures. 11742*22dc650dSSadaf Ebrahimi 11743*22dc650dSSadaf Ebrahimi Decimal digits 11744*22dc650dSSadaf Ebrahimi 11745*22dc650dSSadaf Ebrahimi Unicode contains many sets of 10 decimal digits in different scripts, 11746*22dc650dSSadaf Ebrahimi and some scripts (including the Common script) contain more than one 11747*22dc650dSSadaf Ebrahimi set. Some of these decimal digits them are visually indistinguishable 11748*22dc650dSSadaf Ebrahimi from the common ASCII digits. In addition to the script checking de- 11749*22dc650dSSadaf Ebrahimi scribed above, if a script run contains any decimal digits, they must 11750*22dc650dSSadaf Ebrahimi all come from the same set of 10 adjacent characters. 11751*22dc650dSSadaf Ebrahimi 11752*22dc650dSSadaf Ebrahimi 11753*22dc650dSSadaf EbrahimiVALIDITY OF UTF STRINGS 11754*22dc650dSSadaf Ebrahimi 11755*22dc650dSSadaf Ebrahimi When the PCRE2_UTF option is set, the strings passed as patterns and 11756*22dc650dSSadaf Ebrahimi subjects are (by default) checked for validity on entry to the relevant 11757*22dc650dSSadaf Ebrahimi functions. If an invalid UTF string is passed, a negative error code is 11758*22dc650dSSadaf Ebrahimi returned. The code unit offset to the offending character can be ex- 11759*22dc650dSSadaf Ebrahimi tracted from the match data block by calling pcre2_get_startchar(), 11760*22dc650dSSadaf Ebrahimi which is used for this purpose after a UTF error. 11761*22dc650dSSadaf Ebrahimi 11762*22dc650dSSadaf Ebrahimi In some situations, you may already know that your strings are valid, 11763*22dc650dSSadaf Ebrahimi and therefore want to skip these checks in order to improve perfor- 11764*22dc650dSSadaf Ebrahimi mance, for example in the case of a long subject string that is being 11765*22dc650dSSadaf Ebrahimi scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com- 11766*22dc650dSSadaf Ebrahimi pile time or at match time, PCRE2 assumes that the pattern or subject 11767*22dc650dSSadaf Ebrahimi it is given (respectively) contains only valid UTF code unit sequences. 11768*22dc650dSSadaf Ebrahimi 11769*22dc650dSSadaf Ebrahimi If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the 11770*22dc650dSSadaf Ebrahimi result is undefined and your program may crash or loop indefinitely or 11771*22dc650dSSadaf Ebrahimi give incorrect results. There is, however, one mode of matching that 11772*22dc650dSSadaf Ebrahimi can handle invalid UTF subject strings. This is enabled by passing 11773*22dc650dSSadaf Ebrahimi PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is discussed below in 11774*22dc650dSSadaf Ebrahimi the next section. The rest of this section covers the case when 11775*22dc650dSSadaf Ebrahimi PCRE2_MATCH_INVALID_UTF is not set. 11776*22dc650dSSadaf Ebrahimi 11777*22dc650dSSadaf Ebrahimi Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the UTF 11778*22dc650dSSadaf Ebrahimi check for the pattern; it does not also apply to subject strings. If 11779*22dc650dSSadaf Ebrahimi you want to disable the check for a subject string you must pass this 11780*22dc650dSSadaf Ebrahimi same option to pcre2_match() or pcre2_dfa_match(). 11781*22dc650dSSadaf Ebrahimi 11782*22dc650dSSadaf Ebrahimi UTF-16 and UTF-32 strings can indicate their endianness by special code 11783*22dc650dSSadaf Ebrahimi knows as a byte-order mark (BOM). The PCRE2 functions do not handle 11784*22dc650dSSadaf Ebrahimi this, expecting strings to be in host byte order. 11785*22dc650dSSadaf Ebrahimi 11786*22dc650dSSadaf Ebrahimi Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any 11787*22dc650dSSadaf Ebrahimi other processing takes place. In the case of pcre2_match() and 11788*22dc650dSSadaf Ebrahimi pcre2_dfa_match() calls with a non-zero starting offset, the check is 11789*22dc650dSSadaf Ebrahimi applied only to that part of the subject that could be inspected during 11790*22dc650dSSadaf Ebrahimi matching, and there is a check that the starting offset points to the 11791*22dc650dSSadaf Ebrahimi first code unit of a character or to the end of the subject. If there 11792*22dc650dSSadaf Ebrahimi are no lookbehind assertions in the pattern, the check starts at the 11793*22dc650dSSadaf Ebrahimi starting offset. Otherwise, it starts at the length of the longest 11794*22dc650dSSadaf Ebrahimi lookbehind before the starting offset, or at the start of the subject 11795*22dc650dSSadaf Ebrahimi if there are not that many characters before the starting offset. Note 11796*22dc650dSSadaf Ebrahimi that the sequences \b and \B are one-character lookbehinds. 11797*22dc650dSSadaf Ebrahimi 11798*22dc650dSSadaf Ebrahimi In addition to checking the format of the string, there is a check to 11799*22dc650dSSadaf Ebrahimi ensure that all code points lie in the range U+0 to U+10FFFF, excluding 11800*22dc650dSSadaf Ebrahimi the surrogate area. The so-called "non-character" code points are not 11801*22dc650dSSadaf Ebrahimi excluded because Unicode corrigendum #9 makes it clear that they should 11802*22dc650dSSadaf Ebrahimi not be. 11803*22dc650dSSadaf Ebrahimi 11804*22dc650dSSadaf Ebrahimi Characters in the "Surrogate Area" of Unicode are reserved for use by 11805*22dc650dSSadaf Ebrahimi UTF-16, where they are used in pairs to encode code points with values 11806*22dc650dSSadaf Ebrahimi greater than 0xFFFF. The code points that are encoded by UTF-16 pairs 11807*22dc650dSSadaf Ebrahimi are available independently in the UTF-8 and UTF-32 encodings. (In 11808*22dc650dSSadaf Ebrahimi other words, the whole surrogate thing is a fudge for UTF-16 which un- 11809*22dc650dSSadaf Ebrahimi fortunately messes up UTF-8 and UTF-32.) 11810*22dc650dSSadaf Ebrahimi 11811*22dc650dSSadaf Ebrahimi Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error 11812*22dc650dSSadaf Ebrahimi that is given if an escape sequence for an invalid Unicode code point 11813*22dc650dSSadaf Ebrahimi is encountered in the pattern. If you want to allow escape sequences 11814*22dc650dSSadaf Ebrahimi such as \x{d800} (a surrogate code point) you can set the PCRE2_EX- 11815*22dc650dSSadaf Ebrahimi TRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible 11816*22dc650dSSadaf Ebrahimi only in UTF-8 and UTF-32 modes, because these values are not repre- 11817*22dc650dSSadaf Ebrahimi sentable in UTF-16. 11818*22dc650dSSadaf Ebrahimi 11819*22dc650dSSadaf Ebrahimi Errors in UTF-8 strings 11820*22dc650dSSadaf Ebrahimi 11821*22dc650dSSadaf Ebrahimi The following negative error codes are given for invalid UTF-8 strings: 11822*22dc650dSSadaf Ebrahimi 11823*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR1 11824*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR2 11825*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR3 11826*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR4 11827*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR5 11828*22dc650dSSadaf Ebrahimi 11829*22dc650dSSadaf Ebrahimi The string ends with a truncated UTF-8 character; the code specifies 11830*22dc650dSSadaf Ebrahimi how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 11831*22dc650dSSadaf Ebrahimi characters to be no longer than 4 bytes, the encoding scheme (origi- 11832*22dc650dSSadaf Ebrahimi nally defined by RFC 2279) allows for up to 6 bytes, and this is 11833*22dc650dSSadaf Ebrahimi checked first; hence the possibility of 4 or 5 missing bytes. 11834*22dc650dSSadaf Ebrahimi 11835*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR6 11836*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR7 11837*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR8 11838*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR9 11839*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR10 11840*22dc650dSSadaf Ebrahimi 11841*22dc650dSSadaf Ebrahimi The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of 11842*22dc650dSSadaf Ebrahimi the character do not have the binary value 0b10 (that is, either the 11843*22dc650dSSadaf Ebrahimi most significant bit is 0, or the next bit is 1). 11844*22dc650dSSadaf Ebrahimi 11845*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR11 11846*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR12 11847*22dc650dSSadaf Ebrahimi 11848*22dc650dSSadaf Ebrahimi A character that is valid by the RFC 2279 rules is either 5 or 6 bytes 11849*22dc650dSSadaf Ebrahimi long; these code points are excluded by RFC 3629. 11850*22dc650dSSadaf Ebrahimi 11851*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR13 11852*22dc650dSSadaf Ebrahimi 11853*22dc650dSSadaf Ebrahimi A 4-byte character has a value greater than 0x10ffff; these code points 11854*22dc650dSSadaf Ebrahimi are excluded by RFC 3629. 11855*22dc650dSSadaf Ebrahimi 11856*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR14 11857*22dc650dSSadaf Ebrahimi 11858*22dc650dSSadaf Ebrahimi A 3-byte character has a value in the range 0xd800 to 0xdfff; this 11859*22dc650dSSadaf Ebrahimi range of code points are reserved by RFC 3629 for use with UTF-16, and 11860*22dc650dSSadaf Ebrahimi so are excluded from UTF-8. 11861*22dc650dSSadaf Ebrahimi 11862*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR15 11863*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR16 11864*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR17 11865*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR18 11866*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR19 11867*22dc650dSSadaf Ebrahimi 11868*22dc650dSSadaf Ebrahimi A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes 11869*22dc650dSSadaf Ebrahimi for a value that can be represented by fewer bytes, which is invalid. 11870*22dc650dSSadaf Ebrahimi For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- 11871*22dc650dSSadaf Ebrahimi rect coding uses just one byte. 11872*22dc650dSSadaf Ebrahimi 11873*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR20 11874*22dc650dSSadaf Ebrahimi 11875*22dc650dSSadaf Ebrahimi The two most significant bits of the first byte of a character have the 11876*22dc650dSSadaf Ebrahimi binary value 0b10 (that is, the most significant bit is 1 and the sec- 11877*22dc650dSSadaf Ebrahimi ond is 0). Such a byte can only validly occur as the second or subse- 11878*22dc650dSSadaf Ebrahimi quent byte of a multi-byte character. 11879*22dc650dSSadaf Ebrahimi 11880*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF8_ERR21 11881*22dc650dSSadaf Ebrahimi 11882*22dc650dSSadaf Ebrahimi The first byte of a character has the value 0xfe or 0xff. These values 11883*22dc650dSSadaf Ebrahimi can never occur in a valid UTF-8 string. 11884*22dc650dSSadaf Ebrahimi 11885*22dc650dSSadaf Ebrahimi Errors in UTF-16 strings 11886*22dc650dSSadaf Ebrahimi 11887*22dc650dSSadaf Ebrahimi The following negative error codes are given for invalid UTF-16 11888*22dc650dSSadaf Ebrahimi strings: 11889*22dc650dSSadaf Ebrahimi 11890*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string 11891*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate 11892*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate 11893*22dc650dSSadaf Ebrahimi 11894*22dc650dSSadaf Ebrahimi 11895*22dc650dSSadaf Ebrahimi Errors in UTF-32 strings 11896*22dc650dSSadaf Ebrahimi 11897*22dc650dSSadaf Ebrahimi The following negative error codes are given for invalid UTF-32 11898*22dc650dSSadaf Ebrahimi strings: 11899*22dc650dSSadaf Ebrahimi 11900*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff) 11901*22dc650dSSadaf Ebrahimi PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff 11902*22dc650dSSadaf Ebrahimi 11903*22dc650dSSadaf Ebrahimi 11904*22dc650dSSadaf EbrahimiMATCHING IN INVALID UTF STRINGS 11905*22dc650dSSadaf Ebrahimi 11906*22dc650dSSadaf Ebrahimi You can run pattern matches on subject strings that may contain invalid 11907*22dc650dSSadaf Ebrahimi UTF sequences if you call pcre2_compile() with the PCRE2_MATCH_IN- 11908*22dc650dSSadaf Ebrahimi VALID_UTF option. This is supported by pcre2_match(), including JIT 11909*22dc650dSSadaf Ebrahimi matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is 11910*22dc650dSSadaf Ebrahimi set, it forces PCRE2_UTF to be set as well. Note, however, that the 11911*22dc650dSSadaf Ebrahimi pattern itself must be a valid UTF string. 11912*22dc650dSSadaf Ebrahimi 11913*22dc650dSSadaf Ebrahimi If you do not set PCRE2_MATCH_INVALID_UTF when calling pcre2_compile, 11914*22dc650dSSadaf Ebrahimi and you are not certain that your subject strings are valid UTF se- 11915*22dc650dSSadaf Ebrahimi quences, you should not make use of the JIT "fast path" function 11916*22dc650dSSadaf Ebrahimi pcre2_jit_match() because it bypasses sanity checks, including the one 11917*22dc650dSSadaf Ebrahimi for UTF validity. An invalid string may cause undefined behaviour, in- 11918*22dc650dSSadaf Ebrahimi cluding looping, crashing, or giving the wrong answer. 11919*22dc650dSSadaf Ebrahimi 11920*22dc650dSSadaf Ebrahimi Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile() 11921*22dc650dSSadaf Ebrahimi generates, but if pcre2_jit_compile() is subsequently called, it does 11922*22dc650dSSadaf Ebrahimi generate different code. If JIT is not used, the option affects the be- 11923*22dc650dSSadaf Ebrahimi haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN- 11924*22dc650dSSadaf Ebrahimi VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at 11925*22dc650dSSadaf Ebrahimi match time. 11926*22dc650dSSadaf Ebrahimi 11927*22dc650dSSadaf Ebrahimi In this mode, an invalid code unit sequence in the subject never 11928*22dc650dSSadaf Ebrahimi matches any pattern item. It does not match dot, it does not match 11929*22dc650dSSadaf Ebrahimi \p{Any}, it does not even match negative items such as [^X]. A lookbe- 11930*22dc650dSSadaf Ebrahimi hind assertion fails if it encounters an invalid sequence while moving 11931*22dc650dSSadaf Ebrahimi the current point backwards. In other words, an invalid UTF code unit 11932*22dc650dSSadaf Ebrahimi sequence acts as a barrier which no match can cross. 11933*22dc650dSSadaf Ebrahimi 11934*22dc650dSSadaf Ebrahimi You can also think of this as the subject being split up into fragments 11935*22dc650dSSadaf Ebrahimi of valid UTF, delimited internally by invalid code unit sequences. The 11936*22dc650dSSadaf Ebrahimi pattern is matched fragment by fragment. The result of a successful 11937*22dc650dSSadaf Ebrahimi match, however, is given as code unit offsets in the entire subject 11938*22dc650dSSadaf Ebrahimi string in the usual way. There are a few points to consider: 11939*22dc650dSSadaf Ebrahimi 11940*22dc650dSSadaf Ebrahimi The internal boundaries are not interpreted as the beginnings or ends 11941*22dc650dSSadaf Ebrahimi of lines and so do not match circumflex or dollar characters in the 11942*22dc650dSSadaf Ebrahimi pattern. 11943*22dc650dSSadaf Ebrahimi 11944*22dc650dSSadaf Ebrahimi If pcre2_match() is called with an offset that points to an invalid 11945*22dc650dSSadaf Ebrahimi UTF-sequence, that sequence is skipped, and the match starts at the 11946*22dc650dSSadaf Ebrahimi next valid UTF character, or the end of the subject. 11947*22dc650dSSadaf Ebrahimi 11948*22dc650dSSadaf Ebrahimi At internal fragment boundaries, \b and \B behave in the same way as at 11949*22dc650dSSadaf Ebrahimi the beginning and end of the subject. For example, a sequence such as 11950*22dc650dSSadaf Ebrahimi \bWORD\b would match an instance of WORD that is surrounded by invalid 11951*22dc650dSSadaf Ebrahimi UTF code units. 11952*22dc650dSSadaf Ebrahimi 11953*22dc650dSSadaf Ebrahimi Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi- 11954*22dc650dSSadaf Ebrahimi trary data, knowing that any matched strings that are returned are 11955*22dc650dSSadaf Ebrahimi valid UTF. This can be useful when searching for UTF text in executable 11956*22dc650dSSadaf Ebrahimi or other binary files. 11957*22dc650dSSadaf Ebrahimi 11958*22dc650dSSadaf Ebrahimi Note, however, that the 16-bit and 32-bit PCRE2 libraries process 11959*22dc650dSSadaf Ebrahimi strings as sequences of uint16_t or uint32_t code points. They cannot 11960*22dc650dSSadaf Ebrahimi find valid UTF sequences within an arbitrary string of bytes unless 11961*22dc650dSSadaf Ebrahimi such sequences are suitably aligned. 11962*22dc650dSSadaf Ebrahimi 11963*22dc650dSSadaf Ebrahimi 11964*22dc650dSSadaf EbrahimiAUTHOR 11965*22dc650dSSadaf Ebrahimi 11966*22dc650dSSadaf Ebrahimi Philip Hazel 11967*22dc650dSSadaf Ebrahimi Retired from University Computing Service 11968*22dc650dSSadaf Ebrahimi Cambridge, England. 11969*22dc650dSSadaf Ebrahimi 11970*22dc650dSSadaf Ebrahimi 11971*22dc650dSSadaf EbrahimiREVISION 11972*22dc650dSSadaf Ebrahimi 11973*22dc650dSSadaf Ebrahimi Last updated: 12 October 2023 11974*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2023 University of Cambridge. 11975*22dc650dSSadaf Ebrahimi 11976*22dc650dSSadaf Ebrahimi 11977*22dc650dSSadaf EbrahimiPCRE2 10.43 04 February 2023 PCRE2UNICODE(3) 11978*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------ 11979*22dc650dSSadaf Ebrahimi 11980*22dc650dSSadaf Ebrahimi 11981