1----------------------------------------------------------------------------- 2This file contains a concatenation of the PCRE2 man pages, converted to plain 3text format for ease of searching with a text editor, or for use on systems 4that do not have a man page processor. The small individual files that give 5synopses of each function in the library have not been included. Neither has 6the pcre2demo program. There are separate text files for the pcre2grep and 7pcre2test commands. 8----------------------------------------------------------------------------- 9 10 11 12PCRE2(3) Library Functions Manual PCRE2(3) 13 14 15NAME 16 PCRE2 - Perl-compatible regular expressions (revised API) 17 18 19INTRODUCTION 20 21 PCRE2 is the name used for a revised API for the PCRE library, which is 22 a set of functions, written in C, that implement regular expression 23 pattern matching using the same syntax and semantics as Perl, with just 24 a few differences. After nearly two decades, the limitations of the 25 original API were making development increasingly difficult. The new 26 API is more extensible, and it was simplified by abolishing the sepa- 27 rate "study" optimizing function; in PCRE2, patterns are automatically 28 optimized where possible. Since forking from PCRE1, the code has been 29 extensively refactored and new features introduced. The old library is 30 now obsolete and is no longer maintained. 31 32 As well as Perl-style regular expression patterns, some features that 33 appeared in Python and the original PCRE before they appeared in Perl 34 are available using the Python syntax. There is also some support for 35 one or two .NET and Oniguruma syntax items, and there are options for 36 requesting some minor changes that give better ECMAScript (aka 37 JavaScript) compatibility. 38 39 The source code for PCRE2 can be compiled to support strings of 8-bit, 40 16-bit, or 32-bit code units, which means that up to three separate li- 41 braries may be installed, one for each code unit size. The size of code 42 unit is not related to the bit size of the underlying hardware. In a 43 64-bit environment that also supports 32-bit applications, versions of 44 PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed. 45 46 The original work to extend PCRE to 16-bit and 32-bit code units was 47 done by Zoltan Herczeg and Christian Persch, respectively. In all three 48 cases, strings can be interpreted either as one character per code 49 unit, or as UTF-encoded Unicode, with support for Unicode general cate- 50 gory properties. Unicode support is optional at build time (but is the 51 default). However, processing strings as UTF code units must be enabled 52 explicitly at run time. The version of Unicode in use can be discovered 53 by running 54 55 pcre2test -C 56 57 The three libraries contain identical sets of functions, with names 58 ending in _8, _16, or _32, respectively (for example, pcre2_com- 59 pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 60 32, a program that uses just one code unit width can be written using 61 generic names such as pcre2_compile(), and the documentation is written 62 assuming that this is the case. 63 64 In addition to the Perl-compatible matching function, PCRE2 contains an 65 alternative function that matches the same compiled patterns in a dif- 66 ferent way. In certain circumstances, the alternative function has some 67 advantages. For a discussion of the two matching algorithms, see the 68 pcre2matching page. 69 70 Details of exactly which Perl regular expression features are and are 71 not supported by PCRE2 are given in separate documents. See the 72 pcre2pattern and pcre2compat pages. There is a syntax summary in the 73 pcre2syntax page. 74 75 Some features of PCRE2 can be included, excluded, or changed when the 76 library is built. The pcre2_config() function makes it possible for a 77 client to discover which features are available. The features them- 78 selves are described in the pcre2build page. Documentation about build- 79 ing PCRE2 for various operating systems can be found in the README and 80 NON-AUTOTOOLS_BUILD files in the source distribution. 81 82 The libraries contains a number of undocumented internal functions and 83 data tables that are used by more than one of the exported external 84 functions, but which are not intended for use by external callers. 85 Their names all begin with "_pcre2", which hopefully will not provoke 86 any name clashes. In some environments, it is possible to control which 87 external symbols are exported when a shared library is built, and in 88 these cases the undocumented symbols are not exported. 89 90 91SECURITY CONSIDERATIONS 92 93 If you are using PCRE2 in a non-UTF application that permits users to 94 supply arbitrary patterns for compilation, you should be aware of a 95 feature that allows users to turn on UTF support from within a pattern. 96 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 97 mode, which interprets patterns and subjects as strings of UTF-8 code 98 units instead of individual 8-bit characters. This causes both the pat- 99 tern and any data against which it is matched to be checked for UTF-8 100 validity. If the data string is very long, such a check might use suf- 101 ficiently many resources as to cause your application to lose perfor- 102 mance. 103 104 One way of guarding against this possibility is to use the pcre2_pat- 105 tern_info() function to check the compiled pattern's options for 106 PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when 107 calling pcre2_compile(). This causes a compile time error if the pat- 108 tern contains a UTF-setting sequence. 109 110 The use of Unicode properties for character types such as \d can also 111 be enabled from within the pattern, by specifying "(*UCP)". This fea- 112 ture can be disallowed by setting the PCRE2_NEVER_UCP option. 113 114 If your application is one that supports UTF, be aware that validity 115 checking can take time. If the same data string is to be matched many 116 times, you can use the PCRE2_NO_UTF_CHECK option for the second and 117 subsequent matches to avoid running redundant checks. 118 119 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead 120 to problems, because it may leave the current matching point in the 121 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op- 122 tion can be used by an application to lock out the use of \C, causing a 123 compile-time error if it is encountered. It is also possible to build 124 PCRE2 with the use of \C permanently disabled. 125 126 Another way that performance can be hit is by running a pattern that 127 has a very large search tree against a string that will never match. 128 Nested unlimited repeats in a pattern are a common example. PCRE2 pro- 129 vides some protection against this: see the pcre2_set_match_limit() 130 function in the pcre2api page. There is a similar function called 131 pcre2_set_depth_limit() that can be used to restrict the amount of mem- 132 ory that is used. 133 134 135USER DOCUMENTATION 136 137 The user documentation for PCRE2 comprises a number of different sec- 138 tions. In the "man" format, each of these is a separate "man page". In 139 the HTML format, each is a separate page, linked from the index page. 140 In the plain text format, the descriptions of the pcre2grep and 141 pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, 142 respectively. The remaining sections, except for the pcre2demo section 143 (which is a program listing), and the short pages for individual func- 144 tions, are concatenated in pcre2.txt, for ease of searching. The sec- 145 tions are as follows: 146 147 pcre2 this document 148 pcre2-config show PCRE2 installation configuration information 149 pcre2api details of PCRE2's native C API 150 pcre2build building PCRE2 151 pcre2callout details of the pattern callout feature 152 pcre2compat discussion of Perl compatibility 153 pcre2convert details of pattern conversion functions 154 pcre2demo a demonstration C program that uses PCRE2 155 pcre2grep description of the pcre2grep command (8-bit only) 156 pcre2jit discussion of just-in-time optimization support 157 pcre2limits details of size and other limits 158 pcre2matching discussion of the two matching algorithms 159 pcre2partial details of the partial matching facility 160 pcre2pattern syntax and semantics of supported regular 161 expression patterns 162 pcre2perform discussion of performance issues 163 pcre2posix the POSIX-compatible C API for the 8-bit library 164 pcre2sample discussion of the pcre2demo program 165 pcre2serialize details of pattern serialization 166 pcre2syntax quick syntax reference 167 pcre2test description of the pcre2test command 168 pcre2unicode discussion of Unicode and UTF support 169 170 In the "man" and HTML formats, there is also a short page for each C 171 library function, listing its arguments and results. 172 173 174AUTHOR 175 176 Philip Hazel 177 Retired from University Computing Service 178 Cambridge, England. 179 180 Putting an actual email address here is a spam magnet. If you want to 181 email me, use my two names separated by a dot at gmail.com. 182 183 184REVISION 185 186 Last updated: 27 August 2021 187 Copyright (c) 1997-2021 University of Cambridge. 188 189 190PCRE2 10.38 27 August 2021 PCRE2(3) 191------------------------------------------------------------------------------ 192 193 194 195PCRE2API(3) Library Functions Manual PCRE2API(3) 196 197 198NAME 199 PCRE2 - Perl-compatible regular expressions (revised API) 200 201 #include <pcre2.h> 202 203 PCRE2 is a new API for PCRE, starting at release 10.0. This document 204 contains a description of all its native functions. See the pcre2 docu- 205 ment for an overview of all the PCRE2 documentation. 206 207 208PCRE2 NATIVE API BASIC FUNCTIONS 209 210 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 211 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 212 pcre2_compile_context *ccontext); 213 214 void pcre2_code_free(pcre2_code *code); 215 216 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 217 pcre2_general_context *gcontext); 218 219 pcre2_match_data *pcre2_match_data_create_from_pattern( 220 const pcre2_code *code, pcre2_general_context *gcontext); 221 222 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 223 PCRE2_SIZE length, PCRE2_SIZE startoffset, 224 uint32_t options, pcre2_match_data *match_data, 225 pcre2_match_context *mcontext); 226 227 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 228 PCRE2_SIZE length, PCRE2_SIZE startoffset, 229 uint32_t options, pcre2_match_data *match_data, 230 pcre2_match_context *mcontext, 231 int *workspace, PCRE2_SIZE wscount); 232 233 void pcre2_match_data_free(pcre2_match_data *match_data); 234 235 236PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS 237 238 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 239 240 PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data); 241 242 PCRE2_SIZE pcre2_get_match_data_heapframes_size( 243 pcre2_match_data *match_data); 244 245 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 246 247 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 248 249 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 250 251 252PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS 253 254 pcre2_general_context *pcre2_general_context_create( 255 void *(*private_malloc)(PCRE2_SIZE, void *), 256 void (*private_free)(void *, void *), void *memory_data); 257 258 pcre2_general_context *pcre2_general_context_copy( 259 pcre2_general_context *gcontext); 260 261 void pcre2_general_context_free(pcre2_general_context *gcontext); 262 263 264PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS 265 266 pcre2_compile_context *pcre2_compile_context_create( 267 pcre2_general_context *gcontext); 268 269 pcre2_compile_context *pcre2_compile_context_copy( 270 pcre2_compile_context *ccontext); 271 272 void pcre2_compile_context_free(pcre2_compile_context *ccontext); 273 274 int pcre2_set_bsr(pcre2_compile_context *ccontext, 275 uint32_t value); 276 277 int pcre2_set_character_tables(pcre2_compile_context *ccontext, 278 const uint8_t *tables); 279 280 int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, 281 uint32_t extra_options); 282 283 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 284 PCRE2_SIZE value); 285 286 int pcre2_set_max_pattern_compiled_length( 287 pcre2_compile_context *ccontext, PCRE2_SIZE value); 288 289 int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext, 290 uint32_t value); 291 292 int pcre2_set_newline(pcre2_compile_context *ccontext, 293 uint32_t value); 294 295 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 296 uint32_t value); 297 298 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 299 int (*guard_function)(uint32_t, void *), void *user_data); 300 301 302PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS 303 304 pcre2_match_context *pcre2_match_context_create( 305 pcre2_general_context *gcontext); 306 307 pcre2_match_context *pcre2_match_context_copy( 308 pcre2_match_context *mcontext); 309 310 void pcre2_match_context_free(pcre2_match_context *mcontext); 311 312 int pcre2_set_callout(pcre2_match_context *mcontext, 313 int (*callout_function)(pcre2_callout_block *, void *), 314 void *callout_data); 315 316 int pcre2_set_substitute_callout(pcre2_match_context *mcontext, 317 int (*callout_function)(pcre2_substitute_callout_block *, void *), 318 void *callout_data); 319 320 int pcre2_set_offset_limit(pcre2_match_context *mcontext, 321 PCRE2_SIZE value); 322 323 int pcre2_set_heap_limit(pcre2_match_context *mcontext, 324 uint32_t value); 325 326 int pcre2_set_match_limit(pcre2_match_context *mcontext, 327 uint32_t value); 328 329 int pcre2_set_depth_limit(pcre2_match_context *mcontext, 330 uint32_t value); 331 332 333PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS 334 335 int pcre2_substring_copy_byname(pcre2_match_data *match_data, 336 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 337 338 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 339 uint32_t number, PCRE2_UCHAR *buffer, 340 PCRE2_SIZE *bufflen); 341 342 void pcre2_substring_free(PCRE2_UCHAR *buffer); 343 344 int pcre2_substring_get_byname(pcre2_match_data *match_data, 345 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 346 347 int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 348 uint32_t number, PCRE2_UCHAR **bufferptr, 349 PCRE2_SIZE *bufflen); 350 351 int pcre2_substring_length_byname(pcre2_match_data *match_data, 352 PCRE2_SPTR name, PCRE2_SIZE *length); 353 354 int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 355 uint32_t number, PCRE2_SIZE *length); 356 357 int pcre2_substring_nametable_scan(const pcre2_code *code, 358 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 359 360 int pcre2_substring_number_from_name(const pcre2_code *code, 361 PCRE2_SPTR name); 362 363 void pcre2_substring_list_free(PCRE2_UCHAR **list); 364 365 int pcre2_substring_list_get(pcre2_match_data *match_data, 366 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 367 368 369PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION 370 371 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 372 PCRE2_SIZE length, PCRE2_SIZE startoffset, 373 uint32_t options, pcre2_match_data *match_data, 374 pcre2_match_context *mcontext, PCRE2_SPTR replacementz, 375 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, 376 PCRE2_SIZE *outlengthptr); 377 378 379PCRE2 NATIVE API JIT FUNCTIONS 380 381 int pcre2_jit_compile(pcre2_code *code, uint32_t options); 382 383 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 384 PCRE2_SIZE length, PCRE2_SIZE startoffset, 385 uint32_t options, pcre2_match_data *match_data, 386 pcre2_match_context *mcontext); 387 388 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 389 390 pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize, 391 size_t maxsize, pcre2_general_context *gcontext); 392 393 void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 394 pcre2_jit_callback callback_function, void *callback_data); 395 396 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 397 398 399PCRE2 NATIVE API SERIALIZATION FUNCTIONS 400 401 int32_t pcre2_serialize_decode(pcre2_code **codes, 402 int32_t number_of_codes, const uint8_t *bytes, 403 pcre2_general_context *gcontext); 404 405 int32_t pcre2_serialize_encode(const pcre2_code **codes, 406 int32_t number_of_codes, uint8_t **serialized_bytes, 407 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 408 409 void pcre2_serialize_free(uint8_t *bytes); 410 411 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 412 413 414PCRE2 NATIVE API AUXILIARY FUNCTIONS 415 416 pcre2_code *pcre2_code_copy(const pcre2_code *code); 417 418 pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); 419 420 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 421 PCRE2_SIZE bufflen); 422 423 const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); 424 425 void pcre2_maketables_free(pcre2_general_context *gcontext, 426 const uint8_t *tables); 427 428 int pcre2_pattern_info(const pcre2_code *code, uint32_t what, 429 void *where); 430 431 int pcre2_callout_enumerate(const pcre2_code *code, 432 int (*callback)(pcre2_callout_enumerate_block *, void *), 433 void *user_data); 434 435 int pcre2_config(uint32_t what, void *where); 436 437 438PCRE2 NATIVE API OBSOLETE FUNCTIONS 439 440 int pcre2_set_recursion_limit(pcre2_match_context *mcontext, 441 uint32_t value); 442 443 int pcre2_set_recursion_memory_management( 444 pcre2_match_context *mcontext, 445 void *(*private_malloc)(size_t, void *), 446 void (*private_free)(void *, void *), void *memory_data); 447 448 These functions became obsolete at release 10.30 and are retained only 449 for backward compatibility. They should not be used in new code. The 450 first is replaced by pcre2_set_depth_limit(); the second is no longer 451 needed and has no effect (it always returns zero). 452 453 454PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS 455 456 pcre2_convert_context *pcre2_convert_context_create( 457 pcre2_general_context *gcontext); 458 459 pcre2_convert_context *pcre2_convert_context_copy( 460 pcre2_convert_context *cvcontext); 461 462 void pcre2_convert_context_free(pcre2_convert_context *cvcontext); 463 464 int pcre2_set_glob_escape(pcre2_convert_context *cvcontext, 465 uint32_t escape_char); 466 467 int pcre2_set_glob_separator(pcre2_convert_context *cvcontext, 468 uint32_t separator_char); 469 470 int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length, 471 uint32_t options, PCRE2_UCHAR **buffer, 472 PCRE2_SIZE *blength, pcre2_convert_context *cvcontext); 473 474 void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern); 475 476 These functions provide a way of converting non-PCRE2 patterns into 477 patterns that can be processed by pcre2_compile(). This facility is ex- 478 perimental and may be changed in future releases. At present, "globs" 479 and POSIX basic and extended patterns can be converted. Details are 480 given in the pcre2convert documentation. 481 482 483PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES 484 485 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit 486 code units, respectively. However, there is just one header file, 487 pcre2.h. This contains the function prototypes and other definitions 488 for all three libraries. One, two, or all three can be installed simul- 489 taneously. On Unix-like systems the libraries are called libpcre2-8, 490 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- 491 inal PCRE libraries. Every PCRE2 function comes in three different 492 forms, one for each library, for example: 493 494 pcre2_compile_8() 495 pcre2_compile_16() 496 pcre2_compile_32() 497 498 There are also three different sets of data types: 499 500 PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 501 PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32 502 503 The UCHAR types define unsigned code units of the appropriate widths. 504 For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR 505 types are pointers to constants of the equivalent UCHAR types, that is, 506 they are pointers to vectors of unsigned code units. 507 508 Character strings are passed to a PCRE2 library as sequences of un- 509 signed integers in code units of the appropriate width. The length of a 510 string may be given as a number of code units, or the string may be 511 specified as zero-terminated. 512 513 Many applications use only one code unit width. For their convenience, 514 macros are defined whose names are the generic forms such as pcre2_com- 515 pile() and PCRE2_SPTR. These macros use the value of the macro 516 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func- 517 tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default. 518 An application must define it to be 8, 16, or 32 before including 519 pcre2.h in order to make use of the generic names. 520 521 Applications that use more than one code unit width can be linked with 522 more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to 523 be 0 before including pcre2.h, and then use the real function names. 524 Any code that is to be included in an environment where the value of 525 PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function 526 names. (Unfortunately, it is not possible in C code to save and restore 527 the value of a macro.) 528 529 If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a 530 compiler error occurs. 531 532 When using multiple libraries in an application, you must take care 533 when processing any particular pattern to use only functions from a 534 single library. For example, if you want to run a match using a pat- 535 tern that was compiled with pcre2_compile_16(), you must do so with 536 pcre2_match_16(), not pcre2_match_8() or pcre2_match_32(). 537 538 In the function summaries above, and in the rest of this document and 539 other PCRE2 documents, functions and data types are described using 540 their generic names, without the _8, _16, or _32 suffix. 541 542 543PCRE2 API OVERVIEW 544 545 PCRE2 has its own native API, which is described in this document. 546 There are also some wrapper functions for the 8-bit library that corre- 547 spond to the POSIX regular expression API, but they do not give access 548 to all the functionality of PCRE2 and they are not thread-safe. They 549 are described in the pcre2posix documentation. Both these APIs define a 550 set of C function calls. 551 552 The native API C data types, function prototypes, option values, and 553 error codes are defined in the header file pcre2.h, which also contains 554 definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release 555 numbers for the library. Applications can use these to include support 556 for different releases of PCRE2. 557 558 In a Windows environment, if you want to statically link an application 559 program against a non-dll PCRE2 library, you must define PCRE2_STATIC 560 before including pcre2.h. 561 562 The functions pcre2_compile() and pcre2_match() are used for compiling 563 and matching regular expressions in a Perl-compatible manner. A sample 564 program that demonstrates the simplest way of using them is provided in 565 the file called pcre2demo.c in the PCRE2 source distribution. A listing 566 of this program is given in the pcre2demo documentation, and the 567 pcre2sample documentation describes how to compile and run it. 568 569 The compiling and matching functions recognize various options that are 570 passed as bits in an options argument. There are also some more compli- 571 cated parameters such as custom memory management functions and re- 572 source limits that are passed in "contexts" (which are just memory 573 blocks, described below). Simple applications do not need to make use 574 of contexts. 575 576 Just-in-time (JIT) compiler support is an optional feature of PCRE2 577 that can be built in appropriate hardware environments. It greatly 578 speeds up the matching performance of many patterns. Programs can re- 579 quest that it be used if available by calling pcre2_jit_compile() after 580 a pattern has been successfully compiled by pcre2_compile(). This does 581 nothing if JIT support is not available. 582 583 More complicated programs might need to make use of the specialist 584 functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and 585 pcre2_jit_stack_assign() in order to control the JIT code's memory us- 586 age. 587 588 JIT matching is automatically used by pcre2_match() if it is available, 589 unless the PCRE2_NO_JIT option is set. There is also a direct interface 590 for JIT matching, which gives improved performance at the expense of 591 less sanity checking. The JIT-specific functions are discussed in the 592 pcre2jit documentation. 593 594 A second matching function, pcre2_dfa_match(), which is not Perl-com- 595 patible, is also provided. This uses a different algorithm for the 596 matching. The alternative algorithm finds all possible matches (at a 597 given point in the subject), and scans the subject just once (unless 598 there are lookaround assertions). However, this algorithm does not re- 599 turn captured substrings. A description of the two matching algorithms 600 and their advantages and disadvantages is given in the pcre2matching 601 documentation. There is no JIT support for pcre2_dfa_match(). 602 603 In addition to the main compiling and matching functions, there are 604 convenience functions for extracting captured substrings from a subject 605 string that has been matched by pcre2_match(). They are: 606 607 pcre2_substring_copy_byname() 608 pcre2_substring_copy_bynumber() 609 pcre2_substring_get_byname() 610 pcre2_substring_get_bynumber() 611 pcre2_substring_list_get() 612 pcre2_substring_length_byname() 613 pcre2_substring_length_bynumber() 614 pcre2_substring_nametable_scan() 615 pcre2_substring_number_from_name() 616 617 pcre2_substring_free() and pcre2_substring_list_free() are also pro- 618 vided, to free memory used for extracted strings. If either of these 619 functions is called with a NULL argument, the function returns immedi- 620 ately without doing anything. 621 622 The function pcre2_substitute() can be called to match a pattern and 623 return a copy of the subject string with substitutions for parts that 624 were matched. 625 626 Functions whose names begin with pcre2_serialize_ are used for saving 627 compiled patterns on disc or elsewhere, and reloading them later. 628 629 Finally, there are functions for finding out information about a com- 630 piled pattern (pcre2_pattern_info()) and about the configuration with 631 which PCRE2 was built (pcre2_config()). 632 633 Functions with names ending with _free() are used for freeing memory 634 blocks of various sorts. In all cases, if one of these functions is 635 called with a NULL argument, it does nothing. 636 637 638STRING LENGTHS AND OFFSETS 639 640 The PCRE2 API uses string lengths and offsets into strings of code 641 units in several places. These values are always of type PCRE2_SIZE, 642 which is an unsigned integer type, currently always defined as size_t. 643 The largest value that can be stored in such a type (that is 644 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated 645 strings and unset offsets. Therefore, the longest string that can be 646 handled is one less than this maximum. Note that string lengths are al- 647 ways given in code units. Only in the 8-bit library is such a length 648 the same as the number of bytes in the string. 649 650 651NEWLINES 652 653 PCRE2 supports five different conventions for indicating line breaks in 654 strings: a single CR (carriage return) character, a single LF (line- 655 feed) character, the two-character sequence CRLF, any of the three pre- 656 ceding, or any Unicode newline sequence. The Unicode newline sequences 657 are the three just mentioned, plus the single characters VT (vertical 658 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line 659 separator, U+2028), and PS (paragraph separator, U+2029). 660 661 Each of the first three conventions is used by at least one operating 662 system as its standard newline sequence. When PCRE2 is built, a default 663 can be specified. If it is not, the default is set to LF, which is the 664 Unix standard. However, the newline convention can be changed by an ap- 665 plication when calling pcre2_compile(), or it can be specified by spe- 666 cial text at the start of the pattern itself; this overrides any other 667 settings. See the pcre2pattern page for details of the special charac- 668 ter sequences. 669 670 In the PCRE2 documentation the word "newline" is used to mean "the 671 character or pair of characters that indicate a line break". The choice 672 of newline convention affects the handling of the dot, circumflex, and 673 dollar metacharacters, the handling of #-comments in /x mode, and, when 674 CRLF is a recognized line ending sequence, the match position advance- 675 ment for a non-anchored pattern. There is more detail about this in the 676 section on pcre2_match() options below. 677 678 The choice of newline convention does not affect the interpretation of 679 the \n or \r escape sequences, nor does it affect what \R matches; this 680 has its own separate convention. 681 682 683MULTITHREADING 684 685 In a multithreaded application it is important to keep thread-specific 686 data separate from data that can be shared between threads. The PCRE2 687 library code itself is thread-safe: it contains no static or global 688 variables. The API is designed to be fairly simple for non-threaded ap- 689 plications while at the same time ensuring that multithreaded applica- 690 tions can use it. 691 692 There are several different blocks of data that are used to pass infor- 693 mation between the application and the PCRE2 libraries. 694 695 The compiled pattern 696 697 A pointer to the compiled form of a pattern is returned to the user 698 when pcre2_compile() is successful. The data in the compiled pattern is 699 fixed, and does not change when the pattern is matched. Therefore, it 700 is thread-safe, that is, the same compiled pattern can be used by more 701 than one thread simultaneously. For example, an application can compile 702 all its patterns at the start, before forking off multiple threads that 703 use them. However, if the just-in-time (JIT) optimization feature is 704 being used, it needs separate memory stack areas for each thread. See 705 the pcre2jit documentation for more details. 706 707 In a more complicated situation, where patterns are compiled only when 708 they are first needed, but are still shared between threads, pointers 709 to compiled patterns must be protected from simultaneous writing by 710 multiple threads. This is somewhat tricky to do correctly. If you know 711 that writing to a pointer is atomic in your environment, you can use 712 logic like this: 713 714 Get a read-only (shared) lock (mutex) for pointer 715 if (pointer == NULL) 716 { 717 Get a write (unique) lock for pointer 718 if (pointer == NULL) pointer = pcre2_compile(... 719 } 720 Release the lock 721 Use pointer in pcre2_match() 722 723 Of course, testing for compilation errors should also be included in 724 the code. 725 726 The reason for checking the pointer a second time is as follows: Sev- 727 eral threads may have acquired the shared lock and tested the pointer 728 for being NULL, but only one of them will be given the write lock, with 729 the rest kept waiting. The winning thread will compile the pattern and 730 store the result. After this thread releases the write lock, another 731 thread will get it, and if it does not retest pointer for being NULL, 732 will recompile the pattern and overwrite the pointer, creating a memory 733 leak and possibly causing other issues. 734 735 In an environment where writing to a pointer may not be atomic, the 736 above logic is not sufficient. The thread that is doing the compiling 737 may be descheduled after writing only part of the pointer, which could 738 cause other threads to use an invalid value. Instead of checking the 739 pointer itself, a separate "pointer is valid" flag (that can be updated 740 atomically) must be used: 741 742 Get a read-only (shared) lock (mutex) for pointer 743 if (!pointer_is_valid) 744 { 745 Get a write (unique) lock for pointer 746 if (!pointer_is_valid) 747 { 748 pointer = pcre2_compile(... 749 pointer_is_valid = TRUE 750 } 751 } 752 Release the lock 753 Use pointer in pcre2_match() 754 755 If JIT is being used, but the JIT compilation is not being done immedi- 756 ately (perhaps waiting to see if the pattern is used often enough), 757 similar logic is required. JIT compilation updates a value within the 758 compiled code block, so a thread must gain unique write access to the 759 pointer before calling pcre2_jit_compile(). Alternatively, 760 pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to ob- 761 tain a private copy of the compiled code before calling the JIT com- 762 piler. 763 764 Context blocks 765 766 The next main section below introduces the idea of "contexts" in which 767 PCRE2 functions are called. A context is nothing more than a collection 768 of parameters that control the way PCRE2 operates. Grouping a number of 769 parameters together in a context is a convenient way of passing them to 770 a PCRE2 function without using lots of arguments. The parameters that 771 are stored in contexts are in some sense "advanced features" of the 772 API. Many straightforward applications will not need to use contexts. 773 774 In a multithreaded application, if the parameters in a context are val- 775 ues that are never changed, the same context can be used by all the 776 threads. However, if any thread needs to change any value in a context, 777 it must make its own thread-specific copy. 778 779 Match blocks 780 781 The matching functions need a block of memory for storing the results 782 of a match. This includes details of what was matched, as well as addi- 783 tional information such as the name of a (*MARK) setting. Each thread 784 must provide its own copy of this memory. 785 786 787PCRE2 CONTEXTS 788 789 Some PCRE2 functions have a lot of parameters, many of which are used 790 only by specialist applications, for example, those that use custom 791 memory management or non-standard character tables. To keep function 792 argument lists at a reasonable size, and at the same time to keep the 793 API extensible, "uncommon" parameters are passed to certain functions 794 in a context instead of directly. A context is just a block of memory 795 that holds the parameter values. Applications that do not need to ad- 796 just any of the context parameters can pass NULL when a context pointer 797 is required. 798 799 There are three different types of context: a general context that is 800 relevant for several PCRE2 operations, a compile-time context, and a 801 match-time context. 802 803 The general context 804 805 At present, this context just contains pointers to (and data for) ex- 806 ternal memory management functions that are called from several places 807 in the PCRE2 library. The context is named `general' rather than 808 specifically `memory' because in future other fields may be added. If 809 you do not want to supply your own custom memory management functions, 810 you do not need to bother with a general context. A general context is 811 created by: 812 813 pcre2_general_context *pcre2_general_context_create( 814 void *(*private_malloc)(PCRE2_SIZE, void *), 815 void (*private_free)(void *, void *), void *memory_data); 816 817 The two function pointers specify custom memory management functions, 818 whose prototypes are: 819 820 void *private_malloc(PCRE2_SIZE, void *); 821 void private_free(void *, void *); 822 823 Whenever code in PCRE2 calls these functions, the final argument is the 824 value of memory_data. Either of the first two arguments of the creation 825 function may be NULL, in which case the system memory management func- 826 tions malloc() and free() are used. (This is not currently useful, as 827 there are no other fields in a general context, but in future there 828 might be.) The private_malloc() function is used (if supplied) to ob- 829 tain memory for storing the context, and all three values are saved as 830 part of the context. 831 832 Whenever PCRE2 creates a data block of any kind, the block contains a 833 pointer to the free() function that matches the malloc() function that 834 was used. When the time comes to free the block, this function is 835 called. 836 837 A general context can be copied by calling: 838 839 pcre2_general_context *pcre2_general_context_copy( 840 pcre2_general_context *gcontext); 841 842 The memory used for a general context should be freed by calling: 843 844 void pcre2_general_context_free(pcre2_general_context *gcontext); 845 846 If this function is passed a NULL argument, it returns immediately 847 without doing anything. 848 849 The compile context 850 851 A compile context is required if you want to provide an external func- 852 tion for stack checking during compilation or to change the default 853 values of any of the following compile-time parameters: 854 855 What \R matches (Unicode newlines or CR, LF, CRLF only) 856 PCRE2's character tables 857 The newline character sequence 858 The compile time nested parentheses limit 859 The maximum length of the pattern string 860 The extra options bits (none set by default) 861 862 A compile context is also required if you are using custom memory man- 863 agement. If none of these apply, just pass NULL as the context argu- 864 ment of pcre2_compile(). 865 866 A compile context is created, copied, and freed by the following func- 867 tions: 868 869 pcre2_compile_context *pcre2_compile_context_create( 870 pcre2_general_context *gcontext); 871 872 pcre2_compile_context *pcre2_compile_context_copy( 873 pcre2_compile_context *ccontext); 874 875 void pcre2_compile_context_free(pcre2_compile_context *ccontext); 876 877 A compile context is created with default values for its parameters. 878 These can be changed by calling the following functions, which return 0 879 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 880 881 int pcre2_set_bsr(pcre2_compile_context *ccontext, 882 uint32_t value); 883 884 The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only 885 CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any 886 Unicode line ending sequence. The value is used by the JIT compiler and 887 by the two interpreted matching functions, pcre2_match() and 888 pcre2_dfa_match(). 889 890 int pcre2_set_character_tables(pcre2_compile_context *ccontext, 891 const uint8_t *tables); 892 893 The value must be the result of a call to pcre2_maketables(), whose 894 only argument is a general context. This function builds a set of char- 895 acter tables in the current locale. 896 897 int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, 898 uint32_t extra_options); 899 900 As PCRE2 has developed, almost all the 32 option bits that are avail- 901 able in the options argument of pcre2_compile() have been used up. To 902 avoid running out, the compile context contains a set of extra option 903 bits which are used for some newer, assumed rarer, options. This func- 904 tion sets those bits. It always sets all the bits (either on or off). 905 It does not modify any existing setting. The available options are de- 906 fined in the section entitled "Extra compile options" below. 907 908 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 909 PCRE2_SIZE value); 910 911 This sets a maximum length, in code units, for any pattern string that 912 is compiled with this context. If the pattern is longer, an error is 913 generated. This facility is provided so that applications that accept 914 patterns from external sources can limit their size. The default is the 915 largest number that a PCRE2_SIZE variable can hold, which is effec- 916 tively unlimited. 917 918 int pcre2_set_max_pattern_compiled_length( 919 pcre2_compile_context *ccontext, PCRE2_SIZE value); 920 921 This sets a maximum size, in bytes, for the memory needed to hold the 922 compiled version of a pattern that is compiled with this context. If 923 the pattern needs more memory, an error is generated. This facility is 924 provided so that applications that accept patterns from external 925 sources can limit the amount of memory they use. The default is the 926 largest number that a PCRE2_SIZE variable can hold, which is effec- 927 tively unlimited. 928 929 int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext, 930 uint32_t value); 931 932 This sets a maximum length for the number of characters matched by a 933 variable-length lookbehind assertion. The default is set when PCRE2 is 934 built, with the ultimate default being 255, the same as Perl. Lookbe- 935 hind assertions without a bounding length are not supported. 936 937 int pcre2_set_newline(pcre2_compile_context *ccontext, 938 uint32_t value); 939 940 This specifies which characters or character sequences are to be recog- 941 nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage 942 return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the 943 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any 944 of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or 945 PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero). 946 947 A pattern can override the value set in the compile context by starting 948 with a sequence such as (*CRLF). See the pcre2pattern page for details. 949 950 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX- 951 TENDED_MORE option, the newline convention affects the recognition of 952 the end of internal comments starting with #. The value is saved with 953 the compiled pattern for subsequent use by the JIT compiler and by the 954 two interpreted matching functions, pcre2_match() and 955 pcre2_dfa_match(). 956 957 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 958 uint32_t value); 959 960 This parameter adjusts the limit, set when PCRE2 is built (default 961 250), on the depth of parenthesis nesting in a pattern. This limit 962 stops rogue patterns using up too much system stack when being com- 963 piled. The limit applies to parentheses of all kinds, not just captur- 964 ing parentheses. 965 966 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 967 int (*guard_function)(uint32_t, void *), void *user_data); 968 969 There is at least one application that runs PCRE2 in threads with very 970 limited system stack, where running out of stack is to be avoided at 971 all costs. The parenthesis limit above cannot take account of how much 972 stack is actually available during compilation. For a finer control, 973 you can supply a function that is called whenever pcre2_compile() 974 starts to compile a parenthesized part of a pattern. This function can 975 check the actual stack size (or anything else that it wants to, of 976 course). 977 978 The first argument to the callout function gives the current depth of 979 nesting, and the second is user data that is set up by the last argu- 980 ment of pcre2_set_compile_recursion_guard(). The callout function 981 should return zero if all is well, or non-zero to force an error. 982 983 The match context 984 985 A match context is required if you want to: 986 987 Set up a callout function 988 Set an offset limit for matching an unanchored pattern 989 Change the limit on the amount of heap used when matching 990 Change the backtracking match limit 991 Change the backtracking depth limit 992 Set custom memory management specifically for the match 993 994 If none of these apply, just pass NULL as the context argument of 995 pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). 996 997 A match context is created, copied, and freed by the following func- 998 tions: 999 1000 pcre2_match_context *pcre2_match_context_create( 1001 pcre2_general_context *gcontext); 1002 1003 pcre2_match_context *pcre2_match_context_copy( 1004 pcre2_match_context *mcontext); 1005 1006 void pcre2_match_context_free(pcre2_match_context *mcontext); 1007 1008 A match context is created with default values for its parameters. 1009 These can be changed by calling the following functions, which return 0 1010 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 1011 1012 int pcre2_set_callout(pcre2_match_context *mcontext, 1013 int (*callout_function)(pcre2_callout_block *, void *), 1014 void *callout_data); 1015 1016 This sets up a callout function for PCRE2 to call at specified points 1017 during a matching operation. Details are given in the pcre2callout doc- 1018 umentation. 1019 1020 int pcre2_set_substitute_callout(pcre2_match_context *mcontext, 1021 int (*callout_function)(pcre2_substitute_callout_block *, void *), 1022 void *callout_data); 1023 1024 This sets up a callout function for PCRE2 to call after each substitu- 1025 tion made by pcre2_substitute(). Details are given in the section enti- 1026 tled "Creating a new string with substitutions" below. 1027 1028 int pcre2_set_offset_limit(pcre2_match_context *mcontext, 1029 PCRE2_SIZE value); 1030 1031 The offset_limit parameter limits how far an unanchored search can ad- 1032 vance in the subject string. The default value is PCRE2_UNSET. The 1033 pcre2_match() and pcre2_dfa_match() functions return PCRE2_ERROR_NO- 1034 MATCH if a match with a starting point before or at the given offset is 1035 not found. The pcre2_substitute() function makes no more substitutions. 1036 1037 For example, if the pattern /abc/ is matched against "123abc" with an 1038 offset limit less than 3, the result is PCRE2_ERROR_NOMATCH. A match 1039 can never be found if the startoffset argument of pcre2_match(), 1040 pcre2_dfa_match(), or pcre2_substitute() is greater than the offset 1041 limit set in the match context. 1042 1043 When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT op- 1044 tion when calling pcre2_compile() so that when JIT is in use, different 1045 code can be compiled. If a match is started with a non-default match 1046 limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. 1047 1048 The offset limit facility can be used to track progress when searching 1049 large subject strings or to limit the extent of global substitutions. 1050 See also the PCRE2_FIRSTLINE option, which requires a match to start 1051 before or at the first newline that follows the start of matching in 1052 the subject. If this is set with an offset limit, a match must occur in 1053 the first line and also within the offset limit. In other words, 1054 whichever limit comes first is used. 1055 1056 int pcre2_set_heap_limit(pcre2_match_context *mcontext, 1057 uint32_t value); 1058 1059 The heap_limit parameter specifies, in units of kibibytes (1024 bytes), 1060 the maximum amount of heap memory that pcre2_match() may use to hold 1061 backtracking information when running an interpretive match. This limit 1062 also applies to pcre2_dfa_match(), which may use the heap when process- 1063 ing patterns with a lot of nested pattern recursion or lookarounds or 1064 atomic groups. This limit does not apply to matching with the JIT opti- 1065 mization, which has its own memory control arrangements (see the 1066 pcre2jit documentation for more details). If the limit is reached, the 1067 negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default 1068 limit can be set when PCRE2 is built; if it is not, the default is set 1069 very large and is essentially unlimited. 1070 1071 A value for the heap limit may also be supplied by an item at the start 1072 of a pattern of the form 1073 1074 (*LIMIT_HEAP=ddd) 1075 1076 where ddd is a decimal number. However, such a setting is ignored un- 1077 less ddd is less than the limit set by the caller of pcre2_match() or, 1078 if no such limit is set, less than the default. 1079 1080 The pcre2_match() function always needs some heap memory, so setting a 1081 value of zero guarantees a "heap limit exceeded" error. Details of how 1082 pcre2_match() uses the heap are given in the pcre2perform documenta- 1083 tion. 1084 1085 For pcre2_dfa_match(), a vector on the system stack is used when pro- 1086 cessing pattern recursions, lookarounds, or atomic groups, and only if 1087 this is not big enough is heap memory used. In this case, setting a 1088 value of zero disables the use of the heap. 1089 1090 int pcre2_set_match_limit(pcre2_match_context *mcontext, 1091 uint32_t value); 1092 1093 The match_limit parameter provides a means of preventing PCRE2 from us- 1094 ing up too many computing resources when processing patterns that are 1095 not going to match, but which have a very large number of possibilities 1096 in their search trees. The classic example is a pattern that uses 1097 nested unlimited repeats. 1098 1099 There is an internal counter in pcre2_match() that is incremented each 1100 time round its main matching loop. If this value reaches the match 1101 limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT. 1102 This has the effect of limiting the amount of backtracking that can 1103 take place. For patterns that are not anchored, the count restarts from 1104 zero for each position in the subject string. This limit also applies 1105 to pcre2_dfa_match(), though the counting is done in a different way. 1106 1107 When pcre2_match() is called with a pattern that was successfully 1108 processed by pcre2_jit_compile(), the way in which matching is executed 1109 is entirely different. However, there is still the possibility of run- 1110 away matching that goes on for a very long time, and so the match_limit 1111 value is also used in this case (but in a different way) to limit how 1112 long the matching can continue. 1113 1114 The default value for the limit can be set when PCRE2 is built; the de- 1115 fault is 10 million, which handles all but the most extreme cases. A 1116 value for the match limit may also be supplied by an item at the start 1117 of a pattern of the form 1118 1119 (*LIMIT_MATCH=ddd) 1120 1121 where ddd is a decimal number. However, such a setting is ignored un- 1122 less ddd is less than the limit set by the caller of pcre2_match() or 1123 pcre2_dfa_match() or, if no such limit is set, less than the default. 1124 1125 int pcre2_set_depth_limit(pcre2_match_context *mcontext, 1126 uint32_t value); 1127 1128 This parameter limits the depth of nested backtracking in 1129 pcre2_match(). Each time a nested backtracking point is passed, a new 1130 memory frame is used to remember the state of matching at that point. 1131 Thus, this parameter indirectly limits the amount of memory that is 1132 used in a match. However, because the size of each memory frame depends 1133 on the number of capturing parentheses, the actual memory limit varies 1134 from pattern to pattern. This limit was more useful in versions before 1135 10.30, where function recursion was used for backtracking. 1136 1137 The depth limit is not relevant, and is ignored, when matching is done 1138 using JIT compiled code. However, it is supported by pcre2_dfa_match(), 1139 which uses it to limit the depth of nested internal recursive function 1140 calls that implement atomic groups, lookaround assertions, and pattern 1141 recursions. This limits, indirectly, the amount of system stack that is 1142 used. It was more useful in versions before 10.32, when stack memory 1143 was used for local workspace vectors for recursive function calls. From 1144 version 10.32, only local variables are allocated on the stack and as 1145 each call uses only a few hundred bytes, even a small stack can support 1146 quite a lot of recursion. 1147 1148 If the depth of internal recursive function calls is great enough, lo- 1149 cal workspace vectors are allocated on the heap from version 10.32 on- 1150 wards, so the depth limit also indirectly limits the amount of heap 1151 memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when 1152 matched to a very long string using pcre2_dfa_match(), can use a great 1153 deal of memory. However, it is probably better to limit heap usage di- 1154 rectly by calling pcre2_set_heap_limit(). 1155 1156 The default value for the depth limit can be set when PCRE2 is built; 1157 if it is not, the default is set to the same value as the default for 1158 the match limit. If the limit is exceeded, pcre2_match() or 1159 pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth 1160 limit may also be supplied by an item at the start of a pattern of the 1161 form 1162 1163 (*LIMIT_DEPTH=ddd) 1164 1165 where ddd is a decimal number. However, such a setting is ignored un- 1166 less ddd is less than the limit set by the caller of pcre2_match() or 1167 pcre2_dfa_match() or, if no such limit is set, less than the default. 1168 1169 1170CHECKING BUILD-TIME OPTIONS 1171 1172 int pcre2_config(uint32_t what, void *where); 1173 1174 The function pcre2_config() makes it possible for a PCRE2 client to 1175 find the value of certain configuration parameters and to discover 1176 which optional features have been compiled into the PCRE2 library. The 1177 pcre2build documentation has more details about these features. 1178 1179 The first argument for pcre2_config() specifies which information is 1180 required. The second argument is a pointer to memory into which the in- 1181 formation is placed. If NULL is passed, the function returns the amount 1182 of memory that is needed for the requested information. For calls that 1183 return numerical values, the value is in bytes; when requesting these 1184 values, where should point to appropriately aligned memory. For calls 1185 that return strings, the required length is given in code units, not 1186 counting the terminating zero. 1187 1188 When requesting information, the returned value from pcre2_config() is 1189 non-negative on success, or the negative error code PCRE2_ERROR_BADOP- 1190 TION if the value in the first argument is not recognized. The follow- 1191 ing information is available: 1192 1193 PCRE2_CONFIG_BSR 1194 1195 The output is a uint32_t integer whose value indicates what character 1196 sequences the \R escape sequence matches by default. A value of 1197 PCRE2_BSR_UNICODE means that \R matches any Unicode line ending se- 1198 quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, 1199 or CRLF. The default can be overridden when a pattern is compiled. 1200 1201 PCRE2_CONFIG_COMPILED_WIDTHS 1202 1203 The output is a uint32_t integer whose lower bits indicate which code 1204 unit widths were selected when PCRE2 was built. The 1-bit indicates 1205 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup- 1206 port, respectively. 1207 1208 PCRE2_CONFIG_DEPTHLIMIT 1209 1210 The output is a uint32_t integer that gives the default limit for the 1211 depth of nested backtracking in pcre2_match() or the depth of nested 1212 recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur- 1213 ther details are given with pcre2_set_depth_limit() above. 1214 1215 PCRE2_CONFIG_HEAPLIMIT 1216 1217 The output is a uint32_t integer that gives, in kibibytes, the default 1218 limit for the amount of heap memory used by pcre2_match() or 1219 pcre2_dfa_match(). Further details are given with 1220 pcre2_set_heap_limit() above. 1221 1222 PCRE2_CONFIG_JIT 1223 1224 The output is a uint32_t integer that is set to one if support for 1225 just-in-time compiling is included in the library; otherwise it is set 1226 to zero. Note that having the support in the library does not guarantee 1227 that JIT will be used for any given match. See the pcre2jit documenta- 1228 tion for more details. 1229 1230 PCRE2_CONFIG_JITTARGET 1231 1232 The where argument should point to a buffer that is at least 48 code 1233 units long. (The exact length required can be found by calling 1234 pcre2_config() with where set to NULL.) The buffer is filled with a 1235 string that contains the name of the architecture for which the JIT 1236 compiler is configured, for example "x86 32bit (little endian + un- 1237 aligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is 1238 returned, otherwise the number of code units used is returned. This is 1239 the length of the string, plus one unit for the terminating zero. 1240 1241 PCRE2_CONFIG_LINKSIZE 1242 1243 The output is a uint32_t integer that contains the number of bytes used 1244 for internal linkage in compiled regular expressions. When PCRE2 is 1245 configured, the value can be set to 2, 3, or 4, with the default being 1246 2. This is the value that is returned by pcre2_config(). However, when 1247 the 16-bit library is compiled, a value of 3 is rounded up to 4, and 1248 when the 32-bit library is compiled, internal linkages always use 4 1249 bytes, so the configured value is not relevant. 1250 1251 The default value of 2 for the 8-bit and 16-bit libraries is sufficient 1252 for all but the most massive patterns, since it allows the size of the 1253 compiled pattern to be up to 65535 code units. Larger values allow 1254 larger regular expressions to be compiled by those two libraries, but 1255 at the expense of slower matching. 1256 1257 PCRE2_CONFIG_MATCHLIMIT 1258 1259 The output is a uint32_t integer that gives the default match limit for 1260 pcre2_match(). Further details are given with pcre2_set_match_limit() 1261 above. 1262 1263 PCRE2_CONFIG_NEWLINE 1264 1265 The output is a uint32_t integer whose value specifies the default 1266 character sequence that is recognized as meaning "newline". The values 1267 are: 1268 1269 PCRE2_NEWLINE_CR Carriage return (CR) 1270 PCRE2_NEWLINE_LF Linefeed (LF) 1271 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 1272 PCRE2_NEWLINE_ANY Any Unicode line ending 1273 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 1274 PCRE2_NEWLINE_NUL The NUL character (binary zero) 1275 1276 The default should normally correspond to the standard sequence for 1277 your operating system. 1278 1279 PCRE2_CONFIG_NEVER_BACKSLASH_C 1280 1281 The output is a uint32_t integer that is set to one if the use of \C 1282 was permanently disabled when PCRE2 was built; otherwise it is set to 1283 zero. 1284 1285 PCRE2_CONFIG_PARENSLIMIT 1286 1287 The output is a uint32_t integer that gives the maximum depth of nest- 1288 ing of parentheses (of any kind) in a pattern. This limit is imposed to 1289 cap the amount of system stack used when a pattern is compiled. It is 1290 specified when PCRE2 is built; the default is 250. This limit does not 1291 take into account the stack that may already be used by the calling ap- 1292 plication. For finer control over compilation stack usage, see 1293 pcre2_set_compile_recursion_guard(). 1294 1295 PCRE2_CONFIG_STACKRECURSE 1296 1297 This parameter is obsolete and should not be used in new code. The out- 1298 put is a uint32_t integer that is always set to zero. 1299 1300 PCRE2_CONFIG_TABLES_LENGTH 1301 1302 The output is a uint32_t integer that gives the length of PCRE2's char- 1303 acter processing tables in bytes. For details of these tables see the 1304 section on locale support below. 1305 1306 PCRE2_CONFIG_UNICODE_VERSION 1307 1308 The where argument should point to a buffer that is at least 24 code 1309 units long. (The exact length required can be found by calling 1310 pcre2_config() with where set to NULL.) If PCRE2 has been compiled 1311 without Unicode support, the buffer is filled with the text "Unicode 1312 not supported". Otherwise, the Unicode version string (for example, 1313 "8.0.0") is inserted. The number of code units used is returned. This 1314 is the length of the string plus one unit for the terminating zero. 1315 1316 PCRE2_CONFIG_UNICODE 1317 1318 The output is a uint32_t integer that is set to one if Unicode support 1319 is available; otherwise it is set to zero. Unicode support implies UTF 1320 support. 1321 1322 PCRE2_CONFIG_VERSION 1323 1324 The where argument should point to a buffer that is at least 24 code 1325 units long. (The exact length required can be found by calling 1326 pcre2_config() with where set to NULL.) The buffer is filled with the 1327 PCRE2 version string, zero-terminated. The number of code units used is 1328 returned. This is the length of the string plus one unit for the termi- 1329 nating zero. 1330 1331 1332COMPILING A PATTERN 1333 1334 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 1335 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 1336 pcre2_compile_context *ccontext); 1337 1338 void pcre2_code_free(pcre2_code *code); 1339 1340 pcre2_code *pcre2_code_copy(const pcre2_code *code); 1341 1342 pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); 1343 1344 The pcre2_compile() function compiles a pattern into an internal form. 1345 The pattern is defined by a pointer to a string of code units and a 1346 length in code units. If the pattern is zero-terminated, the length can 1347 be specified as PCRE2_ZERO_TERMINATED. A NULL pattern pointer with a 1348 length of zero is treated as an empty string (NULL with a non-zero 1349 length causes an error return). The function returns a pointer to a 1350 block of memory that contains the compiled pattern and related data, or 1351 NULL if an error occurred. 1352 1353 If the compile context argument ccontext is NULL, memory for the com- 1354 piled pattern is obtained by calling malloc(). Otherwise, it is ob- 1355 tained from the same memory function that was used for the compile con- 1356 text. The caller must free the memory by calling pcre2_code_free() when 1357 it is no longer needed. If pcre2_code_free() is called with a NULL ar- 1358 gument, it returns immediately, without doing anything. 1359 1360 The function pcre2_code_copy() makes a copy of the compiled code in new 1361 memory, using the same memory allocator as was used for the original. 1362 However, if the code has been processed by the JIT compiler (see be- 1363 low), the JIT information cannot be copied (because it is position-de- 1364 pendent). The new copy can initially be used only for non-JIT match- 1365 ing, though it can be passed to pcre2_jit_compile() if required. If 1366 pcre2_code_copy() is called with a NULL argument, it returns NULL. 1367 1368 The pcre2_code_copy() function provides a way for individual threads in 1369 a multithreaded application to acquire a private copy of shared com- 1370 piled code. However, it does not make a copy of the character tables 1371 used by the compiled pattern; the new pattern code points to the same 1372 tables as the original code. (See "Locale Support" below for details 1373 of these character tables.) In many applications the same tables are 1374 used throughout, so this behaviour is appropriate. Nevertheless, there 1375 are occasions when a copy of a compiled pattern and the relevant tables 1376 are needed. The pcre2_code_copy_with_tables() provides this facility. 1377 Copies of both the code and the tables are made, with the new code 1378 pointing to the new tables. The memory for the new tables is automati- 1379 cally freed when pcre2_code_free() is called for the new copy of the 1380 compiled code. If pcre2_code_copy_with_tables() is called with a NULL 1381 argument, it returns NULL. 1382 1383 NOTE: When one of the matching functions is called, pointers to the 1384 compiled pattern and the subject string are set in the match data block 1385 so that they can be referenced by the substring extraction functions 1386 after a successful match. After running a match, you must not free a 1387 compiled pattern or a subject string until after all operations on the 1388 match data block have taken place, unless, in the case of the subject 1389 string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is 1390 described in the section entitled "Option bits for pcre2_match()" be- 1391 low. 1392 1393 The options argument for pcre2_compile() contains various bit settings 1394 that affect the compilation. It should be zero if none of them are re- 1395 quired. The available options are described below. Some of them (in 1396 particular, those that are compatible with Perl, but some others as 1397 well) can also be set and unset from within the pattern (see the de- 1398 tailed description in the pcre2pattern documentation). 1399 1400 For those options that can be different in different parts of the pat- 1401 tern, the contents of the options argument specifies their settings at 1402 the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and 1403 PCRE2_NO_UTF_CHECK options can be set at the time of matching as well 1404 as at compile time. 1405 1406 Some additional options and less frequently required compile-time para- 1407 meters (for example, the newline setting) can be provided in a compile 1408 context (as described above). 1409 1410 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- 1411 diately. Otherwise, the variables to which these point are set to an 1412 error code and an offset (number of code units) within the pattern, re- 1413 spectively, when pcre2_compile() returns NULL because a compilation er- 1414 ror has occurred. 1415 1416 There are nearly 100 positive error codes that pcre2_compile() may re- 1417 turn if it finds an error in the pattern. There are also some negative 1418 error codes that are used for invalid UTF strings when validity check- 1419 ing is in force. These are the same as given by pcre2_match() and 1420 pcre2_dfa_match(), and are described in the pcre2unicode documentation. 1421 There is no separate documentation for the positive error codes, be- 1422 cause the textual error messages that are obtained by calling the 1423 pcre2_get_error_message() function (see "Obtaining a textual error mes- 1424 sage" below) should be self-explanatory. Macro names starting with 1425 PCRE2_ERROR_ are defined for both positive and negative error codes in 1426 pcre2.h. When compilation is successful errorcode is set to a value 1427 that returns the message "no error" if passed to pcre2_get_error_mes- 1428 sage(). 1429 1430 The value returned in erroroffset is an indication of where in the pat- 1431 tern an error occurred. When there is no error, zero is returned. A 1432 non-zero value is not necessarily the furthest point in the pattern 1433 that was read. For example, after the error "lookbehind assertion is 1434 not fixed length", the error offset points to the start of the failing 1435 assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of 1436 the first code unit of the failing character. 1437 1438 Some errors are not detected until the whole pattern has been scanned; 1439 in these cases, the offset passed back is the length of the pattern. 1440 Note that the offset is in code units, not characters, even in a UTF 1441 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- 1442 acter. 1443 1444 This code fragment shows a typical straightforward call to pcre2_com- 1445 pile(): 1446 1447 pcre2_code *re; 1448 PCRE2_SIZE erroffset; 1449 int errorcode; 1450 re = pcre2_compile( 1451 "^A.*Z", /* the pattern */ 1452 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ 1453 0, /* default options */ 1454 &errorcode, /* for error code */ 1455 &erroffset, /* for error offset */ 1456 NULL); /* no compile context */ 1457 1458 1459 Main compile options 1460 1461 The following names for option bits are defined in the pcre2.h header 1462 file: 1463 1464 PCRE2_ANCHORED 1465 1466 If this bit is set, the pattern is forced to be "anchored", that is, it 1467 is constrained to match only at the first matching point in the string 1468 that is being searched (the "subject string"). This effect can also be 1469 achieved by appropriate constructs in the pattern itself, which is the 1470 only way to do it in Perl. 1471 1472 PCRE2_ALLOW_EMPTY_CLASS 1473 1474 By default, for compatibility with Perl, a closing square bracket that 1475 immediately follows an opening one is treated as a data character for 1476 the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the 1477 class, which therefore contains no characters and so can never match. 1478 1479 PCRE2_ALT_BSUX 1480 1481 This option request alternative handling of three escape sequences, 1482 which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). 1483 When it is set: 1484 1485 (1) \U matches an upper case "U" character; by default \U causes a com- 1486 pile time error (Perl uses \U to upper case subsequent characters). 1487 1488 (2) \u matches a lower case "u" character unless it is followed by four 1489 hexadecimal digits, in which case the hexadecimal number defines the 1490 code point to match. By default, \u causes a compile time error (Perl 1491 uses it to upper case the following character). 1492 1493 (3) \x matches a lower case "x" character unless it is followed by two 1494 hexadecimal digits, in which case the hexadecimal number defines the 1495 code point to match. By default, as in Perl, a hexadecimal number is 1496 always expected after \x, but it may have zero, one, or two digits (so, 1497 for example, \xz matches a binary zero character followed by z). 1498 1499 ECMAscript 6 added additional functionality to \u. This can be accessed 1500 using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile op- 1501 tions" below). Note that this alternative escape handling applies only 1502 to patterns. Neither of these options affects the processing of re- 1503 placement strings passed to pcre2_substitute(). 1504 1505 PCRE2_ALT_CIRCUMFLEX 1506 1507 In multiline mode (when PCRE2_MULTILINE is set), the circumflex 1508 metacharacter matches at the start of the subject (unless PCRE2_NOTBOL 1509 is set), and also after any internal newline. However, it does not 1510 match after a newline at the end of the subject, for compatibility with 1511 Perl. If you want a multiline circumflex also to match after a termi- 1512 nating newline, you must set PCRE2_ALT_CIRCUMFLEX. 1513 1514 PCRE2_ALT_VERBNAMES 1515 1516 By default, for compatibility with Perl, the name in any verb sequence 1517 such as (*MARK:NAME) is any sequence of characters that does not in- 1518 clude a closing parenthesis. The name is not processed in any way, and 1519 it is not possible to include a closing parenthesis in the name. How- 1520 ever, if the PCRE2_ALT_VERBNAMES option is set, normal backslash pro- 1521 cessing is applied to verb names and only an unescaped closing paren- 1522 thesis terminates the name. A closing parenthesis can be included in a 1523 name either as \) or between \Q and \E. If the PCRE2_EXTENDED or 1524 PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped 1525 whitespace in verb names is skipped and #-comments are recognized, ex- 1526 actly as in the rest of the pattern. 1527 1528 PCRE2_AUTO_CALLOUT 1529 1530 If this bit is set, pcre2_compile() automatically inserts callout 1531 items, all with number 255, before each pattern item, except immedi- 1532 ately before or after an explicit callout in the pattern. For discus- 1533 sion of the callout facility, see the pcre2callout documentation. 1534 1535 PCRE2_CASELESS 1536 1537 If this bit is set, letters in the pattern match both upper and lower 1538 case letters in the subject. It is equivalent to Perl's /i option, and 1539 it can be changed within a pattern by a (?i) option setting. If either 1540 PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all 1541 characters with more than one other case, and for all characters whose 1542 code points are greater than U+007F. Note that there are two ASCII 1543 characters, K and S, that, in addition to their lower case ASCII equiv- 1544 alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long 1545 S) respectively. If you do not want this case equivalence, you can sup- 1546 press it by setting PCRE2_EXTRA_CASELESS_RESTRICT. 1547 1548 For lower valued characters with only one other case, a lookup table is 1549 used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup 1550 table is used for all code points less than 256, and higher code points 1551 (available only in 16-bit or 32-bit mode) are treated as not having an- 1552 other case. 1553 1554 PCRE2_DOLLAR_ENDONLY 1555 1556 If this bit is set, a dollar metacharacter in the pattern matches only 1557 at the end of the subject string. Without this option, a dollar also 1558 matches immediately before a newline at the end of the string (but not 1559 before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored 1560 if PCRE2_MULTILINE is set. There is no equivalent to this option in 1561 Perl, and no way to set it within a pattern. 1562 1563 PCRE2_DOTALL 1564 1565 If this bit is set, a dot metacharacter in the pattern matches any 1566 character, including one that indicates a newline. However, it only 1567 ever matches one character, even if newlines are coded as CRLF. Without 1568 this option, a dot does not match when the current position in the sub- 1569 ject is at a newline. This option is equivalent to Perl's /s option, 1570 and it can be changed within a pattern by a (?s) option setting. A neg- 1571 ative class such as [^a] always matches newline characters, and the \N 1572 escape sequence always matches a non-newline character, independent of 1573 the setting of PCRE2_DOTALL. 1574 1575 PCRE2_DUPNAMES 1576 1577 If this bit is set, names used to identify capture groups need not be 1578 unique. This can be helpful for certain types of pattern when it is 1579 known that only one instance of the named group can ever be matched. 1580 There are more details of named capture groups below; see also the 1581 pcre2pattern documentation. 1582 1583 PCRE2_ENDANCHORED 1584 1585 If this bit is set, the end of any pattern match must be right at the 1586 end of the string being searched (the "subject string"). If the pattern 1587 match succeeds by reaching (*ACCEPT), but does not reach the end of the 1588 subject, the match fails at the current starting point. For unanchored 1589 patterns, a new match is then tried at the next starting point. How- 1590 ever, if the match succeeds by reaching the end of the pattern, but not 1591 the end of the subject, backtracking occurs and an alternative match 1592 may be found. Consider these two patterns: 1593 1594 .(*ACCEPT)|.. 1595 .|.. 1596 1597 If matched against "abc" with PCRE2_ENDANCHORED set, the first matches 1598 "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED 1599 can also be achieved by appropriate constructs in the pattern itself, 1600 which is the only way to do it in Perl. 1601 1602 For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only 1603 to the first (that is, the longest) matched string. Other parallel 1604 matches, which are necessarily substrings of the first one, must obvi- 1605 ously end before the end of the subject. 1606 1607 PCRE2_EXTENDED 1608 1609 If this bit is set, most white space characters in the pattern are to- 1610 tally ignored except when escaped, inside a character class, or inside 1611 a \Q...\E sequence. However, white space is not allowed within se- 1612 quences such as (?> that introduce various parenthesized groups, nor 1613 within numerical quantifiers such as {1,3}. Ignorable white space is 1614 permitted between an item and a following quantifier and between a 1615 quantifier and a following + that indicates possessiveness. PCRE2_EX- 1616 TENDED is equivalent to Perl's /x option, and it can be changed within 1617 a pattern by a (?x) option setting. 1618 1619 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog- 1620 nizes as white space only those characters with code points less than 1621 256 that are flagged as white space in its low-character table. The ta- 1622 ble is normally created by pcre2_maketables(), which uses the isspace() 1623 function to identify space characters. In most ASCII environments, the 1624 relevant characters are those with code points 0x0009 (tab), 0x000A 1625 (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage 1626 return), and 0x0020 (space). 1627 1628 When PCRE2 is compiled with Unicode support, in addition to these char- 1629 acters, five more Unicode "Pattern White Space" characters are recog- 1630 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to- 1631 right mark), U+200F (right-to-left mark), U+2028 (line separator), and 1632 U+2029 (paragraph separator). This set of characters is the same as 1633 recognized by Perl's /x option. Note that the horizontal and vertical 1634 space characters that are matched by the \h and \v escapes in patterns 1635 are a much bigger set. 1636 1637 As well as ignoring most white space, PCRE2_EXTENDED also causes char- 1638 acters between an unescaped # outside a character class and the next 1639 newline, inclusive, to be ignored, which makes it possible to include 1640 comments inside complicated patterns. Note that the end of this type of 1641 comment is a literal newline sequence in the pattern; escape sequences 1642 that happen to represent a newline do not count. 1643 1644 Which characters are interpreted as newlines can be specified by a set- 1645 ting in the compile context that is passed to pcre2_compile() or by a 1646 special sequence at the start of the pattern, as described in the sec- 1647 tion entitled "Newline conventions" in the pcre2pattern documentation. 1648 A default is defined when PCRE2 is built. 1649 1650 PCRE2_EXTENDED_MORE 1651 1652 This option has the effect of PCRE2_EXTENDED, but, in addition, un- 1653 escaped space and horizontal tab characters are ignored inside a char- 1654 acter class. Note: only these two characters are ignored, not the full 1655 set of pattern white space characters that are ignored outside a char- 1656 acter class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx option, 1657 and it can be changed within a pattern by a (?xx) option setting. 1658 1659 PCRE2_FIRSTLINE 1660 1661 If this option is set, the start of an unanchored pattern match must be 1662 before or at the first newline in the subject string following the 1663 start of matching, though the matched text may continue over the new- 1664 line. If startoffset is non-zero, the limiting newline is not necessar- 1665 ily the first newline in the subject. For example, if the subject 1666 string is "abc\nxyz" (where \n represents a single-character newline) a 1667 pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is 1668 greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more 1669 general limiting facility. If PCRE2_FIRSTLINE is set with an offset 1670 limit, a match must occur in the first line and also within the offset 1671 limit. In other words, whichever limit comes first is used. This option 1672 has no effect for anchored patterns. 1673 1674 PCRE2_LITERAL 1675 1676 If this option is set, all meta-characters in the pattern are disabled, 1677 and it is treated as a literal string. Matching literal strings with a 1678 regular expression engine is not the most efficient way of doing it. If 1679 you are doing a lot of literal matching and are worried about effi- 1680 ciency, you should consider using other approaches. The only other main 1681 options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED, 1682 PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE, 1683 PCRE2_MATCH_INVALID_UTF, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, 1684 PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EX- 1685 TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other 1686 options cause an error. 1687 1688 PCRE2_MATCH_INVALID_UTF 1689 1690 This option forces PCRE2_UTF (see below) and also enables support for 1691 matching by pcre2_match() in subject strings that contain invalid UTF 1692 sequences. Note, however, that the 16-bit and 32-bit PCRE2 libraries 1693 process strings as sequences of uint16_t or uint32_t code points. They 1694 cannot find valid UTF sequences within an arbitrary string of bytes un- 1695 less such sequences are suitably aligned. This facility is not sup- 1696 ported for DFA matching. For details, see the pcre2unicode documenta- 1697 tion. 1698 1699 PCRE2_MATCH_UNSET_BACKREF 1700 1701 If this option is set, a backreference to an unset capture group 1702 matches an empty string (by default this causes the current matching 1703 alternative to fail). A pattern such as (\1)(a) succeeds when this op- 1704 tion is set (assuming it can find an "a" in the subject), whereas it 1705 fails by default, for Perl compatibility. Setting this option makes 1706 PCRE2 behave more like ECMAscript (aka JavaScript). 1707 1708 PCRE2_MULTILINE 1709 1710 By default, for the purposes of matching "start of line" and "end of 1711 line", PCRE2 treats the subject string as consisting of a single line 1712 of characters, even if it actually contains newlines. The "start of 1713 line" metacharacter (^) matches only at the start of the string, and 1714 the "end of line" metacharacter ($) matches only at the end of the 1715 string, or before a terminating newline (except when PCRE2_DOLLAR_EN- 1716 DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any 1717 character" metacharacter (.) does not match at a newline. This behav- 1718 iour (for ^, $, and dot) is the same as Perl. 1719 1720 When PCRE2_MULTILINE it is set, the "start of line" and "end of line" 1721 constructs match immediately following or immediately before internal 1722 newlines in the subject string, respectively, as well as at the very 1723 start and end. This is equivalent to Perl's /m option, and it can be 1724 changed within a pattern by a (?m) option setting. Note that the "start 1725 of line" metacharacter does not match after a newline at the end of the 1726 subject, for compatibility with Perl. However, you can change this by 1727 setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a 1728 subject string, or no occurrences of ^ or $ in a pattern, setting 1729 PCRE2_MULTILINE has no effect. 1730 1731 PCRE2_NEVER_BACKSLASH_C 1732 1733 This option locks out the use of \C in the pattern that is being com- 1734 piled. This escape can cause unpredictable behaviour in UTF-8 or 1735 UTF-16 modes, because it may leave the current matching point in the 1736 middle of a multi-code-unit character. This option may be useful in ap- 1737 plications that process patterns from external sources. Note that there 1738 is also a build-time option that permanently locks out the use of \C. 1739 1740 PCRE2_NEVER_UCP 1741 1742 This option locks out the use of Unicode properties for handling \B, 1743 \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as 1744 described for the PCRE2_UCP option below. In particular, it prevents 1745 the creator of the pattern from enabling this facility by starting the 1746 pattern with (*UCP). This option may be useful in applications that 1747 process patterns from external sources. The option combination PCRE_UCP 1748 and PCRE_NEVER_UCP causes an error. 1749 1750 PCRE2_NEVER_UTF 1751 1752 This option locks out interpretation of the pattern as UTF-8, UTF-16, 1753 or UTF-32, depending on which library is in use. In particular, it pre- 1754 vents the creator of the pattern from switching to UTF interpretation 1755 by starting the pattern with (*UTF). This option may be useful in ap- 1756 plications that process patterns from external sources. The combination 1757 of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. 1758 1759 PCRE2_NO_AUTO_CAPTURE 1760 1761 If this option is set, it disables the use of numbered capturing paren- 1762 theses in the pattern. Any opening parenthesis that is not followed by 1763 ? behaves as if it were followed by ?: but named parentheses can still 1764 be used for capturing (and they acquire numbers in the usual way). This 1765 is the same as Perl's /n option. Note that, when this option is set, 1766 references to capture groups (backreferences or recursion/subroutine 1767 calls) may only refer to named groups, though the reference can be by 1768 name or by number. 1769 1770 PCRE2_NO_AUTO_POSSESS 1771 1772 If this option is set, it disables "auto-possessification", which is an 1773 optimization that, for example, turns a+b into a++b in order to avoid 1774 backtracks into a+ that can never be successful. However, if callouts 1775 are in use, auto-possessification means that some callouts are never 1776 taken. You can set this option if you want the matching functions to do 1777 a full unoptimized search and run all the callouts, but it is mainly 1778 provided for testing purposes. 1779 1780 PCRE2_NO_DOTSTAR_ANCHOR 1781 1782 If this option is set, it disables an optimization that is applied when 1783 .* is the first significant item in a top-level branch of a pattern, 1784 and all the other branches also start with .* or with \A or \G or ^. 1785 The optimization is automatically disabled for .* if it is inside an 1786 atomic group or a capture group that is the subject of a backreference, 1787 or if the pattern contains (*PRUNE) or (*SKIP). When the optimization 1788 is not disabled, such a pattern is automatically anchored if 1789 PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set 1790 for any ^ items. Otherwise, the fact that any match must start either 1791 at the start of the subject or following a newline is remembered. Like 1792 other optimizations, this can cause callouts to be skipped. 1793 1794 PCRE2_NO_START_OPTIMIZE 1795 1796 This is an option whose main effect is at matching time. It does not 1797 change what pcre2_compile() generates, but it does affect the output of 1798 the JIT compiler. 1799 1800 There are a number of optimizations that may occur at the start of a 1801 match, in order to speed up the process. For example, if it is known 1802 that an unanchored match must start with a specific code unit value, 1803 the matching code searches the subject for that value, and fails imme- 1804 diately if it cannot find it, without actually running the main match- 1805 ing function. This means that a special item such as (*COMMIT) at the 1806 start of a pattern is not considered until after a suitable starting 1807 point for the match has been found. Also, when callouts or (*MARK) 1808 items are in use, these "start-up" optimizations can cause them to be 1809 skipped if the pattern is never actually used. The start-up optimiza- 1810 tions are in effect a pre-scan of the subject that takes place before 1811 the pattern is run. 1812 1813 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, 1814 possibly causing performance to suffer, but ensuring that in cases 1815 where the result is "no match", the callouts do occur, and that items 1816 such as (*COMMIT) and (*MARK) are considered at every possible starting 1817 position in the subject string. 1818 1819 Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching 1820 operation. Consider the pattern 1821 1822 (*COMMIT)ABC 1823 1824 When this is compiled, PCRE2 records the fact that a match must start 1825 with the character "A". Suppose the subject string is "DEFABC". The 1826 start-up optimization scans along the subject, finds "A" and runs the 1827 first match attempt from there. The (*COMMIT) item means that the pat- 1828 tern must match the current starting position, which in this case, it 1829 does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE 1830 set, the initial scan along the subject string does not happen. The 1831 first match attempt is run starting from "D" and when this fails, 1832 (*COMMIT) prevents any further matches being tried, so the overall re- 1833 sult is "no match". 1834 1835 As another start-up optimization makes use of a minimum length for a 1836 matching subject, which is recorded when possible. Consider the pattern 1837 1838 (*MARK:1)B(*MARK:2)(X|Y) 1839 1840 The minimum length for a match is two characters. If the subject is 1841 "XXBB", the "starting character" optimization skips "XX", then tries to 1842 match "BB", which is long enough. In the process, (*MARK:2) is encoun- 1843 tered and remembered. When the match attempt fails, the next "B" is 1844 found, but there is only one character left, so there are no more at- 1845 tempts, and "no match" is returned with the "last mark seen" set to 1846 "2". If NO_START_OPTIMIZE is set, however, matches are tried at every 1847 possible starting position, including at the end of the subject, where 1848 (*MARK:1) is encountered, but there is no "B", so the "last mark seen" 1849 that is returned is "1". In this case, the optimizations do not affect 1850 the overall match result, which is still "no match", but they do affect 1851 the auxiliary information that is returned. 1852 1853 PCRE2_NO_UTF_CHECK 1854 1855 When PCRE2_UTF is set, the validity of the pattern as a UTF string is 1856 automatically checked. There are discussions about the validity of 1857 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode 1858 document. If an invalid UTF sequence is found, pcre2_compile() returns 1859 a negative error code. 1860 1861 If you know that your pattern is a valid UTF string, and you want to 1862 skip this check for performance reasons, you can set the 1863 PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in- 1864 valid UTF string as a pattern is undefined. It may cause your program 1865 to crash or loop. 1866 1867 Note that this option can also be passed to pcre2_match() and 1868 pcre2_dfa_match(), to suppress UTF validity checking of the subject 1869 string. 1870 1871 Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis- 1872 able the error that is given if an escape sequence for an invalid Uni- 1873 code code point is encountered in the pattern. In particular, the so- 1874 called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you 1875 want to allow escape sequences such as \x{d800} you can set the 1876 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the 1877 section entitled "Extra compile options" below. However, this is pos- 1878 sible only in UTF-8 and UTF-32 modes, because these values are not rep- 1879 resentable in UTF-16. 1880 1881 PCRE2_UCP 1882 1883 This option has two effects. Firstly, it change the way PCRE2 processes 1884 \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character 1885 classes. By default, only ASCII characters are recognized, but if 1886 PCRE2_UCP is set, Unicode properties are used to classify characters. 1887 There are some PCRE2_EXTRA options (see below) that add finer control 1888 to this behaviour. More details are given in the section on generic 1889 character types in the pcre2pattern page. 1890 1891 The second effect of PCRE2_UCP is to force the use of Unicode proper- 1892 ties for upper/lower casing operations, even when PCRE2_UTF is not set. 1893 This makes it possible to process strings in the 16-bit UCS-2 code. 1894 This option is available only if PCRE2 has been compiled with Unicode 1895 support (which is the default). The PCRE2_EXTRA_CASELESS_RESTRICT op- 1896 tion (see below) restricts caseless matching such that ASCII characters 1897 match only ASCII characters and non-ASCII characters match only non- 1898 ASCII characters. 1899 1900 PCRE2_UNGREEDY 1901 1902 This option inverts the "greediness" of the quantifiers so that they 1903 are not greedy by default, but become greedy if followed by "?". It is 1904 not compatible with Perl. It can also be set by a (?U) option setting 1905 within the pattern. 1906 1907 PCRE2_USE_OFFSET_LIMIT 1908 1909 This option must be set for pcre2_compile() if pcre2_set_offset_limit() 1910 is going to be used to set a non-default offset limit in a match con- 1911 text for matches that use this pattern. An error is generated if an 1912 offset limit is set without this option. For more details, see the de- 1913 scription of pcre2_set_offset_limit() in the section that describes 1914 match contexts. See also the PCRE2_FIRSTLINE option above. 1915 1916 PCRE2_UTF 1917 1918 This option causes PCRE2 to regard both the pattern and the subject 1919 strings that are subsequently processed as strings of UTF characters 1920 instead of single-code-unit strings. It is available when PCRE2 is 1921 built to include Unicode support (which is the default). If Unicode 1922 support is not available, the use of this option provokes an error. De- 1923 tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in the 1924 pcre2unicode page. In particular, note that it changes the way 1925 PCRE2_CASELESS works. 1926 1927 Extra compile options 1928 1929 The option bits that can be set in a compile context by calling the 1930 pcre2_set_compile_extra_options() function are as follows: 1931 1932 PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK 1933 1934 Since release 10.38 PCRE2 has forbidden the use of \K within lookaround 1935 assertions, following Perl's lead. This option is provided to re-enable 1936 the previous behaviour (act in positive lookarounds, ignore in negative 1937 ones) in case anybody is relying on it. 1938 1939 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 1940 1941 This option applies when compiling a pattern in UTF-8 or UTF-32 mode. 1942 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode 1943 "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs 1944 in UTF-16 to encode code points with values in the range 0x10000 to 1945 0x10ffff. The surrogates cannot therefore be represented in UTF-16. 1946 They can be represented in UTF-8 and UTF-32, but are defined as invalid 1947 code points, and cause errors if encountered in a UTF-8 or UTF-32 1948 string that is being checked for validity by PCRE2. 1949 1950 These values also cause errors if encountered in escape sequences such 1951 as \x{d912} within a pattern. However, it seems that some applications, 1952 when using PCRE2 to check for unwanted characters in UTF-8 strings, ex- 1953 plicitly test for the surrogates using escape sequences. The 1954 PCRE2_NO_UTF_CHECK option does not disable the error that occurs, be- 1955 cause it applies only to the testing of input strings for UTF validity. 1956 1957 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro- 1958 gate code point values in UTF-8 and UTF-32 patterns no longer provoke 1959 errors and are incorporated in the compiled pattern. However, they can 1960 only match subject characters if the matching function is called with 1961 PCRE2_NO_UTF_CHECK set. 1962 1963 PCRE2_EXTRA_ALT_BSUX 1964 1965 The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and 1966 \x in the way that ECMAscript (aka JavaScript) does. Additional func- 1967 tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has 1968 the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} 1969 as a hexadecimal character code, where hhh.. is any number of hexadeci- 1970 mal digits. 1971 1972 PCRE2_EXTRA_ASCII_BSD 1973 1974 This option forces \d to match only ASCII digits, even when PCRE2_UCP 1975 is set. It can be changed within a pattern by means of the (?aD) op- 1976 tion setting. 1977 1978 PCRE2_EXTRA_ASCII_BSS 1979 1980 This option forces \s to match only ASCII space characters, even when 1981 PCRE2_UCP is set. It can be changed within a pattern by means of the 1982 (?aS) option setting. 1983 1984 PCRE2_EXTRA_ASCII_BSW 1985 1986 This option forces \w to match only ASCII word characters, even when 1987 PCRE2_UCP is set. It can be changed within a pattern by means of the 1988 (?aW) option setting. 1989 1990 PCRE2_EXTRA_ASCII_DIGIT 1991 1992 This option forces the POSIX character classes [:digit:] and [:xdigit:] 1993 to match only ASCII digits, even when PCRE2_UCP is set. It can be 1994 changed within a pattern by means of the (?aT) option setting. 1995 1996 PCRE2_EXTRA_ASCII_POSIX 1997 1998 This option forces all the POSIX character classes, including [:digit:] 1999 and [:xdigit:], to match only ASCII characters, even when PCRE2_UCP is 2000 set. It can be changed within a pattern by means of the (?aP) option 2001 setting, but note that this also sets PCRE2_EXTRA_ASCII_DIGIT in order 2002 to ensure that (?-aP) unsets all ASCII restrictions for POSIX classes. 2003 2004 PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 2005 2006 This is a dangerous option. Use with care. By default, an unrecognized 2007 escape such as \j or a malformed one such as \x{2z} causes a compile- 2008 time error when detected by pcre2_compile(). Perl is somewhat inconsis- 2009 tent in handling such items: for example, \j is treated as a literal 2010 "j", and non-hexadecimal digits in \x{} are just ignored, though warn- 2011 ings are given in both cases if Perl's warning switch is enabled. How- 2012 ever, a malformed octal number after \o{ always causes an error in 2013 Perl. 2014 2015 If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to 2016 pcre2_compile(), all unrecognized or malformed escape sequences are 2017 treated as single-character escapes. For example, \j is a literal "j" 2018 and \x{2z} is treated as the literal string "x{2z}". Setting this op- 2019 tion means that typos in patterns may go undetected and have unexpected 2020 results. Also note that a sequence such as [\N{] is interpreted as a 2021 malformed attempt at [\N{...}] and so is treated as [N{] whereas [\N] 2022 gives an error because an unqualified \N is a valid escape sequence but 2023 is not supported in a character class. To reiterate: this is a danger- 2024 ous option. Use with great care. 2025 2026 PCRE2_EXTRA_CASELESS_RESTRICT 2027 2028 When either PCRE2_UCP or PCRE2_UTF is set, caseless matching follows 2029 Unicode rules, which allow for more than two cases per character. There 2030 are two case-equivalent character sets that contain both ASCII and non- 2031 ASCII characters. The ASCII letter S is case-equivalent to U+017f (long 2032 S) and the ASCII letter K is case-equivalent to U+212a (Kelvin sign). 2033 This option disables recognition of case-equivalences that cross the 2034 ASCII/non-ASCII boundary. In a caseless match, both characters must ei- 2035 ther be ASCII or non-ASCII. The option can be changed with a pattern by 2036 the (?r) option setting. 2037 2038 PCRE2_EXTRA_ESCAPED_CR_IS_LF 2039 2040 There are some legacy applications where the escape sequence \r in a 2041 pattern is expected to match a newline. If this option is set, \r in a 2042 pattern is converted to \n so that it matches a LF (linefeed) instead 2043 of a CR (carriage return) character. The option does not affect a lit- 2044 eral CR in the pattern, nor does it affect CR specified as an explicit 2045 code point such as \x{0D}. 2046 2047 PCRE2_EXTRA_MATCH_LINE 2048 2049 This option is provided for use by the -x option of pcre2grep. It 2050 causes the pattern only to match complete lines. This is achieved by 2051 automatically inserting the code for "^(?:" at the start of the com- 2052 piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, 2053 the matched line may be in the middle of the subject string. This op- 2054 tion can be used with PCRE2_LITERAL. 2055 2056 PCRE2_EXTRA_MATCH_WORD 2057 2058 This option is provided for use by the -w option of pcre2grep. It 2059 causes the pattern only to match strings that have a word boundary at 2060 the start and the end. This is achieved by automatically inserting the 2061 code for "\b(?:" at the start of the compiled pattern and ")\b" at the 2062 end. The option may be used with PCRE2_LITERAL. However, it is ignored 2063 if PCRE2_EXTRA_MATCH_LINE is also set. 2064 2065 2066JUST-IN-TIME (JIT) COMPILATION 2067 2068 int pcre2_jit_compile(pcre2_code *code, uint32_t options); 2069 2070 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 2071 PCRE2_SIZE length, PCRE2_SIZE startoffset, 2072 uint32_t options, pcre2_match_data *match_data, 2073 pcre2_match_context *mcontext); 2074 2075 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 2076 2077 pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize, 2078 size_t maxsize, pcre2_general_context *gcontext); 2079 2080 void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 2081 pcre2_jit_callback callback_function, void *callback_data); 2082 2083 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 2084 2085 These functions provide support for JIT compilation, which, if the 2086 just-in-time compiler is available, further processes a compiled pat- 2087 tern into machine code that executes much faster than the pcre2_match() 2088 interpretive matching function. Full details are given in the pcre2jit 2089 documentation. 2090 2091 JIT compilation is a heavyweight optimization. It can take some time 2092 for patterns to be analyzed, and for one-off matches and simple pat- 2093 terns the benefit of faster execution might be offset by a much slower 2094 compilation time. Most (but not all) patterns can be optimized by the 2095 JIT compiler. 2096 2097 2098LOCALE SUPPORT 2099 2100 const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); 2101 2102 void pcre2_maketables_free(pcre2_general_context *gcontext, 2103 const uint8_t *tables); 2104 2105 PCRE2 handles caseless matching, and determines whether characters are 2106 letters, digits, or whatever, by reference to a set of tables, indexed 2107 by character code point. However, this applies only to characters whose 2108 code points are less than 256. By default, higher-valued code points 2109 never match escapes such as \w or \d. 2110 2111 When PCRE2 is built with Unicode support (the default), certain Unicode 2112 character properties can be tested with \p and \P, or, alternatively, 2113 the PCRE2_UCP option can be set when a pattern is compiled; this causes 2114 \w and friends to use Unicode property support instead of the built-in 2115 tables. PCRE2_UCP also causes upper/lower casing operations on charac- 2116 ters with code points greater than 127 to use Unicode properties. These 2117 effects apply even when PCRE2_UTF is not set. There are, however, some 2118 PCRE2_EXTRA options (see above) that can be used to modify or suppress 2119 them. 2120 2121 The use of locales with Unicode is discouraged. If you are handling 2122 characters with code points greater than 127, you should either use 2123 Unicode support, or use locales, but not try to mix the two. 2124 2125 PCRE2 contains a built-in set of character tables that are used by de- 2126 fault. These are sufficient for many applications. Normally, the in- 2127 ternal tables recognize only ASCII characters. However, when PCRE2 is 2128 built, it is possible to cause the internal tables to be rebuilt in the 2129 default "C" locale of the local system, which may cause them to be dif- 2130 ferent. 2131 2132 The built-in tables can be overridden by tables supplied by the appli- 2133 cation that calls PCRE2. These may be created in a different locale 2134 from the default. As more and more applications change to using Uni- 2135 code, the need for this locale support is expected to die away. 2136 2137 External tables are built by calling the pcre2_maketables() function, 2138 in the relevant locale. The only argument to this function is a general 2139 context, which can be used to pass a custom memory allocator. If the 2140 argument is NULL, the system malloc() is used. The result can be passed 2141 to pcre2_compile() as often as necessary, by creating a compile context 2142 and calling pcre2_set_character_tables() to set the tables pointer 2143 therein. 2144 2145 For example, to build and use tables that are appropriate for the 2146 French locale (where accented characters with values greater than 127 2147 are treated as letters), the following code could be used: 2148 2149 setlocale(LC_CTYPE, "fr_FR"); 2150 tables = pcre2_maketables(NULL); 2151 ccontext = pcre2_compile_context_create(NULL); 2152 pcre2_set_character_tables(ccontext, tables); 2153 re = pcre2_compile(..., ccontext); 2154 2155 The locale name "fr_FR" is used on Linux and other Unix-like systems; 2156 if you are using Windows, the name for the French locale is "french". 2157 2158 The pointer that is passed (via the compile context) to pcre2_compile() 2159 is saved with the compiled pattern, and the same tables are used by the 2160 matching functions. Thus, for any single pattern, compilation and 2161 matching both happen in the same locale, but different patterns can be 2162 processed in different locales. 2163 2164 It is the caller's responsibility to ensure that the memory containing 2165 the tables remains available while they are still in use. When they are 2166 no longer needed, you can discard them using pcre2_maketables_free(), 2167 which should pass as its first parameter the same global context that 2168 was used to create the tables. 2169 2170 Saving locale tables 2171 2172 The tables described above are just a sequence of binary bytes, which 2173 makes them independent of hardware characteristics such as endianness 2174 or whether the processor is 32-bit or 64-bit. A copy of the result of 2175 pcre2_maketables() can therefore be saved in a file or elsewhere and 2176 re-used later, even in a different program or on another computer. The 2177 size of the tables (number of bytes) must be obtained by calling 2178 pcre2_config() with the PCRE2_CONFIG_TABLES_LENGTH option because 2179 pcre2_maketables() does not return this value. Note that the 2180 pcre2_dftables program, which is part of the PCRE2 build system, can be 2181 used stand-alone to create a file that contains a set of binary tables. 2182 See the pcre2build documentation for details. 2183 2184 2185INFORMATION ABOUT A COMPILED PATTERN 2186 2187 int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); 2188 2189 The pcre2_pattern_info() function returns general information about a 2190 compiled pattern. For information about callouts, see the next section. 2191 The first argument for pcre2_pattern_info() is a pointer to the com- 2192 piled pattern. The second argument specifies which piece of information 2193 is required, and the third argument is a pointer to a variable to re- 2194 ceive the data. If the third argument is NULL, the first argument is 2195 ignored, and the function returns the size in bytes of the variable 2196 that is required for the information requested. Otherwise, the yield of 2197 the function is zero for success, or one of the following negative num- 2198 bers: 2199 2200 PCRE2_ERROR_NULL the argument code was NULL 2201 PCRE2_ERROR_BADMAGIC the "magic number" was not found 2202 PCRE2_ERROR_BADOPTION the value of what was invalid 2203 PCRE2_ERROR_UNSET the requested field is not set 2204 2205 The "magic number" is placed at the start of each compiled pattern as a 2206 simple check against passing an arbitrary memory pointer. Here is a 2207 typical call of pcre2_pattern_info(), to obtain the length of the com- 2208 piled pattern: 2209 2210 int rc; 2211 size_t length; 2212 rc = pcre2_pattern_info( 2213 re, /* result of pcre2_compile() */ 2214 PCRE2_INFO_SIZE, /* what is required */ 2215 &length); /* where to put the data */ 2216 2217 The possible values for the second argument are defined in pcre2.h, and 2218 are as follows: 2219 2220 PCRE2_INFO_ALLOPTIONS 2221 PCRE2_INFO_ARGOPTIONS 2222 PCRE2_INFO_EXTRAOPTIONS 2223 2224 Return copies of the pattern's options. The third argument should point 2225 to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op- 2226 tions that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP- 2227 TIONS returns the compile options as modified by any top-level (*XXX) 2228 option settings such as (*UTF) at the start of the pattern itself. 2229 PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the 2230 compile context by calling the pcre2_set_compile_extra_options() func- 2231 tion. 2232 2233 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX- 2234 TENDED option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED 2235 and PCRE2_UTF. Option settings such as (?i) that can change within a 2236 pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they 2237 appear right at the start of the pattern. (This was different in some 2238 earlier releases.) 2239 2240 A pattern compiled without PCRE2_ANCHORED is automatically anchored by 2241 PCRE2 if the first significant item in every top-level branch is one of 2242 the following: 2243 2244 ^ unless PCRE2_MULTILINE is set 2245 \A always 2246 \G always 2247 .* sometimes - see below 2248 2249 When .* is the first significant item, anchoring is possible only when 2250 all the following are true: 2251 2252 .* is not in an atomic group 2253 .* is not in a capture group that is the subject 2254 of a backreference 2255 PCRE2_DOTALL is in force for .* 2256 Neither (*PRUNE) nor (*SKIP) appears in the pattern 2257 PCRE2_NO_DOTSTAR_ANCHOR is not set 2258 2259 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in 2260 the options returned for PCRE2_INFO_ALLOPTIONS. 2261 2262 PCRE2_INFO_BACKREFMAX 2263 2264 Return the number of the highest backreference in the pattern. The 2265 third argument should point to a uint32_t variable. Named capture 2266 groups acquire numbers as well as names, and these count towards the 2267 highest backreference. Backreferences such as \4 or \g{12} match the 2268 captured characters of the given group, but in addition, the check that 2269 a capture group is set in a conditional group such as (?(3)a|b) is also 2270 a backreference. Zero is returned if there are no backreferences. 2271 2272 PCRE2_INFO_BSR 2273 2274 The output is a uint32_t integer whose value indicates what character 2275 sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE 2276 means that \R matches any Unicode line ending sequence; a value of 2277 PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. 2278 2279 PCRE2_INFO_CAPTURECOUNT 2280 2281 Return the highest capture group number in the pattern. In patterns 2282 where (?| is not used, this is also the total number of capture groups. 2283 The third argument should point to a uint32_t variable. 2284 2285 PCRE2_INFO_DEPTHLIMIT 2286 2287 If the pattern set a backtracking depth limit by including an item of 2288 the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The 2289 third argument should point to a uint32_t integer. If no such value has 2290 been set, the call to pcre2_pattern_info() returns the error PCRE2_ER- 2291 ROR_UNSET. Note that this limit will only be used during matching if it 2292 is less than the limit set or defaulted by the caller of the match 2293 function. 2294 2295 PCRE2_INFO_FIRSTBITMAP 2296 2297 In the absence of a single first code unit for a non-anchored pattern, 2298 pcre2_compile() may construct a 256-bit table that defines a fixed set 2299 of values for the first code unit in any match. For example, a pattern 2300 that starts with [abc] results in a table with three bits set. When 2301 code unit values greater than 255 are supported, the flag bit for 255 2302 means "any code unit of value 255 or above". If such a table was con- 2303 structed, a pointer to it is returned. Otherwise NULL is returned. The 2304 third argument should point to a const uint8_t * variable. 2305 2306 PCRE2_INFO_FIRSTCODETYPE 2307 2308 Return information about the first code unit of any matched string, for 2309 a non-anchored pattern. The third argument should point to a uint32_t 2310 variable. If there is a fixed first value, for example, the letter "c" 2311 from a pattern such as (cat|cow|coyote), 1 is returned, and the value 2312 can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed 2313 first value, but it is known that a match can occur only at the start 2314 of the subject or following a newline in the subject, 2 is returned. 2315 Otherwise, and for anchored patterns, 0 is returned. 2316 2317 PCRE2_INFO_FIRSTCODEUNIT 2318 2319 Return the value of the first code unit of any matched string for a 2320 pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. 2321 The third argument should point to a uint32_t variable. In the 8-bit 2322 library, the value is always less than 256. In the 16-bit library the 2323 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the 2324 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 2325 mode. 2326 2327 PCRE2_INFO_FRAMESIZE 2328 2329 Return the size (in bytes) of the data frames that are used to remember 2330 backtracking positions when the pattern is processed by pcre2_match() 2331 without the use of JIT. The third argument should point to a size_t 2332 variable. The frame size depends on the number of capturing parentheses 2333 in the pattern. Each additional capture group adds two PCRE2_SIZE vari- 2334 ables. 2335 2336 PCRE2_INFO_HASBACKSLASHC 2337 2338 Return 1 if the pattern contains any instances of \C, otherwise 0. The 2339 third argument should point to a uint32_t variable. 2340 2341 PCRE2_INFO_HASCRORLF 2342 2343 Return 1 if the pattern contains any explicit matches for CR or LF 2344 characters, otherwise 0. The third argument should point to a uint32_t 2345 variable. An explicit match is either a literal CR or LF character, or 2346 \r or \n or one of the equivalent hexadecimal or octal escape se- 2347 quences. 2348 2349 PCRE2_INFO_HEAPLIMIT 2350 2351 If the pattern set a heap memory limit by including an item of the form 2352 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu- 2353 ment should point to a uint32_t integer. If no such value has been set, 2354 the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET. 2355 Note that this limit will only be used during matching if it is less 2356 than the limit set or defaulted by the caller of the match function. 2357 2358 PCRE2_INFO_JCHANGED 2359 2360 Return 1 if the (?J) or (?-J) option setting is used in the pattern, 2361 otherwise 0. The third argument should point to a uint32_t variable. 2362 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec- 2363 tively. 2364 2365 PCRE2_INFO_JITSIZE 2366 2367 If the compiled pattern was successfully processed by pcre2_jit_com- 2368 pile(), return the size of the JIT compiled code, otherwise return 2369 zero. The third argument should point to a size_t variable. 2370 2371 PCRE2_INFO_LASTCODETYPE 2372 2373 Returns 1 if there is a rightmost literal code unit that must exist in 2374 any matched string, other than at its start. The third argument should 2375 point to a uint32_t variable. If there is no such value, 0 is returned. 2376 When 1 is returned, the code unit value itself can be retrieved using 2377 PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is 2378 recorded only if it follows something of variable length. For example, 2379 for the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned 2380 from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 2381 0. 2382 2383 PCRE2_INFO_LASTCODEUNIT 2384 2385 Return the value of the rightmost literal code unit that must exist in 2386 any matched string, other than at its start, for a pattern where 2387 PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu- 2388 ment should point to a uint32_t variable. 2389 2390 PCRE2_INFO_MATCHEMPTY 2391 2392 Return 1 if the pattern might match an empty string, otherwise 0. The 2393 third argument should point to a uint32_t variable. When a pattern con- 2394 tains recursive subroutine calls it is not always possible to determine 2395 whether or not it can match an empty string. PCRE2 takes a cautious ap- 2396 proach and returns 1 in such cases. 2397 2398 PCRE2_INFO_MATCHLIMIT 2399 2400 If the pattern set a match limit by including an item of the form 2401 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third ar- 2402 gument should point to a uint32_t integer. If no such value has been 2403 set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN- 2404 SET. Note that this limit will only be used during matching if it is 2405 less than the limit set or defaulted by the caller of the match func- 2406 tion. 2407 2408 PCRE2_INFO_MAXLOOKBEHIND 2409 2410 A lookbehind assertion moves back a certain number of characters (not 2411 code units) when it starts to process each of its branches. This re- 2412 quest returns the largest of these backward moves. The third argument 2413 should point to a uint32_t integer. The simple assertions \b and \B re- 2414 quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to 2415 return 1 in the absence of anything longer. \A also registers a one- 2416 character lookbehind, though it does not actually inspect the previous 2417 character. 2418 2419 Note that this information is useful for multi-segment matching only if 2420 the pattern contains no nested lookbehinds. For example, the pattern 2421 (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is 2422 processed, the first lookbehind moves back by two characters, matches 2423 one character, then the nested lookbehind also moves back by two char- 2424 acters. This puts the matching point three characters earlier than it 2425 was at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a 2426 debugging tool. See the pcre2partial documentation for a discussion of 2427 multi-segment matching. 2428 2429 PCRE2_INFO_MINLENGTH 2430 2431 If a minimum length for matching subject strings was computed, its 2432 value is returned. Otherwise the returned value is 0. This value is not 2433 computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number of 2434 characters, which in UTF mode may be different from the number of code 2435 units. The third argument should point to a uint32_t variable. The 2436 value is a lower bound to the length of any matching string. There may 2437 not be any strings of that length that do actually match, but every 2438 string that does match is at least that long. 2439 2440 PCRE2_INFO_NAMECOUNT 2441 PCRE2_INFO_NAMEENTRYSIZE 2442 PCRE2_INFO_NAMETABLE 2443 2444 PCRE2 supports the use of named as well as numbered capturing parenthe- 2445 ses. The names are just an additional way of identifying the parenthe- 2446 ses, which still acquire numbers. Several convenience functions such as 2447 pcre2_substring_get_byname() are provided for extracting captured sub- 2448 strings by name. It is also possible to extract the data directly, by 2449 first converting the name to a number in order to access the correct 2450 pointers in the output vector (described with pcre2_match() below). To 2451 do the conversion, you need to use the name-to-number map, which is de- 2452 scribed by these three values. 2453 2454 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- 2455 COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives 2456 the size of each entry in code units; both of these return a uint32_t 2457 value. The entry size depends on the length of the longest name. 2458 2459 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. 2460 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li- 2461 brary, the first two bytes of each entry are the number of the captur- 2462 ing parenthesis, most significant byte first. In the 16-bit library, 2463 the pointer points to 16-bit code units, the first of which contains 2464 the parenthesis number. In the 32-bit library, the pointer points to 2465 32-bit code units, the first of which contains the parenthesis number. 2466 The rest of the entry is the corresponding name, zero terminated. 2467 2468 The names are in alphabetical order. If (?| is used to create multiple 2469 capture groups with the same number, as described in the section on du- 2470 plicate group numbers in the pcre2pattern page, the groups may be given 2471 the same name, but there is only one entry in the table. Different 2472 names for groups of the same number are not permitted. 2473 2474 Duplicate names for capture groups with different numbers are permit- 2475 ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the 2476 order in which they were found in the pattern. In the absence of (?| 2477 this is the order of increasing number; when (?| is used this is not 2478 necessarily the case because later capture groups may have lower num- 2479 bers. 2480 2481 As a simple example of the name/number table, consider the following 2482 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED 2483 is set, so white space - including newlines - is ignored): 2484 2485 (?<date> (?<year>(\d\d)?\d\d) - 2486 (?<month>\d\d) - (?<day>\d\d) ) 2487 2488 There are four named capture groups, so the table has four entries, and 2489 each entry in the table is eight bytes long. The table is as follows, 2490 with non-printing bytes shows in hexadecimal, and undefined bytes shown 2491 as ??: 2492 2493 00 01 d a t e 00 ?? 2494 00 05 d a y 00 ?? ?? 2495 00 04 m o n t h 00 2496 00 02 y e a r 00 ?? 2497 2498 When writing code to extract data from named capture groups using the 2499 name-to-number map, remember that the length of the entries is likely 2500 to be different for each compiled pattern. 2501 2502 PCRE2_INFO_NEWLINE 2503 2504 The output is one of the following uint32_t values: 2505 2506 PCRE2_NEWLINE_CR Carriage return (CR) 2507 PCRE2_NEWLINE_LF Linefeed (LF) 2508 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 2509 PCRE2_NEWLINE_ANY Any Unicode line ending 2510 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 2511 PCRE2_NEWLINE_NUL The NUL character (binary zero) 2512 2513 This identifies the character sequence that will be recognized as mean- 2514 ing "newline" while matching. 2515 2516 PCRE2_INFO_SIZE 2517 2518 Return the size of the compiled pattern in bytes (for all three li- 2519 braries). The third argument should point to a size_t variable. This 2520 value includes the size of the general data block that precedes the 2521 code units of the compiled pattern itself. The value that is used when 2522 pcre2_compile() is getting memory in which to place the compiled pat- 2523 tern may be slightly larger than the value returned by this option, be- 2524 cause there are cases where the code that calculates the size has to 2525 over-estimate. Processing a pattern with the JIT compiler does not al- 2526 ter the value returned by this option. 2527 2528 2529INFORMATION ABOUT A PATTERN'S CALLOUTS 2530 2531 int pcre2_callout_enumerate(const pcre2_code *code, 2532 int (*callback)(pcre2_callout_enumerate_block *, void *), 2533 void *user_data); 2534 2535 A script language that supports the use of string arguments in callouts 2536 might like to scan all the callouts in a pattern before running the 2537 match. This can be done by calling pcre2_callout_enumerate(). The first 2538 argument is a pointer to a compiled pattern, the second points to a 2539 callback function, and the third is arbitrary user data. The callback 2540 function is called for every callout in the pattern in the order in 2541 which they appear. Its first argument is a pointer to a callout enumer- 2542 ation block, and its second argument is the user_data value that was 2543 passed to pcre2_callout_enumerate(). The contents of the callout enu- 2544 meration block are described in the pcre2callout documentation, which 2545 also gives further details about callouts. 2546 2547 2548SERIALIZATION AND PRECOMPILING 2549 2550 It is possible to save compiled patterns on disc or elsewhere, and re- 2551 load them later, subject to a number of restrictions. The host on which 2552 the patterns are reloaded must be running the same version of PCRE2, 2553 with the same code unit width, and must also have the same endianness, 2554 pointer width, and PCRE2_SIZE type. Before compiled patterns can be 2555 saved, they must be converted to a "serialized" form, which in the case 2556 of PCRE2 is really just a bytecode dump. The functions whose names be- 2557 gin with pcre2_serialize_ are used for converting to and from the seri- 2558 alized form. They are described in the pcre2serialize documentation. 2559 Note that PCRE2 serialization does not convert compiled patterns to an 2560 abstract format like Java or .NET serialization. 2561 2562 2563THE MATCH DATA BLOCK 2564 2565 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 2566 pcre2_general_context *gcontext); 2567 2568 pcre2_match_data *pcre2_match_data_create_from_pattern( 2569 const pcre2_code *code, pcre2_general_context *gcontext); 2570 2571 void pcre2_match_data_free(pcre2_match_data *match_data); 2572 2573 Information about a successful or unsuccessful match is placed in a 2574 match data block, which is an opaque structure that is accessed by 2575 function calls. In particular, the match data block contains a vector 2576 of offsets into the subject string that define the matched parts of the 2577 subject. This is known as the ovector. 2578 2579 Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() 2580 you must create a match data block by calling one of the creation func- 2581 tions above. For pcre2_match_data_create(), the first argument is the 2582 number of pairs of offsets in the ovector. 2583 2584 When using pcre2_match(), one pair of offsets is required to identify 2585 the string that matched the whole pattern, with an additional pair for 2586 each captured substring. For example, a value of 4 creates enough space 2587 to record the matched portion of the subject plus three captured sub- 2588 strings. 2589 2590 When using pcre2_dfa_match() there may be multiple matched substrings 2591 of different lengths at the same point in the subject. The ovector 2592 should be made large enough to hold as many as are expected. 2593 2594 A minimum of at least 1 pair is imposed by pcre2_match_data_create(), 2595 so it is always possible to return the overall matched string in the 2596 case of pcre2_match() or the longest match in the case of 2597 pcre2_dfa_match(). The maximum number of pairs is 65535; if the first 2598 argument of pcre2_match_data_create() is greater than this, 65535 is 2599 used. 2600 2601 The second argument of pcre2_match_data_create() is a pointer to a gen- 2602 eral context, which can specify custom memory management for obtaining 2603 the memory for the match data block. If you are not using custom memory 2604 management, pass NULL, which causes malloc() to be used. 2605 2606 For pcre2_match_data_create_from_pattern(), the first argument is a 2607 pointer to a compiled pattern. The ovector is created to be exactly the 2608 right size to hold all the substrings a pattern might capture when 2609 matched using pcre2_match(). You should not use this call when matching 2610 with pcre2_dfa_match(). The second argument is again a pointer to a 2611 general context, but in this case if NULL is passed, the memory is ob- 2612 tained using the same allocator that was used for the compiled pattern 2613 (custom or default). 2614 2615 A match data block can be used many times, with the same or different 2616 compiled patterns. You can extract information from a match data block 2617 after a match operation has finished, using functions that are de- 2618 scribed in the sections on matched strings and other match data below. 2619 2620 When a call of pcre2_match() fails, valid data is available in the 2621 match block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ER- 2622 ROR_PARTIAL, or one of the error codes for an invalid UTF string. Ex- 2623 actly what is available depends on the error, and is detailed below. 2624 2625 When one of the matching functions is called, pointers to the compiled 2626 pattern and the subject string are set in the match data block so that 2627 they can be referenced by the extraction functions after a successful 2628 match. After running a match, you must not free a compiled pattern or a 2629 subject string until after all operations on the match data block (for 2630 that match) have taken place, unless, in the case of the subject 2631 string, you have used the PCRE2_COPY_MATCHED_SUBJECT option, which is 2632 described in the section entitled "Option bits for pcre2_match()" be- 2633 low. 2634 2635 When a match data block itself is no longer needed, it should be freed 2636 by calling pcre2_match_data_free(). If this function is called with a 2637 NULL argument, it returns immediately, without doing anything. 2638 2639 2640MEMORY USE FOR MATCH DATA BLOCKS 2641 2642 PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data); 2643 2644 PCRE2_SIZE pcre2_get_match_data_heapframes_size( 2645 pcre2_match_data *match_data); 2646 2647 The size of a match data block depends on the size of the ovector that 2648 it contains. The function pcre2_get_match_data_size() returns the size, 2649 in bytes, of the block that is its argument. 2650 2651 When pcre2_match() runs interpretively (that is, without using JIT), it 2652 makes use of a vector of data frames for remembering backtracking posi- 2653 tions. The size of each individual frame depends on the number of cap- 2654 turing parentheses in the pattern and can be obtained by calling 2655 pcre2_pattern_info() with the PCRE2_INFO_FRAMESIZE option (see the sec- 2656 tion entitled "Information about a compiled pattern" above). 2657 2658 Heap memory is used for the frames vector; if the initial memory block 2659 turns out to be too small during matching, it is automatically ex- 2660 panded. When pcre2_match() returns, the memory is not freed, but re- 2661 mains attached to the match data block, for use by any subsequent 2662 matches that use the same block. It is automatically freed when the 2663 match data block itself is freed. 2664 2665 You can find the current size of the frames vector that a match data 2666 block owns by calling pcre2_get_match_data_heapframes_size(). For a 2667 newly created match data block the size will be zero. Some types of 2668 match may require a lot of frames and thus a large vector; applications 2669 that run in environments where memory is constrained can check this and 2670 free the match data block if the heap frames vector has become too big. 2671 2672 2673MATCHING A PATTERN: THE TRADITIONAL FUNCTION 2674 2675 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 2676 PCRE2_SIZE length, PCRE2_SIZE startoffset, 2677 uint32_t options, pcre2_match_data *match_data, 2678 pcre2_match_context *mcontext); 2679 2680 The function pcre2_match() is called to match a subject string against 2681 a compiled pattern, which is passed in the code argument. You can call 2682 pcre2_match() with the same code argument as many times as you like, in 2683 order to find multiple matches in the subject string or to match dif- 2684 ferent subject strings with the same pattern. 2685 2686 This function is the main matching facility of the library, and it op- 2687 erates in a Perl-like manner. For specialist use there is also an al- 2688 ternative matching function, which is described below in the section 2689 about the pcre2_dfa_match() function. 2690 2691 Here is an example of a simple call to pcre2_match(): 2692 2693 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 2694 int rc = pcre2_match( 2695 re, /* result of pcre2_compile() */ 2696 "some string", /* the subject string */ 2697 11, /* the length of the subject string */ 2698 0, /* start at offset 0 in the subject */ 2699 0, /* default options */ 2700 md, /* the match data block */ 2701 NULL); /* a match context; NULL means use defaults */ 2702 2703 If the subject string is zero-terminated, the length can be given as 2704 PCRE2_ZERO_TERMINATED. A match context must be provided if certain less 2705 common matching parameters are to be changed. For details, see the sec- 2706 tion on the match context above. 2707 2708 The string to be matched by pcre2_match() 2709 2710 The subject string is passed to pcre2_match() as a pointer in subject, 2711 a length in length, and a starting offset in startoffset. The length 2712 and offset are in code units, not characters. That is, they are in 2713 bytes for the 8-bit library, 16-bit code units for the 16-bit library, 2714 and 32-bit code units for the 32-bit library, whether or not UTF pro- 2715 cessing is enabled. As a special case, if subject is NULL and length is 2716 zero, the subject is assumed to be an empty string. If length is non- 2717 zero, an error occurs if subject is NULL. 2718 2719 If startoffset is greater than the length of the subject, pcre2_match() 2720 returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the 2721 search for a match starts at the beginning of the subject, and this is 2722 by far the most common case. In UTF-8 or UTF-16 mode, the starting off- 2723 set must point to the start of a character, or to the end of the sub- 2724 ject (in UTF-32 mode, one code unit equals one character, so all off- 2725 sets are valid). Like the pattern string, the subject may contain bi- 2726 nary zeros. 2727 2728 A non-zero starting offset is useful when searching for another match 2729 in the same subject by calling pcre2_match() again after a previous 2730 success. Setting startoffset differs from passing over a shortened 2731 string and setting PCRE2_NOTBOL in the case of a pattern that begins 2732 with any kind of lookbehind. For example, consider the pattern 2733 2734 \Biss\B 2735 2736 which finds occurrences of "iss" in the middle of words. (\B matches 2737 only if the current position in the subject is not a word boundary.) 2738 When applied to the string "Mississippi" the first call to 2739 pcre2_match() finds the first occurrence. If pcre2_match() is called 2740 again with just the remainder of the subject, namely "issippi", it does 2741 not match, because \B is always false at the start of the subject, 2742 which is deemed to be a word boundary. However, if pcre2_match() is 2743 passed the entire string again, but with startoffset set to 4, it finds 2744 the second occurrence of "iss" because it is able to look behind the 2745 starting point to discover that it is preceded by a letter. 2746 2747 Finding all the matches in a subject is tricky when the pattern can 2748 match an empty string. It is possible to emulate Perl's /g behaviour by 2749 first trying the match again at the same offset, with the 2750 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that 2751 fails, advancing the starting offset and trying an ordinary match 2752 again. There is some code that demonstrates how to do this in the 2753 pcre2demo sample program. In the most general case, you have to check 2754 to see if the newline convention recognizes CRLF as a newline, and if 2755 so, and the current character is CR followed by LF, advance the start- 2756 ing offset by two characters instead of one. 2757 2758 If a non-zero starting offset is passed when the pattern is anchored, a 2759 single attempt to match at the given offset is made. This can only suc- 2760 ceed if the pattern does not require the match to be at the start of 2761 the subject. In other words, the anchoring must be the result of set- 2762 ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not 2763 by starting the pattern with ^ or \A. 2764 2765 Option bits for pcre2_match() 2766 2767 The unused bits of the options argument for pcre2_match() must be zero. 2768 The only bits that may be set are PCRE2_ANCHORED, 2769 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK, PCRE2_EN- 2770 DANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, 2771 PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PAR- 2772 TIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. 2773 2774 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup- 2775 ported by the just-in-time (JIT) compiler. If it is set, JIT matching 2776 is disabled and the interpretive code in pcre2_match() is run. 2777 PCRE2_DISABLE_RECURSELOOP_CHECK is ignored by JIT, but apart from 2778 PCRE2_NO_JIT (obviously), the remaining options are supported for JIT 2779 matching. 2780 2781 PCRE2_ANCHORED 2782 2783 The PCRE2_ANCHORED option limits pcre2_match() to matching at the first 2784 matching position. If a pattern was compiled with PCRE2_ANCHORED, or 2785 turned out to be anchored by virtue of its contents, it cannot be made 2786 unachored at matching time. Note that setting the option at match time 2787 disables JIT matching. 2788 2789 PCRE2_COPY_MATCHED_SUBJECT 2790 2791 By default, a pointer to the subject is remembered in the match data 2792 block so that, after a successful match, it can be referenced by the 2793 substring extraction functions. This means that the subject's memory 2794 must not be freed until all such operations are complete. For some ap- 2795 plications where the lifetime of the subject string is not guaranteed, 2796 it may be necessary to make a copy of the subject string, but it is 2797 wasteful to do this unless the match is successful. After a successful 2798 match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied and 2799 the new pointer is remembered in the match data block instead of the 2800 original subject pointer. The memory allocator that was used for the 2801 match block itself is used. The copy is automatically freed when 2802 pcre2_match_data_free() is called to free the match data block. It is 2803 also automatically freed if the match data block is re-used for another 2804 match operation. 2805 2806 PCRE2_DISABLE_RECURSELOOP_CHECK 2807 2808 This option is relevant only to pcre2_match() for interpretive match- 2809 ing. It is ignored when JIT is used, and is forbidden for 2810 pcre2_dfa_match(). 2811 2812 The use of recursion in patterns can lead to infinite loops. In the in- 2813 terpretive matcher these would be eventually caught by the match or 2814 heap limits, but this could take a long time and/or use a lot of memory 2815 if the limits are large. There is therefore a check at the start of 2816 each recursion. If the same group is still active from a previous 2817 call, and the current subject pointer is the same as it was at the 2818 start of that group, and the furthest inspected character of the sub- 2819 ject has not changed, an error is generated. 2820 2821 There are rare cases of matches that would complete, but nevertheless 2822 trigger this error. This option disables the check. It is provided 2823 mainly for testing when comparing JIT and interpretive behaviour. 2824 2825 PCRE2_ENDANCHORED 2826 2827 If the PCRE2_ENDANCHORED option is set, any string that pcre2_match() 2828 matches must be right at the end of the subject string. Note that set- 2829 ting the option at match time disables JIT matching. 2830 2831 PCRE2_NOTBOL 2832 2833 This option specifies that first character of the subject string is not 2834 the beginning of a line, so the circumflex metacharacter should not 2835 match before it. Setting this without having set PCRE2_MULTILINE at 2836 compile time causes circumflex never to match. This option affects only 2837 the behaviour of the circumflex metacharacter. It does not affect \A. 2838 2839 PCRE2_NOTEOL 2840 2841 This option specifies that the end of the subject string is not the end 2842 of a line, so the dollar metacharacter should not match it nor (except 2843 in multiline mode) a newline immediately before it. Setting this with- 2844 out having set PCRE2_MULTILINE at compile time causes dollar never to 2845 match. This option affects only the behaviour of the dollar metacharac- 2846 ter. It does not affect \Z or \z. 2847 2848 PCRE2_NOTEMPTY 2849 2850 An empty string is not considered to be a valid match if this option is 2851 set. If there are alternatives in the pattern, they are tried. If all 2852 the alternatives match the empty string, the entire match fails. For 2853 example, if the pattern 2854 2855 a?b? 2856 2857 is applied to a string not beginning with "a" or "b", it matches an 2858 empty string at the start of the subject. With PCRE2_NOTEMPTY set, this 2859 match is not valid, so pcre2_match() searches further into the string 2860 for occurrences of "a" or "b". 2861 2862 PCRE2_NOTEMPTY_ATSTART 2863 2864 This is like PCRE2_NOTEMPTY, except that it locks out an empty string 2865 match only at the first matching position, that is, at the start of the 2866 subject plus the starting offset. An empty string match later in the 2867 subject is permitted. If the pattern is anchored, such a match can oc- 2868 cur only if the pattern contains \K. 2869 2870 PCRE2_NO_JIT 2871 2872 By default, if a pattern has been successfully processed by 2873 pcre2_jit_compile(), JIT is automatically used when pcre2_match() is 2874 called with options that JIT supports. Setting PCRE2_NO_JIT disables 2875 the use of JIT; it forces matching to be done by the interpreter. 2876 2877 PCRE2_NO_UTF_CHECK 2878 2879 When PCRE2_UTF is set at compile time, the validity of the subject as a 2880 UTF string is checked unless PCRE2_NO_UTF_CHECK is passed to 2881 pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile(). 2882 The latter special case is discussed in detail in the pcre2unicode doc- 2883 umentation. 2884 2885 In the default case, if a non-zero starting offset is given, the check 2886 is applied only to that part of the subject that could be inspected 2887 during matching, and there is a check that the starting offset points 2888 to the first code unit of a character or to the end of the subject. If 2889 there are no lookbehind assertions in the pattern, the check starts at 2890 the starting offset. Otherwise, it starts at the length of the longest 2891 lookbehind before the starting offset, or at the start of the subject 2892 if there are not that many characters before the starting offset. Note 2893 that the sequences \b and \B are one-character lookbehinds. 2894 2895 The check is carried out before any other processing takes place, and a 2896 negative error code is returned if the check fails. There are several 2897 UTF error codes for each code unit width, corresponding to different 2898 problems with the code unit sequence. There are discussions about the 2899 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the 2900 pcre2unicode documentation. 2901 2902 If you know that your subject is valid, and you want to skip this check 2903 for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when 2904 calling pcre2_match(). You might want to do this for the second and 2905 subsequent calls to pcre2_match() if you are making repeated calls to 2906 find multiple matches in the same subject string. 2907 2908 Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when 2909 PCRE2_NO_UTF_CHECK is set at match time the effect of passing an in- 2910 valid string as a subject, or an invalid value of startoffset, is unde- 2911 fined. Your program may crash or loop indefinitely or give wrong re- 2912 sults. 2913 2914 PCRE2_PARTIAL_HARD 2915 PCRE2_PARTIAL_SOFT 2916 2917 These options turn on the partial matching feature. A partial match oc- 2918 curs if the end of the subject string is reached successfully, but 2919 there are not enough subject characters to complete the match. In addi- 2920 tion, either at least one character must have been inspected or the 2921 pattern must contain a lookbehind, or the pattern must be one that 2922 could match an empty string. 2923 2924 If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR- 2925 TIAL_HARD) is set, matching continues by testing any remaining alterna- 2926 tives. Only if no complete match can be found is PCRE2_ERROR_PARTIAL 2927 returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR- 2928 TIAL_SOFT specifies that the caller is prepared to handle a partial 2929 match, but only if no complete match can be found. 2930 2931 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this 2932 case, if a partial match is found, pcre2_match() immediately returns 2933 PCRE2_ERROR_PARTIAL, without considering any other alternatives. In 2934 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- 2935 ered to be more important that an alternative complete match. 2936 2937 There is a more detailed discussion of partial and multi-segment match- 2938 ing, with examples, in the pcre2partial documentation. 2939 2940 2941NEWLINE HANDLING WHEN MATCHING 2942 2943 When PCRE2 is built, a default newline convention is set; this is usu- 2944 ally the standard convention for the operating system. The default can 2945 be overridden in a compile context by calling pcre2_set_newline(). It 2946 can also be overridden by starting a pattern string with, for example, 2947 (*CRLF), as described in the section on newline conventions in the 2948 pcre2pattern page. During matching, the newline choice affects the be- 2949 haviour of the dot, circumflex, and dollar metacharacters. It may also 2950 alter the way the match starting position is advanced after a match 2951 failure for an unanchored pattern. 2952 2953 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is 2954 set as the newline convention, and a match attempt for an unanchored 2955 pattern fails when the current starting position is at a CRLF sequence, 2956 and the pattern contains no explicit matches for CR or LF characters, 2957 the match position is advanced by two characters instead of one, in 2958 other words, to after the CRLF. 2959 2960 The above rule is a compromise that makes the most common cases work as 2961 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL op- 2962 tion is not set), it does not match the string "\r\nA" because, after 2963 failing at the start, it skips both the CR and the LF before retrying. 2964 However, the pattern [\r\n]A does match that string, because it con- 2965 tains an explicit CR or LF reference, and so advances only by one char- 2966 acter after the first failure. 2967 2968 An explicit match for CR of LF is either a literal appearance of one of 2969 those characters in the pattern, or one of the \r or \n or equivalent 2970 octal or hexadecimal escape sequences. Implicit matches such as [^X] do 2971 not count, nor does \s, even though it includes CR and LF in the char- 2972 acters that it matches. 2973 2974 Notwithstanding the above, anomalous effects may still occur when CRLF 2975 is a valid newline sequence and explicit \r or \n escapes appear in the 2976 pattern. 2977 2978 2979HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS 2980 2981 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 2982 2983 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 2984 2985 In general, a pattern matches a certain portion of the subject, and in 2986 addition, further substrings from the subject may be picked out by 2987 parenthesized parts of the pattern. Following the usage in Jeffrey 2988 Friedl's book, this is called "capturing" in what follows, and the 2989 phrase "capture group" (Perl terminology) is used for a fragment of a 2990 pattern that picks out a substring. PCRE2 supports several other kinds 2991 of parenthesized group that do not cause substrings to be captured. The 2992 pcre2_pattern_info() function can be used to find out how many capture 2993 groups there are in a compiled pattern. 2994 2995 You can use auxiliary functions for accessing captured substrings by 2996 number or by name, as described in sections below. 2997 2998 Alternatively, you can make direct use of the vector of PCRE2_SIZE val- 2999 ues, called the ovector, which contains the offsets of captured 3000 strings. It is part of the match data block. The function 3001 pcre2_get_ovector_pointer() returns the address of the ovector, and 3002 pcre2_get_ovector_count() returns the number of pairs of values it con- 3003 tains. 3004 3005 Within the ovector, the first in each pair of values is set to the off- 3006 set of the first code unit of a substring, and the second is set to the 3007 offset of the first code unit after the end of a substring. These val- 3008 ues are always code unit offsets, not character offsets. That is, they 3009 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li- 3010 brary, and 32-bit offsets in the 32-bit library. 3011 3012 After a partial match (error return PCRE2_ERROR_PARTIAL), only the 3013 first pair of offsets (that is, ovector[0] and ovector[1]) are set. 3014 They identify the part of the subject that was partially matched. See 3015 the pcre2partial documentation for details of partial matching. 3016 3017 After a fully successful match, the first pair of offsets identifies 3018 the portion of the subject string that was matched by the entire pat- 3019 tern. The next pair is used for the first captured substring, and so 3020 on. The value returned by pcre2_match() is one more than the highest 3021 numbered pair that has been set. For example, if two substrings have 3022 been captured, the returned value is 3. If there are no captured sub- 3023 strings, the return value from a successful match is 1, indicating that 3024 just the first pair of offsets has been set. 3025 3026 If a pattern uses the \K escape sequence within a positive assertion, 3027 the reported start of a successful match can be greater than the end of 3028 the match. For example, if the pattern (?=ab\K) is matched against 3029 "ab", the start and end offset values for the match are 2 and 0. 3030 3031 If a capture group is matched repeatedly within a single match opera- 3032 tion, it is the last portion of the subject that it matched that is re- 3033 turned. 3034 3035 If the ovector is too small to hold all the captured substring offsets, 3036 as much as possible is filled in, and the function returns a value of 3037 zero. If captured substrings are not of interest, pcre2_match() may be 3038 called with a match data block whose ovector is of minimum length (that 3039 is, one pair). 3040 3041 It is possible for capture group number n+1 to match some part of the 3042 subject when group n has not been used at all. For example, if the 3043 string "abc" is matched against the pattern (a|(z))(bc) the return from 3044 the function is 4, and groups 1 and 3 are matched, but 2 is not. When 3045 this happens, both values in the offset pairs corresponding to unused 3046 groups are set to PCRE2_UNSET. 3047 3048 Offset values that correspond to unused groups at the end of the ex- 3049 pression are also set to PCRE2_UNSET. For example, if the string "abc" 3050 is matched against the pattern (abc)(x(yz)?)? groups 2 and 3 are not 3051 matched. The return from the function is 2, because the highest used 3052 capture group number is 1. The offsets for the second and third capture 3053 groups (assuming the vector is large enough, of course) are set to 3054 PCRE2_UNSET. 3055 3056 Elements in the ovector that do not correspond to capturing parentheses 3057 in the pattern are never changed. That is, if a pattern contains n cap- 3058 turing parentheses, no more than ovector[0] to ovector[2n+1] are set by 3059 pcre2_match(). The other elements retain whatever values they previ- 3060 ously had. After a failed match attempt, the contents of the ovector 3061 are unchanged. 3062 3063 3064OTHER INFORMATION ABOUT A MATCH 3065 3066 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 3067 3068 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 3069 3070 As well as the offsets in the ovector, other information about a match 3071 is retained in the match data block and can be retrieved by the above 3072 functions in appropriate circumstances. If they are called at other 3073 times, the result is undefined. 3074 3075 After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a 3076 failure to match (PCRE2_ERROR_NOMATCH), a mark name may be available. 3077 The function pcre2_get_mark() can be called to access this name, which 3078 can be specified in the pattern by any of the backtracking control 3079 verbs, not just (*MARK). The same function applies to all the verbs. It 3080 returns a pointer to the zero-terminated name, which is within the com- 3081 piled pattern. If no name is available, NULL is returned. The length of 3082 the name (excluding the terminating zero) is stored in the code unit 3083 that precedes the name. You should use this length instead of relying 3084 on the terminating zero if the name might contain a binary zero. 3085 3086 After a successful match, the name that is returned is the last mark 3087 name encountered on the matching path through the pattern. Instances of 3088 backtracking verbs without names do not count. Thus, for example, if 3089 the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned. 3090 After a "no match" or a partial match, the last encountered name is re- 3091 turned. For example, consider this pattern: 3092 3093 ^(*MARK:A)((*MARK:B)a|b)c 3094 3095 When it matches "bc", the returned name is A. The B mark is "seen" in 3096 the first branch of the group, but it is not on the matching path. On 3097 the other hand, when this pattern fails to match "bx", the returned 3098 name is B. 3099 3100 Warning: By default, certain start-of-match optimizations are used to 3101 give a fast "no match" result in some situations. For example, if the 3102 anchoring is removed from the pattern above, there is an initial check 3103 for the presence of "c" in the subject before running the matching en- 3104 gine. This check fails for "bx", causing a match failure without seeing 3105 any marks. You can disable the start-of-match optimizations by setting 3106 the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or by starting 3107 the pattern with (*NO_START_OPT). 3108 3109 After a successful match, a partial match, or one of the invalid UTF 3110 errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can 3111 be called. After a successful or partial match it returns the code unit 3112 offset of the character at which the match started. For a non-partial 3113 match, this can be different to the value of ovector[0] if the pattern 3114 contains the \K escape sequence. After a partial match, however, this 3115 value is always the same as ovector[0] because \K does not affect the 3116 result of a partial match. 3117 3118 After a UTF check failure, pcre2_get_startchar() can be used to obtain 3119 the code unit offset of the invalid UTF character. Details are given in 3120 the pcre2unicode page. 3121 3122 3123ERROR RETURNS FROM pcre2_match() 3124 3125 If pcre2_match() fails, it returns a negative number. This can be con- 3126 verted to a text string by calling the pcre2_get_error_message() func- 3127 tion (see "Obtaining a textual error message" below). Negative error 3128 codes are also returned by other functions, and are documented with 3129 them. The codes are given names in the header file. If UTF checking is 3130 in force and an invalid UTF subject string is detected, one of a number 3131 of UTF-specific negative error codes is returned. Details are given in 3132 the pcre2unicode page. The following are the other errors that may be 3133 returned by pcre2_match(): 3134 3135 PCRE2_ERROR_NOMATCH 3136 3137 The subject string did not match the pattern. 3138 3139 PCRE2_ERROR_PARTIAL 3140 3141 The subject string did not match, but it did match partially. See the 3142 pcre2partial documentation for details of partial matching. 3143 3144 PCRE2_ERROR_BADMAGIC 3145 3146 PCRE2 stores a 4-byte "magic number" at the start of the compiled code, 3147 to catch the case when it is passed a junk pointer. This is the error 3148 that is returned when the magic number is not present. 3149 3150 PCRE2_ERROR_BADMODE 3151 3152 This error is given when a compiled pattern is passed to a function in 3153 a library of a different code unit width, for example, a pattern com- 3154 piled by the 8-bit library is passed to a 16-bit or 32-bit library 3155 function. 3156 3157 PCRE2_ERROR_BADOFFSET 3158 3159 The value of startoffset was greater than the length of the subject. 3160 3161 PCRE2_ERROR_BADOPTION 3162 3163 An unrecognized bit was set in the options argument. 3164 3165 PCRE2_ERROR_BADUTFOFFSET 3166 3167 The UTF code unit sequence that was passed as a subject was checked and 3168 found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the 3169 value of startoffset did not point to the beginning of a UTF character 3170 or the end of the subject. 3171 3172 PCRE2_ERROR_CALLOUT 3173 3174 This error is never generated by pcre2_match() itself. It is provided 3175 for use by callout functions that want to cause pcre2_match() or 3176 pcre2_callout_enumerate() to return a distinctive error code. See the 3177 pcre2callout documentation for details. 3178 3179 PCRE2_ERROR_DEPTHLIMIT 3180 3181 The nested backtracking depth limit was reached. 3182 3183 PCRE2_ERROR_HEAPLIMIT 3184 3185 The heap limit was reached. 3186 3187 PCRE2_ERROR_INTERNAL 3188 3189 An unexpected internal error has occurred. This error could be caused 3190 by a bug in PCRE2 or by overwriting of the compiled pattern. 3191 3192 PCRE2_ERROR_JIT_STACKLIMIT 3193 3194 This error is returned when a pattern that was successfully studied us- 3195 ing JIT is being matched, but the memory available for the just-in-time 3196 processing stack is not large enough. See the pcre2jit documentation 3197 for more details. 3198 3199 PCRE2_ERROR_MATCHLIMIT 3200 3201 The backtracking match limit was reached. 3202 3203 PCRE2_ERROR_NOMEMORY 3204 3205 Heap memory is used to remember backtracking points. This error is 3206 given when the memory allocation function (default or custom) fails. 3207 Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given if the 3208 amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is 3209 also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca- 3210 tion fails. 3211 3212 PCRE2_ERROR_NULL 3213 3214 Either the code, subject, or match_data argument was passed as NULL. 3215 3216 PCRE2_ERROR_RECURSELOOP 3217 3218 This error is returned when pcre2_match() detects a recursion loop 3219 within the pattern. Specifically, it means that either the whole pat- 3220 tern or a capture group has been called recursively for the second time 3221 at the same position in the subject string. Some simple patterns that 3222 might do this are detected and faulted at compile time, but more com- 3223 plicated cases, in particular mutual recursions between two different 3224 groups, cannot be detected until matching is attempted. 3225 3226 3227OBTAINING A TEXTUAL ERROR MESSAGE 3228 3229 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 3230 PCRE2_SIZE bufflen); 3231 3232 A text message for an error code from any PCRE2 function (compile, 3233 match, or auxiliary) can be obtained by calling pcre2_get_error_mes- 3234 sage(). The code is passed as the first argument, with the remaining 3235 two arguments specifying a code unit buffer and its length in code 3236 units, into which the text message is placed. The message is returned 3237 in code units of the appropriate width for the library that is being 3238 used. 3239 3240 The returned message is terminated with a trailing zero, and the func- 3241 tion returns the number of code units used, excluding the trailing 3242 zero. If the error number is unknown, the negative error code PCRE2_ER- 3243 ROR_BADDATA is returned. If the buffer is too small, the message is 3244 truncated (but still with a trailing zero), and the negative error code 3245 PCRE2_ERROR_NOMEMORY is returned. None of the messages are very long; 3246 a buffer size of 120 code units is ample. 3247 3248 3249EXTRACTING CAPTURED SUBSTRINGS BY NUMBER 3250 3251 int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 3252 uint32_t number, PCRE2_SIZE *length); 3253 3254 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 3255 uint32_t number, PCRE2_UCHAR *buffer, 3256 PCRE2_SIZE *bufflen); 3257 3258 int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 3259 uint32_t number, PCRE2_UCHAR **bufferptr, 3260 PCRE2_SIZE *bufflen); 3261 3262 void pcre2_substring_free(PCRE2_UCHAR *buffer); 3263 3264 Captured substrings can be accessed directly by using the ovector as 3265 described above. For convenience, auxiliary functions are provided for 3266 extracting captured substrings as new, separate, zero-terminated 3267 strings. A substring that contains a binary zero is correctly extracted 3268 and has a further zero added on the end, but the result is not, of 3269 course, a C string. 3270 3271 The functions in this section identify substrings by number. The number 3272 zero refers to the entire matched substring, with higher numbers refer- 3273 ring to substrings captured by parenthesized groups. After a partial 3274 match, only substring zero is available. An attempt to extract any 3275 other substring gives the error PCRE2_ERROR_PARTIAL. The next section 3276 describes similar functions for extracting captured substrings by name. 3277 3278 If a pattern uses the \K escape sequence within a positive assertion, 3279 the reported start of a successful match can be greater than the end of 3280 the match. For example, if the pattern (?=ab\K) is matched against 3281 "ab", the start and end offset values for the match are 2 and 0. In 3282 this situation, calling these functions with a zero substring number 3283 extracts a zero-length empty string. 3284 3285 You can find the length in code units of a captured substring without 3286 extracting it by calling pcre2_substring_length_bynumber(). The first 3287 argument is a pointer to the match data block, the second is the group 3288 number, and the third is a pointer to a variable into which the length 3289 is placed. If you just want to know whether or not the substring has 3290 been captured, you can pass the third argument as NULL. 3291 3292 The pcre2_substring_copy_bynumber() function copies a captured sub- 3293 string into a supplied buffer, whereas pcre2_substring_get_bynumber() 3294 copies it into new memory, obtained using the same memory allocation 3295 function that was used for the match data block. The first two argu- 3296 ments of these functions are a pointer to the match data block and a 3297 capture group number. 3298 3299 The final arguments of pcre2_substring_copy_bynumber() are a pointer to 3300 the buffer and a pointer to a variable that contains its length in code 3301 units. This is updated to contain the actual number of code units used 3302 for the extracted substring, excluding the terminating zero. 3303 3304 For pcre2_substring_get_bynumber() the third and fourth arguments point 3305 to variables that are updated with a pointer to the new memory and the 3306 number of code units that comprise the substring, again excluding the 3307 terminating zero. When the substring is no longer needed, the memory 3308 should be freed by calling pcre2_substring_free(). 3309 3310 The return value from all these functions is zero for success, or a 3311 negative error code. If the pattern match failed, the match failure 3312 code is returned. If a substring number greater than zero is used af- 3313 ter a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible 3314 error codes are: 3315 3316 PCRE2_ERROR_NOMEMORY 3317 3318 The buffer was too small for pcre2_substring_copy_bynumber(), or the 3319 attempt to get memory failed for pcre2_substring_get_bynumber(). 3320 3321 PCRE2_ERROR_NOSUBSTRING 3322 3323 There is no substring with that number in the pattern, that is, the 3324 number is greater than the number of capturing parentheses. 3325 3326 PCRE2_ERROR_UNAVAILABLE 3327 3328 The substring number, though not greater than the number of captures in 3329 the pattern, is greater than the number of slots in the ovector, so the 3330 substring could not be captured. 3331 3332 PCRE2_ERROR_UNSET 3333 3334 The substring did not participate in the match. For example, if the 3335 pattern is (abc)|(def) and the subject is "def", and the ovector con- 3336 tains at least two capturing slots, substring number 1 is unset. 3337 3338 3339EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS 3340 3341 int pcre2_substring_list_get(pcre2_match_data *match_data, 3342 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 3343 3344 void pcre2_substring_list_free(PCRE2_UCHAR **list); 3345 3346 The pcre2_substring_list_get() function extracts all available sub- 3347 strings and builds a list of pointers to them. It also (optionally) 3348 builds a second list that contains their lengths (in code units), ex- 3349 cluding a terminating zero that is added to each of them. All this is 3350 done in a single block of memory that is obtained using the same memory 3351 allocation function that was used to get the match data block. 3352 3353 This function must be called only after a successful match. If called 3354 after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. 3355 3356 The address of the memory block is returned via listptr, which is also 3357 the start of the list of string pointers. The end of the list is marked 3358 by a NULL pointer. The address of the list of lengths is returned via 3359 lengthsptr. If your strings do not contain binary zeros and you do not 3360 therefore need the lengths, you may supply NULL as the lengthsptr argu- 3361 ment to disable the creation of a list of lengths. The yield of the 3362 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- 3363 ory block could not be obtained. When the list is no longer needed, it 3364 should be freed by calling pcre2_substring_list_free(). 3365 3366 If this function encounters a substring that is unset, which can happen 3367 when capture group number n+1 matches some part of the subject, but 3368 group n has not been used at all, it returns an empty string. This can 3369 be distinguished from a genuine zero-length substring by inspecting the 3370 appropriate offset in the ovector, which contain PCRE2_UNSET for unset 3371 substrings, or by calling pcre2_substring_length_bynumber(). 3372 3373 3374EXTRACTING CAPTURED SUBSTRINGS BY NAME 3375 3376 int pcre2_substring_number_from_name(const pcre2_code *code, 3377 PCRE2_SPTR name); 3378 3379 int pcre2_substring_length_byname(pcre2_match_data *match_data, 3380 PCRE2_SPTR name, PCRE2_SIZE *length); 3381 3382 int pcre2_substring_copy_byname(pcre2_match_data *match_data, 3383 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 3384 3385 int pcre2_substring_get_byname(pcre2_match_data *match_data, 3386 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 3387 3388 void pcre2_substring_free(PCRE2_UCHAR *buffer); 3389 3390 To extract a substring by name, you first have to find associated num- 3391 ber. For example, for this pattern: 3392 3393 (a+)b(?<xxx>\d+)... 3394 3395 the number of the capture group called "xxx" is 2. If the name is known 3396 to be unique (PCRE2_DUPNAMES was not set), you can find the number from 3397 the name by calling pcre2_substring_number_from_name(). The first argu- 3398 ment is the compiled pattern, and the second is the name. The yield of 3399 the function is the group number, PCRE2_ERROR_NOSUBSTRING if there is 3400 no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is 3401 more than one group with that name. Given the number, you can extract 3402 the substring directly from the ovector, or use one of the "bynumber" 3403 functions described above. 3404 3405 For convenience, there are also "byname" functions that correspond to 3406 the "bynumber" functions, the only difference being that the second ar- 3407 gument is a name instead of a number. If PCRE2_DUPNAMES is set and 3408 there are duplicate names, these functions scan all the groups with the 3409 given name, and return the captured substring from the first named 3410 group that is set. 3411 3412 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is 3413 returned. If all groups with the name have numbers that are greater 3414 than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re- 3415 turned. If there is at least one group with a slot in the ovector, but 3416 no group is found to be set, PCRE2_ERROR_UNSET is returned. 3417 3418 Warning: If the pattern uses the (?| feature to set up multiple capture 3419 groups with the same number, as described in the section on duplicate 3420 group numbers in the pcre2pattern page, you cannot use names to distin- 3421 guish the different capture groups, because names are not included in 3422 the compiled code. The matching process uses only numbers. For this 3423 reason, the use of different names for groups with the same number 3424 causes an error at compile time. 3425 3426 3427CREATING A NEW STRING WITH SUBSTITUTIONS 3428 3429 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 3430 PCRE2_SIZE length, PCRE2_SIZE startoffset, 3431 uint32_t options, pcre2_match_data *match_data, 3432 pcre2_match_context *mcontext, PCRE2_SPTR replacement, 3433 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, 3434 PCRE2_SIZE *outlengthptr); 3435 3436 This function optionally calls pcre2_match() and then makes a copy of 3437 the subject string in outputbuffer, replacing parts that were matched 3438 with the replacement string, whose length is supplied in rlength, which 3439 can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As 3440 a special case, if replacement is NULL and rlength is zero, the re- 3441 placement is assumed to be an empty string. If rlength is non-zero, an 3442 error occurs if replacement is NULL. 3443 3444 There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re- 3445 turn just the replacement string(s). The default action is to perform 3446 just one replacement if the pattern matches, but there is an option 3447 that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL be- 3448 low). 3449 3450 If successful, pcre2_substitute() returns the number of substitutions 3451 that were carried out. This may be zero if no match was found, and is 3452 never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega- 3453 tive value is returned if an error is detected. 3454 3455 Matches in which a \K item in a lookahead in the pattern causes the 3456 match to end before it starts are not supported, and give rise to an 3457 error return. For global replacements, matches in which \K in a lookbe- 3458 hind causes the match to start earlier than the point that was reached 3459 in the previous iteration are also not supported. 3460 3461 The first seven arguments of pcre2_substitute() are the same as for 3462 pcre2_match(), except that the partial matching options are not permit- 3463 ted, and match_data may be passed as NULL, in which case a match data 3464 block is obtained and freed within this function, using memory manage- 3465 ment functions from the match context, if provided, or else those that 3466 were used to allocate memory for the compiled code. 3467 3468 If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the 3469 provided block is used for all calls to pcre2_match(), and its contents 3470 afterwards are the result of the final call. For global changes, this 3471 will always be a no-match error. The contents of the ovector within the 3472 match data block may or may not have been changed. 3473 3474 As well as the usual options for pcre2_match(), a number of additional 3475 options can be set in the options argument of pcre2_substitute(). One 3476 such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external 3477 match_data block must be provided, and it must have already been used 3478 for an external call to pcre2_match() with the same pattern and subject 3479 arguments. The data in the match_data block (return code, offset vec- 3480 tor) is then used for the first substitution instead of calling 3481 pcre2_match() from within pcre2_substitute(). This allows an applica- 3482 tion to check for a match before choosing to substitute, without having 3483 to repeat the match. 3484 3485 The contents of the externally supplied match data block are not 3486 changed when PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTI- 3487 TUTE_GLOBAL is also set, pcre2_match() is called after the first sub- 3488 stitution to check for further matches, but this is done using an in- 3489 ternally obtained match data block, thus always leaving the external 3490 block unchanged. 3491 3492 The code argument is not used for matching before the first substitu- 3493 tion when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided, 3494 even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in- 3495 formation such as the UTF setting and the number of capturing parenthe- 3496 ses in the pattern. 3497 3498 The default action of pcre2_substitute() is to return a copy of the 3499 subject string with matched substrings replaced. However, if PCRE2_SUB- 3500 STITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are 3501 returned. In the global case, multiple replacements are concatenated in 3502 the output buffer. Substitution callouts (see below) can be used to 3503 separate them if necessary. 3504 3505 The outlengthptr argument of pcre2_substitute() must point to a vari- 3506 able that contains the length, in code units, of the output buffer. If 3507 the function is successful, the value is updated to contain the length 3508 in code units of the new string, excluding the trailing zero that is 3509 automatically added. 3510 3511 If the function is not successful, the value set via outlengthptr de- 3512 pends on the type of error. For syntax errors in the replacement 3513 string, the value is the offset in the replacement string where the er- 3514 ror was detected. For other errors, the value is PCRE2_UNSET by de- 3515 fault. This includes the case of the output buffer being too small, un- 3516 less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set. 3517 3518 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output 3519 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- 3520 ORY immediately. If this option is set, however, pcre2_substitute() 3521 continues to go through the motions of matching and substituting (with- 3522 out, of course, writing anything) in order to compute the size of 3523 buffer that is needed. This value is passed back via the outlengthptr 3524 variable, with the result of the function still being PCRE2_ER- 3525 ROR_NOMEMORY. 3526 3527 Passing a buffer size of zero is a permitted way of finding out how 3528 much memory is needed for given substitution. However, this does mean 3529 that the entire operation is carried out twice. Depending on the appli- 3530 cation, it may be more efficient to allocate a large buffer and free 3531 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- 3532 FLOW_LENGTH. 3533 3534 The replacement string, which is interpreted as a UTF string in UTF 3535 mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An 3536 invalid UTF replacement string causes an immediate return with the rel- 3537 evant UTF error code. 3538 3539 If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not in- 3540 terpreted in any way. By default, however, a dollar character is an es- 3541 cape character that can specify the insertion of characters from cap- 3542 ture groups and names from (*MARK) or other control verbs in the pat- 3543 tern. Dollar is the only escape character (backslash is treated as lit- 3544 eral). The following forms are always recognized: 3545 3546 $$ insert a dollar character 3547 $<n> or ${<n>} insert the contents of group <n> 3548 $*MARK or ${*MARK} insert a control verb name 3549 3550 Either a group number or a group name can be given for <n>. Curly 3551 brackets are required only if the following character would be inter- 3552 preted as part of the number or name. The number may be zero to include 3553 the entire matched string. For example, if the pattern a(b)c is 3554 matched with "=abc=" and the replacement string "+$1$0$1+", the result 3555 is "=+babcb+=". 3556 3557 $*MARK inserts the name from the last encountered backtracking control 3558 verb on the matching path that has a name. (*MARK) must always include 3559 a name, but the other verbs need not. For example, in the case of 3560 (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B) 3561 the relevant name is "B". This facility can be used to perform simple 3562 simultaneous substitutions, as this pcre2test example shows: 3563 3564 /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} 3565 apple lemon 3566 2: pear orange 3567 3568 PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject 3569 string, replacing every matching substring. If this option is not set, 3570 only the first matching substring is replaced. The search for matches 3571 takes place in the original subject string (that is, previous replace- 3572 ments do not affect it). Iteration is implemented by advancing the 3573 startoffset value for each search, which is always passed the entire 3574 subject string. If an offset limit is set in the match context, search- 3575 ing stops when that limit is reached. 3576 3577 You can restrict the effect of a global substitution to a portion of 3578 the subject string by setting either or both of startoffset and an off- 3579 set limit. Here is a pcre2test example: 3580 3581 /B/g,replace=!,use_offset_limit 3582 ABC ABC ABC ABC\=offset=3,offset_limit=12 3583 2: ABC A!C A!C ABC 3584 3585 When continuing with global substitutions after matching a substring 3586 with zero length, an attempt to find a non-empty match at the same off- 3587 set is performed. If this is not successful, the offset is advanced by 3588 one character except when CRLF is a valid newline sequence and the next 3589 two characters are CR, LF. In this case, the offset is advanced by two 3590 characters. 3591 3592 PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that 3593 do not appear in the pattern to be treated as unset groups. This option 3594 should be used with care, because it means that a typo in a group name 3595 or number no longer causes the PCRE2_ERROR_NOSUBSTRING error. 3596 3597 PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un- 3598 known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated 3599 as empty strings when inserted as described above. If this option is 3600 not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN- 3601 SET error. This option does not influence the extended substitution 3602 syntax described below. 3603 3604 PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the 3605 replacement string. Without this option, only the dollar character is 3606 special, and only the group insertion forms listed above are valid. 3607 When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: 3608 3609 Firstly, backslash in a replacement string is interpreted as an escape 3610 character. The usual forms such as \n or \x{ddd} can be used to specify 3611 particular character codes, and backslash followed by any non-alphanu- 3612 meric character quotes that character. Extended quoting can be coded 3613 using \Q...\E, exactly as in pattern strings. 3614 3615 There are also four escape sequences for forcing the case of inserted 3616 letters. The insertion mechanism has three states: no case forcing, 3617 force upper case, and force lower case. The escape sequences change the 3618 current state: \U and \L change to upper or lower case forcing, respec- 3619 tively, and \E (when not terminating a \Q quoted sequence) reverts to 3620 no case forcing. The sequences \u and \l force the next character (if 3621 it is a letter) to upper or lower case, respectively, and then the 3622 state automatically reverts to no case forcing. Case forcing applies to 3623 all inserted characters, including those from capture groups and let- 3624 ters within \Q...\E quoted sequences. If either PCRE2_UTF or PCRE2_UCP 3625 was set when the pattern was compiled, Unicode properties are used for 3626 case forcing characters whose code points are greater than 127. 3627 3628 Note that case forcing sequences such as \U...\E do not nest. For exam- 3629 ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final 3630 \E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX- 3631 TRA_ALT_BSUX options do not apply to replacement strings. 3632 3633 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more 3634 flexibility to capture group substitution. The syntax is similar to 3635 that used by Bash: 3636 3637 ${<n>:-<string>} 3638 ${<n>:+<string1>:<string2>} 3639 3640 As before, <n> may be a group number or a name. The first form speci- 3641 fies a default value. If group <n> is set, its value is inserted; if 3642 not, <string> is expanded and the result inserted. The second form 3643 specifies strings that are expanded and inserted when group <n> is set 3644 or unset, respectively. The first form is just a convenient shorthand 3645 for 3646 3647 ${<n>:+${<n>}:<string>} 3648 3649 Backslash can be used to escape colons and closing curly brackets in 3650 the replacement strings. A change of the case forcing state within a 3651 replacement string remains in force afterwards, as shown in this 3652 pcre2test example: 3653 3654 /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo 3655 body 3656 1: hello 3657 somebody 3658 1: HELLO 3659 3660 The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended 3661 substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un- 3662 known groups in the extended syntax forms to be treated as unset. 3663 3664 If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, 3665 PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele- 3666 vant and are ignored. 3667 3668 Substitution errors 3669 3670 In the event of an error, pcre2_substitute() returns a negative error 3671 code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors 3672 from pcre2_match() are passed straight back. 3673 3674 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- 3675 tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. 3676 3677 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- 3678 ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) 3679 when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN- 3680 SET_EMPTY is not set. 3681 3682 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big 3683 enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size 3684 of buffer that is needed is returned via outlengthptr. Note that this 3685 does not happen by default. 3686 3687 PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the 3688 match_data argument is NULL or if the subject or replacement arguments 3689 are NULL. For backward compatibility reasons an exception is made for 3690 the replacement argument if the rlength argument is also 0. 3691 3692 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in 3693 the replacement string, with more particular errors being PCRE2_ER- 3694 ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE 3695 (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax 3696 error in extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN 3697 (the pattern match ended before it started or the match started earlier 3698 than the current position in the subject, which can happen if \K is 3699 used in an assertion). 3700 3701 As for all PCRE2 errors, a text message that describes the error can be 3702 obtained by calling the pcre2_get_error_message() function (see "Ob- 3703 taining a textual error message" above). 3704 3705 Substitution callouts 3706 3707 int pcre2_set_substitute_callout(pcre2_match_context *mcontext, 3708 int (*callout_function)(pcre2_substitute_callout_block *, void *), 3709 void *callout_data); 3710 3711 The pcre2_set_substitution_callout() function can be used to specify a 3712 callout function for pcre2_substitute(). This information is passed in 3713 a match context. The callout function is called after each substitution 3714 has been processed, but it can cause the replacement not to happen. The 3715 callout function is not called for simulated substitutions that happen 3716 as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. 3717 3718 The first argument of the callout function is a pointer to a substitute 3719 callout block structure, which contains the following fields, not nec- 3720 essarily in this order: 3721 3722 uint32_t version; 3723 uint32_t subscount; 3724 PCRE2_SPTR input; 3725 PCRE2_SPTR output; 3726 PCRE2_SIZE *ovector; 3727 uint32_t oveccount; 3728 PCRE2_SIZE output_offsets[2]; 3729 3730 The version field contains the version number of the block format. The 3731 current version is 0. The version number will increase in future if 3732 more fields are added, but the intention is never to remove any of the 3733 existing fields. 3734 3735 The subscount field is the number of the current match. It is 1 for the 3736 first callout, 2 for the second, and so on. The input and output point- 3737 ers are copies of the values passed to pcre2_substitute(). 3738 3739 The ovector field points to the ovector, which contains the result of 3740 the most recent match. The oveccount field contains the number of pairs 3741 that are set in the ovector, and is always greater than zero. 3742 3743 The output_offsets vector contains the offsets of the replacement in 3744 the output string. This has already been processed for dollar and (if 3745 requested) backslash substitutions as described above. 3746 3747 The second argument of the callout function is the value passed as 3748 callout_data when the function was registered. The value returned by 3749 the callout function is interpreted as follows: 3750 3751 If the value is zero, the replacement is accepted, and, if PCRE2_SUB- 3752 STITUTE_GLOBAL is set, processing continues with a search for the next 3753 match. If the value is not zero, the current replacement is not ac- 3754 cepted. If the value is greater than zero, processing continues when 3755 PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero 3756 or PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied 3757 to the output and the call to pcre2_substitute() exits, returning the 3758 number of matches so far. 3759 3760 3761DUPLICATE CAPTURE GROUP NAMES 3762 3763 int pcre2_substring_nametable_scan(const pcre2_code *code, 3764 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 3765 3766 When a pattern is compiled with the PCRE2_DUPNAMES option, names for 3767 capture groups are not required to be unique. Duplicate names are al- 3768 ways allowed for groups with the same number, created by using the (?| 3769 feature. Indeed, if such groups are named, they are required to use the 3770 same names. 3771 3772 Normally, patterns that use duplicate names are such that in any one 3773 match, only one of each set of identically-named groups participates. 3774 An example is shown in the pcre2pattern documentation. 3775 3776 When duplicates are present, pcre2_substring_copy_byname() and 3777 pcre2_substring_get_byname() return the first substring corresponding 3778 to the given name that is set. Only if none are set is PCRE2_ERROR_UN- 3779 SET is returned. The pcre2_substring_number_from_name() function re- 3780 turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate 3781 names. 3782 3783 If you want to get full details of all captured substrings for a given 3784 name, you must use the pcre2_substring_nametable_scan() function. The 3785 first argument is the compiled pattern, and the second is the name. If 3786 the third and fourth arguments are NULL, the function returns a group 3787 number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. 3788 3789 When the third and fourth arguments are not NULL, they must be pointers 3790 to variables that are updated by the function. After it has run, they 3791 point to the first and last entries in the name-to-number table for the 3792 given name, and the function returns the length of each entry in code 3793 units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are 3794 no entries for the given name. 3795 3796 The format of the name table is described above in the section entitled 3797 Information about a pattern. Given all the relevant entries for the 3798 name, you can extract each of their numbers, and hence the captured 3799 data. 3800 3801 3802FINDING ALL POSSIBLE MATCHES AT ONE POSITION 3803 3804 The traditional matching function uses a similar algorithm to Perl, 3805 which stops when it finds the first match at a given point in the sub- 3806 ject. If you want to find all possible matches, or the longest possible 3807 match at a given position, consider using the alternative matching 3808 function (see below) instead. If you cannot use the alternative func- 3809 tion, you can kludge it up by making use of the callout facility, which 3810 is described in the pcre2callout documentation. 3811 3812 What you have to do is to insert a callout right at the end of the pat- 3813 tern. When your callout function is called, extract and save the cur- 3814 rent matched substring. Then return 1, which forces pcre2_match() to 3815 backtrack and try other alternatives. Ultimately, when it runs out of 3816 matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. 3817 3818 3819MATCHING A PATTERN: THE ALTERNATIVE FUNCTION 3820 3821 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 3822 PCRE2_SIZE length, PCRE2_SIZE startoffset, 3823 uint32_t options, pcre2_match_data *match_data, 3824 pcre2_match_context *mcontext, 3825 int *workspace, PCRE2_SIZE wscount); 3826 3827 The function pcre2_dfa_match() is called to match a subject string 3828 against a compiled pattern, using a matching algorithm that scans the 3829 subject string just once (not counting lookaround assertions), and does 3830 not backtrack (except when processing lookaround assertions). This has 3831 different characteristics to the normal algorithm, and is not compati- 3832 ble with Perl. Some of the features of PCRE2 patterns are not sup- 3833 ported. Nevertheless, there are times when this kind of matching can be 3834 useful. For a discussion of the two matching algorithms, and a list of 3835 features that pcre2_dfa_match() does not support, see the pcre2matching 3836 documentation. 3837 3838 The arguments for the pcre2_dfa_match() function are the same as for 3839 pcre2_match(), plus two extras. The ovector within the match data block 3840 is used in a different way, and this is described below. The other com- 3841 mon arguments are used in the same way as for pcre2_match(), so their 3842 description is not repeated here. 3843 3844 The two additional arguments provide workspace for the function. The 3845 workspace vector should contain at least 20 elements. It is used for 3846 keeping track of multiple paths through the pattern tree. More work- 3847 space is needed for patterns and subjects where there are a lot of po- 3848 tential matches. 3849 3850 Here is an example of a simple call to pcre2_dfa_match(): 3851 3852 int wspace[20]; 3853 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 3854 int rc = pcre2_dfa_match( 3855 re, /* result of pcre2_compile() */ 3856 "some string", /* the subject string */ 3857 11, /* the length of the subject string */ 3858 0, /* start at offset 0 in the subject */ 3859 0, /* default options */ 3860 md, /* the match data block */ 3861 NULL, /* a match context; NULL means use defaults */ 3862 wspace, /* working space vector */ 3863 20); /* number of elements (NOT size in bytes) */ 3864 3865 Option bits for pcre2_dfa_match() 3866 3867 The unused bits of the options argument for pcre2_dfa_match() must be 3868 zero. The only bits that may be set are PCRE2_ANCHORED, 3869 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO- 3870 TEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, 3871 PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and 3872 PCRE2_DFA_RESTART. All but the last four of these are exactly the same 3873 as for pcre2_match(), so their description is not repeated here. 3874 3875 PCRE2_PARTIAL_HARD 3876 PCRE2_PARTIAL_SOFT 3877 3878 These have the same general effect as they do for pcre2_match(), but 3879 the details are slightly different. When PCRE2_PARTIAL_HARD is set for 3880 pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the 3881 subject is reached and there is still at least one matching possibility 3882 that requires additional characters. This happens even if some complete 3883 matches have already been found. When PCRE2_PARTIAL_SOFT is set, the 3884 return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL 3885 if the end of the subject is reached, there have been no complete 3886 matches, but there is still at least one matching possibility. The por- 3887 tion of the string that was inspected when the longest partial match 3888 was found is set as the first matching string in both cases. There is a 3889 more detailed discussion of partial and multi-segment matching, with 3890 examples, in the pcre2partial documentation. 3891 3892 PCRE2_DFA_SHORTEST 3893 3894 Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to 3895 stop as soon as it has found one match. Because of the way the alterna- 3896 tive algorithm works, this is necessarily the shortest possible match 3897 at the first possible matching point in the subject string. 3898 3899 PCRE2_DFA_RESTART 3900 3901 When pcre2_dfa_match() returns a partial match, it is possible to call 3902 it again, with additional subject characters, and have it continue with 3903 the same match. The PCRE2_DFA_RESTART option requests this action; when 3904 it is set, the workspace and wscount options must reference the same 3905 vector as before because data about the match so far is left in them 3906 after a partial match. There is more discussion of this facility in the 3907 pcre2partial documentation. 3908 3909 Successful returns from pcre2_dfa_match() 3910 3911 When pcre2_dfa_match() succeeds, it may have matched more than one sub- 3912 string in the subject. Note, however, that all the matches from one run 3913 of the function start at the same point in the subject. The shorter 3914 matches are all initial substrings of the longer matches. For example, 3915 if the pattern 3916 3917 <.*> 3918 3919 is matched against the string 3920 3921 This is <something> <something else> <something further> no more 3922 3923 the three matched strings are 3924 3925 <something> <something else> <something further> 3926 <something> <something else> 3927 <something> 3928 3929 On success, the yield of the function is a number greater than zero, 3930 which is the number of matched substrings. The offsets of the sub- 3931 strings are returned in the ovector, and can be extracted by number in 3932 the same way as for pcre2_match(), but the numbers bear no relation to 3933 any capture groups that may exist in the pattern, because DFA matching 3934 does not support capturing. 3935 3936 Calls to the convenience functions that extract substrings by name re- 3937 turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af- 3938 ter a DFA match. The convenience functions that extract substrings by 3939 number never return PCRE2_ERROR_NOSUBSTRING. 3940 3941 The matched strings are stored in the ovector in reverse order of 3942 length; that is, the longest matching string is first. If there were 3943 too many matches to fit into the ovector, the yield of the function is 3944 zero, and the vector is filled with the longest matches. 3945 3946 NOTE: PCRE2's "auto-possessification" optimization usually applies to 3947 character repeats at the end of a pattern (as well as internally). For 3948 example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA 3949 matching, this means that only one possible match is found. If you re- 3950 ally do want multiple matches in such cases, either use an ungreedy re- 3951 peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com- 3952 piling. 3953 3954 Error returns from pcre2_dfa_match() 3955 3956 The pcre2_dfa_match() function returns a negative number when it fails. 3957 Many of the errors are the same as for pcre2_match(), as described 3958 above. There are in addition the following errors that are specific to 3959 pcre2_dfa_match(): 3960 3961 PCRE2_ERROR_DFA_UITEM 3962 3963 This return is given if pcre2_dfa_match() encounters an item in the 3964 pattern that it does not support, for instance, the use of \C in a UTF 3965 mode or a backreference. 3966 3967 PCRE2_ERROR_DFA_UCOND 3968 3969 This return is given if pcre2_dfa_match() encounters a condition item 3970 that uses a backreference for the condition, or a test for recursion in 3971 a specific capture group. These are not supported. 3972 3973 PCRE2_ERROR_DFA_UINVALID_UTF 3974 3975 This return is given if pcre2_dfa_match() is called for a pattern that 3976 was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for 3977 DFA matching. 3978 3979 PCRE2_ERROR_DFA_WSSIZE 3980 3981 This return is given if pcre2_dfa_match() runs out of space in the 3982 workspace vector. 3983 3984 PCRE2_ERROR_DFA_RECURSE 3985 3986 When a recursion or subroutine call is processed, the matching function 3987 calls itself recursively, using private memory for the ovector and 3988 workspace. This error is given if the internal ovector is not large 3989 enough. This should be extremely rare, as a vector of size 1000 is 3990 used. 3991 3992 PCRE2_ERROR_DFA_BADRESTART 3993 3994 When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, 3995 some plausibility checks are made on the contents of the workspace, 3996 which should contain data about the previous partial match. If any of 3997 these checks fail, this error is given. 3998 3999 4000SEE ALSO 4001 4002 pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), 4003 pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). 4004 4005 4006AUTHOR 4007 4008 Philip Hazel 4009 Retired from University Computing Service 4010 Cambridge, England. 4011 4012 4013REVISION 4014 4015 Last updated: 24 April 2024 4016 Copyright (c) 1997-2024 University of Cambridge. 4017 4018 4019PCRE2 10.44 24 April 2024 PCRE2API(3) 4020------------------------------------------------------------------------------ 4021 4022 4023 4024PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) 4025 4026 4027NAME 4028 PCRE2 - Perl-compatible regular expressions (revised API) 4029 4030 4031BUILDING PCRE2 4032 4033 PCRE2 is distributed with a configure script that can be used to build 4034 the library in Unix-like environments using the applications known as 4035 Autotools. Also in the distribution are files to support building using 4036 CMake instead of configure. The text file README contains general in- 4037 formation about building with Autotools (some of which is repeated be- 4038 low), and also has some comments about building on various operating 4039 systems. The files in the vms directory support building under OpenVMS. 4040 There is a lot more information about building PCRE2 without using Au- 4041 totools (including information about using CMake and building "by 4042 hand") in the text file called NON-AUTOTOOLS-BUILD. You should consult 4043 this file as well as the README file if you are building in a non-Unix- 4044 like environment. 4045 4046 4047PCRE2 BUILD-TIME OPTIONS 4048 4049 The rest of this document describes the optional features of PCRE2 that 4050 can be selected when the library is compiled. It assumes use of the 4051 configure script, where the optional features are selected or dese- 4052 lected by providing options to configure before running the make com- 4053 mand. However, the same options can be selected in both Unix-like and 4054 non-Unix-like environments if you are using CMake instead of configure 4055 to build PCRE2. 4056 4057 If you are not using Autotools or CMake, option selection can be done 4058 by editing the config.h file, or by passing parameter settings to the 4059 compiler, as described in NON-AUTOTOOLS-BUILD. 4060 4061 The complete list of options for configure (which includes the standard 4062 ones such as the selection of the installation directory) can be ob- 4063 tained by running 4064 4065 ./configure --help 4066 4067 The following sections include descriptions of "on/off" options whose 4068 names begin with --enable or --disable. Because of the way that config- 4069 ure works, --enable and --disable always come in pairs, so the comple- 4070 mentary option always exists as well, but as it specifies the default, 4071 it is not described. Options that specify values have names that start 4072 with --with. At the end of a configure run, a summary of the configura- 4073 tion is output. 4074 4075 4076BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES 4077 4078 By default, a library called libpcre2-8 is built, containing functions 4079 that take string arguments contained in arrays of bytes, interpreted 4080 either as single-byte characters, or UTF-8 strings. You can also build 4081 two other libraries, called libpcre2-16 and libpcre2-32, which process 4082 strings that are contained in arrays of 16-bit and 32-bit code units, 4083 respectively. These can be interpreted either as single-unit characters 4084 or UTF-16/UTF-32 strings. To build these additional libraries, add one 4085 or both of the following to the configure command: 4086 4087 --enable-pcre2-16 4088 --enable-pcre2-32 4089 4090 If you do not want the 8-bit library, add 4091 4092 --disable-pcre2-8 4093 4094 as well. At least one of the three libraries must be built. Note that 4095 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is 4096 an 8-bit program. Neither of these are built if you select only the 4097 16-bit or 32-bit libraries. 4098 4099 4100BUILDING SHARED AND STATIC LIBRARIES 4101 4102 The Autotools PCRE2 building process uses libtool to build both shared 4103 and static libraries by default. You can suppress an unwanted library 4104 by adding one of 4105 4106 --disable-shared 4107 --disable-static 4108 4109 to the configure command. Setting --disable-shared ensures that PCRE2 4110 libraries are built as static libraries. The binaries that are then 4111 created as part of the build process (for example, pcre2test and 4112 pcre2grep) are linked statically with one or more PCRE2 libraries, but 4113 may also be dynamically linked with other libraries such as libc. If 4114 you want these binaries to be fully statically linked, you can set LD- 4115 FLAGS like this: 4116 4117 LDFLAGS=--static ./configure --disable-shared 4118 4119 Note the two hyphens in --static. Of course, this works only if static 4120 versions of all the relevant libraries are available for linking. 4121 4122 4123UNICODE AND UTF SUPPORT 4124 4125 By default, PCRE2 is built with support for Unicode and UTF character 4126 strings. To build it without Unicode support, add 4127 4128 --disable-unicode 4129 4130 to the configure command. This setting applies to all three libraries. 4131 It is not possible to build one library with Unicode support and an- 4132 other without in the same configuration. 4133 4134 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, 4135 UTF-16 or UTF-32. To do that, applications that use the library can set 4136 the PCRE2_UTF option when they call pcre2_compile() to compile a pat- 4137 tern. Alternatively, patterns may be started with (*UTF) unless the 4138 application has locked this out by setting PCRE2_NEVER_UTF. 4139 4140 UTF support allows the libraries to process character code points up to 4141 0x10ffff in the strings that they handle. Unicode support also gives 4142 access to the Unicode properties of characters, using pattern escapes 4143 such as \P, \p, and \X. Only the general category properties such as Lu 4144 and Nd, script names, and some bi-directional properties are supported. 4145 Details are given in the pcre2pattern documentation. 4146 4147 Pattern escapes such as \d and \w do not by default make use of Unicode 4148 properties. The application can request that they do by setting the 4149 PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a 4150 pattern may also request this by starting with (*UCP). 4151 4152 4153DISABLING THE USE OF \C 4154 4155 The \C escape sequence, which matches a single code unit, even in a UTF 4156 mode, can cause unpredictable behaviour because it may leave the cur- 4157 rent matching point in the middle of a multi-code-unit character. The 4158 application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op- 4159 tion when calling pcre2_compile(). There is also a build-time option 4160 4161 --enable-never-backslash-C 4162 4163 (note the upper case C) which locks out the use of \C entirely. 4164 4165 4166JUST-IN-TIME COMPILER SUPPORT 4167 4168 Just-in-time (JIT) compiler support is included in the build by speci- 4169 fying 4170 4171 --enable-jit 4172 4173 This support is available only for certain hardware architectures. If 4174 this option is set for an unsupported architecture, a building error 4175 occurs. If in doubt, use 4176 4177 --enable-jit=auto 4178 4179 which enables JIT only if the current hardware is supported. You can 4180 check if JIT is enabled in the configuration summary that is output at 4181 the end of a configure run. If you are enabling JIT under SELinux you 4182 may also want to add 4183 4184 --enable-jit-sealloc 4185 4186 which enables the use of an execmem allocator in JIT that is compatible 4187 with SELinux. This has no effect if JIT is not enabled. See the 4188 pcre2jit documentation for a discussion of JIT usage. When JIT support 4189 is enabled, pcre2grep automatically makes use of it, unless you add 4190 4191 --disable-pcre2grep-jit 4192 4193 to the configure command. 4194 4195 4196NEWLINE RECOGNITION 4197 4198 By default, PCRE2 interprets the linefeed (LF) character as indicating 4199 the end of a line. This is the normal newline character on Unix-like 4200 systems. You can compile PCRE2 to use carriage return (CR) instead, by 4201 adding 4202 4203 --enable-newline-is-cr 4204 4205 to the configure command. There is also an --enable-newline-is-lf op- 4206 tion, which explicitly specifies linefeed as the newline character. 4207 4208 Alternatively, you can specify that line endings are to be indicated by 4209 the two-character sequence CRLF (CR immediately followed by LF). If you 4210 want this, add 4211 4212 --enable-newline-is-crlf 4213 4214 to the configure command. There is a fourth option, specified by 4215 4216 --enable-newline-is-anycrlf 4217 4218 which causes PCRE2 to recognize any of the three sequences CR, LF, or 4219 CRLF as indicating a line ending. A fifth option, specified by 4220 4221 --enable-newline-is-any 4222 4223 causes PCRE2 to recognize any Unicode newline sequence. The Unicode 4224 newline sequences are the three just mentioned, plus the single charac- 4225 ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, 4226 U+0085), LS (line separator, U+2028), and PS (paragraph separator, 4227 U+2029). The final option is 4228 4229 --enable-newline-is-nul 4230 4231 which causes NUL (binary zero) to be set as the default line-ending 4232 character. 4233 4234 Whatever default line ending convention is selected when PCRE2 is built 4235 can be overridden by applications that use the library. At build time 4236 it is recommended to use the standard for your operating system. 4237 4238 4239WHAT \R MATCHES 4240 4241 By default, the sequence \R in a pattern matches any Unicode newline 4242 sequence, independently of what has been selected as the line ending 4243 sequence. If you specify 4244 4245 --enable-bsr-anycrlf 4246 4247 the default is changed so that \R matches only CR, LF, or CRLF. What- 4248 ever is selected when PCRE2 is built can be overridden by applications 4249 that use the library. 4250 4251 4252HANDLING VERY LARGE PATTERNS 4253 4254 Within a compiled pattern, offset values are used to point from one 4255 part to another (for example, from an opening parenthesis to an alter- 4256 nation metacharacter). By default, in the 8-bit and 16-bit libraries, 4257 two-byte values are used for these offsets, leading to a maximum size 4258 for a compiled pattern of around 64 thousand code units. This is suffi- 4259 cient to handle all but the most gigantic patterns. Nevertheless, some 4260 people do want to process truly enormous patterns, so it is possible to 4261 compile PCRE2 to use three-byte or four-byte offsets by adding a set- 4262 ting such as 4263 4264 --with-link-size=3 4265 4266 to the configure command. The value given must be 2, 3, or 4. For the 4267 16-bit library, a value of 3 is rounded up to 4. In these libraries, 4268 using longer offsets slows down the operation of PCRE2 because it has 4269 to load additional data when handling them. For the 32-bit library the 4270 value is always 4 and cannot be overridden; the value of --with-link- 4271 size is ignored. 4272 4273 4274LIMITING PCRE2 RESOURCE USAGE 4275 4276 The pcre2_match() function increments a counter each time it goes round 4277 its main loop. Putting a limit on this counter controls the amount of 4278 computing resource used by a single call to pcre2_match(). The limit 4279 can be changed at run time, as described in the pcre2api documentation. 4280 The default is 10 million, but this can be changed by adding a setting 4281 such as 4282 4283 --with-match-limit=500000 4284 4285 to the configure command. This setting also applies to the 4286 pcre2_dfa_match() matching function, and to JIT matching (though the 4287 counting is done differently). 4288 4289 The pcre2_match() function uses heap memory to record backtracking 4290 points. The more nested backtracking points there are (that is, the 4291 deeper the search tree), the more memory is needed. There is an upper 4292 limit, specified in kibibytes (units of 1024 bytes). This limit can be 4293 changed at run time, as described in the pcre2api documentation. The 4294 default limit (in effect unlimited) is 20 million. You can change this 4295 by a setting such as 4296 4297 --with-heap-limit=500 4298 4299 which limits the amount of heap to 500 KiB. This limit applies only to 4300 interpretive matching in pcre2_match() and pcre2_dfa_match(), which may 4301 also use the heap for internal workspace when processing complicated 4302 patterns. This limit does not apply when JIT (which has its own memory 4303 arrangements) is used. 4304 4305 You can also explicitly limit the depth of nested backtracking in the 4306 pcre2_match() interpreter. This limit defaults to the value that is set 4307 for --with-match-limit. You can set a lower default limit by adding, 4308 for example, 4309 4310 --with-match-limit-depth=10000 4311 4312 to the configure command. This value can be overridden at run time. 4313 This depth limit indirectly limits the amount of heap memory that is 4314 used, but because the size of each backtracking "frame" depends on the 4315 number of capturing parentheses in a pattern, the amount of heap that 4316 is used before the limit is reached varies from pattern to pattern. 4317 This limit was more useful in versions before 10.30, where function re- 4318 cursion was used for backtracking. 4319 4320 As well as applying to pcre2_match(), the depth limit also controls the 4321 depth of recursive function calls in pcre2_dfa_match(). These are used 4322 for lookaround assertions, atomic groups, and recursion within pat- 4323 terns. The limit does not apply to JIT matching. 4324 4325 4326LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS 4327 4328 Lookbehind assertions in which one or more branches can match a vari- 4329 able number of characters are supported only if there is a maximum 4330 matching length for each top-level branch. There is a limit to this 4331 maximum that defaults to 255 characters. You can alter this default by 4332 a setting such as 4333 4334 --with-max-varlookbehind=100 4335 4336 The limit can be changed at runtime by calling pcre2_set_max_varlookbe- 4337 hind(). Lookbehind assertions in which every branch matches a fixed 4338 number of characters (not necessarily all the same) are not constrained 4339 by this limit. 4340 4341 4342CREATING CHARACTER TABLES AT BUILD TIME 4343 4344 PCRE2 uses fixed tables for processing characters whose code points are 4345 less than 256. By default, PCRE2 is built with a set of tables that are 4346 distributed in the file src/pcre2_chartables.c.dist. These tables are 4347 for ASCII codes only. If you add 4348 4349 --enable-rebuild-chartables 4350 4351 to the configure command, the distributed tables are no longer used. 4352 Instead, a program called pcre2_dftables is compiled and run. This out- 4353 puts the source for new set of tables, created in the default locale of 4354 your C run-time system. This method of replacing the tables does not 4355 work if you are cross compiling, because pcre2_dftables needs to be run 4356 on the local host and therefore not compiled with the cross compiler. 4357 4358 If you need to create alternative tables when cross compiling, you will 4359 have to do so "by hand". There may also be other reasons for creating 4360 tables manually. To cause pcre2_dftables to be built on the local 4361 host, run a normal compiling command, and then run the program with the 4362 output file as its argument, for example: 4363 4364 cc src/pcre2_dftables.c -o pcre2_dftables 4365 ./pcre2_dftables src/pcre2_chartables.c 4366 4367 This builds the tables in the default locale of the local host. If you 4368 want to specify a locale, you must use the -L option: 4369 4370 LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c 4371 4372 You can also specify -b (with or without -L). This causes the tables to 4373 be written in binary instead of as source code. A set of binary tables 4374 can be loaded into memory by an application and passed to pcre2_com- 4375 pile() in the same way as tables created by calling pcre2_maketables(). 4376 The tables are just a string of bytes, independent of hardware charac- 4377 teristics such as endianness. This means they can be bundled with an 4378 application that runs in different environments, to ensure consistent 4379 behaviour. 4380 4381 4382USING EBCDIC CODE 4383 4384 PCRE2 assumes by default that it will run in an environment where the 4385 character code is ASCII or Unicode, which is a superset of ASCII. This 4386 is the case for most computer operating systems. PCRE2 can, however, be 4387 compiled to run in an 8-bit EBCDIC environment by adding 4388 4389 --enable-ebcdic --disable-unicode 4390 4391 to the configure command. This setting implies --enable-rebuild-charta- 4392 bles. You should only use it if you know that you are in an EBCDIC en- 4393 vironment (for example, an IBM mainframe operating system). 4394 4395 It is not possible to support both EBCDIC and UTF-8 codes in the same 4396 version of the library. Consequently, --enable-unicode and --enable- 4397 ebcdic are mutually exclusive. 4398 4399 The EBCDIC character that corresponds to an ASCII LF is assumed to have 4400 the value 0x15 by default. However, in some EBCDIC environments, 0x25 4401 is used. In such an environment you should use 4402 4403 --enable-ebcdic-nl25 4404 4405 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR 4406 has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 4407 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- 4408 acter (which, in Unicode, is 0x85). 4409 4410 The options that select newline behaviour, such as --enable-newline-is- 4411 cr, and equivalent run-time options, refer to these character values in 4412 an EBCDIC environment. 4413 4414 4415PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS 4416 4417 By default pcre2grep supports the use of callouts with string arguments 4418 within the patterns it is matching. There are two kinds: one that gen- 4419 erates output using local code, and another that calls an external pro- 4420 gram or script. If --disable-pcre2grep-callout-fork is added to the 4421 configure command, only the first kind of callout is supported; if 4422 --disable-pcre2grep-callout is used, all callouts are completely ig- 4423 nored. For more details of pcre2grep callouts, see the pcre2grep docu- 4424 mentation. 4425 4426 4427PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT 4428 4429 By default, pcre2grep reads all files as plain text. You can build it 4430 so that it recognizes files whose names end in .gz or .bz2, and reads 4431 them with libz or libbz2, respectively, by adding one or both of 4432 4433 --enable-pcre2grep-libz 4434 --enable-pcre2grep-libbz2 4435 4436 to the configure command. These options naturally require that the rel- 4437 evant libraries are installed on your system. Configuration will fail 4438 if they are not. 4439 4440 4441PCRE2GREP BUFFER SIZE 4442 4443 pcre2grep uses an internal buffer to hold a "window" on the file it is 4444 scanning, in order to be able to output "before" and "after" lines when 4445 it finds a match. The default starting size of the buffer is 20KiB. The 4446 buffer itself is three times this size, but because of the way it is 4447 used for holding "before" lines, the longest line that is guaranteed to 4448 be processable is the notional buffer size. If a longer line is encoun- 4449 tered, pcre2grep automatically expands the buffer, up to a specified 4450 maximum size, whose default is 1MiB or the starting size, whichever is 4451 the larger. You can change the default parameter values by adding, for 4452 example, 4453 4454 --with-pcre2grep-bufsize=51200 4455 --with-pcre2grep-max-bufsize=2097152 4456 4457 to the configure command. The caller of pcre2grep can override these 4458 values by using --buffer-size and --max-buffer-size on the command 4459 line. 4460 4461 4462PCRE2TEST OPTION FOR LIBREADLINE SUPPORT 4463 4464 If you add one of 4465 4466 --enable-pcre2test-libreadline 4467 --enable-pcre2test-libedit 4468 4469 to the configure command, pcre2test is linked with the libreadline or- 4470 libedit library, respectively, and when its input is from a terminal, 4471 it reads it using the readline() function. This provides line-editing 4472 and history facilities. Note that libreadline is GPL-licensed, so if 4473 you distribute a binary of pcre2test linked in this way, there may be 4474 licensing issues. These can be avoided by linking instead with libedit, 4475 which has a BSD licence. 4476 4477 Setting --enable-pcre2test-libreadline causes the -lreadline option to 4478 be added to the pcre2test build. In many operating environments with a 4479 system-installed readline library this is sufficient. However, in some 4480 environments (e.g. if an unmodified distribution version of readline is 4481 in use), some extra configuration may be necessary. The INSTALL file 4482 for libreadline says this: 4483 4484 "Readline uses the termcap functions, but does not link with 4485 the termcap or curses library itself, allowing applications 4486 which link with readline the to choose an appropriate library." 4487 4488 If your environment has not been set up so that an appropriate library 4489 is automatically included, you may need to add something like 4490 4491 LIBS="-ncurses" 4492 4493 immediately before the configure command. 4494 4495 4496INCLUDING DEBUGGING CODE 4497 4498 If you add 4499 4500 --enable-debug 4501 4502 to the configure command, additional debugging code is included in the 4503 build. This feature is intended for use by the PCRE2 maintainers. 4504 4505 4506DEBUGGING WITH VALGRIND SUPPORT 4507 4508 If you add 4509 4510 --enable-valgrind 4511 4512 to the configure command, PCRE2 will use valgrind annotations to mark 4513 certain memory regions as unaddressable. This allows it to detect in- 4514 valid memory accesses, and is mostly useful for debugging PCRE2 itself. 4515 4516 4517CODE COVERAGE REPORTING 4518 4519 If your C compiler is gcc, you can build a version of PCRE2 that can 4520 generate a code coverage report for its test suite. To enable this, you 4521 must install lcov version 1.6 or above. Then specify 4522 4523 --enable-coverage 4524 4525 to the configure command and build PCRE2 in the usual way. 4526 4527 Note that using ccache (a caching C compiler) is incompatible with code 4528 coverage reporting. If you have configured ccache to run automatically 4529 on your system, you must set the environment variable 4530 4531 CCACHE_DISABLE=1 4532 4533 before running make to build PCRE2, so that ccache is not used. 4534 4535 When --enable-coverage is used, the following addition targets are 4536 added to the Makefile: 4537 4538 make coverage 4539 4540 This creates a fresh coverage report for the PCRE2 test suite. It is 4541 equivalent to running "make coverage-reset", "make coverage-baseline", 4542 "make check", and then "make coverage-report". 4543 4544 make coverage-reset 4545 4546 This zeroes the coverage counters, but does nothing else. 4547 4548 make coverage-baseline 4549 4550 This captures baseline coverage information. 4551 4552 make coverage-report 4553 4554 This creates the coverage report. 4555 4556 make coverage-clean-report 4557 4558 This removes the generated coverage report without cleaning the cover- 4559 age data itself. 4560 4561 make coverage-clean-data 4562 4563 This removes the captured coverage data without removing the coverage 4564 files created at compile time (*.gcno). 4565 4566 make coverage-clean 4567 4568 This cleans all coverage data including the generated coverage report. 4569 For more information about code coverage, see the gcov and lcov docu- 4570 mentation. 4571 4572 4573DISABLING THE Z AND T FORMATTING MODIFIERS 4574 4575 The C99 standard defines formatting modifiers z and t for size_t and 4576 ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers 4577 in environments other than old versions of Microsoft Visual Studio when 4578 __STDC_VERSION__ is defined and has a value greater than or equal to 4579 199901L (indicating support for C99). However, there is at least one 4580 environment that claims to be C99 but does not support these modifiers. 4581 If 4582 4583 --disable-percent-zt 4584 4585 is specified, no use is made of the z or t modifiers. Instead of %td or 4586 %zu, a suitable format is used depending in the size of long for the 4587 platform. 4588 4589 4590SUPPORT FOR FUZZERS 4591 4592 There is a special option for use by people who want to run fuzzing 4593 tests on PCRE2: 4594 4595 --enable-fuzz-support 4596 4597 At present this applies only to the 8-bit library. If set, it causes an 4598 extra library called libpcre2-fuzzsupport.a to be built, but not in- 4599 stalled. This contains a single function called LLVMFuzzerTestOneIn- 4600 put() whose arguments are a pointer to a string and the length of the 4601 string. When called, this function tries to compile the string as a 4602 pattern, and if that succeeds, to match it. This is done both with no 4603 options and with some random options bits that are generated from the 4604 string. 4605 4606 Setting --enable-fuzz-support also causes a binary called pcre2fuz- 4607 zcheck to be created. This is normally run under valgrind or used when 4608 PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing 4609 function and outputs information about what it is doing. The input 4610 strings are specified by arguments: if an argument starts with "=" the 4611 rest of it is a literal input string. Otherwise, it is assumed to be a 4612 file name, and the contents of the file are the test string. 4613 4614 4615OBSOLETE OPTION 4616 4617 In versions of PCRE2 prior to 10.30, there were two ways of handling 4618 backtracking in the pcre2_match() function. The default was to use the 4619 system stack, but if 4620 4621 --disable-stack-for-recursion 4622 4623 was set, memory on the heap was used. From release 10.30 onwards this 4624 has changed (the stack is no longer used) and this option now does 4625 nothing except give a warning. 4626 4627 4628SEE ALSO 4629 4630 pcre2api(3), pcre2-config(3). 4631 4632 4633AUTHOR 4634 4635 Philip Hazel 4636 Retired from University Computing Service 4637 Cambridge, England. 4638 4639 4640REVISION 4641 4642 Last updated: 15 April 2024 4643 Copyright (c) 1997-2024 University of Cambridge. 4644 4645 4646PCRE2 10.44 15 April 2024 PCRE2BUILD(3) 4647------------------------------------------------------------------------------ 4648 4649 4650 4651PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3) 4652 4653 4654NAME 4655 PCRE2 - Perl-compatible regular expressions (revised API) 4656 4657 4658SYNOPSIS 4659 4660 #include <pcre2.h> 4661 4662 int (*pcre2_callout)(pcre2_callout_block *, void *); 4663 4664 int pcre2_callout_enumerate(const pcre2_code *code, 4665 int (*callback)(pcre2_callout_enumerate_block *, void *), 4666 void *user_data); 4667 4668 4669DESCRIPTION 4670 4671 PCRE2 provides a feature called "callout", which is a means of tem- 4672 porarily passing control to the caller of PCRE2 in the middle of pat- 4673 tern matching. The caller of PCRE2 provides an external function by 4674 putting its entry point in a match context (see pcre2_set_callout() in 4675 the pcre2api documentation). 4676 4677 When using the pcre2_substitute() function, an additional callout fea- 4678 ture is available. This does a callout after each change to the subject 4679 string and is described in the pcre2api documentation; the rest of this 4680 document is concerned with callouts during pattern matching. 4681 4682 Within a regular expression, (?C<arg>) indicates a point at which the 4683 external function is to be called. Different callout points can be 4684 identified by putting a number less than 256 after the letter C. The 4685 default value is zero. Alternatively, the argument may be a delimited 4686 string. The starting delimiter must be one of ` ' " ^ % # $ { and the 4687 ending delimiter is the same as the start, except for {, where the end- 4688 ing delimiter is }. If the ending delimiter is needed within the 4689 string, it must be doubled. For example, this pattern has two callout 4690 points: 4691 4692 (?C1)abc(?C"some ""arbitrary"" text")def 4693 4694 If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, 4695 PCRE2 automatically inserts callouts, all with number 255, before each 4696 item in the pattern except for immediately before or after an explicit 4697 callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern 4698 4699 A(?C3)B 4700 4701 it is processed as if it were 4702 4703 (?C255)A(?C3)B(?C255) 4704 4705 Here is a more complicated example: 4706 4707 A(\d{2}|--) 4708 4709 With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were 4710 4711 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) 4712 4713 Notice that there is a callout before and after each parenthesis and 4714 alternation bar. If the pattern contains a conditional group whose con- 4715 dition is an assertion, an automatic callout is inserted immediately 4716 before the condition. Such a callout may also be inserted explicitly, 4717 for example: 4718 4719 (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) 4720 4721 This applies only to assertion conditions (because they are themselves 4722 independent groups). 4723 4724 Callouts can be useful for tracking the progress of pattern matching. 4725 The pcre2test program has a pattern qualifier (/auto_callout) that sets 4726 automatic callouts. When any callouts are present, the output from 4727 pcre2test indicates how the pattern is being matched. This is useful 4728 information when you are trying to optimize the performance of a par- 4729 ticular pattern. 4730 4731 4732MISSING CALLOUTS 4733 4734 You should be aware that, because of optimizations in the way PCRE2 4735 compiles and matches patterns, callouts sometimes do not happen exactly 4736 as you might expect. 4737 4738 Auto-possessification 4739 4740 At compile time, PCRE2 "auto-possessifies" repeated items when it knows 4741 that what follows cannot be part of the repeat. For example, a+[bc] is 4742 compiled as if it were a++[bc]. The pcre2test output when this pattern 4743 is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied 4744 to the string "aaaa" is: 4745 4746 --->aaaa 4747 +0 ^ a+ 4748 +2 ^ ^ [bc] 4749 No match 4750 4751 This indicates that when matching [bc] fails, there is no backtracking 4752 into a+ (because it is being treated as a++) and therefore the callouts 4753 that would be taken for the backtracks do not occur. You can disable 4754 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to 4755 pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In 4756 this case, the output changes to this: 4757 4758 --->aaaa 4759 +0 ^ a+ 4760 +2 ^ ^ [bc] 4761 +2 ^ ^ [bc] 4762 +2 ^ ^ [bc] 4763 +2 ^^ [bc] 4764 No match 4765 4766 This time, when matching [bc] fails, the matcher backtracks into a+ and 4767 tries again, repeatedly, until a+ itself fails. 4768 4769 Automatic .* anchoring 4770 4771 By default, an optimization is applied when .* is the first significant 4772 item in a pattern. If PCRE2_DOTALL is set, so that the dot can match 4773 any character, the pattern is automatically anchored. If PCRE2_DOTALL 4774 is not set, a match can start only after an internal newline or at the 4775 beginning of the subject, and pcre2_compile() remembers this. If a pat- 4776 tern has more than one top-level branch, automatic anchoring occurs if 4777 all branches are anchorable. 4778 4779 This optimization is disabled, however, if .* is in an atomic group or 4780 if there is a backreference to the capture group in which it appears. 4781 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How- 4782 ever, the presence of callouts does not affect it. 4783 4784 For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT 4785 and applied to the string "aa", the pcre2test output is: 4786 4787 --->aa 4788 +0 ^ .* 4789 +2 ^ ^ \d 4790 +2 ^^ \d 4791 +2 ^ \d 4792 No match 4793 4794 This shows that all match attempts start at the beginning of the sub- 4795 ject. In other words, the pattern is anchored. You can disable this op- 4796 timization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or 4797 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out- 4798 put changes to: 4799 4800 --->aa 4801 +0 ^ .* 4802 +2 ^ ^ \d 4803 +2 ^^ \d 4804 +2 ^ \d 4805 +0 ^ .* 4806 +2 ^^ \d 4807 +2 ^ \d 4808 No match 4809 4810 This shows more match attempts, starting at the second subject charac- 4811 ter. Another optimization, described in the next section, means that 4812 there is no subsequent attempt to match with an empty subject. 4813 4814 Other optimizations 4815 4816 Other optimizations that provide fast "no match" results also affect 4817 callouts. For example, if the pattern is 4818 4819 ab(?C4)cd 4820 4821 PCRE2 knows that any matching string must contain the letter "d". If 4822 the subject string is "abyz", the lack of "d" means that matching 4823 doesn't ever start, and the callout is never reached. However, with 4824 "abyd", though the result is still no match, the callout is obeyed. 4825 4826 For most patterns PCRE2 also knows the minimum length of a matching 4827 string, and will immediately give a "no match" return without actually 4828 running a match if the subject is not long enough, or, for unanchored 4829 patterns, if it has been scanned far enough. 4830 4831 You can disable these optimizations by passing the PCRE2_NO_START_OPTI- 4832 MIZE option to pcre2_compile(), or by starting the pattern with 4833 (*NO_START_OPT). This slows down the matching process, but does ensure 4834 that callouts such as the example above are obeyed. 4835 4836 4837THE CALLOUT INTERFACE 4838 4839 During matching, when PCRE2 reaches a callout point, if an external 4840 function is provided in the match context, it is called. This applies 4841 to both normal, DFA, and JIT matching. The first argument to the call- 4842 out function is a pointer to a pcre2_callout block. The second argument 4843 is the void * callout data that was supplied when the callout was set 4844 up by calling pcre2_set_callout() (see the pcre2api documentation). The 4845 callout block structure contains the following fields, not necessarily 4846 in this order: 4847 4848 uint32_t version; 4849 uint32_t callout_number; 4850 uint32_t capture_top; 4851 uint32_t capture_last; 4852 uint32_t callout_flags; 4853 PCRE2_SIZE *offset_vector; 4854 PCRE2_SPTR mark; 4855 PCRE2_SPTR subject; 4856 PCRE2_SIZE subject_length; 4857 PCRE2_SIZE start_match; 4858 PCRE2_SIZE current_position; 4859 PCRE2_SIZE pattern_position; 4860 PCRE2_SIZE next_item_length; 4861 PCRE2_SIZE callout_string_offset; 4862 PCRE2_SIZE callout_string_length; 4863 PCRE2_SPTR callout_string; 4864 4865 The version field contains the version number of the block format. The 4866 current version is 2; the three callout string fields were added for 4867 version 1, and the callout_flags field for version 2. If you are writ- 4868 ing an application that might use an earlier release of PCRE2, you 4869 should check the version number before accessing any of these fields. 4870 The version number will increase in future if more fields are added, 4871 but the intention is never to remove any of the existing fields. 4872 4873 Fields for numerical callouts 4874 4875 For a numerical callout, callout_string is NULL, and callout_number 4876 contains the number of the callout, in the range 0-255. This is the 4877 number that follows (?C for callouts that part of the pattern; it is 4878 255 for automatically generated callouts. 4879 4880 Fields for string callouts 4881 4882 For callouts with string arguments, callout_number is always zero, and 4883 callout_string points to the string that is contained within the com- 4884 piled pattern. Its length is given by callout_string_length. Duplicated 4885 ending delimiters that were present in the original pattern string have 4886 been turned into single characters, but there is no other processing of 4887 the callout string argument. An additional code unit containing binary 4888 zero is present after the string, but is not included in the length. 4889 The delimiter that was used to start the string is also stored within 4890 the pattern, immediately before the string itself. You can access this 4891 delimiter as callout_string[-1] if you need it. 4892 4893 The callout_string_offset field is the code unit offset to the start of 4894 the callout argument string within the original pattern string. This is 4895 provided for the benefit of applications such as script languages that 4896 might need to report errors in the callout string within the pattern. 4897 4898 Fields for all callouts 4899 4900 The remaining fields in the callout block are the same for both kinds 4901 of callout. 4902 4903 The offset_vector field is a pointer to a vector of capturing offsets 4904 (the "ovector"). You may read the elements in this vector, but you must 4905 not change any of them. 4906 4907 For calls to pcre2_match(), the offset_vector field is not (since re- 4908 lease 10.30) a pointer to the actual ovector that was passed to the 4909 matching function in the match data block. Instead it points to an in- 4910 ternal ovector of a size large enough to hold all possible captured 4911 substrings in the pattern. Note that whenever a recursion or subroutine 4912 call within a pattern completes, the capturing state is reset to what 4913 it was before. 4914 4915 The capture_last field contains the number of the most recently cap- 4916 tured substring, and the capture_top field contains one more than the 4917 number of the highest numbered captured substring so far. If no sub- 4918 strings have yet been captured, the value of capture_last is 0 and the 4919 value of capture_top is 1. The values of these fields do not always 4920 differ by one; for example, when the callout in the pattern 4921 ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4. 4922 4923 The contents of ovector[2] to ovector[<capture_top>*2-1] can be in- 4924 spected in order to extract substrings that have been matched so far, 4925 in the same way as extracting substrings after a match has completed. 4926 The values in ovector[0] and ovector[1] are always PCRE2_UNSET because 4927 the match is by definition not complete. Substrings that have not been 4928 captured but whose numbers are less than capture_top also have both of 4929 their ovector slots set to PCRE2_UNSET. 4930 4931 For DFA matching, the offset_vector field points to the ovector that 4932 was passed to the matching function in the match data block for call- 4933 outs at the top level, but to an internal ovector during the processing 4934 of pattern recursions, lookarounds, and atomic groups. However, these 4935 ovectors hold no useful information because pcre2_dfa_match() does not 4936 support substring capturing. The value of capture_top is always 1 and 4937 the value of capture_last is always 0 for DFA matching. 4938 4939 The subject and subject_length fields contain copies of the values that 4940 were passed to the matching function. 4941 4942 The start_match field normally contains the offset within the subject 4943 at which the current match attempt started. However, if the escape se- 4944 quence \K has been encountered, this value is changed to reflect the 4945 modified starting point. If the pattern is not anchored, the callout 4946 function may be called several times from the same point in the pattern 4947 for different starting points in the subject. 4948 4949 The current_position field contains the offset within the subject of 4950 the current match pointer. 4951 4952 The pattern_position field contains the offset in the pattern string to 4953 the next item to be matched. 4954 4955 The next_item_length field contains the length of the next item to be 4956 processed in the pattern string. When the callout is at the end of the 4957 pattern, the length is zero. When the callout precedes an opening 4958 parenthesis, the length includes meta characters that follow the paren- 4959 thesis. For example, in a callout before an assertion such as (?=ab) 4960 the length is 3. For an alternation bar or a closing parenthesis, the 4961 length is one, unless a closing parenthesis is followed by a quanti- 4962 fier, in which case its length is included. (This changed in release 4963 10.23. In earlier releases, before an opening parenthesis the length 4964 was that of the entire group, and before an alternation bar or a clos- 4965 ing parenthesis the length was zero.) 4966 4967 The pattern_position and next_item_length fields are intended to help 4968 in distinguishing between different automatic callouts, which all have 4969 the same callout number. However, they are set for all callouts, and 4970 are used by pcre2test to show the next item to be matched when display- 4971 ing callout information. 4972 4973 In callouts from pcre2_match() the mark field contains a pointer to the 4974 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or 4975 (*THEN) item in the match, or NULL if no such items have been passed. 4976 Instances of (*PRUNE) or (*THEN) without a name do not obliterate a 4977 previous (*MARK). In callouts from the DFA matching function this field 4978 always contains NULL. 4979 4980 The callout_flags field is always zero in callouts from 4981 pcre2_dfa_match() or when JIT is being used. When pcre2_match() without 4982 JIT is used, the following bits may be set: 4983 4984 PCRE2_CALLOUT_STARTMATCH 4985 4986 This is set for the first callout after the start of matching for each 4987 new starting position in the subject. 4988 4989 PCRE2_CALLOUT_BACKTRACK 4990 4991 This is set if there has been a matching backtrack since the previous 4992 callout, or since the start of matching if this is the first callout 4993 from a pcre2_match() run. 4994 4995 Both bits are set when a backtrack has caused a "bumpalong" to a new 4996 starting position in the subject. Output from pcre2test does not indi- 4997 cate the presence of these bits unless the callout_extra modifier is 4998 set. 4999 5000 The information in the callout_flags field is provided so that applica- 5001 tions can track and tell their users how matching with backtracking is 5002 done. This can be useful when trying to optimize patterns, or just to 5003 understand how PCRE2 works. There is no support in pcre2_dfa_match() 5004 because there is no backtracking in DFA matching, and there is no sup- 5005 port in JIT because JIT is all about maximimizing matching performance. 5006 In both these cases the callout_flags field is always zero. 5007 5008 5009RETURN VALUES FROM CALLOUTS 5010 5011 The external callout function returns an integer to PCRE2. If the value 5012 is zero, matching proceeds as normal. If the value is greater than 5013 zero, matching fails at the current point, but the testing of other 5014 matching possibilities goes ahead, just as if a lookahead assertion had 5015 failed. If the value is less than zero, the match is abandoned, and the 5016 matching function returns the negative value. 5017 5018 Negative values should normally be chosen from the set of PCRE2_ER- 5019 ROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a standard 5020 "no match" failure. The error number PCRE2_ERROR_CALLOUT is reserved 5021 for use by callout functions; it will never be used by PCRE2 itself. 5022 5023 5024CALLOUT ENUMERATION 5025 5026 int pcre2_callout_enumerate(const pcre2_code *code, 5027 int (*callback)(pcre2_callout_enumerate_block *, void *), 5028 void *user_data); 5029 5030 A script language that supports the use of string arguments in callouts 5031 might like to scan all the callouts in a pattern before running the 5032 match. This can be done by calling pcre2_callout_enumerate(). The first 5033 argument is a pointer to a compiled pattern, the second points to a 5034 callback function, and the third is arbitrary user data. The callback 5035 function is called for every callout in the pattern in the order in 5036 which they appear. Its first argument is a pointer to a callout enumer- 5037 ation block, and its second argument is the user_data value that was 5038 passed to pcre2_callout_enumerate(). The data block contains the fol- 5039 lowing fields: 5040 5041 version Block version number 5042 pattern_position Offset to next item in pattern 5043 next_item_length Length of next item in pattern 5044 callout_number Number for numbered callouts 5045 callout_string_offset Offset to string within pattern 5046 callout_string_length Length of callout string 5047 callout_string Points to callout string or is NULL 5048 5049 The version number is currently 0. It will increase if new fields are 5050 ever added to the block. The remaining fields are the same as their 5051 namesakes in the pcre2_callout block that is used for callouts during 5052 matching, as described above. 5053 5054 Note that the value of pattern_position is unique for each callout. 5055 However, if a callout occurs inside a group that is quantified with a 5056 non-zero minimum or a fixed maximum, the group is replicated inside the 5057 compiled pattern. For example, a pattern such as /(a){2}/ is compiled 5058 as if it were /(a)(a)/. This means that the callout will be enumerated 5059 more than once, but with the same value for pattern_position in each 5060 case. 5061 5062 The callback function should normally return zero. If it returns a non- 5063 zero value, scanning the pattern stops, and that value is returned from 5064 pcre2_callout_enumerate(). 5065 5066 5067AUTHOR 5068 5069 Philip Hazel 5070 Retired from University Computing Service 5071 Cambridge, England. 5072 5073 5074REVISION 5075 5076 Last updated: 19 January 2024 5077 Copyright (c) 1997-2024 University of Cambridge. 5078 5079 5080PCRE2 10.43 19 January 2024 PCRE2CALLOUT(3) 5081------------------------------------------------------------------------------ 5082 5083 5084 5085PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3) 5086 5087 5088NAME 5089 PCRE2 - Perl-compatible regular expressions (revised API) 5090 5091 5092DIFFERENCES BETWEEN PCRE2 AND PERL 5093 5094 This document describes some of the known differences in the ways that 5095 PCRE2 and Perl handle regular expressions. The differences described 5096 here are with respect to Perl version 5.38.0, but as both Perl and 5097 PCRE2 are continually changing, the information may at times be out of 5098 date. 5099 5100 1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set, 5101 the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' 5102 matches the next character unless it is the start of a newline se- 5103 quence. This means that, if the newline setting is CR, CRLF, or NUL, 5104 '.' will match the code point LF (0x0A) in ASCII/Unicode environments, 5105 and NL (either 0x15 or 0x25) when using EBCDIC. In Perl, '.' appears 5106 never to match LF, even when 0x0A is not a newline indicator. 5107 5108 2. PCRE2 has only a subset of Perl's Unicode support. Details of what 5109 it does have are given in the pcre2unicode page. 5110 5111 3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser- 5112 tions, but they do not mean what you might think. For example, (?!a){3} 5113 does not assert that the next three characters are not "a". It just as- 5114 serts that the next character is not "a" three times (in principle; 5115 PCRE2 optimizes this to run the assertion just once). Perl allows some 5116 repeat quantifiers on other assertions, for example, \b* , but these do 5117 not seem to have any use. PCRE2 does not allow any kind of quantifier 5118 on non-lookaround assertions. 5119 5120 4. If a braced quantifier such as {1,2} appears where there is nothing 5121 to repeat (for example, at the start of a branch), PCRE2 raises an er- 5122 ror whereas Perl treats the quantifier characters as literal. 5123 5124 5. Capture groups that occur inside negative lookaround assertions are 5125 counted, but their entries in the offsets vector are set only when a 5126 negative assertion is a condition that has a matching branch (that is, 5127 the condition is false). Perl may set such capture groups in other 5128 circumstances. 5129 5130 6. The following Perl escape sequences are not supported: \F, \l, \L, 5131 \u, \U, and \N when followed by a character name. \N on its own, match- 5132 ing a non-newline character, and \N{U+dd..}, matching a Unicode code 5133 point, are supported. The escapes that modify the case of following 5134 letters are implemented by Perl's general string-handling and are not 5135 part of its pattern matching engine. If any of these are encountered by 5136 PCRE2, an error is generated by default. However, if either of the 5137 PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are 5138 interpreted as ECMAScript interprets them. 5139 5140 7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 5141 is built with Unicode support (the default). The properties that can be 5142 tested with \p and \P are limited to the general category properties 5143 such as Lu and Nd, the derived properties Any and LC (synonym L&), 5144 script names such as Greek or Han, Bidi_Class, Bidi_Control, and a few 5145 binary properties. Both PCRE2 and Perl support the Cs (surrogate) prop- 5146 erty, but in PCRE2 its use is limited. See the pcre2pattern documenta- 5147 tion for details. The long synonyms for property names that Perl sup- 5148 ports (such as \p{Letter}) are not supported by PCRE2, nor is it per- 5149 mitted to prefix any of these properties with "Is". 5150 5151 8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters 5152 in between are treated as literals. However, this is slightly different 5153 from Perl in that $ and @ are also handled as literals inside the 5154 quotes. In Perl, they cause variable interpolation (PCRE2 does not have 5155 variables). Also, Perl does "double-quotish backslash interpolation" on 5156 any backslashes between \Q and \E which, its documentation says, "may 5157 lead to confusing results". PCRE2 treats a backslash between \Q and \E 5158 just like any other character. Note the following examples: 5159 5160 Pattern PCRE2 matches Perl matches 5161 5162 \Qabc$xyz\E abc$xyz abc followed by the 5163 contents of $xyz 5164 \Qabc\$xyz\E abc\$xyz abc\$xyz 5165 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 5166 \QA\B\E A\B A\B 5167 \Q\\E \ \\E 5168 5169 The \Q...\E sequence is recognized both inside and outside character 5170 classes by both PCRE2 and Perl. 5171 5172 9. Fairly obviously, PCRE2 does not support the (?{code}) and 5173 (??{code}) constructions. However, PCRE2 does have a "callout" feature, 5174 which allows an external function to be called during pattern matching. 5175 See the pcre2callout documentation for details. 5176 5177 10. Subroutine calls (whether recursive or not) were treated as atomic 5178 groups up to PCRE2 release 10.23, but from release 10.30 this changed, 5179 and backtracking into subroutine calls is now supported, as in Perl. 5180 5181 11. In PCRE2, if any of the backtracking control verbs are used in a 5182 group that is called as a subroutine (whether or not recursively), 5183 their effect is confined to that group; it does not extend to the sur- 5184 rounding pattern. This is not always the case in Perl. In particular, 5185 if (*THEN) is present in a group that is called as a subroutine, its 5186 action is limited to that group, even if the group does not contain any 5187 | characters. Note that such groups are processed as anchored at the 5188 point where they are tested. 5189 5190 12. If a pattern contains more than one backtracking control verb, the 5191 first one that is backtracked onto acts. For example, in the pattern 5192 A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure 5193 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases 5194 it is the same as PCRE2, but there are cases where it differs. 5195 5196 13. There are some differences that are concerned with the settings of 5197 captured strings when part of a pattern is repeated. For example, 5198 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un- 5199 set, but in PCRE2 it is set to "b". 5200 5201 14. PCRE2's handling of duplicate capture group numbers and names is 5202 not as general as Perl's. This is a consequence of the fact the PCRE2 5203 works internally just with numbers, using an external table to trans- 5204 late between numbers and names. In particular, a pattern such as 5205 (?|(?<a>A)|(?<b>B)), where the two capture groups have the same number 5206 but different names, is not supported, and causes an error at compile 5207 time. If it were allowed, it would not be possible to distinguish which 5208 group matched, because both names map to capture group number 1. To 5209 avoid this confusing situation, an error is given at compile time. 5210 5211 15. Perl used to recognize comments in some places that PCRE2 does not, 5212 for example, between the ( and ? at the start of a group. If the /x 5213 modifier is set, Perl allowed white space between ( and ? though the 5214 latest Perls give an error (for a while it was just deprecated). There 5215 may still be some cases where Perl behaves differently. 5216 5217 16. Perl, when in warning mode, gives warnings for character classes 5218 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- 5219 als. PCRE2 has no warning features, so it gives an error in these cases 5220 because they are almost certainly user mistakes. 5221 5222 17. In PCRE2, the upper/lower case character properties Lu and Ll are 5223 not affected when case-independent matching is specified. For example, 5224 \p{Lu} always matches an upper case letter. I think Perl has changed in 5225 this respect; in the release at the time of writing (5.38), \p{Lu} and 5226 \p{Ll} match all letters, regardless of case, when case independence is 5227 specified. 5228 5229 18. From release 5.32.0, Perl locks out the use of \K in lookaround as- 5230 sertions. From release 10.38 PCRE2 does the same by default. However, 5231 there is an option for re-enabling the previous behaviour. When this 5232 option is set, \K is acted on when it occurs in positive assertions, 5233 but is ignored in negative assertions. 5234 5235 19. PCRE2 provides some extensions to the Perl regular expression fa- 5236 cilities. Perl 5.10 included new features that were not in earlier 5237 versions of Perl, some of which (such as named parentheses) were in 5238 PCRE2 for some time before. This list is with respect to Perl 5.38: 5239 5240 (a) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the 5241 $ meta-character matches only at the very end of the string. 5242 5243 (b) A backslash followed by a letter with no special meaning is 5244 faulted. (Perl can be made to issue a warning.) 5245 5246 (c) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- 5247 fiers is inverted, that is, by default they are not greedy, but if fol- 5248 lowed by a question mark they are. 5249 5250 (d) PCRE2_ANCHORED can be used at matching time to force a pattern to 5251 be tried only at the first matching position in the subject string. 5252 5253 (e) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and 5254 PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents. 5255 5256 (f) The \R escape sequence can be restricted to match only CR, LF, or 5257 CRLF by the PCRE2_BSR_ANYCRLF option. 5258 5259 (g) The callout facility is PCRE2-specific. Perl supports codeblocks 5260 and variable interpolation, but not general hooks on every match. 5261 5262 (h) The partial matching facility is PCRE2-specific. 5263 5264 (i) The alternative matching function (pcre2_dfa_match() matches in a 5265 different way and is not Perl-compatible. 5266 5267 (j) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) 5268 at the start of a pattern. These set overall options that cannot be 5269 changed within the pattern. 5270 5271 (k) PCRE2 supports non-atomic positive lookaround assertions. This is 5272 an extension to the lookaround facilities. The default, Perl-compatible 5273 lookarounds are atomic. 5274 5275 (l) There are three syntactical items in patterns that can refer to a 5276 capturing group by number: back references such as \g{2}, subroutine 5277 calls such as (?3), and condition references such as (?(4)...). PCRE2 5278 supports relative group numbers such as +2 and -4 in all three cases. 5279 Perl supports both plus and minus for subroutine calls, but only minus 5280 for back references, and no relative numbering at all for conditions. 5281 5282 20. Perl has different limits than PCRE2. See the pcre2limit documenta- 5283 tion for details. Perl went with 5.10 from recursion to iteration keep- 5284 ing the intermediate matches on the heap, which is ~10% slower but does 5285 not fall into any stack-overflow limit. PCRE2 made a similar change at 5286 release 10.30, and also has many build-time and run-time customizable 5287 limits. 5288 5289 21. Unlike Perl, PCRE2 doesn't have character set modifiers and spe- 5290 cially no way to set characters by context just like Perl's "/d". A 5291 regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules 5292 to Perl's "/u"; something closer to "/a" could be selected by adding 5293 other PCRE2_EXTRA_ASCII* options on top. 5294 5295 22. Some recursive patterns that Perl diagnoses as infinite recursions 5296 can be handled by PCRE2, either by the interpreter or the JIT. An exam- 5297 ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number 5298 of repeated "abcd" substrings at the end of the subject. 5299 5300 5301AUTHOR 5302 5303 Philip Hazel 5304 Retired from University Computing Service 5305 Cambridge, England. 5306 5307 5308REVISION 5309 5310 Last updated: 30 November 2023 5311 Copyright (c) 1997-2023 University of Cambridge. 5312 5313 5314PCRE2 10.43 30 November 2023 PCRE2COMPAT(3) 5315------------------------------------------------------------------------------ 5316 5317 5318 5319PCRE2JIT(3) Library Functions Manual PCRE2JIT(3) 5320 5321 5322NAME 5323 PCRE2 - Perl-compatible regular expressions (revised API) 5324 5325 5326PCRE2 JUST-IN-TIME COMPILER SUPPORT 5327 5328 Just-in-time compiling is a heavyweight optimization that can greatly 5329 speed up pattern matching. However, it comes at the cost of extra pro- 5330 cessing before the match is performed, so it is of most benefit when 5331 the same pattern is going to be matched many times. This does not nec- 5332 essarily mean many calls of a matching function; if the pattern is not 5333 anchored, matching attempts may take place many times at various posi- 5334 tions in the subject, even for a single call. Therefore, if the subject 5335 string is very long, it may still pay to use JIT even for one-off 5336 matches. JIT support is available for all of the 8-bit, 16-bit and 5337 32-bit PCRE2 libraries. 5338 5339 JIT support applies only to the traditional Perl-compatible matching 5340 function. It does not apply when the DFA matching function is being 5341 used. The code for JIT support was written by Zoltan Herczeg. 5342 5343 5344AVAILABILITY OF JIT SUPPORT 5345 5346 JIT support is an optional feature of PCRE2. The "configure" option 5347 --enable-jit (or equivalent CMake option) must be set when PCRE2 is 5348 built if you want to use JIT. The support is limited to the following 5349 hardware platforms: 5350 5351 ARM 32-bit (v7, and Thumb2) 5352 ARM 64-bit 5353 IBM s390x 64 bit 5354 Intel x86 32-bit and 64-bit 5355 LoongArch 64 bit 5356 MIPS 32-bit and 64-bit 5357 Power PC 32-bit and 64-bit 5358 RISC-V 32-bit and 64-bit 5359 5360 If --enable-jit is set on an unsupported platform, compilation fails. 5361 5362 A client program can tell if JIT support is available by calling 5363 pcre2_config() with the PCRE2_CONFIG_JIT option. The result is one if 5364 PCRE2 was built with JIT support, and zero otherwise. However, having 5365 the JIT code available does not guarantee that it will be used for any 5366 particular match. One reason for this is that there are a number of op- 5367 tions and pattern items that are not supported by JIT (see below). An- 5368 other reason is that in some environments JIT is unable to get memory 5369 in which to build its compiled code. The only guarantee from pcre2_con- 5370 fig() is that if it returns zero, JIT will definitely not be used. 5371 5372 A simple program does not need to check availability in order to use 5373 JIT when possible. The API is implemented in a way that falls back to 5374 the interpretive code if JIT is not available or cannot be used for a 5375 given match. For programs that need the best possible performance, 5376 there is a "fast path" API that is JIT-specific. 5377 5378 5379SIMPLE USE OF JIT 5380 5381 To make use of the JIT support in the simplest way, all you have to do 5382 is to call pcre2_jit_compile() after successfully compiling a pattern 5383 with pcre2_compile(). This function has two arguments: the first is the 5384 compiled pattern pointer that was returned by pcre2_compile(), and the 5385 second is zero or more of the following option bits: PCRE2_JIT_COM- 5386 PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. 5387 5388 If JIT support is not available, a call to pcre2_jit_compile() does 5389 nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled 5390 pattern is passed to the JIT compiler, which turns it into machine code 5391 that executes much faster than the normal interpretive code, but yields 5392 exactly the same results. The returned value from pcre2_jit_compile() 5393 is zero on success, or a negative error code. 5394 5395 There is a limit to the size of pattern that JIT supports, imposed by 5396 the size of machine stack that it uses. The exact rules are not docu- 5397 mented because they may change at any time, in particular, when new op- 5398 timizations are introduced. If a pattern is too big, a call to 5399 pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY. 5400 5401 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com- 5402 plete matches. If you want to run partial matches using the PCRE2_PAR- 5403 TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should 5404 set one or both of the other options as well as, or instead of 5405 PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code 5406 for each of the three modes (normal, soft partial, hard partial). When 5407 pcre2_match() is called, the appropriate code is run if it is avail- 5408 able. Otherwise, the pattern is matched using interpretive code. 5409 5410 You can call pcre2_jit_compile() multiple times for the same compiled 5411 pattern. It does nothing if it has previously compiled code for any of 5412 the option bits. For example, you can call it once with PCRE2_JIT_COM- 5413 PLETE and (perhaps later, when you find you need partial matching) 5414 again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it 5415 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- 5416 ing. If pcre2_jit_compile() is called with no option bits set, it imme- 5417 diately returns zero. This is an alternative way of testing whether JIT 5418 is available. 5419 5420 At present, it is not possible to free JIT compiled code except when 5421 the entire compiled pattern is freed by calling pcre2_code_free(). 5422 5423 In some circumstances you may need to call additional functions. These 5424 are described in the section entitled "Controlling the JIT stack" be- 5425 low. 5426 5427 There are some pcre2_match() options that are not supported by JIT, and 5428 there are also some pattern items that JIT cannot handle. Details are 5429 given below. In both cases, matching automatically falls back to the 5430 interpretive code. If you want to know whether JIT was actually used 5431 for a particular match, you should arrange for a JIT callback function 5432 to be set up as described in the section entitled "Controlling the JIT 5433 stack" below, even if you do not need to supply a non-default JIT 5434 stack. Such a callback function is called whenever JIT code is about to 5435 be obeyed. If the match-time options are not right for JIT execution, 5436 the callback function is not obeyed. 5437 5438 If the JIT compiler finds an unsupported item, no JIT data is gener- 5439 ated. You can find out if JIT compilation was successful for a compiled 5440 pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op- 5441 tion. A non-zero result means that JIT compilation was successful. A 5442 result of 0 means that JIT support is not available, or the pattern was 5443 not processed by pcre2_jit_compile(), or the JIT compiler was not able 5444 to handle the pattern. Successful JIT compilation does not, however, 5445 guarantee the use of JIT at match time because there are some match 5446 time options that are not supported by JIT. 5447 5448 5449MATCHING SUBJECTS CONTAINING INVALID UTF 5450 5451 When a pattern is compiled with the PCRE2_UTF option, subject strings 5452 are normally expected to be a valid sequence of UTF code units. By de- 5453 fault, this is checked at the start of matching and an error is gener- 5454 ated if invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be 5455 passed to pcre2_match() to skip the check (for improved performance) if 5456 you are sure that a subject string is valid. If this option is used 5457 with an invalid string, the result is undefined. The calling program 5458 may crash or loop or otherwise misbehave. 5459 5460 However, a way of running matches on strings that may contain invalid 5461 UTF sequences is available. Calling pcre2_compile() with the 5462 PCRE2_MATCH_INVALID_UTF option has two effects: it tells the inter- 5463 preter in pcre2_match() to support invalid UTF, and, if pcre2_jit_com- 5464 pile() is subsequently called, the compiled JIT code also supports in- 5465 valid UTF. Details of how this support works, in both the JIT and the 5466 interpretive cases, is given in the pcre2unicode documentation. 5467 5468 There is also an obsolete option for pcre2_jit_compile() called 5469 PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat- 5470 ibility. It is superseded by the pcre2_compile() option 5471 PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed 5472 in future. 5473 5474 5475UNSUPPORTED OPTIONS AND PATTERN ITEMS 5476 5477 The pcre2_match() options that are supported for JIT matching are 5478 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, 5479 PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and 5480 PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options 5481 are not supported at match time. 5482 5483 If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the 5484 use of JIT, forcing matching by the interpreter code. 5485 5486 The only unsupported pattern items are \C (match a single data unit) 5487 when running in a UTF mode, and a callout immediately before an asser- 5488 tion condition in a conditional group. 5489 5490 5491RETURN VALUES FROM JIT MATCHING 5492 5493 When a pattern is matched using JIT, the return values are the same as 5494 those given by the interpretive pcre2_match() code, with the addition 5495 of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that the 5496 memory used for the JIT stack was insufficient. See "Controlling the 5497 JIT stack" below for a discussion of JIT stack usage. 5498 5499 The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if 5500 searching a very large pattern tree goes on for too long, as it is in 5501 the same circumstance when JIT is not used, but the details of exactly 5502 what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code 5503 is never returned when JIT matching is used. 5504 5505 5506CONTROLLING THE JIT STACK 5507 5508 When the compiled JIT code runs, it needs a block of memory to use as a 5509 stack. By default, it uses 32KiB on the machine stack. However, some 5510 large or complicated patterns need more than this. The error PCRE2_ER- 5511 ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func- 5512 tions are provided for managing blocks of memory for use as JIT stacks. 5513 There is further discussion about the use of JIT stacks in the section 5514 entitled "JIT stack FAQ" below. 5515 5516 The pcre2_jit_stack_create() function creates a JIT stack. Its argu- 5517 ments are a starting size, a maximum size, and a general context (for 5518 memory allocation functions, or NULL for standard memory allocation). 5519 It returns a pointer to an opaque structure of type pcre2_jit_stack, or 5520 NULL if there is an error. The pcre2_jit_stack_free() function is used 5521 to free a stack that is no longer needed. If its argument is NULL, this 5522 function returns immediately, without doing anything. (For the techni- 5523 cally minded: the address space is allocated by mmap or VirtualAlloc.) 5524 A maximum stack size of 512KiB to 1MiB should be more than enough for 5525 any pattern. 5526 5527 The pcre2_jit_stack_assign() function specifies which stack JIT code 5528 should use. Its arguments are as follows: 5529 5530 pcre2_match_context *mcontext 5531 pcre2_jit_callback callback 5532 void *data 5533 5534 The first argument is a pointer to a match context. When this is subse- 5535 quently passed to a matching function, its information determines which 5536 JIT stack is used. If this argument is NULL, the function returns imme- 5537 diately, without doing anything. There are three cases for the values 5538 of the other two options: 5539 5540 (1) If callback is NULL and data is NULL, an internal 32KiB block 5541 on the machine stack is used. This is the default when a match 5542 context is created. 5543 5544 (2) If callback is NULL and data is not NULL, data must be 5545 a pointer to a valid JIT stack, the result of calling 5546 pcre2_jit_stack_create(). 5547 5548 (3) If callback is not NULL, it must point to a function that is 5549 called with data as an argument at the start of matching, in 5550 order to set up a JIT stack. If the return from the callback 5551 function is NULL, the internal 32KiB stack is used; otherwise the 5552 return value must be a valid JIT stack, the result of calling 5553 pcre2_jit_stack_create(). 5554 5555 A callback function is obeyed whenever JIT code is about to be run; it 5556 is not obeyed when pcre2_match() is called with options that are incom- 5557 patible for JIT matching. A callback function can therefore be used to 5558 determine whether a match operation was executed by JIT or by the in- 5559 terpreter. 5560 5561 You may safely use the same JIT stack for more than one pattern (either 5562 by assigning directly or by callback), as long as the patterns are 5563 matched sequentially in the same thread. Currently, the only way to set 5564 up non-sequential matches in one thread is to use callouts: if a call- 5565 out function starts another match, that match must use a different JIT 5566 stack to the one used for currently suspended match(es). 5567 5568 In a multithread application, if you do not specify a JIT stack, or if 5569 you assign or pass back NULL from a callback, that is thread-safe, be- 5570 cause each thread has its own machine stack. However, if you assign or 5571 pass back a non-NULL JIT stack, this must be a different stack for each 5572 thread so that the application is thread-safe. 5573 5574 Strictly speaking, even more is allowed. You can assign the same non- 5575 NULL stack to a match context that is used by any number of patterns, 5576 as long as they are not used for matching by multiple threads at the 5577 same time. For example, you could use the same stack in all compiled 5578 patterns, with a global mutex in the callback to wait until the stack 5579 is available for use. However, this is an inefficient solution, and not 5580 recommended. 5581 5582 This is a suggestion for how a multithreaded program that needs to set 5583 up non-default JIT stacks might operate: 5584 5585 During thread initialization 5586 thread_local_var = pcre2_jit_stack_create(...) 5587 5588 During thread exit 5589 pcre2_jit_stack_free(thread_local_var) 5590 5591 Use a one-line callback function 5592 return thread_local_var 5593 5594 All the functions described in this section do nothing if JIT is not 5595 available. 5596 5597 5598JIT STACK FAQ 5599 5600 (1) Why do we need JIT stacks? 5601 5602 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack 5603 where the local data of the current node is pushed before checking its 5604 child nodes. Allocating real machine stack on some platforms is diffi- 5605 cult. For example, the stack chain needs to be updated every time if we 5606 extend the stack on PowerPC. Although it is possible, its updating 5607 time overhead decreases performance. So we do the recursion in memory. 5608 5609 (2) Why don't we simply allocate blocks of memory with malloc()? 5610 5611 Modern operating systems have a nice feature: they can reserve an ad- 5612 dress space instead of allocating memory. We can safely allocate memory 5613 pages inside this address space, so the stack could grow without moving 5614 memory data (this is important because of pointers). Thus we can allo- 5615 cate 1MiB address space, and use only a single memory page (usually 5616 4KiB) if that is enough. However, we can still grow up to 1MiB anytime 5617 if needed. 5618 5619 (3) Who "owns" a JIT stack? 5620 5621 The owner of the stack is the user program, not the JIT studied pattern 5622 or anything else. The user program must ensure that if a stack is being 5623 used by pcre2_match(), (that is, it is assigned to a match context that 5624 is passed to the pattern currently running), that stack must not be 5625 used by any other threads (to avoid overwriting the same memory area). 5626 The best practice for multithreaded programs is to allocate a stack for 5627 each thread, and return this stack through the JIT callback function. 5628 5629 (4) When should a JIT stack be freed? 5630 5631 You can free a JIT stack at any time, as long as it will not be used by 5632 pcre2_match() again. When you assign the stack to a match context, only 5633 a pointer is set. There is no reference counting or any other magic. 5634 You can free compiled patterns, contexts, and stacks in any order, any- 5635 time. Just do not call pcre2_match() with a match context pointing to 5636 an already freed stack, as that will cause SEGFAULT. (Also, do not free 5637 a stack currently used by pcre2_match() in another thread). You can 5638 also replace the stack in a context at any time when it is not in use. 5639 You should free the previous stack before assigning a replacement. 5640 5641 (5) Should I allocate/free a stack every time before/after calling 5642 pcre2_match()? 5643 5644 No, because this is too costly in terms of resources. However, you 5645 could implement some clever idea which release the stack if it is not 5646 used in let's say two minutes. The JIT callback can help to achieve 5647 this without keeping a list of patterns. 5648 5649 (6) OK, the stack is for long term memory allocation. But what happens 5650 if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB 5651 kept until the stack is freed? 5652 5653 Especially on embedded systems, it might be a good idea to release mem- 5654 ory sometimes without freeing the stack. There is no API for this at 5655 the moment. Probably a function call which returns with the currently 5656 allocated memory for any stack and another which allows releasing mem- 5657 ory (shrinking the stack) would be a good idea if someone needs this. 5658 5659 (7) This is too much of a headache. Isn't there any better solution for 5660 JIT stack handling? 5661 5662 No, thanks to Windows. If POSIX threads were used everywhere, we could 5663 throw out this complicated API. 5664 5665 5666FREEING JIT SPECULATIVE MEMORY 5667 5668 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 5669 5670 The JIT executable allocator does not free all memory when it is possi- 5671 ble. It expects new allocations, and keeps some free memory around to 5672 improve allocation speed. However, in low memory conditions, it might 5673 be better to free all possible memory. You can cause this to happen by 5674 calling pcre2_jit_free_unused_memory(). Its argument is a general con- 5675 text, for custom memory management, or NULL for standard memory manage- 5676 ment. 5677 5678 5679EXAMPLE CODE 5680 5681 This is a single-threaded example that specifies a JIT stack without 5682 using a callback. A real program should include error checking after 5683 all the function calls. 5684 5685 int rc; 5686 pcre2_code *re; 5687 pcre2_match_data *match_data; 5688 pcre2_match_context *mcontext; 5689 pcre2_jit_stack *jit_stack; 5690 5691 re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, 5692 &errornumber, &erroffset, NULL); 5693 rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE); 5694 mcontext = pcre2_match_context_create(NULL); 5695 jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL); 5696 pcre2_jit_stack_assign(mcontext, NULL, jit_stack); 5697 match_data = pcre2_match_data_create(re, 10); 5698 rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext); 5699 /* Process result */ 5700 5701 pcre2_code_free(re); 5702 pcre2_match_data_free(match_data); 5703 pcre2_match_context_free(mcontext); 5704 pcre2_jit_stack_free(jit_stack); 5705 5706 5707JIT FAST PATH API 5708 5709 Because the API described above falls back to interpreted matching when 5710 JIT is not available, it is convenient for programs that are written 5711 for general use in many environments. However, calling JIT via 5712 pcre2_match() does have a performance impact. Programs that are written 5713 for use where JIT is known to be available, and which need the best 5714 possible performance, can instead use a "fast path" API to call JIT 5715 matching directly instead of calling pcre2_match() (obviously only for 5716 patterns that have been successfully processed by pcre2_jit_compile()). 5717 5718 The fast path function is called pcre2_jit_match(), and it takes ex- 5719 actly the same arguments as pcre2_match(). However, the subject string 5720 must be specified with a length; PCRE2_ZERO_TERMINATED is not sup- 5721 ported. Unsupported option bits (for example, PCRE2_ANCHORED and 5722 PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The re- 5723 turn values are also the same as for pcre2_match(), plus PCRE2_ER- 5724 ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested 5725 that was not compiled. 5726 5727 When you call pcre2_match(), as well as testing for invalid options, a 5728 number of other sanity checks are performed on the arguments. For exam- 5729 ple, if the subject pointer is NULL but the length is non-zero, an im- 5730 mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF 5731 subject string is tested for validity. In the interests of speed, these 5732 checks do not happen on the JIT fast path. If invalid UTF data is 5733 passed when PCRE2_MATCH_INVALID_UTF was not set for pcre2_compile(), 5734 the result is undefined. The program may crash or loop or give wrong 5735 results. In the absence of PCRE2_MATCH_INVALID_UTF you should call 5736 pcre2_jit_match() in UTF mode only if you are sure the subject is 5737 valid. 5738 5739 Bypassing the sanity checks and the pcre2_match() wrapping can give 5740 speedups of more than 10%. 5741 5742 5743SEE ALSO 5744 5745 pcre2api(3), pcre2unicode(3) 5746 5747 5748AUTHOR 5749 5750 Philip Hazel (FAQ by Zoltan Herczeg) 5751 Retired from University Computing Service 5752 Cambridge, England. 5753 5754 5755REVISION 5756 5757 Last updated: 21 February 2024 5758 Copyright (c) 1997-2024 University of Cambridge. 5759 5760 5761PCRE2 10.43 21 February 2024 PCRE2JIT(3) 5762------------------------------------------------------------------------------ 5763 5764 5765 5766PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3) 5767 5768 5769NAME 5770 PCRE2 - Perl-compatible regular expressions (revised API) 5771 5772 5773SIZE AND OTHER LIMITATIONS 5774 5775 There are some size limitations in PCRE2 but it is hoped that they will 5776 never in practice be relevant. 5777 5778 The maximum size of a compiled pattern is approximately 64 thousand 5779 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with 5780 the default internal linkage size, which is 2 bytes for these li- 5781 braries. If you want to process regular expressions that are truly 5782 enormous, you can compile PCRE2 with an internal linkage size of 3 or 4 5783 (when building the 16-bit library, 3 is rounded up to 4). See the 5784 README file in the source distribution and the pcre2build documentation 5785 for details. In these cases the limit is substantially larger. How- 5786 ever, the speed of execution is slower. In the 32-bit library, the in- 5787 ternal linkage size is always 4. 5788 5789 The maximum length of a source pattern string is essentially unlimited; 5790 it is the largest number a PCRE2_SIZE variable can hold. However, the 5791 program that calls pcre2_compile() can specify a smaller limit. 5792 5793 The maximum length (in code units) of a subject string is one less than 5794 the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un- 5795 signed integer type, usually defined as size_t. Its maximum value (that 5796 is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-termi- 5797 nated strings and unset offsets. 5798 5799 All values in repeating quantifiers must be less than 65536. 5800 5801 There are two different limits that apply to branches of lookbehind as- 5802 sertions. If every branch in such an assertion matches a fixed number 5803 of characters, the maximum length of any branch is 65535 characters. If 5804 any branch matches a variable number of characters, then the maximum 5805 matching length for every branch is limited. The default limit is set 5806 at compile time, defaulting to 255, but can be changed by the calling 5807 program. 5808 5809 There is no limit to the number of parenthesized groups, but there can 5810 be no more than 65535 capture groups, and there is a limit to the depth 5811 of nesting of parenthesized subpatterns of all kinds. This is imposed 5812 in order to limit the amount of system stack used at compile time. The 5813 default limit can be specified when PCRE2 is built; if not, the default 5814 is set to 250. An application can change this limit by calling 5815 pcre2_set_parens_nest_limit() to set the limit in a compile context. 5816 5817 The maximum length of name for a named capture group is 32 code units, 5818 and the maximum number of such groups is 10000. 5819 5820 The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or 5821 (*THEN) verb is 255 code units for the 8-bit library and 65535 code 5822 units for the 16-bit and 32-bit libraries. 5823 5824 The maximum length of a string argument to a callout is the largest 5825 number a 32-bit unsigned integer can hold. 5826 5827 The maximum amount of heap memory used for matching is controlled by 5828 the heap limit, which can be set in a pattern or in a match context. 5829 The default is a very large number, effectively unlimited. 5830 5831 5832AUTHOR 5833 5834 Philip Hazel 5835 Retired from University Computing Service 5836 Cambridge, England. 5837 5838 5839REVISION 5840 5841 Last updated: August 2023 5842 Copyright (c) 1997-2023 University of Cambridge. 5843 5844 5845PCRE2 10.43 1 August 2023 PCRE2LIMITS(3) 5846------------------------------------------------------------------------------ 5847 5848 5849 5850PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3) 5851 5852 5853NAME 5854 PCRE2 - Perl-compatible regular expressions (revised API) 5855 5856 5857PCRE2 MATCHING ALGORITHMS 5858 5859 This document describes the two different algorithms that are available 5860 in PCRE2 for matching a compiled regular expression against a given 5861 subject string. The "standard" algorithm is the one provided by the 5862 pcre2_match() function. This works in the same as Perl's matching func- 5863 tion, and provide a Perl-compatible matching operation. The just-in- 5864 time (JIT) optimization that is described in the pcre2jit documentation 5865 is compatible with this function. 5866 5867 An alternative algorithm is provided by the pcre2_dfa_match() function; 5868 it operates in a different way, and is not Perl-compatible. This alter- 5869 native has advantages and disadvantages compared with the standard al- 5870 gorithm, and these are described below. 5871 5872 When there is only one possible way in which a given subject string can 5873 match a pattern, the two algorithms give the same answer. A difference 5874 arises, however, when there are multiple possibilities. For example, if 5875 the pattern 5876 5877 ^<.*> 5878 5879 is matched against the string 5880 5881 <something> <something else> <something further> 5882 5883 there are three possible answers. The standard algorithm finds only one 5884 of them, whereas the alternative algorithm finds all three. 5885 5886 5887REGULAR EXPRESSIONS AS TREES 5888 5889 The set of strings that are matched by a regular expression can be rep- 5890 resented as a tree structure. An unlimited repetition in the pattern 5891 makes the tree of infinite size, but it is still a tree. Matching the 5892 pattern to a given subject string (from a given starting point) can be 5893 thought of as a search of the tree. There are two ways to search a 5894 tree: depth-first and breadth-first, and these correspond to the two 5895 matching algorithms provided by PCRE2. 5896 5897 5898THE STANDARD MATCHING ALGORITHM 5899 5900 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres- 5901 sions", the standard algorithm is an "NFA algorithm". It conducts a 5902 depth-first search of the pattern tree. That is, it proceeds along a 5903 single path through the tree, checking that the subject matches what is 5904 required. When there is a mismatch, the algorithm tries any alterna- 5905 tives at the current point, and if they all fail, it backs up to the 5906 previous branch point in the tree, and tries the next alternative 5907 branch at that level. This often involves backing up (moving to the 5908 left) in the subject string as well. The order in which repetition 5909 branches are tried is controlled by the greedy or ungreedy nature of 5910 the quantifier. 5911 5912 If a leaf node is reached, a matching string has been found, and at 5913 that point the algorithm stops. Thus, if there is more than one possi- 5914 ble match, this algorithm returns the first one that it finds. Whether 5915 this is the shortest, the longest, or some intermediate length depends 5916 on the way the alternations and the greedy or ungreedy repetition quan- 5917 tifiers are specified in the pattern. 5918 5919 Because it ends up with a single path through the tree, it is rela- 5920 tively straightforward for this algorithm to keep track of the sub- 5921 strings that are matched by portions of the pattern in parentheses. 5922 This provides support for capturing parentheses and backreferences. 5923 5924 5925THE ALTERNATIVE MATCHING ALGORITHM 5926 5927 This algorithm conducts a breadth-first search of the tree. Starting 5928 from the first matching point in the subject, it scans the subject 5929 string from left to right, once, character by character, and as it does 5930 this, it remembers all the paths through the tree that represent valid 5931 matches. In Friedl's terminology, this is a kind of "DFA algorithm", 5932 though it is not implemented as a traditional finite state machine (it 5933 keeps multiple states active simultaneously). 5934 5935 Although the general principle of this matching algorithm is that it 5936 scans the subject string only once, without backtracking, there is one 5937 exception: when a lookaround assertion is encountered, the characters 5938 following or preceding the current point have to be independently in- 5939 spected. 5940 5941 The scan continues until either the end of the subject is reached, or 5942 there are no more unterminated paths. At this point, terminated paths 5943 represent the different matching possibilities (if there are none, the 5944 match has failed). Thus, if there is more than one possible match, 5945 this algorithm finds all of them, and in particular, it finds the 5946 longest. The matches are returned in the output vector in decreasing 5947 order of length. There is an option to stop the algorithm after the 5948 first match (which is necessarily the shortest) is found. 5949 5950 Note that the size of vector needed to contain all the results depends 5951 on the number of simultaneous matches, not on the number of parentheses 5952 in the pattern. Using pcre2_match_data_create_from_pattern() to create 5953 the match data block is therefore not advisable when doing DFA match- 5954 ing. 5955 5956 Note also that all the matches that are found start at the same point 5957 in the subject. If the pattern 5958 5959 cat(er(pillar)?)? 5960 5961 is matched against the string "the caterpillar catchment", the result 5962 is the three strings "caterpillar", "cater", and "cat" that start at 5963 the fifth character of the subject. The algorithm does not automati- 5964 cally move on to find matches that start at later positions. 5965 5966 PCRE2's "auto-possessification" optimization usually applies to charac- 5967 ter repeats at the end of a pattern (as well as internally). For exam- 5968 ple, the pattern "a\d+" is compiled as if it were "a\d++" because there 5969 is no point even considering the possibility of backtracking into the 5970 repeated digits. For DFA matching, this means that only one possible 5971 match is found. If you really do want multiple matches in such cases, 5972 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- 5973 SESS option when compiling. 5974 5975 There are a number of features of PCRE2 regular expressions that are 5976 not supported or behave differently in the alternative matching func- 5977 tion. Those that are not supported cause an error if encountered. 5978 5979 1. Because the algorithm finds all possible matches, the greedy or un- 5980 greedy nature of repetition quantifiers is not relevant (though it may 5981 affect auto-possessification, as just described). During matching, 5982 greedy and ungreedy quantifiers are treated in exactly the same way. 5983 However, possessive quantifiers can make a difference when what follows 5984 could also match what is quantified, for example in a pattern like 5985 this: 5986 5987 ^a++\w! 5988 5989 This pattern matches "aaab!" but not "aaa!", which would be matched by 5990 a non-possessive quantifier. Similarly, if an atomic group is present, 5991 it is matched as if it were a standalone pattern at the current point, 5992 and the longest match is then "locked in" for the rest of the overall 5993 pattern. 5994 5995 2. When dealing with multiple paths through the tree simultaneously, it 5996 is not straightforward to keep track of captured substrings for the 5997 different matching possibilities, and PCRE2's implementation of this 5998 algorithm does not attempt to do this. This means that no captured sub- 5999 strings are available. 6000 6001 3. Because no substrings are captured, backreferences within the pat- 6002 tern are not supported. 6003 6004 4. For the same reason, conditional expressions that use a backrefer- 6005 ence as the condition or test for a specific group recursion are not 6006 supported. 6007 6008 5. Again for the same reason, script runs are not supported. 6009 6010 6. Because many paths through the tree may be active, the \K escape se- 6011 quence, which resets the start of the match when encountered (but may 6012 be on some paths and not on others), is not supported. 6013 6014 7. Callouts are supported, but the value of the capture_top field is 6015 always 1, and the value of the capture_last field is always 0. 6016 6017 8. The \C escape sequence, which (in the standard algorithm) always 6018 matches a single code unit, even in a UTF mode, is not supported in 6019 these modes, because the alternative algorithm moves through the sub- 6020 ject string one character (not code unit) at a time, for all active 6021 paths through the tree. 6022 6023 9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) 6024 are not supported. (*FAIL) is supported, and behaves like a failing 6025 negative assertion. 6026 6027 10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup- 6028 ported by pcre2_dfa_match(). 6029 6030 6031ADVANTAGES OF THE ALTERNATIVE ALGORITHM 6032 6033 The main advantage of the alternative algorithm is that all possible 6034 matches (at a single point in the subject) are automatically found, and 6035 in particular, the longest match is found. To find more than one match 6036 at the same point using the standard algorithm, you have to do kludgy 6037 things with callouts. 6038 6039 Partial matching is possible with this algorithm, though it has some 6040 limitations. The pcre2partial documentation gives details of partial 6041 matching and discusses multi-segment matching. 6042 6043 6044DISADVANTAGES OF THE ALTERNATIVE ALGORITHM 6045 6046 The alternative algorithm suffers from a number of disadvantages: 6047 6048 1. It is substantially slower than the standard algorithm. This is 6049 partly because it has to search for all possible matches, but is also 6050 because it is less susceptible to optimization. 6051 6052 2. Capturing parentheses, backreferences, script runs, and matching 6053 within invalid UTF string are not supported. 6054 6055 3. Although atomic groups are supported, their use does not provide the 6056 performance advantage that it does for the standard algorithm. 6057 6058 4. JIT optimization is not supported. 6059 6060 6061AUTHOR 6062 6063 Philip Hazel 6064 Retired from University Computing Service 6065 Cambridge, England. 6066 6067 6068REVISION 6069 6070 Last updated: 19 January 2024 6071 Copyright (c) 1997-2024 University of Cambridge. 6072 6073 6074PCRE2 10.43 19 January 2024 PCRE2MATCHING(3) 6075------------------------------------------------------------------------------ 6076 6077 6078 6079PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3) 6080 6081 6082NAME 6083 PCRE2 - Perl-compatible regular expressions 6084 6085 6086PARTIAL MATCHING IN PCRE2 6087 6088 In normal use of PCRE2, if there is a match up to the end of a subject 6089 string, but more characters are needed to match the entire pattern, 6090 PCRE2_ERROR_NOMATCH is returned, just like any other failing match. 6091 There are circumstances where it might be helpful to distinguish this 6092 "partial match" case. 6093 6094 One example is an application where the subject string is very long, 6095 and not all available at once. The requirement here is to be able to do 6096 the matching segment by segment, but special action is needed when a 6097 matched substring spans the boundary between two segments. 6098 6099 Another example is checking a user input string as it is typed, to en- 6100 sure that it conforms to a required format. Invalid characters can be 6101 immediately diagnosed and rejected, giving instant feedback. 6102 6103 Partial matching is a PCRE2-specific feature; it is not Perl-compati- 6104 ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or 6105 PCRE2_PARTIAL_SOFT options when calling a matching function. The dif- 6106 ference between the two options is whether or not a partial match is 6107 preferred to an alternative complete match, though the details differ 6108 between the two types of matching function. If both options are set, 6109 PCRE2_PARTIAL_HARD takes precedence. 6110 6111 If you want to use partial matching with just-in-time optimized code, 6112 as well as setting a partial match option for the matching function, 6113 you must also call pcre2_jit_compile() with one or both of these op- 6114 tions: 6115 6116 PCRE2_JIT_PARTIAL_HARD 6117 PCRE2_JIT_PARTIAL_SOFT 6118 6119 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par- 6120 tial matches on the same pattern. Separate code is compiled for each 6121 mode. If the appropriate JIT mode has not been compiled, interpretive 6122 matching code is used. 6123 6124 Setting a partial matching option disables two of PCRE2's standard op- 6125 timization hints. PCRE2 remembers the last literal code unit in a pat- 6126 tern, and abandons matching immediately if it is not present in the 6127 subject string. This optimization cannot be used for a subject string 6128 that might match only partially. PCRE2 also remembers a minimum length 6129 of a matching string, and does not bother to run the matching function 6130 on shorter strings. This optimization is also disabled for partial 6131 matching. 6132 6133 6134REQUIREMENTS FOR A PARTIAL MATCH 6135 6136 A possible partial match occurs during matching when the end of the 6137 subject string is reached successfully, but either more characters are 6138 needed to complete the match, or the addition of more characters might 6139 change what is matched. 6140 6141 Example 1: if the pattern is /abc/ and the subject is "ab", more char- 6142 acters are definitely needed to complete a match. In this case both 6143 hard and soft matching options yield a partial match. 6144 6145 Example 2: if the pattern is /ab+/ and the subject is "ab", a complete 6146 match can be found, but the addition of more characters might change 6147 what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par- 6148 tial match; PCRE2_PARTIAL_SOFT returns the complete match. 6149 6150 On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if 6151 the next pattern item is \z, \Z, \b, \B, or $ there is always a partial 6152 match. Otherwise, for both options, the next pattern item must be one 6153 that inspects a character, and at least one of the following must be 6154 true: 6155 6156 (1) At least one character has already been inspected. An inspected 6157 character need not form part of the final matched string; lookbehind 6158 assertions and the \K escape sequence provide ways of inspecting char- 6159 acters before the start of a matched string. 6160 6161 (2) The pattern contains one or more lookbehind assertions. This condi- 6162 tion exists in case there is a lookbehind that inspects characters be- 6163 fore the start of the match. 6164 6165 (3) There is a special case when the whole pattern can match an empty 6166 string. When the starting point is at the end of the subject, the 6167 empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set 6168 and neither of the above conditions is true, it is returned. However, 6169 because adding more characters might result in a non-empty match, 6170 PCRE2_PARTIAL_HARD returns a partial match, which in this case means 6171 "there is going to be a match at this point, but until some more char- 6172 acters are added, we do not know if it will be an empty string or some- 6173 thing longer". 6174 6175 6176PARTIAL MATCHING USING pcre2_match() 6177 6178 When a partial matching option is set, the result of calling 6179 pcre2_match() can be one of the following: 6180 6181 A successful match 6182 A complete match has been found, starting and ending within this sub- 6183 ject. 6184 6185 PCRE2_ERROR_NOMATCH 6186 No match can start anywhere in this subject. 6187 6188 PCRE2_ERROR_PARTIAL 6189 Adding more characters may result in a complete match that uses one 6190 or more characters from the end of this subject. 6191 6192 When a partial match is returned, the first two elements in the ovector 6193 point to the portion of the subject that was matched, but the values in 6194 the rest of the ovector are undefined. The appearance of \K in the pat- 6195 tern has no effect for a partial match. Consider this pattern: 6196 6197 /abc\K123/ 6198 6199 If it is matched against "456abc123xyz" the result is a complete match, 6200 and the ovector defines the matched string as "123", because \K resets 6201 the "start of match" point. However, if a partial match is requested 6202 and the subject string is "456abc12", a partial match is found for the 6203 string "abc12", because all these characters are needed for a subse- 6204 quent re-match with additional characters. 6205 6206 If there is more than one partial match, the first one that was found 6207 provides the data that is returned. Consider this pattern: 6208 6209 /123\w+X|dogY/ 6210 6211 If this is matched against the subject string "abc123dog", both alter- 6212 natives fail to match, but the end of the subject is reached during 6213 matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 6214 and 9, identifying "123dog" as the first partial match. (In this exam- 6215 ple, there are two partial matches, because "dog" on its own partially 6216 matches the second alternative.) 6217 6218 How a partial match is processed by pcre2_match() 6219 6220 What happens when a partial match is identified depends on which of the 6221 two partial matching options is set. 6222 6223 If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon 6224 as a partial match is found, without continuing to search for possible 6225 complete matches. This option is "hard" because it prefers an earlier 6226 partial match over a later complete match. For this reason, the assump- 6227 tion is made that the end of the supplied subject string is not the 6228 true end of the available data, which is why \z, \Z, \b, \B, and $ al- 6229 ways give a partial match. 6230 6231 If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but 6232 matching continues as normal, and other alternatives in the pattern are 6233 tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re- 6234 turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it 6235 prefers a complete match over a partial match. All the various matching 6236 items in a pattern behave as if the subject string is potentially com- 6237 plete; \z, \Z, and $ match at the end of the subject, as normal, and 6238 for \b and \B the end of the subject is treated as a non-alphanumeric. 6239 6240 The difference between the two partial matching options can be illus- 6241 trated by a pattern such as: 6242 6243 /dog(sbody)?/ 6244 6245 This matches either "dog" or "dogsbody", greedily (that is, it prefers 6246 the longer string if possible). If it is matched against the string 6247 "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". 6248 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR- 6249 TIAL. On the other hand, if the pattern is made ungreedy the result is 6250 different: 6251 6252 /dog(sbody)??/ 6253 6254 In this case the result is always a complete match because that is 6255 found first, and matching never continues after finding a complete 6256 match. It might be easier to follow this explanation by thinking of the 6257 two patterns like this: 6258 6259 /dog(sbody)?/ is the same as /dogsbody|dog/ 6260 /dog(sbody)??/ is the same as /dog|dogsbody/ 6261 6262 The second pattern will never match "dogsbody", because it will always 6263 find the shorter match first. 6264 6265 Example of partial matching using pcre2test 6266 6267 The pcre2test data modifiers partial_hard (or ph) and partial_soft (or 6268 ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when 6269 calling pcre2_match(). Here is a run of pcre2test using a pattern that 6270 matches the whole subject in the form of a date: 6271 6272 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 6273 data> 25dec3\=ph 6274 Partial match: 23dec3 6275 data> 3ju\=ph 6276 Partial match: 3ju 6277 data> 3juj\=ph 6278 No match 6279 6280 This example gives the same results for both hard and soft partial 6281 matching options. Here is an example where there is a difference: 6282 6283 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 6284 data> 25jun04\=ps 6285 0: 25jun04 6286 1: jun 6287 data> 25jun04\=ph 6288 Partial match: 25jun04 6289 6290 With PCRE2_PARTIAL_SOFT, the subject is matched completely. For 6291 PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, 6292 so there is only a partial match. 6293 6294 6295MULTI-SEGMENT MATCHING WITH pcre2_match() 6296 6297 PCRE was not originally designed with multi-segment matching in mind. 6298 However, over time, features (including partial matching) that make 6299 multi-segment matching possible have been added. A very long string can 6300 be searched segment by segment by calling pcre2_match() repeatedly, 6301 with the aim of achieving the same results that would happen if the en- 6302 tire string was available for searching all the time. Normally, the 6303 strings that are being sought are much shorter than each individual 6304 segment, and are in the middle of very long strings, so the pattern is 6305 normally not anchored. 6306 6307 Special logic must be implemented to handle a matched substring that 6308 spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it 6309 returns a partial match at the end of a segment whenever there is the 6310 possibility of changing the match by adding more characters. The 6311 PCRE2_NOTBOL option should also be set for all but the first segment. 6312 6313 When a partial match occurs, the next segment must be added to the cur- 6314 rent subject and the match re-run, using the startoffset argument of 6315 pcre2_match() to begin at the point where the partial match started. 6316 For example: 6317 6318 re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ 6319 data> ...the date is 23ja\=ph 6320 Partial match: 23ja 6321 data> ...the date is 23jan19 and on that day...\=offset=15 6322 0: 23jan19 6323 1: jan 6324 6325 Note the use of the offset modifier to start the new match where the 6326 partial match was found. In this example, the next segment was added to 6327 the one in which the partial match was found. This is the most 6328 straightforward approach, typically using a memory buffer that is twice 6329 the size of each segment. After a partial match, the first half of the 6330 buffer is discarded, the second half is moved to the start of the 6331 buffer, and a new segment is added before repeating the match as in the 6332 example above. After a no match, the entire buffer can be discarded. 6333 6334 If there are memory constraints, you may want to discard text that pre- 6335 cedes a partial match before adding the next segment. Unfortunately, 6336 this is not at present straightforward. In cases such as the above, 6337 where the pattern does not contain any lookbehinds, it is sufficient to 6338 retain only the partially matched substring. However, if the pattern 6339 contains a lookbehind assertion, characters that precede the start of 6340 the partial match may have been inspected during the matching process. 6341 When pcre2test displays a partial match, it indicates these characters 6342 with '<' if the allusedtext modifier is set: 6343 6344 re> "(?<=123)abc" 6345 data> xx123ab\=ph,allusedtext 6346 Partial match: 123ab 6347 <<< 6348 6349 However, the allusedtext modifier is not available for JIT matching, 6350 because JIT matching does not record the first (or last) consulted 6351 characters. For this reason, this information is not available via the 6352 API. It is therefore not possible in general to obtain the exact number 6353 of characters that must be retained in order to get the right match re- 6354 sult. If you cannot retain the entire segment, you must find some 6355 heuristic way of choosing. 6356 6357 If you know the approximate length of the matching substrings, you can 6358 use that to decide how much text to retain. The only lookbehind infor- 6359 mation that is currently available via the API is the length of the 6360 longest individual lookbehind in a pattern, but this can be misleading 6361 if there are nested lookbehinds. The value returned by calling 6362 pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND option is the 6363 maximum number of characters (not code units) that any individual look- 6364 behind moves back when it is processed. A pattern such as 6365 "(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects two 6366 characters before its starting point. 6367 6368 In a non-UTF or a 32-bit case, moving back is just a subtraction, but 6369 in UTF-8 or UTF-16 you have to count characters while moving back 6370 through the code units. 6371 6372 6373PARTIAL MATCHING USING pcre2_dfa_match() 6374 6375 The DFA function moves along the subject string character by character, 6376 without backtracking, searching for all possible matches simultane- 6377 ously. If the end of the subject is reached before the end of the pat- 6378 tern, there is the possibility of a partial match. 6379 6380 When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if 6381 there have been no complete matches. Otherwise, the complete matches 6382 are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes 6383 precedence over any complete matches. The portion of the string that 6384 was matched when the longest partial match was found is set as the 6385 first matching string. 6386 6387 Because the DFA function always searches for all possible matches, and 6388 there is no difference between greedy and ungreedy repetition, its be- 6389 haviour is different from the pcre2_match(). Consider the string "dog" 6390 matched against this ungreedy pattern: 6391 6392 /dog(sbody)??/ 6393 6394 Whereas the standard function stops as soon as it finds the complete 6395 match for "dog", the DFA function also finds the partial match for 6396 "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set. 6397 6398 6399MULTI-SEGMENT MATCHING WITH pcre2_dfa_match() 6400 6401 When a partial match has been found using the DFA matching function, it 6402 is possible to continue the match by providing additional subject data 6403 and calling the function again with the same compiled regular expres- 6404 sion, this time setting the PCRE2_DFA_RESTART option. You must pass the 6405 same working space as before, because this is where details of the pre- 6406 vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or 6407 PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to continue partial 6408 matching over multiple segments. Here is an example using pcre2test: 6409 6410 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 6411 data> 23ja\=dfa,ps 6412 Partial match: 23ja 6413 data> n05\=dfa,dfa_restart 6414 0: n05 6415 6416 The first call has "23ja" as the subject, and requests partial match- 6417 ing; the second call has "n05" as the subject for the continued 6418 (restarted) match. Notice that when the match is complete, only the 6419 last part is shown; PCRE2 does not retain the previously partially- 6420 matched string. It is up to the calling program to do that if it needs 6421 to. This means that, for an unanchored pattern, if a continued match 6422 fails, it is not possible to try again at a new starting point. All 6423 this facility is capable of doing is continuing with the previous match 6424 attempt. For example, consider this pattern: 6425 6426 1234|3789 6427 6428 If the first part of the subject is "ABC123", a partial match of the 6429 first alternative is found at offset 3. There is no partial match for 6430 the second alternative, because such a match does not start at the same 6431 point in the subject string. Attempting to continue with the string 6432 "7890" does not yield a match because only those alternatives that 6433 match at one point in the subject are remembered. Depending on the ap- 6434 plication, this may or may not be what you want. 6435 6436 If you do want to allow for starting again at the next character, one 6437 way of doing it is to retain some or all of the segment and try a new 6438 complete match, as described for pcre2_match() above. Another possibil- 6439 ity is to work with two buffers. If a partial match at offset n in the 6440 first buffer is followed by "no match" when PCRE2_DFA_RESTART is used 6441 on the second buffer, you can then try a new match starting at offset 6442 n+1 in the first buffer. 6443 6444 6445AUTHOR 6446 6447 Philip Hazel 6448 Retired from University Computing Service 6449 Cambridge, England. 6450 6451 6452REVISION 6453 6454 Last updated: 04 September 2019 6455 Copyright (c) 1997-2019 University of Cambridge. 6456 6457 6458PCRE2 10.34 04 September 2019 PCRE2PARTIAL(3) 6459------------------------------------------------------------------------------ 6460 6461 6462 6463PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3) 6464 6465 6466NAME 6467 PCRE2 - Perl-compatible regular expressions (revised API) 6468 6469 6470PCRE2 REGULAR EXPRESSION DETAILS 6471 6472 The syntax and semantics of the regular expressions that are supported 6473 by PCRE2 are described in detail below. There is a quick-reference syn- 6474 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax 6475 and semantics as closely as it can. PCRE2 also supports some alterna- 6476 tive regular expression syntax (which does not conflict with the Perl 6477 syntax) in order to provide some compatibility with regular expressions 6478 in Python, .NET, and Oniguruma. 6479 6480 Perl's regular expressions are described in its own documentation, and 6481 regular expressions in general are covered in a number of books, some 6482 of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex- 6483 pressions", published by O'Reilly, covers regular expressions in great 6484 detail. This description of PCRE2's regular expressions is intended as 6485 reference material. 6486 6487 This document discusses the regular expression patterns that are sup- 6488 ported by PCRE2 when its main matching function, pcre2_match(), is 6489 used. PCRE2 also has an alternative matching function, 6490 pcre2_dfa_match(), which matches using a different algorithm that is 6491 not Perl-compatible. Some of the features discussed below are not 6492 available when DFA matching is used. The advantages and disadvantages 6493 of the alternative function, and how it differs from the normal func- 6494 tion, are discussed in the pcre2matching page. 6495 6496 6497SPECIAL START-OF-PATTERN ITEMS 6498 6499 A number of options that can be passed to pcre2_compile() can also be 6500 set by special items at the start of a pattern. These are not Perl-com- 6501 patible, but are provided to make these options accessible to pattern 6502 writers who are not able to change the program that processes the pat- 6503 tern. Any number of these items may appear, but they must all be to- 6504 gether right at the start of the pattern string, and the letters must 6505 be in upper case. 6506 6507 UTF support 6508 6509 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either 6510 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 6511 can be specified for the 32-bit library, in which case it constrains 6512 the character values to valid Unicode code points. To process UTF 6513 strings, PCRE2 must be built to include Unicode support (which is the 6514 default). When using UTF strings you must either call the compiling 6515 function with one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF 6516 options, or the pattern must start with the special sequence (*UTF), 6517 which is equivalent to setting the relevant PCRE2_UTF. How setting a 6518 UTF mode affects pattern matching is mentioned in several places below. 6519 There is also a summary of features in the pcre2unicode page. 6520 6521 Some applications that allow their users to supply patterns may wish to 6522 restrict them to non-UTF data for security reasons. If the 6523 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al- 6524 lowed, and its appearance in a pattern causes an error. 6525 6526 Unicode property support 6527 6528 Another special sequence that may appear at the start of a pattern is 6529 (*UCP). This has the same effect as setting the PCRE2_UCP option: it 6530 causes sequences such as \d and \w to use Unicode properties to deter- 6531 mine character types, instead of recognizing only characters with codes 6532 less than 256 via a lookup table. If also causes upper/lower casing op- 6533 erations to use Unicode properties for characters with code points 6534 greater than 127, even when UTF is not set. These behaviours can be 6535 changed within the pattern; see the section entitled "Internal Option 6536 Setting" below. 6537 6538 Some applications that allow their users to supply patterns may wish to 6539 restrict them for security reasons. If the PCRE2_NEVER_UCP option is 6540 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in 6541 a pattern causes an error. 6542 6543 Locking out empty string matching 6544 6545 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same 6546 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option 6547 to whichever matching function is subsequently called to match the pat- 6548 tern. These options lock out the matching of empty strings, either en- 6549 tirely, or only at the start of the subject. 6550 6551 Disabling auto-possessification 6552 6553 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as 6554 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making 6555 quantifiers possessive when what follows cannot match the repeated 6556 item. For example, by default a+b is treated as a++b. For more details, 6557 see the pcre2api documentation. 6558 6559 Disabling start-up optimizations 6560 6561 If a pattern starts with (*NO_START_OPT), it has the same effect as 6562 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti- 6563 mizations for quickly reaching "no match" results. For more details, 6564 see the pcre2api documentation. 6565 6566 Disabling automatic anchoring 6567 6568 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect 6569 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza- 6570 tions that apply to patterns whose top-level branches all start with .* 6571 (match any number of arbitrary characters). For more details, see the 6572 pcre2api documentation. 6573 6574 Disabling JIT compilation 6575 6576 If a pattern that starts with (*NO_JIT) is successfully compiled, an 6577 attempt by the application to apply the JIT optimization by calling 6578 pcre2_jit_compile() is ignored. 6579 6580 Setting match resource limits 6581 6582 The pcre2_match() function contains a counter that is incremented every 6583 time it goes round its main loop. The caller of pcre2_match() can set a 6584 limit on this counter, which therefore limits the amount of computing 6585 resource used for a match. The maximum depth of nested backtracking can 6586 also be limited; this indirectly restricts the amount of heap memory 6587 that is used, but there is also an explicit memory limit that can be 6588 set. 6589 6590 These facilities are provided to catch runaway matches that are pro- 6591 voked by patterns with huge matching trees. A common example is a pat- 6592 tern with nested unlimited repeats applied to a long string that does 6593 not match. When one of these limits is reached, pcre2_match() gives an 6594 error return. The limits can also be set by items at the start of the 6595 pattern of the form 6596 6597 (*LIMIT_HEAP=d) 6598 (*LIMIT_MATCH=d) 6599 (*LIMIT_DEPTH=d) 6600 6601 where d is any number of decimal digits. However, the value of the set- 6602 ting must be less than the value set (or defaulted) by the caller of 6603 pcre2_match() for it to have any effect. In other words, the pattern 6604 writer can lower the limits set by the programmer, but not raise them. 6605 If there is more than one setting of one of these limits, the lower 6606 value is used. The heap limit is specified in kibibytes (units of 1024 6607 bytes). 6608 6609 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This 6610 name is still recognized for backwards compatibility. 6611 6612 The heap limit applies only when the pcre2_match() or pcre2_dfa_match() 6613 interpreters are used for matching. It does not apply to JIT. The match 6614 limit is used (but in a different way) when JIT is being used, or when 6615 pcre2_dfa_match() is called, to limit computing resource usage by those 6616 matching functions. The depth limit is ignored by JIT but is relevant 6617 for DFA matching, which uses function recursion for recursions within 6618 the pattern and for lookaround assertions and atomic groups. In this 6619 case, the depth limit controls the depth of such recursion. 6620 6621 Newline conventions 6622 6623 PCRE2 supports six different conventions for indicating line breaks in 6624 strings: a single CR (carriage return) character, a single LF (line- 6625 feed) character, the two-character sequence CRLF, any of the three pre- 6626 ceding, any Unicode newline sequence, or the NUL character (binary 6627 zero). The pcre2api page has further discussion about newlines, and 6628 shows how to set the newline convention when calling pcre2_compile(). 6629 6630 It is also possible to specify a newline convention by starting a pat- 6631 tern string with one of the following sequences: 6632 6633 (*CR) carriage return 6634 (*LF) linefeed 6635 (*CRLF) carriage return, followed by linefeed 6636 (*ANYCRLF) any of the three above 6637 (*ANY) all Unicode newline sequences 6638 (*NUL) the NUL character (binary zero) 6639 6640 These override the default and the options given to the compiling func- 6641 tion. For example, on a Unix system where LF is the default newline se- 6642 quence, the pattern 6643 6644 (*CR)a.b 6645 6646 changes the convention to CR. That pattern matches "a\nb" because LF is 6647 no longer a newline. If more than one of these settings is present, the 6648 last one is used. 6649 6650 The newline convention affects where the circumflex and dollar asser- 6651 tions are true. It also affects the interpretation of the dot metachar- 6652 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not 6653 followed by an opening brace. However, it does not affect what the \R 6654 escape sequence matches. By default, this is any Unicode newline se- 6655 quence, for Perl compatibility. However, this can be changed; see the 6656 next section and the description of \R in the section entitled "Newline 6657 sequences" below. A change of \R setting can be combined with a change 6658 of newline convention. 6659 6660 Specifying what \R matches 6661 6662 It is possible to restrict \R to match only CR, LF, or CRLF (instead of 6663 the complete set of Unicode line endings) by setting the option 6664 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by 6665 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- 6666 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. 6667 6668 6669EBCDIC CHARACTER CODES 6670 6671 PCRE2 can be compiled to run in an environment that uses EBCDIC as its 6672 character code instead of ASCII or Unicode (typically a mainframe sys- 6673 tem). In the sections below, character code values are ASCII or Uni- 6674 code; in an EBCDIC environment these characters may have different code 6675 values, and there are no code points greater than 255. 6676 6677 6678CHARACTERS AND METACHARACTERS 6679 6680 A regular expression is a pattern that is matched against a subject 6681 string from left to right. Most characters stand for themselves in a 6682 pattern, and match the corresponding characters in the subject. As a 6683 trivial example, the pattern 6684 6685 The quick brown fox 6686 6687 matches a portion of a subject string that is identical to itself. When 6688 caseless matching is specified (the PCRE2_CASELESS option or (?i) 6689 within the pattern), letters are matched independently of case. Note 6690 that there are two ASCII characters, K and S, that, in addition to 6691 their lower case ASCII equivalents, are case-equivalent with Unicode 6692 U+212A (Kelvin sign) and U+017F (long S) respectively when either 6693 PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT 6694 option is in force (either passed to pcre2_compile() or set by (?r) 6695 within the pattern). 6696 6697 The power of regular expressions comes from the ability to include wild 6698 cards, character classes, alternatives, and repetitions in the pattern. 6699 These are encoded in the pattern by the use of metacharacters, which do 6700 not stand for themselves but instead are interpreted in some special 6701 way. 6702 6703 There are two different sets of metacharacters: those that are recog- 6704 nized anywhere in the pattern except within square brackets, and those 6705 that are recognized within square brackets. Outside square brackets, 6706 the metacharacters are as follows: 6707 6708 \ general escape character with several uses 6709 ^ assert start of string (or line, in multiline mode) 6710 $ assert end of string (or line, in multiline mode) 6711 . match any character except newline (by default) 6712 [ start character class definition 6713 | start of alternative branch 6714 ( start group or control verb 6715 ) end group or control verb 6716 * 0 or more quantifier 6717 + 1 or more quantifier; also "possessive quantifier" 6718 ? 0 or 1 quantifier; also quantifier minimizer 6719 { potential start of min/max quantifier 6720 6721 Brace characters { and } are also used to enclose data for construc- 6722 tions such as \g{2} or \k{name}. In almost all uses of braces, space 6723 and/or horizontal tab characters that follow { or precede } are allowed 6724 and are ignored. In the case of quantifiers, they may also appear be- 6725 fore or after the comma. The exception to this is \u{...} which is an 6726 ECMAScript compatibility feature that is recognized only when the 6727 PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript does not ignore such 6728 white space; it causes the item to be interpreted as literal. 6729 6730 Part of a pattern that is in square brackets is called a "character 6731 class". In a character class the only metacharacters are: 6732 6733 \ general escape character 6734 ^ negate the class, but only if the first character 6735 - indicates character range 6736 [ POSIX character class (if followed by POSIX syntax) 6737 ] terminates the character class 6738 6739 If a pattern is compiled with the PCRE2_EXTENDED option, most white 6740 space in the pattern, other than in a character class, within a \Q...\E 6741 sequence, or between a # outside a character class and the next new- 6742 line, inclusive, are ignored. An escaping backslash can be used to in- 6743 clude a white space or a # character as part of the pattern. If the 6744 PCRE2_EXTENDED_MORE option is set, the same applies, but in addition 6745 unescaped space and horizontal tab characters are ignored inside a 6746 character class. Note: only these two characters are ignored, not the 6747 full set of pattern white space characters that are ignored outside a 6748 character class. Option settings can be changed within a pattern; see 6749 the section entitled "Internal Option Setting" below. 6750 6751 The following sections describe the use of each of the metacharacters. 6752 6753 6754BACKSLASH 6755 6756 The backslash character has several uses. Firstly, if it is followed by 6757 a character that is not a digit or a letter, it takes away any special 6758 meaning that character may have. This use of backslash as an escape 6759 character applies both inside and outside character classes. 6760 6761 For example, if you want to match a * character, you must write \* in 6762 the pattern. This escaping action applies whether or not the following 6763 character would otherwise be interpreted as a metacharacter, so it is 6764 always safe to precede a non-alphanumeric with backslash to specify 6765 that it stands for itself. In particular, if you want to match a back- 6766 slash, you write \\. 6767 6768 Only ASCII digits and letters have any special meaning after a back- 6769 slash. All other characters (in particular, those whose code points are 6770 greater than 127) are treated as literals. 6771 6772 If you want to treat all characters in a sequence as literals, you can 6773 do so by putting them between \Q and \E. Note that this includes white 6774 space even when the PCRE2_EXTENDED option is set so that most other 6775 white space is ignored. The behaviour is different from Perl in that $ 6776 and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in 6777 Perl, $ and @ cause variable interpolation. Also, Perl does "double- 6778 quotish backslash interpolation" on any backslashes between \Q and \E 6779 which, its documentation says, "may lead to confusing results". PCRE2 6780 treats a backslash between \Q and \E just like any other character. 6781 Note the following examples: 6782 6783 Pattern PCRE2 matches Perl matches 6784 6785 \Qabc$xyz\E abc$xyz abc followed by the 6786 contents of $xyz 6787 \Qabc\$xyz\E abc\$xyz abc\$xyz 6788 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 6789 \QA\B\E A\B A\B 6790 \Q\\E \ \\E 6791 6792 The \Q...\E sequence is recognized both inside and outside character 6793 classes. An isolated \E that is not preceded by \Q is ignored. If \Q 6794 is not followed by \E later in the pattern, the literal interpretation 6795 continues to the end of the pattern (that is, \E is assumed at the 6796 end). If the isolated \Q is inside a character class, this causes an 6797 error, because the character class is then not terminated by a closing 6798 square bracket. 6799 6800 Non-printing characters 6801 6802 A second use of backslash provides a way of encoding non-printing char- 6803 acters in patterns in a visible manner. There is no restriction on the 6804 appearance of non-printing characters in a pattern, but when a pattern 6805 is being prepared by text editing, it is often easier to use one of the 6806 following escape sequences instead of the binary character it repre- 6807 sents. In an ASCII or Unicode environment, these escapes are as fol- 6808 lows: 6809 6810 \a alarm, that is, the BEL character (hex 07) 6811 \cx "control-x", where x is a non-control ASCII character 6812 \e escape (hex 1B) 6813 \f form feed (hex 0C) 6814 \n linefeed (hex 0A) 6815 \r carriage return (hex 0D) (but see below) 6816 \t tab (hex 09) 6817 \0dd character with octal code 0dd 6818 \ddd character with octal code ddd, or backreference 6819 \o{ddd..} character with octal code ddd.. 6820 \xhh character with hex code hh 6821 \x{hhh..} character with hex code hhh.. 6822 \N{U+hhh..} character with Unicode hex code point hhh.. 6823 6824 By default, after \x that is not followed by {, from zero to two hexa- 6825 decimal digits are read (letters can be in upper or lower case). Any 6826 number of hexadecimal digits may appear between \x{ and }. If a charac- 6827 ter other than a hexadecimal digit appears between \x{ and }, or if 6828 there is no terminating }, an error occurs. 6829 6830 Characters whose code points are less than 256 can be defined by either 6831 of the two syntaxes for \x or by an octal sequence. There is no differ- 6832 ence in the way they are handled. For example, \xdc is exactly the same 6833 as \x{dc} or \334. However, using the braced versions does make such 6834 sequences easier to read. 6835 6836 Support is available for some ECMAScript (aka JavaScript) escape se- 6837 quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se- 6838 quence \x followed by { is not recognized. Only if \x is followed by 6839 two hexadecimal digits is it recognized as a character escape. Other- 6840 wise it is interpreted as a literal "x" character. In this mode, sup- 6841 port for code points greater than 256 is provided by \u, which must be 6842 followed by four hexadecimal digits; otherwise it is interpreted as a 6843 literal "u" character. 6844 6845 PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in ad- 6846 dition, \u{hhh..} is recognized as the character specified by hexadeci- 6847 mal code point. There may be any number of hexadecimal digits, but un- 6848 like other places that also use curly brackets, spaces are not allowed 6849 and would result in the string being interpreted as a literal. This 6850 syntax is from ECMAScript 6. 6851 6852 The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper- 6853 ating in UTF mode. Perl also uses \N{name} to specify characters by 6854 Unicode name; PCRE2 does not support this. Note that when \N is not 6855 followed by an opening brace (curly bracket) it has an entirely differ- 6856 ent meaning, matching any character that is not a newline. 6857 6858 There are some legacy applications where the escape sequence \r is ex- 6859 pected to match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option 6860 is set, \r in a pattern is converted to \n so that it matches a LF 6861 (linefeed) instead of a CR (carriage return) character. 6862 6863 An error occurs if \c is not followed by a character whose ASCII code 6864 point is in the range 32 to 126. The precise effect of \cx is as fol- 6865 lows: if x is a lower case letter, it is converted to upper case. Then 6866 bit 6 of the character (hex 40) is inverted. Thus \cA to \cZ become hex 6867 01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and 6868 \c; becomes hex 7B (; is 3B). If the code unit following \c has a code 6869 point less than 32 or greater than 126, a compile-time error occurs. 6870 6871 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. 6872 \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values. 6873 The \c escape is processed as specified for Perl in the perlebcdic doc- 6874 ument. The only characters that are allowed after \c are A-Z, a-z, or 6875 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile- 6876 time error. The sequence \c@ encodes character code 0; after \c the 6877 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [, 6878 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be- 6879 comes either 255 (hex FF) or 95 (hex 5F). 6880 6881 Thus, apart from \c?, these escapes generate the same character code 6882 values as they do in an ASCII environment, though the meanings of the 6883 values mostly differ. For example, \cG always generates code value 7, 6884 which is BEL in ASCII but DEL in EBCDIC. 6885 6886 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, 6887 but because 127 is not a control character in EBCDIC, Perl makes it 6888 generate the APC character. Unfortunately, there are several variants 6889 of EBCDIC. In most of them the APC character has the value 255 (hex 6890 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If 6891 certain other characters have POSIX-BC values, PCRE2 makes \c? generate 6892 95; otherwise it generates 255. 6893 6894 After \0 up to two further octal digits are read. If there are fewer 6895 than two digits, just those that are present are used. Thus the se- 6896 quence \0\x\015 specifies two binary zeros followed by a CR character 6897 (code value 13). Make sure you supply two digits after the initial zero 6898 if the pattern character that follows is itself an octal digit. 6899 6900 The escape \o must be followed by a sequence of octal digits, enclosed 6901 in braces. An error occurs if this is not the case. This escape is a 6902 recent addition to Perl; it provides way of specifying character code 6903 points as octal numbers greater than 0777, and it also allows octal 6904 numbers and backreferences to be unambiguously specified. 6905 6906 For greater clarity and unambiguity, it is best to avoid following \ by 6907 a digit greater than zero. Instead, use \o{...} or \x{...} to specify 6908 numerical character code points, and \g{...} to specify backreferences. 6909 The following paragraphs describe the old, ambiguous syntax. 6910 6911 The handling of a backslash followed by a digit other than 0 is compli- 6912 cated, and Perl has changed over time, causing PCRE2 also to change. 6913 6914 Outside a character class, PCRE2 reads the digit and any following dig- 6915 its as a decimal number. If the number is less than 10, begins with the 6916 digit 8 or 9, or if there are at least that many previous capture 6917 groups in the expression, the entire sequence is taken as a backrefer- 6918 ence. A description of how this works is given later, following the 6919 discussion of parenthesized groups. Otherwise, up to three octal dig- 6920 its are read to form a character code. 6921 6922 Inside a character class, PCRE2 handles \8 and \9 as the literal char- 6923 acters "8" and "9", and otherwise reads up to three octal digits fol- 6924 lowing the backslash, using them to generate a data character. Any sub- 6925 sequent digits stand for themselves. For example, outside a character 6926 class: 6927 6928 \040 is another way of writing an ASCII space 6929 \40 is the same, provided there are fewer than 40 6930 previous capture groups 6931 \7 is always a backreference 6932 \11 might be a backreference, or another way of 6933 writing a tab 6934 \011 is always a tab 6935 \0113 is a tab followed by the character "3" 6936 \113 might be a backreference, otherwise the 6937 character with octal code 113 6938 \377 might be a backreference, otherwise 6939 the value 255 (decimal) 6940 \81 is always a backreference 6941 6942 Note that octal values of 100 or greater that are specified using this 6943 syntax must not be introduced by a leading zero, because no more than 6944 three octal digits are ever read. 6945 6946 Constraints on character values 6947 6948 Characters that are specified using octal or hexadecimal numbers are 6949 limited to certain values, as follows: 6950 6951 8-bit non-UTF mode no greater than 0xff 6952 16-bit non-UTF mode no greater than 0xffff 6953 32-bit non-UTF mode no greater than 0xffffffff 6954 All UTF modes no greater than 0x10ffff and a valid code point 6955 6956 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff 6957 (the so-called "surrogate" code points). The check for these can be 6958 disabled by the caller of pcre2_compile() by setting the option 6959 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in 6960 UTF-8 and UTF-32 modes, because these values are not representable in 6961 UTF-16. 6962 6963 Escape sequences in character classes 6964 6965 All the sequences that define a single character value can be used both 6966 inside and outside character classes. In addition, inside a character 6967 class, \b is interpreted as the backspace character (hex 08). 6968 6969 When not followed by an opening brace, \N is not allowed in a character 6970 class. \B, \R, and \X are not special inside a character class. Like 6971 other unrecognized alphabetic escape sequences, they cause an error. 6972 Outside a character class, these sequences have different meanings. 6973 6974 Unsupported escape sequences 6975 6976 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its 6977 string handler and used to modify the case of following characters. By 6978 default, PCRE2 does not support these escape sequences in patterns. 6979 However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op- 6980 tions is set, \U matches a "U" character, and \u can be used to define 6981 a character by code point, as described above. 6982 6983 Absolute and relative backreferences 6984 6985 The sequence \g followed by a signed or unsigned number, optionally en- 6986 closed in braces, is an absolute or relative backreference. A named 6987 backreference can be coded as \g{name}. Backreferences are discussed 6988 later, following the discussion of parenthesized groups. 6989 6990 Absolute and relative subroutine calls 6991 6992 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 6993 name or a number enclosed either in angle brackets or single quotes, is 6994 an alternative syntax for referencing a capture group as a subroutine. 6995 Details are discussed later. Note that \g{...} (Perl syntax) and 6996 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref- 6997 erence; the latter is a subroutine call. 6998 6999 Generic character types 7000 7001 Another use of backslash is for specifying generic character types: 7002 7003 \d any decimal digit 7004 \D any character that is not a decimal digit 7005 \h any horizontal white space character 7006 \H any character that is not a horizontal white space character 7007 \N any character that is not a newline 7008 \s any white space character 7009 \S any character that is not a white space character 7010 \v any vertical white space character 7011 \V any character that is not a vertical white space character 7012 \w any "word" character 7013 \W any "non-word" character 7014 7015 The \N escape sequence has the same meaning as the "." metacharacter 7016 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change 7017 the meaning of \N. Note that when \N is followed by an opening brace it 7018 has a different meaning. See the section entitled "Non-printing charac- 7019 ters" above for details. Perl also uses \N{name} to specify characters 7020 by Unicode name; PCRE2 does not support this. 7021 7022 Each pair of lower and upper case escape sequences partitions the com- 7023 plete set of characters into two disjoint sets. Any given character 7024 matches one, and only one, of each pair. The sequences can appear both 7025 inside and outside character classes. They each match one character of 7026 the appropriate type. If the current matching point is at the end of 7027 the subject string, all of them fail, because there is no character to 7028 match. 7029 7030 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR 7031 (13), and space (32), which are defined as white space in the "C" lo- 7032 cale. This list may vary if locale-specific matching is taking place. 7033 For example, in some locales the "non-breaking space" character (\xA0) 7034 is recognized as white space, and in others the VT character is not. 7035 7036 A "word" character is an underscore or any character that is a letter 7037 or digit. By default, the definition of letters and digits is con- 7038 trolled by PCRE2's low-valued character tables, and may vary if locale- 7039 specific matching is taking place (see "Locale support" in the pcre2api 7040 page). For example, in a French locale such as "fr_FR" in Unix-like 7041 systems, or "french" in Windows, some character codes greater than 127 7042 are used for accented letters, and these are then matched by \w. The 7043 use of locales with Unicode is discouraged. 7044 7045 By default, characters whose code points are greater than 127 never 7046 match \d, \s, or \w, and always match \D, \S, and \W, although this may 7047 be different for characters in the range 128-255 when locale-specific 7048 matching is happening. These escape sequences retain their original 7049 meanings from before Unicode support was available, mainly for effi- 7050 ciency reasons. If the PCRE2_UCP option is set, the behaviour is 7051 changed so that Unicode properties are used to determine character 7052 types, as follows: 7053 7054 \d any character that matches \p{Nd} (decimal digit) 7055 \s any character that matches \p{Z} or \h or \v 7056 \w any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc} 7057 7058 The addition of \p{Mn} (non-spacing mark) and the replacement of an ex- 7059 plicit test for underscore with a test for \p{Pc} (connector punctua- 7060 tion) happened in PCRE2 release 10.43. This brings PCRE2 into line with 7061 Perl. 7062 7063 The upper case escapes match the inverse sets of characters. Note that 7064 \d matches only decimal digits, whereas \w matches any Unicode digit, 7065 as well as other character categories. Note also that PCRE2_UCP affects 7066 \b, and \B because they are defined in terms of \w and \W. Matching 7067 these sequences is noticeably slower when PCRE2_UCP is set. 7068 7069 The effect of PCRE2_UCP on any one of these escape sequences can be 7070 negated by the options PCRE2_EXTRA_ASCII_BSD, PCRE2_EXTRA_ASCII_BSS, 7071 and PCRE2_EXTRA_ASCII_BSW, respectively. These options can be set and 7072 reset within a pattern by means of an internal option setting (see be- 7073 low). 7074 7075 The sequences \h, \H, \v, and \V, in contrast to the other sequences, 7076 which match only ASCII characters by default, always match a specific 7077 list of code points, whether or not PCRE2_UCP is set. The horizontal 7078 space characters are: 7079 7080 U+0009 Horizontal tab (HT) 7081 U+0020 Space 7082 U+00A0 Non-break space 7083 U+1680 Ogham space mark 7084 U+180E Mongolian vowel separator 7085 U+2000 En quad 7086 U+2001 Em quad 7087 U+2002 En space 7088 U+2003 Em space 7089 U+2004 Three-per-em space 7090 U+2005 Four-per-em space 7091 U+2006 Six-per-em space 7092 U+2007 Figure space 7093 U+2008 Punctuation space 7094 U+2009 Thin space 7095 U+200A Hair space 7096 U+202F Narrow no-break space 7097 U+205F Medium mathematical space 7098 U+3000 Ideographic space 7099 7100 The vertical space characters are: 7101 7102 U+000A Linefeed (LF) 7103 U+000B Vertical tab (VT) 7104 U+000C Form feed (FF) 7105 U+000D Carriage return (CR) 7106 U+0085 Next line (NEL) 7107 U+2028 Line separator 7108 U+2029 Paragraph separator 7109 7110 In 8-bit, non-UTF-8 mode, only the characters with code points less 7111 than 256 are relevant. 7112 7113 Newline sequences 7114 7115 Outside a character class, by default, the escape sequence \R matches 7116 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent 7117 to the following: 7118 7119 (?>\r\n|\n|\x0b|\f|\r|\x85) 7120 7121 This is an example of an "atomic group", details of which are given be- 7122 low. This particular group matches either the two-character sequence 7123 CR followed by LF, or one of the single characters LF (linefeed, 7124 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- 7125 riage return, U+000D), or NEL (next line, U+0085). Because this is an 7126 atomic group, the two-character sequence is treated as a single unit 7127 that cannot be split. 7128 7129 In other modes, two additional characters whose code points are greater 7130 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- 7131 rator, U+2029). Unicode support is not needed for these characters to 7132 be recognized. 7133 7134 It is possible to restrict \R to match only CR, LF, or CRLF (instead of 7135 the complete set of Unicode line endings) by setting the option 7136 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back- 7137 slash R".) This can be made the default when PCRE2 is built; if this is 7138 the case, the other behaviour can be requested via the PCRE2_BSR_UNI- 7139 CODE option. It is also possible to specify these settings by starting 7140 a pattern string with one of the following sequences: 7141 7142 (*BSR_ANYCRLF) CR, LF, or CRLF only 7143 (*BSR_UNICODE) any Unicode newline sequence 7144 7145 These override the default and the options given to the compiling func- 7146 tion. Note that these special settings, which are not Perl-compatible, 7147 are recognized only at the very start of a pattern, and that they must 7148 be in upper case. If more than one of them is present, the last one is 7149 used. They can be combined with a change of newline convention; for ex- 7150 ample, a pattern can start with: 7151 7152 (*ANY)(*BSR_ANYCRLF) 7153 7154 They can also be combined with the (*UTF) or (*UCP) special sequences. 7155 Inside a character class, \R is treated as an unrecognized escape se- 7156 quence, and causes an error. 7157 7158 Unicode character properties 7159 7160 When PCRE2 is built with Unicode support (the default), three addi- 7161 tional escape sequences that match characters with specific properties 7162 are available. They can be used in any mode, though in 8-bit and 16-bit 7163 non-UTF modes these sequences are of course limited to testing charac- 7164 ters whose code points are less than U+0100 and U+10000, respectively. 7165 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode 7166 limit) may be encountered. These are all treated as being in the Un- 7167 known script and with an unassigned type. 7168 7169 Matching characters by Unicode property is not fast, because PCRE2 has 7170 to do a multistage table lookup in order to find a character's prop- 7171 erty. That is why the traditional escape sequences such as \d and \w do 7172 not use Unicode properties in PCRE2 by default, though you can make 7173 them do so by setting the PCRE2_UCP option or by starting the pattern 7174 with (*UCP). 7175 7176 The extra escape sequences that provide property support are: 7177 7178 \p{xx} a character with the xx property 7179 \P{xx} a character without the xx property 7180 \X a Unicode extended grapheme cluster 7181 7182 The property names represented by xx above are not case-sensitive, and 7183 in accordance with Unicode's "loose matching" rules, spaces, hyphens, 7184 and underscores are ignored. There is support for Unicode script names, 7185 Unicode general category properties, "Any", which matches any character 7186 (including newline), Bidi_Class, a number of binary (yes/no) proper- 7187 ties, and some special PCRE2 properties (described below). Certain 7188 other Perl properties such as "InMusicalSymbols" are not supported by 7189 PCRE2. Note that \P{Any} does not match any characters, so always 7190 causes a match failure. 7191 7192 Script properties for \p and \P 7193 7194 There are three different syntax forms for matching a script. Each Uni- 7195 code character has a basic script and, optionally, a list of other 7196 scripts ("Script Extensions") with which it is commonly used. Using the 7197 Adlam script as an example, \p{sc:Adlam} matches characters whose basic 7198 script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters 7199 that have Adlam in their extensions list. The full names "script" and 7200 "script extensions" for the property types are recognized, and a equals 7201 sign is an alternative to the colon. If a script name is given without 7202 a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad- 7203 lam}. Perl changed to this interpretation at release 5.26 and PCRE2 7204 changed at release 10.40. 7205 7206 Unassigned characters (and in non-UTF 32-bit mode, characters with code 7207 points greater than 0x10FFFF) are assigned the "Unknown" script. Others 7208 that are not part of an identified script are lumped together as "Com- 7209 mon". The current list of recognized script names and their 4-character 7210 abbreviations can be obtained by running this command: 7211 7212 pcre2test -LS 7213 7214 7215 The general category property for \p and \P 7216 7217 Each character has exactly one Unicode general category property, spec- 7218 ified by a two-letter abbreviation. For compatibility with Perl, nega- 7219 tion can be specified by including a circumflex between the opening 7220 brace and the property name. For example, \p{^Lu} is the same as 7221 \P{Lu}. 7222 7223 If only one letter is specified with \p or \P, it includes all the gen- 7224 eral category properties that start with that letter. In this case, in 7225 the absence of negation, the curly brackets in the escape sequence are 7226 optional; these two examples have the same effect: 7227 7228 \p{L} 7229 \pL 7230 7231 The following general category property codes are supported: 7232 7233 C Other 7234 Cc Control 7235 Cf Format 7236 Cn Unassigned 7237 Co Private use 7238 Cs Surrogate 7239 7240 L Letter 7241 Ll Lower case letter 7242 Lm Modifier letter 7243 Lo Other letter 7244 Lt Title case letter 7245 Lu Upper case letter 7246 7247 M Mark 7248 Mc Spacing mark 7249 Me Enclosing mark 7250 Mn Non-spacing mark 7251 7252 N Number 7253 Nd Decimal number 7254 Nl Letter number 7255 No Other number 7256 7257 P Punctuation 7258 Pc Connector punctuation 7259 Pd Dash punctuation 7260 Pe Close punctuation 7261 Pf Final punctuation 7262 Pi Initial punctuation 7263 Po Other punctuation 7264 Ps Open punctuation 7265 7266 S Symbol 7267 Sc Currency symbol 7268 Sk Modifier symbol 7269 Sm Mathematical symbol 7270 So Other symbol 7271 7272 Z Separator 7273 Zl Line separator 7274 Zp Paragraph separator 7275 Zs Space separator 7276 7277 The special property LC, which has the synonym L&, is also supported: 7278 it matches a character that has the Lu, Ll, or Lt property, in other 7279 words, a letter that is not classified as a modifier or "other". 7280 7281 The Cs (Surrogate) property applies only to characters whose code 7282 points are in the range U+D800 to U+DFFF. These characters are no dif- 7283 ferent to any other character when PCRE2 is not in UTF mode (using the 7284 16-bit or 32-bit library). However, they are not valid in Unicode 7285 strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid- 7286 ity checking has been turned off (see the discussion of 7287 PCRE2_NO_UTF_CHECK in the pcre2api page). 7288 7289 The long synonyms for property names that Perl supports (such as 7290 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix 7291 any of these properties with "Is". 7292 7293 No character that is in the Unicode table has the Cn (unassigned) prop- 7294 erty. Instead, this property is assumed for any code point that is not 7295 in the Unicode table. 7296 7297 Specifying caseless matching does not affect these escape sequences. 7298 For example, \p{Lu} always matches only upper case letters. This is 7299 different from the behaviour of current versions of Perl. 7300 7301 Binary (yes/no) properties for \p and \P 7302 7303 Unicode defines a number of binary properties, that is, properties 7304 whose only values are true or false. You can obtain a list of those 7305 that are recognized by \p and \P, along with their abbreviations, by 7306 running this command: 7307 7308 pcre2test -LP 7309 7310 7311 The Bidi_Class property for \p and \P 7312 7313 \p{Bidi_Class:<class>} matches a character with the given class 7314 \p{BC:<class>} matches a character with the given class 7315 7316 The recognized classes are: 7317 7318 AL Arabic letter 7319 AN Arabic number 7320 B paragraph separator 7321 BN boundary neutral 7322 CS common separator 7323 EN European number 7324 ES European separator 7325 ET European terminator 7326 FSI first strong isolate 7327 L left-to-right 7328 LRE left-to-right embedding 7329 LRI left-to-right isolate 7330 LRO left-to-right override 7331 NSM non-spacing mark 7332 ON other neutral 7333 PDF pop directional format 7334 PDI pop directional isolate 7335 R right-to-left 7336 RLE right-to-left embedding 7337 RLI right-to-left isolate 7338 RLO right-to-left override 7339 S segment separator 7340 WS which space 7341 7342 An equals sign may be used instead of a colon. The class names are 7343 case-insensitive; only the short names listed above are recognized. 7344 7345 Extended grapheme clusters 7346 7347 The \X escape matches any number of Unicode characters that form an 7348 "extended grapheme cluster", and treats the sequence as an atomic group 7349 (see below). Unicode supports various kinds of composite character by 7350 giving each character a grapheme breaking property, and having rules 7351 that use these properties to define the boundaries of extended grapheme 7352 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode 7353 Text Segmentation". Unicode 11.0.0 abandoned the use of some previous 7354 properties that had been used for emojis. Instead it introduced vari- 7355 ous emoji-specific properties. PCRE2 uses only the Extended Picto- 7356 graphic property. 7357 7358 \X always matches at least one character. Then it decides whether to 7359 add additional characters according to the following rules for ending a 7360 cluster: 7361 7362 1. End at the end of the subject string. 7363 7364 2. Do not end between CR and LF; otherwise end after any control char- 7365 acter. 7366 7367 3. Do not break Hangul (a Korean script) syllable sequences. Hangul 7368 characters are of five types: L, V, T, LV, and LVT. An L character may 7369 be followed by an L, V, LV, or LVT character; an LV or V character may 7370 be followed by a V or T character; an LVT or T character may be fol- 7371 lowed only by a T character. 7372 7373 4. Do not end before extending characters or spacing marks or the zero- 7374 width joiner (ZWJ) character. Characters with the "mark" property al- 7375 ways have the "extend" grapheme breaking property. 7376 7377 5. Do not end after prepend characters. 7378 7379 6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width 7380 joiner) sequences. An emoji ZWJ sequence consists of a character with 7381 the Extended_Pictographic property, optionally followed by one or more 7382 characters with the Extend property, followed by the ZWJ character, 7383 followed by another Extended_Pictographic character. 7384 7385 7. Do not break within emoji flag sequences. That is, do not break be- 7386 tween regional indicator (RI) characters if there are an odd number of 7387 RI characters before the break point. 7388 7389 8. Otherwise, end the cluster. 7390 7391 PCRE2's additional properties 7392 7393 As well as the standard Unicode properties described above, PCRE2 sup- 7394 ports four more that make it possible to convert traditional escape se- 7395 quences such as \w and \s to use Unicode properties. PCRE2 uses these 7396 non-standard, non-Perl properties internally when PCRE2_UCP is set. 7397 However, they may also be used explicitly. These properties are: 7398 7399 Xan Any alphanumeric character 7400 Xps Any POSIX space character 7401 Xsp Any Perl space character 7402 Xwd Any Perl "word" character 7403 7404 Xan matches characters that have either the L (letter) or the N (num- 7405 ber) property. Xps matches the characters tab, linefeed, vertical tab, 7406 form feed, or carriage return, and any other character that has the Z 7407 (separator) property. Xsp is the same as Xps; in PCRE1 it used to ex- 7408 clude vertical tab, for Perl compatibility, but Perl changed. Xwd 7409 matches the same characters as Xan, plus those that match Mn (non-spac- 7410 ing mark) or Pc (connector punctuation, which includes underscore). 7411 7412 There is another non-standard property, Xuc, which matches any charac- 7413 ter that can be represented by a Universal Character Name in C++ and 7414 other programming languages. These are the characters $, @, ` (grave 7415 accent), and all characters with Unicode code points greater than or 7416 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that 7417 most base (ASCII) characters are excluded. (Universal Character Names 7418 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. 7419 Note that the Xuc property does not match these sequences but the char- 7420 acters that they represent.) 7421 7422 Resetting the match start 7423 7424 In normal use, the escape sequence \K causes any previously matched 7425 characters not to be included in the final matched sequence that is re- 7426 turned. For example, the pattern: 7427 7428 foo\Kbar 7429 7430 matches "foobar", but reports that it has matched "bar". \K does not 7431 interact with anchoring in any way. The pattern: 7432 7433 ^foo\Kbar 7434 7435 matches only when the subject begins with "foobar" (in single line 7436 mode), though it again reports the matched string as "bar". This fea- 7437 ture is similar to a lookbehind assertion (described below), but the 7438 part of the pattern that precedes \K is not constrained to match a lim- 7439 ited number of characters, as is required for a lookbehind assertion. 7440 The use of \K does not interfere with the setting of captured sub- 7441 strings. For example, when the pattern 7442 7443 (foo)\Kbar 7444 7445 matches "foobar", the first substring is still set to "foo". 7446 7447 From version 5.32.0 Perl forbids the use of \K in lookaround asser- 7448 tions. From release 10.38 PCRE2 also forbids this by default. However, 7449 the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling 7450 pcre2_compile() to re-enable the previous behaviour. When this option 7451 is set, \K is acted upon when it occurs inside positive assertions, but 7452 is ignored in negative assertions. Note that when a pattern such as 7453 (?=ab\K) matches, the reported start of the match can be greater than 7454 the end of the match. Using \K in a lookbehind assertion at the start 7455 of a pattern can also lead to odd effects. For example, consider this 7456 pattern: 7457 7458 (?<=\Kfoo)bar 7459 7460 If the subject is "foobar", a call to pcre2_match() with a starting 7461 offset of 3 succeeds and reports the matching string as "foobar", that 7462 is, the start of the reported match is earlier than where the match 7463 started. 7464 7465 Simple assertions 7466 7467 The final use of backslash is for certain simple assertions. An asser- 7468 tion specifies a condition that has to be met at a particular point in 7469 a match, without consuming any characters from the subject string. The 7470 use of groups for more complicated assertions is described below. The 7471 backslashed assertions are: 7472 7473 \b matches at a word boundary 7474 \B matches when not at a word boundary 7475 \A matches at the start of the subject 7476 \Z matches at the end of the subject 7477 also matches before a newline at the end of the subject 7478 \z matches only at the end of the subject 7479 \G matches at the first matching position in the subject 7480 7481 Inside a character class, \b has a different meaning; it matches the 7482 backspace character. If any other of these assertions appears in a 7483 character class, an "invalid escape sequence" error is generated. 7484 7485 A word boundary is a position in the subject string where the current 7486 character and the previous character do not both match \w or \W (i.e. 7487 one matches \w and the other matches \W), or the start or end of the 7488 string if the first or last character matches \w, respectively. When 7489 PCRE2 is built with Unicode support, the meanings of \w and \W can be 7490 changed by setting the PCRE2_UCP option. When this is done, it also af- 7491 fects \b and \B. Neither PCRE2 nor Perl has a separate "start of word" 7492 or "end of word" metasequence. However, whatever follows \b normally 7493 determines which it is. For example, the fragment \ba matches "a" at 7494 the start of a word. 7495 7496 The \A, \Z, and \z assertions differ from the traditional circumflex 7497 and dollar (described in the next section) in that they only ever match 7498 at the very start and end of the subject string, whatever options are 7499 set. Thus, they are independent of multiline mode. These three asser- 7500 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options, 7501 which affect only the behaviour of the circumflex and dollar metachar- 7502 acters. However, if the startoffset argument of pcre2_match() is non- 7503 zero, indicating that matching is to start at a point other than the 7504 beginning of the subject, \A can never match. The difference between 7505 \Z and \z is that \Z matches before a newline at the end of the string 7506 as well as at the very end, whereas \z matches only at the end. 7507 7508 The \G assertion is true only when the current matching position is at 7509 the start point of the matching process, as specified by the startoff- 7510 set argument of pcre2_match(). It differs from \A when the value of 7511 startoffset is non-zero. By calling pcre2_match() multiple times with 7512 appropriate arguments, you can mimic Perl's /g option, and it is in 7513 this kind of implementation where \G can be useful. 7514 7515 Note, however, that PCRE2's implementation of \G, being true at the 7516 starting character of the matching process, is subtly different from 7517 Perl's, which defines it as true at the end of the previous match. In 7518 Perl, these can be different when the previously matched string was 7519 empty. Because PCRE2 does just one match at a time, it cannot reproduce 7520 this behaviour. 7521 7522 If all the alternatives of a pattern begin with \G, the expression is 7523 anchored to the starting match position, and the "anchored" flag is set 7524 in the compiled regular expression. 7525 7526 7527CIRCUMFLEX AND DOLLAR 7528 7529 The circumflex and dollar metacharacters are zero-width assertions. 7530 That is, they test for a particular condition being true without con- 7531 suming any characters from the subject string. These two metacharacters 7532 are concerned with matching the starts and ends of lines. If the new- 7533 line convention is set so that only the two-character sequence CRLF is 7534 recognized as a newline, isolated CR and LF characters are treated as 7535 ordinary data characters, and are not recognized as newlines. 7536 7537 Outside a character class, in the default matching mode, the circumflex 7538 character is an assertion that is true only if the current matching 7539 point is at the start of the subject string. If the startoffset argu- 7540 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum- 7541 flex can never match if the PCRE2_MULTILINE option is unset. Inside a 7542 character class, circumflex has an entirely different meaning (see be- 7543 low). 7544 7545 Circumflex need not be the first character of the pattern if a number 7546 of alternatives are involved, but it should be the first thing in each 7547 alternative in which it appears if the pattern is ever to match that 7548 branch. If all possible alternatives start with a circumflex, that is, 7549 if the pattern is constrained to match only at the start of the sub- 7550 ject, it is said to be an "anchored" pattern. (There are also other 7551 constructs that can cause a pattern to be anchored.) 7552 7553 The dollar character is an assertion that is true only if the current 7554 matching point is at the end of the subject string, or immediately be- 7555 fore a newline at the end of the string (by default), unless PCRE2_NO- 7556 TEOL is set. Note, however, that it does not actually match the new- 7557 line. Dollar need not be the last character of the pattern if a number 7558 of alternatives are involved, but it should be the last item in any 7559 branch in which it appears. Dollar has no special meaning in a charac- 7560 ter class. 7561 7562 The meaning of dollar can be changed so that it matches only at the 7563 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at 7564 compile time. This does not affect the \Z assertion. 7565 7566 The meanings of the circumflex and dollar metacharacters are changed if 7567 the PCRE2_MULTILINE option is set. When this is the case, a dollar 7568 character matches before any newlines in the string, as well as at the 7569 very end, and a circumflex matches immediately after internal newlines 7570 as well as at the start of the subject string. It does not match after 7571 a newline that ends the string, for compatibility with Perl. However, 7572 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option. 7573 7574 For example, the pattern /^abc$/ matches the subject string "def\nabc" 7575 (where \n represents a newline) in multiline mode, but not otherwise. 7576 Consequently, patterns that are anchored in single line mode because 7577 all branches start with ^ are not anchored in multiline mode, and a 7578 match for circumflex is possible when the startoffset argument of 7579 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored 7580 if PCRE2_MULTILINE is set. 7581 7582 When the newline convention (see "Newline conventions" below) recog- 7583 nizes the two-character sequence CRLF as a newline, this is preferred, 7584 even if the single characters CR and LF are also recognized as new- 7585 lines. For example, if the newline convention is "any", a multiline 7586 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather 7587 than after CR, even though CR on its own is a valid newline. (It also 7588 matches at the very start of the string, of course.) 7589 7590 Note that the sequences \A, \Z, and \z can be used to match the start 7591 and end of the subject in both modes, and if all branches of a pattern 7592 start with \A it is always anchored, whether or not PCRE2_MULTILINE is 7593 set. 7594 7595 7596FULL STOP (PERIOD, DOT) AND \N 7597 7598 Outside a character class, a dot in the pattern matches any one charac- 7599 ter in the subject string except (by default) a character that signi- 7600 fies the end of a line. One or more characters may be specified as line 7601 terminators (see "Newline conventions" above). 7602 7603 Dot never matches a single line-ending character. When the two-charac- 7604 ter sequence CRLF is the only line ending, dot does not match CR if it 7605 is immediately followed by LF, but otherwise it matches all characters 7606 (including isolated CRs and LFs). When ANYCRLF is selected for line 7607 endings, no occurrences of CR of LF match dot. When all Unicode line 7608 endings are being recognized, dot does not match CR or LF or any of the 7609 other line ending characters. 7610 7611 The behaviour of dot with regard to newlines can be changed. If the 7612 PCRE2_DOTALL option is set, a dot matches any one character, without 7613 exception. If the two-character sequence CRLF is present in the sub- 7614 ject string, it takes two dots to match it. 7615 7616 The handling of dot is entirely independent of the handling of circum- 7617 flex and dollar, the only relationship being that they both involve 7618 newlines. Dot has no special meaning in a character class. 7619 7620 The escape sequence \N when not followed by an opening brace behaves 7621 like a dot, except that it is not affected by the PCRE2_DOTALL option. 7622 In other words, it matches any character except one that signifies the 7623 end of a line. 7624 7625 When \N is followed by an opening brace it has a different meaning. See 7626 the section entitled "Non-printing characters" above for details. Perl 7627 also uses \N{name} to specify characters by Unicode name; PCRE2 does 7628 not support this. 7629 7630 7631MATCHING A SINGLE CODE UNIT 7632 7633 Outside a character class, the escape sequence \C matches any one code 7634 unit, whether or not a UTF mode is set. In the 8-bit library, one code 7635 unit is one byte; in the 16-bit library it is a 16-bit unit; in the 7636 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches 7637 line-ending characters. The feature is provided in Perl in order to 7638 match individual bytes in UTF-8 mode, but it is unclear how it can use- 7639 fully be used. 7640 7641 Because \C breaks up characters into individual code units, matching 7642 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the 7643 string may start with a malformed UTF character. This has undefined re- 7644 sults, because PCRE2 assumes that it is matching character by character 7645 in a valid UTF string (by default it checks the subject string's valid- 7646 ity at the start of processing unless the PCRE2_NO_UTF_CHECK or 7647 PCRE2_MATCH_INVALID_UTF option is used). 7648 7649 An application can lock out the use of \C by setting the 7650 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also 7651 possible to build PCRE2 with the use of \C permanently disabled. 7652 7653 PCRE2 does not allow \C to appear in lookbehind assertions (described 7654 below) in UTF-8 or UTF-16 modes, because this would make it impossible 7655 to calculate the length of the lookbehind. Neither the alternative 7656 matching function pcre2_dfa_match() nor the JIT optimizer support \C in 7657 these UTF modes. The former gives a match-time error; the latter fails 7658 to optimize and so the match is always run using the interpreter. 7659 7660 In the 32-bit library, however, \C is always supported (when not ex- 7661 plicitly locked out) because it always matches a single code unit, 7662 whether or not UTF-32 is specified. 7663 7664 In general, the \C escape sequence is best avoided. However, one way of 7665 using it that avoids the problem of malformed UTF-8 or UTF-16 charac- 7666 ters is to use a lookahead to check the length of the next character, 7667 as in this pattern, which could be used with a UTF-8 string (ignore 7668 white space and line breaks): 7669 7670 (?| (?=[\x00-\x7f])(\C) | 7671 (?=[\x80-\x{7ff}])(\C)(\C) | 7672 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | 7673 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) 7674 7675 In this example, a group that starts with (?| resets the capturing 7676 parentheses numbers in each alternative (see "Duplicate Group Numbers" 7677 below). The assertions at the start of each branch check the next UTF-8 7678 character for values whose encoding uses 1, 2, 3, or 4 bytes, respec- 7679 tively. The character's individual bytes are then captured by the ap- 7680 propriate number of \C groups. 7681 7682 7683SQUARE BRACKETS AND CHARACTER CLASSES 7684 7685 An opening square bracket introduces a character class, terminated by a 7686 closing square bracket. A closing square bracket on its own is not spe- 7687 cial by default. If a closing square bracket is required as a member 7688 of the class, it should be the first data character in the class (after 7689 an initial circumflex, if present) or escaped with a backslash. This 7690 means that, by default, an empty class cannot be defined. However, if 7691 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at 7692 the start does end the (empty) class. 7693 7694 A character class matches a single character in the subject. A matched 7695 character must be in the set of characters defined by the class, unless 7696 the first character in the class definition is a circumflex, in which 7697 case the subject character must not be in the set defined by the class. 7698 If a circumflex is actually required as a member of the class, ensure 7699 it is not the first character, or escape it with a backslash. 7700 7701 For example, the character class [aeiou] matches any lower case vowel, 7702 while [^aeiou] matches any character that is not a lower case vowel. 7703 Note that a circumflex is just a convenient notation for specifying the 7704 characters that are in the class by enumerating those that are not. A 7705 class that starts with a circumflex is not an assertion; it still con- 7706 sumes a character from the subject string, and therefore it fails if 7707 the current pointer is at the end of the string. 7708 7709 Characters in a class may be specified by their code points using \o, 7710 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any 7711 letters in a class represent both their upper case and lower case ver- 7712 sions, so for example, a caseless [aeiou] matches "A" as well as "a", 7713 and a caseless [^aeiou] does not match "A", whereas a caseful version 7714 would. Note that there are two ASCII characters, K and S, that, in ad- 7715 dition to their lower case ASCII equivalents, are case-equivalent with 7716 Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei- 7717 ther PCRE2_UTF or PCRE2_UCP is set. 7718 7719 Characters that might indicate line breaks are never treated in any 7720 special way when matching character classes, whatever line-ending se- 7721 quence is in use, and whatever setting of the PCRE2_DOTALL and 7722 PCRE2_MULTILINE options is used. A class such as [^a] always matches 7723 one of these characters. 7724 7725 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, 7726 \S, \v, \V, \w, and \W may appear in a character class, and add the 7727 characters that they match to the class. For example, [\dABCDEF] 7728 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option af- 7729 fects the meanings of \d, \s, \w and their upper case partners, just as 7730 it does when they appear outside a character class, as described in the 7731 section entitled "Generic character types" above. The escape sequence 7732 \b has a different meaning inside a character class; it matches the 7733 backspace character. The sequences \B, \R, and \X are not special in- 7734 side a character class. Like any other unrecognized escape sequences, 7735 they cause an error. The same is true for \N when not followed by an 7736 opening brace. 7737 7738 The minus (hyphen) character can be used to specify a range of charac- 7739 ters in a character class. For example, [d-m] matches any letter be- 7740 tween d and m, inclusive. If a minus character is required in a class, 7741 it must be escaped with a backslash or appear in a position where it 7742 cannot be interpreted as indicating a range, typically as the first or 7743 last character in the class, or immediately after a range. For example, 7744 [b-d-z] matches letters in the range b to d, a hyphen character, or z. 7745 7746 Perl treats a hyphen as a literal if it appears before or after a POSIX 7747 class (see below) or before or after a character type escape such as \d 7748 or \H. However, unless the hyphen is the last character in the class, 7749 Perl outputs a warning in its warning mode, as this is most likely a 7750 user error. As PCRE2 has no facility for warning, an error is given in 7751 these cases. 7752 7753 It is not possible to have the literal character "]" as the end charac- 7754 ter of a range. A pattern such as [W-]46] is interpreted as a class of 7755 two characters ("W" and "-") followed by a literal string "46]", so it 7756 would match "W46]" or "-46]". However, if the "]" is escaped with a 7757 backslash it is interpreted as the end of range, so [W-\]46] is inter- 7758 preted as a class containing a range followed by two other characters. 7759 The octal or hexadecimal representation of "]" can also be used to end 7760 a range. 7761 7762 Ranges normally include all code points between the start and end char- 7763 acters, inclusive. They can also be used for code points specified nu- 7764 merically, for example [\000-\037]. Ranges can include any characters 7765 that are valid for the current mode. In any UTF mode, the so-called 7766 "surrogate" characters (those whose code points lie between 0xd800 and 7767 0xdfff inclusive) may not be specified explicitly by default (the 7768 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How- 7769 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates, 7770 are always permitted. 7771 7772 There is a special case in EBCDIC environments for ranges whose end 7773 points are both specified as literal letters in the same case. For com- 7774 patibility with Perl, EBCDIC code points within the range that are not 7775 letters are omitted. For example, [h-k] matches only four characters, 7776 even though the codes for h and k are 0x88 and 0x92, a range of 11 code 7777 points. However, if the range is specified numerically, for example, 7778 [\x88-\x92] or [h-\x92], all code points are included. 7779 7780 If a range that includes letters is used when caseless matching is set, 7781 it matches the letters in either case. For example, [W-c] is equivalent 7782 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if 7783 character tables for a French locale are in use, [\xc8-\xcb] matches 7784 accented E characters in both cases. 7785 7786 A circumflex can conveniently be used with the upper case character 7787 types to specify a more restricted set of characters than the matching 7788 lower case type. For example, the class [^\W_] matches any letter or 7789 digit, but not underscore, whereas [\w] includes underscore. A positive 7790 character class should be read as "something OR something OR ..." and a 7791 negative class as "NOT something AND NOT something AND NOT ...". 7792 7793 The only metacharacters that are recognized in character classes are 7794 backslash, hyphen (only where it can be interpreted as specifying a 7795 range), circumflex (only at the start), opening square bracket (only 7796 when it can be interpreted as introducing a POSIX class name, or for a 7797 special compatibility feature - see the next two sections), and the 7798 terminating closing square bracket. However, escaping other non-al- 7799 phanumeric characters does no harm. 7800 7801 7802POSIX CHARACTER CLASSES 7803 7804 Perl supports the POSIX notation for character classes. This uses names 7805 enclosed by [: and :] within the enclosing square brackets. PCRE2 also 7806 supports this notation. For example, 7807 7808 [01[:alpha:]%] 7809 7810 matches "0", "1", any alphabetic character, or "%". The supported class 7811 names are: 7812 7813 alnum letters and digits 7814 alpha letters 7815 ascii character codes 0 - 127 7816 blank space or tab only 7817 cntrl control characters 7818 digit decimal digits (same as \d) 7819 graph printing characters, excluding space 7820 lower lower case letters 7821 print printing characters, including space 7822 punct printing characters, excluding letters and digits and space 7823 space white space (the same as \s from PCRE2 8.34) 7824 upper upper case letters 7825 word "word" characters (same as \w) 7826 xdigit hexadecimal digits 7827 7828 The default "space" characters are HT (9), LF (10), VT (11), FF (12), 7829 CR (13), and space (32). If locale-specific matching is taking place, 7830 the list of space characters may be different; there may be fewer or 7831 more of them. "Space" and \s match the same set of characters, as do 7832 "word" and \w. 7833 7834 The name "word" is a Perl extension, and "blank" is a GNU extension 7835 from Perl 5.8. Another Perl extension is negation, which is indicated 7836 by a ^ character after the colon. For example, 7837 7838 [12[:^digit:]] 7839 7840 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the 7841 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but 7842 these are not supported, and an error is given if they are encountered. 7843 7844 By default, characters with values greater than 127 do not match any of 7845 the POSIX character classes, although this may be different for charac- 7846 ters in the range 128-255 when locale-specific matching is happening. 7847 However, in UCP mode, unless certain options are set (see below), some 7848 of the classes are changed so that Unicode character properties are 7849 used. This is achieved by replacing POSIX classes with other sequences, 7850 as follows: 7851 7852 [:alnum:] becomes \p{Xan} 7853 [:alpha:] becomes \p{L} 7854 [:blank:] becomes \h 7855 [:cntrl:] becomes \p{Cc} 7856 [:digit:] becomes \p{Nd} 7857 [:lower:] becomes \p{Ll} 7858 [:space:] becomes \p{Xps} 7859 [:upper:] becomes \p{Lu} 7860 [:word:] becomes \p{Xwd} 7861 7862 Negated versions, such as [:^alpha:] use \P instead of \p. Four other 7863 POSIX classes are handled specially in UCP mode: 7864 7865 [:graph:] This matches characters that have glyphs that mark the page 7866 when printed. In Unicode property terms, it matches all char- 7867 acters with the L, M, N, P, S, or Cf properties, except for: 7868 7869 U+061C Arabic Letter Mark 7870 U+180E Mongolian Vowel Separator 7871 U+2066 - U+2069 Various "isolate"s 7872 7873 7874 [:print:] This matches the same characters as [:graph:] plus space 7875 characters that are not controls, that is, characters with 7876 the Zs property. 7877 7878 [:punct:] This matches all characters that have the Unicode P (punctua- 7879 tion) property, plus those characters with code points less 7880 than 256 that have the S (Symbol) property. 7881 7882 [:xdigit:] 7883 In addition to the ASCII hexadecimal digits, this also 7884 matches the "fullwidth" versions of those characters, whose 7885 Unicode code points start at U+FF10. This is a change that 7886 was made in PCRE release 10.43 for Perl compatibility. 7887 7888 The other POSIX classes are unchanged by PCRE2_UCP, and match only 7889 characters with code points less than 256. 7890 7891 There are two options that can be used to restrict the POSIX classes to 7892 ASCII characters when PCRE2_UCP is set. The option PCRE2_EX- 7893 TRA_ASCII_DIGIT affects just [:digit:] and [:xdigit:]. Within a pat- 7894 tern, this can be set and unset by (?aT) and (?-aT). The PCRE2_EX- 7895 TRA_ASCII_POSIX option disables UCP processing for all POSIX classes, 7896 including [:digit:] and [:xdigit:]. Within a pattern, (?aP) and (?-aP) 7897 set and unset both these options for consistency. 7898 7899 7900COMPATIBILITY FEATURE FOR WORD BOUNDARIES 7901 7902 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the 7903 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" 7904 and "end of word". PCRE2 treats these items as follows: 7905 7906 [[:<:]] is converted to \b(?=\w) 7907 [[:>:]] is converted to \b(?<=\w) 7908 7909 Only these exact character sequences are recognized. A sequence such as 7910 [a[:<:]b] provokes error for an unrecognized POSIX class name. This 7911 support is not compatible with Perl. It is provided to help migrations 7912 from other environments, and is best not used in any new patterns. Note 7913 that \b matches at the start and the end of a word (see "Simple asser- 7914 tions" above), and in a Perl-style pattern the preceding or following 7915 character normally shows which is wanted, without the need for the as- 7916 sertions that are used above in order to give exactly the POSIX behav- 7917 iour. Note also that the PCRE2_UCP option changes the meaning of \w 7918 (and therefore \b) by default, so it also affects these POSIX se- 7919 quences. 7920 7921 7922VERTICAL BAR 7923 7924 Vertical bar characters are used to separate alternative patterns. For 7925 example, the pattern 7926 7927 gilbert|sullivan 7928 7929 matches either "gilbert" or "sullivan". Any number of alternatives may 7930 appear, and an empty alternative is permitted (matching the empty 7931 string). The matching process tries each alternative in turn, from left 7932 to right, and the first one that succeeds is used. If the alternatives 7933 are within a group (defined below), "succeeds" means matching the rest 7934 of the main pattern as well as the alternative in the group. 7935 7936 7937INTERNAL OPTION SETTING 7938 7939 The settings of several options can be changed within a pattern by a 7940 sequence of letters enclosed between "(?" and ")". The following are 7941 Perl-compatible, and are described in detail in the pcre2api documenta- 7942 tion. The option letters are: 7943 7944 i for PCRE2_CASELESS 7945 m for PCRE2_MULTILINE 7946 n for PCRE2_NO_AUTO_CAPTURE 7947 s for PCRE2_DOTALL 7948 x for PCRE2_EXTENDED 7949 xx for PCRE2_EXTENDED_MORE 7950 7951 For example, (?im) sets caseless, multiline matching. It is also possi- 7952 ble to unset these options by preceding the relevant letters with a hy- 7953 phen, for example (?-im). The two "extended" options are not indepen- 7954 dent; unsetting either one cancels the effects of both of them. 7955 7956 A combined setting and unsetting such as (?im-sx), which sets 7957 PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and 7958 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the 7959 options string. If a letter appears both before and after the hyphen, 7960 the option is unset. An empty options setting "(?)" is allowed. Need- 7961 less to say, it has no effect. 7962 7963 If the first character following (? is a circumflex, it causes all of 7964 the above options to be unset. Letters may follow the circumflex to 7965 cause some options to be re-instated, but a hyphen may not appear. 7966 7967 Some PCRE2-specific options can be changed by the same mechanism using 7968 these pairs or individual letters: 7969 7970 aD for PCRE2_EXTRA_ASCII_BSD 7971 aS for PCRE2_EXTRA_ASCII_BSS 7972 aW for PCRE2_EXTRA_ASCII_BSW 7973 aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT 7974 aT for PCRE2_EXTRA_ASCII_DIGIT 7975 r for PCRE2_EXTRA_CASELESS_RESTRICT 7976 J for PCRE2_DUPNAMES 7977 U for PCRE2_UNGREEDY 7978 7979 However, except for 'r', these are not unset by (?^), which is equiva- 7980 lent to (?-imnrsx). If 'a' is not followed by any of the upper case 7981 letters shown above, it sets (or unsets) all the ASCII options. 7982 7983 PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EX- 7984 TRA_ASCII_POSIX is set, but including it in (?aP) means that (?-aP) 7985 suppresses all ASCII restrictions for POSIX classes. 7986 7987 When one of these option changes occurs at top level (that is, not in- 7988 side group parentheses), the change applies until a subsequent change, 7989 or the end of the pattern. An option change within a group (see below 7990 for a description of groups) affects only that part of the group that 7991 follows it. At the end of the group these options are reset to the 7992 state they were before the group. For example, 7993 7994 (a(?i)b)c 7995 7996 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is 7997 not set externally). Any changes made in one alternative do carry on 7998 into subsequent branches within the same group. For example, 7999 8000 (a(?i)b|c) 8001 8002 matches "ab", "aB", "c", and "C", even though when matching "C" the 8003 first branch is abandoned before the option setting. This is because 8004 the effects of option settings happen at compile time. There would be 8005 some very weird behaviour otherwise. 8006 8007 As a convenient shorthand, if any option settings are required at the 8008 start of a non-capturing group (see the next section), the option let- 8009 ters may appear between the "?" and the ":". Thus the two patterns 8010 8011 (?i:saturday|sunday) 8012 (?:(?i)saturday|sunday) 8013 8014 match exactly the same set of strings. 8015 8016 Note: There are other PCRE2-specific options, applying to the whole 8017 pattern, which can be set by the application when the compiling func- 8018 tion is called. In addition, the pattern can contain special leading 8019 sequences such as (*CRLF) to override what the application has set or 8020 what has been defaulted. Details are given in the section entitled 8021 "Newline sequences" above. There are also the (*UTF) and (*UCP) leading 8022 sequences that can be used to set UTF and Unicode property modes; they 8023 are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec- 8024 tively. However, the application can set the PCRE2_NEVER_UTF or 8025 PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and 8026 (*UCP) sequences. 8027 8028 8029GROUPS 8030 8031 Groups are delimited by parentheses (round brackets), which can be 8032 nested. Turning part of a pattern into a group does two things: 8033 8034 1. It localizes a set of alternatives. For example, the pattern 8035 8036 cat(aract|erpillar|) 8037 8038 matches "cataract", "caterpillar", or "cat". Without the parentheses, 8039 it would match "cataract", "erpillar" or an empty string. 8040 8041 2. It creates a "capture group". This means that, when the whole pat- 8042 tern matches, the portion of the subject string that matched the group 8043 is passed back to the caller, separately from the portion that matched 8044 the whole pattern. (This applies only to the traditional matching 8045 function; the DFA matching function does not support capturing.) 8046 8047 Opening parentheses are counted from left to right (starting from 1) to 8048 obtain numbers for capture groups. For example, if the string "the red 8049 king" is matched against the pattern 8050 8051 the ((red|white) (king|queen)) 8052 8053 the captured substrings are "red king", "red", and "king", and are num- 8054 bered 1, 2, and 3, respectively. 8055 8056 The fact that plain parentheses fulfil two functions is not always 8057 helpful. There are often times when grouping is required without cap- 8058 turing. If an opening parenthesis is followed by a question mark and a 8059 colon, the group does not do any capturing, and is not counted when 8060 computing the number of any subsequent capture groups. For example, if 8061 the string "the white queen" is matched against the pattern 8062 8063 the ((?:red|white) (king|queen)) 8064 8065 the captured substrings are "white queen" and "queen", and are numbered 8066 1 and 2. The maximum number of capture groups is 65535. 8067 8068 As a convenient shorthand, if any option settings are required at the 8069 start of a non-capturing group, the option letters may appear between 8070 the "?" and the ":". Thus the two patterns 8071 8072 (?i:saturday|sunday) 8073 (?:(?i)saturday|sunday) 8074 8075 match exactly the same set of strings. Because alternative branches are 8076 tried from left to right, and options are not reset until the end of 8077 the group is reached, an option setting in one branch does affect sub- 8078 sequent branches, so the above patterns match "SUNDAY" as well as "Sat- 8079 urday". 8080 8081 8082DUPLICATE GROUP NUMBERS 8083 8084 Perl 5.10 introduced a feature whereby each alternative in a group uses 8085 the same numbers for its capturing parentheses. Such a group starts 8086 with (?| and is itself a non-capturing group. For example, consider 8087 this pattern: 8088 8089 (?|(Sat)ur|(Sun))day 8090 8091 Because the two alternatives are inside a (?| group, both sets of cap- 8092 turing parentheses are numbered one. Thus, when the pattern matches, 8093 you can look at captured substring number one, whichever alternative 8094 matched. This construct is useful when you want to capture part, but 8095 not all, of one of a number of alternatives. Inside a (?| group, paren- 8096 theses are numbered as usual, but the number is reset at the start of 8097 each branch. The numbers of any capturing parentheses that follow the 8098 whole group start after the highest number used in any branch. The fol- 8099 lowing example is taken from the Perl documentation. The numbers under- 8100 neath show in which buffer the captured content will be stored. 8101 8102 # before ---------------branch-reset----------- after 8103 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 8104 # 1 2 2 3 2 3 4 8105 8106 A backreference to a capture group uses the most recent value that is 8107 set for the group. The following pattern matches "abcabc" or "defdef": 8108 8109 /(?|(abc)|(def))\1/ 8110 8111 In contrast, a subroutine call to a capture group always refers to the 8112 first one in the pattern with the given number. The following pattern 8113 matches "abcabc" or "defabc": 8114 8115 /(?|(abc)|(def))(?1)/ 8116 8117 A relative reference such as (?-1) is no different: it is just a conve- 8118 nient way of computing an absolute group number. 8119 8120 If a condition test for a group's having matched refers to a non-unique 8121 number, the test is true if any group with that number has matched. 8122 8123 An alternative approach to using this "branch reset" feature is to use 8124 duplicate named groups, as described in the next section. 8125 8126 8127NAMED CAPTURE GROUPS 8128 8129 Identifying capture groups by number is simple, but it can be very hard 8130 to keep track of the numbers in complicated patterns. Furthermore, if 8131 an expression is modified, the numbers may change. To help with this 8132 difficulty, PCRE2 supports the naming of capture groups. This feature 8133 was not added to Perl until release 5.10. Python had the feature ear- 8134 lier, and PCRE1 introduced it at release 4.0, using the Python syntax. 8135 PCRE2 supports both the Perl and the Python syntax. 8136 8137 In PCRE2, a capture group can be named in one of three ways: 8138 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. 8139 Names may be up to 128 code units long. When PCRE2_UTF is not set, they 8140 may contain only ASCII alphanumeric characters and underscores, but 8141 must start with a non-digit. When PCRE2_UTF is set, the syntax of group 8142 names is extended to allow any Unicode letter or Unicode decimal digit. 8143 In other words, group names must match one of these patterns: 8144 8145 ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set 8146 ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set 8147 8148 References to capture groups from other parts of the pattern, such as 8149 backreferences, recursion, and conditions, can all be made by name as 8150 well as by number. 8151 8152 Named capture groups are allocated numbers as well as names, exactly as 8153 if the names were not present. In both PCRE2 and Perl, capture groups 8154 are primarily identified by numbers; any names are just aliases for 8155 these numbers. The PCRE2 API provides function calls for extracting the 8156 complete name-to-number translation table from a compiled pattern, as 8157 well as convenience functions for extracting captured substrings by 8158 name. 8159 8160 Warning: When more than one capture group has the same number, as de- 8161 scribed in the previous section, a name given to one of them applies to 8162 all of them. Perl allows identically numbered groups to have different 8163 names. Consider this pattern, where there are two capture groups, both 8164 numbered 1: 8165 8166 (?|(?<AA>aa)|(?<BB>bb)) 8167 8168 Perl allows this, with both names AA and BB as aliases of group 1. 8169 Thus, after a successful match, both names yield the same value (either 8170 "aa" or "bb"). 8171 8172 In an attempt to reduce confusion, PCRE2 does not allow the same group 8173 number to be associated with more than one name. The example above pro- 8174 vokes a compile-time error. However, there is still scope for confu- 8175 sion. Consider this pattern: 8176 8177 (?|(?<AA>aa)|(bb)) 8178 8179 Although the second group number 1 is not explicitly named, the name AA 8180 is still an alias for any group 1. Whether the pattern matches "aa" or 8181 "bb", a reference by name to group AA yields the matched string. 8182 8183 By default, a name must be unique within a pattern, except that dupli- 8184 cate names are permitted for groups with the same number, for example: 8185 8186 (?|(?<AA>aa)|(?<AA>bb)) 8187 8188 The duplicate name constraint can be disabled by setting the PCRE2_DUP- 8189 NAMES option at compile time, or by the use of (?J) within the pattern, 8190 as described in the section entitled "Internal Option Setting" above. 8191 8192 Duplicate names can be useful for patterns where only one instance of 8193 the named capture group can match. Suppose you want to match the name 8194 of a weekday, either as a 3-letter abbreviation or as the full name, 8195 and in both cases you want to extract the abbreviation. This pattern 8196 (ignoring the line breaks) does the job: 8197 8198 (?J) 8199 (?<DN>Mon|Fri|Sun)(?:day)?| 8200 (?<DN>Tue)(?:sday)?| 8201 (?<DN>Wed)(?:nesday)?| 8202 (?<DN>Thu)(?:rsday)?| 8203 (?<DN>Sat)(?:urday)? 8204 8205 There are five capture groups, but only one is ever set after a match. 8206 The convenience functions for extracting the data by name returns the 8207 substring for the first (and in this example, the only) group of that 8208 name that matched. This saves searching to find which numbered group it 8209 was. (An alternative way of solving this problem is to use a "branch 8210 reset" group, as described in the previous section.) 8211 8212 If you make a backreference to a non-unique named group from elsewhere 8213 in the pattern, the groups to which the name refers are checked in the 8214 order in which they appear in the overall pattern. The first one that 8215 is set is used for the reference. For example, this pattern matches 8216 both "foofoo" and "barbar" but not "foobar" or "barfoo": 8217 8218 (?J)(?:(?<n>foo)|(?<n>bar))\k<n> 8219 8220 8221 If you make a subroutine call to a non-unique named group, the one that 8222 corresponds to the first occurrence of the name is used. In the absence 8223 of duplicate numbers this is the one with the lowest number. 8224 8225 If you use a named reference in a condition test (see the section about 8226 conditions below), either to check whether a capture group has matched, 8227 or to check for recursion, all groups with the same name are tested. If 8228 the condition is true for any one of them, the overall condition is 8229 true. This is the same behaviour as testing by number. For further de- 8230 tails of the interfaces for handling named capture groups, see the 8231 pcre2api documentation. 8232 8233 8234REPETITION 8235 8236 Repetition is specified by quantifiers, which may follow any one of 8237 these items: 8238 8239 a literal data character 8240 the dot metacharacter 8241 the \C escape sequence 8242 the \R escape sequence 8243 the \X escape sequence 8244 any escape sequence that matches a single character 8245 a character class 8246 a backreference 8247 a parenthesized group (including lookaround assertions) 8248 a subroutine call (recursive or otherwise) 8249 8250 If a quantifier does not follow a repeatable item, an error occurs. The 8251 general repetition quantifier specifies a minimum and maximum number of 8252 permitted matches by giving two numbers in curly brackets (braces), 8253 separated by a comma. The numbers must be less than 65536, and the 8254 first must be less than or equal to the second. For example, 8255 8256 z{2,4} 8257 8258 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a 8259 special character. If the second number is omitted, but the comma is 8260 present, there is no upper limit; if the second number and the comma 8261 are both omitted, the quantifier specifies an exact number of required 8262 matches. Thus 8263 8264 [aeiou]{3,} 8265 8266 matches at least 3 successive vowels, but may match many more, whereas 8267 8268 \d{8} 8269 8270 matches exactly 8 digits. If the first number is omitted, the lower 8271 limit is taken as zero; in this case the upper limit must be present. 8272 8273 X{,4} is interpreted as X{0,4} 8274 8275 This is a change in behaviour that happened in Perl 5.34.0 and PCRE2 8276 10.43. In earlier versions such a sequence was not interpreted as a 8277 quantifier. Other regular expression engines may behave either way. 8278 8279 If the characters that follow an opening brace do not match the syntax 8280 of a quantifier, the brace is taken as a literal character. In particu- 8281 lar, this means that {,} is a literal string of three characters. 8282 8283 Note that not every opening brace is potentially the start of a quanti- 8284 fier because braces are used in other items such as \N{U+345} or 8285 \k{name}. 8286 8287 In UTF modes, quantifiers apply to characters rather than to individual 8288 code units. Thus, for example, \x{100}{2} matches two characters, each 8289 of which is represented by a two-byte sequence in a UTF-8 string. Simi- 8290 larly, \X{3} matches three Unicode extended grapheme clusters, each of 8291 which may be several code units long (and they may be of different 8292 lengths). 8293 8294 The quantifier {0} is permitted, causing the expression to behave as if 8295 the previous item and the quantifier were not present. This may be use- 8296 ful for capture groups that are referenced as subroutines from else- 8297 where in the pattern (but see also the section entitled "Defining cap- 8298 ture groups for use by reference only" below). Except for parenthesized 8299 groups, items that have a {0} quantifier are omitted from the compiled 8300 pattern. 8301 8302 For convenience, the three most common quantifiers have single-charac- 8303 ter abbreviations: 8304 8305 * is equivalent to {0,} 8306 + is equivalent to {1,} 8307 ? is equivalent to {0,1} 8308 8309 It is possible to construct infinite loops by following a group that 8310 can match no characters with a quantifier that has no upper limit, for 8311 example: 8312 8313 (a?)* 8314 8315 Earlier versions of Perl and PCRE1 used to give an error at compile 8316 time for such patterns. However, because there are cases where this can 8317 be useful, such patterns are now accepted, but whenever an iteration of 8318 such a group matches no characters, matching moves on to the next item 8319 in the pattern instead of repeatedly matching an empty string. This 8320 does not prevent backtracking into any of the iterations if a subse- 8321 quent item fails to match. 8322 8323 By default, quantifiers are "greedy", that is, they match as much as 8324 possible (up to the maximum number of permitted repetitions), without 8325 causing the rest of the pattern to fail. The classic example of where 8326 this gives problems is in trying to match comments in C programs. These 8327 appear between /* and */ and within the comment, individual * and / 8328 characters may appear. An attempt to match C comments by applying the 8329 pattern 8330 8331 /\*.*\*/ 8332 8333 to the string 8334 8335 /* first comment */ not comment /* second comment */ 8336 8337 fails, because it matches the entire string owing to the greediness of 8338 the .* item. However, if a quantifier is followed by a question mark, 8339 it ceases to be greedy, and instead matches the minimum number of times 8340 possible, so the pattern 8341 8342 /\*.*?\*/ 8343 8344 does the right thing with C comments. The meaning of the various quan- 8345 tifiers is not otherwise changed, just the preferred number of matches. 8346 Do not confuse this use of question mark with its use as a quantifier 8347 in its own right. Because it has two uses, it can sometimes appear 8348 doubled, as in 8349 8350 \d??\d 8351 8352 which matches one digit by preference, but can match two if that is the 8353 only way the rest of the pattern matches. 8354 8355 If the PCRE2_UNGREEDY option is set (an option that is not available in 8356 Perl), the quantifiers are not greedy by default, but individual ones 8357 can be made greedy by following them with a question mark. In other 8358 words, it inverts the default behaviour. 8359 8360 When a parenthesized group is quantified with a minimum repeat count 8361 that is greater than 1 or with a limited maximum, more memory is re- 8362 quired for the compiled pattern, in proportion to the size of the mini- 8363 mum or maximum. 8364 8365 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option 8366 (equivalent to Perl's /s) is set, thus allowing the dot to match new- 8367 lines, the pattern is implicitly anchored, because whatever follows 8368 will be tried against every character position in the subject string, 8369 so there is no point in retrying the overall match at any position af- 8370 ter the first. PCRE2 normally treats such a pattern as though it were 8371 preceded by \A. 8372 8373 In cases where it is known that the subject string contains no new- 8374 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- 8375 mization, or alternatively, using ^ to indicate anchoring explicitly. 8376 8377 However, there are some cases where the optimization cannot be used. 8378 When .* is inside capturing parentheses that are the subject of a 8379 backreference elsewhere in the pattern, a match at the start may fail 8380 where a later one succeeds. Consider, for example: 8381 8382 (.*)abc\1 8383 8384 If the subject is "xyz123abc123" the match point is the fourth charac- 8385 ter. For this reason, such a pattern is not implicitly anchored. 8386 8387 Another case where implicit anchoring is not applied is when the lead- 8388 ing .* is inside an atomic group. Once again, a match at the start may 8389 fail where a later one succeeds. Consider this pattern: 8390 8391 (?>.*?a)b 8392 8393 It matches "ab" in the subject "aab". The use of the backtracking con- 8394 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and 8395 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. 8396 8397 When a capture group is repeated, the value captured is the substring 8398 that matched the final iteration. For example, after 8399 8400 (tweedle[dume]{3}\s*)+ 8401 8402 has matched "tweedledum tweedledee" the value of the captured substring 8403 is "tweedledee". However, if there are nested capture groups, the cor- 8404 responding captured values may have been set in previous iterations. 8405 For example, after 8406 8407 (a|(b))+ 8408 8409 matches "aba" the value of the second captured substring is "b". 8410 8411 8412ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS 8413 8414 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") 8415 repetition, failure of what follows normally causes the repeated item 8416 to be re-evaluated to see if a different number of repeats allows the 8417 rest of the pattern to match. Sometimes it is useful to prevent this, 8418 either to change the nature of the match, or to cause it fail earlier 8419 than it otherwise might, when the author of the pattern knows there is 8420 no point in carrying on. 8421 8422 Consider, for example, the pattern \d+foo when applied to the subject 8423 line 8424 8425 123456bar 8426 8427 After matching all 6 digits and then failing to match "foo", the normal 8428 action of the matcher is to try again with only 5 digits matching the 8429 \d+ item, and then with 4, and so on, before ultimately failing. 8430 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides 8431 the means for specifying that once a group has matched, it is not to be 8432 re-evaluated in this way. 8433 8434 If we use atomic grouping for the previous example, the matcher gives 8435 up immediately on failing to match "foo" the first time. The notation 8436 is a kind of special parenthesis, starting with (?> as in this example: 8437 8438 (?>\d+)foo 8439 8440 Perl 5.28 introduced an experimental alphabetic form starting with (* 8441 which may be easier to remember: 8442 8443 (*atomic:\d+)foo 8444 8445 This kind of parenthesized group "locks up" the part of the pattern it 8446 contains once it has matched, and a failure further into the pattern is 8447 prevented from backtracking into it. Backtracking past it to previous 8448 items, however, works as normal. 8449 8450 An alternative description is that a group of this type matches exactly 8451 the string of characters that an identical standalone pattern would 8452 match, if anchored at the current point in the subject string. 8453 8454 Atomic groups are not capture groups. Simple cases such as the above 8455 example can be thought of as a maximizing repeat that must swallow 8456 everything it can. So, while both \d+ and \d+? are prepared to adjust 8457 the number of digits they match in order to make the rest of the pat- 8458 tern match, (?>\d+) can only match an entire sequence of digits. 8459 8460 Atomic groups in general can of course contain arbitrarily complicated 8461 expressions, and can be nested. However, when the contents of an atomic 8462 group is just a single repeated item, as in the example above, a sim- 8463 pler notation, called a "possessive quantifier" can be used. This con- 8464 sists of an additional + character following a quantifier. Using this 8465 notation, the previous example can be rewritten as 8466 8467 \d++foo 8468 8469 Note that a possessive quantifier can be used with an entire group, for 8470 example: 8471 8472 (abc|xyz){2,3}+ 8473 8474 Possessive quantifiers are always greedy; the setting of the PCRE2_UN- 8475 GREEDY option is ignored. They are a convenient notation for the sim- 8476 pler forms of atomic group. However, there is no difference in the 8477 meaning of a possessive quantifier and the equivalent atomic group, 8478 though there may be a performance difference; possessive quantifiers 8479 should be slightly faster. 8480 8481 The possessive quantifier syntax is an extension to the Perl 5.8 syn- 8482 tax. Jeffrey Friedl originated the idea (and the name) in the first 8483 edition of his book. Mike McCloskey liked it, so implemented it when he 8484 built Sun's Java package, and PCRE1 copied it from there. It found its 8485 way into Perl at release 5.10. 8486 8487 PCRE2 has an optimization that automatically "possessifies" certain 8488 simple pattern constructs. For example, the sequence A+B is treated as 8489 A++B because there is no point in backtracking into a sequence of A's 8490 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO- 8491 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). 8492 8493 When a pattern contains an unlimited repeat inside a group that can it- 8494 self be repeated an unlimited number of times, the use of an atomic 8495 group is the only way to avoid some failing matches taking a very long 8496 time indeed. The pattern 8497 8498 (\D+|<\d+>)*[!?] 8499 8500 matches an unlimited number of substrings that either consist of non- 8501 digits, or digits enclosed in <>, followed by either ! or ?. When it 8502 matches, it runs quickly. However, if it is applied to 8503 8504 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 8505 8506 it takes a long time before reporting failure. This is because the 8507 string can be divided between the internal \D+ repeat and the external 8508 * repeat in a large number of ways, and all have to be tried. (The ex- 8509 ample uses [!?] rather than a single character at the end, because both 8510 PCRE2 and Perl have an optimization that allows for fast failure when a 8511 single character is used. They remember the last single character that 8512 is required for a match, and fail early if it is not present in the 8513 string.) If the pattern is changed so that it uses an atomic group, 8514 like this: 8515 8516 ((?>\D+)|<\d+>)*[!?] 8517 8518 sequences of non-digits cannot be broken, and failure happens quickly. 8519 8520 8521BACKREFERENCES 8522 8523 Outside a character class, a backslash followed by a digit greater than 8524 0 (and possibly further digits) is a backreference to a capture group 8525 earlier (that is, to its left) in the pattern, provided there have been 8526 that many previous capture groups. 8527 8528 However, if the decimal number following the backslash is less than 8, 8529 it is always taken as a backreference, and causes an error only if 8530 there are not that many capture groups in the entire pattern. In other 8531 words, the group that is referenced need not be to the left of the ref- 8532 erence for numbers less than 8. A "forward backreference" of this type 8533 can make sense when a repetition is involved and the group to the right 8534 has participated in an earlier iteration. 8535 8536 It is not possible to have a numerical "forward backreference" to a 8537 group whose number is 8 or more using this syntax because a sequence 8538 such as \50 is interpreted as a character defined in octal. See the 8539 subsection entitled "Non-printing characters" above for further details 8540 of the handling of digits following a backslash. Other forms of back- 8541 referencing do not suffer from this restriction. In particular, there 8542 is no problem when named capture groups are used (see below). 8543 8544 Another way of avoiding the ambiguity inherent in the use of digits 8545 following a backslash is to use the \g escape sequence. This escape 8546 must be followed by a signed or unsigned number, optionally enclosed in 8547 braces. These examples are all identical: 8548 8549 (ring), \1 8550 (ring), \g1 8551 (ring), \g{1} 8552 8553 An unsigned number specifies an absolute reference without the ambigu- 8554 ity that is present in the older syntax. It is also useful when literal 8555 digits follow the reference. A signed number is a relative reference. 8556 Consider this example: 8557 8558 (abc(def)ghi)\g{-1} 8559 8560 The sequence \g{-1} is a reference to the capture group whose number is 8561 one less than the number of the next group to be started, so in this 8562 example (where the next group would be numbered 3) is it equivalent to 8563 \2, and \g{-2} would be equivalent to \1. Note that if this construct 8564 is inside a capture group, that group is included in the count, so in 8565 this example \g{-2} also refers to group 1: 8566 8567 (A)(\g{-2}B) 8568 8569 The use of relative references can be helpful in long patterns, and 8570 also in patterns that are created by joining together fragments that 8571 contain references within themselves. 8572 8573 The sequence \g{+1} is a reference to the next capture group that is 8574 started after this item, and \g{+2} refers to the one after that, and 8575 so on. This kind of forward reference can be useful in patterns that 8576 repeat. Perl does not support the use of + in this way. 8577 8578 A backreference matches whatever actually most recently matched the 8579 capture group in the current subject string, rather than anything at 8580 all that matches the group (see "Groups as subroutines" below for a way 8581 of doing that). So the pattern 8582 8583 (sens|respons)e and \1ibility 8584 8585 matches "sense and sensibility" and "response and responsibility", but 8586 not "sense and responsibility". If caseful matching is in force at the 8587 time of the backreference, the case of letters is relevant. For exam- 8588 ple, 8589 8590 ((?i)rah)\s+\1 8591 8592 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the 8593 original capture group is matched caselessly. 8594 8595 There are several different ways of writing backreferences to named 8596 capture groups. The .NET syntax is \k{name}, the Python syntax is 8597 (?=name), and the original Perl syntax is \k<name> or \k'name'. All of 8598 these are now supported by both Perl and PCRE2. Perl 5.10's unified 8599 backreference syntax, in which \g can be used for both numeric and 8600 named references, is also supported by PCRE2. We could rewrite the 8601 above example in any of the following ways: 8602 8603 (?<p1>(?i)rah)\s+\k<p1> 8604 (?'p1'(?i)rah)\s+\k{p1} 8605 (?P<p1>(?i)rah)\s+(?P=p1) 8606 (?<p1>(?i)rah)\s+\g{p1} 8607 8608 A capture group that is referenced by name may appear in the pattern 8609 before or after the reference. 8610 8611 There may be more than one backreference to the same group. If a group 8612 has not actually been used in a particular match, backreferences to it 8613 always fail by default. For example, the pattern 8614 8615 (a|(bc))\2 8616 8617 always fails if it starts to match "a" rather than "bc". However, if 8618 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref- 8619 erence to an unset value matches an empty string. 8620 8621 Because there may be many capture groups in a pattern, all digits fol- 8622 lowing a backslash are taken as part of a potential backreference num- 8623 ber. If the pattern continues with a digit character, some delimiter 8624 must be used to terminate the backreference. If the PCRE2_EXTENDED or 8625 PCRE2_EXTENDED_MORE option is set, this can be white space. Otherwise, 8626 the \g{} syntax or an empty comment (see "Comments" below) can be used. 8627 8628 Recursive backreferences 8629 8630 A backreference that occurs inside the group to which it refers fails 8631 when the group is first used, so, for example, (a\1) never matches. 8632 However, such references can be useful inside repeated groups. For ex- 8633 ample, the pattern 8634 8635 (a|b\1)+ 8636 8637 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- 8638 ation of the group, the backreference matches the character string cor- 8639 responding to the previous iteration. In order for this to work, the 8640 pattern must be such that the first iteration does not need to match 8641 the backreference. This can be done using alternation, as in the exam- 8642 ple above, or by a quantifier with a minimum of zero. 8643 8644 For versions of PCRE2 less than 10.25, backreferences of this type used 8645 to cause the group that they reference to be treated as an atomic 8646 group. This restriction no longer applies, and backtracking into such 8647 groups can occur as normal. 8648 8649 8650ASSERTIONS 8651 8652 An assertion is a test on the characters following or preceding the 8653 current matching point that does not consume any characters. The simple 8654 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described 8655 above. 8656 8657 More complicated assertions are coded as parenthesized groups. There 8658 are two kinds: those that look ahead of the current position in the 8659 subject string, and those that look behind it, and in each case an as- 8660 sertion may be positive (must match for the assertion to be true) or 8661 negative (must not match for the assertion to be true). An assertion 8662 group is matched in the normal way, and if it is true, matching contin- 8663 ues after it, but with the matching position in the subject string re- 8664 set to what it was before the assertion was processed. 8665 8666 The Perl-compatible lookaround assertions are atomic. If an assertion 8667 is true, but there is a subsequent matching failure, there is no back- 8668 tracking into the assertion. However, there are some cases where non- 8669 atomic assertions can be useful. PCRE2 has some support for these, de- 8670 scribed in the section entitled "Non-atomic assertions" below, but they 8671 are not Perl-compatible. 8672 8673 A lookaround assertion may appear as the condition in a conditional 8674 group (see below). In this case, the result of matching the assertion 8675 determines which branch of the condition is followed. 8676 8677 Assertion groups are not capture groups. If an assertion contains cap- 8678 ture groups within it, these are counted for the purposes of numbering 8679 the capture groups in the whole pattern. Within each branch of an as- 8680 sertion, locally captured substrings may be referenced in the usual 8681 way. For example, a sequence such as (.)\g{-1} can be used to check 8682 that two adjacent characters are the same. 8683 8684 When a branch within an assertion fails to match, any substrings that 8685 were captured are discarded (as happens with any pattern branch that 8686 fails to match). A negative assertion is true only when all its 8687 branches fail to match; this means that no captured substrings are ever 8688 retained after a successful negative assertion. When an assertion con- 8689 tains a matching branch, what happens depends on the type of assertion. 8690 8691 For a positive assertion, internally captured substrings in the suc- 8692 cessful branch are retained, and matching continues with the next pat- 8693 tern item after the assertion. For a negative assertion, a matching 8694 branch means that the assertion is not true. If such an assertion is 8695 being used as a condition in a conditional group (see below), captured 8696 substrings are retained, because matching continues with the "no" 8697 branch of the condition. For other failing negative assertions, control 8698 passes to the previous backtracking point, thus discarding any captured 8699 strings within the assertion. 8700 8701 Most assertion groups may be repeated; though it makes no sense to as- 8702 sert the same thing several times, the side effect of capturing in pos- 8703 itive assertions may occasionally be useful. However, an assertion that 8704 forms the condition for a conditional group may not be quantified. 8705 PCRE2 used to restrict the repetition of assertions, but from release 8706 10.35 the only restriction is that an unlimited maximum repetition is 8707 changed to be one more than the minimum. For example, {3,} is treated 8708 as {3,4}. 8709 8710 Alphabetic assertion names 8711 8712 Traditionally, symbolic sequences such as (?= and (?<= have been used 8713 to specify lookaround assertions. Perl 5.28 introduced some experimen- 8714 tal alphabetic alternatives which might be easier to remember. They all 8715 start with (* instead of (? and must be written using lower case let- 8716 ters. PCRE2 supports the following synonyms: 8717 8718 (*positive_lookahead: or (*pla: is the same as (?= 8719 (*negative_lookahead: or (*nla: is the same as (?! 8720 (*positive_lookbehind: or (*plb: is the same as (?<= 8721 (*negative_lookbehind: or (*nlb: is the same as (?<! 8722 8723 For example, (*pla:foo) is the same assertion as (?=foo). In the fol- 8724 lowing sections, the various assertions are described using the origi- 8725 nal symbolic forms. 8726 8727 Lookahead assertions 8728 8729 Lookahead assertions start with (?= for positive assertions and (?! for 8730 negative assertions. For example, 8731 8732 \w+(?=;) 8733 8734 matches a word followed by a semicolon, but does not include the semi- 8735 colon in the match, and 8736 8737 foo(?!bar) 8738 8739 matches any occurrence of "foo" that is not followed by "bar". Note 8740 that the apparently similar pattern 8741 8742 (?!foo)bar 8743 8744 does not find an occurrence of "bar" that is preceded by something 8745 other than "foo"; it finds any occurrence of "bar" whatsoever, because 8746 the assertion (?!foo) is always true when the next three characters are 8747 "bar". A lookbehind assertion is needed to achieve the other effect. 8748 8749 If you want to force a matching failure at some point in a pattern, the 8750 most convenient way to do it is with (?!) because an empty string al- 8751 ways matches, so an assertion that requires there not to be an empty 8752 string must always fail. The backtracking control verb (*FAIL) or (*F) 8753 is a synonym for (?!). 8754 8755 Lookbehind assertions 8756 8757 Lookbehind assertions start with (?<= for positive assertions and (?<! 8758 for negative assertions. For example, 8759 8760 (?<!foo)bar 8761 8762 does find an occurrence of "bar" that is not preceded by "foo". The 8763 contents of a lookbehind assertion are restricted such that there must 8764 be a known maximum to the lengths of all the strings it matches. There 8765 are two cases: 8766 8767 If every top-level alternative matches a fixed length, for example 8768 8769 (?<=colour|color) 8770 8771 there is a limit of 65535 characters to the lengths, which do not have 8772 to be the same, as this example demonstrates. This is the only kind of 8773 lookbehind supported by PCRE2 versions earlier than 10.43 and by the 8774 alternative matching function pcre2_dfa_match(). 8775 8776 In PCRE2 10.43 and later, pcre2_match() supports lookbehind assertions 8777 in which one or more top-level alternatives can match more than one 8778 string length, for example 8779 8780 (?<=colou?r) 8781 8782 The maximum matching length for any branch of the lookbehind is limited 8783 to a value set by the calling program (default 255 characters). Unlim- 8784 ited repetition (for example \d*) is not supported. In some cases, the 8785 escape sequence \K (see above) can be used instead of a lookbehind as- 8786 sertion at the start of a pattern to get round the length limit re- 8787 striction. 8788 8789 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which 8790 matches a single code unit even in a UTF mode) to appear in lookbehind 8791 assertions, because it makes it impossible to calculate the length of 8792 the lookbehind. The \X and \R escapes, which can match different num- 8793 bers of code units, are never permitted in lookbehinds. 8794 8795 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in 8796 lookbehinds, as long as the called capture group matches a limited- 8797 length string. However, recursion, that is, a "subroutine" call into a 8798 group that is already active, is not supported. 8799 8800 PCRE2 supports backreferences in lookbehinds, but only if certain con- 8801 ditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, 8802 there must be no use of (?| in the pattern (it creates duplicate group 8803 numbers), and if the backreference is by name, the name must be unique. 8804 Of course, the referenced group must itself match a limited length sub- 8805 string. The following pattern matches words containing at least two 8806 characters that begin and end with the same character: 8807 8808 \b(\w)\w++(?<=\1) 8809 8810 Possessive quantifiers can be used in conjunction with lookbehind as- 8811 sertions to specify efficient matching at the end of subject strings. 8812 Consider a simple pattern such as 8813 8814 abcd$ 8815 8816 when applied to a long string that does not match. Because matching 8817 proceeds from left to right, PCRE2 will look for each "a" in the sub- 8818 ject and then see if what follows matches the rest of the pattern. If 8819 the pattern is specified as 8820 8821 ^.*abcd$ 8822 8823 the initial .* matches the entire string at first, but when this fails 8824 (because there is no following "a"), it backtracks to match all but the 8825 last character, then all but the last two characters, and so on. Once 8826 again the search for "a" covers the entire string, from right to left, 8827 so we are no better off. However, if the pattern is written as 8828 8829 ^.*+(?<=abcd) 8830 8831 there can be no backtracking for the .*+ item because of the possessive 8832 quantifier; it can match only the entire string. The subsequent lookbe- 8833 hind assertion does a single test on the last four characters. If it 8834 fails, the match fails immediately. For long strings, this approach 8835 makes a significant difference to the processing time. 8836 8837 Using multiple assertions 8838 8839 Several assertions (of any sort) may occur in succession. For example, 8840 8841 (?<=\d{3})(?<!999)foo 8842 8843 matches "foo" preceded by three digits that are not "999". Notice that 8844 each of the assertions is applied independently at the same point in 8845 the subject string. First there is a check that the previous three 8846 characters are all digits, and then there is a check that the same 8847 three characters are not "999". This pattern does not match "foo" pre- 8848 ceded by six characters, the first of which are digits and the last 8849 three of which are not "999". For example, it doesn't match "123abc- 8850 foo". A pattern to do that is 8851 8852 (?<=\d{3}...)(?<!999)foo 8853 8854 This time the first assertion looks at the preceding six characters, 8855 checking that the first three are digits, and then the second assertion 8856 checks that the preceding three characters are not "999". 8857 8858 Assertions can be nested in any combination. For example, 8859 8860 (?<=(?<!foo)bar)baz 8861 8862 matches an occurrence of "baz" that is preceded by "bar" which in turn 8863 is not preceded by "foo", while 8864 8865 (?<=\d{3}(?!999)...)foo 8866 8867 is another pattern that matches "foo" preceded by three digits and any 8868 three characters that are not "999". 8869 8870 8871NON-ATOMIC ASSERTIONS 8872 8873 Traditional lookaround assertions are atomic. That is, if an assertion 8874 is true, but there is a subsequent matching failure, there is no back- 8875 tracking into the assertion. However, there are some cases where non- 8876 atomic positive assertions can be useful. PCRE2 provides these using 8877 the following syntax: 8878 8879 (*non_atomic_positive_lookahead: or (*napla: or (?* 8880 (*non_atomic_positive_lookbehind: or (*naplb: or (?<* 8881 8882 Consider the problem of finding the right-most word in a string that 8883 also appears earlier in the string, that is, it must appear at least 8884 twice in total. This pattern returns the required result as captured 8885 substring 1: 8886 8887 ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2} 8888 8889 For a subject such as "word1 word2 word3 word2 word3 word4" the result 8890 is "word3". How does it work? At the start, ^(?x) anchors the pattern 8891 and sets the "x" option, which causes white space (introduced for read- 8892 ability) to be ignored. Inside the assertion, the greedy .* at first 8893 consumes the entire string, but then has to backtrack until the rest of 8894 the assertion can match a word, which is captured by group 1. In other 8895 words, when the assertion first succeeds, it captures the right-most 8896 word in the string. 8897 8898 The current matching point is then reset to the start of the subject, 8899 and the rest of the pattern match checks for two occurrences of the 8900 captured word, using an ungreedy .*? to scan from the left. If this 8901 succeeds, we are done, but if the last word in the string does not oc- 8902 cur twice, this part of the pattern fails. If a traditional atomic 8903 lookahead (?= or (*pla: had been used, the assertion could not be re- 8904 entered, and the whole match would fail. The pattern would succeed only 8905 if the very last word in the subject was found twice. 8906 8907 Using a non-atomic lookahead, however, means that when the last word 8908 does not occur twice in the string, the lookahead can backtrack and 8909 find the second-last word, and so on, until either the match succeeds, 8910 or all words have been tested. 8911 8912 Two conditions must be met for a non-atomic assertion to be useful: the 8913 contents of one or more capturing groups must change after a backtrack 8914 into the assertion, and there must be a backreference to a changed 8915 group later in the pattern. If this is not the case, the rest of the 8916 pattern match fails exactly as before because nothing has changed, so 8917 using a non-atomic assertion just wastes resources. 8918 8919 There is one exception to backtracking into a non-atomic assertion. If 8920 an (*ACCEPT) control verb is triggered, the assertion succeeds atomi- 8921 cally. That is, a subsequent match failure cannot backtrack into the 8922 assertion. 8923 8924 Non-atomic assertions are not supported by the alternative matching 8925 function pcre2_dfa_match(). They are supported by JIT, but only if they 8926 do not contain any control verbs such as (*ACCEPT). (This may change in 8927 future). Note that assertions that appear as conditions for conditional 8928 groups (see below) must be atomic. 8929 8930 8931SCRIPT RUNS 8932 8933 In concept, a script run is a sequence of characters that are all from 8934 the same Unicode script such as Latin or Greek. However, because some 8935 scripts are commonly used together, and because some diacritical and 8936 other marks are used with multiple scripts, it is not that simple. 8937 There is a full description of the rules that PCRE2 uses in the section 8938 entitled "Script Runs" in the pcre2unicode documentation. 8939 8940 If part of a pattern is enclosed between (*script_run: or (*sr: and a 8941 closing parenthesis, it fails if the sequence of characters that it 8942 matches are not a script run. After a failure, normal backtracking oc- 8943 curs. Script runs can be used to detect spoofing attacks using charac- 8944 ters that look the same, but are from different scripts. The string 8945 "paypal.com" is an infamous example, where the letters could be a mix- 8946 ture of Latin and Cyrillic. This pattern ensures that the matched char- 8947 acters in a sequence of non-spaces that follow white space are a script 8948 run: 8949 8950 \s+(*sr:\S+) 8951 8952 To be sure that they are all from the Latin script (for example), a 8953 lookahead can be used: 8954 8955 \s+(?=\p{Latin})(*sr:\S+) 8956 8957 This works as long as the first character is expected to be a character 8958 in that script, and not (for example) punctuation, which is allowed 8959 with any script. If this is not the case, a more creative lookahead is 8960 needed. For example, if digits, underscore, and dots are permitted at 8961 the start: 8962 8963 \s+(?=[0-9_.]*\p{Latin})(*sr:\S+) 8964 8965 8966 In many cases, backtracking into a script run pattern fragment is not 8967 desirable. The script run can employ an atomic group to prevent this. 8968 Because this is a common requirement, a shorthand notation is provided 8969 by (*atomic_script_run: or (*asr: 8970 8971 (*asr:...) is the same as (*sr:(?>...)) 8972 8973 Note that the atomic group is inside the script run. Putting it outside 8974 would not prevent backtracking into the script run pattern. 8975 8976 Support for script runs is not available if PCRE2 is compiled without 8977 Unicode support. A compile-time error is given if any of the above con- 8978 structs is encountered. Script runs are not supported by the alternate 8979 matching function, pcre2_dfa_match() because they use the same mecha- 8980 nism as capturing parentheses. 8981 8982 Warning: The (*ACCEPT) control verb (see below) should not be used 8983 within a script run group, because it causes an immediate exit from the 8984 group, bypassing the script run checking. 8985 8986 8987CONDITIONAL GROUPS 8988 8989 It is possible to cause the matching process to obey a pattern fragment 8990 conditionally or to choose between two alternative fragments, depending 8991 on the result of an assertion, or whether a specific capture group has 8992 already been matched. The two possible forms of conditional group are: 8993 8994 (?(condition)yes-pattern) 8995 (?(condition)yes-pattern|no-pattern) 8996 8997 If the condition is satisfied, the yes-pattern is used; otherwise the 8998 no-pattern (if present) is used. An absent no-pattern is equivalent to 8999 an empty string (it always matches). If there are more than two alter- 9000 natives in the group, a compile-time error occurs. Each of the two al- 9001 ternatives may itself contain nested groups of any form, including con- 9002 ditional groups; the restriction to two alternatives applies only at 9003 the level of the condition itself. This pattern fragment is an example 9004 where the alternatives are complex: 9005 9006 (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) 9007 9008 9009 There are five kinds of condition: references to capture groups, refer- 9010 ences to recursion, two pseudo-conditions called DEFINE and VERSION, 9011 and assertions. 9012 9013 Checking for a used capture group by number 9014 9015 If the text between the parentheses consists of a sequence of digits, 9016 the condition is true if a capture group of that number has previously 9017 matched. If there is more than one capture group with the same number 9018 (see the earlier section about duplicate group numbers), the condition 9019 is true if any of them have matched. An alternative notation, which is 9020 a PCRE2 extension, not supported by Perl, is to precede the digits with 9021 a plus or minus sign. In this case, the group number is relative rather 9022 than absolute. The most recently opened capture group (which could be 9023 enclosing this condition) can be referenced by (?(-1), the next most 9024 recent by (?(-2), and so on. Inside loops it can also make sense to re- 9025 fer to subsequent groups. The next capture group to be opened can be 9026 referenced as (?(+1), and so on. The value zero in any of these forms 9027 is not used; it provokes a compile-time error. 9028 9029 Consider the following pattern, which contains non-significant white 9030 space to make it more readable (assume the PCRE2_EXTENDED option) and 9031 to divide it into three parts for ease of discussion: 9032 9033 ( \( )? [^()]+ (?(1) \) ) 9034 9035 The first part matches an optional opening parenthesis, and if that 9036 character is present, sets it as the first captured substring. The sec- 9037 ond part matches one or more characters that are not parentheses. The 9038 third part is a conditional group that tests whether or not the first 9039 capture group matched. If it did, that is, if subject started with an 9040 opening parenthesis, the condition is true, and so the yes-pattern is 9041 executed and a closing parenthesis is required. Otherwise, since no- 9042 pattern is not present, the conditional group matches nothing. In other 9043 words, this pattern matches a sequence of non-parentheses, optionally 9044 enclosed in parentheses. 9045 9046 If you were embedding this pattern in a larger one, you could use a 9047 relative reference: 9048 9049 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... 9050 9051 This makes the fragment independent of the parentheses in the larger 9052 pattern. 9053 9054 Checking for a used capture group by name 9055 9056 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a 9057 used capture group by name. For compatibility with earlier versions of 9058 PCRE1, which had this facility before Perl, the syntax (?(name)...) is 9059 also recognized. Note, however, that undelimited names consisting of 9060 the letter R followed by digits are ambiguous (see the following sec- 9061 tion). Rewriting the above example to use a named group gives this: 9062 9063 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) 9064 9065 If the name used in a condition of this kind is a duplicate, the test 9066 is applied to all groups of the same name, and is true if any one of 9067 them has matched. 9068 9069 Checking for pattern recursion 9070 9071 "Recursion" in this sense refers to any subroutine-like call from one 9072 part of the pattern to another, whether or not it is actually recur- 9073 sive. See the sections entitled "Recursive patterns" and "Groups as 9074 subroutines" below for details of recursion and subroutine calls. 9075 9076 If a condition is the string (R), and there is no capture group with 9077 the name R, the condition is true if matching is currently in a recur- 9078 sion or subroutine call to the whole pattern or any capture group. If 9079 digits follow the letter R, and there is no group with that name, the 9080 condition is true if the most recent call is into a group with the 9081 given number, which must exist somewhere in the overall pattern. This 9082 is a contrived example that is equivalent to a+b: 9083 9084 ((?(R1)a+|(?1)b)) 9085 9086 However, in both cases, if there is a capture group with a matching 9087 name, the condition tests for its being set, as described in the sec- 9088 tion above, instead of testing for recursion. For example, creating a 9089 group with the name R1 by adding (?<R1>) to the above pattern com- 9090 pletely changes its meaning. 9091 9092 If a name preceded by ampersand follows the letter R, for example: 9093 9094 (?(R&name)...) 9095 9096 the condition is true if the most recent recursion is into a group of 9097 that name (which must exist within the pattern). 9098 9099 This condition does not check the entire recursion stack. It tests only 9100 the current level. If the name used in a condition of this kind is a 9101 duplicate, the test is applied to all groups of the same name, and is 9102 true if any one of them is the most recent recursion. 9103 9104 At "top level", all these recursion test conditions are false. 9105 9106 Defining capture groups for use by reference only 9107 9108 If the condition is the string (DEFINE), the condition is always false, 9109 even if there is a group with the name DEFINE. In this case, there may 9110 be only one alternative in the rest of the conditional group. It is al- 9111 ways skipped if control reaches this point in the pattern; the idea of 9112 DEFINE is that it can be used to define subroutines that can be refer- 9113 enced from elsewhere. (The use of subroutines is described below.) For 9114 example, a pattern to match an IPv4 address such as "192.168.23.245" 9115 could be written like this (ignore white space and line breaks): 9116 9117 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) 9118 \b (?&byte) (\.(?&byte)){3} \b 9119 9120 The first part of the pattern is a DEFINE group inside which another 9121 group named "byte" is defined. This matches an individual component of 9122 an IPv4 address (a number less than 256). When matching takes place, 9123 this part of the pattern is skipped because DEFINE acts like a false 9124 condition. The rest of the pattern uses references to the named group 9125 to match the four dot-separated components of an IPv4 address, insist- 9126 ing on a word boundary at each end. 9127 9128 Checking the PCRE2 version 9129 9130 Programs that link with a PCRE2 library can check the version by call- 9131 ing pcre2_config() with appropriate arguments. Users of applications 9132 that do not have access to the underlying code cannot do this. A spe- 9133 cial "condition" called VERSION exists to allow such users to discover 9134 which version of PCRE2 they are dealing with by using this condition to 9135 match a string such as "yesno". VERSION must be followed either by "=" 9136 or ">=" and a version number. For example: 9137 9138 (?(VERSION>=10.4)yes|no) 9139 9140 This pattern matches "yes" if the PCRE2 version is greater or equal to 9141 10.4, or "no" otherwise. The fractional part of the version number may 9142 not contain more than two digits. 9143 9144 Assertion conditions 9145 9146 If the condition is not in any of the above formats, it must be a 9147 parenthesized assertion. This may be a positive or negative lookahead 9148 or lookbehind assertion. However, it must be a traditional atomic as- 9149 sertion, not one of the non-atomic assertions. 9150 9151 Consider this pattern, again containing non-significant white space, 9152 and with the two alternatives on the second line: 9153 9154 (?(?=[^a-z]*[a-z]) 9155 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) 9156 9157 The condition is a positive lookahead assertion that matches an op- 9158 tional sequence of non-letters followed by a letter. In other words, it 9159 tests for the presence of at least one letter in the subject. If a let- 9160 ter is found, the subject is matched against the first alternative; 9161 otherwise it is matched against the second. This pattern matches 9162 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are 9163 letters and dd are digits. 9164 9165 When an assertion that is a condition contains capture groups, any cap- 9166 turing that occurs in a matching branch is retained afterwards, for 9167 both positive and negative assertions, because matching always contin- 9168 ues after the assertion, whether it succeeds or fails. (Compare non- 9169 conditional assertions, for which captures are retained only for posi- 9170 tive assertions that succeed.) 9171 9172 9173COMMENTS 9174 9175 There are two ways of including comments in patterns that are processed 9176 by PCRE2. In both cases, the start of the comment must not be in a 9177 character class, nor in the middle of any other sequence of related 9178 characters such as (?: or a group name or number. The characters that 9179 make up a comment play no part in the pattern matching. 9180 9181 The sequence (?# marks the start of a comment that continues up to the 9182 next closing parenthesis. Nested parentheses are not permitted. If the 9183 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # 9184 character also introduces a comment, which in this case continues to 9185 immediately after the next newline character or character sequence in 9186 the pattern. Which characters are interpreted as newlines is controlled 9187 by an option passed to the compiling function or by a special sequence 9188 at the start of the pattern, as described in the section entitled "New- 9189 line conventions" above. Note that the end of this type of comment is a 9190 literal newline sequence in the pattern; escape sequences that happen 9191 to represent a newline do not count. For example, consider this pattern 9192 when PCRE2_EXTENDED is set, and the default newline convention (a sin- 9193 gle linefeed character) is in force: 9194 9195 abc #comment \n still comment 9196 9197 On encountering the # character, pcre2_compile() skips along, looking 9198 for a newline in the pattern. The sequence \n is still literal at this 9199 stage, so it does not terminate the comment. Only an actual character 9200 with the code value 0x0a (the default newline) does so. 9201 9202 9203RECURSIVE PATTERNS 9204 9205 Consider the problem of matching a string in parentheses, allowing for 9206 unlimited nested parentheses. Without the use of recursion, the best 9207 that can be done is to use a pattern that matches up to some fixed 9208 depth of nesting. It is not possible to handle an arbitrary nesting 9209 depth. 9210 9211 For some time, Perl has provided a facility that allows regular expres- 9212 sions to recurse (amongst other things). It does this by interpolating 9213 Perl code in the expression at run time, and the code can refer to the 9214 expression itself. A Perl pattern using code interpolation to solve the 9215 parentheses problem can be created like this: 9216 9217 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; 9218 9219 The (?p{...}) item interpolates Perl code at run time, and in this case 9220 refers recursively to the pattern in which it appears. 9221 9222 Obviously, PCRE2 cannot support the interpolation of Perl code. In- 9223 stead, it supports special syntax for recursion of the entire pattern, 9224 and also for individual capture group recursion. After its introduction 9225 in PCRE1 and Python, this kind of recursion was subsequently introduced 9226 into Perl at release 5.10. 9227 9228 A special item that consists of (? followed by a number greater than 9229 zero and a closing parenthesis is a recursive subroutine call of the 9230 capture group of the given number, provided that it occurs inside that 9231 group. (If not, it is a non-recursive subroutine call, which is de- 9232 scribed in the next section.) The special item (?R) or (?0) is a recur- 9233 sive call of the entire regular expression. 9234 9235 This PCRE2 pattern solves the nested parentheses problem (assume the 9236 PCRE2_EXTENDED option is set so that white space is ignored): 9237 9238 \( ( [^()]++ | (?R) )* \) 9239 9240 First it matches an opening parenthesis. Then it matches any number of 9241 substrings which can either be a sequence of non-parentheses, or a re- 9242 cursive match of the pattern itself (that is, a correctly parenthesized 9243 substring). Finally there is a closing parenthesis. Note the use of a 9244 possessive quantifier to avoid backtracking into sequences of non- 9245 parentheses. 9246 9247 If this were part of a larger pattern, you would not want to recurse 9248 the entire pattern, so instead you could use this: 9249 9250 ( \( ( [^()]++ | (?1) )* \) ) 9251 9252 We have put the pattern into parentheses, and caused the recursion to 9253 refer to them instead of the whole pattern. 9254 9255 In a larger pattern, keeping track of parenthesis numbers can be 9256 tricky. This is made easier by the use of relative references. Instead 9257 of (?1) in the pattern above you can write (?-2) to refer to the second 9258 most recently opened parentheses preceding the recursion. In other 9259 words, a negative number counts capturing parentheses leftwards from 9260 the point at which it is encountered. 9261 9262 Be aware however, that if duplicate capture group numbers are in use, 9263 relative references refer to the earliest group with the appropriate 9264 number. Consider, for example: 9265 9266 (?|(a)|(b)) (c) (?-2) 9267 9268 The first two capture groups (a) and (b) are both numbered 1, and group 9269 (c) is number 2. When the reference (?-2) is encountered, the second 9270 most recently opened parentheses has the number 1, but it is the first 9271 such group (the (a) group) to which the recursion refers. This would be 9272 the same if an absolute reference (?1) was used. In other words, rela- 9273 tive references are just a shorthand for computing a group number. 9274 9275 It is also possible to refer to subsequent capture groups, by writing 9276 references such as (?+2). However, these cannot be recursive because 9277 the reference is not inside the parentheses that are referenced. They 9278 are always non-recursive subroutine calls, as described in the next 9279 section. 9280 9281 An alternative approach is to use named parentheses. The Perl syntax 9282 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- 9283 ported. We could rewrite the above example as follows: 9284 9285 (?<pn> \( ( [^()]++ | (?&pn) )* \) ) 9286 9287 If there is more than one group with the same name, the earliest one is 9288 used. 9289 9290 The example pattern that we have been looking at contains nested unlim- 9291 ited repeats, and so the use of a possessive quantifier for matching 9292 strings of non-parentheses is important when applying the pattern to 9293 strings that do not match. For example, when this pattern is applied to 9294 9295 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 9296 9297 it yields "no match" quickly. However, if a possessive quantifier is 9298 not used, the match runs for a very long time indeed because there are 9299 so many different ways the + and * repeats can carve up the subject, 9300 and all have to be tested before failure can be reported. 9301 9302 At the end of a match, the values of capturing parentheses are those 9303 from the outermost level. If you want to obtain intermediate values, a 9304 callout function can be used (see below and the pcre2callout documenta- 9305 tion). If the pattern above is matched against 9306 9307 (ab(cd)ef) 9308 9309 the value for the inner capturing parentheses (numbered 2) is "ef", 9310 which is the last value taken on at the top level. If a capture group 9311 is not matched at the top level, its final captured value is unset, 9312 even if it was (temporarily) set at a deeper level during the matching 9313 process. 9314 9315 Do not confuse the (?R) item with the condition (R), which tests for 9316 recursion. Consider this pattern, which matches text in angle brack- 9317 ets, allowing for arbitrary nesting. Only digits are allowed in nested 9318 brackets (that is, when recursing), whereas any characters are permit- 9319 ted at the outer level. 9320 9321 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > 9322 9323 In this pattern, (?(R) is the start of a conditional group, with two 9324 different alternatives for the recursive and non-recursive cases. The 9325 (?R) item is the actual recursive call. 9326 9327 Differences in recursion processing between PCRE2 and Perl 9328 9329 Some former differences between PCRE2 and Perl no longer exist. 9330 9331 Before release 10.30, recursion processing in PCRE2 differed from Perl 9332 in that a recursive subroutine call was always treated as an atomic 9333 group. That is, once it had matched some of the subject string, it was 9334 never re-entered, even if it contained untried alternatives and there 9335 was a subsequent matching failure. (Historical note: PCRE implemented 9336 recursion before Perl did.) 9337 9338 Starting with release 10.30, recursive subroutine calls are no longer 9339 treated as atomic. That is, they can be re-entered to try unused alter- 9340 natives if there is a matching failure later in the pattern. This is 9341 now compatible with the way Perl works. If you want a subroutine call 9342 to be atomic, you must explicitly enclose it in an atomic group. 9343 9344 Supporting backtracking into recursions simplifies certain types of re- 9345 cursive pattern. For example, this pattern matches palindromic strings: 9346 9347 ^((.)(?1)\2|.?)$ 9348 9349 The second branch in the group matches a single central character in 9350 the palindrome when there are an odd number of characters, or nothing 9351 when there are an even number of characters, but in order to work it 9352 has to be able to try the second case when the rest of the pattern 9353 match fails. If you want to match typical palindromic phrases, the pat- 9354 tern has to ignore all non-word characters, which can be done like 9355 this: 9356 9357 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$ 9358 9359 If run with the PCRE2_CASELESS option, this pattern matches phrases 9360 such as "A man, a plan, a canal: Panama!". Note the use of the posses- 9361 sive quantifier *+ to avoid backtracking into sequences of non-word 9362 characters. Without this, PCRE2 takes a great deal longer (ten times or 9363 more) to match typical phrases, and Perl takes so long that you think 9364 it has gone into a loop. 9365 9366 Another way in which PCRE2 and Perl used to differ in their recursion 9367 processing is in the handling of captured values. Formerly in Perl, 9368 when a group was called recursively or as a subroutine (see the next 9369 section), it had no access to any values that were captured outside the 9370 recursion, whereas in PCRE2 these values can be referenced. Consider 9371 this pattern: 9372 9373 ^(.)(\1|a(?2)) 9374 9375 This pattern matches "bab". The first capturing parentheses match "b", 9376 then in the second group, when the backreference \1 fails to match "b", 9377 the second alternative matches "a" and then recurses. In the recursion, 9378 \1 does now match "b" and so the whole match succeeds. This match used 9379 to fail in Perl, but in later versions (I tried 5.024) it now works. 9380 9381 9382GROUPS AS SUBROUTINES 9383 9384 If the syntax for a recursive group call (either by number or by name) 9385 is used outside the parentheses to which it refers, it operates a bit 9386 like a subroutine in a programming language. More accurately, PCRE2 9387 treats the referenced group as an independent subpattern which it tries 9388 to match at the current matching position. The called group may be de- 9389 fined before or after the reference. A numbered reference can be ab- 9390 solute or relative, as in these examples: 9391 9392 (...(absolute)...)...(?2)... 9393 (...(relative)...)...(?-1)... 9394 (...(?+1)...(relative)... 9395 9396 An earlier example pointed out that the pattern 9397 9398 (sens|respons)e and \1ibility 9399 9400 matches "sense and sensibility" and "response and responsibility", but 9401 not "sense and responsibility". If instead the pattern 9402 9403 (sens|respons)e and (?1)ibility 9404 9405 is used, it does match "sense and responsibility" as well as the other 9406 two strings. Another example is given in the discussion of DEFINE 9407 above. 9408 9409 Like recursions, subroutine calls used to be treated as atomic, but 9410 this changed at PCRE2 release 10.30, so backtracking into subroutine 9411 calls can now occur. However, any capturing parentheses that are set 9412 during the subroutine call revert to their previous values afterwards. 9413 9414 Processing options such as case-independence are fixed when a group is 9415 defined, so if it is used as a subroutine, such options cannot be 9416 changed for different calls. For example, consider this pattern: 9417 9418 (abc)(?i:(?-1)) 9419 9420 It matches "abcabc". It does not match "abcABC" because the change of 9421 processing option does not affect the called group. 9422 9423 The behaviour of backtracking control verbs in groups when called as 9424 subroutines is described in the section entitled "Backtracking verbs in 9425 subroutines" below. 9426 9427 9428ONIGURUMA SUBROUTINE SYNTAX 9429 9430 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 9431 name or a number enclosed either in angle brackets or single quotes, is 9432 an alternative syntax for calling a group as a subroutine, possibly re- 9433 cursively. Here are two of the examples used above, rewritten using 9434 this syntax: 9435 9436 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) 9437 (sens|respons)e and \g'1'ibility 9438 9439 PCRE2 supports an extension to Oniguruma: if a number is preceded by a 9440 plus or a minus sign it is taken as a relative reference. For example: 9441 9442 (abc)(?i:\g<-1>) 9443 9444 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not 9445 synonymous. The former is a backreference; the latter is a subroutine 9446 call. 9447 9448 9449CALLOUTS 9450 9451 Perl has a feature whereby using the sequence (?{...}) causes arbitrary 9452 Perl code to be obeyed in the middle of matching a regular expression. 9453 This makes it possible, amongst other things, to extract different sub- 9454 strings that match the same pair of parentheses when there is a repeti- 9455 tion. 9456 9457 PCRE2 provides a similar feature, but of course it cannot obey arbi- 9458 trary Perl code. The feature is called "callout". The caller of PCRE2 9459 provides an external function by putting its entry point in a match 9460 context using the function pcre2_set_callout(), and then passing that 9461 context to pcre2_match() or pcre2_dfa_match(). If no match context is 9462 passed, or if the callout entry point is set to NULL, callouts are dis- 9463 abled. 9464 9465 Within a regular expression, (?C<arg>) indicates a point at which the 9466 external function is to be called. There are two kinds of callout: 9467 those with a numerical argument and those with a string argument. (?C) 9468 on its own with no argument is treated as (?C0). A numerical argument 9469 allows the application to distinguish between different callouts. 9470 String arguments were added for release 10.20 to make it possible for 9471 script languages that use PCRE2 to embed short scripts within patterns 9472 in a similar way to Perl. 9473 9474 During matching, when PCRE2 reaches a callout point, the external func- 9475 tion is called. It is provided with the number or string argument of 9476 the callout, the position in the pattern, and one item of data that is 9477 also set in the match block. The callout function may cause matching to 9478 proceed, to backtrack, or to fail. 9479 9480 By default, PCRE2 implements a number of optimizations at matching 9481 time, and one side-effect is that sometimes callouts are skipped. If 9482 you need all possible callouts to happen, you need to set options that 9483 disable the relevant optimizations. More details, including a complete 9484 description of the programming interface to the callout function, are 9485 given in the pcre2callout documentation. 9486 9487 Callouts with numerical arguments 9488 9489 If you just want to have a means of identifying different callout 9490 points, put a number less than 256 after the letter C. For example, 9491 this pattern has two callout points: 9492 9493 (?C1)abc(?C2)def 9494 9495 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical 9496 callouts are automatically installed before each item in the pattern. 9497 They are all numbered 255. If there is a conditional group in the pat- 9498 tern whose condition is an assertion, an additional callout is inserted 9499 just before the condition. An explicit callout may also be set at this 9500 position, as in this example: 9501 9502 (?(?C9)(?=a)abc|def) 9503 9504 Note that this applies only to assertion conditions, not to other types 9505 of condition. 9506 9507 Callouts with string arguments 9508 9509 A delimited string may be used instead of a number as a callout argu- 9510 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the 9511 ending delimiter is the same as the start, except for {, where the end- 9512 ing delimiter is }. If the ending delimiter is needed within the 9513 string, it must be doubled. For example: 9514 9515 (?C'ab ''c'' d')xyz(?C{any text})pqr 9516 9517 The doubling is removed before the string is passed to the callout 9518 function. 9519 9520 9521BACKTRACKING CONTROL 9522 9523 There are a number of special "Backtracking Control Verbs" (to use 9524 Perl's terminology) that modify the behaviour of backtracking during 9525 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some 9526 verbs take either form, and may behave differently depending on whether 9527 or not a name argument is present. The names are not required to be 9528 unique within the pattern. 9529 9530 By default, for compatibility with Perl, a name is any sequence of 9531 characters that does not include a closing parenthesis. The name is not 9532 processed in any way, and it is not possible to include a closing 9533 parenthesis in the name. This can be changed by setting the 9534 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- 9535 ble. 9536 9537 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to 9538 verb names and only an unescaped closing parenthesis terminates the 9539 name. However, the only backslash items that are permitted are \Q, \E, 9540 and sequences such as \x{100} that define character code points. Char- 9541 acter type escapes such as \d are faulted. 9542 9543 A closing parenthesis can be included in a name either as \) or between 9544 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED 9545 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb 9546 names is skipped, and #-comments are recognized, exactly as in the rest 9547 of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect 9548 verb names unless PCRE2_ALT_VERBNAMES is also set. 9549 9550 The maximum length of a name is 255 in the 8-bit library and 65535 in 9551 the 16-bit and 32-bit libraries. If the name is empty, that is, if the 9552 closing parenthesis immediately follows the colon, the effect is as if 9553 the colon were not there. Any number of these verbs may occur in a pat- 9554 tern. Except for (*ACCEPT), they may not be quantified. 9555 9556 Since these verbs are specifically related to backtracking, most of 9557 them can be used only when the pattern is to be matched using the tra- 9558 ditional matching function, because that uses a backtracking algorithm. 9559 With the exception of (*FAIL), which behaves like a failing negative 9560 assertion, the backtracking control verbs cause an error if encountered 9561 by the DFA matching function. 9562 9563 The behaviour of these verbs in repeated groups, assertions, and in 9564 capture groups called as subroutines (whether or not recursively) is 9565 documented below. 9566 9567 Optimizations that affect backtracking verbs 9568 9569 PCRE2 contains some optimizations that are used to speed up matching by 9570 running some checks at the start of each match attempt. For example, it 9571 may know the minimum length of matching subject, or that a particular 9572 character must be present. When one of these optimizations bypasses the 9573 running of a match, any included backtracking verbs will not, of 9574 course, be processed. You can suppress the start-of-match optimizations 9575 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- 9576 pile(), or by starting the pattern with (*NO_START_OPT). There is more 9577 discussion of this option in the section entitled "Compiling a pattern" 9578 in the pcre2api documentation. 9579 9580 Experiments with Perl suggest that it too has similar optimizations, 9581 and like PCRE2, turning them off can change the result of a match. 9582 9583 Verbs that act immediately 9584 9585 The following verbs act as soon as they are encountered. 9586 9587 (*ACCEPT) or (*ACCEPT:NAME) 9588 9589 This verb causes the match to end successfully, skipping the remainder 9590 of the pattern. However, when it is inside a capture group that is 9591 called as a subroutine, only that group is ended successfully. Matching 9592 then continues at the outer level. If (*ACCEPT) in triggered in a posi- 9593 tive assertion, the assertion succeeds; in a negative assertion, the 9594 assertion fails. 9595 9596 If (*ACCEPT) is inside capturing parentheses, the data so far is cap- 9597 tured. For example: 9598 9599 A((?:A|B(*ACCEPT)|C)D) 9600 9601 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- 9602 tured by the outer parentheses. 9603 9604 (*ACCEPT) is the only backtracking verb that is allowed to be quanti- 9605 fied because an ungreedy quantification with a minimum of zero acts 9606 only when a backtrack happens. Consider, for example, 9607 9608 (A(*ACCEPT)??B)C 9609 9610 where A, B, and C may be complex expressions. After matching "A", the 9611 matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT) 9612 is triggered and the match succeeds. In both cases, all but C is cap- 9613 tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re- 9614 peated (*ACCEPT) of this type means "succeed on backtrack". 9615 9616 Warning: (*ACCEPT) should not be used within a script run group, be- 9617 cause it causes an immediate exit from the group, bypassing the script 9618 run checking. 9619 9620 (*FAIL) or (*FAIL:NAME) 9621 9622 This verb causes a matching failure, forcing backtracking to occur. It 9623 may be abbreviated to (*F). It is equivalent to (?!) but easier to 9624 read. The Perl documentation notes that it is probably useful only when 9625 combined with (?{}) or (??{}). Those are, of course, Perl features that 9626 are not present in PCRE2. The nearest equivalent is the callout fea- 9627 ture, as for example in this pattern: 9628 9629 a+(?C)(*FAIL) 9630 9631 A match with the string "aaaa" always fails, but the callout is taken 9632 before each backtrack happens (in this example, 10 times). 9633 9634 (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC- 9635 CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is 9636 recorded just before the verb acts. 9637 9638 Recording which path was taken 9639 9640 There is one verb whose main purpose is to track how a match was ar- 9641 rived at, though it also has a secondary use in conjunction with ad- 9642 vancing the match starting point (see (*SKIP) below). 9643 9644 (*MARK:NAME) or (*:NAME) 9645 9646 A name is always required with this verb. For all the other backtrack- 9647 ing control verbs, a NAME argument is optional. 9648 9649 When a match succeeds, the name of the last-encountered mark name on 9650 the matching path is passed back to the caller as described in the sec- 9651 tion entitled "Other information about the match" in the pcre2api docu- 9652 mentation. This applies to all instances of (*MARK) and other verbs, 9653 including those inside assertions and atomic groups. However, there are 9654 differences in those cases when (*MARK) is used in conjunction with 9655 (*SKIP) as described below. 9656 9657 The mark name that was last encountered on the matching path is passed 9658 back. A verb without a NAME argument is ignored for this purpose. Here 9659 is an example of pcre2test output, where the "mark" modifier requests 9660 the retrieval and outputting of (*MARK) data: 9661 9662 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 9663 data> XY 9664 0: XY 9665 MK: A 9666 XZ 9667 0: XZ 9668 MK: B 9669 9670 The (*MARK) name is tagged with "MK:" in this output, and in this exam- 9671 ple it indicates which of the two alternatives matched. This is a more 9672 efficient way of obtaining this information than putting each alterna- 9673 tive in its own capturing parentheses. 9674 9675 If a verb with a name is encountered in a positive assertion that is 9676 true, the name is recorded and passed back if it is the last-encoun- 9677 tered. This does not happen for negative assertions or failing positive 9678 assertions. 9679 9680 After a partial match or a failed match, the last encountered name in 9681 the entire match process is returned. For example: 9682 9683 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 9684 data> XP 9685 No match, mark = B 9686 9687 Note that in this unanchored example the mark is retained from the 9688 match attempt that started at the letter "X" in the subject. Subsequent 9689 match attempts starting at "P" and then with an empty string do not get 9690 as far as the (*MARK) item, but nevertheless do not reset it. 9691 9692 If you are interested in (*MARK) values after failed matches, you 9693 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to 9694 ensure that the match is always attempted. 9695 9696 Verbs that act after backtracking 9697 9698 The following verbs do nothing when they are encountered. Matching con- 9699 tinues with what follows, but if there is a subsequent match failure, 9700 causing a backtrack to the verb, a failure is forced. That is, back- 9701 tracking cannot pass to the left of the verb. However, when one of 9702 these verbs appears inside an atomic group or in a lookaround assertion 9703 that is true, its effect is confined to that group, because once the 9704 group has been matched, there is never any backtracking into it. Back- 9705 tracking from beyond an assertion or an atomic group ignores the entire 9706 group, and seeks a preceding backtracking point. 9707 9708 These verbs differ in exactly what kind of failure occurs when back- 9709 tracking reaches them. The behaviour described below is what happens 9710 when the verb is not in a subroutine or an assertion. Subsequent sec- 9711 tions cover these special cases. 9712 9713 (*COMMIT) or (*COMMIT:NAME) 9714 9715 This verb causes the whole match to fail outright if there is a later 9716 matching failure that causes backtracking to reach it. Even if the pat- 9717 tern is unanchored, no further attempts to find a match by advancing 9718 the starting point take place. If (*COMMIT) is the only backtracking 9719 verb that is encountered, once it has been passed pcre2_match() is com- 9720 mitted to finding a match at the current starting point, or not at all. 9721 For example: 9722 9723 a+(*COMMIT)b 9724 9725 This matches "xxaab" but not "aacaab". It can be thought of as a kind 9726 of dynamic anchor, or "I've started, so I must finish." 9727 9728 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM- 9729 MIT). It is like (*MARK:NAME) in that the name is remembered for pass- 9730 ing back to the caller. However, (*SKIP:NAME) searches only for names 9731 that are set with (*MARK), ignoring those set by any of the other back- 9732 tracking verbs. 9733 9734 If there is more than one backtracking verb in a pattern, a different 9735 one that follows (*COMMIT) may be triggered first, so merely passing 9736 (*COMMIT) during a match does not always guarantee that a match must be 9737 at this starting point. 9738 9739 Note that (*COMMIT) at the start of a pattern is not the same as an an- 9740 chor, unless PCRE2's start-of-match optimizations are turned off, as 9741 shown in this output from pcre2test: 9742 9743 re> /(*COMMIT)abc/ 9744 data> xyzabc 9745 0: abc 9746 data> 9747 re> /(*COMMIT)abc/no_start_optimize 9748 data> xyzabc 9749 No match 9750 9751 For the first pattern, PCRE2 knows that any match must start with "a", 9752 so the optimization skips along the subject to "a" before applying the 9753 pattern to the first set of data. The match attempt then succeeds. The 9754 second pattern disables the optimization that skips along to the first 9755 character. The pattern is now applied starting at "x", and so the 9756 (*COMMIT) causes the match to fail without trying any other starting 9757 points. 9758 9759 (*PRUNE) or (*PRUNE:NAME) 9760 9761 This verb causes the match to fail at the current starting position in 9762 the subject if there is a later matching failure that causes backtrack- 9763 ing to reach it. If the pattern is unanchored, the normal "bumpalong" 9764 advance to the next starting character then happens. Backtracking can 9765 occur as usual to the left of (*PRUNE), before it is reached, or when 9766 matching to the right of (*PRUNE), but if there is no match to the 9767 right, backtracking cannot cross (*PRUNE). In simple cases, the use of 9768 (*PRUNE) is just an alternative to an atomic group or possessive quan- 9769 tifier, but there are some uses of (*PRUNE) that cannot be expressed in 9770 any other way. In an anchored pattern (*PRUNE) has the same effect as 9771 (*COMMIT). 9772 9773 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). 9774 It is like (*MARK:NAME) in that the name is remembered for passing back 9775 to the caller. However, (*SKIP:NAME) searches only for names set with 9776 (*MARK), ignoring those set by other backtracking verbs. 9777 9778 (*SKIP) 9779 9780 This verb, when given without a name, is like (*PRUNE), except that if 9781 the pattern is unanchored, the "bumpalong" advance is not to the next 9782 character, but to the position in the subject where (*SKIP) was encoun- 9783 tered. (*SKIP) signifies that whatever text was matched leading up to 9784 it cannot be part of a successful match if there is a later mismatch. 9785 Consider: 9786 9787 a+(*SKIP)b 9788 9789 If the subject is "aaaac...", after the first match attempt fails 9790 (starting at the first character in the string), the starting point 9791 skips on to start the next attempt at "c". Note that a possessive quan- 9792 tifier does not have the same effect as this example; although it would 9793 suppress backtracking during the first match attempt, the second at- 9794 tempt would start at the second character instead of skipping on to 9795 "c". 9796 9797 If (*SKIP) is used to specify a new starting position that is the same 9798 as the starting position of the current match, or (by being inside a 9799 lookbehind) earlier, the position specified by (*SKIP) is ignored, and 9800 instead the normal "bumpalong" occurs. 9801 9802 (*SKIP:NAME) 9803 9804 When (*SKIP) has an associated name, its behaviour is modified. When 9805 such a (*SKIP) is triggered, the previous path through the pattern is 9806 searched for the most recent (*MARK) that has the same name. If one is 9807 found, the "bumpalong" advance is to the subject position that corre- 9808 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If 9809 no (*MARK) with a matching name is found, the (*SKIP) is ignored. 9810 9811 The search for a (*MARK) name uses the normal backtracking mechanism, 9812 which means that it does not see (*MARK) settings that are inside 9813 atomic groups or assertions, because they are never re-entered by back- 9814 tracking. Compare the following pcre2test examples: 9815 9816 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/ 9817 data: abc 9818 0: a 9819 1: a 9820 data: 9821 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/ 9822 data: abc 9823 0: b 9824 1: b 9825 9826 In the first example, the (*MARK) setting is in an atomic group, so it 9827 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. 9828 This allows the second branch of the pattern to be tried at the first 9829 character position. In the second example, the (*MARK) setting is not 9830 in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it 9831 backtracks, and this causes a new matching attempt to start at the sec- 9832 ond character. This time, the (*MARK) is never seen because "a" does 9833 not match "b", so the matcher immediately jumps to the second branch of 9834 the pattern. 9835 9836 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It 9837 ignores names that are set by other backtracking verbs. 9838 9839 (*THEN) or (*THEN:NAME) 9840 9841 This verb causes a skip to the next innermost alternative when back- 9842 tracking reaches it. That is, it cancels any further backtracking 9843 within the current alternative. Its name comes from the observation 9844 that it can be used for a pattern-based if-then-else block: 9845 9846 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... 9847 9848 If the COND1 pattern matches, FOO is tried (and possibly further items 9849 after the end of the group if FOO succeeds); on failure, the matcher 9850 skips to the second alternative and tries COND2, without backtracking 9851 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- 9852 quently BAZ fails, there are no more alternatives, so there is a back- 9853 track to whatever came before the entire group. If (*THEN) is not in- 9854 side an alternation, it acts like (*PRUNE). 9855 9856 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). 9857 It is like (*MARK:NAME) in that the name is remembered for passing back 9858 to the caller. However, (*SKIP:NAME) searches only for names set with 9859 (*MARK), ignoring those set by other backtracking verbs. 9860 9861 A group that does not contain a | character is just a part of the en- 9862 closing alternative; it is not a nested alternation with only one al- 9863 ternative. The effect of (*THEN) extends beyond such a group to the en- 9864 closing alternative. Consider this pattern, where A, B, etc. are com- 9865 plex pattern fragments that do not contain any | characters at this 9866 level: 9867 9868 A (B(*THEN)C) | D 9869 9870 If A and B are matched, but there is a failure in C, matching does not 9871 backtrack into A; instead it moves to the next alternative, that is, D. 9872 However, if the group containing (*THEN) is given an alternative, it 9873 behaves differently: 9874 9875 A (B(*THEN)C | (*FAIL)) | D 9876 9877 The effect of (*THEN) is now confined to the inner group. After a fail- 9878 ure in C, matching moves to (*FAIL), which causes the whole group to 9879 fail because there are no more alternatives to try. In this case, 9880 matching does backtrack into A. 9881 9882 Note that a conditional group is not considered as having two alterna- 9883 tives, because only one is ever used. In other words, the | character 9884 in a conditional group has a different meaning. Ignoring white space, 9885 consider: 9886 9887 ^.*? (?(?=a) a | b(*THEN)c ) 9888 9889 If the subject is "ba", this pattern does not match. Because .*? is un- 9890 greedy, it initially matches zero characters. The condition (?=a) then 9891 fails, the character "b" is matched, but "c" is not. At this point, 9892 matching does not backtrack to .*? as might perhaps be expected from 9893 the presence of the | character. The conditional group is part of the 9894 single alternative that comprises the whole pattern, and so the match 9895 fails. (If there was a backtrack into .*?, allowing it to match "b", 9896 the match would succeed.) 9897 9898 The verbs just described provide four different "strengths" of control 9899 when subsequent matching fails. (*THEN) is the weakest, carrying on the 9900 match at the next alternative. (*PRUNE) comes next, failing the match 9901 at the current starting position, but allowing an advance to the next 9902 character (for an unanchored pattern). (*SKIP) is similar, except that 9903 the advance may be more than one character. (*COMMIT) is the strongest, 9904 causing the entire match to fail. 9905 9906 More than one backtracking verb 9907 9908 If more than one backtracking verb is present in a pattern, the one 9909 that is backtracked onto first acts. For example, consider this pat- 9910 tern, where A, B, etc. are complex pattern fragments: 9911 9912 (A(*COMMIT)B(*THEN)C|ABD) 9913 9914 If A matches but B fails, the backtrack to (*COMMIT) causes the entire 9915 match to fail. However, if A and B match, but C fails, the backtrack to 9916 (*THEN) causes the next alternative (ABD) to be tried. This behaviour 9917 is consistent, but is not always the same as Perl's. It means that if 9918 two or more backtracking verbs appear in succession, all but the last 9919 of them has no effect. Consider this example: 9920 9921 ...(*COMMIT)(*PRUNE)... 9922 9923 If there is a matching failure to the right, backtracking onto (*PRUNE) 9924 causes it to be triggered, and its action is taken. There can never be 9925 a backtrack onto (*COMMIT). 9926 9927 Backtracking verbs in repeated groups 9928 9929 PCRE2 sometimes differs from Perl in its handling of backtracking verbs 9930 in repeated groups. For example, consider: 9931 9932 /(a(*COMMIT)b)+ac/ 9933 9934 If the subject is "abac", Perl matches unless its optimizations are 9935 disabled, but PCRE2 always fails because the (*COMMIT) in the second 9936 repeat of the group acts. 9937 9938 Backtracking verbs in assertions 9939 9940 (*FAIL) in any assertion has its normal effect: it forces an immediate 9941 backtrack. The behaviour of the other backtracking verbs depends on 9942 whether or not the assertion is standalone or acting as the condition 9943 in a conditional group. 9944 9945 (*ACCEPT) in a standalone positive assertion causes the assertion to 9946 succeed without any further processing; captured strings and a mark 9947 name (if set) are retained. In a standalone negative assertion, (*AC- 9948 CEPT) causes the assertion to fail without any further processing; cap- 9949 tured substrings and any mark name are discarded. 9950 9951 If the assertion is a condition, (*ACCEPT) causes the condition to be 9952 true for a positive assertion and false for a negative one; captured 9953 substrings are retained in both cases. 9954 9955 The remaining verbs act only when a later failure causes a backtrack to 9956 reach them. This means that, for the Perl-compatible assertions, their 9957 effect is confined to the assertion, because Perl lookaround assertions 9958 are atomic. A backtrack that occurs after such an assertion is complete 9959 does not jump back into the assertion. Note in particular that a 9960 (*MARK) name that is set in an assertion is not "seen" by an instance 9961 of (*SKIP:NAME) later in the pattern. 9962 9963 PCRE2 now supports non-atomic positive assertions, as described in the 9964 section entitled "Non-atomic assertions" above. These assertions must 9965 be standalone (not used as conditions). They are not Perl-compatible. 9966 For these assertions, a later backtrack does jump back into the asser- 9967 tion, and therefore verbs such as (*COMMIT) can be triggered by back- 9968 tracks from later in the pattern. 9969 9970 The effect of (*THEN) is not allowed to escape beyond an assertion. If 9971 there are no more branches to try, (*THEN) causes a positive assertion 9972 to be false, and a negative assertion to be true. 9973 9974 The other backtracking verbs are not treated specially if they appear 9975 in a standalone positive assertion. In a conditional positive asser- 9976 tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP), 9977 or (*PRUNE) causes the condition to be false. However, for both stand- 9978 alone and conditional negative assertions, backtracking into (*COMMIT), 9979 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider- 9980 ing any further alternative branches. 9981 9982 Backtracking verbs in subroutines 9983 9984 These behaviours occur whether or not the group is called recursively. 9985 9986 (*ACCEPT) in a group called as a subroutine causes the subroutine match 9987 to succeed without any further processing. Matching then continues af- 9988 ter the subroutine call. Perl documents this behaviour. Perl's treat- 9989 ment of the other verbs in subroutines is different in some cases. 9990 9991 (*FAIL) in a group called as a subroutine has its normal effect: it 9992 forces an immediate backtrack. 9993 9994 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail 9995 when triggered by being backtracked to in a group called as a subrou- 9996 tine. There is then a backtrack at the outer level. 9997 9998 (*THEN), when triggered, skips to the next alternative in the innermost 9999 enclosing group that has alternatives (its normal behaviour). However, 10000 if there is no such group within the subroutine's group, the subroutine 10001 match fails and there is a backtrack at the outer level. 10002 10003 10004SEE ALSO 10005 10006 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), 10007 pcre2(3). 10008 10009 10010AUTHOR 10011 10012 Philip Hazel 10013 Retired from University Computing Service 10014 Cambridge, England. 10015 10016 10017REVISION 10018 10019 Last updated: 04 June 2024 10020 Copyright (c) 1997-2024 University of Cambridge. 10021 10022 10023PCRE2 10.44 04 June 2024 PCRE2PATTERN(3) 10024------------------------------------------------------------------------------ 10025 10026 10027 10028PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3) 10029 10030 10031NAME 10032 PCRE2 - Perl-compatible regular expressions (revised API) 10033 10034 10035PCRE2 PERFORMANCE 10036 10037 Two aspects of performance are discussed below: memory usage and pro- 10038 cessing time. The way you express your pattern as a regular expression 10039 can affect both of them. 10040 10041 10042COMPILED PATTERN MEMORY USAGE 10043 10044 Patterns are compiled by PCRE2 into a reasonably efficient interpretive 10045 code, so that most simple patterns do not use much memory for storing 10046 the compiled version. However, there is one case where the memory usage 10047 of a compiled pattern can be unexpectedly large. If a parenthesized 10048 group has a quantifier with a minimum greater than 1 and/or a limited 10049 maximum, the whole group is repeated in the compiled code. For example, 10050 the pattern 10051 10052 (abc|def){2,4} 10053 10054 is compiled as if it were 10055 10056 (abc|def)(abc|def)((abc|def)(abc|def)?)? 10057 10058 (Technical aside: It is done this way so that backtrack points within 10059 each of the repetitions can be independently maintained.) 10060 10061 For regular expressions whose quantifiers use only small numbers, this 10062 is not usually a problem. However, if the numbers are large, and par- 10063 ticularly if such repetitions are nested, the memory usage can become 10064 an embarrassment. For example, the very simple pattern 10065 10066 ((ab){1,1000}c){1,3} 10067 10068 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is 10069 compiled with its default internal pointer size of two bytes, the size 10070 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit 10071 libraries, and this is reached with the above pattern if the outer rep- 10072 etition is increased from 3 to 4. PCRE2 can be compiled to use larger 10073 internal pointers and thus handle larger compiled patterns, but it is 10074 better to try to rewrite your pattern to use less memory if you can. 10075 10076 One way of reducing the memory usage for such patterns is to make use 10077 of PCRE2's "subroutine" facility. Re-writing the above pattern as 10078 10079 ((ab)(?2){0,999}c)(?1){0,2} 10080 10081 reduces the memory requirements to around 16KiB, and indeed it remains 10082 under 20KiB even with the outer repetition increased to 100. However, 10083 this kind of pattern is not always exactly equivalent, because any cap- 10084 tures within subroutine calls are lost when the subroutine completes. 10085 If this is not a problem, this kind of rewriting will allow you to 10086 process patterns that PCRE2 cannot otherwise handle. The matching per- 10087 formance of the two different versions of the pattern are roughly the 10088 same. (This applies from release 10.30 - things were different in ear- 10089 lier releases.) 10090 10091 10092STACK AND HEAP USAGE AT RUN TIME 10093 10094 From release 10.30, the interpretive (non-JIT) version of pcre2_match() 10095 uses very little system stack at run time. In earlier releases recur- 10096 sive function calls could use a great deal of stack, and this could 10097 cause problems, but this usage has been eliminated. Backtracking posi- 10098 tions are now explicitly remembered in memory frames controlled by the 10099 code. 10100 10101 The size of each frame depends on the size of pointer variables and the 10102 number of capturing parenthesized groups in the pattern being matched. 10103 On a 64-bit system the frame size for a pattern with no captures is 128 10104 bytes. For each capturing group the size increases by 16 bytes. 10105 10106 Until release 10.41, an initial 20KiB frames vector was allocated on 10107 the system stack, but this still caused some issues for multi-thread 10108 applications where each thread has a very small stack. From release 10109 10.41 backtracking memory frames are always held in heap memory. An 10110 initial heap allocation is obtained the first time any match data block 10111 is passed to pcre2_match(). This is remembered with the match data 10112 block and re-used if that block is used for another match. It is freed 10113 when the match data block itself is freed. 10114 10115 The size of the initial block is the larger of 20KiB or ten times the 10116 pattern's frame size, unless the heap limit is less than this, in which 10117 case the heap limit is used. If the initial block proves to be too 10118 small during matching, it is replaced by a larger block, subject to the 10119 heap limit. The heap limit is checked only when a new block is to be 10120 allocated. Reducing the heap limit between calls to pcre2_match() with 10121 the same match data block does not affect the saved block. 10122 10123 In contrast to pcre2_match(), pcre2_dfa_match() does use recursive 10124 function calls, but only for processing atomic groups, lookaround as- 10125 sertions, and recursion within the pattern. The original version of the 10126 code used to allocate quite large internal workspace vectors on the 10127 stack, which caused some problems for some patterns in environments 10128 with small stacks. From release 10.32 the code for pcre2_dfa_match() 10129 has been re-factored to use heap memory when necessary for internal 10130 workspace when recursing, though recursive function calls are still 10131 used. 10132 10133 The "match depth" parameter can be used to limit the depth of function 10134 recursion, and the "match heap" parameter to limit heap memory in 10135 pcre2_dfa_match(). 10136 10137 10138PROCESSING TIME 10139 10140 Certain items in regular expression patterns are processed more effi- 10141 ciently than others. It is more efficient to use a character class like 10142 [aeiou] than a set of single-character alternatives such as 10143 (a|e|i|o|u). In general, the simplest construction that provides the 10144 required behaviour is usually the most efficient. Jeffrey Friedl's book 10145 contains a lot of useful general discussion about optimizing regular 10146 expressions for efficient performance. This document contains a few ob- 10147 servations about PCRE2. 10148 10149 Using Unicode character properties (the \p, \P, and \X escapes) is 10150 slow, because PCRE2 has to use a multi-stage table lookup whenever it 10151 needs a character's property. If you can find an alternative pattern 10152 that does not use character properties, it will probably be faster. 10153 10154 By default, the escape sequences \b, \d, \s, and \w, and the POSIX 10155 character classes such as [:alpha:] do not use Unicode properties, 10156 partly for backwards compatibility, and partly for performance reasons. 10157 However, you can set the PCRE2_UCP option or start the pattern with 10158 (*UCP) if you want Unicode character properties to be used. This can 10159 double the matching time for items such as \d, when matched with 10160 pcre2_match(); the performance loss is less with a DFA matching func- 10161 tion, and in both cases there is not much difference for \b. 10162 10163 When a pattern begins with .* not in atomic parentheses, nor in paren- 10164 theses that are the subject of a backreference, and the PCRE2_DOTALL 10165 option is set, the pattern is implicitly anchored by PCRE2, since it 10166 can match only at the start of a subject string. If the pattern has 10167 multiple top-level branches, they must all be anchorable. The optimiza- 10168 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au- 10169 tomatically disabled if the pattern contains (*PRUNE) or (*SKIP). 10170 10171 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be- 10172 cause the dot metacharacter does not then match a newline, and if the 10173 subject string contains newlines, the pattern may match from the char- 10174 acter immediately following one of them instead of from the very start. 10175 For example, the pattern 10176 10177 .*second 10178 10179 matches the subject "first\nand second" (where \n stands for a newline 10180 character), with the match starting at the seventh character. In order 10181 to do this, PCRE2 has to retry the match starting after every newline 10182 in the subject. 10183 10184 If you are using such a pattern with subject strings that do not con- 10185 tain newlines, the best performance is obtained by setting 10186 PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex- 10187 plicit anchoring. That saves PCRE2 from having to scan along the sub- 10188 ject looking for a newline to restart at. 10189 10190 Beware of patterns that contain nested indefinite repeats. These can 10191 take a long time to run when applied to a string that does not match. 10192 Consider the pattern fragment 10193 10194 ^(a+)* 10195 10196 This can match "aaaa" in 16 different ways, and this number increases 10197 very rapidly as the string gets longer. (The * repeat can match 0, 1, 10198 2, 3, or 4 times, and for each of those cases other than 0 or 4, the + 10199 repeats can match different numbers of times.) When the remainder of 10200 the pattern is such that the entire match is going to fail, PCRE2 has 10201 in principle to try every possible variation, and this can take an ex- 10202 tremely long time, even for relatively short strings. 10203 10204 An optimization catches some of the more simple cases such as 10205 10206 (a+)*b 10207 10208 where a literal character follows. Before embarking on the standard 10209 matching procedure, PCRE2 checks that there is a "b" later in the sub- 10210 ject string, and if there is not, it fails the match immediately. How- 10211 ever, when there is no following literal this optimization cannot be 10212 used. You can see the difference by comparing the behaviour of 10213 10214 (a+)*\d 10215 10216 with the pattern above. The former gives a failure almost instantly 10217 when applied to a whole line of "a" characters, whereas the latter 10218 takes an appreciable time with strings longer than about 20 characters. 10219 10220 In many cases, the solution to this kind of performance issue is to use 10221 an atomic group or a possessive quantifier. This can often reduce mem- 10222 ory requirements as well. As another example, consider this pattern: 10223 10224 ([^<]|<(?!inet))+ 10225 10226 It matches from wherever it starts until it encounters "<inet" or the 10227 end of the data, and is the kind of pattern that might be used when 10228 processing an XML file. Each iteration of the outer parentheses matches 10229 either one character that is not "<" or a "<" that is not followed by 10230 "inet". However, each time a parenthesis is processed, a backtracking 10231 position is passed, so this formulation uses a memory frame for each 10232 matched character. For a long string, a lot of memory is required. Con- 10233 sider now this rewritten pattern, which matches exactly the same 10234 strings: 10235 10236 ([^<]++|<(?!inet))+ 10237 10238 This runs much faster, because sequences of characters that do not con- 10239 tain "<" are "swallowed" in one item inside the parentheses, and a pos- 10240 sessive quantifier is used to stop any backtracking into the runs of 10241 non-"<" characters. This version also uses a lot less memory because 10242 entry to a new set of parentheses happens only when a "<" character 10243 that is not followed by "inet" is encountered (and we assume this is 10244 relatively rare). 10245 10246 This example shows that one way of optimizing performance when matching 10247 long subject strings is to write repeated parenthesized subpatterns to 10248 match more than one character whenever possible. 10249 10250 SETTING RESOURCE LIMITS 10251 10252 You can set limits on the amount of processing that takes place when 10253 matching, and on the amount of heap memory that is used. The default 10254 values of the limits are very large, and unlikely ever to operate. They 10255 can be changed when PCRE2 is built, and they can also be set when 10256 pcre2_match() or pcre2_dfa_match() is called. For details of these in- 10257 terfaces, see the pcre2build documentation and the section entitled 10258 "The match context" in the pcre2api documentation. 10259 10260 The pcre2test test program has a modifier called "find_limits" which, 10261 if applied to a subject line, causes it to find the smallest limits 10262 that allow a pattern to match. This is done by repeatedly matching with 10263 different limits. 10264 10265 10266AUTHOR 10267 10268 Philip Hazel 10269 Retired from University Computing Service 10270 Cambridge, England. 10271 10272 10273REVISION 10274 10275 Last updated: 27 July 2022 10276 Copyright (c) 1997-2022 University of Cambridge. 10277 10278 10279PCRE2 10.41 27 July 2022 PCRE2PERFORM(3) 10280------------------------------------------------------------------------------ 10281 10282 10283 10284PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3) 10285 10286 10287NAME 10288 PCRE2 - Perl-compatible regular expressions (revised API) 10289 10290 10291SYNOPSIS 10292 10293 #include <pcre2posix.h> 10294 10295 int pcre2_regcomp(regex_t *preg, const char *pattern, 10296 int cflags); 10297 10298 int pcre2_regexec(const regex_t *preg, const char *string, 10299 size_t nmatch, regmatch_t pmatch[], int eflags); 10300 10301 size_t pcre2_regerror(int errcode, const regex_t *preg, 10302 char *errbuf, size_t errbuf_size); 10303 10304 void pcre2_regfree(regex_t *preg); 10305 10306 10307DESCRIPTION 10308 10309 This set of functions provides a POSIX-style API for the PCRE2 regular 10310 expression 8-bit library. There are no POSIX-style wrappers for PCRE2's 10311 16-bit and 32-bit libraries. See the pcre2api documentation for a de- 10312 scription of PCRE2's native API, which contains much additional func- 10313 tionality. 10314 10315 IMPORTANT NOTE: The functions described here are NOT thread-safe, and 10316 should not be used in multi-threaded applications. They are also lim- 10317 ited to processing subjects that are not bigger than 2GB. Use the na- 10318 tive API instead. 10319 10320 These functions are wrapper functions that ultimately call the PCRE2 10321 native API. Their prototypes are defined in the pcre2posix.h header 10322 file, and they all have unique names starting with pcre2_. However, the 10323 pcre2posix.h header also contains macro definitions that convert the 10324 standard POSIX names such regcomp() into pcre2_regcomp() etc. This 10325 means that a program can use the usual POSIX names without running the 10326 risk of accidentally linking with POSIX functions from a different li- 10327 brary. 10328 10329 On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix, 10330 so can be accessed by adding -lpcre2-posix to the command for linking 10331 an application. Because the POSIX functions call the native ones, it is 10332 also necessary to add -lpcre2-8. 10333 10334 On Windows systems, if you are linking to a DLL version of the library, 10335 it is recommended that PCRE2POSIX_SHARED is defined before including 10336 the pcre2posix.h header, as it will allow for a more efficient way to 10337 invoke the functions by adding the __declspec(dllimport) decorator. 10338 10339 Although they were not defined as prototypes in pcre2posix.h, releases 10340 10.33 to 10.36 of the library contained functions with the POSIX names 10341 regcomp() etc. These simply passed their arguments to the PCRE2 func- 10342 tions. These functions were provided for backwards compatibility with 10343 earlier versions of PCRE2, which had only POSIX names. However, this 10344 has proved troublesome in situations where a program links with several 10345 libraries, some of which use PCRE2's POSIX interface while others use 10346 the real POSIX functions. For this reason, the POSIX names have been 10347 removed since release 10.37. 10348 10349 Calling the header file pcre2posix.h avoids any conflict with other 10350 POSIX libraries. It can, of course, be renamed or aliased as regex.h, 10351 which is the "correct" name, if there is no clash. It provides two 10352 structure types, regex_t for compiled internal forms, and regmatch_t 10353 for returning captured substrings. It also defines some constants whose 10354 names start with "REG_"; these are used for setting options and identi- 10355 fying error codes. 10356 10357 10358USING THE POSIX FUNCTIONS 10359 10360 Note that these functions are just POSIX-style wrappers for PCRE2's na- 10361 tive API. They do not give POSIX regular expression behaviour, and 10362 they are not thread-safe or even POSIX compatible. 10363 10364 Those POSIX option bits that can reasonably be mapped to PCRE2 native 10365 options have been implemented. In addition, the option REG_EXTENDED is 10366 defined with the value zero. This has no effect, but since programs 10367 that are written to the POSIX interface often use it, this makes it 10368 easier to slot in PCRE2 as a replacement library. Other POSIX options 10369 are not even defined. 10370 10371 There are also some options that are not defined by POSIX. These have 10372 been added at the request of users who want to make use of certain 10373 PCRE2-specific features via the POSIX calling interface or to add BSD 10374 or GNU functionality. 10375 10376 When PCRE2 is called via these functions, it is only the API that is 10377 POSIX-like in style. The syntax and semantics of the regular expres- 10378 sions themselves are still those of Perl, subject to the setting of 10379 various PCRE2 options, as described below. "POSIX-like in style" means 10380 that the API approximates to the POSIX definition; it is not fully 10381 POSIX-compatible, and in multi-unit encoding domains it is probably 10382 even less compatible. 10383 10384 The descriptions below use the actual names of the functions, but, as 10385 described above, the standard POSIX names (without the pcre2_ prefix) 10386 may also be used. 10387 10388 10389COMPILING A PATTERN 10390 10391 The function pcre2_regcomp() is called to compile a pattern into an in- 10392 ternal form. By default, the pattern is a C string terminated by a bi- 10393 nary zero (but see REG_PEND below). The preg argument is a pointer to a 10394 regex_t structure that is used as a base for storing information about 10395 the compiled regular expression. It is also used for input when 10396 REG_PEND is set. The regex_t structure used by pcre2_regcomp() is de- 10397 fined in pcre2posix.h and is not the same as the structure used by 10398 other libraries that provide POSIX-style matching. 10399 10400 The argument cflags is either zero, or contains one or more of the bits 10401 defined by the following macros: 10402 10403 REG_DOTALL 10404 10405 The PCRE2_DOTALL option is set when the regular expression is passed 10406 for compilation to the native function. Note that REG_DOTALL is not 10407 part of the POSIX standard. 10408 10409 REG_ICASE 10410 10411 The PCRE2_CASELESS option is set when the regular expression is passed 10412 for compilation to the native function. 10413 10414 REG_NEWLINE 10415 10416 The PCRE2_MULTILINE option is set when the regular expression is passed 10417 for compilation to the native function. Note that this does not mimic 10418 the defined POSIX behaviour for REG_NEWLINE (see the following sec- 10419 tion). 10420 10421 REG_NOSPEC 10422 10423 The PCRE2_LITERAL option is set when the regular expression is passed 10424 for compilation to the native function. This disables all meta charac- 10425 ters in the pattern, causing it to be treated as a literal string. The 10426 only other options that are allowed with REG_NOSPEC are REG_ICASE, 10427 REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of 10428 the POSIX standard. 10429 10430 REG_NOSUB 10431 10432 When a pattern that is compiled with this flag is passed to 10433 pcre2_regexec() for matching, the nmatch and pmatch arguments are ig- 10434 nored, and no captured strings are returned. Versions of the PCRE li- 10435 brary prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op- 10436 tion, but this no longer happens because it disables the use of back- 10437 references. 10438 10439 REG_PEND 10440 10441 If this option is set, the reg_endp field in the preg structure (which 10442 has the type const char *) must be set to point to the character beyond 10443 the end of the pattern before calling pcre2_regcomp(). The pattern it- 10444 self may now contain binary zeros, which are treated as data charac- 10445 ters. Without REG_PEND, a binary zero terminates the pattern and the 10446 re_endp field is ignored. This is a GNU extension to the POSIX standard 10447 and should be used with caution in software intended to be portable to 10448 other systems. 10449 10450 REG_UCP 10451 10452 The PCRE2_UCP option is set when the regular expression is passed for 10453 compilation to the native function. This causes PCRE2 to use Unicode 10454 properties when matching \d, \w, etc., instead of just recognizing 10455 ASCII values. Note that REG_UCP is not part of the POSIX standard. 10456 10457 REG_UNGREEDY 10458 10459 The PCRE2_UNGREEDY option is set when the regular expression is passed 10460 for compilation to the native function. Note that REG_UNGREEDY is not 10461 part of the POSIX standard. 10462 10463 REG_UTF 10464 10465 The PCRE2_UTF option is set when the regular expression is passed for 10466 compilation to the native function. This causes the pattern itself and 10467 all data strings used for matching it to be treated as UTF-8 strings. 10468 Note that REG_UTF is not part of the POSIX standard. 10469 10470 In the absence of these flags, no options are passed to the native 10471 function. This means that the regex is compiled with PCRE2 default se- 10472 mantics. In particular, the way it handles newline characters in the 10473 subject string is the Perl way, not the POSIX way. Note that setting 10474 PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. 10475 It does not affect the way newlines are matched by the dot metacharac- 10476 ter (they are not) or by a negative class such as [^a] (they are). 10477 10478 The yield of pcre2_regcomp() is zero on success, and non-zero other- 10479 wise. The preg structure is filled in on success, and one other member 10480 of the structure (as well as re_endp) is public: re_nsub contains the 10481 number of capturing subpatterns in the regular expression. Various er- 10482 ror codes are defined in the header file. 10483 10484 NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt 10485 to use the contents of the preg structure. If, for example, you pass it 10486 to pcre2_regexec(), the result is undefined and your program is likely 10487 to crash. 10488 10489 10490MATCHING NEWLINE CHARACTERS 10491 10492 This area is not simple, because POSIX and Perl take different views of 10493 things. It is not possible to get PCRE2 to obey POSIX semantics, but 10494 then PCRE2 was never intended to be a POSIX engine. The following table 10495 lists the different possibilities for matching newline characters in 10496 Perl and PCRE2: 10497 10498 Default Change with 10499 10500 . matches newline no PCRE2_DOTALL 10501 newline matches [^a] yes not changeable 10502 $ matches \n at end yes PCRE2_DOLLAR_ENDONLY 10503 $ matches \n in middle no PCRE2_MULTILINE 10504 ^ matches \n in middle no PCRE2_MULTILINE 10505 10506 This is the equivalent table for a POSIX-compatible pattern matcher: 10507 10508 Default Change with 10509 10510 . matches newline yes REG_NEWLINE 10511 newline matches [^a] yes REG_NEWLINE 10512 $ matches \n at end no REG_NEWLINE 10513 $ matches \n in middle no REG_NEWLINE 10514 ^ matches \n in middle no REG_NEWLINE 10515 10516 This behaviour is not what happens when PCRE2 is called via its POSIX 10517 API. By default, PCRE2's behaviour is the same as Perl's, except that 10518 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 10519 and Perl, there is no way to stop newline from matching [^a]. 10520 10521 Default POSIX newline handling can be obtained by setting PCRE2_DOTALL 10522 and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but 10523 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac- 10524 tion. When using the POSIX API, passing REG_NEWLINE to PCRE2's 10525 pcre2_regcomp() function causes PCRE2_MULTILINE to be passed to 10526 pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to 10527 pass PCRE2_DOLLAR_ENDONLY. 10528 10529 10530MATCHING A PATTERN 10531 10532 The function pcre2_regexec() is called to match a compiled pattern preg 10533 against a given string, which is by default terminated by a zero byte 10534 (but see REG_STARTEND below), subject to the options in eflags. These 10535 can be: 10536 10537 REG_NOTBOL 10538 10539 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match- 10540 ing function. 10541 10542 REG_NOTEMPTY 10543 10544 The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 10545 matching function. Note that REG_NOTEMPTY is not part of the POSIX 10546 standard. However, setting this option can give more POSIX-like behav- 10547 iour in some situations. 10548 10549 REG_NOTEOL 10550 10551 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match- 10552 ing function. 10553 10554 REG_STARTEND 10555 10556 When this option is set, the subject string starts at string + 10557 pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should 10558 point to the first character beyond the string. There may be binary ze- 10559 ros within the subject string, and indeed, using REG_STARTEND is the 10560 only way to pass a subject string that contains a binary zero. 10561 10562 Whatever the value of pmatch[0].rm_so, the offsets of the matched 10563 string and any captured substrings are still given relative to the 10564 start of string itself. (Before PCRE2 release 10.30 these were given 10565 relative to string + pmatch[0].rm_so, but this differs from other im- 10566 plementations.) 10567 10568 This is a BSD extension, compatible with but not specified by IEEE 10569 Standard 1003.2 (POSIX.2), and should be used with caution in software 10570 intended to be portable to other systems. Note that a non-zero rm_so 10571 does not imply REG_NOTBOL; REG_STARTEND affects only the location and 10572 length of the string, not how it is matched. Setting REG_STARTEND and 10573 passing pmatch as NULL are mutually exclusive; the error REG_INVARG is 10574 returned. 10575 10576 If the pattern was compiled with the REG_NOSUB flag, no data about any 10577 matched strings is returned. The nmatch and pmatch arguments of 10578 pcre2_regexec() are ignored (except possibly as input for REG_STAR- 10579 TEND). 10580 10581 The value of nmatch may be zero, and the value pmatch may be NULL (un- 10582 less REG_STARTEND is set); in both these cases no data about any 10583 matched strings is returned. 10584 10585 Otherwise, the portion of the string that was matched, and also any 10586 captured substrings, are returned via the pmatch argument, which points 10587 to an array of nmatch structures of type regmatch_t, containing the 10588 members rm_so and rm_eo. These contain the byte offset to the first 10589 character of each substring and the offset to the first character after 10590 the end of each substring, respectively. The 0th element of the vector 10591 relates to the entire portion of string that was matched; subsequent 10592 elements relate to the capturing subpatterns of the regular expression. 10593 Unused entries in the array have both structure members set to -1. 10594 10595 regmatch_t as well as the regoff_t typedef it uses are defined in 10596 pcre2posix.h and are not warranted to have the same size or layout as 10597 other similarly named types from other libraries that provide POSIX- 10598 style matching. 10599 10600 A successful match yields a zero return; various error codes are de- 10601 fined in the header file, of which REG_NOMATCH is the "expected" fail- 10602 ure code. 10603 10604 10605ERROR MESSAGES 10606 10607 The pcre2_regerror() function maps a non-zero errorcode from either 10608 pcre2_regcomp() or pcre2_regexec() to a printable message. If preg is 10609 not NULL, the error should have arisen from the use of that structure. 10610 A message terminated by a binary zero is placed in errbuf. If the 10611 buffer is too short, only the first errbuf_size - 1 characters of the 10612 error message are used. The yield of the function is the size of buffer 10613 needed to hold the whole message, including the terminating zero. This 10614 value is greater than errbuf_size if the message was truncated. 10615 10616 10617MEMORY USAGE 10618 10619 Compiling a regular expression causes memory to be allocated and asso- 10620 ciated with the preg structure. The function pcre2_regfree() frees all 10621 such memory, after which preg may no longer be used as a compiled ex- 10622 pression. 10623 10624 10625AUTHOR 10626 10627 Philip Hazel 10628 Retired from University Computing Service 10629 Cambridge, England. 10630 10631 10632REVISION 10633 10634 Last updated: 19 January 2024 10635 Copyright (c) 1997-2024 University of Cambridge. 10636 10637 10638PCRE2 10.43 19 January 2024 PCRE2POSIX(3) 10639------------------------------------------------------------------------------ 10640 10641 10642 10643PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3) 10644 10645 10646NAME 10647 PCRE2 - Perl-compatible regular expressions (revised API) 10648 10649 10650PCRE2 SAMPLE PROGRAM 10651 10652 A simple, complete demonstration program to get you started with using 10653 PCRE2 is supplied in the file pcre2demo.c in the src directory in the 10654 PCRE2 distribution. A listing of this program is given in the pcre2demo 10655 documentation. If you do not have a copy of the PCRE2 distribution, you 10656 can save this listing to re-create the contents of pcre2demo.c. 10657 10658 The demonstration program compiles the regular expression that is its 10659 first argument, and matches it against the subject string in its second 10660 argument. No PCRE2 options are set, and default character tables are 10661 used. If matching succeeds, the program outputs the portion of the sub- 10662 ject that matched, together with the contents of any captured sub- 10663 strings. 10664 10665 If the -g option is given on the command line, the program then goes on 10666 to check for further matches of the same regular expression in the same 10667 subject string. The logic is a little bit tricky because of the possi- 10668 bility of matching an empty string. Comments in the code explain what 10669 is going on. 10670 10671 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit 10672 library. It handles strings and characters that are stored in 8-bit 10673 code units. By default, one character corresponds to one code unit, 10674 but if the pattern starts with "(*UTF)", both it and the subject are 10675 treated as UTF-8 strings, where characters may occupy multiple code 10676 units. 10677 10678 If PCRE2 is installed in the standard include and library directories 10679 for your operating system, you should be able to compile the demonstra- 10680 tion program using a command like this: 10681 10682 cc -o pcre2demo pcre2demo.c -lpcre2-8 10683 10684 If PCRE2 is installed elsewhere, you may need to add additional options 10685 to the command line. For example, on a Unix-like system that has PCRE2 10686 installed in /usr/local, you can compile the demonstration program us- 10687 ing a command like this: 10688 10689 cc -o pcre2demo -I/usr/local/include pcre2demo.c \ 10690 -L/usr/local/lib -lpcre2-8 10691 10692 Once you have built the demonstration program, you can run simple tests 10693 like this: 10694 10695 ./pcre2demo 'cat|dog' 'the cat sat on the mat' 10696 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' 10697 10698 Note that there is a much more comprehensive test program, called 10699 pcre2test, which supports many more facilities for testing regular ex- 10700 pressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit, 10701 though not all three need be installed). The pcre2demo program is pro- 10702 vided as a relatively simple coding example. 10703 10704 If you try to run pcre2demo when PCRE2 is not installed in the standard 10705 library directory, you may get an error like this on some operating 10706 systems (e.g. Solaris): 10707 10708 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file 10709 or directory 10710 10711 This is caused by the way shared library support works on those sys- 10712 tems. You need to add 10713 10714 -R/usr/local/lib 10715 10716 (for example) to the compile command to get round this problem. 10717 10718 10719AUTHOR 10720 10721 Philip Hazel 10722 Retired from University Computing Service 10723 Cambridge, England. 10724 10725 10726REVISION 10727 10728 Last updated: 02 February 2016 10729 Copyright (c) 1997-2016 University of Cambridge. 10730 10731 10732PCRE2 10.22 02 February 2016 PCRE2SAMPLE(3) 10733------------------------------------------------------------------------------ 10734 10735PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3) 10736 10737 10738NAME 10739 PCRE2 - Perl-compatible regular expressions (revised API) 10740 10741 10742SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS 10743 10744 int32_t pcre2_serialize_decode(pcre2_code **codes, 10745 int32_t number_of_codes, const uint8_t *bytes, 10746 pcre2_general_context *gcontext); 10747 10748 int32_t pcre2_serialize_encode(const pcre2_code **codes, 10749 int32_t number_of_codes, uint8_t **serialized_bytes, 10750 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 10751 10752 void pcre2_serialize_free(uint8_t *bytes); 10753 10754 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 10755 10756 If you are running an application that uses a large number of regular 10757 expression patterns, it may be useful to store them in a precompiled 10758 form instead of having to compile them every time the application is 10759 run. However, if you are using the just-in-time optimization feature, 10760 it is not possible to save and reload the JIT data, because it is posi- 10761 tion-dependent. The host on which the patterns are reloaded must be 10762 running the same version of PCRE2, with the same code unit width, and 10763 must also have the same endianness, pointer width and PCRE2_SIZE type. 10764 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit 10765 library cannot be reloaded on a 64-bit system, nor can they be reloaded 10766 using the 8-bit library. 10767 10768 Note that "serialization" in PCRE2 does not convert compiled patterns 10769 to an abstract format like Java or .NET serialization. The serialized 10770 output is really just a bytecode dump, which is why it can only be re- 10771 loaded in the same environment as the one that created it. Hence the 10772 restrictions mentioned above. Applications that are not statically 10773 linked with a fixed version of PCRE2 must be prepared to recompile pat- 10774 terns from their sources, in order to be immune to PCRE2 upgrades. 10775 10776 10777SECURITY CONCERNS 10778 10779 The facility for saving and restoring compiled patterns is intended for 10780 use within individual applications. As such, the data supplied to 10781 pcre2_serialize_decode() is expected to be trusted data, not data from 10782 arbitrary external sources. There is only some simple consistency 10783 checking, not complete validation of what is being re-loaded. Corrupted 10784 data may cause undefined results. For example, if the length field of a 10785 pattern in the serialized data is corrupted, the deserializing code may 10786 read beyond the end of the byte stream that is passed to it. 10787 10788 10789SAVING COMPILED PATTERNS 10790 10791 Before compiled patterns can be saved they must be serialized, which in 10792 PCRE2 means converting the pattern to a stream of bytes. A single byte 10793 stream may contain any number of compiled patterns, but they must all 10794 use the same character tables. A single copy of the tables is included 10795 in the byte stream (its size is 1088 bytes). For more details of char- 10796 acter tables, see the section on locale support in the pcre2api docu- 10797 mentation. 10798 10799 The function pcre2_serialize_encode() creates a serialized byte stream 10800 from a list of compiled patterns. Its first two arguments specify the 10801 list, being a pointer to a vector of pointers to compiled patterns, and 10802 the length of the vector. The third and fourth arguments point to vari- 10803 ables which are set to point to the created byte stream and its length, 10804 respectively. The final argument is a pointer to a general context, 10805 which can be used to specify custom memory management functions. If 10806 this argument is NULL, malloc() is used to obtain memory for the byte 10807 stream. The yield of the function is the number of serialized patterns, 10808 or one of the following negative error codes: 10809 10810 PCRE2_ERROR_BADDATA the number of patterns is zero or less 10811 PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns 10812 PCRE2_ERROR_NOMEMORY memory allocation failed 10813 PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables 10814 PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL 10815 10816 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor- 10817 rupted, or that a slot in the vector does not point to a compiled pat- 10818 tern. 10819 10820 Once a set of patterns has been serialized you can save the data in any 10821 appropriate manner. Here is sample code that compiles two patterns and 10822 writes them to a file. It assumes that the variable fd refers to a file 10823 that is open for output. The error checking that should be present in a 10824 real application has been omitted for simplicity. 10825 10826 int errorcode; 10827 uint8_t *bytes; 10828 PCRE2_SIZE erroroffset; 10829 PCRE2_SIZE bytescount; 10830 pcre2_code *list_of_codes[2]; 10831 list_of_codes[0] = pcre2_compile("first pattern", 10832 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 10833 list_of_codes[1] = pcre2_compile("second pattern", 10834 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 10835 errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes, 10836 &bytescount, NULL); 10837 errorcode = fwrite(bytes, 1, bytescount, fd); 10838 10839 Note that the serialized data is binary data that may contain any of 10840 the 256 possible byte values. On systems that make a distinction be- 10841 tween binary and non-binary data, be sure that the file is opened for 10842 binary output. 10843 10844 Serializing a set of patterns leaves the original data untouched, so 10845 they can still be used for matching. Their memory must eventually be 10846 freed in the usual way by calling pcre2_code_free(). When you have fin- 10847 ished with the byte stream, it too must be freed by calling pcre2_seri- 10848 alize_free(). If this function is called with a NULL argument, it re- 10849 turns immediately without doing anything. 10850 10851 10852RE-USING PRECOMPILED PATTERNS 10853 10854 In order to re-use a set of saved patterns you must first make the se- 10855 rialized byte stream available in main memory (for example, by reading 10856 from a file). The management of this memory block is up to the applica- 10857 tion. You can use the pcre2_serialize_get_number_of_codes() function to 10858 find out how many compiled patterns are in the serialized data without 10859 actually decoding the patterns: 10860 10861 uint8_t *bytes = <serialized data>; 10862 int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes); 10863 10864 The pcre2_serialize_decode() function reads a byte stream and recreates 10865 the compiled patterns in new memory blocks, setting pointers to them in 10866 a vector. The first two arguments are a pointer to a suitable vector 10867 and its length, and the third argument points to a byte stream. The fi- 10868 nal argument is a pointer to a general context, which can be used to 10869 specify custom memory management functions for the decoded patterns. If 10870 this argument is NULL, malloc() and free() are used. After deserializa- 10871 tion, the byte stream is no longer needed and can be discarded. 10872 10873 pcre2_code *list_of_codes[2]; 10874 uint8_t *bytes = <serialized data>; 10875 int32_t number_of_codes = 10876 pcre2_serialize_decode(list_of_codes, 2, bytes, NULL); 10877 10878 If the vector is not large enough for all the patterns in the byte 10879 stream, it is filled with those that fit, and the remainder are ig- 10880 nored. The yield of the function is the number of decoded patterns, or 10881 one of the following negative error codes: 10882 10883 PCRE2_ERROR_BADDATA second argument is zero or less 10884 PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data 10885 PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version 10886 PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure 10887 PCRE2_ERROR_MEMORY memory allocation failed 10888 PCRE2_ERROR_NULL first or third argument is NULL 10889 10890 PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was 10891 compiled on a system with different endianness. 10892 10893 Decoded patterns can be used for matching in the usual way, and must be 10894 freed by calling pcre2_code_free(). However, be aware that there is a 10895 potential race issue if you are using multiple patterns that were de- 10896 coded from a single byte stream in a multithreaded application. A sin- 10897 gle copy of the character tables is used by all the decoded patterns 10898 and a reference count is used to arrange for its memory to be automati- 10899 cally freed when the last pattern is freed, but there is no locking on 10900 this reference count. Therefore, if you want to call pcre2_code_free() 10901 for these patterns in different threads, you must arrange your own 10902 locking, and ensure that pcre2_code_free() cannot be called by two 10903 threads at the same time. 10904 10905 If a pattern was processed by pcre2_jit_compile() before being serial- 10906 ized, the JIT data is discarded and so is no longer available after a 10907 save/restore cycle. You can, however, process a restored pattern with 10908 pcre2_jit_compile() if you wish. 10909 10910 10911AUTHOR 10912 10913 Philip Hazel 10914 Retired from University Computing Service 10915 Cambridge, England. 10916 10917 10918REVISION 10919 10920 Last updated: 27 June 2018 10921 Copyright (c) 1997-2018 University of Cambridge. 10922 10923 10924PCRE2 10.32 27 June 2018 PCRE2SERIALIZE(3) 10925------------------------------------------------------------------------------ 10926 10927 10928 10929PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3) 10930 10931 10932NAME 10933 PCRE2 - Perl-compatible regular expressions (revised API) 10934 10935 10936PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY 10937 10938 The full syntax and semantics of the regular expressions that are sup- 10939 ported by PCRE2 are described in the pcre2pattern documentation. This 10940 document contains a quick-reference summary of the syntax. 10941 10942 10943QUOTING 10944 10945 \x where x is non-alphanumeric is a literal x 10946 \Q...\E treat enclosed characters as literal 10947 10948 Note that white space inside \Q...\E is always treated as literal, even 10949 if PCRE2_EXTENDED is set, causing most other white space to be ignored. 10950 10951 10952BRACED ITEMS 10953 10954 With one exception, wherever brace characters { and } are required to 10955 enclose data for constructions such as \g{2} or \k{name}, space and/or 10956 horizontal tab characters that follow { or precede } are allowed and 10957 are ignored. In the case of quantifiers, they may also appear before or 10958 after the comma. The exception is \u{...} which is not Perl-compatible 10959 and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC- 10960 MAScript compatibility feature, and follows ECMAScript's behaviour. 10961 10962 10963ESCAPED CHARACTERS 10964 10965 This table applies to ASCII and Unicode environments. An unrecognized 10966 escape sequence causes an error. 10967 10968 \a alarm, that is, the BEL character (hex 07) 10969 \cx "control-x", where x is a non-control ASCII character 10970 \e escape (hex 1B) 10971 \f form feed (hex 0C) 10972 \n newline (hex 0A) 10973 \r carriage return (hex 0D) 10974 \t tab (hex 09) 10975 \0dd character with octal code 0dd 10976 \ddd character with octal code ddd, or backreference 10977 \o{ddd..} character with octal code ddd.. 10978 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only) 10979 \xhh character with hex code hh 10980 \x{hh..} character with hex code hh.. 10981 10982 If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the 10983 following are also recognized: 10984 10985 \U the character "U" 10986 \uhhhh character with hex code hhhh 10987 \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX 10988 10989 When \x is not followed by {, from zero to two hexadecimal digits are 10990 read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig- 10991 its to be recognized as a hexadecimal escape; otherwise it matches a 10992 literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by 10993 four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex 10994 digits in curly brackets, it matches a literal "u". 10995 10996 Note that \0dd is always an octal code. The treatment of backslash fol- 10997 lowed by a non-zero digit is complicated; for details see the section 10998 "Non-printing characters" in the pcre2pattern documentation, where de- 10999 tails of escape processing in EBCDIC environments are also given. 11000 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in 11001 EBCDIC environments. Note that \N not followed by an opening curly 11002 bracket has a different meaning (see below). 11003 11004 11005CHARACTER TYPES 11006 11007 . any character except newline; 11008 in dotall mode, any character whatsoever 11009 \C one code unit, even in UTF mode (best avoided) 11010 \d a decimal digit 11011 \D a character that is not a decimal digit 11012 \h a horizontal white space character 11013 \H a character that is not a horizontal white space character 11014 \N a character that is not a newline 11015 \p{xx} a character with the xx property 11016 \P{xx} a character without the xx property 11017 \R a newline sequence 11018 \s a white space character 11019 \S a character that is not a white space character 11020 \v a vertical white space character 11021 \V a character that is not a vertical white space character 11022 \w a "word" character 11023 \W a "non-word" character 11024 \X a Unicode extended grapheme cluster 11025 11026 \C is dangerous because it may leave the current matching point in the 11027 middle of a UTF-8 or UTF-16 character. The application can lock out the 11028 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also 11029 possible to build PCRE2 with the use of \C permanently disabled. 11030 11031 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 11032 mode or in the 16-bit and 32-bit libraries. However, if locale-specific 11033 matching is happening, \s and \w may also match characters with code 11034 points in the range 128-255. If the PCRE2_UCP option is set, the behav- 11035 iour of these escape sequences is changed to use Unicode properties and 11036 they match many more characters, but there are some option settings 11037 that can restrict individual sequences to matching only ASCII charac- 11038 ters. 11039 11040 Property descriptions in \p and \P are matched caselessly; hyphens, un- 11041 derscores, and white space are ignored, in accordance with Unicode's 11042 "loose matching" rules. 11043 11044 11045GENERAL CATEGORY PROPERTIES FOR \p and \P 11046 11047 C Other 11048 Cc Control 11049 Cf Format 11050 Cn Unassigned 11051 Co Private use 11052 Cs Surrogate 11053 11054 L Letter 11055 Ll Lower case letter 11056 Lm Modifier letter 11057 Lo Other letter 11058 Lt Title case letter 11059 Lu Upper case letter 11060 Lc Ll, Lu, or Lt 11061 L& Ll, Lu, or Lt 11062 11063 M Mark 11064 Mc Spacing mark 11065 Me Enclosing mark 11066 Mn Non-spacing mark 11067 11068 N Number 11069 Nd Decimal number 11070 Nl Letter number 11071 No Other number 11072 11073 P Punctuation 11074 Pc Connector punctuation 11075 Pd Dash punctuation 11076 Pe Close punctuation 11077 Pf Final punctuation 11078 Pi Initial punctuation 11079 Po Other punctuation 11080 Ps Open punctuation 11081 11082 S Symbol 11083 Sc Currency symbol 11084 Sk Modifier symbol 11085 Sm Mathematical symbol 11086 So Other symbol 11087 11088 Z Separator 11089 Zl Line separator 11090 Zp Paragraph separator 11091 Zs Space separator 11092 11093 11094PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P 11095 11096 Xan Alphanumeric: union of properties L and N 11097 Xps POSIX space: property Z or tab, NL, VT, FF, CR 11098 Xsp Perl space: property Z or tab, NL, VT, FF, CR 11099 Xuc Universally-named character: one that can be 11100 represented by a Universal Character Name 11101 Xwd Perl word: property Xan or underscore 11102 11103 Perl and POSIX space are now the same. Perl added VT to its space char- 11104 acter set at release 5.18. 11105 11106 11107BINARY PROPERTIES FOR \p AND \P 11108 11109 Unicode defines a number of binary properties, that is, properties 11110 whose only values are true or false. You can obtain a list of those 11111 that are recognized by \p and \P, along with their abbreviations, by 11112 running this command: 11113 11114 pcre2test -LP 11115 11116 11117SCRIPT MATCHING WITH \p AND \P 11118 11119 Many script names and their 4-letter abbreviations are recognized in 11120 \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P 11121 of course). You can obtain a list of these scripts by running this com- 11122 mand: 11123 11124 pcre2test -LS 11125 11126 11127THE BIDI_CLASS PROPERTY FOR \p AND \P 11128 11129 \p{Bidi_Class:<class>} matches a character with the given class 11130 \p{BC:<class>} matches a character with the given class 11131 11132 The recognized classes are: 11133 11134 AL Arabic letter 11135 AN Arabic number 11136 B paragraph separator 11137 BN boundary neutral 11138 CS common separator 11139 EN European number 11140 ES European separator 11141 ET European terminator 11142 FSI first strong isolate 11143 L left-to-right 11144 LRE left-to-right embedding 11145 LRI left-to-right isolate 11146 LRO left-to-right override 11147 NSM non-spacing mark 11148 ON other neutral 11149 PDF pop directional format 11150 PDI pop directional isolate 11151 R right-to-left 11152 RLE right-to-left embedding 11153 RLI right-to-left isolate 11154 RLO right-to-left override 11155 S segment separator 11156 WS which space 11157 11158 11159CHARACTER CLASSES 11160 11161 [...] positive character class 11162 [^...] negative character class 11163 [x-y] range (can be used for hex characters) 11164 [[:xxx:]] positive POSIX named set 11165 [[:^xxx:]] negative POSIX named set 11166 11167 alnum alphanumeric 11168 alpha alphabetic 11169 ascii 0-127 11170 blank space or tab 11171 cntrl control character 11172 digit decimal digit 11173 graph printing, excluding space 11174 lower lower case letter 11175 print printing, including space 11176 punct printing, excluding alphanumeric 11177 space white space 11178 upper upper case letter 11179 word same as \w 11180 xdigit hexadecimal digit 11181 11182 In PCRE2, POSIX character set names recognize only ASCII characters by 11183 default, but some of them use Unicode properties if PCRE2_UCP is set. 11184 You can use \Q...\E inside a character class. 11185 11186 11187QUANTIFIERS 11188 11189 ? 0 or 1, greedy 11190 ?+ 0 or 1, possessive 11191 ?? 0 or 1, lazy 11192 * 0 or more, greedy 11193 *+ 0 or more, possessive 11194 *? 0 or more, lazy 11195 + 1 or more, greedy 11196 ++ 1 or more, possessive 11197 +? 1 or more, lazy 11198 {n} exactly n 11199 {n,m} at least n, no more than m, greedy 11200 {n,m}+ at least n, no more than m, possessive 11201 {n,m}? at least n, no more than m, lazy 11202 {n,} n or more, greedy 11203 {n,}+ n or more, possessive 11204 {n,}? n or more, lazy 11205 {,m} zero up to m, greedy 11206 {,m}+ zero up to m, possessive 11207 {,m}? zero up to m, lazy 11208 11209 11210ANCHORS AND SIMPLE ASSERTIONS 11211 11212 \b word boundary 11213 \B not a word boundary 11214 ^ start of subject 11215 also after an internal newline in multiline mode 11216 (after any newline if PCRE2_ALT_CIRCUMFLEX is set) 11217 \A start of subject 11218 $ end of subject 11219 also before newline at end of subject 11220 also before internal newline in multiline mode 11221 \Z end of subject 11222 also before newline at end of subject 11223 \z end of subject 11224 \G first matching position in subject 11225 11226 11227REPORTED MATCH POINT SETTING 11228 11229 \K set reported start of match 11230 11231 From release 10.38 \K is not permitted by default in lookaround asser- 11232 tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL- 11233 LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled. 11234 When this option is set, \K is honoured in positive assertions, but ig- 11235 nored in negative ones. 11236 11237 11238ALTERNATION 11239 11240 expr|expr|expr... 11241 11242 11243CAPTURING 11244 11245 (...) capture group 11246 (?<name>...) named capture group (Perl) 11247 (?'name'...) named capture group (Perl) 11248 (?P<name>...) named capture group (Python) 11249 (?:...) non-capture group 11250 (?|...) non-capture group; reset group numbers for 11251 capture groups in each alternative 11252 11253 In non-UTF modes, names may contain underscores and ASCII letters and 11254 digits; in UTF modes, any Unicode letters and Unicode decimal digits 11255 are permitted. In both cases, a name must not start with a digit. 11256 11257 11258ATOMIC GROUPS 11259 11260 (?>...) atomic non-capture group 11261 (*atomic:...) atomic non-capture group 11262 11263 11264COMMENT 11265 11266 (?#....) comment (not nestable) 11267 11268 11269OPTION SETTING 11270 Changes of these options within a group are automatically cancelled at 11271 the end of the group. 11272 11273 (?a) all ASCII options 11274 (?aD) restrict \d to ASCII in UCP mode 11275 (?aS) restrict \s to ASCII in UCP mode 11276 (?aW) restrict \w to ASCII in UCP mode 11277 (?aP) restrict all POSIX classes to ASCII in UCP mode 11278 (?aT) restrict POSIX digit classes to ASCII in UCP mode 11279 (?i) caseless 11280 (?J) allow duplicate named groups 11281 (?m) multiline 11282 (?n) no auto capture 11283 (?r) restrict caseless to either ASCII or non-ASCII 11284 (?s) single line (dotall) 11285 (?U) default ungreedy (lazy) 11286 (?x) ignore white space except in classes or \Q...\E 11287 (?xx) as (?x) but also ignore space and tab in classes 11288 (?-...) unset the given option(s) 11289 (?^) unset imnrsx options 11290 11291 (?aP) implies (?aT) as well, though this has no additional effect. How- 11292 ever, it means that (?-aP) is really (?-PT) which disables all ASCII 11293 restrictions for POSIX classes. 11294 11295 Unsetting x or xx unsets both. Several options may be set at once, and 11296 a mixture of setting and unsetting such as (?i-x) is allowed, but there 11297 may be only one hyphen. Setting (but no unsetting) is allowed after (?^ 11298 for example (?^in). An option setting may appear at the start of a non- 11299 capture group, for example (?i:...). 11300 11301 The following are recognized only at the very start of a pattern or af- 11302 ter one of the newline or \R options with similar syntax. More than one 11303 of them may appear. For the first three, d is a decimal number. 11304 11305 (*LIMIT_DEPTH=d) set the backtracking limit to d 11306 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes 11307 (*LIMIT_MATCH=d) set the match limit to d 11308 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching 11309 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching 11310 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) 11311 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) 11312 (*NO_JIT) disable JIT optimization 11313 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) 11314 (*UTF) set appropriate UTF mode for the library in use 11315 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) 11316 11317 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the 11318 value of the limits set by the caller of pcre2_match() or 11319 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete 11320 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) 11321 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, 11322 respectively, at compile time. 11323 11324 11325NEWLINE CONVENTION 11326 11327 These are recognized only at the very start of the pattern or after op- 11328 tion settings with a similar syntax. 11329 11330 (*CR) carriage return only 11331 (*LF) linefeed only 11332 (*CRLF) carriage return followed by linefeed 11333 (*ANYCRLF) all three of the above 11334 (*ANY) any Unicode newline sequence 11335 (*NUL) the NUL character (binary zero) 11336 11337 11338WHAT \R MATCHES 11339 11340 These are recognized only at the very start of the pattern or after op- 11341 tion setting with a similar syntax. 11342 11343 (*BSR_ANYCRLF) CR, LF, or CRLF 11344 (*BSR_UNICODE) any Unicode newline sequence 11345 11346 11347LOOKAHEAD AND LOOKBEHIND ASSERTIONS 11348 11349 (?=...) ) 11350 (*pla:...) ) positive lookahead 11351 (*positive_lookahead:...) ) 11352 11353 (?!...) ) 11354 (*nla:...) ) negative lookahead 11355 (*negative_lookahead:...) ) 11356 11357 (?<=...) ) 11358 (*plb:...) ) positive lookbehind 11359 (*positive_lookbehind:...) ) 11360 11361 (?<!...) ) 11362 (*nlb:...) ) negative lookbehind 11363 (*negative_lookbehind:...) ) 11364 11365 Each top-level branch of a lookbehind must have a limit for the number 11366 of characters it matches. If any branch can match a variable number of 11367 characters, the maximum for each branch is limited to a value set by 11368 the caller of pcre2_compile() or defaulted. The default is set when 11369 PCRE2 is built (ultimate default 255). If every branch matches a fixed 11370 number of characters, the limit for each branch is 65535 characters. 11371 11372 11373NON-ATOMIC LOOKAROUND ASSERTIONS 11374 11375 These assertions are specific to PCRE2 and are not Perl-compatible. 11376 11377 (?*...) ) 11378 (*napla:...) ) synonyms 11379 (*non_atomic_positive_lookahead:...) ) 11380 11381 (?<*...) ) 11382 (*naplb:...) ) synonyms 11383 (*non_atomic_positive_lookbehind:...) ) 11384 11385 11386SCRIPT RUNS 11387 11388 (*script_run:...) ) script run, can be backtracked into 11389 (*sr:...) ) 11390 11391 (*atomic_script_run:...) ) atomic script run 11392 (*asr:...) ) 11393 11394 11395BACKREFERENCES 11396 11397 \n reference by number (can be ambiguous) 11398 \gn reference by number 11399 \g{n} reference by number 11400 \g+n relative reference by number (PCRE2 extension) 11401 \g-n relative reference by number 11402 \g{+n} relative reference by number (PCRE2 extension) 11403 \g{-n} relative reference by number 11404 \k<name> reference by name (Perl) 11405 \k'name' reference by name (Perl) 11406 \g{name} reference by name (Perl) 11407 \k{name} reference by name (.NET) 11408 (?P=name) reference by name (Python) 11409 11410 11411SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) 11412 11413 (?R) recurse whole pattern 11414 (?n) call subroutine by absolute number 11415 (?+n) call subroutine by relative number 11416 (?-n) call subroutine by relative number 11417 (?&name) call subroutine by name (Perl) 11418 (?P>name) call subroutine by name (Python) 11419 \g<name> call subroutine by name (Oniguruma) 11420 \g'name' call subroutine by name (Oniguruma) 11421 \g<n> call subroutine by absolute number (Oniguruma) 11422 \g'n' call subroutine by absolute number (Oniguruma) 11423 \g<+n> call subroutine by relative number (PCRE2 extension) 11424 \g'+n' call subroutine by relative number (PCRE2 extension) 11425 \g<-n> call subroutine by relative number (PCRE2 extension) 11426 \g'-n' call subroutine by relative number (PCRE2 extension) 11427 11428 11429CONDITIONAL PATTERNS 11430 11431 (?(condition)yes-pattern) 11432 (?(condition)yes-pattern|no-pattern) 11433 11434 (?(n) absolute reference condition 11435 (?(+n) relative reference condition (PCRE2 extension) 11436 (?(-n) relative reference condition (PCRE2 extension) 11437 (?(<name>) named reference condition (Perl) 11438 (?('name') named reference condition (Perl) 11439 (?(name) named reference condition (PCRE2, deprecated) 11440 (?(R) overall recursion condition 11441 (?(Rn) specific numbered group recursion condition 11442 (?(R&name) specific named group recursion condition 11443 (?(DEFINE) define groups for reference 11444 (?(VERSION[>]=n.m) test PCRE2 version 11445 (?(assert) assertion condition 11446 11447 Note the ambiguity of (?(R) and (?(Rn) which might be named reference 11448 conditions or recursion tests. Such a condition is interpreted as a 11449 reference condition if the relevant named group exists. 11450 11451 11452BACKTRACKING CONTROL 11453 11454 All backtracking control verbs may be in the form (*VERB:NAME). For 11455 (*MARK) the name is mandatory, for the others it is optional. (*SKIP) 11456 changes its behaviour if :NAME is present. The others just set a name 11457 for passing back to the caller, but this is not a name that (*SKIP) can 11458 see. The following act immediately they are reached: 11459 11460 (*ACCEPT) force successful match 11461 (*FAIL) force backtrack; synonym (*F) 11462 (*MARK:NAME) set name to be passed back; synonym (*:NAME) 11463 11464 The following act only when a subsequent match failure causes a back- 11465 track to reach them. They all force a match failure, but they differ in 11466 what happens afterwards. Those that advance the start-of-match point do 11467 so only if the pattern is not anchored. 11468 11469 (*COMMIT) overall failure, no advance of starting point 11470 (*PRUNE) advance to next starting character 11471 (*SKIP) advance to current matching position 11472 (*SKIP:NAME) advance to position corresponding to an earlier 11473 (*MARK:NAME); if not found, the (*SKIP) is ignored 11474 (*THEN) local failure, backtrack to next alternation 11475 11476 The effect of one of these verbs in a group called as a subroutine is 11477 confined to the subroutine call. 11478 11479 11480CALLOUTS 11481 11482 (?C) callout (assumed number 0) 11483 (?Cn) callout with numerical data n 11484 (?C"text") callout with string data 11485 11486 The allowed string delimiters are ` ' " ^ % # $ (which are the same for 11487 the start and the end), and the starting delimiter { matched with the 11488 ending delimiter }. To encode the ending delimiter within the string, 11489 double it. 11490 11491 11492SEE ALSO 11493 11494 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), 11495 pcre2(3). 11496 11497 11498AUTHOR 11499 11500 Philip Hazel 11501 Retired from University Computing Service 11502 Cambridge, England. 11503 11504 11505REVISION 11506 11507 Last updated: 12 October 2023 11508 Copyright (c) 1997-2023 University of Cambridge. 11509 11510 11511PCRE2 10.43 12 October 2023 PCRE2SYNTAX(3) 11512------------------------------------------------------------------------------ 11513 11514 11515 11516PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3) 11517 11518 11519NAME 11520 PCRE - Perl-compatible regular expressions (revised API) 11521 11522 11523UNICODE AND UTF SUPPORT 11524 11525 PCRE2 is normally built with Unicode support, though if you do not need 11526 it, you can build it without, in which case the library will be 11527 smaller. With Unicode support, PCRE2 has knowledge of Unicode character 11528 properties and can process strings of text in UTF-8, UTF-16, and UTF-32 11529 format (depending on the code unit width), but this is not the default. 11530 Unless specifically requested, PCRE2 treats each code unit in a string 11531 as one character. 11532 11533 There are two ways of telling PCRE2 to switch to UTF mode, where char- 11534 acters may consist of more than one code unit and the range of values 11535 is constrained. The program can call pcre2_compile() with the PCRE2_UTF 11536 option, or the pattern may start with the sequence (*UTF). However, 11537 the latter facility can be locked out by the PCRE2_NEVER_UTF option. 11538 That is, the programmer can prevent the supplier of the pattern from 11539 switching to UTF mode. 11540 11541 Note that the PCRE2_MATCH_INVALID_UTF option (see below) forces 11542 PCRE2_UTF to be set. 11543 11544 In UTF mode, both the pattern and any subject strings that are matched 11545 against it are treated as UTF strings instead of strings of individual 11546 one-code-unit characters. There are also some other changes to the way 11547 characters are handled, as documented below. 11548 11549 11550UNICODE PROPERTY SUPPORT 11551 11552 When PCRE2 is built with Unicode support, the escape sequences \p{..}, 11553 \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set- 11554 ting. The Unicode properties that can be tested are a subset of those 11555 that Perl supports. Currently they are limited to the general category 11556 properties such as Lu for an upper case letter or Nd for a decimal num- 11557 ber, the derived properties Any and LC (synonym L&), the Unicode script 11558 names such as Arabic or Han, Bidi_Class, Bidi_Control, and a few binary 11559 properties. 11560 11561 The full lists are given in the pcre2pattern and pcre2syntax documenta- 11562 tion. In general, only the short names for properties are supported. 11563 For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is 11564 not supported. Furthermore, in Perl, many properties may optionally be 11565 prefixed by "Is", for compatibility with Perl 5.6. PCRE2 does not sup- 11566 port this. 11567 11568 11569WIDE CHARACTERS AND UTF MODES 11570 11571 Code points less than 256 can be specified in patterns by either braced 11572 or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). 11573 Larger values have to use braced sequences. Unbraced octal code points 11574 up to \777 are also recognized; larger ones can be coded using \o{...}. 11575 11576 The escape sequence \N{U+<hex digits>} is recognized as another way of 11577 specifying a Unicode character by code point in a UTF mode. It is not 11578 allowed in non-UTF mode. 11579 11580 In UTF mode, repeat quantifiers apply to complete UTF characters, not 11581 to individual code units. 11582 11583 In UTF mode, the dot metacharacter matches one UTF character instead of 11584 a single code unit. 11585 11586 In UTF mode, capture group names are not restricted to ASCII, and may 11587 contain any Unicode letters and decimal digits, as well as underscore. 11588 11589 The escape sequence \C can be used to match a single code unit in UTF 11590 mode, but its use can lead to some strange effects because it breaks up 11591 multi-unit characters (see the description of \C in the pcre2pattern 11592 documentation). For this reason, there is a build-time option that dis- 11593 ables support for \C completely. There is also a less draconian com- 11594 pile-time option for locking out the use of \C when a pattern is com- 11595 piled. 11596 11597 The use of \C is not supported by the alternative matching function 11598 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac- 11599 ter may consist of more than one code unit. The use of \C in these 11600 modes provokes a match-time error. Also, the JIT optimization does not 11601 support \C in these modes. If JIT optimization is requested for a UTF-8 11602 or UTF-16 pattern that contains \C, it will not succeed, and so when 11603 pcre2_match() is called, the matching will be carried out by the inter- 11604 pretive function. 11605 11606 The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test 11607 characters of any code value, but, by default, the characters that 11608 PCRE2 recognizes as digits, spaces, or word characters remain the same 11609 set as in non-UTF mode, all with code points less than 256. This re- 11610 mains true even when PCRE2 is built to include Unicode support, because 11611 to do otherwise would slow down matching in many common cases. Note 11612 that this also applies to \b and \B, because they are defined in terms 11613 of \w and \W. If you want to test for a wider sense of, say, "digit", 11614 you can use explicit Unicode property tests such as \p{Nd}. Alterna- 11615 tively, if you set the PCRE2_UCP option, the way that the character es- 11616 capes work is changed so that Unicode properties are used to determine 11617 which characters match, though there are some options that suppress 11618 this for individual escapes. For details see the section on generic 11619 character types in the pcre2pattern documentation. 11620 11621 Like the escapes, characters that match the POSIX named character 11622 classes are all low-valued characters unless the PCRE2_UCP option is 11623 set, but there is an option to override this. 11624 11625 In contrast to the character escapes and character classes, the special 11626 horizontal and vertical white space escapes (\h, \H, \v, and \V) do 11627 match all the appropriate Unicode characters, whether or not PCRE2_UCP 11628 is set. 11629 11630 11631UNICODE CASE-EQUIVALENCE 11632 11633 If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing 11634 makes use of Unicode properties except for characters whose code points 11635 are less than 128 and that have at most two case-equivalent values. For 11636 these, a direct table lookup is used for speed. A few Unicode charac- 11637 ters such as Greek sigma have more than two code points that are case- 11638 equivalent, and these are treated specially. Setting PCRE2_UCP without 11639 PCRE2_UTF allows Unicode-style case processing for non-UTF character 11640 encodings such as UCS-2. 11641 11642 There are two ASCII characters (S and K) that, in addition to their 11643 ASCII lower case equivalents, have a non-ASCII one as well (long S and 11644 Kelvin sign). Recognition of these non-ASCII characters as case-equiv- 11645 alent to their ASCII counterparts can be disabled by setting the 11646 PCRE2_EXTRA_CASELESS_RESTRICT option. When this is set, all characters 11647 in a case equivalence must either be ASCII or non-ASCII; there can be 11648 no mixing. 11649 11650 11651SCRIPT RUNS 11652 11653 The pattern constructs (*script_run:...) and (*atomic_script_run:...), 11654 with synonyms (*sr:...) and (*asr:...), verify that the string matched 11655 within the parentheses is a script run. In concept, a script run is a 11656 sequence of characters that are all from the same Unicode script. How- 11657 ever, because some scripts are commonly used together, and because some 11658 diacritical and other marks are used with multiple scripts, it is not 11659 that simple. 11660 11661 Every Unicode character has a Script property, mostly with a value cor- 11662 responding to the name of a script, such as Latin, Greek, or Cyrillic. 11663 There are also three special values: 11664 11665 "Unknown" is used for code points that have not been assigned, and also 11666 for the surrogate code points. In the PCRE2 32-bit library, characters 11667 whose code points are greater than the Unicode maximum (U+10FFFF), 11668 which are accessible only in non-UTF mode, are assigned the Unknown 11669 script. 11670 11671 "Common" is used for characters that are used with many scripts. These 11672 include punctuation, emoji, mathematical, musical, and currency sym- 11673 bols, and the ASCII digits 0 to 9. 11674 11675 "Inherited" is used for characters such as diacritical marks that mod- 11676 ify a previous character. These are considered to take on the script of 11677 the character that they modify. 11678 11679 Some Inherited characters are used with many scripts, but many of them 11680 are only normally used with a small number of scripts. For example, 11681 U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop- 11682 tic. In order to make it possible to check this, a Unicode property 11683 called Script Extension exists. Its value is a list of scripts that ap- 11684 ply to the character. For the majority of characters, the list contains 11685 just one script, the same one as the Script property. However, for 11686 characters such as U+102E0 more than one Script is listed. There are 11687 also some Common characters that have a single, non-Common script in 11688 their Script Extension list. 11689 11690 The next section describes the basic rules for deciding whether a given 11691 string of characters is a script run. Note, however, that there are 11692 some special cases involving the Chinese Han script, and an additional 11693 constraint for decimal digits. These are covered in subsequent sec- 11694 tions. 11695 11696 Basic script run rules 11697 11698 A string that is less than two characters long is a script run. This is 11699 the only case in which an Unknown character can be part of a script 11700 run. Longer strings are checked using only the Script Extensions prop- 11701 erty, not the basic Script property. 11702 11703 If a character's Script Extension property is the single value "Inher- 11704 ited", it is always accepted as part of a script run. This is also true 11705 for the property "Common", subject to the checking of decimal digits 11706 described below. All the remaining characters in a script run must have 11707 at least one script in common in their Script Extension lists. In set- 11708 theoretic terminology, the intersection of all the sets of scripts must 11709 not be empty. 11710 11711 A simple example is an Internet name such as "google.com". The letters 11712 are all in the Latin script, and the dot is Common, so this string is a 11713 script run. However, the Cyrillic letter "o" looks exactly the same as 11714 the Latin "o"; a string that looks the same, but with Cyrillic "o"s is 11715 not a script run. 11716 11717 More interesting examples involve characters with more than one script 11718 in their Script Extension. Consider the following characters: 11719 11720 U+060C Arabic comma 11721 U+06D4 Arabic full stop 11722 11723 The first has the Script Extension list Arabic, Hanifi Rohingya, Syr- 11724 iac, and Thaana; the second has just Arabic and Hanifi Rohingya. Both 11725 of them could appear in script runs of either Arabic or Hanifi Ro- 11726 hingya. The first could also appear in Syriac or Thaana script runs, 11727 but the second could not. 11728 11729 The Chinese Han script 11730 11731 The Chinese Han script is commonly used in conjunction with other 11732 scripts for writing certain languages. Japanese uses the Hiragana and 11733 Katakana scripts together with Han; Korean uses Hangul and Han; Tai- 11734 wanese Mandarin uses Bopomofo and Han. These three combinations are 11735 treated as special cases when checking script runs and are, in effect, 11736 "virtual scripts". Thus, a script run may contain a mixture of Hira- 11737 gana, Katakana, and Han, or a mixture of Hangul and Han, or a mixture 11738 of Bopomofo and Han, but not, for example, a mixture of Hangul and 11739 Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan- 11740 dard 39 ("Unicode Security Mechanisms", http://unicode.org/re- 11741 ports/tr39/) in allowing such mixtures. 11742 11743 Decimal digits 11744 11745 Unicode contains many sets of 10 decimal digits in different scripts, 11746 and some scripts (including the Common script) contain more than one 11747 set. Some of these decimal digits them are visually indistinguishable 11748 from the common ASCII digits. In addition to the script checking de- 11749 scribed above, if a script run contains any decimal digits, they must 11750 all come from the same set of 10 adjacent characters. 11751 11752 11753VALIDITY OF UTF STRINGS 11754 11755 When the PCRE2_UTF option is set, the strings passed as patterns and 11756 subjects are (by default) checked for validity on entry to the relevant 11757 functions. If an invalid UTF string is passed, a negative error code is 11758 returned. The code unit offset to the offending character can be ex- 11759 tracted from the match data block by calling pcre2_get_startchar(), 11760 which is used for this purpose after a UTF error. 11761 11762 In some situations, you may already know that your strings are valid, 11763 and therefore want to skip these checks in order to improve perfor- 11764 mance, for example in the case of a long subject string that is being 11765 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com- 11766 pile time or at match time, PCRE2 assumes that the pattern or subject 11767 it is given (respectively) contains only valid UTF code unit sequences. 11768 11769 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the 11770 result is undefined and your program may crash or loop indefinitely or 11771 give incorrect results. There is, however, one mode of matching that 11772 can handle invalid UTF subject strings. This is enabled by passing 11773 PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is discussed below in 11774 the next section. The rest of this section covers the case when 11775 PCRE2_MATCH_INVALID_UTF is not set. 11776 11777 Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the UTF 11778 check for the pattern; it does not also apply to subject strings. If 11779 you want to disable the check for a subject string you must pass this 11780 same option to pcre2_match() or pcre2_dfa_match(). 11781 11782 UTF-16 and UTF-32 strings can indicate their endianness by special code 11783 knows as a byte-order mark (BOM). The PCRE2 functions do not handle 11784 this, expecting strings to be in host byte order. 11785 11786 Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any 11787 other processing takes place. In the case of pcre2_match() and 11788 pcre2_dfa_match() calls with a non-zero starting offset, the check is 11789 applied only to that part of the subject that could be inspected during 11790 matching, and there is a check that the starting offset points to the 11791 first code unit of a character or to the end of the subject. If there 11792 are no lookbehind assertions in the pattern, the check starts at the 11793 starting offset. Otherwise, it starts at the length of the longest 11794 lookbehind before the starting offset, or at the start of the subject 11795 if there are not that many characters before the starting offset. Note 11796 that the sequences \b and \B are one-character lookbehinds. 11797 11798 In addition to checking the format of the string, there is a check to 11799 ensure that all code points lie in the range U+0 to U+10FFFF, excluding 11800 the surrogate area. The so-called "non-character" code points are not 11801 excluded because Unicode corrigendum #9 makes it clear that they should 11802 not be. 11803 11804 Characters in the "Surrogate Area" of Unicode are reserved for use by 11805 UTF-16, where they are used in pairs to encode code points with values 11806 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs 11807 are available independently in the UTF-8 and UTF-32 encodings. (In 11808 other words, the whole surrogate thing is a fudge for UTF-16 which un- 11809 fortunately messes up UTF-8 and UTF-32.) 11810 11811 Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error 11812 that is given if an escape sequence for an invalid Unicode code point 11813 is encountered in the pattern. If you want to allow escape sequences 11814 such as \x{d800} (a surrogate code point) you can set the PCRE2_EX- 11815 TRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible 11816 only in UTF-8 and UTF-32 modes, because these values are not repre- 11817 sentable in UTF-16. 11818 11819 Errors in UTF-8 strings 11820 11821 The following negative error codes are given for invalid UTF-8 strings: 11822 11823 PCRE2_ERROR_UTF8_ERR1 11824 PCRE2_ERROR_UTF8_ERR2 11825 PCRE2_ERROR_UTF8_ERR3 11826 PCRE2_ERROR_UTF8_ERR4 11827 PCRE2_ERROR_UTF8_ERR5 11828 11829 The string ends with a truncated UTF-8 character; the code specifies 11830 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 11831 characters to be no longer than 4 bytes, the encoding scheme (origi- 11832 nally defined by RFC 2279) allows for up to 6 bytes, and this is 11833 checked first; hence the possibility of 4 or 5 missing bytes. 11834 11835 PCRE2_ERROR_UTF8_ERR6 11836 PCRE2_ERROR_UTF8_ERR7 11837 PCRE2_ERROR_UTF8_ERR8 11838 PCRE2_ERROR_UTF8_ERR9 11839 PCRE2_ERROR_UTF8_ERR10 11840 11841 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of 11842 the character do not have the binary value 0b10 (that is, either the 11843 most significant bit is 0, or the next bit is 1). 11844 11845 PCRE2_ERROR_UTF8_ERR11 11846 PCRE2_ERROR_UTF8_ERR12 11847 11848 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes 11849 long; these code points are excluded by RFC 3629. 11850 11851 PCRE2_ERROR_UTF8_ERR13 11852 11853 A 4-byte character has a value greater than 0x10ffff; these code points 11854 are excluded by RFC 3629. 11855 11856 PCRE2_ERROR_UTF8_ERR14 11857 11858 A 3-byte character has a value in the range 0xd800 to 0xdfff; this 11859 range of code points are reserved by RFC 3629 for use with UTF-16, and 11860 so are excluded from UTF-8. 11861 11862 PCRE2_ERROR_UTF8_ERR15 11863 PCRE2_ERROR_UTF8_ERR16 11864 PCRE2_ERROR_UTF8_ERR17 11865 PCRE2_ERROR_UTF8_ERR18 11866 PCRE2_ERROR_UTF8_ERR19 11867 11868 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes 11869 for a value that can be represented by fewer bytes, which is invalid. 11870 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- 11871 rect coding uses just one byte. 11872 11873 PCRE2_ERROR_UTF8_ERR20 11874 11875 The two most significant bits of the first byte of a character have the 11876 binary value 0b10 (that is, the most significant bit is 1 and the sec- 11877 ond is 0). Such a byte can only validly occur as the second or subse- 11878 quent byte of a multi-byte character. 11879 11880 PCRE2_ERROR_UTF8_ERR21 11881 11882 The first byte of a character has the value 0xfe or 0xff. These values 11883 can never occur in a valid UTF-8 string. 11884 11885 Errors in UTF-16 strings 11886 11887 The following negative error codes are given for invalid UTF-16 11888 strings: 11889 11890 PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string 11891 PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate 11892 PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate 11893 11894 11895 Errors in UTF-32 strings 11896 11897 The following negative error codes are given for invalid UTF-32 11898 strings: 11899 11900 PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff) 11901 PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff 11902 11903 11904MATCHING IN INVALID UTF STRINGS 11905 11906 You can run pattern matches on subject strings that may contain invalid 11907 UTF sequences if you call pcre2_compile() with the PCRE2_MATCH_IN- 11908 VALID_UTF option. This is supported by pcre2_match(), including JIT 11909 matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is 11910 set, it forces PCRE2_UTF to be set as well. Note, however, that the 11911 pattern itself must be a valid UTF string. 11912 11913 If you do not set PCRE2_MATCH_INVALID_UTF when calling pcre2_compile, 11914 and you are not certain that your subject strings are valid UTF se- 11915 quences, you should not make use of the JIT "fast path" function 11916 pcre2_jit_match() because it bypasses sanity checks, including the one 11917 for UTF validity. An invalid string may cause undefined behaviour, in- 11918 cluding looping, crashing, or giving the wrong answer. 11919 11920 Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile() 11921 generates, but if pcre2_jit_compile() is subsequently called, it does 11922 generate different code. If JIT is not used, the option affects the be- 11923 haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN- 11924 VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at 11925 match time. 11926 11927 In this mode, an invalid code unit sequence in the subject never 11928 matches any pattern item. It does not match dot, it does not match 11929 \p{Any}, it does not even match negative items such as [^X]. A lookbe- 11930 hind assertion fails if it encounters an invalid sequence while moving 11931 the current point backwards. In other words, an invalid UTF code unit 11932 sequence acts as a barrier which no match can cross. 11933 11934 You can also think of this as the subject being split up into fragments 11935 of valid UTF, delimited internally by invalid code unit sequences. The 11936 pattern is matched fragment by fragment. The result of a successful 11937 match, however, is given as code unit offsets in the entire subject 11938 string in the usual way. There are a few points to consider: 11939 11940 The internal boundaries are not interpreted as the beginnings or ends 11941 of lines and so do not match circumflex or dollar characters in the 11942 pattern. 11943 11944 If pcre2_match() is called with an offset that points to an invalid 11945 UTF-sequence, that sequence is skipped, and the match starts at the 11946 next valid UTF character, or the end of the subject. 11947 11948 At internal fragment boundaries, \b and \B behave in the same way as at 11949 the beginning and end of the subject. For example, a sequence such as 11950 \bWORD\b would match an instance of WORD that is surrounded by invalid 11951 UTF code units. 11952 11953 Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi- 11954 trary data, knowing that any matched strings that are returned are 11955 valid UTF. This can be useful when searching for UTF text in executable 11956 or other binary files. 11957 11958 Note, however, that the 16-bit and 32-bit PCRE2 libraries process 11959 strings as sequences of uint16_t or uint32_t code points. They cannot 11960 find valid UTF sequences within an arbitrary string of bytes unless 11961 such sequences are suitably aligned. 11962 11963 11964AUTHOR 11965 11966 Philip Hazel 11967 Retired from University Computing Service 11968 Cambridge, England. 11969 11970 11971REVISION 11972 11973 Last updated: 12 October 2023 11974 Copyright (c) 1997-2023 University of Cambridge. 11975 11976 11977PCRE2 10.43 04 February 2023 PCRE2UNICODE(3) 11978------------------------------------------------------------------------------ 11979 11980 11981