xref: /aosp_15_r20/external/pcre/doc/pcre2.txt (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE2 man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcre2demo program. There are separate text files for the pcre2grep and
7pcre2test commands.
8-----------------------------------------------------------------------------
9
10
11
12PCRE2(3)                   Library Functions Manual                   PCRE2(3)
13
14
15NAME
16       PCRE2 - Perl-compatible regular expressions (revised API)
17
18
19INTRODUCTION
20
21       PCRE2 is the name used for a revised API for the PCRE library, which is
22       a  set  of  functions,  written in C, that implement regular expression
23       pattern matching using the same syntax and semantics as Perl, with just
24       a few differences. After nearly two decades,  the  limitations  of  the
25       original  API  were  making development increasingly difficult. The new
26       API is more extensible, and it was simplified by abolishing  the  sepa-
27       rate  "study" optimizing function; in PCRE2, patterns are automatically
28       optimized where possible. Since forking from PCRE1, the code  has  been
29       extensively  refactored and new features introduced. The old library is
30       now obsolete and is no longer maintained.
31
32       As well as Perl-style regular expression patterns, some  features  that
33       appeared  in  Python and the original PCRE before they appeared in Perl
34       are available using the Python syntax. There is also some  support  for
35       one  or  two .NET and Oniguruma syntax items, and there are options for
36       requesting  some  minor  changes  that  give  better  ECMAScript   (aka
37       JavaScript) compatibility.
38
39       The  source code for PCRE2 can be compiled to support strings of 8-bit,
40       16-bit, or 32-bit code units, which means that up to three separate li-
41       braries may be installed, one for each code unit size. The size of code
42       unit is not related to the bit size of the underlying  hardware.  In  a
43       64-bit  environment that also supports 32-bit applications, versions of
44       PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
45
46       The original work to extend PCRE to 16-bit and 32-bit  code  units  was
47       done by Zoltan Herczeg and Christian Persch, respectively. In all three
48       cases,  strings  can  be  interpreted  either as one character per code
49       unit, or as UTF-encoded Unicode, with support for Unicode general cate-
50       gory properties. Unicode support is optional at build time (but is  the
51       default). However, processing strings as UTF code units must be enabled
52       explicitly at run time. The version of Unicode in use can be discovered
53       by running
54
55         pcre2test -C
56
57       The  three  libraries  contain  identical sets of functions, with names
58       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
59       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
60       32, a program that uses just one code unit width can be  written  using
61       generic names such as pcre2_compile(), and the documentation is written
62       assuming that this is the case.
63
64       In addition to the Perl-compatible matching function, PCRE2 contains an
65       alternative  function that matches the same compiled patterns in a dif-
66       ferent way. In certain circumstances, the alternative function has some
67       advantages.  For a discussion of the two matching algorithms,  see  the
68       pcre2matching page.
69
70       Details  of  exactly which Perl regular expression features are and are
71       not supported by  PCRE2  are  given  in  separate  documents.  See  the
72       pcre2pattern  and  pcre2compat  pages. There is a syntax summary in the
73       pcre2syntax page.
74
75       Some features of PCRE2 can be included, excluded, or changed  when  the
76       library  is  built. The pcre2_config() function makes it possible for a
77       client to discover which features are  available.  The  features  them-
78       selves are described in the pcre2build page. Documentation about build-
79       ing  PCRE2 for various operating systems can be found in the README and
80       NON-AUTOTOOLS_BUILD files in the source distribution.
81
82       The libraries contains a number of undocumented internal functions  and
83       data  tables  that  are  used by more than one of the exported external
84       functions, but which are not intended  for  use  by  external  callers.
85       Their  names  all begin with "_pcre2", which hopefully will not provoke
86       any name clashes. In some environments, it is possible to control which
87       external symbols are exported when a shared library is  built,  and  in
88       these cases the undocumented symbols are not exported.
89
90
91SECURITY CONSIDERATIONS
92
93       If  you  are using PCRE2 in a non-UTF application that permits users to
94       supply arbitrary patterns for compilation, you should  be  aware  of  a
95       feature that allows users to turn on UTF support from within a pattern.
96       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
97       mode, which interprets patterns and subjects as strings of  UTF-8  code
98       units instead of individual 8-bit characters. This causes both the pat-
99       tern  and  any data against which it is matched to be checked for UTF-8
100       validity. If the data string is very long, such a check might use  suf-
101       ficiently  many  resources as to cause your application to lose perfor-
102       mance.
103
104       One way of guarding against this possibility is to use  the  pcre2_pat-
105       tern_info()  function  to  check  the  compiled  pattern's  options for
106       PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF  option  when
107       calling  pcre2_compile().  This causes a compile time error if the pat-
108       tern contains a UTF-setting sequence.
109
110       The use of Unicode properties for character types such as \d  can  also
111       be  enabled  from within the pattern, by specifying "(*UCP)". This fea-
112       ture can be disallowed by setting the PCRE2_NEVER_UCP option.
113
114       If your application is one that supports UTF, be  aware  that  validity
115       checking  can  take time. If the same data string is to be matched many
116       times, you can use the PCRE2_NO_UTF_CHECK option  for  the  second  and
117       subsequent matches to avoid running redundant checks.
118
119       The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
120       to  problems,  because  it  may leave the current matching point in the
121       middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op-
122       tion can be used by an application to lock out the use of \C, causing a
123       compile-time  error  if it is encountered. It is also possible to build
124       PCRE2 with the use of \C permanently disabled.
125
126       Another way that performance can be hit is by running  a  pattern  that
127       has  a  very  large search tree against a string that will never match.
128       Nested unlimited repeats in a pattern are a common example. PCRE2  pro-
129       vides  some  protection  against  this: see the pcre2_set_match_limit()
130       function in the pcre2api page.  There  is  a  similar  function  called
131       pcre2_set_depth_limit() that can be used to restrict the amount of mem-
132       ory that is used.
133
134
135USER DOCUMENTATION
136
137       The  user  documentation for PCRE2 comprises a number of different sec-
138       tions. In the "man" format, each of these is a separate "man page".  In
139       the  HTML  format, each is a separate page, linked from the index page.
140       In the plain  text  format,  the  descriptions  of  the  pcre2grep  and
141       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
142       respectively.  The remaining sections, except for the pcre2demo section
143       (which is a program listing), and the short pages for individual  func-
144       tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
145       tions are as follows:
146
147         pcre2              this document
148         pcre2-config       show PCRE2 installation configuration information
149         pcre2api           details of PCRE2's native C API
150         pcre2build         building PCRE2
151         pcre2callout       details of the pattern callout feature
152         pcre2compat        discussion of Perl compatibility
153         pcre2convert       details of pattern conversion functions
154         pcre2demo          a demonstration C program that uses PCRE2
155         pcre2grep          description of the pcre2grep command (8-bit only)
156         pcre2jit           discussion of just-in-time optimization support
157         pcre2limits        details of size and other limits
158         pcre2matching      discussion of the two matching algorithms
159         pcre2partial       details of the partial matching facility
160         pcre2pattern       syntax and semantics of supported regular
161                              expression patterns
162         pcre2perform       discussion of performance issues
163         pcre2posix         the POSIX-compatible C API for the 8-bit library
164         pcre2sample        discussion of the pcre2demo program
165         pcre2serialize     details of pattern serialization
166         pcre2syntax        quick syntax reference
167         pcre2test          description of the pcre2test command
168         pcre2unicode       discussion of Unicode and UTF support
169
170       In the "man" and HTML formats, there is also a short page  for  each  C
171       library function, listing its arguments and results.
172
173
174AUTHOR
175
176       Philip Hazel
177       Retired from University Computing Service
178       Cambridge, England.
179
180       Putting  an  actual email address here is a spam magnet. If you want to
181       email me, use my two names separated by a dot at gmail.com.
182
183
184REVISION
185
186       Last updated: 27 August 2021
187       Copyright (c) 1997-2021 University of Cambridge.
188
189
190PCRE2 10.38                     27 August 2021                        PCRE2(3)
191------------------------------------------------------------------------------
192
193
194
195PCRE2API(3)                Library Functions Manual                PCRE2API(3)
196
197
198NAME
199       PCRE2 - Perl-compatible regular expressions (revised API)
200
201       #include <pcre2.h>
202
203       PCRE2  is  a  new API for PCRE, starting at release 10.0. This document
204       contains a description of all its native functions. See the pcre2 docu-
205       ment for an overview of all the PCRE2 documentation.
206
207
208PCRE2 NATIVE API BASIC FUNCTIONS
209
210       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
211         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
212         pcre2_compile_context *ccontext);
213
214       void pcre2_code_free(pcre2_code *code);
215
216       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
217         pcre2_general_context *gcontext);
218
219       pcre2_match_data *pcre2_match_data_create_from_pattern(
220         const pcre2_code *code, pcre2_general_context *gcontext);
221
222       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
223         PCRE2_SIZE length, PCRE2_SIZE startoffset,
224         uint32_t options, pcre2_match_data *match_data,
225         pcre2_match_context *mcontext);
226
227       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
228         PCRE2_SIZE length, PCRE2_SIZE startoffset,
229         uint32_t options, pcre2_match_data *match_data,
230         pcre2_match_context *mcontext,
231         int *workspace, PCRE2_SIZE wscount);
232
233       void pcre2_match_data_free(pcre2_match_data *match_data);
234
235
236PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
237
238       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
239
240       PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data);
241
242       PCRE2_SIZE pcre2_get_match_data_heapframes_size(
243         pcre2_match_data *match_data);
244
245       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
246
247       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
248
249       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
250
251
252PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
253
254       pcre2_general_context *pcre2_general_context_create(
255         void *(*private_malloc)(PCRE2_SIZE, void *),
256         void (*private_free)(void *, void *), void *memory_data);
257
258       pcre2_general_context *pcre2_general_context_copy(
259         pcre2_general_context *gcontext);
260
261       void pcre2_general_context_free(pcre2_general_context *gcontext);
262
263
264PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
265
266       pcre2_compile_context *pcre2_compile_context_create(
267         pcre2_general_context *gcontext);
268
269       pcre2_compile_context *pcre2_compile_context_copy(
270         pcre2_compile_context *ccontext);
271
272       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
273
274       int pcre2_set_bsr(pcre2_compile_context *ccontext,
275         uint32_t value);
276
277       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
278         const uint8_t *tables);
279
280       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
281         uint32_t extra_options);
282
283       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
284         PCRE2_SIZE value);
285
286       int pcre2_set_max_pattern_compiled_length(
287         pcre2_compile_context *ccontext, PCRE2_SIZE value);
288
289       int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext,
290         uint32_t value);
291
292       int pcre2_set_newline(pcre2_compile_context *ccontext,
293         uint32_t value);
294
295       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
296         uint32_t value);
297
298       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
299         int (*guard_function)(uint32_t, void *), void *user_data);
300
301
302PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
303
304       pcre2_match_context *pcre2_match_context_create(
305         pcre2_general_context *gcontext);
306
307       pcre2_match_context *pcre2_match_context_copy(
308         pcre2_match_context *mcontext);
309
310       void pcre2_match_context_free(pcre2_match_context *mcontext);
311
312       int pcre2_set_callout(pcre2_match_context *mcontext,
313         int (*callout_function)(pcre2_callout_block *, void *),
314         void *callout_data);
315
316       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
317         int (*callout_function)(pcre2_substitute_callout_block *, void *),
318         void *callout_data);
319
320       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
321         PCRE2_SIZE value);
322
323       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
324         uint32_t value);
325
326       int pcre2_set_match_limit(pcre2_match_context *mcontext,
327         uint32_t value);
328
329       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
330         uint32_t value);
331
332
333PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
334
335       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
336         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
337
338       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
339         uint32_t number, PCRE2_UCHAR *buffer,
340         PCRE2_SIZE *bufflen);
341
342       void pcre2_substring_free(PCRE2_UCHAR *buffer);
343
344       int pcre2_substring_get_byname(pcre2_match_data *match_data,
345         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
346
347       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
348         uint32_t number, PCRE2_UCHAR **bufferptr,
349         PCRE2_SIZE *bufflen);
350
351       int pcre2_substring_length_byname(pcre2_match_data *match_data,
352         PCRE2_SPTR name, PCRE2_SIZE *length);
353
354       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
355         uint32_t number, PCRE2_SIZE *length);
356
357       int pcre2_substring_nametable_scan(const pcre2_code *code,
358         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
359
360       int pcre2_substring_number_from_name(const pcre2_code *code,
361         PCRE2_SPTR name);
362
363       void pcre2_substring_list_free(PCRE2_UCHAR **list);
364
365       int pcre2_substring_list_get(pcre2_match_data *match_data,
366         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
367
368
369PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
370
371       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
372         PCRE2_SIZE length, PCRE2_SIZE startoffset,
373         uint32_t options, pcre2_match_data *match_data,
374         pcre2_match_context *mcontext, PCRE2_SPTR replacementz,
375         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
376         PCRE2_SIZE *outlengthptr);
377
378
379PCRE2 NATIVE API JIT FUNCTIONS
380
381       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
382
383       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
384         PCRE2_SIZE length, PCRE2_SIZE startoffset,
385         uint32_t options, pcre2_match_data *match_data,
386         pcre2_match_context *mcontext);
387
388       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
389
390       pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize,
391         size_t maxsize, pcre2_general_context *gcontext);
392
393       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
394         pcre2_jit_callback callback_function, void *callback_data);
395
396       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
397
398
399PCRE2 NATIVE API SERIALIZATION FUNCTIONS
400
401       int32_t pcre2_serialize_decode(pcre2_code **codes,
402         int32_t number_of_codes, const uint8_t *bytes,
403         pcre2_general_context *gcontext);
404
405       int32_t pcre2_serialize_encode(const pcre2_code **codes,
406         int32_t number_of_codes, uint8_t **serialized_bytes,
407         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
408
409       void pcre2_serialize_free(uint8_t *bytes);
410
411       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
412
413
414PCRE2 NATIVE API AUXILIARY FUNCTIONS
415
416       pcre2_code *pcre2_code_copy(const pcre2_code *code);
417
418       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
419
420       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
421         PCRE2_SIZE bufflen);
422
423       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
424
425       void pcre2_maketables_free(pcre2_general_context *gcontext,
426         const uint8_t *tables);
427
428       int pcre2_pattern_info(const pcre2_code *code, uint32_t what,
429         void *where);
430
431       int pcre2_callout_enumerate(const pcre2_code *code,
432         int (*callback)(pcre2_callout_enumerate_block *, void *),
433         void *user_data);
434
435       int pcre2_config(uint32_t what, void *where);
436
437
438PCRE2 NATIVE API OBSOLETE FUNCTIONS
439
440       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
441         uint32_t value);
442
443       int pcre2_set_recursion_memory_management(
444         pcre2_match_context *mcontext,
445         void *(*private_malloc)(size_t, void *),
446         void (*private_free)(void *, void *), void *memory_data);
447
448       These functions became obsolete at release 10.30 and are retained  only
449       for  backward  compatibility.  They should not be used in new code. The
450       first is replaced by pcre2_set_depth_limit(); the second is  no  longer
451       needed and has no effect (it always returns zero).
452
453
454PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
455
456       pcre2_convert_context *pcre2_convert_context_create(
457         pcre2_general_context *gcontext);
458
459       pcre2_convert_context *pcre2_convert_context_copy(
460         pcre2_convert_context *cvcontext);
461
462       void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
463
464       int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
465         uint32_t escape_char);
466
467       int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
468         uint32_t separator_char);
469
470       int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
471         uint32_t options, PCRE2_UCHAR **buffer,
472         PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
473
474       void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
475
476       These  functions  provide  a  way of converting non-PCRE2 patterns into
477       patterns that can be processed by pcre2_compile(). This facility is ex-
478       perimental and may be changed in future releases. At  present,  "globs"
479       and  POSIX  basic  and  extended patterns can be converted. Details are
480       given in the pcre2convert documentation.
481
482
483PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
484
485       There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
486       code  units,  respectively.  However,  there  is  just one header file,
487       pcre2.h.  This contains the function prototypes and  other  definitions
488       for all three libraries. One, two, or all three can be installed simul-
489       taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
490       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
491       inal PCRE libraries.  Every PCRE2 function  comes  in  three  different
492       forms, one for each library, for example:
493
494         pcre2_compile_8()
495         pcre2_compile_16()
496         pcre2_compile_32()
497
498       There are also three different sets of data types:
499
500         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
501         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
502
503       The  UCHAR  types define unsigned code units of the appropriate widths.
504       For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.  The  SPTR
505       types are pointers to constants of the equivalent UCHAR types, that is,
506       they are pointers to vectors of unsigned code units.
507
508       Character  strings  are  passed  to a PCRE2 library as sequences of un-
509       signed integers in code units of the appropriate width. The length of a
510       string may be given as a number of code units, or  the  string  may  be
511       specified as zero-terminated.
512
513       Many  applications use only one code unit width. For their convenience,
514       macros are defined whose names are the generic forms such as pcre2_com-
515       pile() and  PCRE2_SPTR.  These  macros  use  the  value  of  the  macro
516       PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
517       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
518       An application must define it to be  8,  16,  or  32  before  including
519       pcre2.h in order to make use of the generic names.
520
521       Applications  that use more than one code unit width can be linked with
522       more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to
523       be  0  before  including pcre2.h, and then use the real function names.
524       Any code that is to be included in an environment where  the  value  of
525       PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function
526       names. (Unfortunately, it is not possible in C code to save and restore
527       the value of a macro.)
528
529       If PCRE2_CODE_UNIT_WIDTH is not defined  before  including  pcre2.h,  a
530       compiler error occurs.
531
532       When  using  multiple  libraries  in an application, you must take care
533       when processing any particular pattern to use  only  functions  from  a
534       single  library.   For example, if you want to run a match using a pat-
535       tern that was compiled with pcre2_compile_16(), you  must  do  so  with
536       pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
537
538       In  the  function summaries above, and in the rest of this document and
539       other PCRE2 documents, functions and data  types  are  described  using
540       their generic names, without the _8, _16, or _32 suffix.
541
542
543PCRE2 API OVERVIEW
544
545       PCRE2  has  its  own  native  API, which is described in this document.
546       There are also some wrapper functions for the 8-bit library that corre-
547       spond to the POSIX regular expression API, but they do not give  access
548       to  all  the  functionality of PCRE2 and they are not thread-safe. They
549       are described in the pcre2posix documentation. Both these APIs define a
550       set of C function calls.
551
552       The native API C data types, function prototypes,  option  values,  and
553       error codes are defined in the header file pcre2.h, which also contains
554       definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
555       numbers  for the library. Applications can use these to include support
556       for different releases of PCRE2.
557
558       In a Windows environment, if you want to statically link an application
559       program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
560       before including pcre2.h.
561
562       The  functions pcre2_compile() and pcre2_match() are used for compiling
563       and matching regular expressions in a Perl-compatible manner. A  sample
564       program that demonstrates the simplest way of using them is provided in
565       the file called pcre2demo.c in the PCRE2 source distribution. A listing
566       of  this  program  is  given  in  the  pcre2demo documentation, and the
567       pcre2sample documentation describes how to compile and run it.
568
569       The compiling and matching functions recognize various options that are
570       passed as bits in an options argument. There are also some more compli-
571       cated parameters such as custom memory  management  functions  and  re-
572       source  limits  that  are  passed  in "contexts" (which are just memory
573       blocks, described below). Simple applications do not need to  make  use
574       of contexts.
575
576       Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
577       that can be built in  appropriate  hardware  environments.  It  greatly
578       speeds  up  the matching performance of many patterns. Programs can re-
579       quest that it be used if available by calling pcre2_jit_compile() after
580       a pattern has been successfully compiled by pcre2_compile(). This  does
581       nothing if JIT support is not available.
582
583       More  complicated  programs  might  need  to make use of the specialist
584       functions   pcre2_jit_stack_create(),    pcre2_jit_stack_free(),    and
585       pcre2_jit_stack_assign()  in order to control the JIT code's memory us-
586       age.
587
588       JIT matching is automatically used by pcre2_match() if it is available,
589       unless the PCRE2_NO_JIT option is set. There is also a direct interface
590       for JIT matching, which gives improved performance at  the  expense  of
591       less  sanity  checking. The JIT-specific functions are discussed in the
592       pcre2jit documentation.
593
594       A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
595       patible,  is  also  provided.  This  uses a different algorithm for the
596       matching. The alternative algorithm finds all possible  matches  (at  a
597       given  point  in  the subject), and scans the subject just once (unless
598       there are lookaround assertions). However, this algorithm does not  re-
599       turn  captured substrings. A description of the two matching algorithms
600       and their advantages and disadvantages is given  in  the  pcre2matching
601       documentation. There is no JIT support for pcre2_dfa_match().
602
603       In  addition  to  the  main compiling and matching functions, there are
604       convenience functions for extracting captured substrings from a subject
605       string that has been matched by pcre2_match(). They are:
606
607         pcre2_substring_copy_byname()
608         pcre2_substring_copy_bynumber()
609         pcre2_substring_get_byname()
610         pcre2_substring_get_bynumber()
611         pcre2_substring_list_get()
612         pcre2_substring_length_byname()
613         pcre2_substring_length_bynumber()
614         pcre2_substring_nametable_scan()
615         pcre2_substring_number_from_name()
616
617       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
618       vided,  to  free  memory used for extracted strings. If either of these
619       functions is called with a NULL argument, the function returns  immedi-
620       ately without doing anything.
621
622       The  function  pcre2_substitute()  can be called to match a pattern and
623       return a copy of the subject string with substitutions for  parts  that
624       were matched.
625
626       Functions  whose  names begin with pcre2_serialize_ are used for saving
627       compiled patterns on disc or elsewhere, and reloading them later.
628
629       Finally, there are functions for finding out information about  a  com-
630       piled  pattern  (pcre2_pattern_info()) and about the configuration with
631       which PCRE2 was built (pcre2_config()).
632
633       Functions with names ending with _free() are used  for  freeing  memory
634       blocks  of  various  sorts.  In all cases, if one of these functions is
635       called with a NULL argument, it does nothing.
636
637
638STRING LENGTHS AND OFFSETS
639
640       The PCRE2 API uses string lengths and  offsets  into  strings  of  code
641       units  in  several  places. These values are always of type PCRE2_SIZE,
642       which is an unsigned integer type, currently always defined as  size_t.
643       The  largest  value  that  can  be  stored  in  such  a  type  (that is
644       ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
645       strings  and  unset offsets.  Therefore, the longest string that can be
646       handled is one less than this maximum. Note that string lengths are al-
647       ways given in code units. Only in the 8-bit library is  such  a  length
648       the same as the number of bytes in the string.
649
650
651NEWLINES
652
653       PCRE2 supports five different conventions for indicating line breaks in
654       strings:  a  single  CR (carriage return) character, a single LF (line-
655       feed) character, the two-character sequence CRLF, any of the three pre-
656       ceding, or any Unicode newline sequence. The Unicode newline  sequences
657       are  the  three just mentioned, plus the single characters VT (vertical
658       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
659       separator, U+2028), and PS (paragraph separator, U+2029).
660
661       Each of the first three conventions is used by at least  one  operating
662       system as its standard newline sequence. When PCRE2 is built, a default
663       can be specified.  If it is not, the default is set to LF, which is the
664       Unix standard. However, the newline convention can be changed by an ap-
665       plication  when calling pcre2_compile(), or it can be specified by spe-
666       cial text at the start of the pattern itself; this overrides any  other
667       settings.  See the pcre2pattern page for details of the special charac-
668       ter sequences.
669
670       In the PCRE2 documentation the word "newline"  is  used  to  mean  "the
671       character or pair of characters that indicate a line break". The choice
672       of  newline convention affects the handling of the dot, circumflex, and
673       dollar metacharacters, the handling of #-comments in /x mode, and, when
674       CRLF is a recognized line ending sequence, the match position  advance-
675       ment for a non-anchored pattern. There is more detail about this in the
676       section on pcre2_match() options below.
677
678       The  choice of newline convention does not affect the interpretation of
679       the \n or \r escape sequences, nor does it affect what \R matches; this
680       has its own separate convention.
681
682
683MULTITHREADING
684
685       In a multithreaded application it is important to keep  thread-specific
686       data  separate  from data that can be shared between threads. The PCRE2
687       library code itself is thread-safe: it contains  no  static  or  global
688       variables. The API is designed to be fairly simple for non-threaded ap-
689       plications  while at the same time ensuring that multithreaded applica-
690       tions can use it.
691
692       There are several different blocks of data that are used to pass infor-
693       mation between the application and the PCRE2 libraries.
694
695   The compiled pattern
696
697       A pointer to the compiled form of a pattern is  returned  to  the  user
698       when pcre2_compile() is successful. The data in the compiled pattern is
699       fixed,  and  does not change when the pattern is matched. Therefore, it
700       is thread-safe, that is, the same compiled pattern can be used by  more
701       than one thread simultaneously. For example, an application can compile
702       all its patterns at the start, before forking off multiple threads that
703       use  them.  However,  if the just-in-time (JIT) optimization feature is
704       being used, it needs separate memory stack areas for each  thread.  See
705       the pcre2jit documentation for more details.
706
707       In  a more complicated situation, where patterns are compiled only when
708       they are first needed, but are still shared between  threads,  pointers
709       to  compiled  patterns  must  be protected from simultaneous writing by
710       multiple threads. This is somewhat tricky to do correctly. If you  know
711       that  writing  to  a pointer is atomic in your environment, you can use
712       logic like this:
713
714         Get a read-only (shared) lock (mutex) for pointer
715         if (pointer == NULL)
716           {
717           Get a write (unique) lock for pointer
718           if (pointer == NULL) pointer = pcre2_compile(...
719           }
720         Release the lock
721         Use pointer in pcre2_match()
722
723       Of course, testing for compilation errors should also  be  included  in
724       the code.
725
726       The  reason  for checking the pointer a second time is as follows: Sev-
727       eral threads may have acquired the shared lock and tested  the  pointer
728       for being NULL, but only one of them will be given the write lock, with
729       the  rest kept waiting. The winning thread will compile the pattern and
730       store the result.  After this thread releases the write  lock,  another
731       thread  will  get it, and if it does not retest pointer for being NULL,
732       will recompile the pattern and overwrite the pointer, creating a memory
733       leak and possibly causing other issues.
734
735       In an environment where writing to a pointer may  not  be  atomic,  the
736       above  logic  is not sufficient. The thread that is doing the compiling
737       may be descheduled after writing only part of the pointer, which  could
738       cause  other  threads  to use an invalid value. Instead of checking the
739       pointer itself, a separate "pointer is valid" flag (that can be updated
740       atomically) must be used:
741
742         Get a read-only (shared) lock (mutex) for pointer
743         if (!pointer_is_valid)
744           {
745           Get a write (unique) lock for pointer
746           if (!pointer_is_valid)
747             {
748             pointer = pcre2_compile(...
749             pointer_is_valid = TRUE
750             }
751           }
752         Release the lock
753         Use pointer in pcre2_match()
754
755       If JIT is being used, but the JIT compilation is not being done immedi-
756       ately (perhaps waiting to see if the pattern  is  used  often  enough),
757       similar  logic  is required. JIT compilation updates a value within the
758       compiled code block, so a thread must gain unique write access  to  the
759       pointer     before    calling    pcre2_jit_compile().    Alternatively,
760       pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob-
761       tain  a  private  copy of the compiled code before calling the JIT com-
762       piler.
763
764   Context blocks
765
766       The next main section below introduces the idea of "contexts" in  which
767       PCRE2 functions are called. A context is nothing more than a collection
768       of parameters that control the way PCRE2 operates. Grouping a number of
769       parameters together in a context is a convenient way of passing them to
770       a  PCRE2  function without using lots of arguments. The parameters that
771       are stored in contexts are in some sense  "advanced  features"  of  the
772       API. Many straightforward applications will not need to use contexts.
773
774       In a multithreaded application, if the parameters in a context are val-
775       ues  that  are  never  changed, the same context can be used by all the
776       threads. However, if any thread needs to change any value in a context,
777       it must make its own thread-specific copy.
778
779   Match blocks
780
781       The matching functions need a block of memory for storing  the  results
782       of a match. This includes details of what was matched, as well as addi-
783       tional  information  such as the name of a (*MARK) setting. Each thread
784       must provide its own copy of this memory.
785
786
787PCRE2 CONTEXTS
788
789       Some PCRE2 functions have a lot of parameters, many of which  are  used
790       only  by  specialist  applications,  for example, those that use custom
791       memory management or non-standard character tables.  To  keep  function
792       argument  lists  at a reasonable size, and at the same time to keep the
793       API extensible, "uncommon" parameters are passed to  certain  functions
794       in  a  context instead of directly. A context is just a block of memory
795       that holds the parameter values.  Applications that do not need to  ad-
796       just any of the context parameters can pass NULL when a context pointer
797       is required.
798
799       There  are  three different types of context: a general context that is
800       relevant for several PCRE2 operations, a compile-time  context,  and  a
801       match-time context.
802
803   The general context
804
805       At  present,  this context just contains pointers to (and data for) ex-
806       ternal memory management functions that are called from several  places
807       in  the  PCRE2  library.  The  context  is  named `general' rather than
808       specifically `memory' because in future other fields may be  added.  If
809       you  do not want to supply your own custom memory management functions,
810       you do not need to bother with a general context. A general context  is
811       created by:
812
813       pcre2_general_context *pcre2_general_context_create(
814         void *(*private_malloc)(PCRE2_SIZE, void *),
815         void (*private_free)(void *, void *), void *memory_data);
816
817       The  two  function pointers specify custom memory management functions,
818       whose prototypes are:
819
820         void *private_malloc(PCRE2_SIZE, void *);
821         void  private_free(void *, void *);
822
823       Whenever code in PCRE2 calls these functions, the final argument is the
824       value of memory_data. Either of the first two arguments of the creation
825       function may be NULL, in which case the system memory management  func-
826       tions  malloc()  and free() are used. (This is not currently useful, as
827       there are no other fields in a general context,  but  in  future  there
828       might  be.)  The private_malloc() function is used (if supplied) to ob-
829       tain memory for storing the context, and all three values are saved  as
830       part of the context.
831
832       Whenever  PCRE2  creates a data block of any kind, the block contains a
833       pointer to the free() function that matches the malloc() function  that
834       was  used.  When  the  time  comes  to free the block, this function is
835       called.
836
837       A general context can be copied by calling:
838
839       pcre2_general_context *pcre2_general_context_copy(
840         pcre2_general_context *gcontext);
841
842       The memory used for a general context should be freed by calling:
843
844       void pcre2_general_context_free(pcre2_general_context *gcontext);
845
846       If this function is passed a  NULL  argument,  it  returns  immediately
847       without doing anything.
848
849   The compile context
850
851       A  compile context is required if you want to provide an external func-
852       tion for stack checking during compilation or  to  change  the  default
853       values of any of the following compile-time parameters:
854
855         What \R matches (Unicode newlines or CR, LF, CRLF only)
856         PCRE2's character tables
857         The newline character sequence
858         The compile time nested parentheses limit
859         The maximum length of the pattern string
860         The extra options bits (none set by default)
861
862       A  compile context is also required if you are using custom memory man-
863       agement.  If none of these apply, just pass NULL as the  context  argu-
864       ment of pcre2_compile().
865
866       A  compile context is created, copied, and freed by the following func-
867       tions:
868
869       pcre2_compile_context *pcre2_compile_context_create(
870         pcre2_general_context *gcontext);
871
872       pcre2_compile_context *pcre2_compile_context_copy(
873         pcre2_compile_context *ccontext);
874
875       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
876
877       A compile context is created with default values  for  its  parameters.
878       These can be changed by calling the following functions, which return 0
879       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
880
881       int pcre2_set_bsr(pcre2_compile_context *ccontext,
882         uint32_t value);
883
884       The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
885       CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
886       Unicode line ending sequence. The value is used by the JIT compiler and
887       by   the   two   interpreted   matching  functions,  pcre2_match()  and
888       pcre2_dfa_match().
889
890       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
891         const uint8_t *tables);
892
893       The value must be the result of a  call  to  pcre2_maketables(),  whose
894       only argument is a general context. This function builds a set of char-
895       acter tables in the current locale.
896
897       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
898         uint32_t extra_options);
899
900       As  PCRE2  has developed, almost all the 32 option bits that are avail-
901       able in the options argument of pcre2_compile() have been used  up.  To
902       avoid  running  out, the compile context contains a set of extra option
903       bits which are used for some newer, assumed rarer, options. This  func-
904       tion  sets  those bits. It always sets all the bits (either on or off).
905       It does not modify any existing setting. The available options are  de-
906       fined in the section entitled "Extra compile options" below.
907
908       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
909         PCRE2_SIZE value);
910
911       This  sets a maximum length, in code units, for any pattern string that
912       is compiled with this context. If the pattern is longer,  an  error  is
913       generated.   This facility is provided so that applications that accept
914       patterns from external sources can limit their size. The default is the
915       largest number that a PCRE2_SIZE variable can  hold,  which  is  effec-
916       tively unlimited.
917
918       int pcre2_set_max_pattern_compiled_length(
919         pcre2_compile_context *ccontext, PCRE2_SIZE value);
920
921       This  sets  a maximum size, in bytes, for the memory needed to hold the
922       compiled version of a pattern that is compiled with  this  context.  If
923       the  pattern needs more memory, an error is generated. This facility is
924       provided so  that  applications  that  accept  patterns  from  external
925       sources  can  limit  the  amount of memory they use. The default is the
926       largest number that a PCRE2_SIZE variable can  hold,  which  is  effec-
927       tively unlimited.
928
929       int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext,
930         uint32_t value);
931
932       This  sets  a  maximum length for the number of characters matched by a
933       variable-length lookbehind assertion. The default is set when PCRE2  is
934       built,  with  the ultimate default being 255, the same as Perl. Lookbe-
935       hind assertions without a bounding length are not supported.
936
937       int pcre2_set_newline(pcre2_compile_context *ccontext,
938         uint32_t value);
939
940       This specifies which characters or character sequences are to be recog-
941       nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage
942       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
943       two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
944       of the above), PCRE2_NEWLINE_ANY (any  Unicode  newline  sequence),  or
945       PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
946
947       A pattern can override the value set in the compile context by starting
948       with a sequence such as (*CRLF). See the pcre2pattern page for details.
949
950       When  a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or PCRE2_EX-
951       TENDED_MORE option, the newline convention affects the  recognition  of
952       the  end  of internal comments starting with #. The value is saved with
953       the compiled pattern for subsequent use by the JIT compiler and by  the
954       two     interpreted     matching     functions,    pcre2_match()    and
955       pcre2_dfa_match().
956
957       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
958         uint32_t value);
959
960       This parameter adjusts the limit, set  when  PCRE2  is  built  (default
961       250),  on  the  depth  of  parenthesis nesting in a pattern. This limit
962       stops rogue patterns using up too much system  stack  when  being  com-
963       piled.  The limit applies to parentheses of all kinds, not just captur-
964       ing parentheses.
965
966       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
967         int (*guard_function)(uint32_t, void *), void *user_data);
968
969       There is at least one application that runs PCRE2 in threads with  very
970       limited  system  stack,  where running out of stack is to be avoided at
971       all costs. The parenthesis limit above cannot take account of how  much
972       stack  is  actually  available during compilation. For a finer control,
973       you can supply a  function  that  is  called  whenever  pcre2_compile()
974       starts  to compile a parenthesized part of a pattern. This function can
975       check the actual stack size (or anything else  that  it  wants  to,  of
976       course).
977
978       The  first  argument to the callout function gives the current depth of
979       nesting, and the second is user data that is set up by the  last  argu-
980       ment   of  pcre2_set_compile_recursion_guard().  The  callout  function
981       should return zero if all is well, or non-zero to force an error.
982
983   The match context
984
985       A match context is required if you want to:
986
987         Set up a callout function
988         Set an offset limit for matching an unanchored pattern
989         Change the limit on the amount of heap used when matching
990         Change the backtracking match limit
991         Change the backtracking depth limit
992         Set custom memory management specifically for the match
993
994       If none of these apply, just pass  NULL  as  the  context  argument  of
995       pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
996
997       A  match  context  is created, copied, and freed by the following func-
998       tions:
999
1000       pcre2_match_context *pcre2_match_context_create(
1001         pcre2_general_context *gcontext);
1002
1003       pcre2_match_context *pcre2_match_context_copy(
1004         pcre2_match_context *mcontext);
1005
1006       void pcre2_match_context_free(pcre2_match_context *mcontext);
1007
1008       A match context is created with  default  values  for  its  parameters.
1009       These can be changed by calling the following functions, which return 0
1010       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
1011
1012       int pcre2_set_callout(pcre2_match_context *mcontext,
1013         int (*callout_function)(pcre2_callout_block *, void *),
1014         void *callout_data);
1015
1016       This  sets  up a callout function for PCRE2 to call at specified points
1017       during a matching operation. Details are given in the pcre2callout doc-
1018       umentation.
1019
1020       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
1021         int (*callout_function)(pcre2_substitute_callout_block *, void *),
1022         void *callout_data);
1023
1024       This sets up a callout function for PCRE2 to call after each  substitu-
1025       tion made by pcre2_substitute(). Details are given in the section enti-
1026       tled "Creating a new string with substitutions" below.
1027
1028       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
1029         PCRE2_SIZE value);
1030
1031       The  offset_limit parameter limits how far an unanchored search can ad-
1032       vance in the subject string. The  default  value  is  PCRE2_UNSET.  The
1033       pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO-
1034       MATCH if a match with a starting point before or at the given offset is
1035       not found. The pcre2_substitute() function makes no more substitutions.
1036
1037       For example, if the pattern /abc/ is matched against "123abc"  with  an
1038       offset  limit  less  than 3, the result is PCRE2_ERROR_NOMATCH. A match
1039       can never be  found  if  the  startoffset  argument  of  pcre2_match(),
1040       pcre2_dfa_match(),  or  pcre2_substitute()  is  greater than the offset
1041       limit set in the match context.
1042
1043       When using this facility, you must set the  PCRE2_USE_OFFSET_LIMIT  op-
1044       tion when calling pcre2_compile() so that when JIT is in use, different
1045       code  can  be  compiled. If a match is started with a non-default match
1046       limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
1047
1048       The offset limit facility can be used to track progress when  searching
1049       large  subject  strings or to limit the extent of global substitutions.
1050       See also the PCRE2_FIRSTLINE option, which requires a  match  to  start
1051       before  or  at  the first newline that follows the start of matching in
1052       the subject. If this is set with an offset limit, a match must occur in
1053       the first line and also  within  the  offset  limit.  In  other  words,
1054       whichever limit comes first is used.
1055
1056       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
1057         uint32_t value);
1058
1059       The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
1060       the  maximum  amount  of heap memory that pcre2_match() may use to hold
1061       backtracking information when running an interpretive match. This limit
1062       also applies to pcre2_dfa_match(), which may use the heap when process-
1063       ing patterns with a lot of nested pattern recursion or  lookarounds  or
1064       atomic groups. This limit does not apply to matching with the JIT opti-
1065       mization,  which  has  its  own  memory  control  arrangements (see the
1066       pcre2jit documentation for more details). If the limit is reached,  the
1067       negative  error  code  PCRE2_ERROR_HEAPLIMIT  is  returned. The default
1068       limit can be set when PCRE2 is built; if it is not, the default is  set
1069       very large and is essentially unlimited.
1070
1071       A value for the heap limit may also be supplied by an item at the start
1072       of a pattern of the form
1073
1074         (*LIMIT_HEAP=ddd)
1075
1076       where  ddd  is a decimal number. However, such a setting is ignored un-
1077       less ddd is less than the limit set by the caller of pcre2_match()  or,
1078       if no such limit is set, less than the default.
1079
1080       The  pcre2_match() function always needs some heap memory, so setting a
1081       value of zero guarantees a "heap limit exceeded" error. Details of  how
1082       pcre2_match()  uses  the  heap are given in the pcre2perform documenta-
1083       tion.
1084
1085       For pcre2_dfa_match(), a vector on the system stack is used  when  pro-
1086       cessing  pattern recursions, lookarounds, or atomic groups, and only if
1087       this is not big enough is heap memory used. In  this  case,  setting  a
1088       value of zero disables the use of the heap.
1089
1090       int pcre2_set_match_limit(pcre2_match_context *mcontext,
1091         uint32_t value);
1092
1093       The match_limit parameter provides a means of preventing PCRE2 from us-
1094       ing  up  too many computing resources when processing patterns that are
1095       not going to match, but which have a very large number of possibilities
1096       in their search trees. The classic  example  is  a  pattern  that  uses
1097       nested unlimited repeats.
1098
1099       There  is an internal counter in pcre2_match() that is incremented each
1100       time round its main matching loop. If  this  value  reaches  the  match
1101       limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
1102       This  has  the  effect  of limiting the amount of backtracking that can
1103       take place. For patterns that are not anchored, the count restarts from
1104       zero for each position in the subject string. This limit  also  applies
1105       to pcre2_dfa_match(), though the counting is done in a different way.
1106
1107       When  pcre2_match()  is  called  with  a  pattern that was successfully
1108       processed by pcre2_jit_compile(), the way in which matching is executed
1109       is entirely different. However, there is still the possibility of  run-
1110       away matching that goes on for a very long time, and so the match_limit
1111       value  is  also used in this case (but in a different way) to limit how
1112       long the matching can continue.
1113
1114       The default value for the limit can be set when PCRE2 is built; the de-
1115       fault is 10 million, which handles all but the most  extreme  cases.  A
1116       value  for the match limit may also be supplied by an item at the start
1117       of a pattern of the form
1118
1119         (*LIMIT_MATCH=ddd)
1120
1121       where ddd is a decimal number. However, such a setting is  ignored  un-
1122       less  ddd  is less than the limit set by the caller of pcre2_match() or
1123       pcre2_dfa_match() or, if no such limit is set, less than the default.
1124
1125       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
1126         uint32_t value);
1127
1128       This  parameter  limits   the   depth   of   nested   backtracking   in
1129       pcre2_match().   Each time a nested backtracking point is passed, a new
1130       memory frame is used to remember the state of matching at  that  point.
1131       Thus,  this  parameter  indirectly  limits the amount of memory that is
1132       used in a match. However, because the size of each memory frame depends
1133       on the number of capturing parentheses, the actual memory limit  varies
1134       from  pattern to pattern. This limit was more useful in versions before
1135       10.30, where function recursion was used for backtracking.
1136
1137       The depth limit is not relevant, and is ignored, when matching is  done
1138       using JIT compiled code. However, it is supported by pcre2_dfa_match(),
1139       which  uses it to limit the depth of nested internal recursive function
1140       calls that implement atomic groups, lookaround assertions, and  pattern
1141       recursions. This limits, indirectly, the amount of system stack that is
1142       used.  It  was  more useful in versions before 10.32, when stack memory
1143       was used for local workspace vectors for recursive function calls. From
1144       version 10.32, only local variables are allocated on the stack  and  as
1145       each call uses only a few hundred bytes, even a small stack can support
1146       quite a lot of recursion.
1147
1148       If  the depth of internal recursive function calls is great enough, lo-
1149       cal workspace vectors are allocated on the heap from version 10.32  on-
1150       wards,  so  the  depth  limit also indirectly limits the amount of heap
1151       memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
1152       matched to a very long string using pcre2_dfa_match(), can use a  great
1153       deal  of memory. However, it is probably better to limit heap usage di-
1154       rectly by calling pcre2_set_heap_limit().
1155
1156       The default value for the depth limit can be set when PCRE2  is  built;
1157       if  it  is not, the default is set to the same value as the default for
1158       the  match  limit.   If  the  limit  is  exceeded,   pcre2_match()   or
1159       pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
1160       limit  may also be supplied by an item at the start of a pattern of the
1161       form
1162
1163         (*LIMIT_DEPTH=ddd)
1164
1165       where ddd is a decimal number. However, such a setting is  ignored  un-
1166       less  ddd  is less than the limit set by the caller of pcre2_match() or
1167       pcre2_dfa_match() or, if no such limit is set, less than the default.
1168
1169
1170CHECKING BUILD-TIME OPTIONS
1171
1172       int pcre2_config(uint32_t what, void *where);
1173
1174       The function pcre2_config() makes it possible for  a  PCRE2  client  to
1175       find  the  value  of  certain  configuration parameters and to discover
1176       which optional features have been compiled into the PCRE2 library.  The
1177       pcre2build documentation has more details about these features.
1178
1179       The  first  argument  for pcre2_config() specifies which information is
1180       required. The second argument is a pointer to memory into which the in-
1181       formation is placed. If NULL is passed, the function returns the amount
1182       of memory that is needed for the requested information. For calls  that
1183       return  numerical  values, the value is in bytes; when requesting these
1184       values, where should point to appropriately aligned memory.  For  calls
1185       that  return  strings,  the required length is given in code units, not
1186       counting the terminating zero.
1187
1188       When requesting information, the returned value from pcre2_config()  is
1189       non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1190       TION if the value in the first argument is not recognized. The  follow-
1191       ing information is available:
1192
1193         PCRE2_CONFIG_BSR
1194
1195       The  output  is a uint32_t integer whose value indicates what character
1196       sequences the \R  escape  sequence  matches  by  default.  A  value  of
1197       PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se-
1198       quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF,
1199       or CRLF. The default can be overridden when a pattern is compiled.
1200
1201         PCRE2_CONFIG_COMPILED_WIDTHS
1202
1203       The output is a uint32_t integer whose lower bits indicate  which  code
1204       unit  widths  were  selected  when PCRE2 was built. The 1-bit indicates
1205       8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit  sup-
1206       port, respectively.
1207
1208         PCRE2_CONFIG_DEPTHLIMIT
1209
1210       The  output  is a uint32_t integer that gives the default limit for the
1211       depth of nested backtracking in pcre2_match() or the  depth  of  nested
1212       recursions,  lookarounds,  and atomic groups in pcre2_dfa_match(). Fur-
1213       ther details are given with pcre2_set_depth_limit() above.
1214
1215         PCRE2_CONFIG_HEAPLIMIT
1216
1217       The output is a uint32_t integer that gives, in kibibytes, the  default
1218       limit   for  the  amount  of  heap  memory  used  by  pcre2_match()  or
1219       pcre2_dfa_match().     Further     details     are      given      with
1220       pcre2_set_heap_limit() above.
1221
1222         PCRE2_CONFIG_JIT
1223
1224       The  output  is  a  uint32_t  integer that is set to one if support for
1225       just-in-time compiling is included in the library; otherwise it is  set
1226       to zero. Note that having the support in the library does not guarantee
1227       that  JIT will be used for any given match. See the pcre2jit documenta-
1228       tion for more details.
1229
1230         PCRE2_CONFIG_JITTARGET
1231
1232       The where argument should point to a buffer that is at  least  48  code
1233       units  long.  (The  exact  length  required  can  be  found  by calling
1234       pcre2_config() with where set to NULL.) The buffer  is  filled  with  a
1235       string  that  contains  the  name of the architecture for which the JIT
1236       compiler is configured, for example "x86 32bit  (little  endian  +  un-
1237       aligned)".  If  JIT  support is not available, PCRE2_ERROR_BADOPTION is
1238       returned, otherwise the number of code units used is returned. This  is
1239       the length of the string, plus one unit for the terminating zero.
1240
1241         PCRE2_CONFIG_LINKSIZE
1242
1243       The output is a uint32_t integer that contains the number of bytes used
1244       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
1245       configured, the value can be set to 2, 3, or 4, with the default  being
1246       2.  This is the value that is returned by pcre2_config(). However, when
1247       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1248       when  the  32-bit  library  is compiled, internal linkages always use 4
1249       bytes, so the configured value is not relevant.
1250
1251       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1252       for all but the most massive patterns, since it allows the size of  the
1253       compiled  pattern  to  be  up  to 65535 code units. Larger values allow
1254       larger regular expressions to be compiled by those two  libraries,  but
1255       at the expense of slower matching.
1256
1257         PCRE2_CONFIG_MATCHLIMIT
1258
1259       The output is a uint32_t integer that gives the default match limit for
1260       pcre2_match().  Further  details are given with pcre2_set_match_limit()
1261       above.
1262
1263         PCRE2_CONFIG_NEWLINE
1264
1265       The output is a uint32_t integer  whose  value  specifies  the  default
1266       character  sequence that is recognized as meaning "newline". The values
1267       are:
1268
1269         PCRE2_NEWLINE_CR       Carriage return (CR)
1270         PCRE2_NEWLINE_LF       Linefeed (LF)
1271         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1272         PCRE2_NEWLINE_ANY      Any Unicode line ending
1273         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1274         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
1275
1276       The default should normally correspond to  the  standard  sequence  for
1277       your operating system.
1278
1279         PCRE2_CONFIG_NEVER_BACKSLASH_C
1280
1281       The  output  is  a uint32_t integer that is set to one if the use of \C
1282       was permanently disabled when PCRE2 was built; otherwise it is  set  to
1283       zero.
1284
1285         PCRE2_CONFIG_PARENSLIMIT
1286
1287       The  output is a uint32_t integer that gives the maximum depth of nest-
1288       ing of parentheses (of any kind) in a pattern. This limit is imposed to
1289       cap the amount of system stack used when a pattern is compiled.  It  is
1290       specified  when PCRE2 is built; the default is 250. This limit does not
1291       take into account the stack that may already be used by the calling ap-
1292       plication.  For  finer  control  over  compilation  stack  usage,   see
1293       pcre2_set_compile_recursion_guard().
1294
1295         PCRE2_CONFIG_STACKRECURSE
1296
1297       This parameter is obsolete and should not be used in new code. The out-
1298       put is a uint32_t integer that is always set to zero.
1299
1300         PCRE2_CONFIG_TABLES_LENGTH
1301
1302       The output is a uint32_t integer that gives the length of PCRE2's char-
1303       acter  processing  tables in bytes. For details of these tables see the
1304       section on locale support below.
1305
1306         PCRE2_CONFIG_UNICODE_VERSION
1307
1308       The where argument should point to a buffer that is at  least  24  code
1309       units  long.  (The  exact  length  required  can  be  found  by calling
1310       pcre2_config() with where set to NULL.)  If  PCRE2  has  been  compiled
1311       without  Unicode  support,  the buffer is filled with the text "Unicode
1312       not supported". Otherwise, the Unicode  version  string  (for  example,
1313       "8.0.0")  is  inserted. The number of code units used is returned. This
1314       is the length of the string plus one unit for the terminating zero.
1315
1316         PCRE2_CONFIG_UNICODE
1317
1318       The output is a uint32_t integer that is set to one if Unicode  support
1319       is  available; otherwise it is set to zero. Unicode support implies UTF
1320       support.
1321
1322         PCRE2_CONFIG_VERSION
1323
1324       The where argument should point to a buffer that is at  least  24  code
1325       units  long.  (The  exact  length  required  can  be  found  by calling
1326       pcre2_config() with where set to NULL.) The buffer is filled  with  the
1327       PCRE2 version string, zero-terminated. The number of code units used is
1328       returned. This is the length of the string plus one unit for the termi-
1329       nating zero.
1330
1331
1332COMPILING A PATTERN
1333
1334       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
1335         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
1336         pcre2_compile_context *ccontext);
1337
1338       void pcre2_code_free(pcre2_code *code);
1339
1340       pcre2_code *pcre2_code_copy(const pcre2_code *code);
1341
1342       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
1343
1344       The  pcre2_compile() function compiles a pattern into an internal form.
1345       The pattern is defined by a pointer to a string of  code  units  and  a
1346       length in code units. If the pattern is zero-terminated, the length can
1347       be  specified  as  PCRE2_ZERO_TERMINATED. A NULL pattern pointer with a
1348       length of zero is treated as an empty  string  (NULL  with  a  non-zero
1349       length  causes  an  error  return). The function returns a pointer to a
1350       block of memory that contains the compiled pattern and related data, or
1351       NULL if an error occurred.
1352
1353       If the compile context argument ccontext is NULL, memory for  the  com-
1354       piled  pattern  is  obtained  by calling malloc(). Otherwise, it is ob-
1355       tained from the same memory function that was used for the compile con-
1356       text. The caller must free the memory by calling pcre2_code_free() when
1357       it is no longer needed.  If pcre2_code_free() is called with a NULL ar-
1358       gument, it returns immediately, without doing anything.
1359
1360       The function pcre2_code_copy() makes a copy of the compiled code in new
1361       memory, using the same memory allocator as was used for  the  original.
1362       However,  if  the  code has been processed by the JIT compiler (see be-
1363       low), the JIT information cannot be copied (because it is  position-de-
1364       pendent).   The  new copy can initially be used only for non-JIT match-
1365       ing, though it can be passed to  pcre2_jit_compile()  if  required.  If
1366       pcre2_code_copy() is called with a NULL argument, it returns NULL.
1367
1368       The pcre2_code_copy() function provides a way for individual threads in
1369       a  multithreaded  application  to acquire a private copy of shared com-
1370       piled code.  However, it does not make a copy of the  character  tables
1371       used  by  the compiled pattern; the new pattern code points to the same
1372       tables as the original code.  (See "Locale Support" below  for  details
1373       of  these  character  tables.) In many applications the same tables are
1374       used throughout, so this behaviour is appropriate. Nevertheless,  there
1375       are occasions when a copy of a compiled pattern and the relevant tables
1376       are  needed.  The pcre2_code_copy_with_tables() provides this facility.
1377       Copies of both the code and the tables are  made,  with  the  new  code
1378       pointing  to the new tables. The memory for the new tables is automati-
1379       cally freed when pcre2_code_free() is called for the new  copy  of  the
1380       compiled  code.  If pcre2_code_copy_with_tables() is called with a NULL
1381       argument, it returns NULL.
1382
1383       NOTE: When one of the matching functions is  called,  pointers  to  the
1384       compiled pattern and the subject string are set in the match data block
1385       so  that  they  can be referenced by the substring extraction functions
1386       after a successful match.  After running a match, you must not  free  a
1387       compiled  pattern or a subject string until after all operations on the
1388       match data block have taken place, unless, in the case of  the  subject
1389       string,  you  have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
1390       described in the section entitled "Option bits for  pcre2_match()"  be-
1391       low.
1392
1393       The  options argument for pcre2_compile() contains various bit settings
1394       that affect the compilation. It should be zero if none of them are  re-
1395       quired.  The  available  options  are described below. Some of them (in
1396       particular, those that are compatible with Perl,  but  some  others  as
1397       well)  can  also  be set and unset from within the pattern (see the de-
1398       tailed description in the pcre2pattern documentation).
1399
1400       For those options that can be different in different parts of the  pat-
1401       tern,  the contents of the options argument specifies their settings at
1402       the start of compilation. The  PCRE2_ANCHORED,  PCRE2_ENDANCHORED,  and
1403       PCRE2_NO_UTF_CHECK  options  can be set at the time of matching as well
1404       as at compile time.
1405
1406       Some additional options and less frequently required compile-time para-
1407       meters (for example, the newline setting) can be provided in a  compile
1408       context (as described above).
1409
1410       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1411       diately.  Otherwise,  the  variables to which these point are set to an
1412       error code and an offset (number of code units) within the pattern, re-
1413       spectively, when pcre2_compile() returns NULL because a compilation er-
1414       ror has occurred.
1415
1416       There are nearly 100 positive error codes that pcre2_compile() may  re-
1417       turn  if it finds an error in the pattern. There are also some negative
1418       error codes that are used for invalid UTF strings when validity  check-
1419       ing  is  in  force.  These  are  the same as given by pcre2_match() and
1420       pcre2_dfa_match(), and are described in the pcre2unicode documentation.
1421       There is no separate documentation for the positive  error  codes,  be-
1422       cause  the  textual  error  messages  that  are obtained by calling the
1423       pcre2_get_error_message() function (see "Obtaining a textual error mes-
1424       sage" below) should be  self-explanatory.  Macro  names  starting  with
1425       PCRE2_ERROR_  are defined for both positive and negative error codes in
1426       pcre2.h. When compilation is successful errorcode is  set  to  a  value
1427       that  returns  the message "no error" if passed to pcre2_get_error_mes-
1428       sage().
1429
1430       The value returned in erroroffset is an indication of where in the pat-
1431       tern an error occurred. When there is no error,  zero  is  returned.  A
1432       non-zero  value  is  not  necessarily the furthest point in the pattern
1433       that was read. For example, after the error  "lookbehind  assertion  is
1434       not  fixed length", the error offset points to the start of the failing
1435       assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1436       the first code unit of the failing character.
1437
1438       Some errors are not detected until the whole pattern has been  scanned;
1439       in  these  cases,  the offset passed back is the length of the pattern.
1440       Note that the offset is in code units, not characters, even  in  a  UTF
1441       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1442       acter.
1443
1444       This  code  fragment shows a typical straightforward call to pcre2_com-
1445       pile():
1446
1447         pcre2_code *re;
1448         PCRE2_SIZE erroffset;
1449         int errorcode;
1450         re = pcre2_compile(
1451           "^A.*Z",                /* the pattern */
1452           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1453           0,                      /* default options */
1454           &errorcode,             /* for error code */
1455           &erroffset,             /* for error offset */
1456           NULL);                  /* no compile context */
1457
1458
1459   Main compile options
1460
1461       The following names for option bits are defined in the  pcre2.h  header
1462       file:
1463
1464         PCRE2_ANCHORED
1465
1466       If this bit is set, the pattern is forced to be "anchored", that is, it
1467       is  constrained to match only at the first matching point in the string
1468       that is being searched (the "subject string"). This effect can also  be
1469       achieved  by appropriate constructs in the pattern itself, which is the
1470       only way to do it in Perl.
1471
1472         PCRE2_ALLOW_EMPTY_CLASS
1473
1474       By default, for compatibility with Perl, a closing square bracket  that
1475       immediately  follows  an opening one is treated as a data character for
1476       the class. When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it  terminates  the
1477       class, which therefore contains no characters and so can never match.
1478
1479         PCRE2_ALT_BSUX
1480
1481       This  option  request  alternative  handling of three escape sequences,
1482       which makes PCRE2's behaviour more like  ECMAscript  (aka  JavaScript).
1483       When it is set:
1484
1485       (1) \U matches an upper case "U" character; by default \U causes a com-
1486       pile time error (Perl uses \U to upper case subsequent characters).
1487
1488       (2) \u matches a lower case "u" character unless it is followed by four
1489       hexadecimal  digits,  in  which case the hexadecimal number defines the
1490       code point to match. By default, \u causes a compile time  error  (Perl
1491       uses it to upper case the following character).
1492
1493       (3)  \x matches a lower case "x" character unless it is followed by two
1494       hexadecimal digits, in which case the hexadecimal  number  defines  the
1495       code  point  to  match. By default, as in Perl, a hexadecimal number is
1496       always expected after \x, but it may have zero, one, or two digits (so,
1497       for example, \xz matches a binary zero character followed by z).
1498
1499       ECMAscript 6 added additional functionality to \u. This can be accessed
1500       using the PCRE2_EXTRA_ALT_BSUX extra option  (see  "Extra  compile  op-
1501       tions" below).  Note that this alternative escape handling applies only
1502       to  patterns.  Neither  of  these options affects the processing of re-
1503       placement strings passed to pcre2_substitute().
1504
1505         PCRE2_ALT_CIRCUMFLEX
1506
1507       In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
1508       metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
1509       is set), and also after any internal  newline.  However,  it  does  not
1510       match after a newline at the end of the subject, for compatibility with
1511       Perl.  If  you want a multiline circumflex also to match after a termi-
1512       nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
1513
1514         PCRE2_ALT_VERBNAMES
1515
1516       By default, for compatibility with Perl, the name in any verb  sequence
1517       such  as  (*MARK:NAME)  is any sequence of characters that does not in-
1518       clude a closing parenthesis. The name is not processed in any way,  and
1519       it  is  not possible to include a closing parenthesis in the name. How-
1520       ever, if the PCRE2_ALT_VERBNAMES option is set, normal  backslash  pro-
1521       cessing  is  applied to verb names and only an unescaped closing paren-
1522       thesis terminates the name. A closing parenthesis can be included in  a
1523       name  either  as  \)  or  between  \Q  and \E. If the PCRE2_EXTENDED or
1524       PCRE2_EXTENDED_MORE option is set with  PCRE2_ALT_VERBNAMES,  unescaped
1525       whitespace  in verb names is skipped and #-comments are recognized, ex-
1526       actly as in the rest of the pattern.
1527
1528         PCRE2_AUTO_CALLOUT
1529
1530       If this bit  is  set,  pcre2_compile()  automatically  inserts  callout
1531       items,  all  with  number 255, before each pattern item, except immedi-
1532       ately before or after an explicit callout in the pattern.  For  discus-
1533       sion of the callout facility, see the pcre2callout documentation.
1534
1535         PCRE2_CASELESS
1536
1537       If  this  bit is set, letters in the pattern match both upper and lower
1538       case letters in the subject. It is equivalent to Perl's /i option,  and
1539       it  can be changed within a pattern by a (?i) option setting. If either
1540       PCRE2_UTF or PCRE2_UCP is set, Unicode  properties  are  used  for  all
1541       characters  with more than one other case, and for all characters whose
1542       code points are greater than U+007F. Note  that  there  are  two  ASCII
1543       characters, K and S, that, in addition to their lower case ASCII equiv-
1544       alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1545       S) respectively. If you do not want this case equivalence, you can sup-
1546       press it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
1547
1548       For lower valued characters with only one other case, a lookup table is
1549       used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set,  a  lookup
1550       table is used for all code points less than 256, and higher code points
1551       (available only in 16-bit or 32-bit mode) are treated as not having an-
1552       other case.
1553
1554         PCRE2_DOLLAR_ENDONLY
1555
1556       If  this bit is set, a dollar metacharacter in the pattern matches only
1557       at the end of the subject string. Without this option,  a  dollar  also
1558       matches  immediately before a newline at the end of the string (but not
1559       before any other newlines). The PCRE2_DOLLAR_ENDONLY option is  ignored
1560       if  PCRE2_MULTILINE  is  set.  There is no equivalent to this option in
1561       Perl, and no way to set it within a pattern.
1562
1563         PCRE2_DOTALL
1564
1565       If this bit is set, a dot metacharacter  in  the  pattern  matches  any
1566       character,  including  one  that  indicates a newline. However, it only
1567       ever matches one character, even if newlines are coded as CRLF. Without
1568       this option, a dot does not match when the current position in the sub-
1569       ject is at a newline. This option is equivalent to  Perl's  /s  option,
1570       and it can be changed within a pattern by a (?s) option setting. A neg-
1571       ative  class such as [^a] always matches newline characters, and the \N
1572       escape sequence always matches a non-newline character, independent  of
1573       the setting of PCRE2_DOTALL.
1574
1575         PCRE2_DUPNAMES
1576
1577       If  this  bit is set, names used to identify capture groups need not be
1578       unique.  This can be helpful for certain types of pattern  when  it  is
1579       known  that  only  one instance of the named group can ever be matched.
1580       There are more details of named capture  groups  below;  see  also  the
1581       pcre2pattern documentation.
1582
1583         PCRE2_ENDANCHORED
1584
1585       If  this  bit is set, the end of any pattern match must be right at the
1586       end of the string being searched (the "subject string"). If the pattern
1587       match succeeds by reaching (*ACCEPT), but does not reach the end of the
1588       subject, the match fails at the current starting point. For  unanchored
1589       patterns,  a  new  match is then tried at the next starting point. How-
1590       ever, if the match succeeds by reaching the end of the pattern, but not
1591       the end of the subject, backtracking occurs and  an  alternative  match
1592       may be found. Consider these two patterns:
1593
1594         .(*ACCEPT)|..
1595         .|..
1596
1597       If  matched against "abc" with PCRE2_ENDANCHORED set, the first matches
1598       "c" whereas the second matches "bc". The  effect  of  PCRE2_ENDANCHORED
1599       can  also  be achieved by appropriate constructs in the pattern itself,
1600       which is the only way to do it in Perl.
1601
1602       For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
1603       to the first (that is, the  longest)  matched  string.  Other  parallel
1604       matches,  which are necessarily substrings of the first one, must obvi-
1605       ously end before the end of the subject.
1606
1607         PCRE2_EXTENDED
1608
1609       If this bit is set, most white space characters in the pattern are  to-
1610       tally  ignored except when escaped, inside a character class, or inside
1611       a \Q...\E sequence. However, white space  is  not  allowed  within  se-
1612       quences  such  as  (?> that introduce various parenthesized groups, nor
1613       within numerical quantifiers such as {1,3}. Ignorable  white  space  is
1614       permitted  between  an  item  and  a following quantifier and between a
1615       quantifier and a following + that indicates  possessiveness.  PCRE2_EX-
1616       TENDED  is equivalent to Perl's /x option, and it can be changed within
1617       a pattern by a (?x) option setting.
1618
1619       When PCRE2 is compiled without Unicode support,  PCRE2_EXTENDED  recog-
1620       nizes  as  white space only those characters with code points less than
1621       256 that are flagged as white space in its low-character table. The ta-
1622       ble is normally created by pcre2_maketables(), which uses the isspace()
1623       function to identify space characters. In most ASCII environments,  the
1624       relevant  characters  are  those  with code points 0x0009 (tab), 0x000A
1625       (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D  (carriage
1626       return), and 0x0020 (space).
1627
1628       When PCRE2 is compiled with Unicode support, in addition to these char-
1629       acters,  five  more Unicode "Pattern White Space" characters are recog-
1630       nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1631       right mark), U+200F (right-to-left mark), U+2028 (line separator),  and
1632       U+2029  (paragraph  separator).  This  set of characters is the same as
1633       recognized by Perl's /x option. Note that the horizontal  and  vertical
1634       space  characters that are matched by the \h and \v escapes in patterns
1635       are a much bigger set.
1636
1637       As well as ignoring most white space, PCRE2_EXTENDED also causes  char-
1638       acters  between  an  unescaped # outside a character class and the next
1639       newline, inclusive, to be ignored, which makes it possible  to  include
1640       comments inside complicated patterns. Note that the end of this type of
1641       comment  is a literal newline sequence in the pattern; escape sequences
1642       that happen to represent a newline do not count.
1643
1644       Which characters are interpreted as newlines can be specified by a set-
1645       ting in the compile context that is passed to pcre2_compile() or  by  a
1646       special  sequence at the start of the pattern, as described in the sec-
1647       tion entitled "Newline conventions" in the pcre2pattern  documentation.
1648       A default is defined when PCRE2 is built.
1649
1650         PCRE2_EXTENDED_MORE
1651
1652       This  option  has  the  effect of PCRE2_EXTENDED, but, in addition, un-
1653       escaped space and horizontal tab characters are ignored inside a  char-
1654       acter  class. Note: only these two characters are ignored, not the full
1655       set of pattern white space characters that are ignored outside a  char-
1656       acter  class.  PCRE2_EXTENDED_MORE  is equivalent to Perl's /xx option,
1657       and it can be changed within a pattern by a (?xx) option setting.
1658
1659         PCRE2_FIRSTLINE
1660
1661       If this option is set, the start of an unanchored pattern match must be
1662       before or at the first newline in  the  subject  string  following  the
1663       start  of  matching, though the matched text may continue over the new-
1664       line. If startoffset is non-zero, the limiting newline is not necessar-
1665       ily the first newline in the  subject.  For  example,  if  the  subject
1666       string is "abc\nxyz" (where \n represents a single-character newline) a
1667       pattern  match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
1668       greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a  more
1669       general  limiting  facility.  If  PCRE2_FIRSTLINE is set with an offset
1670       limit, a match must occur in the first line and also within the  offset
1671       limit. In other words, whichever limit comes first is used. This option
1672       has no effect for anchored patterns.
1673
1674         PCRE2_LITERAL
1675
1676       If this option is set, all meta-characters in the pattern are disabled,
1677       and  it is treated as a literal string. Matching literal strings with a
1678       regular expression engine is not the most efficient way of doing it. If
1679       you are doing a lot of literal matching and  are  worried  about  effi-
1680       ciency, you should consider using other approaches. The only other main
1681       options  that  are  allowed  with  PCRE2_LITERAL  are:  PCRE2_ANCHORED,
1682       PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
1683       PCRE2_MATCH_INVALID_UTF,  PCRE2_NO_START_OPTIMIZE,  PCRE2_NO_UTF_CHECK,
1684       PCRE2_UTF,  and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options PCRE2_EX-
1685       TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
1686       options cause an error.
1687
1688         PCRE2_MATCH_INVALID_UTF
1689
1690       This option forces PCRE2_UTF (see below) and also enables  support  for
1691       matching  by  pcre2_match() in subject strings that contain invalid UTF
1692       sequences.  Note, however, that the 16-bit and 32-bit  PCRE2  libraries
1693       process  strings as sequences of uint16_t or uint32_t code points. They
1694       cannot find valid UTF sequences within an arbitrary string of bytes un-
1695       less such sequences are suitably aligned. This  facility  is  not  sup-
1696       ported  for  DFA matching. For details, see the pcre2unicode documenta-
1697       tion.
1698
1699         PCRE2_MATCH_UNSET_BACKREF
1700
1701       If this option is set,  a  backreference  to  an  unset  capture  group
1702       matches  an  empty  string (by default this causes the current matching
1703       alternative to fail).  A pattern such as (\1)(a) succeeds when this op-
1704       tion is set (assuming it can find an "a" in the  subject),  whereas  it
1705       fails  by  default,  for  Perl compatibility. Setting this option makes
1706       PCRE2 behave more like ECMAscript (aka JavaScript).
1707
1708         PCRE2_MULTILINE
1709
1710       By default, for the purposes of matching "start of line"  and  "end  of
1711       line",  PCRE2  treats the subject string as consisting of a single line
1712       of characters, even if it actually contains  newlines.  The  "start  of
1713       line"  metacharacter  (^)  matches only at the start of the string, and
1714       the "end of line" metacharacter ($) matches only  at  the  end  of  the
1715       string,  or  before a terminating newline (except when PCRE2_DOLLAR_EN-
1716       DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any
1717       character" metacharacter (.) does not match at a newline.  This  behav-
1718       iour (for ^, $, and dot) is the same as Perl.
1719
1720       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
1721       constructs match immediately following or immediately  before  internal
1722       newlines  in  the  subject string, respectively, as well as at the very
1723       start and end. This is equivalent to Perl's /m option, and  it  can  be
1724       changed within a pattern by a (?m) option setting. Note that the "start
1725       of line" metacharacter does not match after a newline at the end of the
1726       subject,  for compatibility with Perl.  However, you can change this by
1727       setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in  a
1728       subject  string,  or  no  occurrences  of  ^ or $ in a pattern, setting
1729       PCRE2_MULTILINE has no effect.
1730
1731         PCRE2_NEVER_BACKSLASH_C
1732
1733       This option locks out the use of \C in the pattern that is  being  com-
1734       piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1735       UTF-16 modes, because it may leave the current matching  point  in  the
1736       middle of a multi-code-unit character. This option may be useful in ap-
1737       plications that process patterns from external sources. Note that there
1738       is also a build-time option that permanently locks out the use of \C.
1739
1740         PCRE2_NEVER_UCP
1741
1742       This  option  locks  out the use of Unicode properties for handling \B,
1743       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
1744       described for the PCRE2_UCP option below. In  particular,  it  prevents
1745       the  creator of the pattern from enabling this facility by starting the
1746       pattern with (*UCP). This option may be  useful  in  applications  that
1747       process patterns from external sources. The option combination PCRE_UCP
1748       and PCRE_NEVER_UCP causes an error.
1749
1750         PCRE2_NEVER_UTF
1751
1752       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
1753       or UTF-32, depending on which library is in use. In particular, it pre-
1754       vents the creator of the pattern from switching to  UTF  interpretation
1755       by  starting  the pattern with (*UTF). This option may be useful in ap-
1756       plications that process patterns from external sources. The combination
1757       of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
1758
1759         PCRE2_NO_AUTO_CAPTURE
1760
1761       If this option is set, it disables the use of numbered capturing paren-
1762       theses in the pattern. Any opening parenthesis that is not followed  by
1763       ?  behaves as if it were followed by ?: but named parentheses can still
1764       be used for capturing (and they acquire numbers in the usual way). This
1765       is the same as Perl's /n option.  Note that, when this option  is  set,
1766       references  to  capture  groups (backreferences or recursion/subroutine
1767       calls) may only refer to named groups, though the reference can  be  by
1768       name or by number.
1769
1770         PCRE2_NO_AUTO_POSSESS
1771
1772       If this option is set, it disables "auto-possessification", which is an
1773       optimization  that,  for example, turns a+b into a++b in order to avoid
1774       backtracks into a+ that can never be successful. However,  if  callouts
1775       are  in  use,  auto-possessification means that some callouts are never
1776       taken. You can set this option if you want the matching functions to do
1777       a full unoptimized search and run all the callouts, but  it  is  mainly
1778       provided for testing purposes.
1779
1780         PCRE2_NO_DOTSTAR_ANCHOR
1781
1782       If this option is set, it disables an optimization that is applied when
1783       .*  is  the  first significant item in a top-level branch of a pattern,
1784       and all the other branches also start with .* or with \A or  \G  or  ^.
1785       The  optimization  is  automatically disabled for .* if it is inside an
1786       atomic group or a capture group that is the subject of a backreference,
1787       or if the pattern contains (*PRUNE) or (*SKIP). When  the  optimization
1788       is   not   disabled,  such  a  pattern  is  automatically  anchored  if
1789       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1790       for any ^ items. Otherwise, the fact that any match must  start  either
1791       at  the start of the subject or following a newline is remembered. Like
1792       other optimizations, this can cause callouts to be skipped.
1793
1794         PCRE2_NO_START_OPTIMIZE
1795
1796       This is an option whose main effect is at matching time.  It  does  not
1797       change what pcre2_compile() generates, but it does affect the output of
1798       the JIT compiler.
1799
1800       There  are  a  number of optimizations that may occur at the start of a
1801       match, in order to speed up the process. For example, if  it  is  known
1802       that  an  unanchored  match must start with a specific code unit value,
1803       the matching code searches the subject for that value, and fails  imme-
1804       diately  if it cannot find it, without actually running the main match-
1805       ing function. This means that a special item such as (*COMMIT)  at  the
1806       start  of  a  pattern is not considered until after a suitable starting
1807       point for the match has been found.  Also,  when  callouts  or  (*MARK)
1808       items  are  in use, these "start-up" optimizations can cause them to be
1809       skipped if the pattern is never actually used. The  start-up  optimiza-
1810       tions  are  in effect a pre-scan of the subject that takes place before
1811       the pattern is run.
1812
1813       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1814       possibly causing performance to suffer,  but  ensuring  that  in  cases
1815       where  the  result is "no match", the callouts do occur, and that items
1816       such as (*COMMIT) and (*MARK) are considered at every possible starting
1817       position in the subject string.
1818
1819       Setting PCRE2_NO_START_OPTIMIZE may change the outcome  of  a  matching
1820       operation.  Consider the pattern
1821
1822         (*COMMIT)ABC
1823
1824       When  this  is compiled, PCRE2 records the fact that a match must start
1825       with the character "A". Suppose the subject  string  is  "DEFABC".  The
1826       start-up  optimization  scans along the subject, finds "A" and runs the
1827       first match attempt from there. The (*COMMIT) item means that the  pat-
1828       tern  must  match the current starting position, which in this case, it
1829       does. However, if the same match is  run  with  PCRE2_NO_START_OPTIMIZE
1830       set,  the  initial  scan  along the subject string does not happen. The
1831       first match attempt is run starting  from  "D"  and  when  this  fails,
1832       (*COMMIT)  prevents any further matches being tried, so the overall re-
1833       sult is "no match".
1834
1835       As another start-up optimization makes use of a minimum  length  for  a
1836       matching subject, which is recorded when possible. Consider the pattern
1837
1838         (*MARK:1)B(*MARK:2)(X|Y)
1839
1840       The  minimum  length  for  a match is two characters. If the subject is
1841       "XXBB", the "starting character" optimization skips "XX", then tries to
1842       match "BB", which is long enough. In the process, (*MARK:2) is  encoun-
1843       tered  and  remembered.  When  the match attempt fails, the next "B" is
1844       found, but there is only one character left, so there are no  more  at-
1845       tempts,  and  "no  match"  is returned with the "last mark seen" set to
1846       "2". If NO_START_OPTIMIZE is set, however, matches are tried  at  every
1847       possible  starting position, including at the end of the subject, where
1848       (*MARK:1) is encountered, but there is no "B", so the "last mark  seen"
1849       that  is returned is "1". In this case, the optimizations do not affect
1850       the overall match result, which is still "no match", but they do affect
1851       the auxiliary information that is returned.
1852
1853         PCRE2_NO_UTF_CHECK
1854
1855       When PCRE2_UTF is set, the validity of the pattern as a UTF  string  is
1856       automatically  checked.  There  are  discussions  about the validity of
1857       UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1858       document.  If an invalid UTF sequence is found, pcre2_compile() returns
1859       a negative error code.
1860
1861       If you know that your pattern is a valid UTF string, and  you  want  to
1862       skip   this   check   for   performance   reasons,   you  can  set  the
1863       PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1864       valid UTF string as a pattern is undefined. It may cause  your  program
1865       to crash or loop.
1866
1867       Note  that  this  option  can  also  be  passed  to  pcre2_match()  and
1868       pcre2_dfa_match(), to suppress UTF validity  checking  of  the  subject
1869       string.
1870
1871       Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1872       able  the error that is given if an escape sequence for an invalid Uni-
1873       code code point is encountered in the pattern. In particular,  the  so-
1874       called  "surrogate"  code points (0xd800 to 0xdfff) are invalid. If you
1875       want to allow escape  sequences  such  as  \x{d800}  you  can  set  the
1876       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  extra  option, as described in the
1877       section entitled "Extra compile options" below.  However, this is  pos-
1878       sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1879       resentable in UTF-16.
1880
1881         PCRE2_UCP
1882
1883       This option has two effects. Firstly, it change the way PCRE2 processes
1884       \B,  \b,  \D,  \d,  \S,  \s,  \W,  \w,  and some of the POSIX character
1885       classes. By default, only  ASCII  characters  are  recognized,  but  if
1886       PCRE2_UCP  is  set, Unicode properties are used to classify characters.
1887       There are some PCRE2_EXTRA options (see below) that add  finer  control
1888       to  this  behaviour.  More  details are given in the section on generic
1889       character types in the pcre2pattern page.
1890
1891       The second effect of PCRE2_UCP is to force the use of  Unicode  proper-
1892       ties for upper/lower casing operations, even when PCRE2_UTF is not set.
1893       This  makes  it  possible  to process strings in the 16-bit UCS-2 code.
1894       This option is available only if PCRE2 has been compiled  with  Unicode
1895       support  (which is the default).  The PCRE2_EXTRA_CASELESS_RESTRICT op-
1896       tion (see below) restricts caseless matching such that ASCII characters
1897       match only ASCII characters and non-ASCII characters  match  only  non-
1898       ASCII characters.
1899
1900         PCRE2_UNGREEDY
1901
1902       This  option  inverts  the "greediness" of the quantifiers so that they
1903       are not greedy by default, but become greedy if followed by "?". It  is
1904       not  compatible  with Perl. It can also be set by a (?U) option setting
1905       within the pattern.
1906
1907         PCRE2_USE_OFFSET_LIMIT
1908
1909       This option must be set for pcre2_compile() if pcre2_set_offset_limit()
1910       is going to be used to set a non-default offset limit in a  match  con-
1911       text  for  matches  that  use this pattern. An error is generated if an
1912       offset limit is set without this option. For more details, see the  de-
1913       scription  of  pcre2_set_offset_limit()  in  the section that describes
1914       match contexts. See also the PCRE2_FIRSTLINE option above.
1915
1916         PCRE2_UTF
1917
1918       This option causes PCRE2 to regard both the  pattern  and  the  subject
1919       strings  that  are  subsequently processed as strings of UTF characters
1920       instead of single-code-unit strings. It  is  available  when  PCRE2  is
1921       built  to  include  Unicode  support (which is the default). If Unicode
1922       support is not available, the use of this option provokes an error. De-
1923       tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in  the
1924       pcre2unicode  page.  In  particular,  note  that  it  changes  the  way
1925       PCRE2_CASELESS works.
1926
1927   Extra compile options
1928
1929       The option bits that can be set in a compile  context  by  calling  the
1930       pcre2_set_compile_extra_options() function are as follows:
1931
1932         PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
1933
1934       Since release 10.38 PCRE2 has forbidden the use of \K within lookaround
1935       assertions, following Perl's lead. This option is provided to re-enable
1936       the previous behaviour (act in positive lookarounds, ignore in negative
1937       ones) in case anybody is relying on it.
1938
1939         PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
1940
1941       This  option  applies when compiling a pattern in UTF-8 or UTF-32 mode.
1942       It is forbidden in UTF-16 mode, and ignored in non-UTF  modes.  Unicode
1943       "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1944       in  UTF-16  to  encode  code points with values in the range 0x10000 to
1945       0x10ffff. The surrogates cannot therefore  be  represented  in  UTF-16.
1946       They can be represented in UTF-8 and UTF-32, but are defined as invalid
1947       code  points,  and  cause  errors  if  encountered in a UTF-8 or UTF-32
1948       string that is being checked for validity by PCRE2.
1949
1950       These values also cause errors if encountered in escape sequences  such
1951       as \x{d912} within a pattern. However, it seems that some applications,
1952       when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1953       plicitly   test   for   the  surrogates  using  escape  sequences.  The
1954       PCRE2_NO_UTF_CHECK option does not disable the error that  occurs,  be-
1955       cause it applies only to the testing of input strings for UTF validity.
1956
1957       If  the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1958       gate code point values in UTF-8 and UTF-32 patterns no  longer  provoke
1959       errors  and are incorporated in the compiled pattern. However, they can
1960       only match subject characters if the matching function is  called  with
1961       PCRE2_NO_UTF_CHECK set.
1962
1963         PCRE2_EXTRA_ALT_BSUX
1964
1965       The  original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and
1966       \x in the way that ECMAscript (aka JavaScript) does.  Additional  func-
1967       tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has
1968       the  effect  of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..}
1969       as a hexadecimal character code, where hhh.. is any number of hexadeci-
1970       mal digits.
1971
1972         PCRE2_EXTRA_ASCII_BSD
1973
1974       This option forces \d to match only ASCII digits, even  when  PCRE2_UCP
1975       is  set.   It can be changed within a pattern by means of the (?aD) op-
1976       tion setting.
1977
1978         PCRE2_EXTRA_ASCII_BSS
1979
1980       This option forces \s to match only ASCII space characters,  even  when
1981       PCRE2_UCP  is  set.  It can be changed within a pattern by means of the
1982       (?aS) option setting.
1983
1984         PCRE2_EXTRA_ASCII_BSW
1985
1986       This option forces \w to match only ASCII word  characters,  even  when
1987       PCRE2_UCP  is  set.  It can be changed within a pattern by means of the
1988       (?aW) option setting.
1989
1990         PCRE2_EXTRA_ASCII_DIGIT
1991
1992       This option forces the POSIX character classes [:digit:] and [:xdigit:]
1993       to match only ASCII digits, even when  PCRE2_UCP  is  set.  It  can  be
1994       changed within a pattern by means of the (?aT) option setting.
1995
1996         PCRE2_EXTRA_ASCII_POSIX
1997
1998       This option forces all the POSIX character classes, including [:digit:]
1999       and  [:xdigit:], to match only ASCII characters, even when PCRE2_UCP is
2000       set. It can be changed within a pattern by means of  the  (?aP)  option
2001       setting,  but note that this also sets PCRE2_EXTRA_ASCII_DIGIT in order
2002       to ensure that (?-aP) unsets all ASCII restrictions for POSIX classes.
2003
2004         PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
2005
2006       This is a dangerous option. Use with care. By default, an  unrecognized
2007       escape  such  as \j or a malformed one such as \x{2z} causes a compile-
2008       time error when detected by pcre2_compile(). Perl is somewhat inconsis-
2009       tent in handling such items: for example, \j is treated  as  a  literal
2010       "j",  and non-hexadecimal digits in \x{} are just ignored, though warn-
2011       ings are given in both cases if Perl's warning switch is enabled.  How-
2012       ever,  a  malformed  octal  number  after \o{ always causes an error in
2013       Perl.
2014
2015       If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL  extra  option  is  passed  to
2016       pcre2_compile(),  all  unrecognized  or  malformed escape sequences are
2017       treated as single-character escapes. For example, \j is a  literal  "j"
2018       and  \x{2z}  is treated as the literal string "x{2z}". Setting this op-
2019       tion means that typos in patterns may go undetected and have unexpected
2020       results. Also note that a sequence such as [\N{] is  interpreted  as  a
2021       malformed  attempt  at [\N{...}] and so is treated as [N{] whereas [\N]
2022       gives an error because an unqualified \N is a valid escape sequence but
2023       is not supported in a character class. To reiterate: this is a  danger-
2024       ous option. Use with great care.
2025
2026         PCRE2_EXTRA_CASELESS_RESTRICT
2027
2028       When  either  PCRE2_UCP  or PCRE2_UTF is set, caseless matching follows
2029       Unicode rules, which allow for more than two cases per character. There
2030       are two case-equivalent character sets that contain both ASCII and non-
2031       ASCII characters. The ASCII letter S is case-equivalent to U+017f (long
2032       S) and the ASCII letter K is case-equivalent to U+212a  (Kelvin  sign).
2033       This  option  disables  recognition of case-equivalences that cross the
2034       ASCII/non-ASCII boundary. In a caseless match, both characters must ei-
2035       ther be ASCII or non-ASCII. The option can be changed with a pattern by
2036       the (?r) option setting.
2037
2038         PCRE2_EXTRA_ESCAPED_CR_IS_LF
2039
2040       There are some legacy applications where the escape sequence  \r  in  a
2041       pattern  is expected to match a newline. If this option is set, \r in a
2042       pattern is converted to \n so that it matches a LF  (linefeed)  instead
2043       of  a CR (carriage return) character. The option does not affect a lit-
2044       eral CR in the pattern, nor does it affect CR specified as an  explicit
2045       code point such as \x{0D}.
2046
2047         PCRE2_EXTRA_MATCH_LINE
2048
2049       This  option  is  provided  for  use  by the -x option of pcre2grep. It
2050       causes the pattern only to match complete lines. This  is  achieved  by
2051       automatically  inserting  the  code for "^(?:" at the start of the com-
2052       piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE  is  set,
2053       the  matched  line may be in the middle of the subject string. This op-
2054       tion can be used with PCRE2_LITERAL.
2055
2056         PCRE2_EXTRA_MATCH_WORD
2057
2058       This option is provided for use by  the  -w  option  of  pcre2grep.  It
2059       causes  the  pattern only to match strings that have a word boundary at
2060       the start and the end. This is achieved by automatically inserting  the
2061       code  for "\b(?:" at the start of the compiled pattern and ")\b" at the
2062       end. The option may be used with PCRE2_LITERAL. However, it is  ignored
2063       if PCRE2_EXTRA_MATCH_LINE is also set.
2064
2065
2066JUST-IN-TIME (JIT) COMPILATION
2067
2068       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
2069
2070       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
2071         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2072         uint32_t options, pcre2_match_data *match_data,
2073         pcre2_match_context *mcontext);
2074
2075       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
2076
2077       pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize,
2078         size_t maxsize, pcre2_general_context *gcontext);
2079
2080       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
2081         pcre2_jit_callback callback_function, void *callback_data);
2082
2083       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
2084
2085       These  functions  provide  support  for  JIT compilation, which, if the
2086       just-in-time compiler is available, further processes a  compiled  pat-
2087       tern into machine code that executes much faster than the pcre2_match()
2088       interpretive  matching function. Full details are given in the pcre2jit
2089       documentation.
2090
2091       JIT compilation is a heavyweight optimization. It can  take  some  time
2092       for  patterns  to  be analyzed, and for one-off matches and simple pat-
2093       terns the benefit of faster execution might be offset by a much  slower
2094       compilation  time.  Most (but not all) patterns can be optimized by the
2095       JIT compiler.
2096
2097
2098LOCALE SUPPORT
2099
2100       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
2101
2102       void pcre2_maketables_free(pcre2_general_context *gcontext,
2103         const uint8_t *tables);
2104
2105       PCRE2 handles caseless matching, and determines whether characters  are
2106       letters,  digits, or whatever, by reference to a set of tables, indexed
2107       by character code point. However, this applies only to characters whose
2108       code points are less than 256. By default,  higher-valued  code  points
2109       never match escapes such as \w or \d.
2110
2111       When PCRE2 is built with Unicode support (the default), certain Unicode
2112       character  properties  can be tested with \p and \P, or, alternatively,
2113       the PCRE2_UCP option can be set when a pattern is compiled; this causes
2114       \w and friends to use Unicode property support instead of the  built-in
2115       tables.  PCRE2_UCP also causes upper/lower casing operations on charac-
2116       ters with code points greater than 127 to use Unicode properties. These
2117       effects  apply even when PCRE2_UTF is not set. There are, however, some
2118       PCRE2_EXTRA options (see above) that can be used to modify or  suppress
2119       them.
2120
2121       The  use  of  locales  with Unicode is discouraged. If you are handling
2122       characters with code points greater than 127,  you  should  either  use
2123       Unicode support, or use locales, but not try to mix the two.
2124
2125       PCRE2  contains a built-in set of character tables that are used by de-
2126       fault.  These are sufficient for many applications. Normally,  the  in-
2127       ternal  tables  recognize only ASCII characters. However, when PCRE2 is
2128       built, it is possible to cause the internal tables to be rebuilt in the
2129       default "C" locale of the local system, which may cause them to be dif-
2130       ferent.
2131
2132       The built-in tables can be overridden by tables supplied by the  appli-
2133       cation  that  calls  PCRE2.  These may be created in a different locale
2134       from the default.  As more and more applications change to  using  Uni-
2135       code, the need for this locale support is expected to die away.
2136
2137       External  tables  are built by calling the pcre2_maketables() function,
2138       in the relevant locale. The only argument to this function is a general
2139       context, which can be used to pass a custom memory  allocator.  If  the
2140       argument is NULL, the system malloc() is used. The result can be passed
2141       to pcre2_compile() as often as necessary, by creating a compile context
2142       and  calling  pcre2_set_character_tables()  to  set  the tables pointer
2143       therein.
2144
2145       For example, to build and use  tables  that  are  appropriate  for  the
2146       French  locale  (where accented characters with values greater than 127
2147       are treated as letters), the following code could be used:
2148
2149         setlocale(LC_CTYPE, "fr_FR");
2150         tables = pcre2_maketables(NULL);
2151         ccontext = pcre2_compile_context_create(NULL);
2152         pcre2_set_character_tables(ccontext, tables);
2153         re = pcre2_compile(..., ccontext);
2154
2155       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2156       if you are using Windows, the name for the French locale is "french".
2157
2158       The pointer that is passed (via the compile context) to pcre2_compile()
2159       is saved with the compiled pattern, and the same tables are used by the
2160       matching  functions.  Thus,  for  any  single  pattern, compilation and
2161       matching both happen in the same locale, but different patterns can  be
2162       processed in different locales.
2163
2164       It  is the caller's responsibility to ensure that the memory containing
2165       the tables remains available while they are still in use. When they are
2166       no longer needed, you can discard them  using  pcre2_maketables_free(),
2167       which  should  pass as its first parameter the same global context that
2168       was used to create the tables.
2169
2170   Saving locale tables
2171
2172       The tables described above are just a sequence of binary  bytes,  which
2173       makes  them  independent of hardware characteristics such as endianness
2174       or whether the processor is 32-bit or 64-bit. A copy of the  result  of
2175       pcre2_maketables()  can  therefore  be saved in a file or elsewhere and
2176       re-used later, even in a different program or on another computer.  The
2177       size  of  the  tables  (number  of  bytes)  must be obtained by calling
2178       pcre2_config()  with  the  PCRE2_CONFIG_TABLES_LENGTH  option   because
2179       pcre2_maketables()   does   not   return  this  value.  Note  that  the
2180       pcre2_dftables program, which is part of the PCRE2 build system, can be
2181       used stand-alone to create a file that contains a set of binary tables.
2182       See the pcre2build documentation for details.
2183
2184
2185INFORMATION ABOUT A COMPILED PATTERN
2186
2187       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
2188
2189       The pcre2_pattern_info() function returns general information  about  a
2190       compiled pattern. For information about callouts, see the next section.
2191       The  first  argument  for pcre2_pattern_info() is a pointer to the com-
2192       piled pattern. The second argument specifies which piece of information
2193       is required, and the third argument is a pointer to a variable  to  re-
2194       ceive  the  data.  If the third argument is NULL, the first argument is
2195       ignored, and the function returns the size in  bytes  of  the  variable
2196       that is required for the information requested. Otherwise, the yield of
2197       the function is zero for success, or one of the following negative num-
2198       bers:
2199
2200         PCRE2_ERROR_NULL           the argument code was NULL
2201         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
2202         PCRE2_ERROR_BADOPTION      the value of what was invalid
2203         PCRE2_ERROR_UNSET          the requested field is not set
2204
2205       The "magic number" is placed at the start of each compiled pattern as a
2206       simple  check  against  passing  an arbitrary memory pointer. Here is a
2207       typical call of pcre2_pattern_info(), to obtain the length of the  com-
2208       piled pattern:
2209
2210         int rc;
2211         size_t length;
2212         rc = pcre2_pattern_info(
2213           re,               /* result of pcre2_compile() */
2214           PCRE2_INFO_SIZE,  /* what is required */
2215           &length);         /* where to put the data */
2216
2217       The possible values for the second argument are defined in pcre2.h, and
2218       are as follows:
2219
2220         PCRE2_INFO_ALLOPTIONS
2221         PCRE2_INFO_ARGOPTIONS
2222         PCRE2_INFO_EXTRAOPTIONS
2223
2224       Return copies of the pattern's options. The third argument should point
2225       to  a  uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2226       tions that were passed to  pcre2_compile(),  whereas  PCRE2_INFO_ALLOP-
2227       TIONS  returns  the compile options as modified by any top-level (*XXX)
2228       option settings such as (*UTF) at the  start  of  the  pattern  itself.
2229       PCRE2_INFO_EXTRAOPTIONS  returns the extra options that were set in the
2230       compile context by calling the pcre2_set_compile_extra_options()  func-
2231       tion.
2232
2233       For  example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2234       TENDED option, the result for PCRE2_INFO_ALLOPTIONS  is  PCRE2_EXTENDED
2235       and  PCRE2_UTF.   Option settings such as (?i) that can change within a
2236       pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they
2237       appear right at the start of the pattern. (This was different  in  some
2238       earlier releases.)
2239
2240       A  pattern compiled without PCRE2_ANCHORED is automatically anchored by
2241       PCRE2 if the first significant item in every top-level branch is one of
2242       the following:
2243
2244         ^     unless PCRE2_MULTILINE is set
2245         \A    always
2246         \G    always
2247         .*    sometimes - see below
2248
2249       When .* is the first significant item, anchoring is possible only  when
2250       all the following are true:
2251
2252         .* is not in an atomic group
2253         .* is not in a capture group that is the subject
2254              of a backreference
2255         PCRE2_DOTALL is in force for .*
2256         Neither (*PRUNE) nor (*SKIP) appears in the pattern
2257         PCRE2_NO_DOTSTAR_ANCHOR is not set
2258
2259       For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2260       the options returned for PCRE2_INFO_ALLOPTIONS.
2261
2262         PCRE2_INFO_BACKREFMAX
2263
2264       Return the number of the highest  backreference  in  the  pattern.  The
2265       third  argument  should  point  to  a  uint32_t variable. Named capture
2266       groups acquire numbers as well as names, and these  count  towards  the
2267       highest  backreference.  Backreferences  such as \4 or \g{12} match the
2268       captured characters of the given group, but in addition, the check that
2269       a capture group is set in a conditional group such as (?(3)a|b) is also
2270       a backreference.  Zero is returned if there are no backreferences.
2271
2272         PCRE2_INFO_BSR
2273
2274       The output is a uint32_t integer whose value indicates  what  character
2275       sequences  the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
2276       means that \R matches any Unicode line  ending  sequence;  a  value  of
2277       PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
2278
2279         PCRE2_INFO_CAPTURECOUNT
2280
2281       Return  the  highest  capture  group number in the pattern. In patterns
2282       where (?| is not used, this is also the total number of capture groups.
2283       The third argument should point to a uint32_t variable.
2284
2285         PCRE2_INFO_DEPTHLIMIT
2286
2287       If the pattern set a backtracking depth limit by including an  item  of
2288       the  form  (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
2289       third argument should point to a uint32_t integer. If no such value has
2290       been set, the call to pcre2_pattern_info() returns the error  PCRE2_ER-
2291       ROR_UNSET. Note that this limit will only be used during matching if it
2292       is  less  than  the  limit  set or defaulted by the caller of the match
2293       function.
2294
2295         PCRE2_INFO_FIRSTBITMAP
2296
2297       In the absence of a single first code unit for a non-anchored  pattern,
2298       pcre2_compile()  may construct a 256-bit table that defines a fixed set
2299       of values for the first code unit in any match. For example, a  pattern
2300       that  starts  with  [abc]  results in a table with three bits set. When
2301       code unit values greater than 255 are supported, the flag bit  for  255
2302       means  "any  code unit of value 255 or above". If such a table was con-
2303       structed, a pointer to it is returned. Otherwise NULL is returned.  The
2304       third argument should point to a const uint8_t * variable.
2305
2306         PCRE2_INFO_FIRSTCODETYPE
2307
2308       Return information about the first code unit of any matched string, for
2309       a  non-anchored  pattern. The third argument should point to a uint32_t
2310       variable. If there is a fixed first value, for example, the letter  "c"
2311       from  a  pattern such as (cat|cow|coyote), 1 is returned, and the value
2312       can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is  no  fixed
2313       first  value,  but it is known that a match can occur only at the start
2314       of the subject or following a newline in the subject,  2  is  returned.
2315       Otherwise, and for anchored patterns, 0 is returned.
2316
2317         PCRE2_INFO_FIRSTCODEUNIT
2318
2319       Return  the  value  of  the first code unit of any matched string for a
2320       pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise  return  0.
2321       The  third  argument  should point to a uint32_t variable. In the 8-bit
2322       library, the value is always less than 256. In the 16-bit  library  the
2323       value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2324       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2325       mode.
2326
2327         PCRE2_INFO_FRAMESIZE
2328
2329       Return the size (in bytes) of the data frames that are used to remember
2330       backtracking positions when the pattern is processed  by  pcre2_match()
2331       without  the  use  of  JIT. The third argument should point to a size_t
2332       variable. The frame size depends on the number of capturing parentheses
2333       in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2334       ables.
2335
2336         PCRE2_INFO_HASBACKSLASHC
2337
2338       Return 1 if the pattern contains any instances of \C, otherwise 0.  The
2339       third argument should point to a uint32_t variable.
2340
2341         PCRE2_INFO_HASCRORLF
2342
2343       Return  1  if  the  pattern  contains any explicit matches for CR or LF
2344       characters, otherwise 0. The third argument should point to a  uint32_t
2345       variable.  An explicit match is either a literal CR or LF character, or
2346       \r or \n or one of the  equivalent  hexadecimal  or  octal  escape  se-
2347       quences.
2348
2349         PCRE2_INFO_HEAPLIMIT
2350
2351       If the pattern set a heap memory limit by including an item of the form
2352       (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2353       ment should point to a uint32_t integer. If no such value has been set,
2354       the  call  to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
2355       Note that this limit will only be used during matching if  it  is  less
2356       than the limit set or defaulted by the caller of the match function.
2357
2358         PCRE2_INFO_JCHANGED
2359
2360       Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
2361       otherwise 0. The third argument should point to  a  uint32_t  variable.
2362       (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2363       tively.
2364
2365         PCRE2_INFO_JITSIZE
2366
2367       If the compiled pattern was successfully  processed  by  pcre2_jit_com-
2368       pile(),  return  the  size  of  the JIT compiled code, otherwise return
2369       zero. The third argument should point to a size_t variable.
2370
2371         PCRE2_INFO_LASTCODETYPE
2372
2373       Returns 1 if there is a rightmost literal code unit that must exist  in
2374       any  matched string, other than at its start. The third argument should
2375       point to a uint32_t variable. If there is no such value, 0 is returned.
2376       When 1 is returned, the code unit value itself can be  retrieved  using
2377       PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
2378       recorded  only if it follows something of variable length. For example,
2379       for the pattern /^a\d+z\d+/ the returned value is 1 (with "z"  returned
2380       from  PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is
2381       0.
2382
2383         PCRE2_INFO_LASTCODEUNIT
2384
2385       Return the value of the rightmost literal code unit that must exist  in
2386       any  matched  string,  other  than  at  its  start, for a pattern where
2387       PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2388       ment should point to a uint32_t variable.
2389
2390         PCRE2_INFO_MATCHEMPTY
2391
2392       Return 1 if the pattern might match an empty string, otherwise  0.  The
2393       third argument should point to a uint32_t variable. When a pattern con-
2394       tains recursive subroutine calls it is not always possible to determine
2395       whether or not it can match an empty string. PCRE2 takes a cautious ap-
2396       proach and returns 1 in such cases.
2397
2398         PCRE2_INFO_MATCHLIMIT
2399
2400       If  the  pattern  set  a  match  limit by including an item of the form
2401       (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third  ar-
2402       gument  should  point  to a uint32_t integer. If no such value has been
2403       set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2404       SET. Note that this limit will only be used during matching  if  it  is
2405       less  than  the limit set or defaulted by the caller of the match func-
2406       tion.
2407
2408         PCRE2_INFO_MAXLOOKBEHIND
2409
2410       A lookbehind assertion moves back a certain number of  characters  (not
2411       code  units)  when  it starts to process each of its branches. This re-
2412       quest returns the largest of these backward moves. The  third  argument
2413       should point to a uint32_t integer. The simple assertions \b and \B re-
2414       quire  a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
2415       return 1 in the absence of anything longer. \A also  registers  a  one-
2416       character  lookbehind, though it does not actually inspect the previous
2417       character.
2418
2419       Note that this information is useful for multi-segment matching only if
2420       the pattern contains no nested lookbehinds. For  example,  the  pattern
2421       (?<=a(?<=ba)c)  returns  a  maximum  lookbehind  of  2,  but when it is
2422       processed, the first lookbehind moves back by two  characters,  matches
2423       one  character, then the nested lookbehind also moves back by two char-
2424       acters. This puts the matching point three characters earlier  than  it
2425       was  at the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a
2426       debugging tool. See the pcre2partial documentation for a discussion  of
2427       multi-segment matching.
2428
2429         PCRE2_INFO_MINLENGTH
2430
2431       If  a  minimum  length  for  matching subject strings was computed, its
2432       value is returned. Otherwise the returned value is 0. This value is not
2433       computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number  of
2434       characters,  which in UTF mode may be different from the number of code
2435       units. The third argument should point  to  a  uint32_t  variable.  The
2436       value  is a lower bound to the length of any matching string. There may
2437       not be any strings of that length that do  actually  match,  but  every
2438       string that does match is at least that long.
2439
2440         PCRE2_INFO_NAMECOUNT
2441         PCRE2_INFO_NAMEENTRYSIZE
2442         PCRE2_INFO_NAMETABLE
2443
2444       PCRE2 supports the use of named as well as numbered capturing parenthe-
2445       ses.  The names are just an additional way of identifying the parenthe-
2446       ses, which still acquire numbers. Several convenience functions such as
2447       pcre2_substring_get_byname() are provided for extracting captured  sub-
2448       strings  by  name. It is also possible to extract the data directly, by
2449       first converting the name to a number in order to  access  the  correct
2450       pointers  in the output vector (described with pcre2_match() below). To
2451       do the conversion, you need to use the name-to-number map, which is de-
2452       scribed by these three values.
2453
2454       The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
2455       COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
2456       the size of each entry in code units; both of these return  a  uint32_t
2457       value. The entry size depends on the length of the longest name.
2458
2459       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
2460       This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2461       brary,  the first two bytes of each entry are the number of the captur-
2462       ing parenthesis, most significant byte first. In  the  16-bit  library,
2463       the  pointer  points  to 16-bit code units, the first of which contains
2464       the parenthesis number. In the 32-bit library, the  pointer  points  to
2465       32-bit  code units, the first of which contains the parenthesis number.
2466       The rest of the entry is the corresponding name, zero terminated.
2467
2468       The names are in alphabetical order. If (?| is used to create  multiple
2469       capture groups with the same number, as described in the section on du-
2470       plicate group numbers in the pcre2pattern page, the groups may be given
2471       the  same  name,  but  there  is only one entry in the table. Different
2472       names for groups of the same number are not permitted.
2473
2474       Duplicate names for capture groups with different numbers  are  permit-
2475       ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the
2476       order  in  which  they were found in the pattern. In the absence of (?|
2477       this is the order of increasing number; when (?| is used  this  is  not
2478       necessarily  the  case because later capture groups may have lower num-
2479       bers.
2480
2481       As a simple example of the name/number table,  consider  the  following
2482       pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED
2483       is set, so white space - including newlines - is ignored):
2484
2485         (?<date> (?<year>(\d\d)?\d\d) -
2486         (?<month>\d\d) - (?<day>\d\d) )
2487
2488       There are four named capture groups, so the table has four entries, and
2489       each entry in the table is eight bytes long. The table is  as  follows,
2490       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2491       as ??:
2492
2493         00 01 d  a  t  e  00 ??
2494         00 05 d  a  y  00 ?? ??
2495         00 04 m  o  n  t  h  00
2496         00 02 y  e  a  r  00 ??
2497
2498       When  writing  code to extract data from named capture groups using the
2499       name-to-number map, remember that the length of the entries  is  likely
2500       to be different for each compiled pattern.
2501
2502         PCRE2_INFO_NEWLINE
2503
2504       The output is one of the following uint32_t values:
2505
2506         PCRE2_NEWLINE_CR       Carriage return (CR)
2507         PCRE2_NEWLINE_LF       Linefeed (LF)
2508         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
2509         PCRE2_NEWLINE_ANY      Any Unicode line ending
2510         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
2511         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
2512
2513       This identifies the character sequence that will be recognized as mean-
2514       ing "newline" while matching.
2515
2516         PCRE2_INFO_SIZE
2517
2518       Return  the  size  of  the compiled pattern in bytes (for all three li-
2519       braries). The third argument should point to a  size_t  variable.  This
2520       value  includes  the  size  of the general data block that precedes the
2521       code units of the compiled pattern itself. The value that is used  when
2522       pcre2_compile()  is  getting memory in which to place the compiled pat-
2523       tern may be slightly larger than the value returned by this option, be-
2524       cause there are cases where the code that calculates the  size  has  to
2525       over-estimate.  Processing a pattern with the JIT compiler does not al-
2526       ter the value returned by this option.
2527
2528
2529INFORMATION ABOUT A PATTERN'S CALLOUTS
2530
2531       int pcre2_callout_enumerate(const pcre2_code *code,
2532         int (*callback)(pcre2_callout_enumerate_block *, void *),
2533         void *user_data);
2534
2535       A script language that supports the use of string arguments in callouts
2536       might like to scan all the callouts in a  pattern  before  running  the
2537       match. This can be done by calling pcre2_callout_enumerate(). The first
2538       argument  is  a  pointer  to a compiled pattern, the second points to a
2539       callback function, and the third is arbitrary user data.  The  callback
2540       function  is  called  for  every callout in the pattern in the order in
2541       which they appear. Its first argument is a pointer to a callout enumer-
2542       ation block, and its second argument is the user_data  value  that  was
2543       passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2544       meration block are described in the pcre2callout  documentation,  which
2545       also gives further details about callouts.
2546
2547
2548SERIALIZATION AND PRECOMPILING
2549
2550       It  is possible to save compiled patterns on disc or elsewhere, and re-
2551       load them later, subject to a number of restrictions. The host on which
2552       the patterns are reloaded must be running the same  version  of  PCRE2,
2553       with  the same code unit width, and must also have the same endianness,
2554       pointer width, and PCRE2_SIZE type. Before  compiled  patterns  can  be
2555       saved, they must be converted to a "serialized" form, which in the case
2556       of PCRE2 is really just a bytecode dump.  The functions whose names be-
2557       gin with pcre2_serialize_ are used for converting to and from the seri-
2558       alized  form.  They  are described in the pcre2serialize documentation.
2559       Note that PCRE2 serialization does not convert compiled patterns to  an
2560       abstract format like Java or .NET serialization.
2561
2562
2563THE MATCH DATA BLOCK
2564
2565       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
2566         pcre2_general_context *gcontext);
2567
2568       pcre2_match_data *pcre2_match_data_create_from_pattern(
2569         const pcre2_code *code, pcre2_general_context *gcontext);
2570
2571       void pcre2_match_data_free(pcre2_match_data *match_data);
2572
2573       Information  about  a  successful  or unsuccessful match is placed in a
2574       match data block, which is an opaque  structure  that  is  accessed  by
2575       function  calls.  In particular, the match data block contains a vector
2576       of offsets into the subject string that define the matched parts of the
2577       subject. This is known as the ovector.
2578
2579       Before calling pcre2_match(), pcre2_dfa_match(),  or  pcre2_jit_match()
2580       you must create a match data block by calling one of the creation func-
2581       tions  above.  For pcre2_match_data_create(), the first argument is the
2582       number of pairs of offsets in the ovector.
2583
2584       When using pcre2_match(), one pair of offsets is required  to  identify
2585       the  string that matched the whole pattern, with an additional pair for
2586       each captured substring. For example, a value of 4 creates enough space
2587       to record the matched portion of the subject plus three  captured  sub-
2588       strings.
2589
2590       When  using  pcre2_dfa_match() there may be multiple matched substrings
2591       of different lengths at the same point  in  the  subject.  The  ovector
2592       should be made large enough to hold as many as are expected.
2593
2594       A  minimum  of at least 1 pair is imposed by pcre2_match_data_create(),
2595       so it is always possible to return the overall matched  string  in  the
2596       case   of   pcre2_match()   or   the  longest  match  in  the  case  of
2597       pcre2_dfa_match(). The maximum number of pairs is 65535; if  the  first
2598       argument  of  pcre2_match_data_create()  is greater than this, 65535 is
2599       used.
2600
2601       The second argument of pcre2_match_data_create() is a pointer to a gen-
2602       eral context, which can specify custom memory management for  obtaining
2603       the memory for the match data block. If you are not using custom memory
2604       management, pass NULL, which causes malloc() to be used.
2605
2606       For  pcre2_match_data_create_from_pattern(),  the  first  argument is a
2607       pointer to a compiled pattern. The ovector is created to be exactly the
2608       right size to hold all the substrings  a  pattern  might  capture  when
2609       matched using pcre2_match(). You should not use this call when matching
2610       with  pcre2_dfa_match().  The  second  argument is again a pointer to a
2611       general context, but in this case if NULL is passed, the memory is  ob-
2612       tained  using the same allocator that was used for the compiled pattern
2613       (custom or default).
2614
2615       A match data block can be used many times, with the same  or  different
2616       compiled  patterns. You can extract information from a match data block
2617       after a match operation has finished,  using  functions  that  are  de-
2618       scribed in the sections on matched strings and other match data below.
2619
2620       When  a  call  of  pcre2_match()  fails, valid data is available in the
2621       match block only  when  the  error  is  PCRE2_ERROR_NOMATCH,  PCRE2_ER-
2622       ROR_PARTIAL,  or  one of the error codes for an invalid UTF string. Ex-
2623       actly what is available depends on the error, and is detailed below.
2624
2625       When one of the matching functions is called, pointers to the  compiled
2626       pattern  and the subject string are set in the match data block so that
2627       they can be referenced by the extraction functions after  a  successful
2628       match. After running a match, you must not free a compiled pattern or a
2629       subject  string until after all operations on the match data block (for
2630       that match) have taken place,  unless,  in  the  case  of  the  subject
2631       string,  you  have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
2632       described in the section entitled "Option bits for  pcre2_match()"  be-
2633       low.
2634
2635       When  a match data block itself is no longer needed, it should be freed
2636       by calling pcre2_match_data_free(). If this function is called  with  a
2637       NULL argument, it returns immediately, without doing anything.
2638
2639
2640MEMORY USE FOR MATCH DATA BLOCKS
2641
2642       PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data);
2643
2644       PCRE2_SIZE pcre2_get_match_data_heapframes_size(
2645         pcre2_match_data *match_data);
2646
2647       The  size of a match data block depends on the size of the ovector that
2648       it contains. The function pcre2_get_match_data_size() returns the size,
2649       in bytes, of the block that is its argument.
2650
2651       When pcre2_match() runs interpretively (that is, without using JIT), it
2652       makes use of a vector of data frames for remembering backtracking posi-
2653       tions.  The size of each individual frame depends on the number of cap-
2654       turing parentheses in the  pattern  and  can  be  obtained  by  calling
2655       pcre2_pattern_info() with the PCRE2_INFO_FRAMESIZE option (see the sec-
2656       tion entitled "Information about a compiled pattern" above).
2657
2658       Heap  memory is used for the frames vector; if the initial memory block
2659       turns out to be too small during  matching,  it  is  automatically  ex-
2660       panded.  When  pcre2_match()  returns, the memory is not freed, but re-
2661       mains attached to the match data  block,  for  use  by  any  subsequent
2662       matches  that  use  the  same block. It is automatically freed when the
2663       match data block itself is freed.
2664
2665       You can find the current size of the frames vector that  a  match  data
2666       block  owns  by  calling  pcre2_get_match_data_heapframes_size(). For a
2667       newly created match data block the size will be  zero.  Some  types  of
2668       match may require a lot of frames and thus a large vector; applications
2669       that run in environments where memory is constrained can check this and
2670       free the match data block if the heap frames vector has become too big.
2671
2672
2673MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2674
2675       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
2676         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2677         uint32_t options, pcre2_match_data *match_data,
2678         pcre2_match_context *mcontext);
2679
2680       The  function pcre2_match() is called to match a subject string against
2681       a compiled pattern, which is passed in the code argument. You can  call
2682       pcre2_match() with the same code argument as many times as you like, in
2683       order  to  find multiple matches in the subject string or to match dif-
2684       ferent subject strings with the same pattern.
2685
2686       This function is the main matching facility of the library, and it  op-
2687       erates  in  a Perl-like manner. For specialist use there is also an al-
2688       ternative matching function, which is described below  in  the  section
2689       about the pcre2_dfa_match() function.
2690
2691       Here is an example of a simple call to pcre2_match():
2692
2693         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2694         int rc = pcre2_match(
2695           re,             /* result of pcre2_compile() */
2696           "some string",  /* the subject string */
2697           11,             /* the length of the subject string */
2698           0,              /* start at offset 0 in the subject */
2699           0,              /* default options */
2700           md,             /* the match data block */
2701           NULL);          /* a match context; NULL means use defaults */
2702
2703       If  the  subject  string is zero-terminated, the length can be given as
2704       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
2705       common matching parameters are to be changed. For details, see the sec-
2706       tion on the match context above.
2707
2708   The string to be matched by pcre2_match()
2709
2710       The subject string is passed to pcre2_match() as a pointer in  subject,
2711       a  length  in  length, and a starting offset in startoffset. The length
2712       and offset are in code units, not characters.  That  is,  they  are  in
2713       bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
2714       and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
2715       cessing is enabled. As a special case, if subject is NULL and length is
2716       zero,  the  subject is assumed to be an empty string. If length is non-
2717       zero, an error occurs if subject is NULL.
2718
2719       If startoffset is greater than the length of the subject, pcre2_match()
2720       returns PCRE2_ERROR_BADOFFSET. When the starting offset  is  zero,  the
2721       search  for a match starts at the beginning of the subject, and this is
2722       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2723       set must point to the start of a character, or to the end of  the  sub-
2724       ject  (in  UTF-32 mode, one code unit equals one character, so all off-
2725       sets are valid). Like the pattern string, the subject may  contain  bi-
2726       nary zeros.
2727
2728       A  non-zero  starting offset is useful when searching for another match
2729       in the same subject by calling pcre2_match()  again  after  a  previous
2730       success.   Setting  startoffset  differs  from passing over a shortened
2731       string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins
2732       with any kind of lookbehind. For example, consider the pattern
2733
2734         \Biss\B
2735
2736       which  finds  occurrences  of "iss" in the middle of words. (\B matches
2737       only if the current position in the subject is not  a  word  boundary.)
2738       When   applied   to   the   string  "Mississippi"  the  first  call  to
2739       pcre2_match() finds the first occurrence. If  pcre2_match()  is  called
2740       again with just the remainder of the subject, namely "issippi", it does
2741       not  match,  because  \B  is  always false at the start of the subject,
2742       which is deemed to be a word boundary.  However,  if  pcre2_match()  is
2743       passed the entire string again, but with startoffset set to 4, it finds
2744       the  second  occurrence  of "iss" because it is able to look behind the
2745       starting point to discover that it is preceded by a letter.
2746
2747       Finding all the matches in a subject is tricky  when  the  pattern  can
2748       match an empty string. It is possible to emulate Perl's /g behaviour by
2749       first   trying   the   match   again  at  the  same  offset,  with  the
2750       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options,  and  then  if  that
2751       fails,  advancing  the  starting  offset  and  trying an ordinary match
2752       again. There is some code that demonstrates  how  to  do  this  in  the
2753       pcre2demo  sample  program. In the most general case, you have to check
2754       to see if the newline convention recognizes CRLF as a newline,  and  if
2755       so,  and the current character is CR followed by LF, advance the start-
2756       ing offset by two characters instead of one.
2757
2758       If a non-zero starting offset is passed when the pattern is anchored, a
2759       single attempt to match at the given offset is made. This can only suc-
2760       ceed if the pattern does not require the match to be at  the  start  of
2761       the  subject.  In other words, the anchoring must be the result of set-
2762       ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL,  not
2763       by starting the pattern with ^ or \A.
2764
2765   Option bits for pcre2_match()
2766
2767       The unused bits of the options argument for pcre2_match() must be zero.
2768       The    only    bits    that    may    be    set   are   PCRE2_ANCHORED,
2769       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK,  PCRE2_EN-
2770       DANCHORED,       PCRE2_NOTBOL,       PCRE2_NOTEOL,      PCRE2_NOTEMPTY,
2771       PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_JIT,  PCRE2_NO_UTF_CHECK,  PCRE2_PAR-
2772       TIAL_HARD, and PCRE2_PARTIAL_SOFT.  Their action is described below.
2773
2774       Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2775       ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2776       is  disabled  and  the  interpretive  code  in  pcre2_match()  is  run.
2777       PCRE2_DISABLE_RECURSELOOP_CHECK is  ignored  by  JIT,  but  apart  from
2778       PCRE2_NO_JIT  (obviously),  the remaining options are supported for JIT
2779       matching.
2780
2781         PCRE2_ANCHORED
2782
2783       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
2784       matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or
2785       turned  out to be anchored by virtue of its contents, it cannot be made
2786       unachored at matching time. Note that setting the option at match  time
2787       disables JIT matching.
2788
2789         PCRE2_COPY_MATCHED_SUBJECT
2790
2791       By  default,  a  pointer to the subject is remembered in the match data
2792       block so that, after a successful match, it can be  referenced  by  the
2793       substring  extraction  functions.  This means that the subject's memory
2794       must not be freed until all such operations are complete. For some  ap-
2795       plications  where the lifetime of the subject string is not guaranteed,
2796       it may be necessary to make a copy of the subject  string,  but  it  is
2797       wasteful  to do this unless the match is successful. After a successful
2798       match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied  and
2799       the  new  pointer  is remembered in the match data block instead of the
2800       original subject pointer. The memory allocator that was  used  for  the
2801       match  block  itself  is  used.  The  copy  is automatically freed when
2802       pcre2_match_data_free() is called to free the match data block.  It  is
2803       also automatically freed if the match data block is re-used for another
2804       match operation.
2805
2806         PCRE2_DISABLE_RECURSELOOP_CHECK
2807
2808       This  option  is relevant only to pcre2_match() for interpretive match-
2809       ing.   It  is  ignored  when  JIT  is  used,  and  is   forbidden   for
2810       pcre2_dfa_match().
2811
2812       The use of recursion in patterns can lead to infinite loops. In the in-
2813       terpretive  matcher  these  would  be eventually caught by the match or
2814       heap limits, but this could take a long time and/or use a lot of memory
2815       if the limits are large. There is therefore a check  at  the  start  of
2816       each  recursion.   If  the  same  group is still active from a previous
2817       call, and the current subject pointer is the same  as  it  was  at  the
2818       start  of  that group, and the furthest inspected character of the sub-
2819       ject has not changed, an error is generated.
2820
2821       There are rare cases of matches that would complete,  but  nevertheless
2822       trigger  this  error.  This  option  disables the check. It is provided
2823       mainly for testing when comparing JIT and interpretive behaviour.
2824
2825         PCRE2_ENDANCHORED
2826
2827       If the PCRE2_ENDANCHORED option is set, any string  that  pcre2_match()
2828       matches  must be right at the end of the subject string. Note that set-
2829       ting the option at match time disables JIT matching.
2830
2831         PCRE2_NOTBOL
2832
2833       This option specifies that first character of the subject string is not
2834       the beginning of a line, so the  circumflex  metacharacter  should  not
2835       match  before  it.  Setting  this without having set PCRE2_MULTILINE at
2836       compile time causes circumflex never to match. This option affects only
2837       the behaviour of the circumflex metacharacter. It does not affect \A.
2838
2839         PCRE2_NOTEOL
2840
2841       This option specifies that the end of the subject string is not the end
2842       of a line, so the dollar metacharacter should not match it nor  (except
2843       in  multiline mode) a newline immediately before it. Setting this with-
2844       out having set PCRE2_MULTILINE at compile time causes dollar  never  to
2845       match. This option affects only the behaviour of the dollar metacharac-
2846       ter. It does not affect \Z or \z.
2847
2848         PCRE2_NOTEMPTY
2849
2850       An empty string is not considered to be a valid match if this option is
2851       set.  If  there are alternatives in the pattern, they are tried. If all
2852       the alternatives match the empty string, the entire  match  fails.  For
2853       example, if the pattern
2854
2855         a?b?
2856
2857       is  applied  to  a  string not beginning with "a" or "b", it matches an
2858       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
2859       match is not valid, so pcre2_match() searches further into  the  string
2860       for occurrences of "a" or "b".
2861
2862         PCRE2_NOTEMPTY_ATSTART
2863
2864       This  is  like PCRE2_NOTEMPTY, except that it locks out an empty string
2865       match only at the first matching position, that is, at the start of the
2866       subject plus the starting offset. An empty string match  later  in  the
2867       subject is permitted.  If the pattern is anchored, such a match can oc-
2868       cur only if the pattern contains \K.
2869
2870         PCRE2_NO_JIT
2871
2872       By   default,   if   a  pattern  has  been  successfully  processed  by
2873       pcre2_jit_compile(), JIT is automatically used  when  pcre2_match()  is
2874       called  with  options  that JIT supports. Setting PCRE2_NO_JIT disables
2875       the use of JIT; it forces matching to be done by the interpreter.
2876
2877         PCRE2_NO_UTF_CHECK
2878
2879       When PCRE2_UTF is set at compile time, the validity of the subject as a
2880       UTF  string  is  checked  unless  PCRE2_NO_UTF_CHECK   is   passed   to
2881       pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile().
2882       The latter special case is discussed in detail in the pcre2unicode doc-
2883       umentation.
2884
2885       In  the default case, if a non-zero starting offset is given, the check
2886       is applied only to that part of the subject  that  could  be  inspected
2887       during  matching,  and there is a check that the starting offset points
2888       to the first code unit of a character or to the end of the subject.  If
2889       there  are no lookbehind assertions in the pattern, the check starts at
2890       the starting offset.  Otherwise, it starts at the length of the longest
2891       lookbehind before the starting offset, or at the start of  the  subject
2892       if  there are not that many characters before the starting offset. Note
2893       that the sequences \b and \B are one-character lookbehinds.
2894
2895       The check is carried out before any other processing takes place, and a
2896       negative error code is returned if the check fails. There  are  several
2897       UTF  error  codes  for each code unit width, corresponding to different
2898       problems with the code unit sequence. There are discussions  about  the
2899       validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
2900       pcre2unicode documentation.
2901
2902       If you know that your subject is valid, and you want to skip this check
2903       for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when
2904       calling pcre2_match(). You might want to do this  for  the  second  and
2905       subsequent  calls  to pcre2_match() if you are making repeated calls to
2906       find multiple matches in the same subject string.
2907
2908       Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile  time,  when
2909       PCRE2_NO_UTF_CHECK  is  set  at match time the effect of passing an in-
2910       valid string as a subject, or an invalid value of startoffset, is unde-
2911       fined.  Your program may crash or loop indefinitely or give  wrong  re-
2912       sults.
2913
2914         PCRE2_PARTIAL_HARD
2915         PCRE2_PARTIAL_SOFT
2916
2917       These options turn on the partial matching feature. A partial match oc-
2918       curs  if  the  end  of  the subject string is reached successfully, but
2919       there are not enough subject characters to complete the match. In addi-
2920       tion, either at least one character must have  been  inspected  or  the
2921       pattern  must  contain  a  lookbehind,  or the pattern must be one that
2922       could match an empty string.
2923
2924       If this situation arises when PCRE2_PARTIAL_SOFT  (but  not  PCRE2_PAR-
2925       TIAL_HARD) is set, matching continues by testing any remaining alterna-
2926       tives.  Only  if  no complete match can be found is PCRE2_ERROR_PARTIAL
2927       returned instead of PCRE2_ERROR_NOMATCH.  In  other  words,  PCRE2_PAR-
2928       TIAL_SOFT  specifies  that  the  caller is prepared to handle a partial
2929       match, but only if no complete match can be found.
2930
2931       If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In  this
2932       case,  if  a  partial match is found, pcre2_match() immediately returns
2933       PCRE2_ERROR_PARTIAL, without considering  any  other  alternatives.  In
2934       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2935       ered to be more important that an alternative complete match.
2936
2937       There is a more detailed discussion of partial and multi-segment match-
2938       ing, with examples, in the pcre2partial documentation.
2939
2940
2941NEWLINE HANDLING WHEN MATCHING
2942
2943       When  PCRE2 is built, a default newline convention is set; this is usu-
2944       ally the standard convention for the operating system. The default  can
2945       be  overridden  in a compile context by calling pcre2_set_newline(). It
2946       can also be overridden by starting a pattern string with, for  example,
2947       (*CRLF),  as  described  in  the  section on newline conventions in the
2948       pcre2pattern page. During matching, the newline choice affects the  be-
2949       haviour  of the dot, circumflex, and dollar metacharacters. It may also
2950       alter the way the match starting position is  advanced  after  a  match
2951       failure for an unanchored pattern.
2952
2953       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2954       set  as  the  newline convention, and a match attempt for an unanchored
2955       pattern fails when the current starting position is at a CRLF sequence,
2956       and the pattern contains no explicit matches for CR or  LF  characters,
2957       the  match  position  is  advanced by two characters instead of one, in
2958       other words, to after the CRLF.
2959
2960       The above rule is a compromise that makes the most common cases work as
2961       expected. For example, if the pattern is .+A (and the PCRE2_DOTALL  op-
2962       tion  is  not set), it does not match the string "\r\nA" because, after
2963       failing at the start, it skips both the CR and the LF before  retrying.
2964       However,  the  pattern  [\r\n]A does match that string, because it con-
2965       tains an explicit CR or LF reference, and so advances only by one char-
2966       acter after the first failure.
2967
2968       An explicit match for CR of LF is either a literal appearance of one of
2969       those characters in the pattern, or one of the \r or \n  or  equivalent
2970       octal or hexadecimal escape sequences. Implicit matches such as [^X] do
2971       not  count, nor does \s, even though it includes CR and LF in the char-
2972       acters that it matches.
2973
2974       Notwithstanding the above, anomalous effects may still occur when  CRLF
2975       is a valid newline sequence and explicit \r or \n escapes appear in the
2976       pattern.
2977
2978
2979HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
2980
2981       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
2982
2983       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
2984
2985       In  general, a pattern matches a certain portion of the subject, and in
2986       addition, further substrings from the subject  may  be  picked  out  by
2987       parenthesized  parts  of  the  pattern.  Following the usage in Jeffrey
2988       Friedl's book, this is called "capturing"  in  what  follows,  and  the
2989       phrase  "capture  group" (Perl terminology) is used for a fragment of a
2990       pattern that picks out a substring. PCRE2 supports several other  kinds
2991       of parenthesized group that do not cause substrings to be captured. The
2992       pcre2_pattern_info()  function can be used to find out how many capture
2993       groups there are in a compiled pattern.
2994
2995       You can use auxiliary functions for accessing  captured  substrings  by
2996       number or by name, as described in sections below.
2997
2998       Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2999       ues,  called  the  ovector,  which  contains  the  offsets  of captured
3000       strings.  It  is  part  of  the  match  data   block.    The   function
3001       pcre2_get_ovector_pointer()  returns  the  address  of the ovector, and
3002       pcre2_get_ovector_count() returns the number of pairs of values it con-
3003       tains.
3004
3005       Within the ovector, the first in each pair of values is set to the off-
3006       set of the first code unit of a substring, and the second is set to the
3007       offset of the first code unit after the end of a substring. These  val-
3008       ues  are always code unit offsets, not character offsets. That is, they
3009       are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
3010       brary, and 32-bit offsets in the 32-bit library.
3011
3012       After a partial match  (error  return  PCRE2_ERROR_PARTIAL),  only  the
3013       first  pair  of  offsets  (that is, ovector[0] and ovector[1]) are set.
3014       They identify the part of the subject that was partially  matched.  See
3015       the pcre2partial documentation for details of partial matching.
3016
3017       After  a  fully  successful match, the first pair of offsets identifies
3018       the portion of the subject string that was matched by the  entire  pat-
3019       tern.  The  next  pair is used for the first captured substring, and so
3020       on. The value returned by pcre2_match() is one more  than  the  highest
3021       numbered  pair  that  has been set. For example, if two substrings have
3022       been captured, the returned value is 3. If there are no  captured  sub-
3023       strings, the return value from a successful match is 1, indicating that
3024       just the first pair of offsets has been set.
3025
3026       If  a  pattern uses the \K escape sequence within a positive assertion,
3027       the reported start of a successful match can be greater than the end of
3028       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
3029       "ab", the start and end offset values for the match are 2 and 0.
3030
3031       If  a  capture group is matched repeatedly within a single match opera-
3032       tion, it is the last portion of the subject that it matched that is re-
3033       turned.
3034
3035       If the ovector is too small to hold all the captured substring offsets,
3036       as much as possible is filled in, and the function returns a  value  of
3037       zero.  If captured substrings are not of interest, pcre2_match() may be
3038       called with a match data block whose ovector is of minimum length (that
3039       is, one pair).
3040
3041       It is possible for capture group number n+1 to match some part  of  the
3042       subject  when  group  n  has  not been used at all. For example, if the
3043       string "abc" is matched against the pattern (a|(z))(bc) the return from
3044       the function is 4, and groups 1 and 3 are matched, but 2 is  not.  When
3045       this  happens,  both values in the offset pairs corresponding to unused
3046       groups are set to PCRE2_UNSET.
3047
3048       Offset values that correspond to unused groups at the end  of  the  ex-
3049       pression  are also set to PCRE2_UNSET. For example, if the string "abc"
3050       is matched against the pattern (abc)(x(yz)?)? groups 2 and  3  are  not
3051       matched.  The  return  from the function is 2, because the highest used
3052       capture group number is 1. The offsets for the second and third capture
3053       groups (assuming the vector is large enough,  of  course)  are  set  to
3054       PCRE2_UNSET.
3055
3056       Elements in the ovector that do not correspond to capturing parentheses
3057       in the pattern are never changed. That is, if a pattern contains n cap-
3058       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
3059       pcre2_match().  The  other  elements retain whatever values they previ-
3060       ously had. After a failed match attempt, the contents  of  the  ovector
3061       are unchanged.
3062
3063
3064OTHER INFORMATION ABOUT A MATCH
3065
3066       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
3067
3068       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
3069
3070       As  well as the offsets in the ovector, other information about a match
3071       is retained in the match data block and can be retrieved by  the  above
3072       functions  in  appropriate  circumstances.  If they are called at other
3073       times, the result is undefined.
3074
3075       After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a
3076       failure  to  match (PCRE2_ERROR_NOMATCH), a mark name may be available.
3077       The function pcre2_get_mark() can be called to access this name,  which
3078       can  be  specified  in  the  pattern by any of the backtracking control
3079       verbs, not just (*MARK). The same function applies to all the verbs. It
3080       returns a pointer to the zero-terminated name, which is within the com-
3081       piled pattern. If no name is available, NULL is returned. The length of
3082       the name (excluding the terminating zero) is stored in  the  code  unit
3083       that  precedes  the name. You should use this length instead of relying
3084       on the terminating zero if the name might contain a binary zero.
3085
3086       After a successful match, the name that is returned is  the  last  mark
3087       name encountered on the matching path through the pattern. Instances of
3088       backtracking  verbs  without  names do not count. Thus, for example, if
3089       the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
3090       After a "no match" or a partial match, the last encountered name is re-
3091       turned. For example, consider this pattern:
3092
3093         ^(*MARK:A)((*MARK:B)a|b)c
3094
3095       When it matches "bc", the returned name is A. The B mark is  "seen"  in
3096       the  first  branch of the group, but it is not on the matching path. On
3097       the other hand, when this pattern fails to  match  "bx",  the  returned
3098       name is B.
3099
3100       Warning:  By  default, certain start-of-match optimizations are used to
3101       give a fast "no match" result in some situations. For example,  if  the
3102       anchoring  is removed from the pattern above, there is an initial check
3103       for the presence of "c" in the subject before running the matching  en-
3104       gine. This check fails for "bx", causing a match failure without seeing
3105       any  marks. You can disable the start-of-match optimizations by setting
3106       the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or  by  starting
3107       the pattern with (*NO_START_OPT).
3108
3109       After  a  successful  match, a partial match, or one of the invalid UTF
3110       errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar()  can
3111       be called. After a successful or partial match it returns the code unit
3112       offset  of  the character at which the match started. For a non-partial
3113       match, this can be different to the value of ovector[0] if the  pattern
3114       contains  the  \K escape sequence. After a partial match, however, this
3115       value is always the same as ovector[0] because \K does not  affect  the
3116       result of a partial match.
3117
3118       After  a UTF check failure, pcre2_get_startchar() can be used to obtain
3119       the code unit offset of the invalid UTF character. Details are given in
3120       the pcre2unicode page.
3121
3122
3123ERROR RETURNS FROM pcre2_match()
3124
3125       If pcre2_match() fails, it returns a negative number. This can be  con-
3126       verted  to a text string by calling the pcre2_get_error_message() func-
3127       tion (see "Obtaining a textual error message" below).   Negative  error
3128       codes  are  also  returned  by other functions, and are documented with
3129       them. The codes are given names in the header file. If UTF checking  is
3130       in force and an invalid UTF subject string is detected, one of a number
3131       of  UTF-specific negative error codes is returned. Details are given in
3132       the pcre2unicode page. The following are the other errors that  may  be
3133       returned by pcre2_match():
3134
3135         PCRE2_ERROR_NOMATCH
3136
3137       The subject string did not match the pattern.
3138
3139         PCRE2_ERROR_PARTIAL
3140
3141       The  subject  string did not match, but it did match partially. See the
3142       pcre2partial documentation for details of partial matching.
3143
3144         PCRE2_ERROR_BADMAGIC
3145
3146       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
3147       to catch the case when it is passed a junk pointer. This is  the  error
3148       that is returned when the magic number is not present.
3149
3150         PCRE2_ERROR_BADMODE
3151
3152       This  error is given when a compiled pattern is passed to a function in
3153       a library of a different code unit width, for example, a  pattern  com-
3154       piled  by  the  8-bit  library  is passed to a 16-bit or 32-bit library
3155       function.
3156
3157         PCRE2_ERROR_BADOFFSET
3158
3159       The value of startoffset was greater than the length of the subject.
3160
3161         PCRE2_ERROR_BADOPTION
3162
3163       An unrecognized bit was set in the options argument.
3164
3165         PCRE2_ERROR_BADUTFOFFSET
3166
3167       The UTF code unit sequence that was passed as a subject was checked and
3168       found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the
3169       value  of startoffset did not point to the beginning of a UTF character
3170       or the end of the subject.
3171
3172         PCRE2_ERROR_CALLOUT
3173
3174       This error is never generated by pcre2_match() itself. It  is  provided
3175       for  use  by  callout  functions  that  want  to cause pcre2_match() or
3176       pcre2_callout_enumerate() to return a distinctive error code.  See  the
3177       pcre2callout documentation for details.
3178
3179         PCRE2_ERROR_DEPTHLIMIT
3180
3181       The nested backtracking depth limit was reached.
3182
3183         PCRE2_ERROR_HEAPLIMIT
3184
3185       The heap limit was reached.
3186
3187         PCRE2_ERROR_INTERNAL
3188
3189       An  unexpected  internal error has occurred. This error could be caused
3190       by a bug in PCRE2 or by overwriting of the compiled pattern.
3191
3192         PCRE2_ERROR_JIT_STACKLIMIT
3193
3194       This error is returned when a pattern that was successfully studied us-
3195       ing JIT is being matched, but the memory available for the just-in-time
3196       processing stack is not large enough. See  the  pcre2jit  documentation
3197       for more details.
3198
3199         PCRE2_ERROR_MATCHLIMIT
3200
3201       The backtracking match limit was reached.
3202
3203         PCRE2_ERROR_NOMEMORY
3204
3205       Heap  memory  is  used  to  remember backtracking points. This error is
3206       given when the memory allocation function (default  or  custom)  fails.
3207       Note  that  a  different  error, PCRE2_ERROR_HEAPLIMIT, is given if the
3208       amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is
3209       also returned if PCRE2_COPY_MATCHED_SUBJECT is set and  memory  alloca-
3210       tion fails.
3211
3212         PCRE2_ERROR_NULL
3213
3214       Either the code, subject, or match_data argument was passed as NULL.
3215
3216         PCRE2_ERROR_RECURSELOOP
3217
3218       This  error  is  returned  when  pcre2_match() detects a recursion loop
3219       within the pattern. Specifically, it means that either the  whole  pat-
3220       tern or a capture group has been called recursively for the second time
3221       at  the  same position in the subject string. Some simple patterns that
3222       might do this are detected and faulted at compile time, but  more  com-
3223       plicated  cases,  in particular mutual recursions between two different
3224       groups, cannot be detected until matching is attempted.
3225
3226
3227OBTAINING A TEXTUAL ERROR MESSAGE
3228
3229       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
3230         PCRE2_SIZE bufflen);
3231
3232       A text message for an error code  from  any  PCRE2  function  (compile,
3233       match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
3234       sage(). The code is passed as the first argument,  with  the  remaining
3235       two  arguments  specifying  a  code  unit buffer and its length in code
3236       units, into which the text message is placed. The message  is  returned
3237       in  code  units  of the appropriate width for the library that is being
3238       used.
3239
3240       The returned message is terminated with a trailing zero, and the  func-
3241       tion  returns  the  number  of  code units used, excluding the trailing
3242       zero. If the error number is unknown, the negative error code PCRE2_ER-
3243       ROR_BADDATA is returned. If the buffer is too  small,  the  message  is
3244       truncated (but still with a trailing zero), and the negative error code
3245       PCRE2_ERROR_NOMEMORY  is returned.  None of the messages are very long;
3246       a buffer size of 120 code units is ample.
3247
3248
3249EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3250
3251       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
3252         uint32_t number, PCRE2_SIZE *length);
3253
3254       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
3255         uint32_t number, PCRE2_UCHAR *buffer,
3256         PCRE2_SIZE *bufflen);
3257
3258       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
3259         uint32_t number, PCRE2_UCHAR **bufferptr,
3260         PCRE2_SIZE *bufflen);
3261
3262       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3263
3264       Captured substrings can be accessed directly by using  the  ovector  as
3265       described above.  For convenience, auxiliary functions are provided for
3266       extracting   captured  substrings  as  new,  separate,  zero-terminated
3267       strings. A substring that contains a binary zero is correctly extracted
3268       and has a further zero added on the end, but  the  result  is  not,  of
3269       course, a C string.
3270
3271       The functions in this section identify substrings by number. The number
3272       zero refers to the entire matched substring, with higher numbers refer-
3273       ring  to  substrings  captured by parenthesized groups. After a partial
3274       match, only substring zero is available.  An  attempt  to  extract  any
3275       other  substring  gives the error PCRE2_ERROR_PARTIAL. The next section
3276       describes similar functions for extracting captured substrings by name.
3277
3278       If a pattern uses the \K escape sequence within a  positive  assertion,
3279       the reported start of a successful match can be greater than the end of
3280       the  match.   For  example,  if the pattern (?=ab\K) is matched against
3281       "ab", the start and end offset values for the match are  2  and  0.  In
3282       this  situation,  calling  these functions with a zero substring number
3283       extracts a zero-length empty string.
3284
3285       You can find the length in code units of a captured  substring  without
3286       extracting  it  by calling pcre2_substring_length_bynumber(). The first
3287       argument is a pointer to the match data block, the second is the  group
3288       number,  and the third is a pointer to a variable into which the length
3289       is placed. If you just want to know whether or not  the  substring  has
3290       been captured, you can pass the third argument as NULL.
3291
3292       The  pcre2_substring_copy_bynumber()  function  copies  a captured sub-
3293       string into a supplied buffer,  whereas  pcre2_substring_get_bynumber()
3294       copies  it  into  new memory, obtained using the same memory allocation
3295       function that was used for the match data block. The  first  two  argu-
3296       ments  of  these  functions are a pointer to the match data block and a
3297       capture group number.
3298
3299       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
3300       the buffer and a pointer to a variable that contains its length in code
3301       units.  This is updated to contain the actual number of code units used
3302       for the extracted substring, excluding the terminating zero.
3303
3304       For pcre2_substring_get_bynumber() the third and fourth arguments point
3305       to variables that are updated with a pointer to the new memory and  the
3306       number  of  code units that comprise the substring, again excluding the
3307       terminating zero. When the substring is no longer  needed,  the  memory
3308       should be freed by calling pcre2_substring_free().
3309
3310       The  return  value  from  all these functions is zero for success, or a
3311       negative error code. If the pattern match  failed,  the  match  failure
3312       code  is returned.  If a substring number greater than zero is used af-
3313       ter a partial match, PCRE2_ERROR_PARTIAL is  returned.  Other  possible
3314       error codes are:
3315
3316         PCRE2_ERROR_NOMEMORY
3317
3318       The  buffer  was  too small for pcre2_substring_copy_bynumber(), or the
3319       attempt to get memory failed for pcre2_substring_get_bynumber().
3320
3321         PCRE2_ERROR_NOSUBSTRING
3322
3323       There is no substring with that number in the  pattern,  that  is,  the
3324       number is greater than the number of capturing parentheses.
3325
3326         PCRE2_ERROR_UNAVAILABLE
3327
3328       The substring number, though not greater than the number of captures in
3329       the pattern, is greater than the number of slots in the ovector, so the
3330       substring could not be captured.
3331
3332         PCRE2_ERROR_UNSET
3333
3334       The  substring  did  not  participate in the match. For example, if the
3335       pattern is (abc)|(def) and the subject is "def", and the  ovector  con-
3336       tains at least two capturing slots, substring number 1 is unset.
3337
3338
3339EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
3340
3341       int pcre2_substring_list_get(pcre2_match_data *match_data,
3342         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
3343
3344       void pcre2_substring_list_free(PCRE2_UCHAR **list);
3345
3346       The  pcre2_substring_list_get()  function  extracts  all available sub-
3347       strings and builds a list of pointers to  them.  It  also  (optionally)
3348       builds  a  second list that contains their lengths (in code units), ex-
3349       cluding a terminating zero that is added to each of them. All  this  is
3350       done in a single block of memory that is obtained using the same memory
3351       allocation function that was used to get the match data block.
3352
3353       This  function  must be called only after a successful match. If called
3354       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3355
3356       The address of the memory block is returned via listptr, which is  also
3357       the start of the list of string pointers. The end of the list is marked
3358       by  a  NULL pointer. The address of the list of lengths is returned via
3359       lengthsptr. If your strings do not contain binary zeros and you do  not
3360       therefore need the lengths, you may supply NULL as the lengthsptr argu-
3361       ment  to  disable  the  creation of a list of lengths. The yield of the
3362       function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3363       ory  block could not be obtained. When the list is no longer needed, it
3364       should be freed by calling pcre2_substring_list_free().
3365
3366       If this function encounters a substring that is unset, which can happen
3367       when capture group number n+1 matches some part  of  the  subject,  but
3368       group  n has not been used at all, it returns an empty string. This can
3369       be distinguished from a genuine zero-length substring by inspecting the
3370       appropriate offset in the ovector, which contain PCRE2_UNSET for  unset
3371       substrings, or by calling pcre2_substring_length_bynumber().
3372
3373
3374EXTRACTING CAPTURED SUBSTRINGS BY NAME
3375
3376       int pcre2_substring_number_from_name(const pcre2_code *code,
3377         PCRE2_SPTR name);
3378
3379       int pcre2_substring_length_byname(pcre2_match_data *match_data,
3380         PCRE2_SPTR name, PCRE2_SIZE *length);
3381
3382       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
3383         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
3384
3385       int pcre2_substring_get_byname(pcre2_match_data *match_data,
3386         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
3387
3388       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3389
3390       To  extract a substring by name, you first have to find associated num-
3391       ber.  For example, for this pattern:
3392
3393         (a+)b(?<xxx>\d+)...
3394
3395       the number of the capture group called "xxx" is 2. If the name is known
3396       to be unique (PCRE2_DUPNAMES was not set), you can find the number from
3397       the name by calling pcre2_substring_number_from_name(). The first argu-
3398       ment is the compiled pattern, and the second is the name. The yield  of
3399       the  function  is the group number, PCRE2_ERROR_NOSUBSTRING if there is
3400       no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if  there  is
3401       more  than one group with that name.  Given the number, you can extract
3402       the substring directly from the ovector, or use one of  the  "bynumber"
3403       functions described above.
3404
3405       For  convenience,  there are also "byname" functions that correspond to
3406       the "bynumber" functions, the only difference being that the second ar-
3407       gument is a name instead of a number.  If  PCRE2_DUPNAMES  is  set  and
3408       there are duplicate names, these functions scan all the groups with the
3409       given  name,  and  return  the  captured substring from the first named
3410       group that is set.
3411
3412       If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING  is
3413       returned.  If  all  groups  with the name have numbers that are greater
3414       than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3415       turned. If there is at least one group with a slot in the ovector,  but
3416       no group is found to be set, PCRE2_ERROR_UNSET is returned.
3417
3418       Warning: If the pattern uses the (?| feature to set up multiple capture
3419       groups  with  the same number, as described in the section on duplicate
3420       group numbers in the pcre2pattern page, you cannot use names to distin-
3421       guish the different capture groups, because names are not  included  in
3422       the  compiled  code.  The  matching process uses only numbers. For this
3423       reason, the use of different names for  groups  with  the  same  number
3424       causes an error at compile time.
3425
3426
3427CREATING A NEW STRING WITH SUBSTITUTIONS
3428
3429       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
3430         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3431         uint32_t options, pcre2_match_data *match_data,
3432         pcre2_match_context *mcontext, PCRE2_SPTR replacement,
3433         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
3434         PCRE2_SIZE *outlengthptr);
3435
3436       This  function  optionally calls pcre2_match() and then makes a copy of
3437       the subject string in outputbuffer, replacing parts that  were  matched
3438       with the replacement string, whose length is supplied in rlength, which
3439       can  be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
3440       a special case, if replacement is NULL and rlength  is  zero,  the  re-
3441       placement  is assumed to be an empty string. If rlength is non-zero, an
3442       error occurs if replacement is NULL.
3443
3444       There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3445       turn just the replacement string(s). The default action is  to  perform
3446       just  one  replacement  if  the pattern matches, but there is an option
3447       that requests multiple replacements  (see  PCRE2_SUBSTITUTE_GLOBAL  be-
3448       low).
3449
3450       If  successful,  pcre2_substitute() returns the number of substitutions
3451       that were carried out. This may be zero if no match was found,  and  is
3452       never  greater  than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3453       tive value is returned if an error is detected.
3454
3455       Matches in which a \K item in a lookahead in  the  pattern  causes  the
3456       match  to  end  before it starts are not supported, and give rise to an
3457       error return. For global replacements, matches in which \K in a lookbe-
3458       hind causes the match to start earlier than the point that was  reached
3459       in the previous iteration are also not supported.
3460
3461       The  first  seven  arguments  of pcre2_substitute() are the same as for
3462       pcre2_match(), except that the partial matching options are not permit-
3463       ted, and match_data may be passed as NULL, in which case a  match  data
3464       block  is obtained and freed within this function, using memory manage-
3465       ment functions from the match context, if provided, or else those  that
3466       were used to allocate memory for the compiled code.
3467
3468       If  match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
3469       provided block is used for all calls to pcre2_match(), and its contents
3470       afterwards are the result of the final call. For global  changes,  this
3471       will always be a no-match error. The contents of the ovector within the
3472       match data block may or may not have been changed.
3473
3474       As  well as the usual options for pcre2_match(), a number of additional
3475       options can be set in the options argument of pcre2_substitute().   One
3476       such  option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
3477       match_data block must be provided, and it must have already  been  used
3478       for an external call to pcre2_match() with the same pattern and subject
3479       arguments.  The  data in the match_data block (return code, offset vec-
3480       tor) is then  used  for  the  first  substitution  instead  of  calling
3481       pcre2_match()  from  within pcre2_substitute(). This allows an applica-
3482       tion to check for a match before choosing to substitute, without having
3483       to repeat the match.
3484
3485       The contents of the  externally  supplied  match  data  block  are  not
3486       changed   when   PCRE2_SUBSTITUTE_MATCHED   is  set.  If  PCRE2_SUBSTI-
3487       TUTE_GLOBAL is also set, pcre2_match() is called after the  first  sub-
3488       stitution  to  check for further matches, but this is done using an in-
3489       ternally obtained match data block, thus always  leaving  the  external
3490       block unchanged.
3491
3492       The  code  argument is not used for matching before the first substitu-
3493       tion when PCRE2_SUBSTITUTE_MATCHED is set, but  it  must  be  provided,
3494       even  when  PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3495       formation such as the UTF setting and the number of capturing parenthe-
3496       ses in the pattern.
3497
3498       The default action of pcre2_substitute() is to return  a  copy  of  the
3499       subject string with matched substrings replaced. However, if PCRE2_SUB-
3500       STITUTE_REPLACEMENT_ONLY  is  set,  only the replacement substrings are
3501       returned. In the global case, multiple replacements are concatenated in
3502       the output buffer. Substitution callouts (see below)  can  be  used  to
3503       separate them if necessary.
3504
3505       The  outlengthptr  argument of pcre2_substitute() must point to a vari-
3506       able that contains the length, in code units, of the output buffer.  If
3507       the  function is successful, the value is updated to contain the length
3508       in code units of the new string, excluding the trailing  zero  that  is
3509       automatically added.
3510
3511       If  the  function is not successful, the value set via outlengthptr de-
3512       pends on the type of  error.  For  syntax  errors  in  the  replacement
3513       string, the value is the offset in the replacement string where the er-
3514       ror  was  detected.  For  other errors, the value is PCRE2_UNSET by de-
3515       fault. This includes the case of the output buffer being too small, un-
3516       less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
3517
3518       PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when  the  output
3519       buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3520       ORY  immediately.  If  this  option is set, however, pcre2_substitute()
3521       continues to go through the motions of matching and substituting (with-
3522       out, of course, writing anything) in  order  to  compute  the  size  of
3523       buffer  that  is needed. This value is passed back via the outlengthptr
3524       variable, with  the  result  of  the  function  still  being  PCRE2_ER-
3525       ROR_NOMEMORY.
3526
3527       Passing  a  buffer  size  of zero is a permitted way of finding out how
3528       much memory is needed for given substitution. However, this  does  mean
3529       that the entire operation is carried out twice. Depending on the appli-
3530       cation,  it  may  be more efficient to allocate a large buffer and free
3531       the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3532       FLOW_LENGTH.
3533
3534       The  replacement  string,  which  is interpreted as a UTF string in UTF
3535       mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set.  An
3536       invalid UTF replacement string causes an immediate return with the rel-
3537       evant UTF error code.
3538
3539       If  PCRE2_SUBSTITUTE_LITERAL  is set, the replacement string is not in-
3540       terpreted in any way. By default, however, a dollar character is an es-
3541       cape character that can specify the insertion of characters  from  cap-
3542       ture  groups  and names from (*MARK) or other control verbs in the pat-
3543       tern. Dollar is the only escape character (backslash is treated as lit-
3544       eral). The following forms are always recognized:
3545
3546         $$                  insert a dollar character
3547         $<n> or ${<n>}      insert the contents of group <n>
3548         $*MARK or ${*MARK}  insert a control verb name
3549
3550       Either a group number or a group name  can  be  given  for  <n>.  Curly
3551       brackets  are  required only if the following character would be inter-
3552       preted as part of the number or name. The number may be zero to include
3553       the entire matched string.   For  example,  if  the  pattern  a(b)c  is
3554       matched  with "=abc=" and the replacement string "+$1$0$1+", the result
3555       is "=+babcb+=".
3556
3557       $*MARK inserts the name from the last encountered backtracking  control
3558       verb  on the matching path that has a name. (*MARK) must always include
3559       a name, but the other verbs need not.  For  example,  in  the  case  of
3560       (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
3561       the  relevant  name is "B". This facility can be used to perform simple
3562       simultaneous substitutions, as this pcre2test example shows:
3563
3564         /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
3565             apple lemon
3566          2: pear orange
3567
3568       PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
3569       string, replacing every matching substring. If this option is not  set,
3570       only  the  first matching substring is replaced. The search for matches
3571       takes place in the original subject string (that is, previous  replace-
3572       ments  do  not  affect  it).  Iteration is implemented by advancing the
3573       startoffset value for each search, which is always  passed  the  entire
3574       subject string. If an offset limit is set in the match context, search-
3575       ing stops when that limit is reached.
3576
3577       You  can  restrict  the effect of a global substitution to a portion of
3578       the subject string by setting either or both of startoffset and an off-
3579       set limit. Here is a pcre2test example:
3580
3581         /B/g,replace=!,use_offset_limit
3582         ABC ABC ABC ABC\=offset=3,offset_limit=12
3583          2: ABC A!C A!C ABC
3584
3585       When continuing with global substitutions after  matching  a  substring
3586       with zero length, an attempt to find a non-empty match at the same off-
3587       set is performed.  If this is not successful, the offset is advanced by
3588       one character except when CRLF is a valid newline sequence and the next
3589       two  characters are CR, LF. In this case, the offset is advanced by two
3590       characters.
3591
3592       PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that
3593       do not appear in the pattern to be treated as unset groups. This option
3594       should be used with care, because it means that a typo in a group  name
3595       or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
3596
3597       PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3598       known  groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated
3599       as empty strings when inserted as described above. If  this  option  is
3600       not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3601       SET  error.  This  option  does not influence the extended substitution
3602       syntax described below.
3603
3604       PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to  the
3605       replacement  string.  Without this option, only the dollar character is
3606       special, and only the group insertion forms  listed  above  are  valid.
3607       When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3608
3609       Firstly,  backslash in a replacement string is interpreted as an escape
3610       character. The usual forms such as \n or \x{ddd} can be used to specify
3611       particular character codes, and backslash followed by any  non-alphanu-
3612       meric  character  quotes  that character. Extended quoting can be coded
3613       using \Q...\E, exactly as in pattern strings.
3614
3615       There are also four escape sequences for forcing the case  of  inserted
3616       letters.   The  insertion  mechanism has three states: no case forcing,
3617       force upper case, and force lower case. The escape sequences change the
3618       current state: \U and \L change to upper or lower case forcing, respec-
3619       tively, and \E (when not terminating a \Q quoted sequence)  reverts  to
3620       no  case  forcing. The sequences \u and \l force the next character (if
3621       it is a letter) to upper or lower  case,  respectively,  and  then  the
3622       state automatically reverts to no case forcing. Case forcing applies to
3623       all  inserted  characters, including those from capture groups and let-
3624       ters within \Q...\E quoted sequences. If either PCRE2_UTF or  PCRE2_UCP
3625       was  set when the pattern was compiled, Unicode properties are used for
3626       case forcing characters whose code points are greater than 127.
3627
3628       Note that case forcing sequences such as \U...\E do not nest. For exam-
3629       ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc";  the  final
3630       \E  has  no  effect.  Note  also  that the PCRE2_ALT_BSUX and PCRE2_EX-
3631       TRA_ALT_BSUX options do not apply to replacement strings.
3632
3633       The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to  add  more
3634       flexibility  to  capture  group  substitution. The syntax is similar to
3635       that used by Bash:
3636
3637         ${<n>:-<string>}
3638         ${<n>:+<string1>:<string2>}
3639
3640       As before, <n> may be a group number or a name. The first  form  speci-
3641       fies  a  default  value. If group <n> is set, its value is inserted; if
3642       not, <string> is expanded and the  result  inserted.  The  second  form
3643       specifies  strings that are expanded and inserted when group <n> is set
3644       or unset, respectively. The first form is just a  convenient  shorthand
3645       for
3646
3647         ${<n>:+${<n>}:<string>}
3648
3649       Backslash  can  be  used to escape colons and closing curly brackets in
3650       the replacement strings. A change of the case forcing  state  within  a
3651       replacement  string  remains  in  force  afterwards,  as  shown in this
3652       pcre2test example:
3653
3654         /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
3655             body
3656          1: hello
3657             somebody
3658          1: HELLO
3659
3660       The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these  extended
3661       substitutions.  However,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un-
3662       known groups in the extended syntax forms to be treated as unset.
3663
3664       If  PCRE2_SUBSTITUTE_LITERAL  is  set,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
3665       PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3666       vant and are ignored.
3667
3668   Substitution errors
3669
3670       In  the  event of an error, pcre2_substitute() returns a negative error
3671       code. Except for PCRE2_ERROR_NOMATCH (which is never returned),  errors
3672       from pcre2_match() are passed straight back.
3673
3674       PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3675       tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3676
3677       PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3678       ing  an  unknown  substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
3679       when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN-
3680       SET_EMPTY is not set.
3681
3682       PCRE2_ERROR_NOMEMORY  is  returned  if  the  output  buffer  is not big
3683       enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
3684       of buffer that is needed is returned via outlengthptr. Note  that  this
3685       does not happen by default.
3686
3687       PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
3688       match_data  argument is NULL or if the subject or replacement arguments
3689       are NULL. For backward compatibility reasons an exception is  made  for
3690       the replacement argument if the rlength argument is also 0.
3691
3692       PCRE2_ERROR_BADREPLACEMENT  is  used for miscellaneous syntax errors in
3693       the replacement string, with more  particular  errors  being  PCRE2_ER-
3694       ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE
3695       (closing  curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax
3696       error in extended group substitution),  and  PCRE2_ERROR_BADSUBSPATTERN
3697       (the pattern match ended before it started or the match started earlier
3698       than  the  current  position  in the subject, which can happen if \K is
3699       used in an assertion).
3700
3701       As for all PCRE2 errors, a text message that describes the error can be
3702       obtained by calling the pcre2_get_error_message()  function  (see  "Ob-
3703       taining a textual error message" above).
3704
3705   Substitution callouts
3706
3707       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
3708         int (*callout_function)(pcre2_substitute_callout_block *, void *),
3709         void *callout_data);
3710
3711       The  pcre2_set_substitution_callout() function can be used to specify a
3712       callout function for pcre2_substitute(). This information is passed  in
3713       a match context. The callout function is called after each substitution
3714       has been processed, but it can cause the replacement not to happen. The
3715       callout  function is not called for simulated substitutions that happen
3716       as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
3717
3718       The first argument of the callout function is a pointer to a substitute
3719       callout block structure, which contains the following fields, not  nec-
3720       essarily in this order:
3721
3722         uint32_t    version;
3723         uint32_t    subscount;
3724         PCRE2_SPTR  input;
3725         PCRE2_SPTR  output;
3726         PCRE2_SIZE *ovector;
3727         uint32_t    oveccount;
3728         PCRE2_SIZE  output_offsets[2];
3729
3730       The  version field contains the version number of the block format. The
3731       current version is 0. The version number will  increase  in  future  if
3732       more  fields are added, but the intention is never to remove any of the
3733       existing fields.
3734
3735       The subscount field is the number of the current match. It is 1 for the
3736       first callout, 2 for the second, and so on. The input and output point-
3737       ers are copies of the values passed to pcre2_substitute().
3738
3739       The ovector field points to the ovector, which contains the  result  of
3740       the most recent match. The oveccount field contains the number of pairs
3741       that are set in the ovector, and is always greater than zero.
3742
3743       The  output_offsets  vector  contains the offsets of the replacement in
3744       the output string. This has already been processed for dollar  and  (if
3745       requested) backslash substitutions as described above.
3746
3747       The  second  argument  of  the  callout function is the value passed as
3748       callout_data when the function was registered. The  value  returned  by
3749       the callout function is interpreted as follows:
3750
3751       If  the  value is zero, the replacement is accepted, and, if PCRE2_SUB-
3752       STITUTE_GLOBAL is set, processing continues with a search for the  next
3753       match.  If  the  value  is not zero, the current replacement is not ac-
3754       cepted. If the value is greater than zero,  processing  continues  when
3755       PCRE2_SUBSTITUTE_GLOBAL  is set. Otherwise (the value is less than zero
3756       or PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied
3757       to the output and the call to pcre2_substitute() exits,  returning  the
3758       number of matches so far.
3759
3760
3761DUPLICATE CAPTURE GROUP NAMES
3762
3763       int pcre2_substring_nametable_scan(const pcre2_code *code,
3764         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
3765
3766       When  a  pattern  is compiled with the PCRE2_DUPNAMES option, names for
3767       capture groups are not required to be unique. Duplicate names  are  al-
3768       ways  allowed for groups with the same number, created by using the (?|
3769       feature. Indeed, if such groups are named, they are required to use the
3770       same names.
3771
3772       Normally, patterns that use duplicate names are such that  in  any  one
3773       match,  only  one of each set of identically-named groups participates.
3774       An example is shown in the pcre2pattern documentation.
3775
3776       When  duplicates   are   present,   pcre2_substring_copy_byname()   and
3777       pcre2_substring_get_byname()  return  the first substring corresponding
3778       to the given name that is set. Only if none are set is  PCRE2_ERROR_UN-
3779       SET  is  returned.  The pcre2_substring_number_from_name() function re-
3780       turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are  duplicate
3781       names.
3782
3783       If  you want to get full details of all captured substrings for a given
3784       name, you must use the pcre2_substring_nametable_scan()  function.  The
3785       first  argument is the compiled pattern, and the second is the name. If
3786       the third and fourth arguments are NULL, the function returns  a  group
3787       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
3788
3789       When the third and fourth arguments are not NULL, they must be pointers
3790       to  variables  that are updated by the function. After it has run, they
3791       point to the first and last entries in the name-to-number table for the
3792       given name, and the function returns the length of each entry  in  code
3793       units.  In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
3794       no entries for the given name.
3795
3796       The format of the name table is described above in the section entitled
3797       Information about a pattern. Given all the  relevant  entries  for  the
3798       name,  you  can  extract  each of their numbers, and hence the captured
3799       data.
3800
3801
3802FINDING ALL POSSIBLE MATCHES AT ONE POSITION
3803
3804       The traditional matching function uses a  similar  algorithm  to  Perl,
3805       which  stops when it finds the first match at a given point in the sub-
3806       ject. If you want to find all possible matches, or the longest possible
3807       match at a given position,  consider  using  the  alternative  matching
3808       function  (see  below) instead. If you cannot use the alternative func-
3809       tion, you can kludge it up by making use of the callout facility, which
3810       is described in the pcre2callout documentation.
3811
3812       What you have to do is to insert a callout right at the end of the pat-
3813       tern.  When your callout function is called, extract and save the  cur-
3814       rent  matched  substring.  Then return 1, which forces pcre2_match() to
3815       backtrack and try other alternatives. Ultimately, when it runs  out  of
3816       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
3817
3818
3819MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3820
3821       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
3822         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3823         uint32_t options, pcre2_match_data *match_data,
3824         pcre2_match_context *mcontext,
3825         int *workspace, PCRE2_SIZE wscount);
3826
3827       The  function  pcre2_dfa_match()  is  called  to match a subject string
3828       against a compiled pattern, using a matching algorithm that  scans  the
3829       subject string just once (not counting lookaround assertions), and does
3830       not  backtrack (except when processing lookaround assertions). This has
3831       different characteristics to the normal algorithm, and is not  compati-
3832       ble  with  Perl.  Some  of  the features of PCRE2 patterns are not sup-
3833       ported. Nevertheless, there are times when this kind of matching can be
3834       useful. For a discussion of the two matching algorithms, and a list  of
3835       features that pcre2_dfa_match() does not support, see the pcre2matching
3836       documentation.
3837
3838       The  arguments  for  the pcre2_dfa_match() function are the same as for
3839       pcre2_match(), plus two extras. The ovector within the match data block
3840       is used in a different way, and this is described below. The other com-
3841       mon arguments are used in the same way as for pcre2_match(),  so  their
3842       description is not repeated here.
3843
3844       The  two  additional  arguments provide workspace for the function. The
3845       workspace vector should contain at least 20 elements. It  is  used  for
3846       keeping  track  of  multiple paths through the pattern tree. More work-
3847       space is needed for patterns and subjects where there are a lot of  po-
3848       tential matches.
3849
3850       Here is an example of a simple call to pcre2_dfa_match():
3851
3852         int wspace[20];
3853         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
3854         int rc = pcre2_dfa_match(
3855           re,             /* result of pcre2_compile() */
3856           "some string",  /* the subject string */
3857           11,             /* the length of the subject string */
3858           0,              /* start at offset 0 in the subject */
3859           0,              /* default options */
3860           md,             /* the match data block */
3861           NULL,           /* a match context; NULL means use defaults */
3862           wspace,         /* working space vector */
3863           20);            /* number of elements (NOT size in bytes) */
3864
3865   Option bits for pcre2_dfa_match()
3866
3867       The  unused  bits of the options argument for pcre2_dfa_match() must be
3868       zero.  The  only   bits   that   may   be   set   are   PCRE2_ANCHORED,
3869       PCRE2_COPY_MATCHED_SUBJECT,  PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3870       TEOL,   PCRE2_NOTEMPTY,   PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK,
3871       PCRE2_PARTIAL_HARD,    PCRE2_PARTIAL_SOFT,    PCRE2_DFA_SHORTEST,   and
3872       PCRE2_DFA_RESTART. All but the last four of these are exactly the  same
3873       as for pcre2_match(), so their description is not repeated here.
3874
3875         PCRE2_PARTIAL_HARD
3876         PCRE2_PARTIAL_SOFT
3877
3878       These  have  the  same general effect as they do for pcre2_match(), but
3879       the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
3880       pcre2_dfa_match(),  it  returns  PCRE2_ERROR_PARTIAL  if the end of the
3881       subject is reached and there is still at least one matching possibility
3882       that requires additional characters. This happens even if some complete
3883       matches have already been found. When PCRE2_PARTIAL_SOFT  is  set,  the
3884       return  code  PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
3885       if the end of the subject is  reached,  there  have  been  no  complete
3886       matches, but there is still at least one matching possibility. The por-
3887       tion  of  the  string that was inspected when the longest partial match
3888       was found is set as the first matching string in both cases. There is a
3889       more detailed discussion of partial and  multi-segment  matching,  with
3890       examples, in the pcre2partial documentation.
3891
3892         PCRE2_DFA_SHORTEST
3893
3894       Setting  the PCRE2_DFA_SHORTEST option causes the matching algorithm to
3895       stop as soon as it has found one match. Because of the way the alterna-
3896       tive algorithm works, this is necessarily the shortest  possible  match
3897       at the first possible matching point in the subject string.
3898
3899         PCRE2_DFA_RESTART
3900
3901       When  pcre2_dfa_match() returns a partial match, it is possible to call
3902       it again, with additional subject characters, and have it continue with
3903       the same match. The PCRE2_DFA_RESTART option requests this action; when
3904       it is set, the workspace and wscount options must  reference  the  same
3905       vector  as  before  because data about the match so far is left in them
3906       after a partial match. There is more discussion of this facility in the
3907       pcre2partial documentation.
3908
3909   Successful returns from pcre2_dfa_match()
3910
3911       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3912       string in the subject. Note, however, that all the matches from one run
3913       of the function start at the same point in  the  subject.  The  shorter
3914       matches  are all initial substrings of the longer matches. For example,
3915       if the pattern
3916
3917         <.*>
3918
3919       is matched against the string
3920
3921         This is <something> <something else> <something further> no more
3922
3923       the three matched strings are
3924
3925         <something> <something else> <something further>
3926         <something> <something else>
3927         <something>
3928
3929       On success, the yield of the function is a number  greater  than  zero,
3930       which  is  the  number  of  matched substrings. The offsets of the sub-
3931       strings are returned in the ovector, and can be extracted by number  in
3932       the  same way as for pcre2_match(), but the numbers bear no relation to
3933       any capture groups that may exist in the pattern, because DFA  matching
3934       does not support capturing.
3935
3936       Calls  to the convenience functions that extract substrings by name re-
3937       turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3938       ter a DFA match. The convenience functions that extract  substrings  by
3939       number never return PCRE2_ERROR_NOSUBSTRING.
3940
3941       The  matched  strings  are  stored  in  the ovector in reverse order of
3942       length; that is, the longest matching string is first.  If  there  were
3943       too  many matches to fit into the ovector, the yield of the function is
3944       zero, and the vector is filled with the longest matches.
3945
3946       NOTE: PCRE2's "auto-possessification" optimization usually  applies  to
3947       character  repeats at the end of a pattern (as well as internally). For
3948       example, the pattern "a\d+" is compiled as if it were "a\d++". For  DFA
3949       matching,  this means that only one possible match is found. If you re-
3950       ally do want multiple matches in such cases, either use an ungreedy re-
3951       peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when  com-
3952       piling.
3953
3954   Error returns from pcre2_dfa_match()
3955
3956       The pcre2_dfa_match() function returns a negative number when it fails.
3957       Many  of  the  errors  are  the same as for pcre2_match(), as described
3958       above.  There are in addition the following errors that are specific to
3959       pcre2_dfa_match():
3960
3961         PCRE2_ERROR_DFA_UITEM
3962
3963       This return is given if pcre2_dfa_match() encounters  an  item  in  the
3964       pattern  that it does not support, for instance, the use of \C in a UTF
3965       mode or a backreference.
3966
3967         PCRE2_ERROR_DFA_UCOND
3968
3969       This return is given if pcre2_dfa_match() encounters a  condition  item
3970       that uses a backreference for the condition, or a test for recursion in
3971       a specific capture group. These are not supported.
3972
3973         PCRE2_ERROR_DFA_UINVALID_UTF
3974
3975       This  return is given if pcre2_dfa_match() is called for a pattern that
3976       was compiled with PCRE2_MATCH_INVALID_UTF. This is  not  supported  for
3977       DFA matching.
3978
3979         PCRE2_ERROR_DFA_WSSIZE
3980
3981       This  return  is  given  if  pcre2_dfa_match() runs out of space in the
3982       workspace vector.
3983
3984         PCRE2_ERROR_DFA_RECURSE
3985
3986       When a recursion or subroutine call is processed, the matching function
3987       calls itself recursively, using private  memory  for  the  ovector  and
3988       workspace.   This  error  is given if the internal ovector is not large
3989       enough. This should be extremely rare, as a  vector  of  size  1000  is
3990       used.
3991
3992         PCRE2_ERROR_DFA_BADRESTART
3993
3994       When  pcre2_dfa_match()  is  called  with the PCRE2_DFA_RESTART option,
3995       some plausibility checks are made on the  contents  of  the  workspace,
3996       which  should  contain data about the previous partial match. If any of
3997       these checks fail, this error is given.
3998
3999
4000SEE ALSO
4001
4002       pcre2build(3),   pcre2callout(3),    pcre2demo(3),    pcre2matching(3),
4003       pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
4004
4005
4006AUTHOR
4007
4008       Philip Hazel
4009       Retired from University Computing Service
4010       Cambridge, England.
4011
4012
4013REVISION
4014
4015       Last updated: 24 April 2024
4016       Copyright (c) 1997-2024 University of Cambridge.
4017
4018
4019PCRE2 10.44                      24 April 2024                     PCRE2API(3)
4020------------------------------------------------------------------------------
4021
4022
4023
4024PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
4025
4026
4027NAME
4028       PCRE2 - Perl-compatible regular expressions (revised API)
4029
4030
4031BUILDING PCRE2
4032
4033       PCRE2  is distributed with a configure script that can be used to build
4034       the library in Unix-like environments using the applications  known  as
4035       Autotools. Also in the distribution are files to support building using
4036       CMake  instead  of configure. The text file README contains general in-
4037       formation about building with Autotools (some of which is repeated  be-
4038       low),  and  also  has some comments about building on various operating
4039       systems. The files in the vms directory support building under OpenVMS.
4040       There is a lot more information about building PCRE2 without using  Au-
4041       totools  (including  information  about  using  CMake  and building "by
4042       hand") in the text file called NON-AUTOTOOLS-BUILD.  You should consult
4043       this file as well as the README file if you are building in a non-Unix-
4044       like environment.
4045
4046
4047PCRE2 BUILD-TIME OPTIONS
4048
4049       The rest of this document describes the optional features of PCRE2 that
4050       can be selected when the library is compiled. It  assumes  use  of  the
4051       configure  script,  where  the  optional features are selected or dese-
4052       lected by providing options to configure before running the  make  com-
4053       mand.  However,  the same options can be selected in both Unix-like and
4054       non-Unix-like environments if you are using CMake instead of  configure
4055       to build PCRE2.
4056
4057       If  you  are not using Autotools or CMake, option selection can be done
4058       by editing the config.h file, or by passing parameter settings  to  the
4059       compiler, as described in NON-AUTOTOOLS-BUILD.
4060
4061       The complete list of options for configure (which includes the standard
4062       ones  such  as  the selection of the installation directory) can be ob-
4063       tained by running
4064
4065         ./configure --help
4066
4067       The following sections include descriptions of "on/off"  options  whose
4068       names begin with --enable or --disable. Because of the way that config-
4069       ure  works, --enable and --disable always come in pairs, so the comple-
4070       mentary option always exists as well, but as it specifies the  default,
4071       it is not described.  Options that specify values have names that start
4072       with --with. At the end of a configure run, a summary of the configura-
4073       tion is output.
4074
4075
4076BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
4077
4078       By  default, a library called libpcre2-8 is built, containing functions
4079       that take string arguments contained in arrays  of  bytes,  interpreted
4080       either  as single-byte characters, or UTF-8 strings. You can also build
4081       two other libraries, called libpcre2-16 and libpcre2-32, which  process
4082       strings  that  are contained in arrays of 16-bit and 32-bit code units,
4083       respectively. These can be interpreted either as single-unit characters
4084       or UTF-16/UTF-32 strings. To build these additional libraries, add  one
4085       or both of the following to the configure command:
4086
4087         --enable-pcre2-16
4088         --enable-pcre2-32
4089
4090       If you do not want the 8-bit library, add
4091
4092         --disable-pcre2-8
4093
4094       as  well.  At least one of the three libraries must be built. Note that
4095       the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
4096       an  8-bit  program.  Neither  of these are built if you select only the
4097       16-bit or 32-bit libraries.
4098
4099
4100BUILDING SHARED AND STATIC LIBRARIES
4101
4102       The Autotools PCRE2 building process uses libtool to build both  shared
4103       and  static  libraries by default. You can suppress an unwanted library
4104       by adding one of
4105
4106         --disable-shared
4107         --disable-static
4108
4109       to the configure command. Setting --disable-shared ensures  that  PCRE2
4110       libraries  are  built  as  static libraries. The binaries that are then
4111       created as part of  the  build  process  (for  example,  pcre2test  and
4112       pcre2grep)  are linked statically with one or more PCRE2 libraries, but
4113       may also be dynamically linked with other libraries such  as  libc.  If
4114       you  want these binaries to be fully statically linked, you can set LD-
4115       FLAGS like this:
4116
4117       LDFLAGS=--static ./configure --disable-shared
4118
4119       Note the two hyphens in --static. Of course, this works only if  static
4120       versions of all the relevant libraries are available for linking.
4121
4122
4123UNICODE AND UTF SUPPORT
4124
4125       By  default,  PCRE2 is built with support for Unicode and UTF character
4126       strings.  To build it without Unicode support, add
4127
4128         --disable-unicode
4129
4130       to the configure command. This setting applies to all three  libraries.
4131       It  is  not  possible to build one library with Unicode support and an-
4132       other without in the same configuration.
4133
4134       Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
4135       UTF-16 or UTF-32. To do that, applications that use the library can set
4136       the  PCRE2_UTF  option when they call pcre2_compile() to compile a pat-
4137       tern.  Alternatively, patterns may be started with  (*UTF)  unless  the
4138       application has locked this out by setting PCRE2_NEVER_UTF.
4139
4140       UTF support allows the libraries to process character code points up to
4141       0x10ffff  in  the  strings that they handle. Unicode support also gives
4142       access to the Unicode properties of characters, using  pattern  escapes
4143       such as \P, \p, and \X. Only the general category properties such as Lu
4144       and Nd, script names, and some bi-directional properties are supported.
4145       Details are given in the pcre2pattern documentation.
4146
4147       Pattern escapes such as \d and \w do not by default make use of Unicode
4148       properties.  The  application  can  request that they do by setting the
4149       PCRE2_UCP option. Unless the application  has  set  PCRE2_NEVER_UCP,  a
4150       pattern may also request this by starting with (*UCP).
4151
4152
4153DISABLING THE USE OF \C
4154
4155       The \C escape sequence, which matches a single code unit, even in a UTF
4156       mode,  can  cause unpredictable behaviour because it may leave the cur-
4157       rent matching point in the middle of a multi-code-unit  character.  The
4158       application  can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
4159       tion when calling pcre2_compile(). There is also a build-time option
4160
4161         --enable-never-backslash-C
4162
4163       (note the upper case C) which locks out the use of \C entirely.
4164
4165
4166JUST-IN-TIME COMPILER SUPPORT
4167
4168       Just-in-time (JIT) compiler support is included in the build by  speci-
4169       fying
4170
4171         --enable-jit
4172
4173       This  support  is available only for certain hardware architectures. If
4174       this option is set for an unsupported architecture,  a  building  error
4175       occurs.  If in doubt, use
4176
4177         --enable-jit=auto
4178
4179       which  enables  JIT  only if the current hardware is supported. You can
4180       check if JIT is enabled in the configuration summary that is output  at
4181       the  end  of a configure run. If you are enabling JIT under SELinux you
4182       may also want to add
4183
4184         --enable-jit-sealloc
4185
4186       which enables the use of an execmem allocator in JIT that is compatible
4187       with SELinux. This has no  effect  if  JIT  is  not  enabled.  See  the
4188       pcre2jit  documentation for a discussion of JIT usage. When JIT support
4189       is enabled, pcre2grep automatically makes use of it, unless you add
4190
4191         --disable-pcre2grep-jit
4192
4193       to the configure command.
4194
4195
4196NEWLINE RECOGNITION
4197
4198       By default, PCRE2 interprets the linefeed (LF) character as  indicating
4199       the  end  of  a line. This is the normal newline character on Unix-like
4200       systems. You can compile PCRE2 to use carriage return (CR) instead,  by
4201       adding
4202
4203         --enable-newline-is-cr
4204
4205       to  the  configure command. There is also an --enable-newline-is-lf op-
4206       tion, which explicitly specifies linefeed as the newline character.
4207
4208       Alternatively, you can specify that line endings are to be indicated by
4209       the two-character sequence CRLF (CR immediately followed by LF). If you
4210       want this, add
4211
4212         --enable-newline-is-crlf
4213
4214       to the configure command. There is a fourth option, specified by
4215
4216         --enable-newline-is-anycrlf
4217
4218       which causes PCRE2 to recognize any of the three sequences CR,  LF,  or
4219       CRLF as indicating a line ending. A fifth option, specified by
4220
4221         --enable-newline-is-any
4222
4223       causes  PCRE2  to  recognize  any Unicode newline sequence. The Unicode
4224       newline sequences are the three just mentioned, plus the single charac-
4225       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
4226       U+0085), LS (line separator,  U+2028),  and  PS  (paragraph  separator,
4227       U+2029). The final option is
4228
4229         --enable-newline-is-nul
4230
4231       which  causes  NUL  (binary  zero) to be set as the default line-ending
4232       character.
4233
4234       Whatever default line ending convention is selected when PCRE2 is built
4235       can be overridden by applications that use the library. At  build  time
4236       it is recommended to use the standard for your operating system.
4237
4238
4239WHAT \R MATCHES
4240
4241       By  default,  the  sequence \R in a pattern matches any Unicode newline
4242       sequence, independently of what has been selected as  the  line  ending
4243       sequence. If you specify
4244
4245         --enable-bsr-anycrlf
4246
4247       the  default  is changed so that \R matches only CR, LF, or CRLF. What-
4248       ever is selected when PCRE2 is built can be overridden by  applications
4249       that use the library.
4250
4251
4252HANDLING VERY LARGE PATTERNS
4253
4254       Within  a  compiled  pattern,  offset values are used to point from one
4255       part to another (for example, from an opening parenthesis to an  alter-
4256       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
4257       two-byte values are used for these offsets, leading to a  maximum  size
4258       for a compiled pattern of around 64 thousand code units. This is suffi-
4259       cient  to handle all but the most gigantic patterns. Nevertheless, some
4260       people do want to process truly enormous patterns, so it is possible to
4261       compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
4262       ting such as
4263
4264         --with-link-size=3
4265
4266       to  the  configure command. The value given must be 2, 3, or 4. For the
4267       16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
4268       using  longer  offsets slows down the operation of PCRE2 because it has
4269       to load additional data when handling them. For the 32-bit library  the
4270       value  is  always 4 and cannot be overridden; the value of --with-link-
4271       size is ignored.
4272
4273
4274LIMITING PCRE2 RESOURCE USAGE
4275
4276       The pcre2_match() function increments a counter each time it goes round
4277       its main loop. Putting a limit on this counter controls the  amount  of
4278       computing  resource  used  by a single call to pcre2_match(). The limit
4279       can be changed at run time, as described in the pcre2api documentation.
4280       The default is 10 million, but this can be changed by adding a  setting
4281       such as
4282
4283         --with-match-limit=500000
4284
4285       to   the   configure   command.   This  setting  also  applies  to  the
4286       pcre2_dfa_match() matching function, and to JIT  matching  (though  the
4287       counting is done differently).
4288
4289       The  pcre2_match()  function  uses  heap  memory to record backtracking
4290       points. The more nested backtracking points there  are  (that  is,  the
4291       deeper  the  search tree), the more memory is needed. There is an upper
4292       limit, specified in kibibytes (units of 1024 bytes). This limit can  be
4293       changed  at  run  time, as described in the pcre2api documentation. The
4294       default limit (in effect unlimited) is 20 million. You can change  this
4295       by a setting such as
4296
4297         --with-heap-limit=500
4298
4299       which  limits the amount of heap to 500 KiB. This limit applies only to
4300       interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
4301       also use the heap for internal workspace  when  processing  complicated
4302       patterns.  This limit does not apply when JIT (which has its own memory
4303       arrangements) is used.
4304
4305       You can also explicitly limit the depth of nested backtracking  in  the
4306       pcre2_match() interpreter. This limit defaults to the value that is set
4307       for  --with-match-limit.  You  can set a lower default limit by adding,
4308       for example,
4309
4310         --with-match-limit-depth=10000
4311
4312       to the configure command. This value can be  overridden  at  run  time.
4313       This  depth  limit  indirectly limits the amount of heap memory that is
4314       used, but because the size of each backtracking "frame" depends on  the
4315       number  of  capturing parentheses in a pattern, the amount of heap that
4316       is used before the limit is reached varies  from  pattern  to  pattern.
4317       This limit was more useful in versions before 10.30, where function re-
4318       cursion was used for backtracking.
4319
4320       As well as applying to pcre2_match(), the depth limit also controls the
4321       depth  of recursive function calls in pcre2_dfa_match(). These are used
4322       for lookaround assertions, atomic groups,  and  recursion  within  pat-
4323       terns.  The limit does not apply to JIT matching.
4324
4325
4326LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS
4327
4328       Lookbehind  assertions  in which one or more branches can match a vari-
4329       able number of characters are supported only  if  there  is  a  maximum
4330       matching  length  for  each  top-level branch. There is a limit to this
4331       maximum that defaults to 255 characters. You can alter this default  by
4332       a setting such as
4333
4334         --with-max-varlookbehind=100
4335
4336       The limit can be changed at runtime by calling pcre2_set_max_varlookbe-
4337       hind().  Lookbehind  assertions  in  which every branch matches a fixed
4338       number of characters (not necessarily all the same) are not constrained
4339       by this limit.
4340
4341
4342CREATING CHARACTER TABLES AT BUILD TIME
4343
4344       PCRE2 uses fixed tables for processing characters whose code points are
4345       less than 256. By default, PCRE2 is built with a set of tables that are
4346       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
4347       for ASCII codes only. If you add
4348
4349         --enable-rebuild-chartables
4350
4351       to  the  configure  command, the distributed tables are no longer used.
4352       Instead, a program called pcre2_dftables is compiled and run. This out-
4353       puts the source for new set of tables, created in the default locale of
4354       your C run-time system. This method of replacing the  tables  does  not
4355       work if you are cross compiling, because pcre2_dftables needs to be run
4356       on the local host and therefore not compiled with the cross compiler.
4357
4358       If you need to create alternative tables when cross compiling, you will
4359       have  to  do so "by hand". There may also be other reasons for creating
4360       tables manually.  To cause pcre2_dftables to  be  built  on  the  local
4361       host, run a normal compiling command, and then run the program with the
4362       output file as its argument, for example:
4363
4364         cc src/pcre2_dftables.c -o pcre2_dftables
4365         ./pcre2_dftables src/pcre2_chartables.c
4366
4367       This  builds the tables in the default locale of the local host. If you
4368       want to specify a locale, you must use the -L option:
4369
4370         LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4371
4372       You can also specify -b (with or without -L). This causes the tables to
4373       be written in binary instead of as source code. A set of binary  tables
4374       can  be  loaded  into memory by an application and passed to pcre2_com-
4375       pile() in the same way as tables created by calling pcre2_maketables().
4376       The tables are just a string of bytes, independent of hardware  charac-
4377       teristics  such  as  endianness. This means they can be bundled with an
4378       application that runs in different environments, to  ensure  consistent
4379       behaviour.
4380
4381
4382USING EBCDIC CODE
4383
4384       PCRE2  assumes  by default that it will run in an environment where the
4385       character code is ASCII or Unicode, which is a superset of ASCII.  This
4386       is the case for most computer operating systems. PCRE2 can, however, be
4387       compiled to run in an 8-bit EBCDIC environment by adding
4388
4389         --enable-ebcdic --disable-unicode
4390
4391       to the configure command. This setting implies --enable-rebuild-charta-
4392       bles.  You should only use it if you know that you are in an EBCDIC en-
4393       vironment (for example, an IBM mainframe operating system).
4394
4395       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
4396       version  of  the  library. Consequently, --enable-unicode and --enable-
4397       ebcdic are mutually exclusive.
4398
4399       The EBCDIC character that corresponds to an ASCII LF is assumed to have
4400       the value 0x15 by default. However, in some EBCDIC  environments,  0x25
4401       is used. In such an environment you should use
4402
4403         --enable-ebcdic-nl25
4404
4405       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4406       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
4407       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4408       acter (which, in Unicode, is 0x85).
4409
4410       The options that select newline behaviour, such as --enable-newline-is-
4411       cr, and equivalent run-time options, refer to these character values in
4412       an EBCDIC environment.
4413
4414
4415PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
4416
4417       By default pcre2grep supports the use of callouts with string arguments
4418       within the patterns it is matching. There are two kinds: one that  gen-
4419       erates output using local code, and another that calls an external pro-
4420       gram  or  script.   If --disable-pcre2grep-callout-fork is added to the
4421       configure command, only the first kind  of  callout  is  supported;  if
4422       --disable-pcre2grep-callout  is  used,  all callouts are completely ig-
4423       nored. For more details of pcre2grep callouts, see the pcre2grep  docu-
4424       mentation.
4425
4426
4427PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
4428
4429       By  default,  pcre2grep reads all files as plain text. You can build it
4430       so that it recognizes files whose names end in .gz or .bz2,  and  reads
4431       them with libz or libbz2, respectively, by adding one or both of
4432
4433         --enable-pcre2grep-libz
4434         --enable-pcre2grep-libbz2
4435
4436       to the configure command. These options naturally require that the rel-
4437       evant  libraries  are installed on your system. Configuration will fail
4438       if they are not.
4439
4440
4441PCRE2GREP BUFFER SIZE
4442
4443       pcre2grep uses an internal buffer to hold a "window" on the file it  is
4444       scanning, in order to be able to output "before" and "after" lines when
4445       it finds a match. The default starting size of the buffer is 20KiB. The
4446       buffer  itself  is  three times this size, but because of the way it is
4447       used for holding "before" lines, the longest line that is guaranteed to
4448       be processable is the notional buffer size. If a longer line is encoun-
4449       tered, pcre2grep automatically expands the buffer, up  to  a  specified
4450       maximum  size, whose default is 1MiB or the starting size, whichever is
4451       the larger. You can change the default parameter values by adding,  for
4452       example,
4453
4454         --with-pcre2grep-bufsize=51200
4455         --with-pcre2grep-max-bufsize=2097152
4456
4457       to  the  configure  command. The caller of pcre2grep can override these
4458       values by using --buffer-size  and  --max-buffer-size  on  the  command
4459       line.
4460
4461
4462PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
4463
4464       If you add one of
4465
4466         --enable-pcre2test-libreadline
4467         --enable-pcre2test-libedit
4468
4469       to  the configure command, pcre2test is linked with the libreadline or-
4470       libedit library, respectively, and when its input is from  a  terminal,
4471       it  reads  it using the readline() function. This provides line-editing
4472       and history facilities. Note that libreadline is  GPL-licensed,  so  if
4473       you  distribute  a binary of pcre2test linked in this way, there may be
4474       licensing issues. These can be avoided by linking instead with libedit,
4475       which has a BSD licence.
4476
4477       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
4478       be  added to the pcre2test build. In many operating environments with a
4479       system-installed readline library this is sufficient. However, in  some
4480       environments (e.g. if an unmodified distribution version of readline is
4481       in  use),  some  extra configuration may be necessary. The INSTALL file
4482       for libreadline says this:
4483
4484         "Readline uses the termcap functions, but does not link with
4485         the termcap or curses library itself, allowing applications
4486         which link with readline the to choose an appropriate library."
4487
4488       If your environment has not been set up so that an appropriate  library
4489       is automatically included, you may need to add something like
4490
4491         LIBS="-ncurses"
4492
4493       immediately before the configure command.
4494
4495
4496INCLUDING DEBUGGING CODE
4497
4498       If you add
4499
4500         --enable-debug
4501
4502       to  the configure command, additional debugging code is included in the
4503       build. This feature is intended for use by the PCRE2 maintainers.
4504
4505
4506DEBUGGING WITH VALGRIND SUPPORT
4507
4508       If you add
4509
4510         --enable-valgrind
4511
4512       to the configure command, PCRE2 will use valgrind annotations  to  mark
4513       certain  memory  regions as unaddressable. This allows it to detect in-
4514       valid memory accesses, and is mostly useful for debugging PCRE2 itself.
4515
4516
4517CODE COVERAGE REPORTING
4518
4519       If your C compiler is gcc, you can build a version of  PCRE2  that  can
4520       generate a code coverage report for its test suite. To enable this, you
4521       must install lcov version 1.6 or above. Then specify
4522
4523         --enable-coverage
4524
4525       to the configure command and build PCRE2 in the usual way.
4526
4527       Note that using ccache (a caching C compiler) is incompatible with code
4528       coverage  reporting. If you have configured ccache to run automatically
4529       on your system, you must set the environment variable
4530
4531         CCACHE_DISABLE=1
4532
4533       before running make to build PCRE2, so that ccache is not used.
4534
4535       When --enable-coverage is used,  the  following  addition  targets  are
4536       added to the Makefile:
4537
4538         make coverage
4539
4540       This  creates  a  fresh coverage report for the PCRE2 test suite. It is
4541       equivalent to running "make coverage-reset", "make  coverage-baseline",
4542       "make check", and then "make coverage-report".
4543
4544         make coverage-reset
4545
4546       This zeroes the coverage counters, but does nothing else.
4547
4548         make coverage-baseline
4549
4550       This captures baseline coverage information.
4551
4552         make coverage-report
4553
4554       This creates the coverage report.
4555
4556         make coverage-clean-report
4557
4558       This  removes the generated coverage report without cleaning the cover-
4559       age data itself.
4560
4561         make coverage-clean-data
4562
4563       This removes the captured coverage data without removing  the  coverage
4564       files created at compile time (*.gcno).
4565
4566         make coverage-clean
4567
4568       This  cleans all coverage data including the generated coverage report.
4569       For more information about code coverage, see the gcov and  lcov  docu-
4570       mentation.
4571
4572
4573DISABLING THE Z AND T FORMATTING MODIFIERS
4574
4575       The  C99  standard  defines formatting modifiers z and t for size_t and
4576       ptrdiff_t values, respectively. By default, PCRE2 uses these  modifiers
4577       in environments other than old versions of Microsoft Visual Studio when
4578       __STDC_VERSION__  is  defined  and has a value greater than or equal to
4579       199901L (indicating support for C99).  However, there is at  least  one
4580       environment that claims to be C99 but does not support these modifiers.
4581       If
4582
4583         --disable-percent-zt
4584
4585       is specified, no use is made of the z or t modifiers. Instead of %td or
4586       %zu,  a  suitable  format is used depending in the size of long for the
4587       platform.
4588
4589
4590SUPPORT FOR FUZZERS
4591
4592       There is a special option for use by people who  want  to  run  fuzzing
4593       tests on PCRE2:
4594
4595         --enable-fuzz-support
4596
4597       At present this applies only to the 8-bit library. If set, it causes an
4598       extra  library  called  libpcre2-fuzzsupport.a to be built, but not in-
4599       stalled. This contains a single  function  called  LLVMFuzzerTestOneIn-
4600       put()  whose  arguments are a pointer to a string and the length of the
4601       string. When called, this function tries to compile  the  string  as  a
4602       pattern,  and if that succeeds, to match it.  This is done both with no
4603       options and with some random options bits that are generated  from  the
4604       string.
4605
4606       Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz-
4607       zcheck to be created. This is normally run under valgrind or used  when
4608       PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
4609       function  and  outputs  information  about  what it is doing. The input
4610       strings are specified by arguments: if an argument starts with "="  the
4611       rest  of it is a literal input string. Otherwise, it is assumed to be a
4612       file name, and the contents of the file are the test string.
4613
4614
4615OBSOLETE OPTION
4616
4617       In versions of PCRE2 prior to 10.30, there were two  ways  of  handling
4618       backtracking  in the pcre2_match() function. The default was to use the
4619       system stack, but if
4620
4621         --disable-stack-for-recursion
4622
4623       was set, memory on the heap was used. From release 10.30  onwards  this
4624       has  changed  (the  stack  is  no longer used) and this option now does
4625       nothing except give a warning.
4626
4627
4628SEE ALSO
4629
4630       pcre2api(3), pcre2-config(3).
4631
4632
4633AUTHOR
4634
4635       Philip Hazel
4636       Retired from University Computing Service
4637       Cambridge, England.
4638
4639
4640REVISION
4641
4642       Last updated: 15 April 2024
4643       Copyright (c) 1997-2024 University of Cambridge.
4644
4645
4646PCRE2 10.44                      15 April 2024                   PCRE2BUILD(3)
4647------------------------------------------------------------------------------
4648
4649
4650
4651PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
4652
4653
4654NAME
4655       PCRE2 - Perl-compatible regular expressions (revised API)
4656
4657
4658SYNOPSIS
4659
4660       #include <pcre2.h>
4661
4662       int (*pcre2_callout)(pcre2_callout_block *, void *);
4663
4664       int pcre2_callout_enumerate(const pcre2_code *code,
4665         int (*callback)(pcre2_callout_enumerate_block *, void *),
4666         void *user_data);
4667
4668
4669DESCRIPTION
4670
4671       PCRE2  provides  a  feature  called "callout", which is a means of tem-
4672       porarily passing control to the caller of PCRE2 in the middle  of  pat-
4673       tern  matching.  The  caller  of PCRE2 provides an external function by
4674       putting its entry point in a match context (see pcre2_set_callout()  in
4675       the pcre2api documentation).
4676
4677       When  using the pcre2_substitute() function, an additional callout fea-
4678       ture is available. This does a callout after each change to the subject
4679       string and is described in the pcre2api documentation; the rest of this
4680       document is concerned with callouts during pattern matching.
4681
4682       Within a regular expression, (?C<arg>) indicates a point at  which  the
4683       external  function  is  to  be  called. Different callout points can be
4684       identified by putting a number less than 256 after the  letter  C.  The
4685       default  value is zero.  Alternatively, the argument may be a delimited
4686       string. The starting delimiter must be one of ` ' " ^ % # $ {  and  the
4687       ending delimiter is the same as the start, except for {, where the end-
4688       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
4689       string, it must be doubled. For example, this pattern has  two  callout
4690       points:
4691
4692         (?C1)abc(?C"some ""arbitrary"" text")def
4693
4694       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
4695       PCRE2  automatically inserts callouts, all with number 255, before each
4696       item in the pattern except for immediately before or after an  explicit
4697       callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
4698
4699         A(?C3)B
4700
4701       it is processed as if it were
4702
4703         (?C255)A(?C3)B(?C255)
4704
4705       Here is a more complicated example:
4706
4707         A(\d{2}|--)
4708
4709       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
4710
4711         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4712
4713       Notice  that  there  is a callout before and after each parenthesis and
4714       alternation bar. If the pattern contains a conditional group whose con-
4715       dition is an assertion, an automatic callout  is  inserted  immediately
4716       before  the  condition. Such a callout may also be inserted explicitly,
4717       for example:
4718
4719         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
4720
4721       This applies only to assertion conditions (because they are  themselves
4722       independent groups).
4723
4724       Callouts  can  be useful for tracking the progress of pattern matching.
4725       The pcre2test program has a pattern qualifier (/auto_callout) that sets
4726       automatic callouts.  When any callouts are  present,  the  output  from
4727       pcre2test  indicates  how  the pattern is being matched. This is useful
4728       information when you are trying to optimize the performance of  a  par-
4729       ticular pattern.
4730
4731
4732MISSING CALLOUTS
4733
4734       You  should  be  aware  that, because of optimizations in the way PCRE2
4735       compiles and matches patterns, callouts sometimes do not happen exactly
4736       as you might expect.
4737
4738   Auto-possessification
4739
4740       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4741       that what follows cannot be part of the repeat. For example, a+[bc]  is
4742       compiled  as if it were a++[bc]. The pcre2test output when this pattern
4743       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
4744       to the string "aaaa" is:
4745
4746         --->aaaa
4747          +0 ^        a+
4748          +2 ^   ^    [bc]
4749         No match
4750
4751       This indicates that when matching [bc] fails, there is no  backtracking
4752       into a+ (because it is being treated as a++) and therefore the callouts
4753       that  would  be  taken for the backtracks do not occur. You can disable
4754       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
4755       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
4756       this case, the output changes to this:
4757
4758         --->aaaa
4759          +0 ^        a+
4760          +2 ^   ^    [bc]
4761          +2 ^  ^     [bc]
4762          +2 ^ ^      [bc]
4763          +2 ^^       [bc]
4764         No match
4765
4766       This time, when matching [bc] fails, the matcher backtracks into a+ and
4767       tries again, repeatedly, until a+ itself fails.
4768
4769   Automatic .* anchoring
4770
4771       By default, an optimization is applied when .* is the first significant
4772       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
4773       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
4774       is not set, a match can start only after an internal newline or at  the
4775       beginning of the subject, and pcre2_compile() remembers this. If a pat-
4776       tern  has more than one top-level branch, automatic anchoring occurs if
4777       all branches are anchorable.
4778
4779       This optimization is disabled, however, if .* is in an atomic group  or
4780       if  there  is a backreference to the capture group in which it appears.
4781       It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
4782       ever, the presence of callouts does not affect it.
4783
4784       For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
4785       and applied to the string "aa", the pcre2test output is:
4786
4787         --->aa
4788          +0 ^      .*
4789          +2 ^ ^    \d
4790          +2 ^^     \d
4791          +2 ^      \d
4792         No match
4793
4794       This shows that all match attempts start at the beginning of  the  sub-
4795       ject. In other words, the pattern is anchored. You can disable this op-
4796       timization  by  passing  PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
4797       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
4798       put changes to:
4799
4800         --->aa
4801          +0 ^      .*
4802          +2 ^ ^    \d
4803          +2 ^^     \d
4804          +2 ^      \d
4805          +0  ^     .*
4806          +2  ^^    \d
4807          +2  ^     \d
4808         No match
4809
4810       This  shows more match attempts, starting at the second subject charac-
4811       ter.  Another optimization, described in the next section,  means  that
4812       there is no subsequent attempt to match with an empty subject.
4813
4814   Other optimizations
4815
4816       Other  optimizations  that  provide fast "no match" results also affect
4817       callouts.  For example, if the pattern is
4818
4819         ab(?C4)cd
4820
4821       PCRE2 knows that any matching string must contain the  letter  "d".  If
4822       the  subject  string  is  "abyz",  the  lack of "d" means that matching
4823       doesn't ever start, and the callout is  never  reached.  However,  with
4824       "abyd", though the result is still no match, the callout is obeyed.
4825
4826       For  most  patterns  PCRE2  also knows the minimum length of a matching
4827       string, and will immediately give a "no match" return without  actually
4828       running  a  match if the subject is not long enough, or, for unanchored
4829       patterns, if it has been scanned far enough.
4830
4831       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4832       MIZE option  to  pcre2_compile(),  or  by  starting  the  pattern  with
4833       (*NO_START_OPT).  This slows down the matching process, but does ensure
4834       that callouts such as the example above are obeyed.
4835
4836
4837THE CALLOUT INTERFACE
4838
4839       During matching, when PCRE2 reaches a callout  point,  if  an  external
4840       function  is  provided in the match context, it is called. This applies
4841       to both normal, DFA, and JIT matching. The first argument to the  call-
4842       out function is a pointer to a pcre2_callout block. The second argument
4843       is  the  void * callout data that was supplied when the callout was set
4844       up by calling pcre2_set_callout() (see the pcre2api documentation). The
4845       callout block structure contains the following fields, not  necessarily
4846       in this order:
4847
4848         uint32_t      version;
4849         uint32_t      callout_number;
4850         uint32_t      capture_top;
4851         uint32_t      capture_last;
4852         uint32_t      callout_flags;
4853         PCRE2_SIZE   *offset_vector;
4854         PCRE2_SPTR    mark;
4855         PCRE2_SPTR    subject;
4856         PCRE2_SIZE    subject_length;
4857         PCRE2_SIZE    start_match;
4858         PCRE2_SIZE    current_position;
4859         PCRE2_SIZE    pattern_position;
4860         PCRE2_SIZE    next_item_length;
4861         PCRE2_SIZE    callout_string_offset;
4862         PCRE2_SIZE    callout_string_length;
4863         PCRE2_SPTR    callout_string;
4864
4865       The  version field contains the version number of the block format. The
4866       current version is 2; the three callout string fields  were  added  for
4867       version  1, and the callout_flags field for version 2. If you are writ-
4868       ing an application that might use an  earlier  release  of  PCRE2,  you
4869       should  check  the version number before accessing any of these fields.
4870       The version number will increase in future if more  fields  are  added,
4871       but the intention is never to remove any of the existing fields.
4872
4873   Fields for numerical callouts
4874
4875       For  a  numerical  callout,  callout_string is NULL, and callout_number
4876       contains the number of the callout, in the range  0-255.  This  is  the
4877       number  that  follows  (?C for callouts that part of the pattern; it is
4878       255 for automatically generated callouts.
4879
4880   Fields for string callouts
4881
4882       For callouts with string arguments, callout_number is always zero,  and
4883       callout_string  points  to the string that is contained within the com-
4884       piled pattern. Its length is given by callout_string_length. Duplicated
4885       ending delimiters that were present in the original pattern string have
4886       been turned into single characters, but there is no other processing of
4887       the callout string argument. An additional code unit containing  binary
4888       zero  is  present  after the string, but is not included in the length.
4889       The delimiter that was used to start the string is also  stored  within
4890       the  pattern, immediately before the string itself. You can access this
4891       delimiter as callout_string[-1] if you need it.
4892
4893       The callout_string_offset field is the code unit offset to the start of
4894       the callout argument string within the original pattern string. This is
4895       provided for the benefit of applications such as script languages  that
4896       might need to report errors in the callout string within the pattern.
4897
4898   Fields for all callouts
4899
4900       The  remaining  fields in the callout block are the same for both kinds
4901       of callout.
4902
4903       The offset_vector field is a pointer to a vector of  capturing  offsets
4904       (the "ovector"). You may read the elements in this vector, but you must
4905       not change any of them.
4906
4907       For  calls  to pcre2_match(), the offset_vector field is not (since re-
4908       lease 10.30) a pointer to the actual ovector that  was  passed  to  the
4909       matching  function in the match data block. Instead it points to an in-
4910       ternal ovector of a size large enough to  hold  all  possible  captured
4911       substrings in the pattern. Note that whenever a recursion or subroutine
4912       call  within  a pattern completes, the capturing state is reset to what
4913       it was before.
4914
4915       The capture_last field contains the number of the  most  recently  cap-
4916       tured  substring,  and the capture_top field contains one more than the
4917       number of the highest numbered captured substring so far.  If  no  sub-
4918       strings  have yet been captured, the value of capture_last is 0 and the
4919       value of capture_top is 1. The values of these  fields  do  not  always
4920       differ   by   one;  for  example,  when  the  callout  in  the  pattern
4921       ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
4922
4923       The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in-
4924       spected  in  order to extract substrings that have been matched so far,
4925       in the same way as extracting substrings after a match  has  completed.
4926       The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4927       the match is by definition not complete. Substrings that have not  been
4928       captured  but whose numbers are less than capture_top also have both of
4929       their ovector slots set to PCRE2_UNSET.
4930
4931       For DFA matching, the offset_vector field points to  the  ovector  that
4932       was  passed  to the matching function in the match data block for call-
4933       outs at the top level, but to an internal ovector during the processing
4934       of pattern recursions, lookarounds, and atomic groups.  However,  these
4935       ovectors  hold no useful information because pcre2_dfa_match() does not
4936       support substring capturing. The value of capture_top is always  1  and
4937       the value of capture_last is always 0 for DFA matching.
4938
4939       The subject and subject_length fields contain copies of the values that
4940       were passed to the matching function.
4941
4942       The  start_match  field normally contains the offset within the subject
4943       at which the current match attempt started. However, if the escape  se-
4944       quence  \K  has  been encountered, this value is changed to reflect the
4945       modified starting point. If the pattern is not  anchored,  the  callout
4946       function may be called several times from the same point in the pattern
4947       for different starting points in the subject.
4948
4949       The  current_position  field  contains the offset within the subject of
4950       the current match pointer.
4951
4952       The pattern_position field contains the offset in the pattern string to
4953       the next item to be matched.
4954
4955       The next_item_length field contains the length of the next item  to  be
4956       processed  in the pattern string. When the callout is at the end of the
4957       pattern, the length is zero.  When  the  callout  precedes  an  opening
4958       parenthesis, the length includes meta characters that follow the paren-
4959       thesis.  For  example,  in a callout before an assertion such as (?=ab)
4960       the length is 3. For an alternation bar or a closing  parenthesis,  the
4961       length  is  one,  unless a closing parenthesis is followed by a quanti-
4962       fier, in which case its length is included. (This  changed  in  release
4963       10.23.  In  earlier  releases, before an opening parenthesis the length
4964       was that of the entire group, and before an alternation bar or a  clos-
4965       ing parenthesis the length was zero.)
4966
4967       The  pattern_position  and next_item_length fields are intended to help
4968       in distinguishing between different automatic callouts, which all  have
4969       the  same  callout  number. However, they are set for all callouts, and
4970       are used by pcre2test to show the next item to be matched when display-
4971       ing callout information.
4972
4973       In callouts from pcre2_match() the mark field contains a pointer to the
4974       zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
4975       (*THEN)  item  in the match, or NULL if no such items have been passed.
4976       Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
4977       previous (*MARK). In callouts from the DFA matching function this field
4978       always contains NULL.
4979
4980       The   callout_flags   field   is   always   zero   in   callouts   from
4981       pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4982       JIT is used, the following bits may be set:
4983
4984         PCRE2_CALLOUT_STARTMATCH
4985
4986       This is set for the first callout after the start of matching for  each
4987       new starting position in the subject.
4988
4989         PCRE2_CALLOUT_BACKTRACK
4990
4991       This  is  set if there has been a matching backtrack since the previous
4992       callout, or since the start of matching if this is  the  first  callout
4993       from a pcre2_match() run.
4994
4995       Both  bits  are  set when a backtrack has caused a "bumpalong" to a new
4996       starting position in the subject. Output from pcre2test does not  indi-
4997       cate  the  presence  of these bits unless the callout_extra modifier is
4998       set.
4999
5000       The information in the callout_flags field is provided so that applica-
5001       tions can track and tell their users how matching with backtracking  is
5002       done.  This  can be useful when trying to optimize patterns, or just to
5003       understand how PCRE2 works. There is no  support  in  pcre2_dfa_match()
5004       because  there is no backtracking in DFA matching, and there is no sup-
5005       port in JIT because JIT is all about maximimizing matching performance.
5006       In both these cases the callout_flags field is always zero.
5007
5008
5009RETURN VALUES FROM CALLOUTS
5010
5011       The external callout function returns an integer to PCRE2. If the value
5012       is zero, matching proceeds as normal. If  the  value  is  greater  than
5013       zero,  matching  fails  at  the current point, but the testing of other
5014       matching possibilities goes ahead, just as if a lookahead assertion had
5015       failed. If the value is less than zero, the match is abandoned, and the
5016       matching function returns the negative value.
5017
5018       Negative values should normally be chosen from  the  set  of  PCRE2_ER-
5019       ROR_xxx  values.  In  particular, PCRE2_ERROR_NOMATCH forces a standard
5020       "no match" failure. The error number  PCRE2_ERROR_CALLOUT  is  reserved
5021       for use by callout functions; it will never be used by PCRE2 itself.
5022
5023
5024CALLOUT ENUMERATION
5025
5026       int pcre2_callout_enumerate(const pcre2_code *code,
5027         int (*callback)(pcre2_callout_enumerate_block *, void *),
5028         void *user_data);
5029
5030       A script language that supports the use of string arguments in callouts
5031       might  like  to  scan  all the callouts in a pattern before running the
5032       match. This can be done by calling pcre2_callout_enumerate(). The first
5033       argument is a pointer to a compiled pattern, the  second  points  to  a
5034       callback  function,  and the third is arbitrary user data. The callback
5035       function is called for every callout in the pattern  in  the  order  in
5036       which they appear. Its first argument is a pointer to a callout enumer-
5037       ation  block,  and  its second argument is the user_data value that was
5038       passed to pcre2_callout_enumerate(). The data block contains  the  fol-
5039       lowing fields:
5040
5041         version                Block version number
5042         pattern_position       Offset to next item in pattern
5043         next_item_length       Length of next item in pattern
5044         callout_number         Number for numbered callouts
5045         callout_string_offset  Offset to string within pattern
5046         callout_string_length  Length of callout string
5047         callout_string         Points to callout string or is NULL
5048
5049       The  version  number is currently 0. It will increase if new fields are
5050       ever added to the block. The remaining fields are  the  same  as  their
5051       namesakes  in  the pcre2_callout block that is used for callouts during
5052       matching, as described above.
5053
5054       Note that the value of pattern_position is  unique  for  each  callout.
5055       However,  if  a callout occurs inside a group that is quantified with a
5056       non-zero minimum or a fixed maximum, the group is replicated inside the
5057       compiled pattern. For example, a pattern such as /(a){2}/  is  compiled
5058       as  if it were /(a)(a)/. This means that the callout will be enumerated
5059       more than once, but with the same value for  pattern_position  in  each
5060       case.
5061
5062       The callback function should normally return zero. If it returns a non-
5063       zero value, scanning the pattern stops, and that value is returned from
5064       pcre2_callout_enumerate().
5065
5066
5067AUTHOR
5068
5069       Philip Hazel
5070       Retired from University Computing Service
5071       Cambridge, England.
5072
5073
5074REVISION
5075
5076       Last updated: 19 January 2024
5077       Copyright (c) 1997-2024 University of Cambridge.
5078
5079
5080PCRE2 10.43                     19 January 2024                PCRE2CALLOUT(3)
5081------------------------------------------------------------------------------
5082
5083
5084
5085PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
5086
5087
5088NAME
5089       PCRE2 - Perl-compatible regular expressions (revised API)
5090
5091
5092DIFFERENCES BETWEEN PCRE2 AND PERL
5093
5094       This  document describes some of the known differences in the ways that
5095       PCRE2 and Perl handle regular expressions.  The  differences  described
5096       here  are  with  respect  to  Perl version 5.38.0, but as both Perl and
5097       PCRE2 are continually changing, the information may at times be out  of
5098       date.
5099
5100       1.  When  PCRE2_DOTALL  (equivalent to Perl's /s qualifier) is not set,
5101       the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.'
5102       matches the next character unless it is the  start  of  a  newline  se-
5103       quence.  This  means  that, if the newline setting is CR, CRLF, or NUL,
5104       '.' will match the code point LF (0x0A) in ASCII/Unicode  environments,
5105       and  NL  (either  0x15 or 0x25) when using EBCDIC. In Perl, '.' appears
5106       never to match LF, even when 0x0A is not a newline indicator.
5107
5108       2. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
5109       it does have are given in the pcre2unicode page.
5110
5111       3.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
5112       tions, but they do not mean what you might think. For example, (?!a){3}
5113       does not assert that the next three characters are not "a". It just as-
5114       serts that the next character is not "a"  three  times  (in  principle;
5115       PCRE2  optimizes this to run the assertion just once). Perl allows some
5116       repeat quantifiers on other assertions, for example, \b* , but these do
5117       not seem to have any use. PCRE2 does not allow any kind  of  quantifier
5118       on non-lookaround assertions.
5119
5120       4.  If a braced quantifier such as {1,2} appears where there is nothing
5121       to repeat (for example, at the start of a branch), PCRE2 raises an  er-
5122       ror whereas Perl treats the quantifier characters as literal.
5123
5124       5.  Capture groups that occur inside negative lookaround assertions are
5125       counted, but their entries in the offsets vector are set  only  when  a
5126       negative  assertion is a condition that has a matching branch (that is,
5127       the condition is false).  Perl may set such  capture  groups  in  other
5128       circumstances.
5129
5130       6.  The  following Perl escape sequences are not supported: \F, \l, \L,
5131       \u, \U, and \N when followed by a character name. \N on its own, match-
5132       ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
5133       point,  are  supported.  The  escapes that modify the case of following
5134       letters are implemented by Perl's general string-handling and  are  not
5135       part of its pattern matching engine. If any of these are encountered by
5136       PCRE2,  an  error  is  generated  by default. However, if either of the
5137       PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U  and  \u  are
5138       interpreted as ECMAScript interprets them.
5139
5140       7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
5141       is built with Unicode support (the default). The properties that can be
5142       tested  with  \p  and \P are limited to the general category properties
5143       such as Lu and Nd, the derived properties  Any  and  LC  (synonym  L&),
5144       script  names such as Greek or Han, Bidi_Class, Bidi_Control, and a few
5145       binary properties. Both PCRE2 and Perl support the Cs (surrogate) prop-
5146       erty, but in PCRE2 its use is limited. See the pcre2pattern  documenta-
5147       tion  for  details. The long synonyms for property names that Perl sup-
5148       ports (such as \p{Letter}) are not supported by PCRE2, nor is  it  per-
5149       mitted to prefix any of these properties with "Is".
5150
5151       8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
5152       in between are treated as literals. However, this is slightly different
5153       from  Perl  in  that  $  and  @ are also handled as literals inside the
5154       quotes. In Perl, they cause variable interpolation (PCRE2 does not have
5155       variables). Also, Perl does "double-quotish backslash interpolation" on
5156       any backslashes between \Q and \E which, its documentation  says,  "may
5157       lead  to confusing results". PCRE2 treats a backslash between \Q and \E
5158       just like any other character. Note the following examples:
5159
5160           Pattern            PCRE2 matches     Perl matches
5161
5162           \Qabc$xyz\E        abc$xyz           abc followed by the
5163                                                  contents of $xyz
5164           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
5165           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
5166           \QA\B\E            A\B               A\B
5167           \Q\\E              \                 \\E
5168
5169       The \Q...\E sequence is recognized both inside  and  outside  character
5170       classes by both PCRE2 and Perl.
5171
5172       9.   Fairly  obviously,  PCRE2  does  not  support  the  (?{code})  and
5173       (??{code}) constructions. However, PCRE2 does have a "callout" feature,
5174       which allows an external function to be called during pattern matching.
5175       See the pcre2callout documentation for details.
5176
5177       10. Subroutine calls (whether recursive or not) were treated as  atomic
5178       groups  up to PCRE2 release 10.23, but from release 10.30 this changed,
5179       and backtracking into subroutine calls is now supported, as in Perl.
5180
5181       11. In PCRE2, if any of the backtracking control verbs are  used  in  a
5182       group  that  is  called  as  a subroutine (whether or not recursively),
5183       their effect is confined to that group; it does not extend to the  sur-
5184       rounding  pattern.  This is not always the case in Perl. In particular,
5185       if (*THEN) is present in a group that is called as  a  subroutine,  its
5186       action is limited to that group, even if the group does not contain any
5187       |  characters.  Note  that such groups are processed as anchored at the
5188       point where they are tested.
5189
5190       12. If a pattern contains more than one backtracking control verb,  the
5191       first  one  that  is backtracked onto acts. For example, in the pattern
5192       A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but  a  failure
5193       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
5194       it is the same as PCRE2, but there are cases where it differs.
5195
5196       13.  There are some differences that are concerned with the settings of
5197       captured strings when part of  a  pattern  is  repeated.  For  example,
5198       matching  "aba"  against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5199       set, but in PCRE2 it is set to "b".
5200
5201       14. PCRE2's handling of duplicate capture group numbers  and  names  is
5202       not  as  general as Perl's. This is a consequence of the fact the PCRE2
5203       works internally just with numbers, using an external table  to  trans-
5204       late  between  numbers  and  names.  In  particular,  a pattern such as
5205       (?|(?<a>A)|(?<b>B)), where the two capture groups have the same  number
5206       but  different  names, is not supported, and causes an error at compile
5207       time. If it were allowed, it would not be possible to distinguish which
5208       group matched, because both names map to capture  group  number  1.  To
5209       avoid this confusing situation, an error is given at compile time.
5210
5211       15. Perl used to recognize comments in some places that PCRE2 does not,
5212       for  example,  between  the  ( and ? at the start of a group. If the /x
5213       modifier is set, Perl allowed white space between ( and  ?  though  the
5214       latest  Perls give an error (for a while it was just deprecated). There
5215       may still be some cases where Perl behaves differently.
5216
5217       16. Perl, when in warning mode, gives warnings  for  character  classes
5218       such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5219       als. PCRE2 has no warning features, so it gives an error in these cases
5220       because they are almost certainly user mistakes.
5221
5222       17. In PCRE2, the upper/lower case character properties Lu and  Ll  are
5223       not  affected when case-independent matching is specified. For example,
5224       \p{Lu} always matches an upper case letter. I think Perl has changed in
5225       this respect; in the release at the time of writing (5.38), \p{Lu}  and
5226       \p{Ll} match all letters, regardless of case, when case independence is
5227       specified.
5228
5229       18. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5230       sertions.  From  release 10.38 PCRE2 does the same by default. However,
5231       there is an option for re-enabling the previous  behaviour.  When  this
5232       option  is  set,  \K is acted on when it occurs in positive assertions,
5233       but is ignored in negative assertions.
5234
5235       19. PCRE2 provides some extensions to the Perl regular  expression  fa-
5236       cilities.   Perl  5.10  included  new features that were not in earlier
5237       versions of Perl, some of which (such as  named  parentheses)  were  in
5238       PCRE2 for some time before. This list is with respect to Perl 5.38:
5239
5240       (a)  If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
5241       $ meta-character matches only at the very end of the string.
5242
5243       (b) A backslash followed  by  a  letter  with  no  special  meaning  is
5244       faulted. (Perl can be made to issue a warning.)
5245
5246       (c)  If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
5247       fiers is inverted, that is, by default they are not greedy, but if fol-
5248       lowed by a question mark they are.
5249
5250       (d) PCRE2_ANCHORED can be used at matching time to force a  pattern  to
5251       be tried only at the first matching position in the subject string.
5252
5253       (e)     The     PCRE2_NOTBOL,    PCRE2_NOTEOL,    PCRE2_NOTEMPTY    and
5254       PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
5255
5256       (f) The \R escape sequence can be restricted to match only CR,  LF,  or
5257       CRLF by the PCRE2_BSR_ANYCRLF option.
5258
5259       (g)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
5260       and variable interpolation, but not general hooks on every match.
5261
5262       (h) The partial matching facility is PCRE2-specific.
5263
5264       (i) The alternative matching function (pcre2_dfa_match() matches  in  a
5265       different way and is not Perl-compatible.
5266
5267       (j)  PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
5268       at the start of a pattern. These set overall  options  that  cannot  be
5269       changed within the pattern.
5270
5271       (k)  PCRE2  supports non-atomic positive lookaround assertions. This is
5272       an extension to the lookaround facilities. The default, Perl-compatible
5273       lookarounds are atomic.
5274
5275       (l) There are three syntactical items in patterns that can refer  to  a
5276       capturing  group  by  number: back references such as \g{2}, subroutine
5277       calls such as (?3), and condition references such as  (?(4)...).  PCRE2
5278       supports  relative  group numbers such as +2 and -4 in all three cases.
5279       Perl supports both plus and minus for subroutine calls, but only  minus
5280       for back references, and no relative numbering at all for conditions.
5281
5282       20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5283       tion for details. Perl went with 5.10 from recursion to iteration keep-
5284       ing the intermediate matches on the heap, which is ~10% slower but does
5285       not  fall into any stack-overflow limit. PCRE2 made a similar change at
5286       release 10.30, and also has many build-time and  run-time  customizable
5287       limits.
5288
5289       21.  Unlike  Perl,  PCRE2 doesn't have character set modifiers and spe-
5290       cially no way to set characters by context just  like  Perl's  "/d".  A
5291       regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules
5292       to  Perl's  "/u";  something closer to "/a" could be selected by adding
5293       other PCRE2_EXTRA_ASCII* options on top.
5294
5295       22. Some recursive patterns that Perl diagnoses as infinite  recursions
5296       can be handled by PCRE2, either by the interpreter or the JIT. An exam-
5297       ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number
5298       of repeated "abcd" substrings at the end of the subject.
5299
5300
5301AUTHOR
5302
5303       Philip Hazel
5304       Retired from University Computing Service
5305       Cambridge, England.
5306
5307
5308REVISION
5309
5310       Last updated: 30 November 2023
5311       Copyright (c) 1997-2023 University of Cambridge.
5312
5313
5314PCRE2 10.43                    30 November 2023                 PCRE2COMPAT(3)
5315------------------------------------------------------------------------------
5316
5317
5318
5319PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
5320
5321
5322NAME
5323       PCRE2 - Perl-compatible regular expressions (revised API)
5324
5325
5326PCRE2 JUST-IN-TIME COMPILER SUPPORT
5327
5328       Just-in-time  compiling  is a heavyweight optimization that can greatly
5329       speed up pattern matching. However, it comes at the cost of extra  pro-
5330       cessing  before  the  match is performed, so it is of most benefit when
5331       the same pattern is going to be matched many times. This does not  nec-
5332       essarily  mean many calls of a matching function; if the pattern is not
5333       anchored, matching attempts may take place many times at various  posi-
5334       tions in the subject, even for a single call. Therefore, if the subject
5335       string  is  very  long,  it  may  still pay to use JIT even for one-off
5336       matches. JIT support is available for all  of  the  8-bit,  16-bit  and
5337       32-bit PCRE2 libraries.
5338
5339       JIT  support  applies  only to the traditional Perl-compatible matching
5340       function.  It does not apply when the DFA matching  function  is  being
5341       used. The code for JIT support was written by Zoltan Herczeg.
5342
5343
5344AVAILABILITY OF JIT SUPPORT
5345
5346       JIT  support  is  an  optional feature of PCRE2. The "configure" option
5347       --enable-jit (or equivalent CMake option) must be  set  when  PCRE2  is
5348       built  if  you want to use JIT. The support is limited to the following
5349       hardware platforms:
5350
5351         ARM 32-bit (v7, and Thumb2)
5352         ARM 64-bit
5353         IBM s390x 64 bit
5354         Intel x86 32-bit and 64-bit
5355         LoongArch 64 bit
5356         MIPS 32-bit and 64-bit
5357         Power PC 32-bit and 64-bit
5358         RISC-V 32-bit and 64-bit
5359
5360       If --enable-jit is set on an unsupported platform, compilation fails.
5361
5362       A client program can tell  if  JIT  support  is  available  by  calling
5363       pcre2_config()  with  the PCRE2_CONFIG_JIT option. The result is one if
5364       PCRE2 was built with JIT support, and zero otherwise.  However,  having
5365       the  JIT code available does not guarantee that it will be used for any
5366       particular match. One reason for this is that there are a number of op-
5367       tions and pattern items that are not supported by JIT (see below).  An-
5368       other  reason  is that in some environments JIT is unable to get memory
5369       in which to build its compiled code. The only guarantee from pcre2_con-
5370       fig() is that if it returns zero, JIT will definitely not be used.
5371
5372       A simple program does not need to check availability in  order  to  use
5373       JIT  when  possible. The API is implemented in a way that falls back to
5374       the interpretive code if JIT is not available or cannot be used  for  a
5375       given  match.  For  programs  that  need the best possible performance,
5376       there is a "fast path" API that is JIT-specific.
5377
5378
5379SIMPLE USE OF JIT
5380
5381       To make use of the JIT support in the simplest way, all you have to  do
5382       is  to  call pcre2_jit_compile() after successfully compiling a pattern
5383       with pcre2_compile(). This function has two arguments: the first is the
5384       compiled pattern pointer that was returned by pcre2_compile(), and  the
5385       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
5386       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
5387
5388       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
5389       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
5390       pattern is passed to the JIT compiler, which turns it into machine code
5391       that executes much faster than the normal interpretive code, but yields
5392       exactly the same results. The returned value  from  pcre2_jit_compile()
5393       is zero on success, or a negative error code.
5394
5395       There  is  a limit to the size of pattern that JIT supports, imposed by
5396       the size of machine stack that it uses. The exact rules are  not  docu-
5397       mented because they may change at any time, in particular, when new op-
5398       timizations  are  introduced.   If  a  pattern  is  too  big, a call to
5399       pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
5400
5401       PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for  com-
5402       plete  matches. If you want to run partial matches using the PCRE2_PAR-
5403       TIAL_HARD or PCRE2_PARTIAL_SOFT options of  pcre2_match(),  you  should
5404       set  one  or  both  of  the  other  options  as  well as, or instead of
5405       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
5406       for each of the three modes (normal, soft partial, hard partial).  When
5407       pcre2_match()  is  called,  the appropriate code is run if it is avail-
5408       able. Otherwise, the pattern is matched using interpretive code.
5409
5410       You can call pcre2_jit_compile() multiple times for the  same  compiled
5411       pattern.  It does nothing if it has previously compiled code for any of
5412       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
5413       PLETE  and  (perhaps  later,  when  you find you need partial matching)
5414       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
5415       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5416       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5417       diately returns zero. This is an alternative way of testing whether JIT
5418       is available.
5419
5420       At  present,  it  is not possible to free JIT compiled code except when
5421       the entire compiled pattern is freed by calling pcre2_code_free().
5422
5423       In some circumstances you may need to call additional functions.  These
5424       are  described  in the section entitled "Controlling the JIT stack" be-
5425       low.
5426
5427       There are some pcre2_match() options that are not supported by JIT, and
5428       there are also some pattern items that JIT cannot handle.  Details  are
5429       given  below.   In both cases, matching automatically falls back to the
5430       interpretive code. If you want to know whether JIT  was  actually  used
5431       for  a particular match, you should arrange for a JIT callback function
5432       to be set up as described in the section entitled "Controlling the  JIT
5433       stack"  below,  even  if  you  do  not need to supply a non-default JIT
5434       stack. Such a callback function is called whenever JIT code is about to
5435       be obeyed. If the match-time options are not right for  JIT  execution,
5436       the callback function is not obeyed.
5437
5438       If  the  JIT  compiler finds an unsupported item, no JIT data is gener-
5439       ated. You can find out if JIT compilation was successful for a compiled
5440       pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5441       tion. A non-zero result means that JIT compilation  was  successful.  A
5442       result of 0 means that JIT support is not available, or the pattern was
5443       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
5444       to handle the pattern. Successful JIT compilation  does  not,  however,
5445       guarantee  the  use  of  JIT at match time because there are some match
5446       time options that are not supported by JIT.
5447
5448
5449MATCHING SUBJECTS CONTAINING INVALID UTF
5450
5451       When a pattern is compiled with the PCRE2_UTF option,  subject  strings
5452       are  normally expected to be a valid sequence of UTF code units. By de-
5453       fault, this is checked at the start of matching and an error is  gener-
5454       ated  if  invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be
5455       passed to pcre2_match() to skip the check (for improved performance) if
5456       you are sure that a subject string is valid. If  this  option  is  used
5457       with  an  invalid  string, the result is undefined. The calling program
5458       may crash or loop or otherwise misbehave.
5459
5460       However, a way of running matches on strings that may  contain  invalid
5461       UTF   sequences   is   available.   Calling  pcre2_compile()  with  the
5462       PCRE2_MATCH_INVALID_UTF option has two effects:  it  tells  the  inter-
5463       preter  in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5464       pile() is subsequently called, the compiled JIT code also supports  in-
5465       valid  UTF.  Details of how this support works, in both the JIT and the
5466       interpretive cases, is given in the pcre2unicode documentation.
5467
5468       There  is  also  an  obsolete  option  for  pcre2_jit_compile()  called
5469       PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5470       ibility.     It   is   superseded   by   the   pcre2_compile()   option
5471       PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed
5472       in future.
5473
5474
5475UNSUPPORTED OPTIONS AND PATTERN ITEMS
5476
5477       The pcre2_match() options that  are  supported  for  JIT  matching  are
5478       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
5479       PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and
5480       PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED  and  PCRE2_ENDANCHORED  options
5481       are not supported at match time.
5482
5483       If  the  PCRE2_NO_JIT option is passed to pcre2_match() it disables the
5484       use of JIT, forcing matching by the interpreter code.
5485
5486       The only unsupported pattern items are \C (match a  single  data  unit)
5487       when  running in a UTF mode, and a callout immediately before an asser-
5488       tion condition in a conditional group.
5489
5490
5491RETURN VALUES FROM JIT MATCHING
5492
5493       When a pattern is matched using JIT, the return values are the same  as
5494       those  given  by the interpretive pcre2_match() code, with the addition
5495       of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that  the
5496       memory  used  for  the JIT stack was insufficient. See "Controlling the
5497       JIT stack" below for a discussion of JIT stack usage.
5498
5499       The error code PCRE2_ERROR_MATCHLIMIT is returned by the  JIT  code  if
5500       searching  a  very large pattern tree goes on for too long, as it is in
5501       the same circumstance when JIT is not used, but the details of  exactly
5502       what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5503       is never returned when JIT matching is used.
5504
5505
5506CONTROLLING THE JIT STACK
5507
5508       When the compiled JIT code runs, it needs a block of memory to use as a
5509       stack.   By  default, it uses 32KiB on the machine stack. However, some
5510       large or complicated patterns need more than this. The error  PCRE2_ER-
5511       ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5512       tions are provided for managing blocks of memory for use as JIT stacks.
5513       There  is further discussion about the use of JIT stacks in the section
5514       entitled "JIT stack FAQ" below.
5515
5516       The pcre2_jit_stack_create() function creates a JIT  stack.  Its  argu-
5517       ments  are  a starting size, a maximum size, and a general context (for
5518       memory allocation functions, or NULL for standard  memory  allocation).
5519       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
5520       NULL  if there is an error. The pcre2_jit_stack_free() function is used
5521       to free a stack that is no longer needed. If its argument is NULL, this
5522       function returns immediately, without doing anything. (For the  techni-
5523       cally  minded: the address space is allocated by mmap or VirtualAlloc.)
5524       A maximum stack size of 512KiB to 1MiB should be more than  enough  for
5525       any pattern.
5526
5527       The  pcre2_jit_stack_assign()  function  specifies which stack JIT code
5528       should use. Its arguments are as follows:
5529
5530         pcre2_match_context  *mcontext
5531         pcre2_jit_callback    callback
5532         void                 *data
5533
5534       The first argument is a pointer to a match context. When this is subse-
5535       quently passed to a matching function, its information determines which
5536       JIT stack is used. If this argument is NULL, the function returns imme-
5537       diately, without doing anything. There are three cases for  the  values
5538       of the other two options:
5539
5540         (1) If callback is NULL and data is NULL, an internal 32KiB block
5541             on the machine stack is used. This is the default when a match
5542             context is created.
5543
5544         (2) If callback is NULL and data is not NULL, data must be
5545             a pointer to a valid JIT stack, the result of calling
5546             pcre2_jit_stack_create().
5547
5548         (3) If callback is not NULL, it must point to a function that is
5549             called with data as an argument at the start of matching, in
5550             order to set up a JIT stack. If the return from the callback
5551             function is NULL, the internal 32KiB stack is used; otherwise the
5552             return value must be a valid JIT stack, the result of calling
5553             pcre2_jit_stack_create().
5554
5555       A  callback function is obeyed whenever JIT code is about to be run; it
5556       is not obeyed when pcre2_match() is called with options that are incom-
5557       patible for JIT matching. A callback function can therefore be used  to
5558       determine  whether  a match operation was executed by JIT or by the in-
5559       terpreter.
5560
5561       You may safely use the same JIT stack for more than one pattern (either
5562       by assigning directly or by callback), as  long  as  the  patterns  are
5563       matched sequentially in the same thread. Currently, the only way to set
5564       up  non-sequential matches in one thread is to use callouts: if a call-
5565       out function starts another match, that match must use a different  JIT
5566       stack to the one used for currently suspended match(es).
5567
5568       In  a multithread application, if you do not specify a JIT stack, or if
5569       you assign or pass back NULL from a callback, that is thread-safe,  be-
5570       cause  each thread has its own machine stack. However, if you assign or
5571       pass back a non-NULL JIT stack, this must be a different stack for each
5572       thread so that the application is thread-safe.
5573
5574       Strictly speaking, even more is allowed. You can assign the  same  non-
5575       NULL  stack  to a match context that is used by any number of patterns,
5576       as long as they are not used for matching by multiple  threads  at  the
5577       same  time.  For  example, you could use the same stack in all compiled
5578       patterns, with a global mutex in the callback to wait until  the  stack
5579       is available for use. However, this is an inefficient solution, and not
5580       recommended.
5581
5582       This  is a suggestion for how a multithreaded program that needs to set
5583       up non-default JIT stacks might operate:
5584
5585         During thread initialization
5586           thread_local_var = pcre2_jit_stack_create(...)
5587
5588         During thread exit
5589           pcre2_jit_stack_free(thread_local_var)
5590
5591         Use a one-line callback function
5592           return thread_local_var
5593
5594       All the functions described in this section do nothing if  JIT  is  not
5595       available.
5596
5597
5598JIT STACK FAQ
5599
5600       (1) Why do we need JIT stacks?
5601
5602       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5603       where  the local data of the current node is pushed before checking its
5604       child nodes.  Allocating real machine stack on some platforms is diffi-
5605       cult. For example, the stack chain needs to be updated every time if we
5606       extend the stack on PowerPC.  Although it  is  possible,  its  updating
5607       time overhead decreases performance. So we do the recursion in memory.
5608
5609       (2) Why don't we simply allocate blocks of memory with malloc()?
5610
5611       Modern  operating  systems have a nice feature: they can reserve an ad-
5612       dress space instead of allocating memory. We can safely allocate memory
5613       pages inside this address space, so the stack could grow without moving
5614       memory data (this is important because of pointers). Thus we can  allo-
5615       cate  1MiB  address  space,  and use only a single memory page (usually
5616       4KiB) if that is enough. However, we can still grow up to 1MiB  anytime
5617       if needed.
5618
5619       (3) Who "owns" a JIT stack?
5620
5621       The owner of the stack is the user program, not the JIT studied pattern
5622       or anything else. The user program must ensure that if a stack is being
5623       used by pcre2_match(), (that is, it is assigned to a match context that
5624       is  passed  to  the  pattern currently running), that stack must not be
5625       used by any other threads (to avoid overwriting the same memory  area).
5626       The best practice for multithreaded programs is to allocate a stack for
5627       each thread, and return this stack through the JIT callback function.
5628
5629       (4) When should a JIT stack be freed?
5630
5631       You can free a JIT stack at any time, as long as it will not be used by
5632       pcre2_match() again. When you assign the stack to a match context, only
5633       a  pointer  is  set. There is no reference counting or any other magic.
5634       You can free compiled patterns, contexts, and stacks in any order, any-
5635       time.  Just do not call pcre2_match() with a match context pointing  to
5636       an already freed stack, as that will cause SEGFAULT. (Also, do not free
5637       a  stack  currently  used  by pcre2_match() in another thread). You can
5638       also replace the stack in a context at any time when it is not in  use.
5639       You should free the previous stack before assigning a replacement.
5640
5641       (5)  Should  I  allocate/free  a  stack every time before/after calling
5642       pcre2_match()?
5643
5644       No, because this is too costly in  terms  of  resources.  However,  you
5645       could  implement  some clever idea which release the stack if it is not
5646       used in let's say two minutes. The JIT callback  can  help  to  achieve
5647       this without keeping a list of patterns.
5648
5649       (6)  OK, the stack is for long term memory allocation. But what happens
5650       if a pattern causes stack overflow with a stack of 1MiB? Is  that  1MiB
5651       kept until the stack is freed?
5652
5653       Especially on embedded systems, it might be a good idea to release mem-
5654       ory  sometimes  without  freeing the stack. There is no API for this at
5655       the moment.  Probably a function call which returns with the  currently
5656       allocated  memory for any stack and another which allows releasing mem-
5657       ory (shrinking the stack) would be a good idea if someone needs this.
5658
5659       (7) This is too much of a headache. Isn't there any better solution for
5660       JIT stack handling?
5661
5662       No, thanks to Windows. If POSIX threads were used everywhere, we  could
5663       throw out this complicated API.
5664
5665
5666FREEING JIT SPECULATIVE MEMORY
5667
5668       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
5669
5670       The JIT executable allocator does not free all memory when it is possi-
5671       ble.  It  expects new allocations, and keeps some free memory around to
5672       improve allocation speed. However, in low memory conditions,  it  might
5673       be  better to free all possible memory. You can cause this to happen by
5674       calling pcre2_jit_free_unused_memory(). Its argument is a general  con-
5675       text, for custom memory management, or NULL for standard memory manage-
5676       ment.
5677
5678
5679EXAMPLE CODE
5680
5681       This  is  a  single-threaded example that specifies a JIT stack without
5682       using a callback. A real program should include  error  checking  after
5683       all the function calls.
5684
5685         int rc;
5686         pcre2_code *re;
5687         pcre2_match_data *match_data;
5688         pcre2_match_context *mcontext;
5689         pcre2_jit_stack *jit_stack;
5690
5691         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
5692           &errornumber, &erroffset, NULL);
5693         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
5694         mcontext = pcre2_match_context_create(NULL);
5695         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
5696         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
5697         match_data = pcre2_match_data_create(re, 10);
5698         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
5699         /* Process result */
5700
5701         pcre2_code_free(re);
5702         pcre2_match_data_free(match_data);
5703         pcre2_match_context_free(mcontext);
5704         pcre2_jit_stack_free(jit_stack);
5705
5706
5707JIT FAST PATH API
5708
5709       Because the API described above falls back to interpreted matching when
5710       JIT  is  not  available, it is convenient for programs that are written
5711       for  general  use  in  many  environments.  However,  calling  JIT  via
5712       pcre2_match() does have a performance impact. Programs that are written
5713       for  use  where  JIT  is known to be available, and which need the best
5714       possible performance, can instead use a "fast path"  API  to  call  JIT
5715       matching  directly instead of calling pcre2_match() (obviously only for
5716       patterns that have been successfully processed by pcre2_jit_compile()).
5717
5718       The fast path function is called pcre2_jit_match(), and  it  takes  ex-
5719       actly  the same arguments as pcre2_match(). However, the subject string
5720       must be specified with a  length;  PCRE2_ZERO_TERMINATED  is  not  sup-
5721       ported.  Unsupported  option  bits  (for  example,  PCRE2_ANCHORED  and
5722       PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The  re-
5723       turn  values  are  also  the  same as for pcre2_match(), plus PCRE2_ER-
5724       ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
5725       that was not compiled.
5726
5727       When you call pcre2_match(), as well as testing for invalid options,  a
5728       number of other sanity checks are performed on the arguments. For exam-
5729       ple,  if the subject pointer is NULL but the length is non-zero, an im-
5730       mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set,  a  UTF
5731       subject string is tested for validity. In the interests of speed, these
5732       checks  do  not  happen  on  the  JIT fast path. If invalid UTF data is
5733       passed when PCRE2_MATCH_INVALID_UTF was not  set  for  pcre2_compile(),
5734       the  result  is  undefined. The program may crash or loop or give wrong
5735       results. In the absence  of  PCRE2_MATCH_INVALID_UTF  you  should  call
5736       pcre2_jit_match()  in  UTF  mode  only  if  you are sure the subject is
5737       valid.
5738
5739       Bypassing the sanity checks and the  pcre2_match()  wrapping  can  give
5740       speedups of more than 10%.
5741
5742
5743SEE ALSO
5744
5745       pcre2api(3), pcre2unicode(3)
5746
5747
5748AUTHOR
5749
5750       Philip Hazel (FAQ by Zoltan Herczeg)
5751       Retired from University Computing Service
5752       Cambridge, England.
5753
5754
5755REVISION
5756
5757       Last updated: 21 February 2024
5758       Copyright (c) 1997-2024 University of Cambridge.
5759
5760
5761PCRE2 10.43                    21 February 2024                    PCRE2JIT(3)
5762------------------------------------------------------------------------------
5763
5764
5765
5766PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
5767
5768
5769NAME
5770       PCRE2 - Perl-compatible regular expressions (revised API)
5771
5772
5773SIZE AND OTHER LIMITATIONS
5774
5775       There are some size limitations in PCRE2 but it is hoped that they will
5776       never in practice be relevant.
5777
5778       The  maximum  size  of  a compiled pattern is approximately 64 thousand
5779       code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5780       the default internal linkage size, which  is  2  bytes  for  these  li-
5781       braries.  If  you  want  to  process regular expressions that are truly
5782       enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
5783       (when building the 16-bit library, 3 is  rounded  up  to  4).  See  the
5784       README file in the source distribution and the pcre2build documentation
5785       for  details.  In  these cases the limit is substantially larger.  How-
5786       ever, the speed of execution is slower. In the 32-bit library, the  in-
5787       ternal linkage size is always 4.
5788
5789       The maximum length of a source pattern string is essentially unlimited;
5790       it  is  the largest number a PCRE2_SIZE variable can hold. However, the
5791       program that calls pcre2_compile() can specify a smaller limit.
5792
5793       The maximum length (in code units) of a subject string is one less than
5794       the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5795       signed integer type, usually defined as size_t. Its maximum value (that
5796       is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi-
5797       nated strings and unset offsets.
5798
5799       All values in repeating quantifiers must be less than 65536.
5800
5801       There are two different limits that apply to branches of lookbehind as-
5802       sertions.   If every branch in such an assertion matches a fixed number
5803       of characters, the maximum length of any branch is 65535 characters. If
5804       any branch matches a variable number of characters,  then  the  maximum
5805       matching  length  for every branch is limited. The default limit is set
5806       at compile time, defaulting to 255, but can be changed by  the  calling
5807       program.
5808
5809       There  is no limit to the number of parenthesized groups, but there can
5810       be no more than 65535 capture groups, and there is a limit to the depth
5811       of nesting of parenthesized subpatterns of all kinds. This  is  imposed
5812       in  order to limit the amount of system stack used at compile time. The
5813       default limit can be specified when PCRE2 is built; if not, the default
5814       is set to  250.  An  application  can  change  this  limit  by  calling
5815       pcre2_set_parens_nest_limit() to set the limit in a compile context.
5816
5817       The  maximum length of name for a named capture group is 32 code units,
5818       and the maximum number of such groups is 10000.
5819
5820       The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
5821       (*THEN)  verb  is  255  code units for the 8-bit library and 65535 code
5822       units for the 16-bit and 32-bit libraries.
5823
5824       The maximum length of a string argument to a  callout  is  the  largest
5825       number a 32-bit unsigned integer can hold.
5826
5827       The  maximum  amount  of heap memory used for matching is controlled by
5828       the heap limit, which can be set in a pattern or in  a  match  context.
5829       The default is a very large number, effectively unlimited.
5830
5831
5832AUTHOR
5833
5834       Philip Hazel
5835       Retired from University Computing Service
5836       Cambridge, England.
5837
5838
5839REVISION
5840
5841       Last updated: August 2023
5842       Copyright (c) 1997-2023 University of Cambridge.
5843
5844
5845PCRE2 10.43                      1 August 2023                  PCRE2LIMITS(3)
5846------------------------------------------------------------------------------
5847
5848
5849
5850PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
5851
5852
5853NAME
5854       PCRE2 - Perl-compatible regular expressions (revised API)
5855
5856
5857PCRE2 MATCHING ALGORITHMS
5858
5859       This document describes the two different algorithms that are available
5860       in  PCRE2  for  matching  a compiled regular expression against a given
5861       subject string. The "standard" algorithm is the  one  provided  by  the
5862       pcre2_match() function. This works in the same as Perl's matching func-
5863       tion,  and  provide  a Perl-compatible matching operation. The just-in-
5864       time (JIT) optimization that is described in the pcre2jit documentation
5865       is compatible with this function.
5866
5867       An alternative algorithm is provided by the pcre2_dfa_match() function;
5868       it operates in a different way, and is not Perl-compatible. This alter-
5869       native has advantages and disadvantages compared with the standard  al-
5870       gorithm, and these are described below.
5871
5872       When there is only one possible way in which a given subject string can
5873       match  a pattern, the two algorithms give the same answer. A difference
5874       arises, however, when there are multiple possibilities. For example, if
5875       the pattern
5876
5877         ^<.*>
5878
5879       is matched against the string
5880
5881         <something> <something else> <something further>
5882
5883       there are three possible answers. The standard algorithm finds only one
5884       of them, whereas the alternative algorithm finds all three.
5885
5886
5887REGULAR EXPRESSIONS AS TREES
5888
5889       The set of strings that are matched by a regular expression can be rep-
5890       resented as a tree structure. An unlimited repetition  in  the  pattern
5891       makes  the  tree of infinite size, but it is still a tree. Matching the
5892       pattern to a given subject string (from a given starting point) can  be
5893       thought  of  as  a  search of the tree.  There are two ways to search a
5894       tree: depth-first and breadth-first, and these correspond  to  the  two
5895       matching algorithms provided by PCRE2.
5896
5897
5898THE STANDARD MATCHING ALGORITHM
5899
5900       In  the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5901       sions", the standard algorithm is an "NFA  algorithm".  It  conducts  a
5902       depth-first  search  of  the pattern tree. That is, it proceeds along a
5903       single path through the tree, checking that the subject matches what is
5904       required. When there is a mismatch, the algorithm  tries  any  alterna-
5905       tives  at  the  current point, and if they all fail, it backs up to the
5906       previous branch point in the  tree,  and  tries  the  next  alternative
5907       branch  at  that  level.  This often involves backing up (moving to the
5908       left) in the subject string as well.  The  order  in  which  repetition
5909       branches  are  tried  is controlled by the greedy or ungreedy nature of
5910       the quantifier.
5911
5912       If a leaf node is reached, a matching string has  been  found,  and  at
5913       that  point the algorithm stops. Thus, if there is more than one possi-
5914       ble match, this algorithm returns the first one that it finds.  Whether
5915       this  is the shortest, the longest, or some intermediate length depends
5916       on the way the alternations and the greedy or ungreedy repetition quan-
5917       tifiers are specified in the pattern.
5918
5919       Because it ends up with a single path through the  tree,  it  is  rela-
5920       tively  straightforward  for  this  algorithm to keep track of the sub-
5921       strings that are matched by portions of  the  pattern  in  parentheses.
5922       This provides support for capturing parentheses and backreferences.
5923
5924
5925THE ALTERNATIVE MATCHING ALGORITHM
5926
5927       This  algorithm  conducts  a breadth-first search of the tree. Starting
5928       from the first matching point in the  subject,  it  scans  the  subject
5929       string from left to right, once, character by character, and as it does
5930       this,  it remembers all the paths through the tree that represent valid
5931       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
5932       though  it is not implemented as a traditional finite state machine (it
5933       keeps multiple states active simultaneously).
5934
5935       Although the general principle of this matching algorithm  is  that  it
5936       scans  the subject string only once, without backtracking, there is one
5937       exception: when a lookaround assertion is encountered,  the  characters
5938       following  or  preceding the current point have to be independently in-
5939       spected.
5940
5941       The scan continues until either the end of the subject is  reached,  or
5942       there  are  no more unterminated paths. At this point, terminated paths
5943       represent the different matching possibilities (if there are none,  the
5944       match  has  failed).   Thus,  if there is more than one possible match,
5945       this algorithm finds all of them,  and  in  particular,  it  finds  the
5946       longest.  The  matches  are returned in the output vector in decreasing
5947       order of length. There is an option to stop  the  algorithm  after  the
5948       first match (which is necessarily the shortest) is found.
5949
5950       Note  that the size of vector needed to contain all the results depends
5951       on the number of simultaneous matches, not on the number of parentheses
5952       in the pattern. Using pcre2_match_data_create_from_pattern() to  create
5953       the  match  data block is therefore not advisable when doing DFA match-
5954       ing.
5955
5956       Note also that all the matches that are found start at the  same  point
5957       in the subject. If the pattern
5958
5959         cat(er(pillar)?)?
5960
5961       is  matched  against the string "the caterpillar catchment", the result
5962       is the three strings "caterpillar", "cater", and "cat"  that  start  at
5963       the  fifth  character  of the subject. The algorithm does not automati-
5964       cally move on to find matches that start at later positions.
5965
5966       PCRE2's "auto-possessification" optimization usually applies to charac-
5967       ter repeats at the end of a pattern (as well as internally). For  exam-
5968       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
5969       is  no  point even considering the possibility of backtracking into the
5970       repeated digits. For DFA matching, this means that  only  one  possible
5971       match  is  found. If you really do want multiple matches in such cases,
5972       either use an ungreedy repeat ("a\d+?") or set  the  PCRE2_NO_AUTO_POS-
5973       SESS option when compiling.
5974
5975       There  are  a  number of features of PCRE2 regular expressions that are
5976       not supported or behave differently in the alternative  matching  func-
5977       tion. Those that are not supported cause an error if encountered.
5978
5979       1.  Because the algorithm finds all possible matches, the greedy or un-
5980       greedy nature of repetition quantifiers is not relevant (though it  may
5981       affect  auto-possessification,  as  just  described).  During matching,
5982       greedy and ungreedy quantifiers are treated in exactly  the  same  way.
5983       However, possessive quantifiers can make a difference when what follows
5984       could  also  match  what  is  quantified, for example in a pattern like
5985       this:
5986
5987         ^a++\w!
5988
5989       This pattern matches "aaab!" but not "aaa!", which would be matched  by
5990       a  non-possessive quantifier. Similarly, if an atomic group is present,
5991       it is matched as if it were a standalone pattern at the current  point,
5992       and  the  longest match is then "locked in" for the rest of the overall
5993       pattern.
5994
5995       2. When dealing with multiple paths through the tree simultaneously, it
5996       is not straightforward to keep track of  captured  substrings  for  the
5997       different  matching  possibilities,  and PCRE2's implementation of this
5998       algorithm does not attempt to do this. This means that no captured sub-
5999       strings are available.
6000
6001       3. Because no substrings are captured, backreferences within  the  pat-
6002       tern are not supported.
6003
6004       4.  For  the same reason, conditional expressions that use a backrefer-
6005       ence as the condition or test for a specific group  recursion  are  not
6006       supported.
6007
6008       5. Again for the same reason, script runs are not supported.
6009
6010       6. Because many paths through the tree may be active, the \K escape se-
6011       quence,  which  resets the start of the match when encountered (but may
6012       be on some paths and not on others), is not supported.
6013
6014       7. Callouts are supported, but the value of the  capture_top  field  is
6015       always 1, and the value of the capture_last field is always 0.
6016
6017       8.  The  \C  escape  sequence, which (in the standard algorithm) always
6018       matches a single code unit, even in a UTF mode,  is  not  supported  in
6019       these  modes,  because the alternative algorithm moves through the sub-
6020       ject string one character (not code unit) at a  time,  for  all  active
6021       paths through the tree.
6022
6023       9.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
6024       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
6025       negative assertion.
6026
6027       10.  The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup-
6028       ported by pcre2_dfa_match().
6029
6030
6031ADVANTAGES OF THE ALTERNATIVE ALGORITHM
6032
6033       The main advantage of the alternative algorithm is  that  all  possible
6034       matches (at a single point in the subject) are automatically found, and
6035       in  particular, the longest match is found. To find more than one match
6036       at the same point using the standard algorithm, you have to  do  kludgy
6037       things with callouts.
6038
6039       Partial  matching  is  possible with this algorithm, though it has some
6040       limitations. The pcre2partial documentation gives  details  of  partial
6041       matching and discusses multi-segment matching.
6042
6043
6044DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
6045
6046       The alternative algorithm suffers from a number of disadvantages:
6047
6048       1.  It  is  substantially  slower  than the standard algorithm. This is
6049       partly because it has to search for all possible matches, but  is  also
6050       because it is less susceptible to optimization.
6051
6052       2.  Capturing  parentheses,  backreferences,  script runs, and matching
6053       within invalid UTF string are not supported.
6054
6055       3. Although atomic groups are supported, their use does not provide the
6056       performance advantage that it does for the standard algorithm.
6057
6058       4. JIT optimization is not supported.
6059
6060
6061AUTHOR
6062
6063       Philip Hazel
6064       Retired from University Computing Service
6065       Cambridge, England.
6066
6067
6068REVISION
6069
6070       Last updated: 19 January 2024
6071       Copyright (c) 1997-2024 University of Cambridge.
6072
6073
6074PCRE2 10.43                     19 January 2024               PCRE2MATCHING(3)
6075------------------------------------------------------------------------------
6076
6077
6078
6079PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
6080
6081
6082NAME
6083       PCRE2 - Perl-compatible regular expressions
6084
6085
6086PARTIAL MATCHING IN PCRE2
6087
6088       In  normal use of PCRE2, if there is a match up to the end of a subject
6089       string, but more characters are needed to  match  the  entire  pattern,
6090       PCRE2_ERROR_NOMATCH  is  returned,  just  like any other failing match.
6091       There are circumstances where it might be helpful to  distinguish  this
6092       "partial match" case.
6093
6094       One  example  is  an application where the subject string is very long,
6095       and not all available at once. The requirement here is to be able to do
6096       the matching segment by segment, but special action is  needed  when  a
6097       matched substring spans the boundary between two segments.
6098
6099       Another  example is checking a user input string as it is typed, to en-
6100       sure that it conforms to a required format. Invalid characters  can  be
6101       immediately diagnosed and rejected, giving instant feedback.
6102
6103       Partial  matching  is a PCRE2-specific feature; it is not Perl-compati-
6104       ble. It is requested  by  setting  one  of  the  PCRE2_PARTIAL_HARD  or
6105       PCRE2_PARTIAL_SOFT  options  when calling a matching function. The dif-
6106       ference between the two options is whether or not a  partial  match  is
6107       preferred  to  an alternative complete match, though the details differ
6108       between the two types of matching function. If both  options  are  set,
6109       PCRE2_PARTIAL_HARD takes precedence.
6110
6111       If  you  want to use partial matching with just-in-time optimized code,
6112       as well as setting a partial match option for  the  matching  function,
6113       you  must  also  call pcre2_jit_compile() with one or both of these op-
6114       tions:
6115
6116         PCRE2_JIT_PARTIAL_HARD
6117         PCRE2_JIT_PARTIAL_SOFT
6118
6119       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
6120       tial  matches  on  the same pattern. Separate code is compiled for each
6121       mode. If the appropriate JIT mode has not been  compiled,  interpretive
6122       matching code is used.
6123
6124       Setting  a partial matching option disables two of PCRE2's standard op-
6125       timization hints. PCRE2 remembers the last literal code unit in a  pat-
6126       tern,  and  abandons  matching  immediately if it is not present in the
6127       subject string.  This optimization cannot be used for a subject  string
6128       that  might match only partially. PCRE2 also remembers a minimum length
6129       of a matching string, and does not bother to run the matching  function
6130       on  shorter  strings.  This  optimization  is also disabled for partial
6131       matching.
6132
6133
6134REQUIREMENTS FOR A PARTIAL MATCH
6135
6136       A possible partial match occurs during matching when  the  end  of  the
6137       subject  string is reached successfully, but either more characters are
6138       needed to complete the match, or the addition of more characters  might
6139       change what is matched.
6140
6141       Example  1: if the pattern is /abc/ and the subject is "ab", more char-
6142       acters are definitely needed to complete a match.  In  this  case  both
6143       hard and soft matching options yield a partial match.
6144
6145       Example  2: if the pattern is /ab+/ and the subject is "ab", a complete
6146       match can be found, but the addition of more  characters  might  change
6147       what  is  matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
6148       tial match; PCRE2_PARTIAL_SOFT returns the complete match.
6149
6150       On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set,  if
6151       the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
6152       match.   Otherwise, for both options, the next pattern item must be one
6153       that inspects a character, and at least one of the  following  must  be
6154       true:
6155
6156       (1)  At  least  one  character has already been inspected. An inspected
6157       character need not form part of the final  matched  string;  lookbehind
6158       assertions  and the \K escape sequence provide ways of inspecting char-
6159       acters before the start of a matched string.
6160
6161       (2) The pattern contains one or more lookbehind assertions. This condi-
6162       tion exists in case there is a lookbehind that inspects characters  be-
6163       fore the start of the match.
6164
6165       (3)  There  is a special case when the whole pattern can match an empty
6166       string.  When the starting point is at the  end  of  the  subject,  the
6167       empty  string  match is a possibility, and if PCRE2_PARTIAL_SOFT is set
6168       and neither of the above conditions is true, it is  returned.  However,
6169       because  adding  more  characters  might  result  in a non-empty match,
6170       PCRE2_PARTIAL_HARD returns a partial match, which in  this  case  means
6171       "there  is going to be a match at this point, but until some more char-
6172       acters are added, we do not know if it will be an empty string or some-
6173       thing longer".
6174
6175
6176PARTIAL MATCHING USING pcre2_match()
6177
6178       When  a  partial  matching  option  is  set,  the  result  of   calling
6179       pcre2_match() can be one of the following:
6180
6181       A successful match
6182         A complete match has been found, starting and ending within this sub-
6183         ject.
6184
6185       PCRE2_ERROR_NOMATCH
6186         No match can start anywhere in this subject.
6187
6188       PCRE2_ERROR_PARTIAL
6189         Adding  more  characters may result in a complete match that uses one
6190         or more characters from the end of this subject.
6191
6192       When a partial match is returned, the first two elements in the ovector
6193       point to the portion of the subject that was matched, but the values in
6194       the rest of the ovector are undefined. The appearance of \K in the pat-
6195       tern has no effect for a partial match. Consider this pattern:
6196
6197         /abc\K123/
6198
6199       If it is matched against "456abc123xyz" the result is a complete match,
6200       and the ovector defines the matched string as "123", because \K  resets
6201       the  "start  of  match" point. However, if a partial match is requested
6202       and the subject string is "456abc12", a partial match is found for  the
6203       string  "abc12",  because  all these characters are needed for a subse-
6204       quent re-match with additional characters.
6205
6206       If there is more than one partial match, the first one that  was  found
6207       provides the data that is returned. Consider this pattern:
6208
6209         /123\w+X|dogY/
6210
6211       If  this is matched against the subject string "abc123dog", both alter-
6212       natives fail to match, but the end of the  subject  is  reached  during
6213       matching,  so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
6214       and 9, identifying "123dog" as the first partial match. (In this  exam-
6215       ple,  there are two partial matches, because "dog" on its own partially
6216       matches the second alternative.)
6217
6218   How a partial match is processed by pcre2_match()
6219
6220       What happens when a partial match is identified depends on which of the
6221       two partial matching options is set.
6222
6223       If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned  as  soon
6224       as  a partial match is found, without continuing to search for possible
6225       complete matches. This option is "hard" because it prefers  an  earlier
6226       partial match over a later complete match. For this reason, the assump-
6227       tion  is  made  that  the end of the supplied subject string is not the
6228       true end of the available data, which is why \z, \Z, \b, \B, and $  al-
6229       ways give a partial match.
6230
6231       If  PCRE2_PARTIAL_SOFT  is  set,  the  partial match is remembered, but
6232       matching continues as normal, and other alternatives in the pattern are
6233       tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re-
6234       turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
6235       prefers a complete match over a partial match. All the various matching
6236       items  in a pattern behave as if the subject string is potentially com-
6237       plete; \z, \Z, and $ match at the end of the subject,  as  normal,  and
6238       for \b and \B the end of the subject is treated as a non-alphanumeric.
6239
6240       The  difference  between the two partial matching options can be illus-
6241       trated by a pattern such as:
6242
6243         /dog(sbody)?/
6244
6245       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
6246       the  longer  string  if  possible). If it is matched against the string
6247       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
6248       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6249       TIAL. On the other hand, if the pattern is made ungreedy the result  is
6250       different:
6251
6252         /dog(sbody)??/
6253
6254       In  this  case  the  result  is always a complete match because that is
6255       found first, and matching never  continues  after  finding  a  complete
6256       match. It might be easier to follow this explanation by thinking of the
6257       two patterns like this:
6258
6259         /dog(sbody)?/    is the same as  /dogsbody|dog/
6260         /dog(sbody)??/   is the same as  /dog|dogsbody/
6261
6262       The  second pattern will never match "dogsbody", because it will always
6263       find the shorter match first.
6264
6265   Example of partial matching using pcre2test
6266
6267       The pcre2test data modifiers partial_hard (or ph) and partial_soft  (or
6268       ps)  set  PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when
6269       calling pcre2_match(). Here is a run of pcre2test using a pattern  that
6270       matches the whole subject in the form of a date:
6271
6272           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6273         data> 25dec3\=ph
6274         Partial match: 23dec3
6275         data> 3ju\=ph
6276         Partial match: 3ju
6277         data> 3juj\=ph
6278         No match
6279
6280       This  example  gives  the  same  results for both hard and soft partial
6281       matching options. Here is an example where there is a difference:
6282
6283           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6284         data> 25jun04\=ps
6285          0: 25jun04
6286          1: jun
6287         data> 25jun04\=ph
6288         Partial match: 25jun04
6289
6290       With  PCRE2_PARTIAL_SOFT,  the  subject  is  matched  completely.   For
6291       PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6292       so there is only a partial match.
6293
6294
6295MULTI-SEGMENT MATCHING WITH pcre2_match()
6296
6297       PCRE  was  not originally designed with multi-segment matching in mind.
6298       However, over time, features (including  partial  matching)  that  make
6299       multi-segment matching possible have been added. A very long string can
6300       be  searched  segment  by  segment by calling pcre2_match() repeatedly,
6301       with the aim of achieving the same results that would happen if the en-
6302       tire string was available for searching all  the  time.  Normally,  the
6303       strings  that  are  being  sought are much shorter than each individual
6304       segment, and are in the middle of very long strings, so the pattern  is
6305       normally not anchored.
6306
6307       Special  logic  must  be implemented to handle a matched substring that
6308       spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
6309       returns a partial match at the end of a segment whenever there  is  the
6310       possibility  of  changing  the  match  by  adding  more characters. The
6311       PCRE2_NOTBOL option should also be set for all but the first segment.
6312
6313       When a partial match occurs, the next segment must be added to the cur-
6314       rent subject and the match re-run, using the  startoffset  argument  of
6315       pcre2_match()  to  begin  at the point where the partial match started.
6316       For example:
6317
6318           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
6319         data> ...the date is 23ja\=ph
6320         Partial match: 23ja
6321         data> ...the date is 23jan19 and on that day...\=offset=15
6322          0: 23jan19
6323          1: jan
6324
6325       Note the use of the offset modifier to start the new  match  where  the
6326       partial match was found. In this example, the next segment was added to
6327       the  one  in  which  the  partial  match  was  found.  This is the most
6328       straightforward approach, typically using a memory buffer that is twice
6329       the size of each segment. After a partial match, the first half of  the
6330       buffer  is  discarded,  the  second  half  is moved to the start of the
6331       buffer, and a new segment is added before repeating the match as in the
6332       example above. After a no match, the entire buffer can be discarded.
6333
6334       If there are memory constraints, you may want to discard text that pre-
6335       cedes a partial match before adding the  next  segment.  Unfortunately,
6336       this  is  not  at  present straightforward. In cases such as the above,
6337       where the pattern does not contain any lookbehinds, it is sufficient to
6338       retain only the partially matched substring. However,  if  the  pattern
6339       contains  a  lookbehind assertion, characters that precede the start of
6340       the partial match may have been inspected during the matching  process.
6341       When  pcre2test displays a partial match, it indicates these characters
6342       with '<' if the allusedtext modifier is set:
6343
6344           re> "(?<=123)abc"
6345         data> xx123ab\=ph,allusedtext
6346         Partial match: 123ab
6347                        <<<
6348
6349       However, the allusedtext modifier is not available  for  JIT  matching,
6350       because  JIT  matching  does  not  record the first (or last) consulted
6351       characters.  For this reason, this information is not available via the
6352       API. It is therefore not possible in general to obtain the exact number
6353       of characters that must be retained in order to get the right match re-
6354       sult. If you cannot retain the  entire  segment,  you  must  find  some
6355       heuristic way of choosing.
6356
6357       If  you know the approximate length of the matching substrings, you can
6358       use that to decide how much text to retain. The only lookbehind  infor-
6359       mation  that  is  currently  available via the API is the length of the
6360       longest individual lookbehind in a pattern, but this can be  misleading
6361       if  there  are  nested  lookbehinds.  The  value  returned  by  calling
6362       pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND  option  is  the
6363       maximum number of characters (not code units) that any individual look-
6364       behind   moves   back   when   it  is  processed.  A  pattern  such  as
6365       "(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects  two
6366       characters before its starting point.
6367
6368       In  a  non-UTF or a 32-bit case, moving back is just a subtraction, but
6369       in UTF-8 or UTF-16 you have  to  count  characters  while  moving  back
6370       through the code units.
6371
6372
6373PARTIAL MATCHING USING pcre2_dfa_match()
6374
6375       The DFA function moves along the subject string character by character,
6376       without  backtracking,  searching  for  all possible matches simultane-
6377       ously. If the end of the subject is reached before the end of the  pat-
6378       tern, there is the possibility of a partial match.
6379
6380       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
6381       there  have  been  no complete matches. Otherwise, the complete matches
6382       are returned.  If PCRE2_PARTIAL_HARD is  set,  a  partial  match  takes
6383       precedence  over  any  complete matches. The portion of the string that
6384       was matched when the longest partial match was  found  is  set  as  the
6385       first matching string.
6386
6387       Because  the DFA function always searches for all possible matches, and
6388       there is no difference between greedy and ungreedy repetition, its  be-
6389       haviour  is different from the pcre2_match(). Consider the string "dog"
6390       matched against this ungreedy pattern:
6391
6392         /dog(sbody)??/
6393
6394       Whereas the standard function stops as soon as it  finds  the  complete
6395       match  for  "dog",  the  DFA  function also finds the partial match for
6396       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
6397
6398
6399MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6400
6401       When a partial match has been found using the DFA matching function, it
6402       is possible to continue the match by providing additional subject  data
6403       and  calling  the function again with the same compiled regular expres-
6404       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
6405       same working space as before, because this is where details of the pre-
6406       vious partial match are stored. You can set the  PCRE2_PARTIAL_SOFT  or
6407       PCRE2_PARTIAL_HARD  options  with PCRE2_DFA_RESTART to continue partial
6408       matching over multiple segments. Here is an example using pcre2test:
6409
6410           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6411         data> 23ja\=dfa,ps
6412         Partial match: 23ja
6413         data> n05\=dfa,dfa_restart
6414          0: n05
6415
6416       The first call has "23ja" as the subject, and requests  partial  match-
6417       ing;  the  second  call  has  "n05"  as  the  subject for the continued
6418       (restarted) match.  Notice that when the match is  complete,  only  the
6419       last  part  is  shown;  PCRE2 does not retain the previously partially-
6420       matched string. It is up to the calling program to do that if it  needs
6421       to.  This  means  that, for an unanchored pattern, if a continued match
6422       fails, it is not possible to try again at a  new  starting  point.  All
6423       this facility is capable of doing is continuing with the previous match
6424       attempt. For example, consider this pattern:
6425
6426         1234|3789
6427
6428       If  the  first  part of the subject is "ABC123", a partial match of the
6429       first alternative is found at offset 3. There is no partial  match  for
6430       the second alternative, because such a match does not start at the same
6431       point  in  the  subject  string. Attempting to continue with the string
6432       "7890" does not yield a match  because  only  those  alternatives  that
6433       match  at one point in the subject are remembered. Depending on the ap-
6434       plication, this may or may not be what you want.
6435
6436       If you do want to allow for starting again at the next  character,  one
6437       way  of  doing it is to retain some or all of the segment and try a new
6438       complete match, as described for pcre2_match() above. Another possibil-
6439       ity is to work with two buffers. If a partial match at offset n in  the
6440       first  buffer  is followed by "no match" when PCRE2_DFA_RESTART is used
6441       on the second buffer, you can then try a new match starting  at  offset
6442       n+1 in the first buffer.
6443
6444
6445AUTHOR
6446
6447       Philip Hazel
6448       Retired from University Computing Service
6449       Cambridge, England.
6450
6451
6452REVISION
6453
6454       Last updated: 04 September 2019
6455       Copyright (c) 1997-2019 University of Cambridge.
6456
6457
6458PCRE2 10.34                    04 September 2019               PCRE2PARTIAL(3)
6459------------------------------------------------------------------------------
6460
6461
6462
6463PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
6464
6465
6466NAME
6467       PCRE2 - Perl-compatible regular expressions (revised API)
6468
6469
6470PCRE2 REGULAR EXPRESSION DETAILS
6471
6472       The  syntax and semantics of the regular expressions that are supported
6473       by PCRE2 are described in detail below. There is a quick-reference syn-
6474       tax summary in the pcre2syntax page. PCRE2 tries to match  Perl  syntax
6475       and  semantics as closely as it can.  PCRE2 also supports some alterna-
6476       tive regular expression syntax (which does not conflict with  the  Perl
6477       syntax) in order to provide some compatibility with regular expressions
6478       in Python, .NET, and Oniguruma.
6479
6480       Perl's  regular expressions are described in its own documentation, and
6481       regular expressions in general are covered in a number of  books,  some
6482       of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6483       pressions",  published by O'Reilly, covers regular expressions in great
6484       detail. This description of PCRE2's regular expressions is intended  as
6485       reference material.
6486
6487       This  document  discusses the regular expression patterns that are sup-
6488       ported by PCRE2 when its  main  matching  function,  pcre2_match(),  is
6489       used.    PCRE2    also    has   an   alternative   matching   function,
6490       pcre2_dfa_match(), which matches using a different  algorithm  that  is
6491       not  Perl-compatible.  Some  of  the  features  discussed below are not
6492       available when DFA matching is used. The advantages  and  disadvantages
6493       of  the  alternative function, and how it differs from the normal func-
6494       tion, are discussed in the pcre2matching page.
6495
6496
6497SPECIAL START-OF-PATTERN ITEMS
6498
6499       A number of options that can be passed to pcre2_compile() can  also  be
6500       set by special items at the start of a pattern. These are not Perl-com-
6501       patible,  but  are provided to make these options accessible to pattern
6502       writers who are not able to change the program that processes the  pat-
6503       tern.  Any  number  of these items may appear, but they must all be to-
6504       gether right at the start of the pattern string, and the  letters  must
6505       be in upper case.
6506
6507   UTF support
6508
6509       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6510       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6511       can  be  specified  for the 32-bit library, in which case it constrains
6512       the character values to valid  Unicode  code  points.  To  process  UTF
6513       strings,  PCRE2  must be built to include Unicode support (which is the
6514       default). When using UTF strings you must  either  call  the  compiling
6515       function  with  one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
6516       options, or the pattern must start with the  special  sequence  (*UTF),
6517       which  is  equivalent  to setting the relevant PCRE2_UTF. How setting a
6518       UTF mode affects pattern matching is mentioned in several places below.
6519       There is also a summary of features in the pcre2unicode page.
6520
6521       Some applications that allow their users to supply patterns may wish to
6522       restrict  them  to  non-UTF  data  for   security   reasons.   If   the
6523       PCRE2_NEVER_UTF  option is passed to pcre2_compile(), (*UTF) is not al-
6524       lowed, and its appearance in a pattern causes an error.
6525
6526   Unicode property support
6527
6528       Another special sequence that may appear at the start of a  pattern  is
6529       (*UCP).   This  has the same effect as setting the PCRE2_UCP option: it
6530       causes sequences such as \d and \w to use Unicode properties to  deter-
6531       mine character types, instead of recognizing only characters with codes
6532       less than 256 via a lookup table. If also causes upper/lower casing op-
6533       erations  to  use  Unicode  properties  for characters with code points
6534       greater than 127, even when UTF is not set.  These  behaviours  can  be
6535       changed  within  the pattern; see the section entitled "Internal Option
6536       Setting" below.
6537
6538       Some applications that allow their users to supply patterns may wish to
6539       restrict them for security reasons. If the  PCRE2_NEVER_UCP  option  is
6540       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
6541       a pattern causes an error.
6542
6543   Locking out empty string matching
6544
6545       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
6546       effect  as  passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
6547       to whichever matching function is subsequently called to match the pat-
6548       tern. These options lock out the matching of empty strings, either  en-
6549       tirely, or only at the start of the subject.
6550
6551   Disabling auto-possessification
6552
6553       If  a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
6554       setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from  making
6555       quantifiers  possessive  when  what  follows  cannot match the repeated
6556       item. For example, by default a+b is treated as a++b. For more details,
6557       see the pcre2api documentation.
6558
6559   Disabling start-up optimizations
6560
6561       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
6562       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6563       mizations  for  quickly  reaching "no match" results. For more details,
6564       see the pcre2api documentation.
6565
6566   Disabling automatic anchoring
6567
6568       If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the  same  effect
6569       as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6570       tions that apply to patterns whose top-level branches all start with .*
6571       (match any number of arbitrary characters). For more details,  see  the
6572       pcre2api documentation.
6573
6574   Disabling JIT compilation
6575
6576       If  a  pattern  that starts with (*NO_JIT) is successfully compiled, an
6577       attempt by the application to apply the  JIT  optimization  by  calling
6578       pcre2_jit_compile() is ignored.
6579
6580   Setting match resource limits
6581
6582       The pcre2_match() function contains a counter that is incremented every
6583       time it goes round its main loop. The caller of pcre2_match() can set a
6584       limit  on  this counter, which therefore limits the amount of computing
6585       resource used for a match. The maximum depth of nested backtracking can
6586       also be limited; this indirectly restricts the amount  of  heap  memory
6587       that  is  used,  but there is also an explicit memory limit that can be
6588       set.
6589
6590       These facilities are provided to catch runaway matches  that  are  pro-
6591       voked  by patterns with huge matching trees. A common example is a pat-
6592       tern with nested unlimited repeats applied to a long string  that  does
6593       not  match. When one of these limits is reached, pcre2_match() gives an
6594       error return. The limits can also be set by items at the start  of  the
6595       pattern of the form
6596
6597         (*LIMIT_HEAP=d)
6598         (*LIMIT_MATCH=d)
6599         (*LIMIT_DEPTH=d)
6600
6601       where d is any number of decimal digits. However, the value of the set-
6602       ting  must  be  less than the value set (or defaulted) by the caller of
6603       pcre2_match() for it to have any effect. In other  words,  the  pattern
6604       writer  can lower the limits set by the programmer, but not raise them.
6605       If there is more than one setting of one of  these  limits,  the  lower
6606       value  is used. The heap limit is specified in kibibytes (units of 1024
6607       bytes).
6608
6609       Prior to release 10.30, LIMIT_DEPTH was  called  LIMIT_RECURSION.  This
6610       name is still recognized for backwards compatibility.
6611
6612       The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
6613       interpreters are used for matching. It does not apply to JIT. The match
6614       limit  is used (but in a different way) when JIT is being used, or when
6615       pcre2_dfa_match() is called, to limit computing resource usage by those
6616       matching functions. The depth limit is ignored by JIT but  is  relevant
6617       for  DFA  matching, which uses function recursion for recursions within
6618       the pattern and for lookaround assertions and atomic  groups.  In  this
6619       case, the depth limit controls the depth of such recursion.
6620
6621   Newline conventions
6622
6623       PCRE2  supports six different conventions for indicating line breaks in
6624       strings: a single CR (carriage return) character, a  single  LF  (line-
6625       feed) character, the two-character sequence CRLF, any of the three pre-
6626       ceding,  any  Unicode  newline  sequence,  or the NUL character (binary
6627       zero). The pcre2api page has further  discussion  about  newlines,  and
6628       shows how to set the newline convention when calling pcre2_compile().
6629
6630       It  is also possible to specify a newline convention by starting a pat-
6631       tern string with one of the following sequences:
6632
6633         (*CR)        carriage return
6634         (*LF)        linefeed
6635         (*CRLF)      carriage return, followed by linefeed
6636         (*ANYCRLF)   any of the three above
6637         (*ANY)       all Unicode newline sequences
6638         (*NUL)       the NUL character (binary zero)
6639
6640       These override the default and the options given to the compiling func-
6641       tion. For example, on a Unix system where LF is the default newline se-
6642       quence, the pattern
6643
6644         (*CR)a.b
6645
6646       changes the convention to CR. That pattern matches "a\nb" because LF is
6647       no longer a newline. If more than one of these settings is present, the
6648       last one is used.
6649
6650       The newline convention affects where the circumflex and  dollar  asser-
6651       tions are true. It also affects the interpretation of the dot metachar-
6652       acter  when  PCRE2_DOTALL  is not set, and the behaviour of \N when not
6653       followed by an opening brace. However, it does not affect what  the  \R
6654       escape  sequence  matches.  By default, this is any Unicode newline se-
6655       quence, for Perl compatibility. However, this can be changed;  see  the
6656       next section and the description of \R in the section entitled "Newline
6657       sequences"  below. A change of \R setting can be combined with a change
6658       of newline convention.
6659
6660   Specifying what \R matches
6661
6662       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6663       the complete set  of  Unicode  line  endings)  by  setting  the  option
6664       PCRE2_BSR_ANYCRLF  at compile time. This effect can also be achieved by
6665       starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI-
6666       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
6667
6668
6669EBCDIC CHARACTER CODES
6670
6671       PCRE2  can be compiled to run in an environment that uses EBCDIC as its
6672       character code instead of ASCII or Unicode (typically a mainframe  sys-
6673       tem).  In  the  sections below, character code values are ASCII or Uni-
6674       code; in an EBCDIC environment these characters may have different code
6675       values, and there are no code points greater than 255.
6676
6677
6678CHARACTERS AND METACHARACTERS
6679
6680       A regular expression is a pattern that is  matched  against  a  subject
6681       string  from  left  to right. Most characters stand for themselves in a
6682       pattern, and match the corresponding characters in the  subject.  As  a
6683       trivial example, the pattern
6684
6685         The quick brown fox
6686
6687       matches a portion of a subject string that is identical to itself. When
6688       caseless  matching  is  specified  (the  PCRE2_CASELESS  option or (?i)
6689       within the pattern), letters are matched independently  of  case.  Note
6690       that  there  are  two  ASCII  characters, K and S, that, in addition to
6691       their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6692       U+212A  (Kelvin  sign)  and  U+017F  (long  S) respectively when either
6693       PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT
6694       option is in force (either passed to pcre2_compile()  or  set  by  (?r)
6695       within the pattern).
6696
6697       The power of regular expressions comes from the ability to include wild
6698       cards, character classes, alternatives, and repetitions in the pattern.
6699       These are encoded in the pattern by the use of metacharacters, which do
6700       not  stand  for  themselves but instead are interpreted in some special
6701       way.
6702
6703       There are two different sets of metacharacters: those that  are  recog-
6704       nized  anywhere in the pattern except within square brackets, and those
6705       that are recognized within square brackets.  Outside  square  brackets,
6706       the metacharacters are as follows:
6707
6708         \      general escape character with several uses
6709         ^      assert start of string (or line, in multiline mode)
6710         $      assert end of string (or line, in multiline mode)
6711         .      match any character except newline (by default)
6712         [      start character class definition
6713         |      start of alternative branch
6714         (      start group or control verb
6715         )      end group or control verb
6716         *      0 or more quantifier
6717         +      1 or more quantifier; also "possessive quantifier"
6718         ?      0 or 1 quantifier; also quantifier minimizer
6719         {      potential start of min/max quantifier
6720
6721       Brace  characters  {  and } are also used to enclose data for construc-
6722       tions such as \g{2} or \k{name}. In almost all uses  of  braces,  space
6723       and/or horizontal tab characters that follow { or precede } are allowed
6724       and  are  ignored. In the case of quantifiers, they may also appear be-
6725       fore or after the comma. The exception to this is \u{...} which  is  an
6726       ECMAScript  compatibility  feature  that  is  recognized  only when the
6727       PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript  does  not  ignore  such
6728       white space; it causes the item to be interpreted as literal.
6729
6730       Part  of  a  pattern  that is in square brackets is called a "character
6731       class". In a character class the only metacharacters are:
6732
6733         \      general escape character
6734         ^      negate the class, but only if the first character
6735         -      indicates character range
6736         [      POSIX character class (if followed by POSIX syntax)
6737         ]      terminates the character class
6738
6739       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
6740       space in the pattern, other than in a character class, within a \Q...\E
6741       sequence,  or  between  a # outside a character class and the next new-
6742       line, inclusive, are ignored. An escaping backslash can be used to  in-
6743       clude  a  white  space  or a # character as part of the pattern. If the
6744       PCRE2_EXTENDED_MORE option is set, the same applies,  but  in  addition
6745       unescaped  space  and  horizontal  tab  characters are ignored inside a
6746       character class. Note: only these two characters are ignored,  not  the
6747       full  set  of pattern white space characters that are ignored outside a
6748       character class. Option settings can be changed within a  pattern;  see
6749       the section entitled "Internal Option Setting" below.
6750
6751       The following sections describe the use of each of the metacharacters.
6752
6753
6754BACKSLASH
6755
6756       The backslash character has several uses. Firstly, if it is followed by
6757       a  character that is not a digit or a letter, it takes away any special
6758       meaning that character may have. This use of  backslash  as  an  escape
6759       character applies both inside and outside character classes.
6760
6761       For  example,  if you want to match a * character, you must write \* in
6762       the pattern. This escaping action applies whether or not the  following
6763       character  would  otherwise be interpreted as a metacharacter, so it is
6764       always safe to precede a non-alphanumeric  with  backslash  to  specify
6765       that it stands for itself.  In particular, if you want to match a back-
6766       slash, you write \\.
6767
6768       Only  ASCII  digits  and letters have any special meaning after a back-
6769       slash. All other characters (in particular, those whose code points are
6770       greater than 127) are treated as literals.
6771
6772       If you want to treat all characters in a sequence as literals, you  can
6773       do  so by putting them between \Q and \E. Note that this includes white
6774       space even when the PCRE2_EXTENDED option is set  so  that  most  other
6775       white  space is ignored. The behaviour is different from Perl in that $
6776       and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in
6777       Perl, $ and @ cause variable interpolation. Also,  Perl  does  "double-
6778       quotish  backslash  interpolation" on any backslashes between \Q and \E
6779       which, its documentation says, "may lead to confusing  results".  PCRE2
6780       treats  a  backslash  between  \Q and \E just like any other character.
6781       Note the following examples:
6782
6783         Pattern            PCRE2 matches   Perl matches
6784
6785         \Qabc$xyz\E        abc$xyz        abc followed by the
6786                                             contents of $xyz
6787         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
6788         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
6789         \QA\B\E            A\B            A\B
6790         \Q\\E              \              \\E
6791
6792       The \Q...\E sequence is recognized both inside  and  outside  character
6793       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
6794       is not followed by \E later in the pattern, the literal  interpretation
6795       continues  to  the  end  of  the pattern (that is, \E is assumed at the
6796       end). If the isolated \Q is inside a character class,  this  causes  an
6797       error,  because the character class is then not terminated by a closing
6798       square bracket.
6799
6800   Non-printing characters
6801
6802       A second use of backslash provides a way of encoding non-printing char-
6803       acters in patterns in a visible manner. There is no restriction on  the
6804       appearance  of non-printing characters in a pattern, but when a pattern
6805       is being prepared by text editing, it is often easier to use one of the
6806       following escape sequences instead of the binary  character  it  repre-
6807       sents.  In  an  ASCII or Unicode environment, these escapes are as fol-
6808       lows:
6809
6810         \a          alarm, that is, the BEL character (hex 07)
6811         \cx         "control-x", where x is a non-control ASCII character
6812         \e          escape (hex 1B)
6813         \f          form feed (hex 0C)
6814         \n          linefeed (hex 0A)
6815         \r          carriage return (hex 0D) (but see below)
6816         \t          tab (hex 09)
6817         \0dd        character with octal code 0dd
6818         \ddd        character with octal code ddd, or backreference
6819         \o{ddd..}   character with octal code ddd..
6820         \xhh        character with hex code hh
6821         \x{hhh..}   character with hex code hhh..
6822         \N{U+hhh..} character with Unicode hex code point hhh..
6823
6824       By default, after \x that is not followed by {, from zero to two  hexa-
6825       decimal  digits  are  read (letters can be in upper or lower case). Any
6826       number of hexadecimal digits may appear between \x{ and }. If a charac-
6827       ter other than a hexadecimal digit appears between \x{  and  },  or  if
6828       there is no terminating }, an error occurs.
6829
6830       Characters whose code points are less than 256 can be defined by either
6831       of the two syntaxes for \x or by an octal sequence. There is no differ-
6832       ence in the way they are handled. For example, \xdc is exactly the same
6833       as  \x{dc}  or \334.  However, using the braced versions does make such
6834       sequences easier to read.
6835
6836       Support is available for some ECMAScript (aka  JavaScript)  escape  se-
6837       quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6838       quence  \x  followed  by { is not recognized. Only if \x is followed by
6839       two hexadecimal digits is it recognized as a character  escape.  Other-
6840       wise  it  is interpreted as a literal "x" character. In this mode, sup-
6841       port for code points greater than 256 is provided by \u, which must  be
6842       followed  by  four hexadecimal digits; otherwise it is interpreted as a
6843       literal "u" character.
6844
6845       PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad-
6846       dition, \u{hhh..} is recognized as the character specified by hexadeci-
6847       mal code point.  There may be any number of hexadecimal digits, but un-
6848       like  other places that also use curly brackets, spaces are not allowed
6849       and would result in the string being interpreted  as  a  literal.  This
6850       syntax is from ECMAScript 6.
6851
6852       The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6853       ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by
6854       Unicode  name;  PCRE2  does  not support this. Note that when \N is not
6855       followed by an opening brace (curly bracket) it has an entirely differ-
6856       ent meaning, matching any character that is not a newline.
6857
6858       There are some legacy applications where the escape sequence \r is  ex-
6859       pected  to  match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
6860       is set, \r in a pattern is converted to \n so  that  it  matches  a  LF
6861       (linefeed) instead of a CR (carriage return) character.
6862
6863       An  error  occurs if \c is not followed by a character whose ASCII code
6864       point is in the range 32 to 126. The precise effect of \cx is  as  fol-
6865       lows:  if x is a lower case letter, it is converted to upper case. Then
6866       bit 6 of the character (hex 40) is inverted. Thus \cA to \cZ become hex
6867       01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B),  and
6868       \c;  becomes hex 7B (; is 3B). If the code unit following \c has a code
6869       point less than 32 or greater than 126, a compile-time error occurs.
6870
6871       When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..}  is  not  supported.
6872       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
6873       The \c escape is processed as specified for Perl in the perlebcdic doc-
6874       ument.  The  only characters that are allowed after \c are A-Z, a-z, or
6875       one of @, [, \, ], ^, _, or ?. Any other character provokes a  compile-
6876       time  error.  The  sequence  \c@ encodes character code 0; after \c the
6877       letters (in either case) encode characters 1-26 (hex 01 to hex 1A);  [,
6878       \,  ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
6879       comes either 255 (hex FF) or 95 (hex 5F).
6880
6881       Thus, apart from \c?, these escapes generate the  same  character  code
6882       values  as  they do in an ASCII environment, though the meanings of the
6883       values mostly differ. For example, \cG always generates code  value  7,
6884       which is BEL in ASCII but DEL in EBCDIC.
6885
6886       The  sequence  \c? generates DEL (127, hex 7F) in an ASCII environment,
6887       but because 127 is not a control character in  EBCDIC,  Perl  makes  it
6888       generate  the  APC character. Unfortunately, there are several variants
6889       of EBCDIC. In most of them the APC character has  the  value  255  (hex
6890       FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6891       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6892       95; otherwise it generates 255.
6893
6894       After \0 up to two further octal digits are read. If  there  are  fewer
6895       than  two  digits,  just  those that are present are used. Thus the se-
6896       quence \0\x\015 specifies two binary zeros followed by a  CR  character
6897       (code value 13). Make sure you supply two digits after the initial zero
6898       if the pattern character that follows is itself an octal digit.
6899
6900       The  escape \o must be followed by a sequence of octal digits, enclosed
6901       in braces. An error occurs if this is not the case. This  escape  is  a
6902       recent  addition  to Perl; it provides way of specifying character code
6903       points as octal numbers greater than 0777, and  it  also  allows  octal
6904       numbers and backreferences to be unambiguously specified.
6905
6906       For greater clarity and unambiguity, it is best to avoid following \ by
6907       a  digit  greater than zero. Instead, use \o{...} or \x{...} to specify
6908       numerical character code points, and \g{...} to specify backreferences.
6909       The following paragraphs describe the old, ambiguous syntax.
6910
6911       The handling of a backslash followed by a digit other than 0 is compli-
6912       cated, and Perl has changed over time, causing PCRE2 also to change.
6913
6914       Outside a character class, PCRE2 reads the digit and any following dig-
6915       its as a decimal number. If the number is less than 10, begins with the
6916       digit 8 or 9, or if there are  at  least  that  many  previous  capture
6917       groups  in the expression, the entire sequence is taken as a backrefer-
6918       ence. A description of how this works is  given  later,  following  the
6919       discussion  of parenthesized groups.  Otherwise, up to three octal dig-
6920       its are read to form a character code.
6921
6922       Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
6923       acters  "8"  and "9", and otherwise reads up to three octal digits fol-
6924       lowing the backslash, using them to generate a data character. Any sub-
6925       sequent digits stand for themselves. For example, outside  a  character
6926       class:
6927
6928         \040   is another way of writing an ASCII space
6929         \40    is the same, provided there are fewer than 40
6930                   previous capture groups
6931         \7     is always a backreference
6932         \11    might be a backreference, or another way of
6933                   writing a tab
6934         \011   is always a tab
6935         \0113  is a tab followed by the character "3"
6936         \113   might be a backreference, otherwise the
6937                   character with octal code 113
6938         \377   might be a backreference, otherwise
6939                   the value 255 (decimal)
6940         \81    is always a backreference
6941
6942       Note  that octal values of 100 or greater that are specified using this
6943       syntax must not be introduced by a leading zero, because no  more  than
6944       three octal digits are ever read.
6945
6946   Constraints on character values
6947
6948       Characters  that  are  specified using octal or hexadecimal numbers are
6949       limited to certain values, as follows:
6950
6951         8-bit non-UTF mode    no greater than 0xff
6952         16-bit non-UTF mode   no greater than 0xffff
6953         32-bit non-UTF mode   no greater than 0xffffffff
6954         All UTF modes         no greater than 0x10ffff and a valid code point
6955
6956       Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6957       (the so-called "surrogate" code points). The check  for  these  can  be
6958       disabled  by  the  caller  of  pcre2_compile()  by  setting  the option
6959       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only  in
6960       UTF-8  and  UTF-32 modes, because these values are not representable in
6961       UTF-16.
6962
6963   Escape sequences in character classes
6964
6965       All the sequences that define a single character value can be used both
6966       inside and outside character classes. In addition, inside  a  character
6967       class, \b is interpreted as the backspace character (hex 08).
6968
6969       When not followed by an opening brace, \N is not allowed in a character
6970       class.   \B,  \R, and \X are not special inside a character class. Like
6971       other unrecognized alphabetic escape sequences, they  cause  an  error.
6972       Outside a character class, these sequences have different meanings.
6973
6974   Unsupported escape sequences
6975
6976       In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
6977       string handler and used to modify the case of following characters.  By
6978       default,  PCRE2  does  not  support these escape sequences in patterns.
6979       However, if either of the PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  op-
6980       tions  is set, \U matches a "U" character, and \u can be used to define
6981       a character by code point, as described above.
6982
6983   Absolute and relative backreferences
6984
6985       The sequence \g followed by a signed or unsigned number, optionally en-
6986       closed in braces, is an absolute or  relative  backreference.  A  named
6987       backreference  can  be  coded as \g{name}. Backreferences are discussed
6988       later, following the discussion of parenthesized groups.
6989
6990   Absolute and relative subroutine calls
6991
6992       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
6993       name or a number enclosed either in angle brackets or single quotes, is
6994       an  alternative syntax for referencing a capture group as a subroutine.
6995       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6996       \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6997       erence; the latter is a subroutine call.
6998
6999   Generic character types
7000
7001       Another use of backslash is for specifying generic character types:
7002
7003         \d     any decimal digit
7004         \D     any character that is not a decimal digit
7005         \h     any horizontal white space character
7006         \H     any character that is not a horizontal white space character
7007         \N     any character that is not a newline
7008         \s     any white space character
7009         \S     any character that is not a white space character
7010         \v     any vertical white space character
7011         \V     any character that is not a vertical white space character
7012         \w     any "word" character
7013         \W     any "non-word" character
7014
7015       The  \N  escape  sequence has the same meaning as the "." metacharacter
7016       when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not  change
7017       the meaning of \N. Note that when \N is followed by an opening brace it
7018       has a different meaning. See the section entitled "Non-printing charac-
7019       ters"  above for details. Perl also uses \N{name} to specify characters
7020       by Unicode name; PCRE2 does not support this.
7021
7022       Each pair of lower and upper case escape sequences partitions the  com-
7023       plete  set  of  characters  into two disjoint sets. Any given character
7024       matches one, and only one, of each pair. The sequences can appear  both
7025       inside  and outside character classes. They each match one character of
7026       the appropriate type. If the current matching point is at  the  end  of
7027       the  subject string, all of them fail, because there is no character to
7028       match.
7029
7030       The default \s characters are HT (9), LF (10), VT  (11),  FF  (12),  CR
7031       (13),  and  space (32), which are defined as white space in the "C" lo-
7032       cale. This list may vary if locale-specific matching is  taking  place.
7033       For  example, in some locales the "non-breaking space" character (\xA0)
7034       is recognized as white space, and in others the VT character is not.
7035
7036       A "word" character is an underscore or any character that is  a  letter
7037       or  digit.   By  default,  the definition of letters and digits is con-
7038       trolled by PCRE2's low-valued character tables, and may vary if locale-
7039       specific matching is taking place (see "Locale support" in the pcre2api
7040       page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
7041       systems,  or "french" in Windows, some character codes greater than 127
7042       are used for accented letters, and these are then matched  by  \w.  The
7043       use of locales with Unicode is discouraged.
7044
7045       By  default,  characters  whose  code points are greater than 127 never
7046       match \d, \s, or \w, and always match \D, \S, and \W, although this may
7047       be different for characters in the range 128-255  when  locale-specific
7048       matching  is  happening.   These escape sequences retain their original
7049       meanings from before Unicode support was available,  mainly  for  effi-
7050       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
7051       changed so that Unicode properties  are  used  to  determine  character
7052       types, as follows:
7053
7054         \d  any character that matches \p{Nd} (decimal digit)
7055         \s  any character that matches \p{Z} or \h or \v
7056         \w  any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc}
7057
7058       The addition of \p{Mn} (non-spacing mark) and the replacement of an ex-
7059       plicit  test  for underscore with a test for \p{Pc} (connector punctua-
7060       tion) happened in PCRE2 release 10.43. This brings PCRE2 into line with
7061       Perl.
7062
7063       The upper case escapes match the inverse sets of characters. Note  that
7064       \d  matches  only decimal digits, whereas \w matches any Unicode digit,
7065       as well as other character categories. Note also that PCRE2_UCP affects
7066       \b, and \B because they are defined in terms of  \w  and  \W.  Matching
7067       these sequences is noticeably slower when PCRE2_UCP is set.
7068
7069       The  effect  of  PCRE2_UCP  on any one of these escape sequences can be
7070       negated by the  options  PCRE2_EXTRA_ASCII_BSD,  PCRE2_EXTRA_ASCII_BSS,
7071       and  PCRE2_EXTRA_ASCII_BSW,  respectively. These options can be set and
7072       reset within a pattern by means of an internal option setting (see  be-
7073       low).
7074
7075       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
7076       which match only ASCII characters by default, always match  a  specific
7077       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
7078       space characters are:
7079
7080         U+0009     Horizontal tab (HT)
7081         U+0020     Space
7082         U+00A0     Non-break space
7083         U+1680     Ogham space mark
7084         U+180E     Mongolian vowel separator
7085         U+2000     En quad
7086         U+2001     Em quad
7087         U+2002     En space
7088         U+2003     Em space
7089         U+2004     Three-per-em space
7090         U+2005     Four-per-em space
7091         U+2006     Six-per-em space
7092         U+2007     Figure space
7093         U+2008     Punctuation space
7094         U+2009     Thin space
7095         U+200A     Hair space
7096         U+202F     Narrow no-break space
7097         U+205F     Medium mathematical space
7098         U+3000     Ideographic space
7099
7100       The vertical space characters are:
7101
7102         U+000A     Linefeed (LF)
7103         U+000B     Vertical tab (VT)
7104         U+000C     Form feed (FF)
7105         U+000D     Carriage return (CR)
7106         U+0085     Next line (NEL)
7107         U+2028     Line separator
7108         U+2029     Paragraph separator
7109
7110       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
7111       than 256 are relevant.
7112
7113   Newline sequences
7114
7115       Outside  a  character class, by default, the escape sequence \R matches
7116       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
7117       to the following:
7118
7119         (?>\r\n|\n|\x0b|\f|\r|\x85)
7120
7121       This is an example of an "atomic group", details of which are given be-
7122       low.   This  particular group matches either the two-character sequence
7123       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
7124       U+000A),  VT  (vertical  tab, U+000B), FF (form feed, U+000C), CR (car-
7125       riage return, U+000D), or NEL (next line, U+0085). Because this  is  an
7126       atomic  group,  the  two-character sequence is treated as a single unit
7127       that cannot be split.
7128
7129       In other modes, two additional characters whose code points are greater
7130       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
7131       rator, U+2029).  Unicode support is not needed for these characters  to
7132       be recognized.
7133
7134       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
7135       the  complete  set  of  Unicode  line  endings)  by  setting the option
7136       PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation  for  "back-
7137       slash R".) This can be made the default when PCRE2 is built; if this is
7138       the  case,  the other behaviour can be requested via the PCRE2_BSR_UNI-
7139       CODE option. It is also possible to specify these settings by  starting
7140       a pattern string with one of the following sequences:
7141
7142         (*BSR_ANYCRLF)   CR, LF, or CRLF only
7143         (*BSR_UNICODE)   any Unicode newline sequence
7144
7145       These override the default and the options given to the compiling func-
7146       tion.  Note that these special settings, which are not Perl-compatible,
7147       are  recognized only at the very start of a pattern, and that they must
7148       be in upper case. If more than one of them is present, the last one  is
7149       used. They can be combined with a change of newline convention; for ex-
7150       ample, a pattern can start with:
7151
7152         (*ANY)(*BSR_ANYCRLF)
7153
7154       They  can also be combined with the (*UTF) or (*UCP) special sequences.
7155       Inside a character class, \R is treated as an unrecognized  escape  se-
7156       quence, and causes an error.
7157
7158   Unicode character properties
7159
7160       When  PCRE2  is  built  with Unicode support (the default), three addi-
7161       tional escape sequences that match characters with specific  properties
7162       are available. They can be used in any mode, though in 8-bit and 16-bit
7163       non-UTF  modes these sequences are of course limited to testing charac-
7164       ters whose code points are less than U+0100 and U+10000,  respectively.
7165       In  32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
7166       limit) may be encountered. These are all treated as being  in  the  Un-
7167       known script and with an unassigned type.
7168
7169       Matching  characters by Unicode property is not fast, because PCRE2 has
7170       to do a multistage table lookup in order to find  a  character's  prop-
7171       erty. That is why the traditional escape sequences such as \d and \w do
7172       not  use  Unicode  properties  in PCRE2 by default, though you can make
7173       them do so by setting the PCRE2_UCP option or by starting  the  pattern
7174       with (*UCP).
7175
7176       The extra escape sequences that provide property support are:
7177
7178         \p{xx}   a character with the xx property
7179         \P{xx}   a character without the xx property
7180         \X       a Unicode extended grapheme cluster
7181
7182       The  property names represented by xx above are not case-sensitive, and
7183       in accordance with Unicode's "loose matching" rules,  spaces,  hyphens,
7184       and underscores are ignored. There is support for Unicode script names,
7185       Unicode general category properties, "Any", which matches any character
7186       (including  newline),  Bidi_Class,  a number of binary (yes/no) proper-
7187       ties, and some special PCRE2  properties  (described  below).   Certain
7188       other  Perl  properties such as "InMusicalSymbols" are not supported by
7189       PCRE2. Note that \P{Any} does  not  match  any  characters,  so  always
7190       causes a match failure.
7191
7192   Script properties for \p and \P
7193
7194       There are three different syntax forms for matching a script. Each Uni-
7195       code  character  has  a  basic  script and, optionally, a list of other
7196       scripts ("Script Extensions") with which it is commonly used. Using the
7197       Adlam script as an example, \p{sc:Adlam} matches characters whose basic
7198       script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
7199       that have Adlam in their extensions list. The full names  "script"  and
7200       "script extensions" for the property types are recognized, and a equals
7201       sign  is an alternative to the colon. If a script name is given without
7202       a property type, for example, \p{Adlam}, it is  treated  as  \p{scx:Ad-
7203       lam}.  Perl  changed  to  this interpretation at release 5.26 and PCRE2
7204       changed at release 10.40.
7205
7206       Unassigned characters (and in non-UTF 32-bit mode, characters with code
7207       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
7208       that are not part of an identified script are lumped together as  "Com-
7209       mon". The current list of recognized script names and their 4-character
7210       abbreviations can be obtained by running this command:
7211
7212         pcre2test -LS
7213
7214
7215   The general category property for \p and \P
7216
7217       Each character has exactly one Unicode general category property, spec-
7218       ified  by a two-letter abbreviation. For compatibility with Perl, nega-
7219       tion can be specified by including a  circumflex  between  the  opening
7220       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
7221       \P{Lu}.
7222
7223       If only one letter is specified with \p or \P, it includes all the gen-
7224       eral category properties that start with that letter. In this case,  in
7225       the  absence of negation, the curly brackets in the escape sequence are
7226       optional; these two examples have the same effect:
7227
7228         \p{L}
7229         \pL
7230
7231       The following general category property codes are supported:
7232
7233         C     Other
7234         Cc    Control
7235         Cf    Format
7236         Cn    Unassigned
7237         Co    Private use
7238         Cs    Surrogate
7239
7240         L     Letter
7241         Ll    Lower case letter
7242         Lm    Modifier letter
7243         Lo    Other letter
7244         Lt    Title case letter
7245         Lu    Upper case letter
7246
7247         M     Mark
7248         Mc    Spacing mark
7249         Me    Enclosing mark
7250         Mn    Non-spacing mark
7251
7252         N     Number
7253         Nd    Decimal number
7254         Nl    Letter number
7255         No    Other number
7256
7257         P     Punctuation
7258         Pc    Connector punctuation
7259         Pd    Dash punctuation
7260         Pe    Close punctuation
7261         Pf    Final punctuation
7262         Pi    Initial punctuation
7263         Po    Other punctuation
7264         Ps    Open punctuation
7265
7266         S     Symbol
7267         Sc    Currency symbol
7268         Sk    Modifier symbol
7269         Sm    Mathematical symbol
7270         So    Other symbol
7271
7272         Z     Separator
7273         Zl    Line separator
7274         Zp    Paragraph separator
7275         Zs    Space separator
7276
7277       The special property LC, which has the synonym L&, is  also  supported:
7278       it  matches  a  character that has the Lu, Ll, or Lt property, in other
7279       words, a letter that is not classified as a modifier or "other".
7280
7281       The Cs (Surrogate) property  applies  only  to  characters  whose  code
7282       points  are in the range U+D800 to U+DFFF. These characters are no dif-
7283       ferent to any other character when PCRE2 is not in UTF mode (using  the
7284       16-bit  or  32-bit  library).   However,  they are not valid in Unicode
7285       strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7286       ity  checking  has   been   turned   off   (see   the   discussion   of
7287       PCRE2_NO_UTF_CHECK in the pcre2api page).
7288
7289       The  long  synonyms  for  property  names  that  Perl supports (such as
7290       \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
7291       any of these properties with "Is".
7292
7293       No character that is in the Unicode table has the Cn (unassigned) prop-
7294       erty.  Instead, this property is assumed for any code point that is not
7295       in the Unicode table.
7296
7297       Specifying  caseless  matching  does not affect these escape sequences.
7298       For example, \p{Lu} always matches only upper  case  letters.  This  is
7299       different from the behaviour of current versions of Perl.
7300
7301   Binary (yes/no) properties for \p and \P
7302
7303       Unicode  defines  a  number  of  binary properties, that is, properties
7304       whose only values are true or false. You can obtain  a  list  of  those
7305       that  are  recognized  by \p and \P, along with their abbreviations, by
7306       running this command:
7307
7308         pcre2test -LP
7309
7310
7311   The Bidi_Class property for \p and \P
7312
7313         \p{Bidi_Class:<class>}   matches a character with the given class
7314         \p{BC:<class>}           matches a character with the given class
7315
7316       The recognized classes are:
7317
7318         AL          Arabic letter
7319         AN          Arabic number
7320         B           paragraph separator
7321         BN          boundary neutral
7322         CS          common separator
7323         EN          European number
7324         ES          European separator
7325         ET          European terminator
7326         FSI         first strong isolate
7327         L           left-to-right
7328         LRE         left-to-right embedding
7329         LRI         left-to-right isolate
7330         LRO         left-to-right override
7331         NSM         non-spacing mark
7332         ON          other neutral
7333         PDF         pop directional format
7334         PDI         pop directional isolate
7335         R           right-to-left
7336         RLE         right-to-left embedding
7337         RLI         right-to-left isolate
7338         RLO         right-to-left override
7339         S           segment separator
7340         WS          which space
7341
7342       An equals sign may be used instead of a  colon.  The  class  names  are
7343       case-insensitive; only the short names listed above are recognized.
7344
7345   Extended grapheme clusters
7346
7347       The  \X  escape  matches  any number of Unicode characters that form an
7348       "extended grapheme cluster", and treats the sequence as an atomic group
7349       (see below).  Unicode supports various kinds of composite character  by
7350       giving  each  character  a grapheme breaking property, and having rules
7351       that use these properties to define the boundaries of extended grapheme
7352       clusters. The rules are defined in Unicode Standard Annex 29,  "Unicode
7353       Text  Segmentation".  Unicode 11.0.0 abandoned the use of some previous
7354       properties that had been used for emojis.  Instead it introduced  vari-
7355       ous  emoji-specific  properties.  PCRE2  uses  only the Extended Picto-
7356       graphic property.
7357
7358       \X always matches at least one character. Then it  decides  whether  to
7359       add additional characters according to the following rules for ending a
7360       cluster:
7361
7362       1. End at the end of the subject string.
7363
7364       2.  Do not end between CR and LF; otherwise end after any control char-
7365       acter.
7366
7367       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
7368       characters  are of five types: L, V, T, LV, and LVT. An L character may
7369       be followed by an L, V, LV, or LVT character; an LV or V character  may
7370       be  followed  by  a V or T character; an LVT or T character may be fol-
7371       lowed only by a T character.
7372
7373       4. Do not end before extending characters or spacing marks or the zero-
7374       width joiner (ZWJ) character. Characters with the "mark"  property  al-
7375       ways have the "extend" grapheme breaking property.
7376
7377       5. Do not end after prepend characters.
7378
7379       6.  Do not end within emoji modifier sequences or emoji ZWJ (zero-width
7380       joiner) sequences. An emoji ZWJ sequence consists of a  character  with
7381       the  Extended_Pictographic property, optionally followed by one or more
7382       characters with the Extend property, followed  by  the  ZWJ  character,
7383       followed by another Extended_Pictographic character.
7384
7385       7.  Do not break within emoji flag sequences. That is, do not break be-
7386       tween regional indicator (RI) characters if there are an odd number  of
7387       RI characters before the break point.
7388
7389       8. Otherwise, end the cluster.
7390
7391   PCRE2's additional properties
7392
7393       As  well as the standard Unicode properties described above, PCRE2 sup-
7394       ports four more that make it possible to convert traditional escape se-
7395       quences such as \w and \s to use Unicode properties. PCRE2  uses  these
7396       non-standard,  non-Perl  properties  internally  when PCRE2_UCP is set.
7397       However, they may also be used explicitly. These properties are:
7398
7399         Xan   Any alphanumeric character
7400         Xps   Any POSIX space character
7401         Xsp   Any Perl space character
7402         Xwd   Any Perl "word" character
7403
7404       Xan matches characters that have either the L (letter) or the  N  (num-
7405       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
7406       form feed, or carriage return, and any other character that has  the  Z
7407       (separator)  property.  Xsp is the same as Xps; in PCRE1 it used to ex-
7408       clude vertical tab, for  Perl  compatibility,  but  Perl  changed.  Xwd
7409       matches the same characters as Xan, plus those that match Mn (non-spac-
7410       ing mark) or Pc (connector punctuation, which includes underscore).
7411
7412       There  is another non-standard property, Xuc, which matches any charac-
7413       ter that can be represented by a Universal Character Name  in  C++  and
7414       other  programming  languages.  These are the characters $, @, ` (grave
7415       accent), and all characters with Unicode code points  greater  than  or
7416       equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
7417       most base (ASCII) characters are excluded. (Universal  Character  Names
7418       are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
7419       Note that the Xuc property does not match these sequences but the char-
7420       acters that they represent.)
7421
7422   Resetting the match start
7423
7424       In normal use, the escape sequence \K  causes  any  previously  matched
7425       characters not to be included in the final matched sequence that is re-
7426       turned. For example, the pattern:
7427
7428         foo\Kbar
7429
7430       matches  "foobar",  but  reports that it has matched "bar". \K does not
7431       interact with anchoring in any way. The pattern:
7432
7433         ^foo\Kbar
7434
7435       matches only when the subject begins  with  "foobar"  (in  single  line
7436       mode),  though  it again reports the matched string as "bar". This fea-
7437       ture is similar to a lookbehind assertion (described  below),  but  the
7438       part of the pattern that precedes \K is not constrained to match a lim-
7439       ited  number  of characters, as is required for a lookbehind assertion.
7440       The use of \K does not interfere with  the  setting  of  captured  sub-
7441       strings.  For example, when the pattern
7442
7443         (foo)\Kbar
7444
7445       matches "foobar", the first substring is still set to "foo".
7446
7447       From  version  5.32.0  Perl  forbids the use of \K in lookaround asser-
7448       tions. From release 10.38 PCRE2 also forbids this by default.  However,
7449       the  PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK  option  can be used when calling
7450       pcre2_compile() to re-enable the previous behaviour. When  this  option
7451       is set, \K is acted upon when it occurs inside positive assertions, but
7452       is  ignored  in  negative  assertions. Note that when a pattern such as
7453       (?=ab\K) matches, the reported start of the match can be  greater  than
7454       the  end  of the match. Using \K in a lookbehind assertion at the start
7455       of a pattern can also lead to odd effects. For example,  consider  this
7456       pattern:
7457
7458         (?<=\Kfoo)bar
7459
7460       If  the  subject  is  "foobar", a call to pcre2_match() with a starting
7461       offset of 3 succeeds and reports the matching string as "foobar",  that
7462       is,  the  start  of  the reported match is earlier than where the match
7463       started.
7464
7465   Simple assertions
7466
7467       The final use of backslash is for certain simple assertions. An  asser-
7468       tion  specifies a condition that has to be met at a particular point in
7469       a match, without consuming any characters from the subject string.  The
7470       use  of groups for more complicated assertions is described below.  The
7471       backslashed assertions are:
7472
7473         \b     matches at a word boundary
7474         \B     matches when not at a word boundary
7475         \A     matches at the start of the subject
7476         \Z     matches at the end of the subject
7477                 also matches before a newline at the end of the subject
7478         \z     matches only at the end of the subject
7479         \G     matches at the first matching position in the subject
7480
7481       Inside a character class, \b has a different meaning;  it  matches  the
7482       backspace  character.  If  any  other  of these assertions appears in a
7483       character class, an "invalid escape sequence" error is generated.
7484
7485       A word boundary is a position in the subject string where  the  current
7486       character  and  the previous character do not both match \w or \W (i.e.
7487       one matches \w and the other matches \W), or the start or  end  of  the
7488       string  if  the  first or last character matches \w, respectively. When
7489       PCRE2 is built with Unicode support, the meanings of \w and \W  can  be
7490       changed by setting the PCRE2_UCP option. When this is done, it also af-
7491       fects  \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
7492       or "end of word" metasequence. However, whatever  follows  \b  normally
7493       determines  which  it  is. For example, the fragment \ba matches "a" at
7494       the start of a word.
7495
7496       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
7497       and dollar (described in the next section) in that they only ever match
7498       at  the  very start and end of the subject string, whatever options are
7499       set. Thus, they are independent of multiline mode. These  three  asser-
7500       tions  are  not  affected  by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
7501       which affect only the behaviour of the circumflex and dollar  metachar-
7502       acters.  However,  if the startoffset argument of pcre2_match() is non-
7503       zero, indicating that matching is to start at a point  other  than  the
7504       beginning  of  the subject, \A can never match.  The difference between
7505       \Z and \z is that \Z matches before a newline at the end of the  string
7506       as well as at the very end, whereas \z matches only at the end.
7507
7508       The  \G assertion is true only when the current matching position is at
7509       the start point of the matching process, as specified by the  startoff-
7510       set  argument  of  pcre2_match().  It differs from \A when the value of
7511       startoffset is non-zero. By calling pcre2_match() multiple  times  with
7512       appropriate  arguments,  you  can  mimic Perl's /g option, and it is in
7513       this kind of implementation where \G can be useful.
7514
7515       Note, however, that PCRE2's implementation of \G,  being  true  at  the
7516       starting  character  of  the matching process, is subtly different from
7517       Perl's, which defines it as true at the end of the previous  match.  In
7518       Perl,  these  can  be  different when the previously matched string was
7519       empty. Because PCRE2 does just one match at a time, it cannot reproduce
7520       this behaviour.
7521
7522       If all the alternatives of a pattern begin with \G, the  expression  is
7523       anchored to the starting match position, and the "anchored" flag is set
7524       in the compiled regular expression.
7525
7526
7527CIRCUMFLEX AND DOLLAR
7528
7529       The  circumflex  and  dollar  metacharacters are zero-width assertions.
7530       That is, they test for a particular condition being true  without  con-
7531       suming any characters from the subject string. These two metacharacters
7532       are  concerned  with matching the starts and ends of lines. If the new-
7533       line convention is set so that only the two-character sequence CRLF  is
7534       recognized  as  a newline, isolated CR and LF characters are treated as
7535       ordinary data characters, and are not recognized as newlines.
7536
7537       Outside a character class, in the default matching mode, the circumflex
7538       character is an assertion that is true only  if  the  current  matching
7539       point  is  at the start of the subject string. If the startoffset argu-
7540       ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is  set,  circum-
7541       flex  can  never match if the PCRE2_MULTILINE option is unset. Inside a
7542       character class, circumflex has an entirely different meaning (see  be-
7543       low).
7544
7545       Circumflex  need  not be the first character of the pattern if a number
7546       of alternatives are involved, but it should be the first thing in  each
7547       alternative  in  which  it appears if the pattern is ever to match that
7548       branch. If all possible alternatives start with a circumflex, that  is,
7549       if  the  pattern  is constrained to match only at the start of the sub-
7550       ject, it is said to be an "anchored" pattern.  (There  are  also  other
7551       constructs that can cause a pattern to be anchored.)
7552
7553       The  dollar  character is an assertion that is true only if the current
7554       matching point is at the end of the subject string, or immediately  be-
7555       fore  a newline at the end of the string (by default), unless PCRE2_NO-
7556       TEOL is set. Note, however, that it does not actually  match  the  new-
7557       line.  Dollar need not be the last character of the pattern if a number
7558       of alternatives are involved, but it should be the  last  item  in  any
7559       branch  in which it appears. Dollar has no special meaning in a charac-
7560       ter class.
7561
7562       The meaning of dollar can be changed so that it  matches  only  at  the
7563       very  end  of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
7564       compile time. This does not affect the \Z assertion.
7565
7566       The meanings of the circumflex and dollar metacharacters are changed if
7567       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
7568       character  matches before any newlines in the string, as well as at the
7569       very end, and a circumflex matches immediately after internal  newlines
7570       as  well as at the start of the subject string. It does not match after
7571       a newline that ends the string, for compatibility with  Perl.  However,
7572       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
7573
7574       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
7575       (where \n represents a newline) in multiline mode, but  not  otherwise.
7576       Consequently,  patterns  that  are anchored in single line mode because
7577       all branches start with ^ are not anchored in  multiline  mode,  and  a
7578       match  for  circumflex  is  possible  when  the startoffset argument of
7579       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
7580       if PCRE2_MULTILINE is set.
7581
7582       When  the  newline  convention (see "Newline conventions" below) recog-
7583       nizes the two-character sequence CRLF as a newline, this is  preferred,
7584       even  if  the  single  characters CR and LF are also recognized as new-
7585       lines. For example, if the newline convention  is  "any",  a  multiline
7586       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
7587       than after CR, even though CR on its own is a valid newline.  (It  also
7588       matches at the very start of the string, of course.)
7589
7590       Note  that  the sequences \A, \Z, and \z can be used to match the start
7591       and end of the subject in both modes, and if all branches of a  pattern
7592       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
7593       set.
7594
7595
7596FULL STOP (PERIOD, DOT) AND \N
7597
7598       Outside a character class, a dot in the pattern matches any one charac-
7599       ter in the subject string except (by default) a character  that  signi-
7600       fies the end of a line. One or more characters may be specified as line
7601       terminators (see "Newline conventions" above).
7602
7603       Dot  never matches a single line-ending character. When the two-charac-
7604       ter sequence CRLF is the only line ending, dot does not match CR if  it
7605       is  immediately followed by LF, but otherwise it matches all characters
7606       (including isolated CRs and LFs). When ANYCRLF  is  selected  for  line
7607       endings,  no  occurrences  of CR of LF match dot. When all Unicode line
7608       endings are being recognized, dot does not match CR or LF or any of the
7609       other line ending characters.
7610
7611       The behaviour of dot with regard to newlines can  be  changed.  If  the
7612       PCRE2_DOTALL  option  is  set, a dot matches any one character, without
7613       exception.  If the two-character sequence CRLF is present in  the  sub-
7614       ject string, it takes two dots to match it.
7615
7616       The  handling of dot is entirely independent of the handling of circum-
7617       flex and dollar, the only relationship being  that  they  both  involve
7618       newlines. Dot has no special meaning in a character class.
7619
7620       The  escape  sequence  \N when not followed by an opening brace behaves
7621       like a dot, except that it is not affected by the PCRE2_DOTALL  option.
7622       In  other words, it matches any character except one that signifies the
7623       end of a line.
7624
7625       When \N is followed by an opening brace it has a different meaning. See
7626       the section entitled "Non-printing characters" above for details.  Perl
7627       also  uses  \N{name}  to specify characters by Unicode name; PCRE2 does
7628       not support this.
7629
7630
7631MATCHING A SINGLE CODE UNIT
7632
7633       Outside a character class, the escape sequence \C matches any one  code
7634       unit,  whether or not a UTF mode is set. In the 8-bit library, one code
7635       unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
7636       32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
7637       line-ending characters. The feature is provided in  Perl  in  order  to
7638       match individual bytes in UTF-8 mode, but it is unclear how it can use-
7639       fully be used.
7640
7641       Because  \C  breaks  up characters into individual code units, matching
7642       one unit with \C in UTF-8 or UTF-16 mode means that  the  rest  of  the
7643       string may start with a malformed UTF character. This has undefined re-
7644       sults, because PCRE2 assumes that it is matching character by character
7645       in a valid UTF string (by default it checks the subject string's valid-
7646       ity  at  the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK or
7647       PCRE2_MATCH_INVALID_UTF option is used).
7648
7649       An  application  can  lock  out  the  use  of   \C   by   setting   the
7650       PCRE2_NEVER_BACKSLASH_C  option  when  compiling  a pattern. It is also
7651       possible to build PCRE2 with the use of \C permanently disabled.
7652
7653       PCRE2 does not allow \C to appear in lookbehind  assertions  (described
7654       below)  in UTF-8 or UTF-16 modes, because this would make it impossible
7655       to calculate the length of  the  lookbehind.  Neither  the  alternative
7656       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
7657       these UTF modes.  The former gives a match-time error; the latter fails
7658       to optimize and so the match is always run using the interpreter.
7659
7660       In  the  32-bit  library, however, \C is always supported (when not ex-
7661       plicitly locked out) because it always  matches  a  single  code  unit,
7662       whether or not UTF-32 is specified.
7663
7664       In general, the \C escape sequence is best avoided. However, one way of
7665       using  it  that avoids the problem of malformed UTF-8 or UTF-16 charac-
7666       ters is to use a lookahead to check the length of the  next  character,
7667       as  in  this  pattern,  which could be used with a UTF-8 string (ignore
7668       white space and line breaks):
7669
7670         (?| (?=[\x00-\x7f])(\C) |
7671             (?=[\x80-\x{7ff}])(\C)(\C) |
7672             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7673             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7674
7675       In this example, a group that starts  with  (?|  resets  the  capturing
7676       parentheses  numbers in each alternative (see "Duplicate Group Numbers"
7677       below). The assertions at the start of each branch check the next UTF-8
7678       character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec-
7679       tively.  The  character's individual bytes are then captured by the ap-
7680       propriate number of \C groups.
7681
7682
7683SQUARE BRACKETS AND CHARACTER CLASSES
7684
7685       An opening square bracket introduces a character class, terminated by a
7686       closing square bracket. A closing square bracket on its own is not spe-
7687       cial by default.  If a closing square bracket is required as  a  member
7688       of the class, it should be the first data character in the class (after
7689       an  initial  circumflex,  if present) or escaped with a backslash. This
7690       means that, by default, an empty class cannot be defined.  However,  if
7691       the  PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
7692       the start does end the (empty) class.
7693
7694       A character class matches a single character in the subject. A  matched
7695       character must be in the set of characters defined by the class, unless
7696       the  first  character in the class definition is a circumflex, in which
7697       case the subject character must not be in the set defined by the class.
7698       If a circumflex is actually required as a member of the  class,  ensure
7699       it is not the first character, or escape it with a backslash.
7700
7701       For  example, the character class [aeiou] matches any lower case vowel,
7702       while [^aeiou] matches any character that is not a  lower  case  vowel.
7703       Note that a circumflex is just a convenient notation for specifying the
7704       characters  that  are in the class by enumerating those that are not. A
7705       class that starts with a circumflex is not an assertion; it still  con-
7706       sumes  a  character  from the subject string, and therefore it fails if
7707       the current pointer is at the end of the string.
7708
7709       Characters in a class may be specified by their code points  using  \o,
7710       \x,  or \N{U+hh..} in the usual way. When caseless matching is set, any
7711       letters in a class represent both their upper case and lower case  ver-
7712       sions,  so  for example, a caseless [aeiou] matches "A" as well as "a",
7713       and a caseless [^aeiou] does not match "A", whereas a  caseful  version
7714       would.  Note that there are two ASCII characters, K and S, that, in ad-
7715       dition to their lower case ASCII equivalents, are case-equivalent  with
7716       Unicode  U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7717       ther PCRE2_UTF or PCRE2_UCP is set.
7718
7719       Characters that might indicate line breaks are  never  treated  in  any
7720       special  way  when matching character classes, whatever line-ending se-
7721       quence is  in  use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
7722       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
7723       one of these characters.
7724
7725       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
7726       \S, \v, \V, \w, and \W may appear in a character  class,  and  add  the
7727       characters  that  they  match  to  the  class.  For example, [\dABCDEF]
7728       matches any hexadecimal digit. In UTF modes, the PCRE2_UCP  option  af-
7729       fects the meanings of \d, \s, \w and their upper case partners, just as
7730       it does when they appear outside a character class, as described in the
7731       section  entitled  "Generic character types" above. The escape sequence
7732       \b has a different meaning inside a character  class;  it  matches  the
7733       backspace  character.  The sequences \B, \R, and \X are not special in-
7734       side a character class. Like any other unrecognized  escape  sequences,
7735       they  cause  an  error. The same is true for \N when not followed by an
7736       opening brace.
7737
7738       The minus (hyphen) character can be used to specify a range of  charac-
7739       ters  in  a  character class. For example, [d-m] matches any letter be-
7740       tween d and m, inclusive. If a minus character is required in a  class,
7741       it  must  be  escaped with a backslash or appear in a position where it
7742       cannot be interpreted as indicating a range, typically as the first  or
7743       last character in the class, or immediately after a range. For example,
7744       [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7745
7746       Perl treats a hyphen as a literal if it appears before or after a POSIX
7747       class (see below) or before or after a character type escape such as \d
7748       or  \H.  However, unless the hyphen is the last character in the class,
7749       Perl outputs a warning in its warning mode, as this is  most  likely  a
7750       user  error. As PCRE2 has no facility for warning, an error is given in
7751       these cases.
7752
7753       It is not possible to have the literal character "]" as the end charac-
7754       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
7755       two  characters ("W" and "-") followed by a literal string "46]", so it
7756       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
7757       backslash  it is interpreted as the end of range, so [W-\]46] is inter-
7758       preted as a class containing a range followed by two other  characters.
7759       The  octal or hexadecimal representation of "]" can also be used to end
7760       a range.
7761
7762       Ranges normally include all code points between the start and end char-
7763       acters, inclusive. They can also be used for code points specified  nu-
7764       merically,  for  example [\000-\037]. Ranges can include any characters
7765       that are valid for the current mode. In any  UTF  mode,  the  so-called
7766       "surrogate"  characters (those whose code points lie between 0xd800 and
7767       0xdfff inclusive) may not  be  specified  explicitly  by  default  (the
7768       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7769       ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7770       are always permitted.
7771
7772       There is a special case in EBCDIC environments  for  ranges  whose  end
7773       points are both specified as literal letters in the same case. For com-
7774       patibility  with Perl, EBCDIC code points within the range that are not
7775       letters are omitted. For example, [h-k] matches only  four  characters,
7776       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7777       points.  However,  if  the range is specified numerically, for example,
7778       [\x88-\x92] or [h-\x92], all code points are included.
7779
7780       If a range that includes letters is used when caseless matching is set,
7781       it matches the letters in either case. For example, [W-c] is equivalent
7782       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
7783       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
7784       accented E characters in both cases.
7785
7786       A circumflex can conveniently be used with  the  upper  case  character
7787       types  to specify a more restricted set of characters than the matching
7788       lower case type.  For example, the class [^\W_] matches any  letter  or
7789       digit, but not underscore, whereas [\w] includes underscore. A positive
7790       character class should be read as "something OR something OR ..." and a
7791       negative class as "NOT something AND NOT something AND NOT ...".
7792
7793       The  only  metacharacters  that are recognized in character classes are
7794       backslash, hyphen (only where it can be  interpreted  as  specifying  a
7795       range),  circumflex  (only  at the start), opening square bracket (only
7796       when it can be interpreted as introducing a POSIX class name, or for  a
7797       special  compatibility  feature  -  see the next two sections), and the
7798       terminating closing square bracket.  However,  escaping  other  non-al-
7799       phanumeric characters does no harm.
7800
7801
7802POSIX CHARACTER CLASSES
7803
7804       Perl supports the POSIX notation for character classes. This uses names
7805       enclosed  by [: and :] within the enclosing square brackets. PCRE2 also
7806       supports this notation. For example,
7807
7808         [01[:alpha:]%]
7809
7810       matches "0", "1", any alphabetic character, or "%". The supported class
7811       names are:
7812
7813         alnum    letters and digits
7814         alpha    letters
7815         ascii    character codes 0 - 127
7816         blank    space or tab only
7817         cntrl    control characters
7818         digit    decimal digits (same as \d)
7819         graph    printing characters, excluding space
7820         lower    lower case letters
7821         print    printing characters, including space
7822         punct    printing characters, excluding letters and digits and space
7823         space    white space (the same as \s from PCRE2 8.34)
7824         upper    upper case letters
7825         word     "word" characters (same as \w)
7826         xdigit   hexadecimal digits
7827
7828       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
7829       CR  (13),  and space (32). If locale-specific matching is taking place,
7830       the list of space characters may be different; there may  be  fewer  or
7831       more  of  them.  "Space" and \s match the same set of characters, as do
7832       "word" and \w.
7833
7834       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
7835       from  Perl  5.8. Another Perl extension is negation, which is indicated
7836       by a ^ character after the colon. For example,
7837
7838         [12[:^digit:]]
7839
7840       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7841       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
7842       these are not supported, and an error is given if they are encountered.
7843
7844       By default, characters with values greater than 127 do not match any of
7845       the POSIX character classes, although this may be different for charac-
7846       ters in the range 128-255 when locale-specific matching  is  happening.
7847       However,  in UCP mode, unless certain options are set (see below), some
7848       of the classes are changed so that  Unicode  character  properties  are
7849       used. This is achieved by replacing POSIX classes with other sequences,
7850       as follows:
7851
7852         [:alnum:]  becomes  \p{Xan}
7853         [:alpha:]  becomes  \p{L}
7854         [:blank:]  becomes  \h
7855         [:cntrl:]  becomes  \p{Cc}
7856         [:digit:]  becomes  \p{Nd}
7857         [:lower:]  becomes  \p{Ll}
7858         [:space:]  becomes  \p{Xps}
7859         [:upper:]  becomes  \p{Lu}
7860         [:word:]   becomes  \p{Xwd}
7861
7862       Negated  versions,  such as [:^alpha:] use \P instead of \p. Four other
7863       POSIX classes are handled specially in UCP mode:
7864
7865       [:graph:] This matches characters that have glyphs that mark  the  page
7866                 when printed. In Unicode property terms, it matches all char-
7867                 acters with the L, M, N, P, S, or Cf properties, except for:
7868
7869                   U+061C           Arabic Letter Mark
7870                   U+180E           Mongolian Vowel Separator
7871                   U+2066 - U+2069  Various "isolate"s
7872
7873
7874       [:print:] This  matches  the  same  characters  as [:graph:] plus space
7875                 characters that are not controls, that  is,  characters  with
7876                 the Zs property.
7877
7878       [:punct:] This matches all characters that have the Unicode P (punctua-
7879                 tion)  property,  plus those characters with code points less
7880                 than 256 that have the S (Symbol) property.
7881
7882       [:xdigit:]
7883                 In addition  to  the  ASCII  hexadecimal  digits,  this  also
7884                 matches  the  "fullwidth" versions of those characters, whose
7885                 Unicode code points start at U+FF10. This is  a  change  that
7886                 was made in PCRE release 10.43 for Perl compatibility.
7887
7888       The  other  POSIX  classes  are  unchanged by PCRE2_UCP, and match only
7889       characters with code points less than 256.
7890
7891       There are two options that can be used to restrict the POSIX classes to
7892       ASCII  characters  when  PCRE2_UCP  is  set.   The   option   PCRE2_EX-
7893       TRA_ASCII_DIGIT  affects  just  [:digit:] and [:xdigit:]. Within a pat-
7894       tern, this can be set and unset by  (?aT)  and  (?-aT).  The  PCRE2_EX-
7895       TRA_ASCII_POSIX  option  disables UCP processing for all POSIX classes,
7896       including [:digit:] and [:xdigit:]. Within a pattern, (?aP) and  (?-aP)
7897       set and unset both these options for consistency.
7898
7899
7900COMPATIBILITY FEATURE FOR WORD BOUNDARIES
7901
7902       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
7903       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
7904       and "end of word". PCRE2 treats these items as follows:
7905
7906         [[:<:]]  is converted to  \b(?=\w)
7907         [[:>:]]  is converted to  \b(?<=\w)
7908
7909       Only these exact character sequences are recognized. A sequence such as
7910       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
7911       support is not compatible with Perl. It is provided to help  migrations
7912       from other environments, and is best not used in any new patterns. Note
7913       that  \b matches at the start and the end of a word (see "Simple asser-
7914       tions" above), and in a Perl-style pattern the preceding  or  following
7915       character  normally shows which is wanted, without the need for the as-
7916       sertions that are used above in order to give exactly the POSIX  behav-
7917       iour.  Note  also  that  the PCRE2_UCP option changes the meaning of \w
7918       (and therefore \b) by default, so  it  also  affects  these  POSIX  se-
7919       quences.
7920
7921
7922VERTICAL BAR
7923
7924       Vertical  bar characters are used to separate alternative patterns. For
7925       example, the pattern
7926
7927         gilbert|sullivan
7928
7929       matches either "gilbert" or "sullivan". Any number of alternatives  may
7930       appear,  and  an  empty  alternative  is  permitted (matching the empty
7931       string). The matching process tries each alternative in turn, from left
7932       to right, and the first one that succeeds is used. If the  alternatives
7933       are  within a group (defined below), "succeeds" means matching the rest
7934       of the main pattern as well as the alternative in the group.
7935
7936
7937INTERNAL OPTION SETTING
7938
7939       The settings of several options can be changed within a  pattern  by  a
7940       sequence  of  letters  enclosed between "(?" and ")". The following are
7941       Perl-compatible, and are described in detail in the pcre2api documenta-
7942       tion. The option letters are:
7943
7944         i  for PCRE2_CASELESS
7945         m  for PCRE2_MULTILINE
7946         n  for PCRE2_NO_AUTO_CAPTURE
7947         s  for PCRE2_DOTALL
7948         x  for PCRE2_EXTENDED
7949         xx for PCRE2_EXTENDED_MORE
7950
7951       For example, (?im) sets caseless, multiline matching. It is also possi-
7952       ble to unset these options by preceding the relevant letters with a hy-
7953       phen, for example (?-im). The two "extended" options are  not  indepen-
7954       dent; unsetting either one cancels the effects of both of them.
7955
7956       A   combined  setting  and  unsetting  such  as  (?im-sx),  which  sets
7957       PCRE2_CASELESS and PCRE2_MULTILINE  while  unsetting  PCRE2_DOTALL  and
7958       PCRE2_EXTENDED,  is  also  permitted. Only one hyphen may appear in the
7959       options string. If a letter appears both before and after  the  hyphen,
7960       the  option  is unset. An empty options setting "(?)" is allowed. Need-
7961       less to say, it has no effect.
7962
7963       If the first character following (? is a circumflex, it causes  all  of
7964       the  above  options  to  be unset. Letters may follow the circumflex to
7965       cause some options to be re-instated, but a hyphen may not appear.
7966
7967       Some PCRE2-specific options can be changed by the same mechanism  using
7968       these pairs or individual letters:
7969
7970         aD for PCRE2_EXTRA_ASCII_BSD
7971         aS for PCRE2_EXTRA_ASCII_BSS
7972         aW for PCRE2_EXTRA_ASCII_BSW
7973         aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT
7974         aT for PCRE2_EXTRA_ASCII_DIGIT
7975         r  for PCRE2_EXTRA_CASELESS_RESTRICT
7976         J  for PCRE2_DUPNAMES
7977         U  for PCRE2_UNGREEDY
7978
7979       However,  except for 'r', these are not unset by (?^), which is equiva-
7980       lent to (?-imnrsx). If 'a' is not followed by any  of  the  upper  case
7981       letters shown above, it sets (or unsets) all the ASCII options.
7982
7983       PCRE2_EXTRA_ASCII_DIGIT   has   no  additional  effect  when  PCRE2_EX-
7984       TRA_ASCII_POSIX is set, but including it in  (?aP)  means  that  (?-aP)
7985       suppresses all ASCII restrictions for POSIX classes.
7986
7987       When  one of these option changes occurs at top level (that is, not in-
7988       side group parentheses), the change applies until a subsequent  change,
7989       or  the  end of the pattern. An option change within a group (see below
7990       for a description of groups) affects only that part of the  group  that
7991       follows  it.  At  the  end  of the group these options are reset to the
7992       state they were before the group. For example,
7993
7994         (a(?i)b)c
7995
7996       matches abc and aBc and no other strings  (assuming  PCRE2_CASELESS  is
7997       not  set  externally).  Any changes made in one alternative do carry on
7998       into subsequent branches within the same group. For example,
7999
8000         (a(?i)b|c)
8001
8002       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
8003       first  branch  is  abandoned before the option setting. This is because
8004       the effects of option settings happen at compile time. There  would  be
8005       some very weird behaviour otherwise.
8006
8007       As  a  convenient shorthand, if any option settings are required at the
8008       start of a non-capturing group (see the next section), the option  let-
8009       ters may appear between the "?" and the ":". Thus the two patterns
8010
8011         (?i:saturday|sunday)
8012         (?:(?i)saturday|sunday)
8013
8014       match exactly the same set of strings.
8015
8016       Note:  There  are  other  PCRE2-specific options, applying to the whole
8017       pattern, which can be set by the application when the  compiling  func-
8018       tion  is  called.  In addition, the pattern can contain special leading
8019       sequences such as (*CRLF) to override what the application has  set  or
8020       what  has  been  defaulted.   Details are given in the section entitled
8021       "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
8022       sequences that can be used to set UTF and Unicode property modes;  they
8023       are  equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
8024       tively.  However,  the  application  can  set  the  PCRE2_NEVER_UTF  or
8025       PCRE2_NEVER_UCP  options,  which  lock  out  the  use of the (*UTF) and
8026       (*UCP) sequences.
8027
8028
8029GROUPS
8030
8031       Groups are delimited by parentheses  (round  brackets),  which  can  be
8032       nested.  Turning part of a pattern into a group does two things:
8033
8034       1. It localizes a set of alternatives. For example, the pattern
8035
8036         cat(aract|erpillar|)
8037
8038       matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
8039       it would match "cataract", "erpillar" or an empty string.
8040
8041       2. It creates a "capture group". This means that, when the  whole  pat-
8042       tern  matches, the portion of the subject string that matched the group
8043       is passed back to the caller, separately from the portion that  matched
8044       the  whole  pattern.   (This  applies  only to the traditional matching
8045       function; the DFA matching function does not support capturing.)
8046
8047       Opening parentheses are counted from left to right (starting from 1) to
8048       obtain numbers for capture groups. For example, if the string "the  red
8049       king" is matched against the pattern
8050
8051         the ((red|white) (king|queen))
8052
8053       the captured substrings are "red king", "red", and "king", and are num-
8054       bered 1, 2, and 3, respectively.
8055
8056       The  fact  that  plain  parentheses  fulfil two functions is not always
8057       helpful.  There are often times when grouping is required without  cap-
8058       turing.  If an opening parenthesis is followed by a question mark and a
8059       colon, the group does not do any capturing, and  is  not  counted  when
8060       computing  the number of any subsequent capture groups. For example, if
8061       the string "the white queen" is matched against the pattern
8062
8063         the ((?:red|white) (king|queen))
8064
8065       the captured substrings are "white queen" and "queen", and are numbered
8066       1 and 2. The maximum number of capture groups is 65535.
8067
8068       As a convenient shorthand, if any option settings are required  at  the
8069       start  of  a non-capturing group, the option letters may appear between
8070       the "?" and the ":". Thus the two patterns
8071
8072         (?i:saturday|sunday)
8073         (?:(?i)saturday|sunday)
8074
8075       match exactly the same set of strings. Because alternative branches are
8076       tried from left to right, and options are not reset until  the  end  of
8077       the  group is reached, an option setting in one branch does affect sub-
8078       sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
8079       urday".
8080
8081
8082DUPLICATE GROUP NUMBERS
8083
8084       Perl 5.10 introduced a feature whereby each alternative in a group uses
8085       the same numbers for its capturing parentheses.  Such  a  group  starts
8086       with  (?|  and  is  itself a non-capturing group. For example, consider
8087       this pattern:
8088
8089         (?|(Sat)ur|(Sun))day
8090
8091       Because the two alternatives are inside a (?| group, both sets of  cap-
8092       turing  parentheses  are  numbered one. Thus, when the pattern matches,
8093       you can look at captured substring number  one,  whichever  alternative
8094       matched.  This  construct  is useful when you want to capture part, but
8095       not all, of one of a number of alternatives. Inside a (?| group, paren-
8096       theses are numbered as usual, but the number is reset at the  start  of
8097       each  branch.  The numbers of any capturing parentheses that follow the
8098       whole group start after the highest number used in any branch. The fol-
8099       lowing example is taken from the Perl documentation. The numbers under-
8100       neath show in which buffer the captured content will be stored.
8101
8102         # before  ---------------branch-reset----------- after
8103         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
8104         # 1            2         2  3        2     3     4
8105
8106       A backreference to a capture group uses the most recent value  that  is
8107       set for the group. The following pattern matches "abcabc" or "defdef":
8108
8109         /(?|(abc)|(def))\1/
8110
8111       In  contrast, a subroutine call to a capture group always refers to the
8112       first one in the pattern with the given number. The  following  pattern
8113       matches "abcabc" or "defabc":
8114
8115         /(?|(abc)|(def))(?1)/
8116
8117       A relative reference such as (?-1) is no different: it is just a conve-
8118       nient way of computing an absolute group number.
8119
8120       If a condition test for a group's having matched refers to a non-unique
8121       number, the test is true if any group with that number has matched.
8122
8123       An  alternative approach to using this "branch reset" feature is to use
8124       duplicate named groups, as described in the next section.
8125
8126
8127NAMED CAPTURE GROUPS
8128
8129       Identifying capture groups by number is simple, but it can be very hard
8130       to keep track of the numbers in complicated patterns.  Furthermore,  if
8131       an  expression  is  modified, the numbers may change. To help with this
8132       difficulty, PCRE2 supports the naming of capture groups.  This  feature
8133       was  not  added to Perl until release 5.10. Python had the feature ear-
8134       lier, and PCRE1 introduced it at release 4.0, using the Python  syntax.
8135       PCRE2 supports both the Perl and the Python syntax.
8136
8137       In  PCRE2,  a  capture  group  can  be  named  in  one  of  three ways:
8138       (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
8139       Names may be up to 128 code units long. When PCRE2_UTF is not set, they
8140       may contain only ASCII alphanumeric  characters  and  underscores,  but
8141       must start with a non-digit. When PCRE2_UTF is set, the syntax of group
8142       names is extended to allow any Unicode letter or Unicode decimal digit.
8143       In other words, group names must match one of these patterns:
8144
8145         ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
8146         ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set
8147
8148       References  to  capture groups from other parts of the pattern, such as
8149       backreferences, recursion, and conditions, can all be made by  name  as
8150       well as by number.
8151
8152       Named capture groups are allocated numbers as well as names, exactly as
8153       if  the  names were not present. In both PCRE2 and Perl, capture groups
8154       are primarily identified by numbers; any names  are  just  aliases  for
8155       these numbers. The PCRE2 API provides function calls for extracting the
8156       complete  name-to-number  translation table from a compiled pattern, as
8157       well as convenience functions for  extracting  captured  substrings  by
8158       name.
8159
8160       Warning:  When  more than one capture group has the same number, as de-
8161       scribed in the previous section, a name given to one of them applies to
8162       all of them. Perl allows identically numbered groups to have  different
8163       names.  Consider this pattern, where there are two capture groups, both
8164       numbered 1:
8165
8166         (?|(?<AA>aa)|(?<BB>bb))
8167
8168       Perl  allows  this,  with  both  names AA and BB as aliases of group 1.
8169       Thus, after a successful match, both names yield the same value (either
8170       "aa" or "bb").
8171
8172       In an attempt to reduce confusion, PCRE2 does not allow the same  group
8173       number to be associated with more than one name. The example above pro-
8174       vokes  a  compile-time  error. However, there is still scope for confu-
8175       sion. Consider this pattern:
8176
8177         (?|(?<AA>aa)|(bb))
8178
8179       Although the second group number 1 is not explicitly named, the name AA
8180       is still an alias for any group 1. Whether the pattern matches "aa"  or
8181       "bb", a reference by name to group AA yields the matched string.
8182
8183       By  default, a name must be unique within a pattern, except that dupli-
8184       cate names are permitted for groups with the same number, for example:
8185
8186         (?|(?<AA>aa)|(?<AA>bb))
8187
8188       The duplicate name constraint can be disabled by setting the PCRE2_DUP-
8189       NAMES option at compile time, or by the use of (?J) within the pattern,
8190       as described in the section entitled "Internal Option Setting" above.
8191
8192       Duplicate names can be useful for patterns where only one  instance  of
8193       the  named  capture group can match. Suppose you want to match the name
8194       of a weekday, either as a 3-letter abbreviation or as  the  full  name,
8195       and  in  both  cases you want to extract the abbreviation. This pattern
8196       (ignoring the line breaks) does the job:
8197
8198         (?J)
8199         (?<DN>Mon|Fri|Sun)(?:day)?|
8200         (?<DN>Tue)(?:sday)?|
8201         (?<DN>Wed)(?:nesday)?|
8202         (?<DN>Thu)(?:rsday)?|
8203         (?<DN>Sat)(?:urday)?
8204
8205       There are five capture groups, but only one is ever set after a  match.
8206       The  convenience  functions for extracting the data by name returns the
8207       substring for the first (and in this example, the only) group  of  that
8208       name that matched. This saves searching to find which numbered group it
8209       was.  (An  alternative  way of solving this problem is to use a "branch
8210       reset" group, as described in the previous section.)
8211
8212       If you make a backreference to a non-unique named group from  elsewhere
8213       in  the pattern, the groups to which the name refers are checked in the
8214       order in which they appear in the overall pattern. The first  one  that
8215       is  set  is  used  for the reference. For example, this pattern matches
8216       both "foofoo" and "barbar" but not "foobar" or "barfoo":
8217
8218         (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
8219
8220
8221       If you make a subroutine call to a non-unique named group, the one that
8222       corresponds to the first occurrence of the name is used. In the absence
8223       of duplicate numbers this is the one with the lowest number.
8224
8225       If you use a named reference in a condition test (see the section about
8226       conditions below), either to check whether a capture group has matched,
8227       or to check for recursion, all groups with the same name are tested. If
8228       the condition is true for any one of them,  the  overall  condition  is
8229       true.  This is the same behaviour as testing by number. For further de-
8230       tails of the interfaces for handling  named  capture  groups,  see  the
8231       pcre2api documentation.
8232
8233
8234REPETITION
8235
8236       Repetition  is  specified  by  quantifiers, which may follow any one of
8237       these items:
8238
8239         a literal data character
8240         the dot metacharacter
8241         the \C escape sequence
8242         the \R escape sequence
8243         the \X escape sequence
8244         any escape sequence that matches a single character
8245         a character class
8246         a backreference
8247         a parenthesized group (including lookaround assertions)
8248         a subroutine call (recursive or otherwise)
8249
8250       If a quantifier does not follow a repeatable item, an error occurs. The
8251       general repetition quantifier specifies a minimum and maximum number of
8252       permitted matches by giving two numbers  in  curly  brackets  (braces),
8253       separated  by  a  comma.  The  numbers must be less than 65536, and the
8254       first must be less than or equal to the second. For example,
8255
8256         z{2,4}
8257
8258       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
8259       special  character.  If  the second number is omitted, but the comma is
8260       present, there is no upper limit; if the second number  and  the  comma
8261       are  both omitted, the quantifier specifies an exact number of required
8262       matches. Thus
8263
8264         [aeiou]{3,}
8265
8266       matches at least 3 successive vowels, but may match many more, whereas
8267
8268         \d{8}
8269
8270       matches exactly 8 digits. If the first number  is  omitted,  the  lower
8271       limit is taken as zero; in this case the upper limit must be present.
8272
8273         X{,4} is interpreted as X{0,4}
8274
8275       This  is  a  change in behaviour that happened in Perl 5.34.0 and PCRE2
8276       10.43. In earlier versions such a sequence was  not  interpreted  as  a
8277       quantifier. Other regular expression engines may behave either way.
8278
8279       If  the characters that follow an opening brace do not match the syntax
8280       of a quantifier, the brace is taken as a literal character. In particu-
8281       lar, this means that {,} is a literal string of three characters.
8282
8283       Note that not every opening brace is potentially the start of a quanti-
8284       fier because braces are used  in  other  items  such  as  \N{U+345}  or
8285       \k{name}.
8286
8287       In UTF modes, quantifiers apply to characters rather than to individual
8288       code  units. Thus, for example, \x{100}{2} matches two characters, each
8289       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
8290       larly, \X{3} matches three Unicode extended grapheme clusters, each  of
8291       which  may  be  several  code  units long (and they may be of different
8292       lengths).
8293
8294       The quantifier {0} is permitted, causing the expression to behave as if
8295       the previous item and the quantifier were not present. This may be use-
8296       ful for capture groups that are referenced as  subroutines  from  else-
8297       where  in the pattern (but see also the section entitled "Defining cap-
8298       ture groups for use by reference only" below). Except for parenthesized
8299       groups, items that have a {0} quantifier are omitted from the  compiled
8300       pattern.
8301
8302       For  convenience, the three most common quantifiers have single-charac-
8303       ter abbreviations:
8304
8305         *    is equivalent to {0,}
8306         +    is equivalent to {1,}
8307         ?    is equivalent to {0,1}
8308
8309       It is possible to construct infinite loops by following  a  group  that
8310       can  match no characters with a quantifier that has no upper limit, for
8311       example:
8312
8313         (a?)*
8314
8315       Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
8316       time for such patterns. However, because there are cases where this can
8317       be useful, such patterns are now accepted, but whenever an iteration of
8318       such  a group matches no characters, matching moves on to the next item
8319       in the pattern instead of repeatedly matching  an  empty  string.  This
8320       does  not  prevent  backtracking into any of the iterations if a subse-
8321       quent item fails to match.
8322
8323       By default, quantifiers are "greedy", that is, they match  as  much  as
8324       possible  (up  to the maximum number of permitted repetitions), without
8325       causing the rest of the pattern to fail. The classic example  of  where
8326       this gives problems is in trying to match comments in C programs. These
8327       appear  between  /*  and  */ and within the comment, individual * and /
8328       characters may appear. An attempt to match C comments by  applying  the
8329       pattern
8330
8331         /\*.*\*/
8332
8333       to the string
8334
8335         /* first comment */  not comment  /* second comment */
8336
8337       fails,  because it matches the entire string owing to the greediness of
8338       the .*  item. However, if a quantifier is followed by a question  mark,
8339       it ceases to be greedy, and instead matches the minimum number of times
8340       possible, so the pattern
8341
8342         /\*.*?\*/
8343
8344       does  the right thing with C comments. The meaning of the various quan-
8345       tifiers is not otherwise changed, just the preferred number of matches.
8346       Do not confuse this use of question mark with its use as  a  quantifier
8347       in  its  own  right.   Because it has two uses, it can sometimes appear
8348       doubled, as in
8349
8350         \d??\d
8351
8352       which matches one digit by preference, but can match two if that is the
8353       only way the rest of the pattern matches.
8354
8355       If the PCRE2_UNGREEDY option is set (an option that is not available in
8356       Perl), the quantifiers are not greedy by default, but  individual  ones
8357       can  be  made  greedy  by following them with a question mark. In other
8358       words, it inverts the default behaviour.
8359
8360       When a parenthesized group is quantified with a  minimum  repeat  count
8361       that  is  greater  than 1 or with a limited maximum, more memory is re-
8362       quired for the compiled pattern, in proportion to the size of the mini-
8363       mum or maximum.
8364
8365       If a pattern starts with  .*  or  .{0,}  and  the  PCRE2_DOTALL  option
8366       (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
8367       lines, the pattern is implicitly  anchored,  because  whatever  follows
8368       will  be  tried against every character position in the subject string,
8369       so there is no point in retrying the overall match at any position  af-
8370       ter  the  first. PCRE2 normally treats such a pattern as though it were
8371       preceded by \A.
8372
8373       In cases where it is known that the subject  string  contains  no  new-
8374       lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
8375       mization, or alternatively, using ^ to indicate anchoring explicitly.
8376
8377       However, there are some cases where the optimization  cannot  be  used.
8378       When  .*   is  inside  capturing  parentheses that are the subject of a
8379       backreference elsewhere in the pattern, a match at the start  may  fail
8380       where a later one succeeds. Consider, for example:
8381
8382         (.*)abc\1
8383
8384       If  the subject is "xyz123abc123" the match point is the fourth charac-
8385       ter. For this reason, such a pattern is not implicitly anchored.
8386
8387       Another case where implicit anchoring is not applied is when the  lead-
8388       ing  .* is inside an atomic group. Once again, a match at the start may
8389       fail where a later one succeeds. Consider this pattern:
8390
8391         (?>.*?a)b
8392
8393       It matches "ab" in the subject "aab". The use of the backtracking  con-
8394       trol  verbs  (*PRUNE)  and  (*SKIP) also disable this optimization, and
8395       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
8396
8397       When a capture group is repeated, the value captured is  the  substring
8398       that matched the final iteration. For example, after
8399
8400         (tweedle[dume]{3}\s*)+
8401
8402       has matched "tweedledum tweedledee" the value of the captured substring
8403       is  "tweedledee". However, if there are nested capture groups, the cor-
8404       responding captured values may have been set  in  previous  iterations.
8405       For example, after
8406
8407         (a|(b))+
8408
8409       matches "aba" the value of the second captured substring is "b".
8410
8411
8412ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
8413
8414       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
8415       repetition, failure of what follows normally causes the  repeated  item
8416       to  be  re-evaluated to see if a different number of repeats allows the
8417       rest of the pattern to match. Sometimes it is useful to  prevent  this,
8418       either  to  change the nature of the match, or to cause it fail earlier
8419       than it otherwise might, when the author of the pattern knows there  is
8420       no point in carrying on.
8421
8422       Consider,  for  example, the pattern \d+foo when applied to the subject
8423       line
8424
8425         123456bar
8426
8427       After matching all 6 digits and then failing to match "foo", the normal
8428       action of the matcher is to try again with only 5 digits  matching  the
8429       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
8430       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
8431       the means for specifying that once a group has matched, it is not to be
8432       re-evaluated in this way.
8433
8434       If  we  use atomic grouping for the previous example, the matcher gives
8435       up immediately on failing to match "foo" the first time.  The  notation
8436       is a kind of special parenthesis, starting with (?> as in this example:
8437
8438         (?>\d+)foo
8439
8440       Perl  5.28  introduced an experimental alphabetic form starting with (*
8441       which may be easier to remember:
8442
8443         (*atomic:\d+)foo
8444
8445       This kind of parenthesized group "locks up" the part of the pattern  it
8446       contains once it has matched, and a failure further into the pattern is
8447       prevented  from  backtracking into it. Backtracking past it to previous
8448       items, however, works as normal.
8449
8450       An alternative description is that a group of this type matches exactly
8451       the string of characters that an  identical  standalone  pattern  would
8452       match, if anchored at the current point in the subject string.
8453
8454       Atomic  groups  are  not capture groups. Simple cases such as the above
8455       example can be thought of as a  maximizing  repeat  that  must  swallow
8456       everything  it can.  So, while both \d+ and \d+? are prepared to adjust
8457       the number of digits they match in order to make the rest of  the  pat-
8458       tern match, (?>\d+) can only match an entire sequence of digits.
8459
8460       Atomic  groups in general can of course contain arbitrarily complicated
8461       expressions, and can be nested. However, when the contents of an atomic
8462       group is just a single repeated item, as in the example above,  a  sim-
8463       pler  notation, called a "possessive quantifier" can be used. This con-
8464       sists of an additional + character following a quantifier.  Using  this
8465       notation, the previous example can be rewritten as
8466
8467         \d++foo
8468
8469       Note that a possessive quantifier can be used with an entire group, for
8470       example:
8471
8472         (abc|xyz){2,3}+
8473
8474       Possessive  quantifiers are always greedy; the setting of the PCRE2_UN-
8475       GREEDY option is ignored. They are a convenient notation for  the  sim-
8476       pler  forms  of  atomic  group.  However, there is no difference in the
8477       meaning of a possessive quantifier and  the  equivalent  atomic  group,
8478       though  there  may  be a performance difference; possessive quantifiers
8479       should be slightly faster.
8480
8481       The possessive quantifier syntax is an extension to the Perl  5.8  syn-
8482       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
8483       edition of his book. Mike McCloskey liked it, so implemented it when he
8484       built Sun's Java package, and PCRE1 copied it from there. It found  its
8485       way into Perl at release 5.10.
8486
8487       PCRE2  has  an  optimization  that automatically "possessifies" certain
8488       simple pattern constructs. For example, the sequence A+B is treated  as
8489       A++B  because  there is no point in backtracking into a sequence of A's
8490       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
8491       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
8492
8493       When a pattern contains an unlimited repeat inside a group that can it-
8494       self be repeated an unlimited number of times, the  use  of  an  atomic
8495       group  is the only way to avoid some failing matches taking a very long
8496       time indeed. The pattern
8497
8498         (\D+|<\d+>)*[!?]
8499
8500       matches an unlimited number of substrings that either consist  of  non-
8501       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
8502       matches, it runs quickly. However, if it is applied to
8503
8504         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8505
8506       it takes a long time before reporting  failure.  This  is  because  the
8507       string  can be divided between the internal \D+ repeat and the external
8508       * repeat in a large number of ways, and all have to be tried. (The  ex-
8509       ample uses [!?] rather than a single character at the end, because both
8510       PCRE2 and Perl have an optimization that allows for fast failure when a
8511       single  character is used. They remember the last single character that
8512       is required for a match, and fail early if it is  not  present  in  the
8513       string.)  If  the  pattern  is changed so that it uses an atomic group,
8514       like this:
8515
8516         ((?>\D+)|<\d+>)*[!?]
8517
8518       sequences of non-digits cannot be broken, and failure happens quickly.
8519
8520
8521BACKREFERENCES
8522
8523       Outside a character class, a backslash followed by a digit greater than
8524       0 (and possibly further digits) is a backreference to a  capture  group
8525       earlier (that is, to its left) in the pattern, provided there have been
8526       that many previous capture groups.
8527
8528       However,  if the decimal number following the backslash is less than 8,
8529       it is always taken as a backreference, and  causes  an  error  only  if
8530       there  are not that many capture groups in the entire pattern. In other
8531       words, the group that is referenced need not be to the left of the ref-
8532       erence for numbers less than 8. A "forward backreference" of this  type
8533       can make sense when a repetition is involved and the group to the right
8534       has participated in an earlier iteration.
8535
8536       It  is  not  possible  to have a numerical "forward backreference" to a
8537       group whose number is 8 or more using this syntax  because  a  sequence
8538       such  as  \50  is  interpreted as a character defined in octal. See the
8539       subsection entitled "Non-printing characters" above for further details
8540       of the handling of digits following a backslash. Other forms  of  back-
8541       referencing  do  not suffer from this restriction. In particular, there
8542       is no problem when named capture groups are used (see below).
8543
8544       Another way of avoiding the ambiguity inherent in  the  use  of  digits
8545       following  a  backslash  is  to use the \g escape sequence. This escape
8546       must be followed by a signed or unsigned number, optionally enclosed in
8547       braces. These examples are all identical:
8548
8549         (ring), \1
8550         (ring), \g1
8551         (ring), \g{1}
8552
8553       An unsigned number specifies an absolute reference without the  ambigu-
8554       ity that is present in the older syntax. It is also useful when literal
8555       digits  follow  the reference. A signed number is a relative reference.
8556       Consider this example:
8557
8558         (abc(def)ghi)\g{-1}
8559
8560       The sequence \g{-1} is a reference to the capture group whose number is
8561       one less than the number of the next group to be started,  so  in  this
8562       example  (where the next group would be numbered 3) is it equivalent to
8563       \2, and \g{-2} would be equivalent to \1. Note that if  this  construct
8564       is  inside  a capture group, that group is included in the count, so in
8565       this example \g{-2} also refers to group 1:
8566
8567         (A)(\g{-2}B)
8568
8569       The use of relative references can be helpful  in  long  patterns,  and
8570       also  in  patterns  that are created by joining together fragments that
8571       contain references within themselves.
8572
8573       The sequence \g{+1} is a reference to the next capture  group  that  is
8574       started  after  this item, and \g{+2} refers to the one after that, and
8575       so on. This kind of forward reference can be useful  in  patterns  that
8576       repeat. Perl does not support the use of + in this way.
8577
8578       A  backreference  matches  whatever  actually most recently matched the
8579       capture group in the current subject string, rather  than  anything  at
8580       all that matches the group (see "Groups as subroutines" below for a way
8581       of doing that). So the pattern
8582
8583         (sens|respons)e and \1ibility
8584
8585       matches  "sense and sensibility" and "response and responsibility", but
8586       not "sense and responsibility". If caseful matching is in force at  the
8587       time  of  the backreference, the case of letters is relevant. For exam-
8588       ple,
8589
8590         ((?i)rah)\s+\1
8591
8592       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
8593       original capture group is matched caselessly.
8594
8595       There  are  several  different  ways of writing backreferences to named
8596       capture groups. The .NET syntax  is  \k{name},  the  Python  syntax  is
8597       (?=name),  and the original Perl syntax is \k<name> or \k'name'. All of
8598       these are now supported by both Perl and  PCRE2.  Perl  5.10's  unified
8599       backreference  syntax,  in  which  \g  can be used for both numeric and
8600       named references, is also supported by PCRE2.   We  could  rewrite  the
8601       above example in any of the following ways:
8602
8603         (?<p1>(?i)rah)\s+\k<p1>
8604         (?'p1'(?i)rah)\s+\k{p1}
8605         (?P<p1>(?i)rah)\s+(?P=p1)
8606         (?<p1>(?i)rah)\s+\g{p1}
8607
8608       A  capture  group  that is referenced by name may appear in the pattern
8609       before or after the reference.
8610
8611       There may be more than one backreference to the same group. If a  group
8612       has  not actually been used in a particular match, backreferences to it
8613       always fail by default. For example, the pattern
8614
8615         (a|(bc))\2
8616
8617       always fails if it starts to match "a" rather than  "bc".  However,  if
8618       the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8619       erence to an unset value matches an empty string.
8620
8621       Because  there may be many capture groups in a pattern, all digits fol-
8622       lowing a backslash are taken as part of a potential backreference  num-
8623       ber.  If  the  pattern continues with a digit character, some delimiter
8624       must be used to terminate the backreference. If the  PCRE2_EXTENDED  or
8625       PCRE2_EXTENDED_MORE  option is set, this can be white space. Otherwise,
8626       the \g{} syntax or an empty comment (see "Comments" below) can be used.
8627
8628   Recursive backreferences
8629
8630       A backreference that occurs inside the group to which it  refers  fails
8631       when  the  group  is  first used, so, for example, (a\1) never matches.
8632       However, such references can be useful inside repeated groups. For  ex-
8633       ample, the pattern
8634
8635         (a|b\1)+
8636
8637       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8638       ation of the group, the backreference matches the character string cor-
8639       responding  to  the  previous iteration. In order for this to work, the
8640       pattern must be such that the first iteration does not  need  to  match
8641       the  backreference. This can be done using alternation, as in the exam-
8642       ple above, or by a quantifier with a minimum of zero.
8643
8644       For versions of PCRE2 less than 10.25, backreferences of this type used
8645       to cause the group that they reference  to  be  treated  as  an  atomic
8646       group.   This restriction no longer applies, and backtracking into such
8647       groups can occur as normal.
8648
8649
8650ASSERTIONS
8651
8652       An assertion is a test on the characters  following  or  preceding  the
8653       current matching point that does not consume any characters. The simple
8654       assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
8655       above.
8656
8657       More complicated assertions are coded as  parenthesized  groups.  There
8658       are  two  kinds:  those  that look ahead of the current position in the
8659       subject string, and those that look behind it, and in each case an  as-
8660       sertion  may  be  positive (must match for the assertion to be true) or
8661       negative (must not match for the assertion to be  true).  An  assertion
8662       group is matched in the normal way, and if it is true, matching contin-
8663       ues  after it, but with the matching position in the subject string re-
8664       set to what it was before the assertion was processed.
8665
8666       The Perl-compatible lookaround assertions are atomic. If  an  assertion
8667       is  true, but there is a subsequent matching failure, there is no back-
8668       tracking into the assertion. However, there are some cases  where  non-
8669       atomic  assertions can be useful. PCRE2 has some support for these, de-
8670       scribed in the section entitled "Non-atomic assertions" below, but they
8671       are not Perl-compatible.
8672
8673       A lookaround assertion may appear as the  condition  in  a  conditional
8674       group  (see  below). In this case, the result of matching the assertion
8675       determines which branch of the condition is followed.
8676
8677       Assertion groups are not capture groups. If an assertion contains  cap-
8678       ture  groups within it, these are counted for the purposes of numbering
8679       the capture groups in the whole pattern. Within each branch of  an  as-
8680       sertion,  locally  captured  substrings  may be referenced in the usual
8681       way. For example, a sequence such as (.)\g{-1} can  be  used  to  check
8682       that two adjacent characters are the same.
8683
8684       When  a  branch within an assertion fails to match, any substrings that
8685       were captured are discarded (as happens with any  pattern  branch  that
8686       fails  to  match).  A  negative  assertion  is  true  only when all its
8687       branches fail to match; this means that no captured substrings are ever
8688       retained after a successful negative assertion. When an assertion  con-
8689       tains a matching branch, what happens depends on the type of assertion.
8690
8691       For  a  positive  assertion, internally captured substrings in the suc-
8692       cessful branch are retained, and matching continues with the next  pat-
8693       tern  item  after  the  assertion. For a negative assertion, a matching
8694       branch means that the assertion is not true. If such  an  assertion  is
8695       being  used as a condition in a conditional group (see below), captured
8696       substrings are retained,  because  matching  continues  with  the  "no"
8697       branch of the condition. For other failing negative assertions, control
8698       passes to the previous backtracking point, thus discarding any captured
8699       strings within the assertion.
8700
8701       Most  assertion groups may be repeated; though it makes no sense to as-
8702       sert the same thing several times, the side effect of capturing in pos-
8703       itive assertions may occasionally be useful. However, an assertion that
8704       forms the condition for a conditional  group  may  not  be  quantified.
8705       PCRE2  used  to restrict the repetition of assertions, but from release
8706       10.35 the only restriction is that an unlimited maximum  repetition  is
8707       changed  to  be one more than the minimum. For example, {3,} is treated
8708       as {3,4}.
8709
8710   Alphabetic assertion names
8711
8712       Traditionally, symbolic sequences such as (?= and (?<= have  been  used
8713       to  specify lookaround assertions. Perl 5.28 introduced some experimen-
8714       tal alphabetic alternatives which might be easier to remember. They all
8715       start with (* instead of (? and must be written using lower  case  let-
8716       ters. PCRE2 supports the following synonyms:
8717
8718         (*positive_lookahead:  or (*pla: is the same as (?=
8719         (*negative_lookahead:  or (*nla: is the same as (?!
8720         (*positive_lookbehind: or (*plb: is the same as (?<=
8721         (*negative_lookbehind: or (*nlb: is the same as (?<!
8722
8723       For  example,  (*pla:foo) is the same assertion as (?=foo). In the fol-
8724       lowing sections, the various assertions are described using the  origi-
8725       nal symbolic forms.
8726
8727   Lookahead assertions
8728
8729       Lookahead assertions start with (?= for positive assertions and (?! for
8730       negative assertions. For example,
8731
8732         \w+(?=;)
8733
8734       matches  a word followed by a semicolon, but does not include the semi-
8735       colon in the match, and
8736
8737         foo(?!bar)
8738
8739       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
8740       that the apparently similar pattern
8741
8742         (?!foo)bar
8743
8744       does  not  find  an  occurrence  of "bar" that is preceded by something
8745       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
8746       the assertion (?!foo) is always true when the next three characters are
8747       "bar". A lookbehind assertion is needed to achieve the other effect.
8748
8749       If you want to force a matching failure at some point in a pattern, the
8750       most  convenient  way to do it is with (?!) because an empty string al-
8751       ways matches, so an assertion that requires there not to  be  an  empty
8752       string must always fail.  The backtracking control verb (*FAIL) or (*F)
8753       is a synonym for (?!).
8754
8755   Lookbehind assertions
8756
8757       Lookbehind  assertions start with (?<= for positive assertions and (?<!
8758       for negative assertions. For example,
8759
8760         (?<!foo)bar
8761
8762       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
8763       contents  of a lookbehind assertion are restricted such that there must
8764       be a known maximum to the lengths of all the strings it matches.  There
8765       are two cases:
8766
8767       If every top-level alternative matches a fixed length, for example
8768
8769         (?<=colour|color)
8770
8771       there  is a limit of 65535 characters to the lengths, which do not have
8772       to be the same, as this example demonstrates. This is the only kind  of
8773       lookbehind  supported  by  PCRE2 versions earlier than 10.43 and by the
8774       alternative matching function pcre2_dfa_match().
8775
8776       In PCRE2 10.43 and later, pcre2_match() supports lookbehind  assertions
8777       in  which  one  or  more top-level alternatives can match more than one
8778       string length, for example
8779
8780         (?<=colou?r)
8781
8782       The maximum matching length for any branch of the lookbehind is limited
8783       to a value set by the calling program (default 255 characters).  Unlim-
8784       ited  repetition (for example \d*) is not supported. In some cases, the
8785       escape sequence \K (see above) can be used instead of a lookbehind  as-
8786       sertion  at  the  start  of a pattern to get round the length limit re-
8787       striction.
8788
8789       In UTF-8 and UTF-16 modes, PCRE2 does not allow the  \C  escape  (which
8790       matches  a single code unit even in a UTF mode) to appear in lookbehind
8791       assertions, because it makes it impossible to calculate the  length  of
8792       the  lookbehind.  The \X and \R escapes, which can match different num-
8793       bers of code units, are never permitted in lookbehinds.
8794
8795       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
8796       lookbehinds,  as  long  as  the called capture group matches a limited-
8797       length string. However, recursion, that is, a "subroutine" call into  a
8798       group that is already active, is not supported.
8799
8800       PCRE2  supports backreferences in lookbehinds, but only if certain con-
8801       ditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be  set,
8802       there  must be no use of (?| in the pattern (it creates duplicate group
8803       numbers), and if the backreference is by name, the name must be unique.
8804       Of course, the referenced group must itself match a limited length sub-
8805       string. The following pattern matches words  containing  at  least  two
8806       characters that begin and end with the same character:
8807
8808          \b(\w)\w++(?<=\1)
8809
8810       Possessive  quantifiers  can be used in conjunction with lookbehind as-
8811       sertions to specify efficient matching at the end of  subject  strings.
8812       Consider a simple pattern such as
8813
8814         abcd$
8815
8816       when  applied  to  a  long string that does not match. Because matching
8817       proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8818       ject  and  then see if what follows matches the rest of the pattern. If
8819       the pattern is specified as
8820
8821         ^.*abcd$
8822
8823       the initial .* matches the entire string at first, but when this  fails
8824       (because there is no following "a"), it backtracks to match all but the
8825       last  character,  then all but the last two characters, and so on. Once
8826       again the search for "a" covers the entire string, from right to  left,
8827       so we are no better off. However, if the pattern is written as
8828
8829         ^.*+(?<=abcd)
8830
8831       there can be no backtracking for the .*+ item because of the possessive
8832       quantifier; it can match only the entire string. The subsequent lookbe-
8833       hind  assertion  does  a single test on the last four characters. If it
8834       fails, the match fails immediately. For  long  strings,  this  approach
8835       makes a significant difference to the processing time.
8836
8837   Using multiple assertions
8838
8839       Several assertions (of any sort) may occur in succession. For example,
8840
8841         (?<=\d{3})(?<!999)foo
8842
8843       matches  "foo" preceded by three digits that are not "999". Notice that
8844       each of the assertions is applied independently at the  same  point  in
8845       the  subject  string.  First  there  is a check that the previous three
8846       characters are all digits, and then there is  a  check  that  the  same
8847       three characters are not "999".  This pattern does not match "foo" pre-
8848       ceded  by  six  characters,  the first of which are digits and the last
8849       three of which are not "999". For example, it  doesn't  match  "123abc-
8850       foo". A pattern to do that is
8851
8852         (?<=\d{3}...)(?<!999)foo
8853
8854       This  time  the  first assertion looks at the preceding six characters,
8855       checking that the first three are digits, and then the second assertion
8856       checks that the preceding three characters are not "999".
8857
8858       Assertions can be nested in any combination. For example,
8859
8860         (?<=(?<!foo)bar)baz
8861
8862       matches an occurrence of "baz" that is preceded by "bar" which in  turn
8863       is not preceded by "foo", while
8864
8865         (?<=\d{3}(?!999)...)foo
8866
8867       is  another pattern that matches "foo" preceded by three digits and any
8868       three characters that are not "999".
8869
8870
8871NON-ATOMIC ASSERTIONS
8872
8873       Traditional lookaround assertions are atomic. That is, if an  assertion
8874       is  true, but there is a subsequent matching failure, there is no back-
8875       tracking into the assertion. However, there are some cases  where  non-
8876       atomic  positive  assertions  can be useful. PCRE2 provides these using
8877       the following syntax:
8878
8879         (*non_atomic_positive_lookahead:  or (*napla: or (?*
8880         (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
8881
8882       Consider the problem of finding the right-most word in  a  string  that
8883       also  appears  earlier  in the string, that is, it must appear at least
8884       twice in total.  This pattern returns the required result  as  captured
8885       substring 1:
8886
8887         ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
8888
8889       For  a subject such as "word1 word2 word3 word2 word3 word4" the result
8890       is "word3". How does it work? At the start, ^(?x) anchors  the  pattern
8891       and sets the "x" option, which causes white space (introduced for read-
8892       ability)  to  be  ignored. Inside the assertion, the greedy .* at first
8893       consumes the entire string, but then has to backtrack until the rest of
8894       the assertion can match a word, which is captured by group 1. In  other
8895       words,  when  the  assertion first succeeds, it captures the right-most
8896       word in the string.
8897
8898       The current matching point is then reset to the start of  the  subject,
8899       and  the  rest  of  the pattern match checks for two occurrences of the
8900       captured word, using an ungreedy .*? to scan from  the  left.  If  this
8901       succeeds,  we are done, but if the last word in the string does not oc-
8902       cur twice, this part of the pattern  fails.  If  a  traditional  atomic
8903       lookahead  (?=  or (*pla: had been used, the assertion could not be re-
8904       entered, and the whole match would fail. The pattern would succeed only
8905       if the very last word in the subject was found twice.
8906
8907       Using a non-atomic lookahead, however, means that when  the  last  word
8908       does  not  occur  twice  in the string, the lookahead can backtrack and
8909       find the second-last word, and so on, until either the match  succeeds,
8910       or all words have been tested.
8911
8912       Two conditions must be met for a non-atomic assertion to be useful: the
8913       contents  of one or more capturing groups must change after a backtrack
8914       into the assertion, and there must be  a  backreference  to  a  changed
8915       group  later  in  the pattern. If this is not the case, the rest of the
8916       pattern match fails exactly as before because nothing has  changed,  so
8917       using a non-atomic assertion just wastes resources.
8918
8919       There  is one exception to backtracking into a non-atomic assertion. If
8920       an (*ACCEPT) control verb is triggered, the assertion  succeeds  atomi-
8921       cally.  That  is,  a subsequent match failure cannot backtrack into the
8922       assertion.
8923
8924       Non-atomic assertions are not supported  by  the  alternative  matching
8925       function pcre2_dfa_match(). They are supported by JIT, but only if they
8926       do not contain any control verbs such as (*ACCEPT). (This may change in
8927       future). Note that assertions that appear as conditions for conditional
8928       groups (see below) must be atomic.
8929
8930
8931SCRIPT RUNS
8932
8933       In  concept, a script run is a sequence of characters that are all from
8934       the same Unicode script such as Latin or Greek. However,  because  some
8935       scripts  are  commonly  used together, and because some diacritical and
8936       other marks are used with multiple scripts,  it  is  not  that  simple.
8937       There is a full description of the rules that PCRE2 uses in the section
8938       entitled "Script Runs" in the pcre2unicode documentation.
8939
8940       If  part  of a pattern is enclosed between (*script_run: or (*sr: and a
8941       closing parenthesis, it fails if the sequence  of  characters  that  it
8942       matches  are not a script run. After a failure, normal backtracking oc-
8943       curs. Script runs can be used to detect spoofing attacks using  charac-
8944       ters  that  look  the  same, but are from different scripts. The string
8945       "paypal.com" is an infamous example, where the letters could be a  mix-
8946       ture of Latin and Cyrillic. This pattern ensures that the matched char-
8947       acters in a sequence of non-spaces that follow white space are a script
8948       run:
8949
8950         \s+(*sr:\S+)
8951
8952       To  be  sure  that  they are all from the Latin script (for example), a
8953       lookahead can be used:
8954
8955         \s+(?=\p{Latin})(*sr:\S+)
8956
8957       This works as long as the first character is expected to be a character
8958       in that script, and not (for example)  punctuation,  which  is  allowed
8959       with  any script. If this is not the case, a more creative lookahead is
8960       needed. For example, if digits, underscore, and dots are  permitted  at
8961       the start:
8962
8963         \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8964
8965
8966       In  many  cases, backtracking into a script run pattern fragment is not
8967       desirable. The script run can employ an atomic group to  prevent  this.
8968       Because  this is a common requirement, a shorthand notation is provided
8969       by (*atomic_script_run: or (*asr:
8970
8971         (*asr:...) is the same as (*sr:(?>...))
8972
8973       Note that the atomic group is inside the script run. Putting it outside
8974       would not prevent backtracking into the script run pattern.
8975
8976       Support for script runs is not available if PCRE2 is  compiled  without
8977       Unicode support. A compile-time error is given if any of the above con-
8978       structs  is encountered. Script runs are not supported by the alternate
8979       matching function, pcre2_dfa_match() because they use the  same  mecha-
8980       nism as capturing parentheses.
8981
8982       Warning:  The  (*ACCEPT)  control  verb  (see below) should not be used
8983       within a script run group, because it causes an immediate exit from the
8984       group, bypassing the script run checking.
8985
8986
8987CONDITIONAL GROUPS
8988
8989       It is possible to cause the matching process to obey a pattern fragment
8990       conditionally or to choose between two alternative fragments, depending
8991       on the result of an assertion, or whether a specific capture group  has
8992       already been matched. The two possible forms of conditional group are:
8993
8994         (?(condition)yes-pattern)
8995         (?(condition)yes-pattern|no-pattern)
8996
8997       If  the  condition is satisfied, the yes-pattern is used; otherwise the
8998       no-pattern (if present) is used. An absent no-pattern is equivalent  to
8999       an  empty string (it always matches). If there are more than two alter-
9000       natives in the group, a compile-time error occurs. Each of the two  al-
9001       ternatives may itself contain nested groups of any form, including con-
9002       ditional  groups;  the  restriction to two alternatives applies only at
9003       the level of the condition itself. This pattern fragment is an  example
9004       where the alternatives are complex:
9005
9006         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
9007
9008
9009       There are five kinds of condition: references to capture groups, refer-
9010       ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
9011       and assertions.
9012
9013   Checking for a used capture group by number
9014
9015       If the text between the parentheses consists of a sequence  of  digits,
9016       the  condition is true if a capture group of that number has previously
9017       matched. If there is more than one capture group with the  same  number
9018       (see  the earlier section about duplicate group numbers), the condition
9019       is true if any of them have matched. An alternative notation, which  is
9020       a PCRE2 extension, not supported by Perl, is to precede the digits with
9021       a plus or minus sign. In this case, the group number is relative rather
9022       than  absolute.  The most recently opened capture group (which could be
9023       enclosing this condition) can be referenced by (?(-1),  the  next  most
9024       recent by (?(-2), and so on. Inside loops it can also make sense to re-
9025       fer  to  subsequent groups.  The next capture group to be opened can be
9026       referenced as (?(+1), and so on. The value zero in any of  these  forms
9027       is not used; it provokes a compile-time error.
9028
9029       Consider  the  following  pattern, which contains non-significant white
9030       space to make it more readable (assume the PCRE2_EXTENDED  option)  and
9031       to divide it into three parts for ease of discussion:
9032
9033         ( \( )?    [^()]+    (?(1) \) )
9034
9035       The  first  part  matches  an optional opening parenthesis, and if that
9036       character is present, sets it as the first captured substring. The sec-
9037       ond part matches one or more characters that are not  parentheses.  The
9038       third  part  is a conditional group that tests whether or not the first
9039       capture group matched. If it did, that is, if subject started  with  an
9040       opening  parenthesis,  the condition is true, and so the yes-pattern is
9041       executed and a closing parenthesis is required.  Otherwise,  since  no-
9042       pattern is not present, the conditional group matches nothing. In other
9043       words,  this  pattern matches a sequence of non-parentheses, optionally
9044       enclosed in parentheses.
9045
9046       If you were embedding this pattern in a larger one,  you  could  use  a
9047       relative reference:
9048
9049         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
9050
9051       This  makes  the  fragment independent of the parentheses in the larger
9052       pattern.
9053
9054   Checking for a used capture group by name
9055
9056       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
9057       used  capture group by name. For compatibility with earlier versions of
9058       PCRE1, which had this facility before Perl, the syntax (?(name)...)  is
9059       also  recognized.   Note, however, that undelimited names consisting of
9060       the letter R followed by digits are ambiguous (see the  following  sec-
9061       tion). Rewriting the above example to use a named group gives this:
9062
9063         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
9064
9065       If  the  name used in a condition of this kind is a duplicate, the test
9066       is applied to all groups of the same name, and is true if  any  one  of
9067       them has matched.
9068
9069   Checking for pattern recursion
9070
9071       "Recursion"  in  this sense refers to any subroutine-like call from one
9072       part of the pattern to another, whether or not it  is  actually  recur-
9073       sive.  See  the  sections  entitled "Recursive patterns" and "Groups as
9074       subroutines" below for details of recursion and subroutine calls.
9075
9076       If a condition is the string (R), and there is no  capture  group  with
9077       the  name R, the condition is true if matching is currently in a recur-
9078       sion or subroutine call to the whole pattern or any capture  group.  If
9079       digits  follow  the letter R, and there is no group with that name, the
9080       condition is true if the most recent call is  into  a  group  with  the
9081       given  number,  which must exist somewhere in the overall pattern. This
9082       is a contrived example that is equivalent to a+b:
9083
9084         ((?(R1)a+|(?1)b))
9085
9086       However, in both cases, if there is a capture  group  with  a  matching
9087       name,  the  condition tests for its being set, as described in the sec-
9088       tion above, instead of testing for recursion. For example,  creating  a
9089       group  with  the  name  R1  by adding (?<R1>) to the above pattern com-
9090       pletely changes its meaning.
9091
9092       If a name preceded by ampersand follows the letter R, for example:
9093
9094         (?(R&name)...)
9095
9096       the condition is true if the most recent recursion is into a  group  of
9097       that name (which must exist within the pattern).
9098
9099       This condition does not check the entire recursion stack. It tests only
9100       the  current  level.  If the name used in a condition of this kind is a
9101       duplicate, the test is applied to all groups of the same name,  and  is
9102       true if any one of them is the most recent recursion.
9103
9104       At "top level", all these recursion test conditions are false.
9105
9106   Defining capture groups for use by reference only
9107
9108       If the condition is the string (DEFINE), the condition is always false,
9109       even  if there is a group with the name DEFINE. In this case, there may
9110       be only one alternative in the rest of the conditional group. It is al-
9111       ways skipped if control reaches this point in the pattern; the idea  of
9112       DEFINE  is that it can be used to define subroutines that can be refer-
9113       enced from elsewhere. (The use of subroutines is described below.)  For
9114       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
9115       could be written like this (ignore white space and line breaks):
9116
9117         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
9118         \b (?&byte) (\.(?&byte)){3} \b
9119
9120       The first part of the pattern is a DEFINE group  inside  which  another
9121       group  named "byte" is defined. This matches an individual component of
9122       an IPv4 address (a number less than 256). When  matching  takes  place,
9123       this  part  of  the pattern is skipped because DEFINE acts like a false
9124       condition. The rest of the pattern uses references to the  named  group
9125       to  match the four dot-separated components of an IPv4 address, insist-
9126       ing on a word boundary at each end.
9127
9128   Checking the PCRE2 version
9129
9130       Programs that link with a PCRE2 library can check the version by  call-
9131       ing  pcre2_config()  with  appropriate arguments. Users of applications
9132       that do not have access to the underlying code cannot do this.  A  spe-
9133       cial  "condition" called VERSION exists to allow such users to discover
9134       which version of PCRE2 they are dealing with by using this condition to
9135       match a string such as "yesno". VERSION must be followed either by  "="
9136       or ">=" and a version number.  For example:
9137
9138         (?(VERSION>=10.4)yes|no)
9139
9140       This  pattern matches "yes" if the PCRE2 version is greater or equal to
9141       10.4, or "no" otherwise. The fractional part of the version number  may
9142       not contain more than two digits.
9143
9144   Assertion conditions
9145
9146       If  the  condition  is  not  in  any of the above formats, it must be a
9147       parenthesized assertion. This may be a positive or  negative  lookahead
9148       or  lookbehind  assertion. However, it must be a traditional atomic as-
9149       sertion, not one of the non-atomic assertions.
9150
9151       Consider this pattern, again containing  non-significant  white  space,
9152       and with the two alternatives on the second line:
9153
9154         (?(?=[^a-z]*[a-z])
9155         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
9156
9157       The  condition  is  a  positive lookahead assertion that matches an op-
9158       tional sequence of non-letters followed by a letter. In other words, it
9159       tests for the presence of at least one letter in the subject. If a let-
9160       ter is found, the subject is matched  against  the  first  alternative;
9161       otherwise  it  is  matched  against  the  second.  This pattern matches
9162       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
9163       letters and dd are digits.
9164
9165       When an assertion that is a condition contains capture groups, any cap-
9166       turing  that  occurs  in  a matching branch is retained afterwards, for
9167       both positive and negative assertions, because matching always  contin-
9168       ues  after  the  assertion, whether it succeeds or fails. (Compare non-
9169       conditional assertions, for which captures are retained only for  posi-
9170       tive assertions that succeed.)
9171
9172
9173COMMENTS
9174
9175       There are two ways of including comments in patterns that are processed
9176       by  PCRE2.  In  both  cases,  the start of the comment must not be in a
9177       character class, nor in the middle of any  other  sequence  of  related
9178       characters  such  as (?: or a group name or number. The characters that
9179       make up a comment play no part in the pattern matching.
9180
9181       The sequence (?# marks the start of a comment that continues up to  the
9182       next  closing parenthesis. Nested parentheses are not permitted. If the
9183       PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is  set,  an  unescaped  #
9184       character  also  introduces  a comment, which in this case continues to
9185       immediately after the next newline character or character  sequence  in
9186       the pattern. Which characters are interpreted as newlines is controlled
9187       by  an option passed to the compiling function or by a special sequence
9188       at the start of the pattern, as described in the section entitled "New-
9189       line conventions" above. Note that the end of this type of comment is a
9190       literal newline sequence in the pattern; escape sequences  that  happen
9191       to represent a newline do not count. For example, consider this pattern
9192       when  PCRE2_EXTENDED is set, and the default newline convention (a sin-
9193       gle linefeed character) is in force:
9194
9195         abc #comment \n still comment
9196
9197       On encountering the # character, pcre2_compile() skips  along,  looking
9198       for  a newline in the pattern. The sequence \n is still literal at this
9199       stage, so it does not terminate the comment. Only an  actual  character
9200       with the code value 0x0a (the default newline) does so.
9201
9202
9203RECURSIVE PATTERNS
9204
9205       Consider  the problem of matching a string in parentheses, allowing for
9206       unlimited nested parentheses. Without the use of  recursion,  the  best
9207       that  can  be  done  is  to use a pattern that matches up to some fixed
9208       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
9209       depth.
9210
9211       For some time, Perl has provided a facility that allows regular expres-
9212       sions  to recurse (amongst other things). It does this by interpolating
9213       Perl code in the expression at run time, and the code can refer to  the
9214       expression itself. A Perl pattern using code interpolation to solve the
9215       parentheses problem can be created like this:
9216
9217         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
9218
9219       The (?p{...}) item interpolates Perl code at run time, and in this case
9220       refers recursively to the pattern in which it appears.
9221
9222       Obviously,  PCRE2  cannot  support  the interpolation of Perl code. In-
9223       stead, it supports special syntax for recursion of the entire  pattern,
9224       and also for individual capture group recursion. After its introduction
9225       in PCRE1 and Python, this kind of recursion was subsequently introduced
9226       into Perl at release 5.10.
9227
9228       A  special  item  that consists of (? followed by a number greater than
9229       zero and a closing parenthesis is a recursive subroutine  call  of  the
9230       capture  group of the given number, provided that it occurs inside that
9231       group. (If not, it is a non-recursive subroutine  call,  which  is  de-
9232       scribed in the next section.) The special item (?R) or (?0) is a recur-
9233       sive call of the entire regular expression.
9234
9235       This  PCRE2  pattern  solves the nested parentheses problem (assume the
9236       PCRE2_EXTENDED option is set so that white space is ignored):
9237
9238         \( ( [^()]++ | (?R) )* \)
9239
9240       First it matches an opening parenthesis. Then it matches any number  of
9241       substrings  which can either be a sequence of non-parentheses, or a re-
9242       cursive match of the pattern itself (that is, a correctly parenthesized
9243       substring).  Finally there is a closing parenthesis. Note the use of  a
9244       possessive  quantifier  to  avoid  backtracking  into sequences of non-
9245       parentheses.
9246
9247       If this were part of a larger pattern, you would not  want  to  recurse
9248       the entire pattern, so instead you could use this:
9249
9250         ( \( ( [^()]++ | (?1) )* \) )
9251
9252       We  have  put the pattern into parentheses, and caused the recursion to
9253       refer to them instead of the whole pattern.
9254
9255       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
9256       tricky.  This is made easier by the use of relative references. Instead
9257       of (?1) in the pattern above you can write (?-2) to refer to the second
9258       most recently opened parentheses  preceding  the  recursion.  In  other
9259       words,  a  negative  number counts capturing parentheses leftwards from
9260       the point at which it is encountered.
9261
9262       Be aware however, that if duplicate capture group numbers are  in  use,
9263       relative  references  refer  to the earliest group with the appropriate
9264       number. Consider, for example:
9265
9266         (?|(a)|(b)) (c) (?-2)
9267
9268       The first two capture groups (a) and (b) are both numbered 1, and group
9269       (c) is number 2. When the reference (?-2) is  encountered,  the  second
9270       most  recently opened parentheses has the number 1, but it is the first
9271       such group (the (a) group) to which the recursion refers. This would be
9272       the same if an absolute reference (?1) was used. In other words,  rela-
9273       tive references are just a shorthand for computing a group number.
9274
9275       It  is  also possible to refer to subsequent capture groups, by writing
9276       references such as (?+2). However, these cannot  be  recursive  because
9277       the  reference  is not inside the parentheses that are referenced. They
9278       are always non-recursive subroutine calls, as  described  in  the  next
9279       section.
9280
9281       An  alternative  approach  is to use named parentheses. The Perl syntax
9282       for this is (?&name); PCRE1's earlier syntax  (?P>name)  is  also  sup-
9283       ported. We could rewrite the above example as follows:
9284
9285         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
9286
9287       If there is more than one group with the same name, the earliest one is
9288       used.
9289
9290       The example pattern that we have been looking at contains nested unlim-
9291       ited  repeats,  and  so the use of a possessive quantifier for matching
9292       strings of non-parentheses is important when applying  the  pattern  to
9293       strings that do not match. For example, when this pattern is applied to
9294
9295         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
9296
9297       it  yields  "no  match" quickly. However, if a possessive quantifier is
9298       not used, the match runs for a very long time indeed because there  are
9299       so  many  different  ways the + and * repeats can carve up the subject,
9300       and all have to be tested before failure can be reported.
9301
9302       At the end of a match, the values of capturing  parentheses  are  those
9303       from  the outermost level. If you want to obtain intermediate values, a
9304       callout function can be used (see below and the pcre2callout documenta-
9305       tion). If the pattern above is matched against
9306
9307         (ab(cd)ef)
9308
9309       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
9310       which  is  the last value taken on at the top level. If a capture group
9311       is not matched at the top level, its final  captured  value  is  unset,
9312       even  if it was (temporarily) set at a deeper level during the matching
9313       process.
9314
9315       Do not confuse the (?R) item with the condition (R),  which  tests  for
9316       recursion.   Consider  this pattern, which matches text in angle brack-
9317       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
9318       brackets  (that is, when recursing), whereas any characters are permit-
9319       ted at the outer level.
9320
9321         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
9322
9323       In this pattern, (?(R) is the start of a conditional  group,  with  two
9324       different  alternatives  for the recursive and non-recursive cases. The
9325       (?R) item is the actual recursive call.
9326
9327   Differences in recursion processing between PCRE2 and Perl
9328
9329       Some former differences between PCRE2 and Perl no longer exist.
9330
9331       Before release 10.30, recursion processing in PCRE2 differed from  Perl
9332       in  that  a  recursive  subroutine call was always treated as an atomic
9333       group. That is, once it had matched some of the subject string, it  was
9334       never  re-entered,  even if it contained untried alternatives and there
9335       was a subsequent matching failure. (Historical note:  PCRE  implemented
9336       recursion before Perl did.)
9337
9338       Starting  with  release 10.30, recursive subroutine calls are no longer
9339       treated as atomic. That is, they can be re-entered to try unused alter-
9340       natives if there is a matching failure later in the  pattern.  This  is
9341       now  compatible  with the way Perl works. If you want a subroutine call
9342       to be atomic, you must explicitly enclose it in an atomic group.
9343
9344       Supporting backtracking into recursions simplifies certain types of re-
9345       cursive pattern. For example, this pattern matches palindromic strings:
9346
9347         ^((.)(?1)\2|.?)$
9348
9349       The second branch in the group matches a single  central  character  in
9350       the  palindrome  when there are an odd number of characters, or nothing
9351       when there are an even number of characters, but in order  to  work  it
9352       has  to  be  able  to  try the second case when the rest of the pattern
9353       match fails. If you want to match typical palindromic phrases, the pat-
9354       tern has to ignore all non-word characters,  which  can  be  done  like
9355       this:
9356
9357         ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
9358
9359       If  run  with  the  PCRE2_CASELESS option, this pattern matches phrases
9360       such as "A man, a plan, a canal: Panama!". Note the use of the  posses-
9361       sive  quantifier  *+  to  avoid backtracking into sequences of non-word
9362       characters. Without this, PCRE2 takes a great deal longer (ten times or
9363       more) to match typical phrases, and Perl takes so long that  you  think
9364       it has gone into a loop.
9365
9366       Another  way  in which PCRE2 and Perl used to differ in their recursion
9367       processing is in the handling of captured  values.  Formerly  in  Perl,
9368       when  a  group  was called recursively or as a subroutine (see the next
9369       section), it had no access to any values that were captured outside the
9370       recursion, whereas in PCRE2 these values can  be  referenced.  Consider
9371       this pattern:
9372
9373         ^(.)(\1|a(?2))
9374
9375       This  pattern matches "bab". The first capturing parentheses match "b",
9376       then in the second group, when the backreference \1 fails to match "b",
9377       the second alternative matches "a" and then recurses. In the recursion,
9378       \1 does now match "b" and so the whole match succeeds. This match  used
9379       to fail in Perl, but in later versions (I tried 5.024) it now works.
9380
9381
9382GROUPS AS SUBROUTINES
9383
9384       If  the syntax for a recursive group call (either by number or by name)
9385       is used outside the parentheses to which it refers, it operates  a  bit
9386       like  a  subroutine  in  a programming language. More accurately, PCRE2
9387       treats the referenced group as an independent subpattern which it tries
9388       to match at the current matching position. The called group may be  de-
9389       fined  before  or  after the reference. A numbered reference can be ab-
9390       solute or relative, as in these examples:
9391
9392         (...(absolute)...)...(?2)...
9393         (...(relative)...)...(?-1)...
9394         (...(?+1)...(relative)...
9395
9396       An earlier example pointed out that the pattern
9397
9398         (sens|respons)e and \1ibility
9399
9400       matches "sense and sensibility" and "response and responsibility",  but
9401       not "sense and responsibility". If instead the pattern
9402
9403         (sens|respons)e and (?1)ibility
9404
9405       is  used, it does match "sense and responsibility" as well as the other
9406       two strings. Another example is  given  in  the  discussion  of  DEFINE
9407       above.
9408
9409       Like  recursions,  subroutine  calls  used to be treated as atomic, but
9410       this changed at PCRE2 release 10.30, so  backtracking  into  subroutine
9411       calls  can  now  occur. However, any capturing parentheses that are set
9412       during the subroutine call revert to their previous values afterwards.
9413
9414       Processing options such as case-independence are fixed when a group  is
9415       defined,  so  if  it  is  used  as a subroutine, such options cannot be
9416       changed for different calls. For example, consider this pattern:
9417
9418         (abc)(?i:(?-1))
9419
9420       It matches "abcabc". It does not match "abcABC" because the  change  of
9421       processing option does not affect the called group.
9422
9423       The  behaviour  of  backtracking control verbs in groups when called as
9424       subroutines is described in the section entitled "Backtracking verbs in
9425       subroutines" below.
9426
9427
9428ONIGURUMA SUBROUTINE SYNTAX
9429
9430       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
9431       name or a number enclosed either in angle brackets or single quotes, is
9432       an alternative syntax for calling a group as a subroutine, possibly re-
9433       cursively.  Here  are  two  of the examples used above, rewritten using
9434       this syntax:
9435
9436         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
9437         (sens|respons)e and \g'1'ibility
9438
9439       PCRE2 supports an extension to Oniguruma: if a number is preceded by  a
9440       plus or a minus sign it is taken as a relative reference. For example:
9441
9442         (abc)(?i:\g<-1>)
9443
9444       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
9445       synonymous. The former is a backreference; the latter is  a  subroutine
9446       call.
9447
9448
9449CALLOUTS
9450
9451       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
9452       Perl  code to be obeyed in the middle of matching a regular expression.
9453       This makes it possible, amongst other things, to extract different sub-
9454       strings that match the same pair of parentheses when there is a repeti-
9455       tion.
9456
9457       PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
9458       trary  Perl  code. The feature is called "callout". The caller of PCRE2
9459       provides an external function by putting its entry  point  in  a  match
9460       context  using  the function pcre2_set_callout(), and then passing that
9461       context to pcre2_match() or pcre2_dfa_match(). If no match  context  is
9462       passed, or if the callout entry point is set to NULL, callouts are dis-
9463       abled.
9464
9465       Within  a  regular expression, (?C<arg>) indicates a point at which the
9466       external function is to be called. There  are  two  kinds  of  callout:
9467       those  with a numerical argument and those with a string argument. (?C)
9468       on its own with no argument is treated as (?C0). A  numerical  argument
9469       allows  the  application  to  distinguish  between  different callouts.
9470       String arguments were added for release 10.20 to make it  possible  for
9471       script  languages that use PCRE2 to embed short scripts within patterns
9472       in a similar way to Perl.
9473
9474       During matching, when PCRE2 reaches a callout point, the external func-
9475       tion is called. It is provided with the number or  string  argument  of
9476       the  callout, the position in the pattern, and one item of data that is
9477       also set in the match block. The callout function may cause matching to
9478       proceed, to backtrack, or to fail.
9479
9480       By default, PCRE2 implements a  number  of  optimizations  at  matching
9481       time,  and  one  side-effect is that sometimes callouts are skipped. If
9482       you need all possible callouts to happen, you need to set options  that
9483       disable  the relevant optimizations. More details, including a complete
9484       description of the programming interface to the callout  function,  are
9485       given in the pcre2callout documentation.
9486
9487   Callouts with numerical arguments
9488
9489       If  you  just  want  to  have  a means of identifying different callout
9490       points, put a number less than 256 after the  letter  C.  For  example,
9491       this pattern has two callout points:
9492
9493         (?C1)abc(?C2)def
9494
9495       If  the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
9496       callouts are automatically installed before each item in  the  pattern.
9497       They  are all numbered 255. If there is a conditional group in the pat-
9498       tern whose condition is an assertion, an additional callout is inserted
9499       just before the condition. An explicit callout may also be set at  this
9500       position, as in this example:
9501
9502         (?(?C9)(?=a)abc|def)
9503
9504       Note that this applies only to assertion conditions, not to other types
9505       of condition.
9506
9507   Callouts with string arguments
9508
9509       A  delimited  string may be used instead of a number as a callout argu-
9510       ment. The starting delimiter must be one of ` ' " ^ % #  $  {  and  the
9511       ending delimiter is the same as the start, except for {, where the end-
9512       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
9513       string, it must be doubled. For example:
9514
9515         (?C'ab ''c'' d')xyz(?C{any text})pqr
9516
9517       The doubling is removed before the string  is  passed  to  the  callout
9518       function.
9519
9520
9521BACKTRACKING CONTROL
9522
9523       There  are  a  number  of  special "Backtracking Control Verbs" (to use
9524       Perl's terminology) that modify the behaviour  of  backtracking  during
9525       matching.  They are generally of the form (*VERB) or (*VERB:NAME). Some
9526       verbs take either form, and may behave differently depending on whether
9527       or not a name argument is present. The names are  not  required  to  be
9528       unique within the pattern.
9529
9530       By  default,  for  compatibility  with  Perl, a name is any sequence of
9531       characters that does not include a closing parenthesis. The name is not
9532       processed in any way, and it is  not  possible  to  include  a  closing
9533       parenthesis   in  the  name.   This  can  be  changed  by  setting  the
9534       PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati-
9535       ble.
9536
9537       When  PCRE2_ALT_VERBNAMES  is  set,  backslash processing is applied to
9538       verb names and only an unescaped  closing  parenthesis  terminates  the
9539       name.  However, the only backslash items that are permitted are \Q, \E,
9540       and sequences such as \x{100} that define character code points.  Char-
9541       acter type escapes such as \d are faulted.
9542
9543       A closing parenthesis can be included in a name either as \) or between
9544       \Q  and  \E. In addition to backslash processing, if the PCRE2_EXTENDED
9545       or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
9546       names is skipped, and #-comments are recognized, exactly as in the rest
9547       of the pattern.  PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do  not  affect
9548       verb names unless PCRE2_ALT_VERBNAMES is also set.
9549
9550       The  maximum  length of a name is 255 in the 8-bit library and 65535 in
9551       the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
9552       closing  parenthesis immediately follows the colon, the effect is as if
9553       the colon were not there. Any number of these verbs may occur in a pat-
9554       tern. Except for (*ACCEPT), they may not be quantified.
9555
9556       Since these verbs are specifically related  to  backtracking,  most  of
9557       them  can be used only when the pattern is to be matched using the tra-
9558       ditional matching function, because that uses a backtracking algorithm.
9559       With the exception of (*FAIL), which behaves like  a  failing  negative
9560       assertion, the backtracking control verbs cause an error if encountered
9561       by the DFA matching function.
9562
9563       The  behaviour  of  these  verbs in repeated groups, assertions, and in
9564       capture groups called as subroutines (whether or  not  recursively)  is
9565       documented below.
9566
9567   Optimizations that affect backtracking verbs
9568
9569       PCRE2 contains some optimizations that are used to speed up matching by
9570       running some checks at the start of each match attempt. For example, it
9571       may  know  the minimum length of matching subject, or that a particular
9572       character must be present. When one of these optimizations bypasses the
9573       running of a match,  any  included  backtracking  verbs  will  not,  of
9574       course, be processed. You can suppress the start-of-match optimizations
9575       by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9576       pile(), or by starting the pattern with (*NO_START_OPT). There is  more
9577       discussion of this option in the section entitled "Compiling a pattern"
9578       in the pcre2api documentation.
9579
9580       Experiments  with  Perl  suggest that it too has similar optimizations,
9581       and like PCRE2, turning them off can change the result of a match.
9582
9583   Verbs that act immediately
9584
9585       The following verbs act as soon as they are encountered.
9586
9587          (*ACCEPT) or (*ACCEPT:NAME)
9588
9589       This verb causes the match to end successfully, skipping the  remainder
9590       of  the  pattern.  However,  when  it is inside a capture group that is
9591       called as a subroutine, only that group is ended successfully. Matching
9592       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9593       tive assertion, the assertion succeeds; in a  negative  assertion,  the
9594       assertion fails.
9595
9596       If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap-
9597       tured. For example:
9598
9599         A((?:A|B(*ACCEPT)|C)D)
9600
9601       This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
9602       tured by the outer parentheses.
9603
9604       (*ACCEPT)  is  the only backtracking verb that is allowed to be quanti-
9605       fied because an ungreedy quantification with a  minimum  of  zero  acts
9606       only when a backtrack happens. Consider, for example,
9607
9608         (A(*ACCEPT)??B)C
9609
9610       where  A,  B, and C may be complex expressions. After matching "A", the
9611       matcher processes "BC"; if that fails, causing a  backtrack,  (*ACCEPT)
9612       is  triggered  and the match succeeds. In both cases, all but C is cap-
9613       tured. Whereas (*COMMIT) (see below) means "fail on backtrack",  a  re-
9614       peated (*ACCEPT) of this type means "succeed on backtrack".
9615
9616       Warning:  (*ACCEPT)  should  not be used within a script run group, be-
9617       cause it causes an immediate exit from the group, bypassing the  script
9618       run checking.
9619
9620         (*FAIL) or (*FAIL:NAME)
9621
9622       This  verb causes a matching failure, forcing backtracking to occur. It
9623       may be abbreviated to (*F). It is equivalent  to  (?!)  but  easier  to
9624       read. The Perl documentation notes that it is probably useful only when
9625       combined with (?{}) or (??{}). Those are, of course, Perl features that
9626       are  not  present  in PCRE2. The nearest equivalent is the callout fea-
9627       ture, as for example in this pattern:
9628
9629         a+(?C)(*FAIL)
9630
9631       A match with the string "aaaa" always fails, but the callout  is  taken
9632       before each backtrack happens (in this example, 10 times).
9633
9634       (*ACCEPT:NAME)  and  (*FAIL:NAME)  behave the same as (*MARK:NAME)(*AC-
9635       CEPT) and (*MARK:NAME)(*FAIL), respectively,  that  is,  a  (*MARK)  is
9636       recorded just before the verb acts.
9637
9638   Recording which path was taken
9639
9640       There  is  one  verb whose main purpose is to track how a match was ar-
9641       rived at, though it also has a secondary use in  conjunction  with  ad-
9642       vancing the match starting point (see (*SKIP) below).
9643
9644         (*MARK:NAME) or (*:NAME)
9645
9646       A  name is always required with this verb. For all the other backtrack-
9647       ing control verbs, a NAME argument is optional.
9648
9649       When a match succeeds, the name of the last-encountered  mark  name  on
9650       the matching path is passed back to the caller as described in the sec-
9651       tion entitled "Other information about the match" in the pcre2api docu-
9652       mentation.  This  applies  to all instances of (*MARK) and other verbs,
9653       including those inside assertions and atomic groups. However, there are
9654       differences in those cases when (*MARK) is  used  in  conjunction  with
9655       (*SKIP) as described below.
9656
9657       The  mark name that was last encountered on the matching path is passed
9658       back. A verb without a NAME argument is ignored for this purpose.  Here
9659       is  an  example of pcre2test output, where the "mark" modifier requests
9660       the retrieval and outputting of (*MARK) data:
9661
9662           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9663         data> XY
9664          0: XY
9665         MK: A
9666         XZ
9667          0: XZ
9668         MK: B
9669
9670       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9671       ple it indicates which of the two alternatives matched. This is a  more
9672       efficient  way of obtaining this information than putting each alterna-
9673       tive in its own capturing parentheses.
9674
9675       If a verb with a name is encountered in a positive  assertion  that  is
9676       true,  the  name  is recorded and passed back if it is the last-encoun-
9677       tered. This does not happen for negative assertions or failing positive
9678       assertions.
9679
9680       After a partial match or a failed match, the last encountered  name  in
9681       the entire match process is returned. For example:
9682
9683           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9684         data> XP
9685         No match, mark = B
9686
9687       Note  that  in  this  unanchored  example the mark is retained from the
9688       match attempt that started at the letter "X" in the subject. Subsequent
9689       match attempts starting at "P" and then with an empty string do not get
9690       as far as the (*MARK) item, but nevertheless do not reset it.
9691
9692       If you are interested in  (*MARK)  values  after  failed  matches,  you
9693       should  probably  set the PCRE2_NO_START_OPTIMIZE option (see above) to
9694       ensure that the match is always attempted.
9695
9696   Verbs that act after backtracking
9697
9698       The following verbs do nothing when they are encountered. Matching con-
9699       tinues with what follows, but if there is a subsequent  match  failure,
9700       causing  a  backtrack  to the verb, a failure is forced. That is, back-
9701       tracking cannot pass to the left of the  verb.  However,  when  one  of
9702       these verbs appears inside an atomic group or in a lookaround assertion
9703       that  is  true,  its effect is confined to that group, because once the
9704       group has been matched, there is never any backtracking into it.  Back-
9705       tracking from beyond an assertion or an atomic group ignores the entire
9706       group, and seeks a preceding backtracking point.
9707
9708       These  verbs  differ  in exactly what kind of failure occurs when back-
9709       tracking reaches them. The behaviour described below  is  what  happens
9710       when  the  verb is not in a subroutine or an assertion. Subsequent sec-
9711       tions cover these special cases.
9712
9713         (*COMMIT) or (*COMMIT:NAME)
9714
9715       This verb causes the whole match to fail outright if there is  a  later
9716       matching failure that causes backtracking to reach it. Even if the pat-
9717       tern  is  unanchored,  no further attempts to find a match by advancing
9718       the starting point take place. If (*COMMIT) is  the  only  backtracking
9719       verb that is encountered, once it has been passed pcre2_match() is com-
9720       mitted to finding a match at the current starting point, or not at all.
9721       For example:
9722
9723         a+(*COMMIT)b
9724
9725       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
9726       of dynamic anchor, or "I've started, so I must finish."
9727
9728       The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM-
9729       MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
9730       ing back to the caller. However, (*SKIP:NAME) searches only  for  names
9731       that are set with (*MARK), ignoring those set by any of the other back-
9732       tracking verbs.
9733
9734       If  there  is more than one backtracking verb in a pattern, a different
9735       one that follows (*COMMIT) may be triggered first,  so  merely  passing
9736       (*COMMIT) during a match does not always guarantee that a match must be
9737       at this starting point.
9738
9739       Note that (*COMMIT) at the start of a pattern is not the same as an an-
9740       chor,  unless  PCRE2's  start-of-match optimizations are turned off, as
9741       shown in this output from pcre2test:
9742
9743           re> /(*COMMIT)abc/
9744         data> xyzabc
9745          0: abc
9746         data>
9747         re> /(*COMMIT)abc/no_start_optimize
9748         data> xyzabc
9749         No match
9750
9751       For the first pattern, PCRE2 knows that any match must start with  "a",
9752       so  the optimization skips along the subject to "a" before applying the
9753       pattern to the first set of data. The match attempt then succeeds.  The
9754       second  pattern disables the optimization that skips along to the first
9755       character. The pattern is now applied  starting  at  "x",  and  so  the
9756       (*COMMIT)  causes  the  match to fail without trying any other starting
9757       points.
9758
9759         (*PRUNE) or (*PRUNE:NAME)
9760
9761       This verb causes the match to fail at the current starting position  in
9762       the subject if there is a later matching failure that causes backtrack-
9763       ing  to  reach it. If the pattern is unanchored, the normal "bumpalong"
9764       advance to the next starting character then happens.  Backtracking  can
9765       occur  as  usual to the left of (*PRUNE), before it is reached, or when
9766       matching to the right of (*PRUNE), but if there  is  no  match  to  the
9767       right,  backtracking cannot cross (*PRUNE). In simple cases, the use of
9768       (*PRUNE) is just an alternative to an atomic group or possessive  quan-
9769       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
9770       any  other  way. In an anchored pattern (*PRUNE) has the same effect as
9771       (*COMMIT).
9772
9773       The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
9774       It is like (*MARK:NAME) in that the name is remembered for passing back
9775       to the caller. However, (*SKIP:NAME) searches only for names  set  with
9776       (*MARK), ignoring those set by other backtracking verbs.
9777
9778         (*SKIP)
9779
9780       This  verb, when given without a name, is like (*PRUNE), except that if
9781       the pattern is unanchored, the "bumpalong" advance is not to  the  next
9782       character, but to the position in the subject where (*SKIP) was encoun-
9783       tered.  (*SKIP)  signifies that whatever text was matched leading up to
9784       it cannot be part of a successful match if there is a  later  mismatch.
9785       Consider:
9786
9787         a+(*SKIP)b
9788
9789       If  the  subject  is  "aaaac...",  after  the first match attempt fails
9790       (starting at the first character in the  string),  the  starting  point
9791       skips on to start the next attempt at "c". Note that a possessive quan-
9792       tifier does not have the same effect as this example; although it would
9793       suppress  backtracking  during  the first match attempt, the second at-
9794       tempt would start at the second character instead  of  skipping  on  to
9795       "c".
9796
9797       If  (*SKIP) is used to specify a new starting position that is the same
9798       as the starting position of the current match, or (by  being  inside  a
9799       lookbehind)  earlier, the position specified by (*SKIP) is ignored, and
9800       instead the normal "bumpalong" occurs.
9801
9802         (*SKIP:NAME)
9803
9804       When (*SKIP) has an associated name, its behaviour  is  modified.  When
9805       such  a  (*SKIP) is triggered, the previous path through the pattern is
9806       searched for the most recent (*MARK) that has the same name. If one  is
9807       found,  the  "bumpalong" advance is to the subject position that corre-
9808       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
9809       no (*MARK) with a matching name is found, the (*SKIP) is ignored.
9810
9811       The  search  for a (*MARK) name uses the normal backtracking mechanism,
9812       which means that it does not  see  (*MARK)  settings  that  are  inside
9813       atomic groups or assertions, because they are never re-entered by back-
9814       tracking. Compare the following pcre2test examples:
9815
9816           re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
9817         data: abc
9818          0: a
9819          1: a
9820         data:
9821           re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
9822         data: abc
9823          0: b
9824          1: b
9825
9826       In  the first example, the (*MARK) setting is in an atomic group, so it
9827       is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
9828       This allows the second branch of the pattern to be tried at  the  first
9829       character  position.  In the second example, the (*MARK) setting is not
9830       in an atomic group. This allows (*SKIP:X) to find the (*MARK)  when  it
9831       backtracks, and this causes a new matching attempt to start at the sec-
9832       ond  character.  This  time, the (*MARK) is never seen because "a" does
9833       not match "b", so the matcher immediately jumps to the second branch of
9834       the pattern.
9835
9836       Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME).  It
9837       ignores names that are set by other backtracking verbs.
9838
9839         (*THEN) or (*THEN:NAME)
9840
9841       This  verb  causes  a skip to the next innermost alternative when back-
9842       tracking reaches it. That  is,  it  cancels  any  further  backtracking
9843       within  the  current  alternative.  Its name comes from the observation
9844       that it can be used for a pattern-based if-then-else block:
9845
9846         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
9847
9848       If the COND1 pattern matches, FOO is tried (and possibly further  items
9849       after  the  end  of the group if FOO succeeds); on failure, the matcher
9850       skips to the second alternative and tries COND2,  without  backtracking
9851       into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
9852       quently BAZ fails, there are no more alternatives, so there is a  back-
9853       track  to  whatever came before the entire group. If (*THEN) is not in-
9854       side an alternation, it acts like (*PRUNE).
9855
9856       The behaviour of (*THEN:NAME) is not the same  as  (*MARK:NAME)(*THEN).
9857       It is like (*MARK:NAME) in that the name is remembered for passing back
9858       to  the  caller. However, (*SKIP:NAME) searches only for names set with
9859       (*MARK), ignoring those set by other backtracking verbs.
9860
9861       A group that does not contain a | character is just a part of  the  en-
9862       closing  alternative;  it is not a nested alternation with only one al-
9863       ternative. The effect of (*THEN) extends beyond such a group to the en-
9864       closing alternative.  Consider this pattern, where A, B, etc. are  com-
9865       plex  pattern  fragments  that  do not contain any | characters at this
9866       level:
9867
9868         A (B(*THEN)C) | D
9869
9870       If A and B are matched, but there is a failure in C, matching does  not
9871       backtrack into A; instead it moves to the next alternative, that is, D.
9872       However,  if  the  group containing (*THEN) is given an alternative, it
9873       behaves differently:
9874
9875         A (B(*THEN)C | (*FAIL)) | D
9876
9877       The effect of (*THEN) is now confined to the inner group. After a fail-
9878       ure in C, matching moves to (*FAIL), which causes the  whole  group  to
9879       fail  because  there  are  no  more  alternatives to try. In this case,
9880       matching does backtrack into A.
9881
9882       Note that a conditional group is not considered as having two  alterna-
9883       tives,  because  only one is ever used. In other words, the | character
9884       in a conditional group has a different meaning. Ignoring  white  space,
9885       consider:
9886
9887         ^.*? (?(?=a) a | b(*THEN)c )
9888
9889       If the subject is "ba", this pattern does not match. Because .*? is un-
9890       greedy,  it initially matches zero characters. The condition (?=a) then
9891       fails, the character "b" is matched, but "c" is  not.  At  this  point,
9892       matching  does  not  backtrack to .*? as might perhaps be expected from
9893       the presence of the | character. The conditional group is part  of  the
9894       single  alternative  that comprises the whole pattern, and so the match
9895       fails. (If there was a backtrack into .*?, allowing it  to  match  "b",
9896       the match would succeed.)
9897
9898       The  verbs just described provide four different "strengths" of control
9899       when subsequent matching fails. (*THEN) is the weakest, carrying on the
9900       match at the next alternative. (*PRUNE) comes next, failing  the  match
9901       at  the  current starting position, but allowing an advance to the next
9902       character (for an unanchored pattern). (*SKIP) is similar, except  that
9903       the advance may be more than one character. (*COMMIT) is the strongest,
9904       causing the entire match to fail.
9905
9906   More than one backtracking verb
9907
9908       If  more  than  one  backtracking verb is present in a pattern, the one
9909       that is backtracked onto first acts. For example,  consider  this  pat-
9910       tern, where A, B, etc. are complex pattern fragments:
9911
9912         (A(*COMMIT)B(*THEN)C|ABD)
9913
9914       If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
9915       match to fail. However, if A and B match, but C fails, the backtrack to
9916       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
9917       is  consistent,  but is not always the same as Perl's. It means that if
9918       two or more backtracking verbs appear in succession, all but  the  last
9919       of them has no effect. Consider this example:
9920
9921         ...(*COMMIT)(*PRUNE)...
9922
9923       If there is a matching failure to the right, backtracking onto (*PRUNE)
9924       causes  it to be triggered, and its action is taken. There can never be
9925       a backtrack onto (*COMMIT).
9926
9927   Backtracking verbs in repeated groups
9928
9929       PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9930       in repeated groups. For example, consider:
9931
9932         /(a(*COMMIT)b)+ac/
9933
9934       If the subject is "abac", Perl matches  unless  its  optimizations  are
9935       disabled,  but  PCRE2  always fails because the (*COMMIT) in the second
9936       repeat of the group acts.
9937
9938   Backtracking verbs in assertions
9939
9940       (*FAIL) in any assertion has its normal effect: it forces an  immediate
9941       backtrack.  The  behaviour  of  the other backtracking verbs depends on
9942       whether or not the assertion is standalone or acting as  the  condition
9943       in a conditional group.
9944
9945       (*ACCEPT)  in  a  standalone positive assertion causes the assertion to
9946       succeed without any further processing; captured  strings  and  a  mark
9947       name  (if  set) are retained. In a standalone negative assertion, (*AC-
9948       CEPT) causes the assertion to fail without any further processing; cap-
9949       tured substrings and any mark name are discarded.
9950
9951       If the assertion is a condition, (*ACCEPT) causes the condition  to  be
9952       true  for  a  positive assertion and false for a negative one; captured
9953       substrings are retained in both cases.
9954
9955       The remaining verbs act only when a later failure causes a backtrack to
9956       reach them. This means that, for the Perl-compatible assertions,  their
9957       effect is confined to the assertion, because Perl lookaround assertions
9958       are atomic. A backtrack that occurs after such an assertion is complete
9959       does  not  jump  back  into  the  assertion.  Note in particular that a
9960       (*MARK) name that is set in an assertion is not "seen" by  an  instance
9961       of (*SKIP:NAME) later in the pattern.
9962
9963       PCRE2  now supports non-atomic positive assertions, as described in the
9964       section entitled "Non-atomic assertions" above. These  assertions  must
9965       be  standalone  (not used as conditions). They are not Perl-compatible.
9966       For these assertions, a later backtrack does jump back into the  asser-
9967       tion,  and  therefore verbs such as (*COMMIT) can be triggered by back-
9968       tracks from later in the pattern.
9969
9970       The effect of (*THEN) is not allowed to escape beyond an assertion.  If
9971       there  are no more branches to try, (*THEN) causes a positive assertion
9972       to be false, and a negative assertion to be true.
9973
9974       The other backtracking verbs are not treated specially if  they  appear
9975       in  a  standalone  positive assertion. In a conditional positive asser-
9976       tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
9977       or (*PRUNE) causes the condition to be false. However, for both  stand-
9978       alone and conditional negative assertions, backtracking into (*COMMIT),
9979       (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9980       ing any further alternative branches.
9981
9982   Backtracking verbs in subroutines
9983
9984       These behaviours occur whether or not the group is called recursively.
9985
9986       (*ACCEPT) in a group called as a subroutine causes the subroutine match
9987       to  succeed without any further processing. Matching then continues af-
9988       ter the subroutine call. Perl documents this behaviour.  Perl's  treat-
9989       ment of the other verbs in subroutines is different in some cases.
9990
9991       (*FAIL)  in  a  group  called as a subroutine has its normal effect: it
9992       forces an immediate backtrack.
9993
9994       (*COMMIT), (*SKIP), and (*PRUNE) cause the  subroutine  match  to  fail
9995       when  triggered  by being backtracked to in a group called as a subrou-
9996       tine. There is then a backtrack at the outer level.
9997
9998       (*THEN), when triggered, skips to the next alternative in the innermost
9999       enclosing group that has alternatives (its normal behaviour).  However,
10000       if there is no such group within the subroutine's group, the subroutine
10001       match fails and there is a backtrack at the outer level.
10002
10003
10004SEE ALSO
10005
10006       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
10007       pcre2(3).
10008
10009
10010AUTHOR
10011
10012       Philip Hazel
10013       Retired from University Computing Service
10014       Cambridge, England.
10015
10016
10017REVISION
10018
10019       Last updated: 04 June 2024
10020       Copyright (c) 1997-2024 University of Cambridge.
10021
10022
10023PCRE2 10.44                      04 June 2024                  PCRE2PATTERN(3)
10024------------------------------------------------------------------------------
10025
10026
10027
10028PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
10029
10030
10031NAME
10032       PCRE2 - Perl-compatible regular expressions (revised API)
10033
10034
10035PCRE2 PERFORMANCE
10036
10037       Two  aspects  of performance are discussed below: memory usage and pro-
10038       cessing time. The way you express your pattern as a regular  expression
10039       can affect both of them.
10040
10041
10042COMPILED PATTERN MEMORY USAGE
10043
10044       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
10045       code,  so  that most simple patterns do not use much memory for storing
10046       the compiled version. However, there is one case where the memory usage
10047       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
10048       group  has  a quantifier with a minimum greater than 1 and/or a limited
10049       maximum, the whole group is repeated in the compiled code. For example,
10050       the pattern
10051
10052         (abc|def){2,4}
10053
10054       is compiled as if it were
10055
10056         (abc|def)(abc|def)((abc|def)(abc|def)?)?
10057
10058       (Technical aside: It is done this way so that backtrack  points  within
10059       each of the repetitions can be independently maintained.)
10060
10061       For  regular expressions whose quantifiers use only small numbers, this
10062       is not usually a problem. However, if the numbers are large,  and  par-
10063       ticularly  if  such repetitions are nested, the memory usage can become
10064       an embarrassment. For example, the very simple pattern
10065
10066         ((ab){1,1000}c){1,3}
10067
10068       uses over 50KiB when compiled using the 8-bit library.  When  PCRE2  is
10069       compiled  with its default internal pointer size of two bytes, the size
10070       limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
10071       libraries, and this is reached with the above pattern if the outer rep-
10072       etition is increased from 3 to 4. PCRE2 can be compiled to  use  larger
10073       internal  pointers  and thus handle larger compiled patterns, but it is
10074       better to try to rewrite your pattern to use less memory if you can.
10075
10076       One way of reducing the memory usage for such patterns is to  make  use
10077       of PCRE2's "subroutine" facility. Re-writing the above pattern as
10078
10079         ((ab)(?2){0,999}c)(?1){0,2}
10080
10081       reduces  the memory requirements to around 16KiB, and indeed it remains
10082       under 20KiB even with the outer repetition increased to  100.  However,
10083       this kind of pattern is not always exactly equivalent, because any cap-
10084       tures  within  subroutine calls are lost when the subroutine completes.
10085       If this is not a problem, this kind of  rewriting  will  allow  you  to
10086       process  patterns that PCRE2 cannot otherwise handle. The matching per-
10087       formance of the two different versions of the pattern are  roughly  the
10088       same.  (This applies from release 10.30 - things were different in ear-
10089       lier releases.)
10090
10091
10092STACK AND HEAP USAGE AT RUN TIME
10093
10094       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
10095       uses very little system stack at run time. In earlier  releases  recur-
10096       sive  function  calls  could  use a great deal of stack, and this could
10097       cause problems, but this usage has been eliminated. Backtracking  posi-
10098       tions  are now explicitly remembered in memory frames controlled by the
10099       code.
10100
10101       The size of each frame depends on the size of pointer variables and the
10102       number of capturing parenthesized groups in the pattern being  matched.
10103       On a 64-bit system the frame size for a pattern with no captures is 128
10104       bytes. For each capturing group the size increases by 16 bytes.
10105
10106       Until  release  10.41,  an initial 20KiB frames vector was allocated on
10107       the system stack, but this still caused some  issues  for  multi-thread
10108       applications  where  each  thread  has a very small stack. From release
10109       10.41 backtracking memory frames are always held  in  heap  memory.  An
10110       initial heap allocation is obtained the first time any match data block
10111       is  passed  to  pcre2_match().  This  is remembered with the match data
10112       block and re-used if that block is used for another match. It is  freed
10113       when the match data block itself is freed.
10114
10115       The  size  of the initial block is the larger of 20KiB or ten times the
10116       pattern's frame size, unless the heap limit is less than this, in which
10117       case the heap limit is used. If the initial  block  proves  to  be  too
10118       small during matching, it is replaced by a larger block, subject to the
10119       heap  limit.  The  heap limit is checked only when a new block is to be
10120       allocated. Reducing the heap limit between calls to pcre2_match()  with
10121       the same match data block does not affect the saved block.
10122
10123       In  contrast  to  pcre2_match(),  pcre2_dfa_match()  does use recursive
10124       function calls, but only for processing atomic groups,  lookaround  as-
10125       sertions, and recursion within the pattern. The original version of the
10126       code  used  to  allocate  quite large internal workspace vectors on the
10127       stack, which caused some problems for  some  patterns  in  environments
10128       with  small  stacks.  From release 10.32 the code for pcre2_dfa_match()
10129       has been re-factored to use heap memory  when  necessary  for  internal
10130       workspace  when  recursing,  though  recursive function calls are still
10131       used.
10132
10133       The "match depth" parameter can be used to limit the depth of  function
10134       recursion,  and  the  "match  heap"  parameter  to limit heap memory in
10135       pcre2_dfa_match().
10136
10137
10138PROCESSING TIME
10139
10140       Certain items in regular expression patterns are processed  more  effi-
10141       ciently than others. It is more efficient to use a character class like
10142       [aeiou]   than   a   set   of  single-character  alternatives  such  as
10143       (a|e|i|o|u). In general, the simplest construction  that  provides  the
10144       required behaviour is usually the most efficient. Jeffrey Friedl's book
10145       contains  a  lot  of useful general discussion about optimizing regular
10146       expressions for efficient performance. This document contains a few ob-
10147       servations about PCRE2.
10148
10149       Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
10150       slow,  because  PCRE2 has to use a multi-stage table lookup whenever it
10151       needs a character's property. If you can find  an  alternative  pattern
10152       that does not use character properties, it will probably be faster.
10153
10154       By  default,  the  escape  sequences  \b, \d, \s, and \w, and the POSIX
10155       character classes such as [:alpha:]  do  not  use  Unicode  properties,
10156       partly for backwards compatibility, and partly for performance reasons.
10157       However,  you  can  set  the PCRE2_UCP option or start the pattern with
10158       (*UCP) if you want Unicode character properties to be  used.  This  can
10159       double  the  matching  time  for  items  such  as \d, when matched with
10160       pcre2_match(); the performance loss is less with a DFA  matching  func-
10161       tion, and in both cases there is not much difference for \b.
10162
10163       When  a pattern begins with .* not in atomic parentheses, nor in paren-
10164       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
10165       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
10166       can match only at the start of a subject string.  If  the  pattern  has
10167       multiple top-level branches, they must all be anchorable. The optimiza-
10168       tion  can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
10169       tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
10170
10171       If PCRE2_DOTALL is not set, PCRE2 cannot make  this  optimization,  be-
10172       cause  the  dot metacharacter does not then match a newline, and if the
10173       subject string contains newlines, the pattern may match from the  char-
10174       acter immediately following one of them instead of from the very start.
10175       For example, the pattern
10176
10177         .*second
10178
10179       matches  the subject "first\nand second" (where \n stands for a newline
10180       character), with the match starting at the seventh character. In  order
10181       to  do  this, PCRE2 has to retry the match starting after every newline
10182       in the subject.
10183
10184       If you are using such a pattern with subject strings that do  not  con-
10185       tain   newlines,   the   best   performance   is  obtained  by  setting
10186       PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate  ex-
10187       plicit  anchoring.  That saves PCRE2 from having to scan along the sub-
10188       ject looking for a newline to restart at.
10189
10190       Beware of patterns that contain nested indefinite  repeats.  These  can
10191       take  a  long time to run when applied to a string that does not match.
10192       Consider the pattern fragment
10193
10194         ^(a+)*
10195
10196       This can match "aaaa" in 16 different ways, and this  number  increases
10197       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
10198       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
10199       repeats  can  match  different numbers of times.) When the remainder of
10200       the pattern is such that the entire match is going to fail,  PCRE2  has
10201       in  principle to try every possible variation, and this can take an ex-
10202       tremely long time, even for relatively short strings.
10203
10204       An optimization catches some of the more simple cases such as
10205
10206         (a+)*b
10207
10208       where a literal character follows. Before  embarking  on  the  standard
10209       matching  procedure, PCRE2 checks that there is a "b" later in the sub-
10210       ject string, and if there is not, it fails the match immediately.  How-
10211       ever,  when  there  is no following literal this optimization cannot be
10212       used. You can see the difference by comparing the behaviour of
10213
10214         (a+)*\d
10215
10216       with the pattern above. The former gives  a  failure  almost  instantly
10217       when  applied  to  a  whole  line of "a" characters, whereas the latter
10218       takes an appreciable time with strings longer than about 20 characters.
10219
10220       In many cases, the solution to this kind of performance issue is to use
10221       an atomic group or a possessive quantifier. This can often reduce  mem-
10222       ory requirements as well. As another example, consider this pattern:
10223
10224         ([^<]|<(?!inet))+
10225
10226       It  matches  from wherever it starts until it encounters "<inet" or the
10227       end of the data, and is the kind of pattern that  might  be  used  when
10228       processing an XML file. Each iteration of the outer parentheses matches
10229       either  one  character that is not "<" or a "<" that is not followed by
10230       "inet". However, each time a parenthesis is processed,  a  backtracking
10231       position  is  passed,  so this formulation uses a memory frame for each
10232       matched character. For a long string, a lot of memory is required. Con-
10233       sider now this  rewritten  pattern,  which  matches  exactly  the  same
10234       strings:
10235
10236         ([^<]++|<(?!inet))+
10237
10238       This runs much faster, because sequences of characters that do not con-
10239       tain "<" are "swallowed" in one item inside the parentheses, and a pos-
10240       sessive  quantifier  is  used to stop any backtracking into the runs of
10241       non-"<" characters. This version also uses a lot  less  memory  because
10242       entry  to  a  new  set of parentheses happens only when a "<" character
10243       that is not followed by "inet" is encountered (and we  assume  this  is
10244       relatively rare).
10245
10246       This example shows that one way of optimizing performance when matching
10247       long  subject strings is to write repeated parenthesized subpatterns to
10248       match more than one character whenever possible.
10249
10250   SETTING RESOURCE LIMITS
10251
10252       You can set limits on the amount of processing that  takes  place  when
10253       matching,  and  on  the amount of heap memory that is used. The default
10254       values of the limits are very large, and unlikely ever to operate. They
10255       can be changed when PCRE2 is built, and  they  can  also  be  set  when
10256       pcre2_match()  or pcre2_dfa_match() is called. For details of these in-
10257       terfaces, see the pcre2build documentation  and  the  section  entitled
10258       "The match context" in the pcre2api documentation.
10259
10260       The  pcre2test  test program has a modifier called "find_limits" which,
10261       if applied to a subject line, causes it to  find  the  smallest  limits
10262       that allow a pattern to match. This is done by repeatedly matching with
10263       different limits.
10264
10265
10266AUTHOR
10267
10268       Philip Hazel
10269       Retired from University Computing Service
10270       Cambridge, England.
10271
10272
10273REVISION
10274
10275       Last updated: 27 July 2022
10276       Copyright (c) 1997-2022 University of Cambridge.
10277
10278
10279PCRE2 10.41                      27 July 2022                  PCRE2PERFORM(3)
10280------------------------------------------------------------------------------
10281
10282
10283
10284PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
10285
10286
10287NAME
10288       PCRE2 - Perl-compatible regular expressions (revised API)
10289
10290
10291SYNOPSIS
10292
10293       #include <pcre2posix.h>
10294
10295       int pcre2_regcomp(regex_t *preg, const char *pattern,
10296            int cflags);
10297
10298       int pcre2_regexec(const regex_t *preg, const char *string,
10299            size_t nmatch, regmatch_t pmatch[], int eflags);
10300
10301       size_t pcre2_regerror(int errcode, const regex_t *preg,
10302            char *errbuf, size_t errbuf_size);
10303
10304       void pcre2_regfree(regex_t *preg);
10305
10306
10307DESCRIPTION
10308
10309       This  set of functions provides a POSIX-style API for the PCRE2 regular
10310       expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
10311       16-bit and 32-bit libraries. See the pcre2api documentation for  a  de-
10312       scription  of  PCRE2's native API, which contains much additional func-
10313       tionality.
10314
10315       IMPORTANT NOTE: The functions described here are NOT  thread-safe,  and
10316       should  not  be used in multi-threaded applications. They are also lim-
10317       ited to processing subjects that are not bigger than 2GB. Use  the  na-
10318       tive API instead.
10319
10320       These  functions  are  wrapper functions that ultimately call the PCRE2
10321       native API. Their prototypes are defined  in  the  pcre2posix.h  header
10322       file, and they all have unique names starting with pcre2_. However, the
10323       pcre2posix.h  header  also  contains macro definitions that convert the
10324       standard POSIX names such  regcomp()  into  pcre2_regcomp()  etc.  This
10325       means  that a program can use the usual POSIX names without running the
10326       risk of accidentally linking with POSIX functions from a different  li-
10327       brary.
10328
10329       On  Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
10330       so can be accessed by adding -lpcre2-posix to the command  for  linking
10331       an application. Because the POSIX functions call the native ones, it is
10332       also necessary to add -lpcre2-8.
10333
10334       On Windows systems, if you are linking to a DLL version of the library,
10335       it  is  recommended  that PCRE2POSIX_SHARED is defined before including
10336       the pcre2posix.h header, as it will allow for a more efficient  way  to
10337       invoke the functions by adding the __declspec(dllimport) decorator.
10338
10339       Although  they were not defined as prototypes in pcre2posix.h, releases
10340       10.33 to 10.36 of the library contained functions with the POSIX  names
10341       regcomp()  etc.  These simply passed their arguments to the PCRE2 func-
10342       tions. These functions were provided for backwards  compatibility  with
10343       earlier  versions  of  PCRE2, which had only POSIX names. However, this
10344       has proved troublesome in situations where a program links with several
10345       libraries, some of which use PCRE2's POSIX interface while  others  use
10346       the  real  POSIX functions.  For this reason, the POSIX names have been
10347       removed since release 10.37.
10348
10349       Calling the header file pcre2posix.h avoids  any  conflict  with  other
10350       POSIX  libraries.  It can, of course, be renamed or aliased as regex.h,
10351       which is the "correct" name, if there is  no  clash.  It  provides  two
10352       structure  types,  regex_t  for compiled internal forms, and regmatch_t
10353       for returning captured substrings. It also defines some constants whose
10354       names start with "REG_"; these are used for setting options and identi-
10355       fying error codes.
10356
10357
10358USING THE POSIX FUNCTIONS
10359
10360       Note that these functions are just POSIX-style wrappers for PCRE2's na-
10361       tive API.  They do not give POSIX  regular  expression  behaviour,  and
10362       they are not thread-safe or even POSIX compatible.
10363
10364       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
10365       options have been implemented. In addition, the option REG_EXTENDED  is
10366       defined  with  the  value  zero. This has no effect, but since programs
10367       that are written to the POSIX interface often use  it,  this  makes  it
10368       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
10369       are not even defined.
10370
10371       There are also some options that are not defined by POSIX.  These  have
10372       been  added  at  the  request  of users who want to make use of certain
10373       PCRE2-specific features via the POSIX calling interface or to  add  BSD
10374       or GNU functionality.
10375
10376       When  PCRE2  is  called via these functions, it is only the API that is
10377       POSIX-like in style. The syntax and semantics of  the  regular  expres-
10378       sions  themselves  are  still  those of Perl, subject to the setting of
10379       various PCRE2 options, as described below. "POSIX-like in style"  means
10380       that  the  API  approximates  to  the POSIX definition; it is not fully
10381       POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
10382       even less compatible.
10383
10384       The  descriptions  below use the actual names of the functions, but, as
10385       described above, the standard POSIX names (without the  pcre2_  prefix)
10386       may also be used.
10387
10388
10389COMPILING A PATTERN
10390
10391       The function pcre2_regcomp() is called to compile a pattern into an in-
10392       ternal  form. By default, the pattern is a C string terminated by a bi-
10393       nary zero (but see REG_PEND below). The preg argument is a pointer to a
10394       regex_t structure that is used as a base for storing information  about
10395       the  compiled  regular  expression.  It  is  also  used  for input when
10396       REG_PEND is set. The regex_t structure used by pcre2_regcomp()  is  de-
10397       fined  in  pcre2posix.h  and  is  not the same as the structure used by
10398       other libraries that provide POSIX-style matching.
10399
10400       The argument cflags is either zero, or contains one or more of the bits
10401       defined by the following macros:
10402
10403         REG_DOTALL
10404
10405       The PCRE2_DOTALL option is set when the regular  expression  is  passed
10406       for  compilation  to  the  native function. Note that REG_DOTALL is not
10407       part of the POSIX standard.
10408
10409         REG_ICASE
10410
10411       The PCRE2_CASELESS option is set when the regular expression is  passed
10412       for compilation to the native function.
10413
10414         REG_NEWLINE
10415
10416       The PCRE2_MULTILINE option is set when the regular expression is passed
10417       for  compilation  to the native function. Note that this does not mimic
10418       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
10419       tion).
10420
10421         REG_NOSPEC
10422
10423       The  PCRE2_LITERAL  option is set when the regular expression is passed
10424       for compilation to the native function. This disables all meta  charac-
10425       ters  in the pattern, causing it to be treated as a literal string. The
10426       only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE,
10427       REG_NOSUB,  REG_PEND,  and REG_UTF. Note that REG_NOSPEC is not part of
10428       the POSIX standard.
10429
10430         REG_NOSUB
10431
10432       When  a  pattern  that  is  compiled  with  this  flag  is  passed   to
10433       pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
10434       nored, and no captured strings are returned. Versions of the  PCRE  li-
10435       brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10436       tion, but this no longer happens because it disables the use  of  back-
10437       references.
10438
10439         REG_PEND
10440
10441       If  this option is set, the reg_endp field in the preg structure (which
10442       has the type const char *) must be set to point to the character beyond
10443       the end of the pattern before calling pcre2_regcomp(). The pattern  it-
10444       self  may  now  contain binary zeros, which are treated as data charac-
10445       ters. Without REG_PEND, a binary zero terminates the  pattern  and  the
10446       re_endp field is ignored. This is a GNU extension to the POSIX standard
10447       and  should be used with caution in software intended to be portable to
10448       other systems.
10449
10450         REG_UCP
10451
10452       The PCRE2_UCP option is set when the regular expression is  passed  for
10453       compilation  to  the  native function. This causes PCRE2 to use Unicode
10454       properties when matching \d, \w,  etc.,  instead  of  just  recognizing
10455       ASCII values. Note that REG_UCP is not part of the POSIX standard.
10456
10457         REG_UNGREEDY
10458
10459       The  PCRE2_UNGREEDY option is set when the regular expression is passed
10460       for compilation to the native function. Note that REG_UNGREEDY  is  not
10461       part of the POSIX standard.
10462
10463         REG_UTF
10464
10465       The  PCRE2_UTF  option is set when the regular expression is passed for
10466       compilation to the native function. This causes the pattern itself  and
10467       all  data  strings used for matching it to be treated as UTF-8 strings.
10468       Note that REG_UTF is not part of the POSIX standard.
10469
10470       In the absence of these flags, no options  are  passed  to  the  native
10471       function.  This means that the regex is compiled with PCRE2 default se-
10472       mantics.  In  particular,  the way it handles newline characters in the
10473       subject string is the Perl way, not the POSIX way.  Note  that  setting
10474       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
10475       It  does not affect the way newlines are matched by the dot metacharac-
10476       ter (they are not) or by a negative class such as [^a] (they are).
10477
10478       The yield of pcre2_regcomp() is zero on success,  and  non-zero  other-
10479       wise.  The preg structure is filled in on success, and one other member
10480       of  the  structure (as well as re_endp) is public: re_nsub contains the
10481       number of capturing subpatterns in the regular expression. Various  er-
10482       ror codes are defined in the header file.
10483
10484       NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10485       to use the contents of the preg structure. If, for example, you pass it
10486       to  pcre2_regexec(), the result is undefined and your program is likely
10487       to crash.
10488
10489
10490MATCHING NEWLINE CHARACTERS
10491
10492       This area is not simple, because POSIX and Perl take different views of
10493       things.  It is not possible to get PCRE2 to obey POSIX  semantics,  but
10494       then PCRE2 was never intended to be a POSIX engine. The following table
10495       lists  the  different  possibilities for matching newline characters in
10496       Perl and PCRE2:
10497
10498                                 Default   Change with
10499
10500         . matches newline          no     PCRE2_DOTALL
10501         newline matches [^a]       yes    not changeable
10502         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
10503         $ matches \n in middle     no     PCRE2_MULTILINE
10504         ^ matches \n in middle     no     PCRE2_MULTILINE
10505
10506       This is the equivalent table for a POSIX-compatible pattern matcher:
10507
10508                                 Default   Change with
10509
10510         . matches newline          yes    REG_NEWLINE
10511         newline matches [^a]       yes    REG_NEWLINE
10512         $ matches \n at end        no     REG_NEWLINE
10513         $ matches \n in middle     no     REG_NEWLINE
10514         ^ matches \n in middle     no     REG_NEWLINE
10515
10516       This behaviour is not what happens when PCRE2 is called via  its  POSIX
10517       API.  By  default, PCRE2's behaviour is the same as Perl's, except that
10518       there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both  PCRE2
10519       and Perl, there is no way to stop newline from matching [^a].
10520
10521       Default  POSIX newline handling can be obtained by setting PCRE2_DOTALL
10522       and PCRE2_DOLLAR_ENDONLY when  calling  pcre2_compile()  directly,  but
10523       there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10524       tion.  When  using  the  POSIX  API,  passing  REG_NEWLINE  to  PCRE2's
10525       pcre2_regcomp()  function  causes  PCRE2_MULTILINE  to  be  passed   to
10526       pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
10527       pass PCRE2_DOLLAR_ENDONLY.
10528
10529
10530MATCHING A PATTERN
10531
10532       The function pcre2_regexec() is called to match a compiled pattern preg
10533       against  a  given string, which is by default terminated by a zero byte
10534       (but see REG_STARTEND below), subject to the options in eflags.   These
10535       can be:
10536
10537         REG_NOTBOL
10538
10539       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10540       ing function.
10541
10542         REG_NOTEMPTY
10543
10544       The  PCRE2_NOTEMPTY  option  is  set  when calling the underlying PCRE2
10545       matching function. Note that REG_NOTEMPTY is  not  part  of  the  POSIX
10546       standard.  However, setting this option can give more POSIX-like behav-
10547       iour in some situations.
10548
10549         REG_NOTEOL
10550
10551       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10552       ing function.
10553
10554         REG_STARTEND
10555
10556       When this option  is  set,  the  subject  string  starts  at  string  +
10557       pmatch[0].rm_so  and  ends  at  string  + pmatch[0].rm_eo, which should
10558       point to the first character beyond the string. There may be binary ze-
10559       ros within the subject string, and indeed, using  REG_STARTEND  is  the
10560       only way to pass a subject string that contains a binary zero.
10561
10562       Whatever  the  value  of  pmatch[0].rm_so,  the  offsets of the matched
10563       string and any captured substrings are  still  given  relative  to  the
10564       start  of  string  itself. (Before PCRE2 release 10.30 these were given
10565       relative to string + pmatch[0].rm_so, but this differs from  other  im-
10566       plementations.)
10567
10568       This  is  a  BSD  extension,  compatible with but not specified by IEEE
10569       Standard 1003.2 (POSIX.2), and should be used with caution in  software
10570       intended  to  be  portable to other systems. Note that a non-zero rm_so
10571       does not imply REG_NOTBOL; REG_STARTEND affects only the  location  and
10572       length  of  the string, not how it is matched. Setting REG_STARTEND and
10573       passing pmatch as NULL are mutually exclusive; the error REG_INVARG  is
10574       returned.
10575
10576       If  the pattern was compiled with the REG_NOSUB flag, no data about any
10577       matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
10578       pcre2_regexec()  are  ignored  (except  possibly as input for REG_STAR-
10579       TEND).
10580
10581       The value of nmatch may be zero, and the value pmatch may be NULL  (un-
10582       less  REG_STARTEND  is  set);  in  both  these  cases no data about any
10583       matched strings is returned.
10584
10585       Otherwise, the portion of the string that was  matched,  and  also  any
10586       captured substrings, are returned via the pmatch argument, which points
10587       to  an  array  of  nmatch structures of type regmatch_t, containing the
10588       members rm_so and rm_eo. These contain the byte  offset  to  the  first
10589       character of each substring and the offset to the first character after
10590       the  end of each substring, respectively. The 0th element of the vector
10591       relates to the entire portion of string that  was  matched;  subsequent
10592       elements relate to the capturing subpatterns of the regular expression.
10593       Unused entries in the array have both structure members set to -1.
10594
10595       regmatch_t  as  well  as  the  regoff_t  typedef it uses are defined in
10596       pcre2posix.h and are not warranted to have the same size or  layout  as
10597       other  similarly  named  types from other libraries that provide POSIX-
10598       style matching.
10599
10600       A successful match yields a zero return; various error  codes  are  de-
10601       fined  in the header file, of which REG_NOMATCH is the "expected" fail-
10602       ure code.
10603
10604
10605ERROR MESSAGES
10606
10607       The pcre2_regerror() function maps a  non-zero  errorcode  from  either
10608       pcre2_regcomp()  or  pcre2_regexec() to a printable message. If preg is
10609       not NULL, the error should have arisen from the use of that  structure.
10610       A  message  terminated  by  a  binary  zero is placed in errbuf. If the
10611       buffer is too short, only the first errbuf_size - 1 characters  of  the
10612       error message are used. The yield of the function is the size of buffer
10613       needed  to hold the whole message, including the terminating zero. This
10614       value is greater than errbuf_size if the message was truncated.
10615
10616
10617MEMORY USAGE
10618
10619       Compiling a regular expression causes memory to be allocated and  asso-
10620       ciated  with the preg structure. The function pcre2_regfree() frees all
10621       such memory, after which preg may no longer be used as a  compiled  ex-
10622       pression.
10623
10624
10625AUTHOR
10626
10627       Philip Hazel
10628       Retired from University Computing Service
10629       Cambridge, England.
10630
10631
10632REVISION
10633
10634       Last updated: 19 January 2024
10635       Copyright (c) 1997-2024 University of Cambridge.
10636
10637
10638PCRE2 10.43                     19 January 2024                  PCRE2POSIX(3)
10639------------------------------------------------------------------------------
10640
10641
10642
10643PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
10644
10645
10646NAME
10647       PCRE2 - Perl-compatible regular expressions (revised API)
10648
10649
10650PCRE2 SAMPLE PROGRAM
10651
10652       A  simple, complete demonstration program to get you started with using
10653       PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the
10654       PCRE2 distribution. A listing of this program is given in the pcre2demo
10655       documentation. If you do not have a copy of the PCRE2 distribution, you
10656       can save this listing to re-create the contents of pcre2demo.c.
10657
10658       The  demonstration  program compiles the regular expression that is its
10659       first argument, and matches it against the subject string in its second
10660       argument. No PCRE2 options are set, and default  character  tables  are
10661       used. If matching succeeds, the program outputs the portion of the sub-
10662       ject  that  matched,  together  with  the contents of any captured sub-
10663       strings.
10664
10665       If the -g option is given on the command line, the program then goes on
10666       to check for further matches of the same regular expression in the same
10667       subject string. The logic is a little bit tricky because of the  possi-
10668       bility  of  matching an empty string. Comments in the code explain what
10669       is going on.
10670
10671       The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
10672       library.  It  handles  strings  and characters that are stored in 8-bit
10673       code units.  By default, one character corresponds to  one  code  unit,
10674       but  if  the  pattern starts with "(*UTF)", both it and the subject are
10675       treated as UTF-8 strings, where characters  may  occupy  multiple  code
10676       units.
10677
10678       If  PCRE2  is installed in the standard include and library directories
10679       for your operating system, you should be able to compile the demonstra-
10680       tion program using a command like this:
10681
10682         cc -o pcre2demo pcre2demo.c -lpcre2-8
10683
10684       If PCRE2 is installed elsewhere, you may need to add additional options
10685       to the command line. For example, on a Unix-like system that has  PCRE2
10686       installed  in /usr/local, you can compile the demonstration program us-
10687       ing a command like this:
10688
10689         cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10690            -L/usr/local/lib -lpcre2-8
10691
10692       Once you have built the demonstration program, you can run simple tests
10693       like this:
10694
10695         ./pcre2demo 'cat|dog' 'the cat sat on the mat'
10696         ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10697
10698       Note that there is a  much  more  comprehensive  test  program,  called
10699       pcre2test,  which supports many more facilities for testing regular ex-
10700       pressions using all three PCRE2 libraries (8-bit, 16-bit,  and  32-bit,
10701       though  not all three need be installed). The pcre2demo program is pro-
10702       vided as a relatively simple coding example.
10703
10704       If you try to run pcre2demo when PCRE2 is not installed in the standard
10705       library directory, you may get an error like  this  on  some  operating
10706       systems (e.g. Solaris):
10707
10708         ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10709       or directory
10710
10711       This  is  caused  by the way shared library support works on those sys-
10712       tems. You need to add
10713
10714         -R/usr/local/lib
10715
10716       (for example) to the compile command to get round this problem.
10717
10718
10719AUTHOR
10720
10721       Philip Hazel
10722       Retired from University Computing Service
10723       Cambridge, England.
10724
10725
10726REVISION
10727
10728       Last updated: 02 February 2016
10729       Copyright (c) 1997-2016 University of Cambridge.
10730
10731
10732PCRE2 10.22                    02 February 2016                 PCRE2SAMPLE(3)
10733------------------------------------------------------------------------------
10734
10735PCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3)
10736
10737
10738NAME
10739       PCRE2 - Perl-compatible regular expressions (revised API)
10740
10741
10742SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10743
10744       int32_t pcre2_serialize_decode(pcre2_code **codes,
10745         int32_t number_of_codes, const uint8_t *bytes,
10746         pcre2_general_context *gcontext);
10747
10748       int32_t pcre2_serialize_encode(const pcre2_code **codes,
10749         int32_t number_of_codes, uint8_t **serialized_bytes,
10750         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
10751
10752       void pcre2_serialize_free(uint8_t *bytes);
10753
10754       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
10755
10756       If  you  are running an application that uses a large number of regular
10757       expression patterns, it may be useful to store them  in  a  precompiled
10758       form  instead  of  having to compile them every time the application is
10759       run. However, if you are using the just-in-time  optimization  feature,
10760       it is not possible to save and reload the JIT data, because it is posi-
10761       tion-dependent.  The  host  on  which the patterns are reloaded must be
10762       running the same version of PCRE2, with the same code unit  width,  and
10763       must  also have the same endianness, pointer width and PCRE2_SIZE type.
10764       For example, patterns compiled on a 32-bit system using PCRE2's  16-bit
10765       library cannot be reloaded on a 64-bit system, nor can they be reloaded
10766       using the 8-bit library.
10767
10768       Note  that  "serialization" in PCRE2 does not convert compiled patterns
10769       to an abstract format like Java or .NET serialization.  The  serialized
10770       output  is really just a bytecode dump, which is why it can only be re-
10771       loaded in the same environment as the one that created  it.  Hence  the
10772       restrictions  mentioned  above.   Applications  that are not statically
10773       linked with a fixed version of PCRE2 must be prepared to recompile pat-
10774       terns from their sources, in order to be immune to PCRE2 upgrades.
10775
10776
10777SECURITY CONCERNS
10778
10779       The facility for saving and restoring compiled patterns is intended for
10780       use within individual applications.  As  such,  the  data  supplied  to
10781       pcre2_serialize_decode()  is expected to be trusted data, not data from
10782       arbitrary external sources.  There  is  only  some  simple  consistency
10783       checking, not complete validation of what is being re-loaded. Corrupted
10784       data may cause undefined results. For example, if the length field of a
10785       pattern in the serialized data is corrupted, the deserializing code may
10786       read beyond the end of the byte stream that is passed to it.
10787
10788
10789SAVING COMPILED PATTERNS
10790
10791       Before compiled patterns can be saved they must be serialized, which in
10792       PCRE2  means converting the pattern to a stream of bytes. A single byte
10793       stream may contain any number of compiled patterns, but they  must  all
10794       use  the same character tables. A single copy of the tables is included
10795       in the byte stream (its size is 1088 bytes). For more details of  char-
10796       acter  tables,  see the section on locale support in the pcre2api docu-
10797       mentation.
10798
10799       The function pcre2_serialize_encode() creates a serialized byte  stream
10800       from  a  list of compiled patterns. Its first two arguments specify the
10801       list, being a pointer to a vector of pointers to compiled patterns, and
10802       the length of the vector. The third and fourth arguments point to vari-
10803       ables which are set to point to the created byte stream and its length,
10804       respectively. The final argument is a pointer  to  a  general  context,
10805       which  can  be  used  to specify custom memory management functions. If
10806       this argument is NULL, malloc() is used to obtain memory for  the  byte
10807       stream. The yield of the function is the number of serialized patterns,
10808       or one of the following negative error codes:
10809
10810         PCRE2_ERROR_BADDATA      the number of patterns is zero or less
10811         PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
10812         PCRE2_ERROR_NOMEMORY     memory allocation failed
10813         PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
10814         PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
10815
10816       PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
10817       rupted, or that a slot in the vector does not point to a compiled  pat-
10818       tern.
10819
10820       Once a set of patterns has been serialized you can save the data in any
10821       appropriate  manner. Here is sample code that compiles two patterns and
10822       writes them to a file. It assumes that the variable fd refers to a file
10823       that is open for output. The error checking that should be present in a
10824       real application has been omitted for simplicity.
10825
10826         int errorcode;
10827         uint8_t *bytes;
10828         PCRE2_SIZE erroroffset;
10829         PCRE2_SIZE bytescount;
10830         pcre2_code *list_of_codes[2];
10831         list_of_codes[0] = pcre2_compile("first pattern",
10832           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10833         list_of_codes[1] = pcre2_compile("second pattern",
10834           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10835         errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
10836           &bytescount, NULL);
10837         errorcode = fwrite(bytes, 1, bytescount, fd);
10838
10839       Note that the serialized data is binary data that may  contain  any  of
10840       the  256  possible  byte values. On systems that make a distinction be-
10841       tween binary and non-binary data, be sure that the file is  opened  for
10842       binary output.
10843
10844       Serializing  a  set  of patterns leaves the original data untouched, so
10845       they can still be used for matching. Their memory  must  eventually  be
10846       freed in the usual way by calling pcre2_code_free(). When you have fin-
10847       ished with the byte stream, it too must be freed by calling pcre2_seri-
10848       alize_free().  If  this function is called with a NULL argument, it re-
10849       turns immediately without doing anything.
10850
10851
10852RE-USING PRECOMPILED PATTERNS
10853
10854       In order to re-use a set of saved patterns you must first make the  se-
10855       rialized  byte stream available in main memory (for example, by reading
10856       from a file). The management of this memory block is up to the applica-
10857       tion. You can use the pcre2_serialize_get_number_of_codes() function to
10858       find out how many compiled patterns are in the serialized data  without
10859       actually decoding the patterns:
10860
10861         uint8_t *bytes = <serialized data>;
10862         int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
10863
10864       The pcre2_serialize_decode() function reads a byte stream and recreates
10865       the compiled patterns in new memory blocks, setting pointers to them in
10866       a  vector.  The  first two arguments are a pointer to a suitable vector
10867       and its length, and the third argument points to a byte stream. The fi-
10868       nal argument is a pointer to a general context, which can  be  used  to
10869       specify custom memory management functions for the decoded patterns. If
10870       this argument is NULL, malloc() and free() are used. After deserializa-
10871       tion, the byte stream is no longer needed and can be discarded.
10872
10873         pcre2_code *list_of_codes[2];
10874         uint8_t *bytes = <serialized data>;
10875         int32_t number_of_codes =
10876           pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
10877
10878       If  the  vector  is  not  large enough for all the patterns in the byte
10879       stream, it is filled with those that fit, and  the  remainder  are  ig-
10880       nored.  The yield of the function is the number of decoded patterns, or
10881       one of the following negative error codes:
10882
10883         PCRE2_ERROR_BADDATA    second argument is zero or less
10884         PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
10885         PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
10886         PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
10887         PCRE2_ERROR_MEMORY     memory allocation failed
10888         PCRE2_ERROR_NULL       first or third argument is NULL
10889
10890       PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it  was
10891       compiled on a system with different endianness.
10892
10893       Decoded patterns can be used for matching in the usual way, and must be
10894       freed  by  calling pcre2_code_free(). However, be aware that there is a
10895       potential race issue if you are using multiple patterns that  were  de-
10896       coded  from a single byte stream in a multithreaded application. A sin-
10897       gle copy of the character tables is used by all  the  decoded  patterns
10898       and a reference count is used to arrange for its memory to be automati-
10899       cally  freed when the last pattern is freed, but there is no locking on
10900       this reference count. Therefore, if you want to call  pcre2_code_free()
10901       for  these  patterns  in  different  threads, you must arrange your own
10902       locking, and ensure that pcre2_code_free()  cannot  be  called  by  two
10903       threads at the same time.
10904
10905       If  a pattern was processed by pcre2_jit_compile() before being serial-
10906       ized, the JIT data is discarded and so is no longer available  after  a
10907       save/restore  cycle.  You can, however, process a restored pattern with
10908       pcre2_jit_compile() if you wish.
10909
10910
10911AUTHOR
10912
10913       Philip Hazel
10914       Retired from University Computing Service
10915       Cambridge, England.
10916
10917
10918REVISION
10919
10920       Last updated: 27 June 2018
10921       Copyright (c) 1997-2018 University of Cambridge.
10922
10923
10924PCRE2 10.32                      27 June 2018                PCRE2SERIALIZE(3)
10925------------------------------------------------------------------------------
10926
10927
10928
10929PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
10930
10931
10932NAME
10933       PCRE2 - Perl-compatible regular expressions (revised API)
10934
10935
10936PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
10937
10938       The  full syntax and semantics of the regular expressions that are sup-
10939       ported by PCRE2 are described in the pcre2pattern  documentation.  This
10940       document contains a quick-reference summary of the syntax.
10941
10942
10943QUOTING
10944
10945         \x         where x is non-alphanumeric is a literal x
10946         \Q...\E    treat enclosed characters as literal
10947
10948       Note that white space inside \Q...\E is always treated as literal, even
10949       if PCRE2_EXTENDED is set, causing most other white space to be ignored.
10950
10951
10952BRACED ITEMS
10953
10954       With  one  exception, wherever brace characters { and } are required to
10955       enclose data for constructions such as \g{2} or \k{name}, space  and/or
10956       horizontal  tab  characters  that follow { or precede } are allowed and
10957       are ignored. In the case of quantifiers, they may also appear before or
10958       after the comma. The exception is \u{...} which is not  Perl-compatible
10959       and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC-
10960       MAScript compatibility feature, and follows ECMAScript's behaviour.
10961
10962
10963ESCAPED CHARACTERS
10964
10965       This  table  applies to ASCII and Unicode environments. An unrecognized
10966       escape sequence causes an error.
10967
10968         \a         alarm, that is, the BEL character (hex 07)
10969         \cx        "control-x", where x is a non-control ASCII character
10970         \e         escape (hex 1B)
10971         \f         form feed (hex 0C)
10972         \n         newline (hex 0A)
10973         \r         carriage return (hex 0D)
10974         \t         tab (hex 09)
10975         \0dd       character with octal code 0dd
10976         \ddd       character with octal code ddd, or backreference
10977         \o{ddd..}  character with octal code ddd..
10978         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
10979         \xhh       character with hex code hh
10980         \x{hh..}   character with hex code hh..
10981
10982       If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
10983       following are also recognized:
10984
10985         \U         the character "U"
10986         \uhhhh     character with hex code hhhh
10987         \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
10988
10989       When \x is not followed by {, from zero to two hexadecimal  digits  are
10990       read,  but in ALT_BSUX mode \x must be followed by two hexadecimal dig-
10991       its to be recognized as a hexadecimal escape; otherwise  it  matches  a
10992       literal  "x".   Likewise,  if  \u (in ALT_BSUX mode) is not followed by
10993       four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence  of  hex
10994       digits in curly brackets, it matches a literal "u".
10995
10996       Note that \0dd is always an octal code. The treatment of backslash fol-
10997       lowed  by  a non-zero digit is complicated; for details see the section
10998       "Non-printing characters" in the pcre2pattern documentation, where  de-
10999       tails  of  escape  processing  in  EBCDIC  environments are also given.
11000       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
11001       EBCDIC environments. Note that \N not  followed  by  an  opening  curly
11002       bracket has a different meaning (see below).
11003
11004
11005CHARACTER TYPES
11006
11007         .          any character except newline;
11008                      in dotall mode, any character whatsoever
11009         \C         one code unit, even in UTF mode (best avoided)
11010         \d         a decimal digit
11011         \D         a character that is not a decimal digit
11012         \h         a horizontal white space character
11013         \H         a character that is not a horizontal white space character
11014         \N         a character that is not a newline
11015         \p{xx}     a character with the xx property
11016         \P{xx}     a character without the xx property
11017         \R         a newline sequence
11018         \s         a white space character
11019         \S         a character that is not a white space character
11020         \v         a vertical white space character
11021         \V         a character that is not a vertical white space character
11022         \w         a "word" character
11023         \W         a "non-word" character
11024         \X         a Unicode extended grapheme cluster
11025
11026       \C  is dangerous because it may leave the current matching point in the
11027       middle of a UTF-8 or UTF-16 character. The application can lock out the
11028       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
11029       possible to build PCRE2 with the use of \C permanently disabled.
11030
11031       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
11032       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
11033       matching is happening, \s and \w may also match  characters  with  code
11034       points in the range 128-255. If the PCRE2_UCP option is set, the behav-
11035       iour of these escape sequences is changed to use Unicode properties and
11036       they  match  many  more  characters, but there are some option settings
11037       that can restrict individual sequences to matching only  ASCII  charac-
11038       ters.
11039
11040       Property descriptions in \p and \P are matched caselessly; hyphens, un-
11041       derscores,  and  white  space are ignored, in accordance with Unicode's
11042       "loose matching" rules.
11043
11044
11045GENERAL CATEGORY PROPERTIES FOR \p and \P
11046
11047         C          Other
11048         Cc         Control
11049         Cf         Format
11050         Cn         Unassigned
11051         Co         Private use
11052         Cs         Surrogate
11053
11054         L          Letter
11055         Ll         Lower case letter
11056         Lm         Modifier letter
11057         Lo         Other letter
11058         Lt         Title case letter
11059         Lu         Upper case letter
11060         Lc         Ll, Lu, or Lt
11061         L&         Ll, Lu, or Lt
11062
11063         M          Mark
11064         Mc         Spacing mark
11065         Me         Enclosing mark
11066         Mn         Non-spacing mark
11067
11068         N          Number
11069         Nd         Decimal number
11070         Nl         Letter number
11071         No         Other number
11072
11073         P          Punctuation
11074         Pc         Connector punctuation
11075         Pd         Dash punctuation
11076         Pe         Close punctuation
11077         Pf         Final punctuation
11078         Pi         Initial punctuation
11079         Po         Other punctuation
11080         Ps         Open punctuation
11081
11082         S          Symbol
11083         Sc         Currency symbol
11084         Sk         Modifier symbol
11085         Sm         Mathematical symbol
11086         So         Other symbol
11087
11088         Z          Separator
11089         Zl         Line separator
11090         Zp         Paragraph separator
11091         Zs         Space separator
11092
11093
11094PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
11095
11096         Xan        Alphanumeric: union of properties L and N
11097         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
11098         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
11099         Xuc        Universally-named character: one that can be
11100                      represented by a Universal Character Name
11101         Xwd        Perl word: property Xan or underscore
11102
11103       Perl and POSIX space are now the same. Perl added VT to its space char-
11104       acter set at release 5.18.
11105
11106
11107BINARY PROPERTIES FOR \p AND \P
11108
11109       Unicode defines a number of  binary  properties,  that  is,  properties
11110       whose  only  values  are  true or false. You can obtain a list of those
11111       that are recognized by \p and \P, along with  their  abbreviations,  by
11112       running this command:
11113
11114         pcre2test -LP
11115
11116
11117SCRIPT MATCHING WITH \p AND \P
11118
11119       Many  script  names  and their 4-letter abbreviations are recognized in
11120       \p{sc:...} or \p{scx:...} items, or on their own with \p (and  also  \P
11121       of course). You can obtain a list of these scripts by running this com-
11122       mand:
11123
11124         pcre2test -LS
11125
11126
11127THE BIDI_CLASS PROPERTY FOR \p AND \P
11128
11129         \p{Bidi_Class:<class>}   matches a character with the given class
11130         \p{BC:<class>}           matches a character with the given class
11131
11132       The recognized classes are:
11133
11134         AL          Arabic letter
11135         AN          Arabic number
11136         B           paragraph separator
11137         BN          boundary neutral
11138         CS          common separator
11139         EN          European number
11140         ES          European separator
11141         ET          European terminator
11142         FSI         first strong isolate
11143         L           left-to-right
11144         LRE         left-to-right embedding
11145         LRI         left-to-right isolate
11146         LRO         left-to-right override
11147         NSM         non-spacing mark
11148         ON          other neutral
11149         PDF         pop directional format
11150         PDI         pop directional isolate
11151         R           right-to-left
11152         RLE         right-to-left embedding
11153         RLI         right-to-left isolate
11154         RLO         right-to-left override
11155         S           segment separator
11156         WS          which space
11157
11158
11159CHARACTER CLASSES
11160
11161         [...]       positive character class
11162         [^...]      negative character class
11163         [x-y]       range (can be used for hex characters)
11164         [[:xxx:]]   positive POSIX named set
11165         [[:^xxx:]]  negative POSIX named set
11166
11167         alnum       alphanumeric
11168         alpha       alphabetic
11169         ascii       0-127
11170         blank       space or tab
11171         cntrl       control character
11172         digit       decimal digit
11173         graph       printing, excluding space
11174         lower       lower case letter
11175         print       printing, including space
11176         punct       printing, excluding alphanumeric
11177         space       white space
11178         upper       upper case letter
11179         word        same as \w
11180         xdigit      hexadecimal digit
11181
11182       In  PCRE2, POSIX character set names recognize only ASCII characters by
11183       default, but some of them use Unicode properties if PCRE2_UCP  is  set.
11184       You can use \Q...\E inside a character class.
11185
11186
11187QUANTIFIERS
11188
11189         ?           0 or 1, greedy
11190         ?+          0 or 1, possessive
11191         ??          0 or 1, lazy
11192         *           0 or more, greedy
11193         *+          0 or more, possessive
11194         *?          0 or more, lazy
11195         +           1 or more, greedy
11196         ++          1 or more, possessive
11197         +?          1 or more, lazy
11198         {n}         exactly n
11199         {n,m}       at least n, no more than m, greedy
11200         {n,m}+      at least n, no more than m, possessive
11201         {n,m}?      at least n, no more than m, lazy
11202         {n,}        n or more, greedy
11203         {n,}+       n or more, possessive
11204         {n,}?       n or more, lazy
11205         {,m}        zero up to m, greedy
11206         {,m}+       zero up to m, possessive
11207         {,m}?       zero up to m, lazy
11208
11209
11210ANCHORS AND SIMPLE ASSERTIONS
11211
11212         \b          word boundary
11213         \B          not a word boundary
11214         ^           start of subject
11215                       also after an internal newline in multiline mode
11216                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
11217         \A          start of subject
11218         $           end of subject
11219                       also before newline at end of subject
11220                       also before internal newline in multiline mode
11221         \Z          end of subject
11222                       also before newline at end of subject
11223         \z          end of subject
11224         \G          first matching position in subject
11225
11226
11227REPORTED MATCH POINT SETTING
11228
11229         \K          set reported start of match
11230
11231       From  release 10.38 \K is not permitted by default in lookaround asser-
11232       tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL-
11233       LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
11234       When this option is set, \K is honoured in positive assertions, but ig-
11235       nored in negative ones.
11236
11237
11238ALTERNATION
11239
11240         expr|expr|expr...
11241
11242
11243CAPTURING
11244
11245         (...)           capture group
11246         (?<name>...)    named capture group (Perl)
11247         (?'name'...)    named capture group (Perl)
11248         (?P<name>...)   named capture group (Python)
11249         (?:...)         non-capture group
11250         (?|...)         non-capture group; reset group numbers for
11251                          capture groups in each alternative
11252
11253       In  non-UTF  modes, names may contain underscores and ASCII letters and
11254       digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits
11255       are permitted. In both cases, a name must not start with a digit.
11256
11257
11258ATOMIC GROUPS
11259
11260         (?>...)         atomic non-capture group
11261         (*atomic:...)   atomic non-capture group
11262
11263
11264COMMENT
11265
11266         (?#....)        comment (not nestable)
11267
11268
11269OPTION SETTING
11270       Changes  of these options within a group are automatically cancelled at
11271       the end of the group.
11272
11273         (?a)            all ASCII options
11274         (?aD)           restrict \d to ASCII in UCP mode
11275         (?aS)           restrict \s to ASCII in UCP mode
11276         (?aW)           restrict \w to ASCII in UCP mode
11277         (?aP)           restrict all POSIX classes to ASCII in UCP mode
11278         (?aT)           restrict POSIX digit classes to ASCII in UCP mode
11279         (?i)            caseless
11280         (?J)            allow duplicate named groups
11281         (?m)            multiline
11282         (?n)            no auto capture
11283         (?r)            restrict caseless to either ASCII or non-ASCII
11284         (?s)            single line (dotall)
11285         (?U)            default ungreedy (lazy)
11286         (?x)            ignore white space except in classes or \Q...\E
11287         (?xx)           as (?x) but also ignore space and tab in classes
11288         (?-...)         unset the given option(s)
11289         (?^)            unset imnrsx options
11290
11291       (?aP) implies (?aT) as well, though this has no additional effect. How-
11292       ever, it means that (?-aP) is really (?-PT) which  disables  all  ASCII
11293       restrictions for POSIX classes.
11294
11295       Unsetting  x or xx unsets both. Several options may be set at once, and
11296       a mixture of setting and unsetting such as (?i-x) is allowed, but there
11297       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
11298       for example (?^in). An option setting may appear at the start of a non-
11299       capture group, for example (?i:...).
11300
11301       The following are recognized only at the very start of a pattern or af-
11302       ter one of the newline or \R options with similar syntax. More than one
11303       of them may appear. For the first three, d is a decimal number.
11304
11305         (*LIMIT_DEPTH=d) set the backtracking limit to d
11306         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
11307         (*LIMIT_MATCH=d) set the match limit to d
11308         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
11309         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
11310         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
11311         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
11312         (*NO_JIT)       disable JIT optimization
11313         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
11314         (*UTF)          set appropriate UTF mode for the library in use
11315         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
11316
11317       Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the
11318       value   of   the   limits   set  by  the  caller  of  pcre2_match()  or
11319       pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete
11320       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
11321       and  (*UCP)  by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
11322       respectively, at compile time.
11323
11324
11325NEWLINE CONVENTION
11326
11327       These are recognized only at the very start of the pattern or after op-
11328       tion settings with a similar syntax.
11329
11330         (*CR)           carriage return only
11331         (*LF)           linefeed only
11332         (*CRLF)         carriage return followed by linefeed
11333         (*ANYCRLF)      all three of the above
11334         (*ANY)          any Unicode newline sequence
11335         (*NUL)          the NUL character (binary zero)
11336
11337
11338WHAT \R MATCHES
11339
11340       These are recognized only at the very start of the pattern or after op-
11341       tion setting with a similar syntax.
11342
11343         (*BSR_ANYCRLF)  CR, LF, or CRLF
11344         (*BSR_UNICODE)  any Unicode newline sequence
11345
11346
11347LOOKAHEAD AND LOOKBEHIND ASSERTIONS
11348
11349         (?=...)                     )
11350         (*pla:...)                  ) positive lookahead
11351         (*positive_lookahead:...)   )
11352
11353         (?!...)                     )
11354         (*nla:...)                  ) negative lookahead
11355         (*negative_lookahead:...)   )
11356
11357         (?<=...)                    )
11358         (*plb:...)                  ) positive lookbehind
11359         (*positive_lookbehind:...)  )
11360
11361         (?<!...)                    )
11362         (*nlb:...)                  ) negative lookbehind
11363         (*negative_lookbehind:...)  )
11364
11365       Each top-level branch of a lookbehind must have a limit for the  number
11366       of  characters it matches. If any branch can match a variable number of
11367       characters, the maximum for each branch is limited to a  value  set  by
11368       the  caller  of  pcre2_compile()  or defaulted. The default is set when
11369       PCRE2 is built (ultimate default 255). If every branch matches a  fixed
11370       number of characters, the limit for each branch is 65535 characters.
11371
11372
11373NON-ATOMIC LOOKAROUND ASSERTIONS
11374
11375       These assertions are specific to PCRE2 and are not Perl-compatible.
11376
11377         (?*...)                                )
11378         (*napla:...)                           ) synonyms
11379         (*non_atomic_positive_lookahead:...)   )
11380
11381         (?<*...)                               )
11382         (*naplb:...)                           ) synonyms
11383         (*non_atomic_positive_lookbehind:...)  )
11384
11385
11386SCRIPT RUNS
11387
11388         (*script_run:...)           ) script run, can be backtracked into
11389         (*sr:...)                   )
11390
11391         (*atomic_script_run:...)    ) atomic script run
11392         (*asr:...)                  )
11393
11394
11395BACKREFERENCES
11396
11397         \n              reference by number (can be ambiguous)
11398         \gn             reference by number
11399         \g{n}           reference by number
11400         \g+n            relative reference by number (PCRE2 extension)
11401         \g-n            relative reference by number
11402         \g{+n}          relative reference by number (PCRE2 extension)
11403         \g{-n}          relative reference by number
11404         \k<name>        reference by name (Perl)
11405         \k'name'        reference by name (Perl)
11406         \g{name}        reference by name (Perl)
11407         \k{name}        reference by name (.NET)
11408         (?P=name)       reference by name (Python)
11409
11410
11411SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
11412
11413         (?R)            recurse whole pattern
11414         (?n)            call subroutine by absolute number
11415         (?+n)           call subroutine by relative number
11416         (?-n)           call subroutine by relative number
11417         (?&name)        call subroutine by name (Perl)
11418         (?P>name)       call subroutine by name (Python)
11419         \g<name>        call subroutine by name (Oniguruma)
11420         \g'name'        call subroutine by name (Oniguruma)
11421         \g<n>           call subroutine by absolute number (Oniguruma)
11422         \g'n'           call subroutine by absolute number (Oniguruma)
11423         \g<+n>          call subroutine by relative number (PCRE2 extension)
11424         \g'+n'          call subroutine by relative number (PCRE2 extension)
11425         \g<-n>          call subroutine by relative number (PCRE2 extension)
11426         \g'-n'          call subroutine by relative number (PCRE2 extension)
11427
11428
11429CONDITIONAL PATTERNS
11430
11431         (?(condition)yes-pattern)
11432         (?(condition)yes-pattern|no-pattern)
11433
11434         (?(n)               absolute reference condition
11435         (?(+n)              relative reference condition (PCRE2 extension)
11436         (?(-n)              relative reference condition (PCRE2 extension)
11437         (?(<name>)          named reference condition (Perl)
11438         (?('name')          named reference condition (Perl)
11439         (?(name)            named reference condition (PCRE2, deprecated)
11440         (?(R)               overall recursion condition
11441         (?(Rn)              specific numbered group recursion condition
11442         (?(R&name)          specific named group recursion condition
11443         (?(DEFINE)          define groups for reference
11444         (?(VERSION[>]=n.m)  test PCRE2 version
11445         (?(assert)          assertion condition
11446
11447       Note  the  ambiguity of (?(R) and (?(Rn) which might be named reference
11448       conditions or recursion tests. Such a condition  is  interpreted  as  a
11449       reference condition if the relevant named group exists.
11450
11451
11452BACKTRACKING CONTROL
11453
11454       All  backtracking  control  verbs  may be in the form (*VERB:NAME). For
11455       (*MARK) the name is mandatory, for the others it is  optional.  (*SKIP)
11456       changes  its  behaviour if :NAME is present. The others just set a name
11457       for passing back to the caller, but this is not a name that (*SKIP) can
11458       see. The following act immediately they are reached:
11459
11460         (*ACCEPT)       force successful match
11461         (*FAIL)         force backtrack; synonym (*F)
11462         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
11463
11464       The following act only when a subsequent match failure causes  a  back-
11465       track to reach them. They all force a match failure, but they differ in
11466       what happens afterwards. Those that advance the start-of-match point do
11467       so only if the pattern is not anchored.
11468
11469         (*COMMIT)       overall failure, no advance of starting point
11470         (*PRUNE)        advance to next starting character
11471         (*SKIP)         advance to current matching position
11472         (*SKIP:NAME)    advance to position corresponding to an earlier
11473                         (*MARK:NAME); if not found, the (*SKIP) is ignored
11474         (*THEN)         local failure, backtrack to next alternation
11475
11476       The  effect  of one of these verbs in a group called as a subroutine is
11477       confined to the subroutine call.
11478
11479
11480CALLOUTS
11481
11482         (?C)            callout (assumed number 0)
11483         (?Cn)           callout with numerical data n
11484         (?C"text")      callout with string data
11485
11486       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
11487       the start and the end), and the starting delimiter { matched  with  the
11488       ending  delimiter  }. To encode the ending delimiter within the string,
11489       double it.
11490
11491
11492SEE ALSO
11493
11494       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
11495       pcre2(3).
11496
11497
11498AUTHOR
11499
11500       Philip Hazel
11501       Retired from University Computing Service
11502       Cambridge, England.
11503
11504
11505REVISION
11506
11507       Last updated: 12 October 2023
11508       Copyright (c) 1997-2023 University of Cambridge.
11509
11510
11511PCRE2 10.43                     12 October 2023                 PCRE2SYNTAX(3)
11512------------------------------------------------------------------------------
11513
11514
11515
11516PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
11517
11518
11519NAME
11520       PCRE - Perl-compatible regular expressions (revised API)
11521
11522
11523UNICODE AND UTF SUPPORT
11524
11525       PCRE2 is normally built with Unicode support, though if you do not need
11526       it,  you  can  build  it  without,  in  which  case the library will be
11527       smaller. With Unicode support, PCRE2 has knowledge of Unicode character
11528       properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11529       format (depending on the code unit width), but this is not the default.
11530       Unless specifically requested, PCRE2 treats each code unit in a  string
11531       as one character.
11532
11533       There  are two ways of telling PCRE2 to switch to UTF mode, where char-
11534       acters may consist of more than one code unit and the range  of  values
11535       is constrained. The program can call pcre2_compile() with the PCRE2_UTF
11536       option,  or  the  pattern may start with the sequence (*UTF).  However,
11537       the latter facility can be locked out by  the  PCRE2_NEVER_UTF  option.
11538       That  is,  the  programmer can prevent the supplier of the pattern from
11539       switching to UTF mode.
11540
11541       Note  that  the  PCRE2_MATCH_INVALID_UTF  option  (see  below)   forces
11542       PCRE2_UTF to be set.
11543
11544       In  UTF mode, both the pattern and any subject strings that are matched
11545       against it are treated as UTF strings instead of strings of  individual
11546       one-code-unit  characters. There are also some other changes to the way
11547       characters are handled, as documented below.
11548
11549
11550UNICODE PROPERTY SUPPORT
11551
11552       When PCRE2 is built with Unicode support, the escape sequences  \p{..},
11553       \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11554       ting.   The Unicode properties that can be tested are a subset of those
11555       that Perl supports. Currently they are limited to the general  category
11556       properties such as Lu for an upper case letter or Nd for a decimal num-
11557       ber, the derived properties Any and LC (synonym L&), the Unicode script
11558       names such as Arabic or Han, Bidi_Class, Bidi_Control, and a few binary
11559       properties.
11560
11561       The full lists are given in the pcre2pattern and pcre2syntax documenta-
11562       tion.  In  general,  only the short names for properties are supported.
11563       For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is
11564       not supported. Furthermore, in Perl, many properties may optionally  be
11565       prefixed  by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
11566       port this.
11567
11568
11569WIDE CHARACTERS AND UTF MODES
11570
11571       Code points less than 256 can be specified in patterns by either braced
11572       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
11573       Larger values have to use braced sequences. Unbraced octal code  points
11574       up to \777 are also recognized; larger ones can be coded using \o{...}.
11575
11576       The  escape sequence \N{U+<hex digits>} is recognized as another way of
11577       specifying a Unicode character by code point in a UTF mode. It  is  not
11578       allowed in non-UTF mode.
11579
11580       In  UTF  mode, repeat quantifiers apply to complete UTF characters, not
11581       to individual code units.
11582
11583       In UTF mode, the dot metacharacter matches one UTF character instead of
11584       a single code unit.
11585
11586       In UTF mode, capture group names are not restricted to ASCII,  and  may
11587       contain any Unicode letters and decimal digits, as well as underscore.
11588
11589       The  escape  sequence \C can be used to match a single code unit in UTF
11590       mode, but its use can lead to some strange effects because it breaks up
11591       multi-unit characters (see the description of \C  in  the  pcre2pattern
11592       documentation). For this reason, there is a build-time option that dis-
11593       ables  support  for  \C completely. There is also a less draconian com-
11594       pile-time option for locking out the use of \C when a pattern  is  com-
11595       piled.
11596
11597       The  use  of  \C  is not supported by the alternative matching function
11598       pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11599       ter may consist of more than one code unit. The  use  of  \C  in  these
11600       modes  provokes a match-time error. Also, the JIT optimization does not
11601       support \C in these modes. If JIT optimization is requested for a UTF-8
11602       or UTF-16 pattern that contains \C, it will not succeed,  and  so  when
11603       pcre2_match() is called, the matching will be carried out by the inter-
11604       pretive function.
11605
11606       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
11607       characters  of  any  code  value,  but, by default, the characters that
11608       PCRE2 recognizes as digits, spaces, or word characters remain the  same
11609       set  as  in  non-UTF mode, all with code points less than 256. This re-
11610       mains true even when PCRE2 is built to include Unicode support, because
11611       to do otherwise would slow down matching in  many  common  cases.  Note
11612       that  this also applies to \b and \B, because they are defined in terms
11613       of \w and \W. If you want to test for a wider sense of,  say,  "digit",
11614       you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-
11615       tively, if you set the PCRE2_UCP option, the way that the character es-
11616       capes work is changed so that Unicode properties are used to  determine
11617       which  characters  match,  though  there are some options that suppress
11618       this for individual escapes. For details see  the  section  on  generic
11619       character types in the pcre2pattern documentation.
11620
11621       Like  the  escapes,  characters  that  match  the POSIX named character
11622       classes are all low-valued characters unless the  PCRE2_UCP  option  is
11623       set, but there is an option to override this.
11624
11625       In contrast to the character escapes and character classes, the special
11626       horizontal  and  vertical  white  space escapes (\h, \H, \v, and \V) do
11627       match all the appropriate Unicode characters, whether or not  PCRE2_UCP
11628       is set.
11629
11630
11631UNICODE CASE-EQUIVALENCE
11632
11633       If  either  PCRE2_UTF  or PCRE2_UCP is set, upper/lower case processing
11634       makes use of Unicode properties except for characters whose code points
11635       are less than 128 and that have at most two case-equivalent values. For
11636       these, a direct table lookup is used for speed. A few  Unicode  charac-
11637       ters  such as Greek sigma have more than two code points that are case-
11638       equivalent, and these are treated specially. Setting PCRE2_UCP  without
11639       PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character
11640       encodings such as UCS-2.
11641
11642       There are two ASCII characters (S and K) that,  in  addition  to  their
11643       ASCII  lower case equivalents, have a non-ASCII one as well (long S and
11644       Kelvin sign).  Recognition of these non-ASCII characters as case-equiv-
11645       alent to their ASCII  counterparts  can  be  disabled  by  setting  the
11646       PCRE2_EXTRA_CASELESS_RESTRICT  option. When this is set, all characters
11647       in a case equivalence must either be ASCII or non-ASCII; there  can  be
11648       no mixing.
11649
11650
11651SCRIPT RUNS
11652
11653       The  pattern constructs (*script_run:...) and (*atomic_script_run:...),
11654       with synonyms (*sr:...) and (*asr:...), verify that the string  matched
11655       within  the  parentheses is a script run. In concept, a script run is a
11656       sequence of characters that are all from the same Unicode script.  How-
11657       ever, because some scripts are commonly used together, and because some
11658       diacritical  and  other marks are used with multiple scripts, it is not
11659       that simple.
11660
11661       Every Unicode character has a Script property, mostly with a value cor-
11662       responding to the name of a script, such as Latin, Greek, or  Cyrillic.
11663       There are also three special values:
11664
11665       "Unknown" is used for code points that have not been assigned, and also
11666       for  the surrogate code points. In the PCRE2 32-bit library, characters
11667       whose code points are greater  than  the  Unicode  maximum  (U+10FFFF),
11668       which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11669       script.
11670
11671       "Common" is used for characters that are used with many scripts.  These
11672       include  punctuation,  emoji,  mathematical, musical, and currency sym-
11673       bols, and the ASCII digits 0 to 9.
11674
11675       "Inherited" is used for characters such as diacritical marks that  mod-
11676       ify a previous character. These are considered to take on the script of
11677       the character that they modify.
11678
11679       Some  Inherited characters are used with many scripts, but many of them
11680       are only normally used with a small number  of  scripts.  For  example,
11681       U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11682       tic.  In  order  to  make it possible to check this, a Unicode property
11683       called Script Extension exists. Its value is a list of scripts that ap-
11684       ply to the character. For the majority of characters, the list contains
11685       just one script, the same one as  the  Script  property.  However,  for
11686       characters  such  as  U+102E0 more than one Script is listed. There are
11687       also some Common characters that have a single,  non-Common  script  in
11688       their Script Extension list.
11689
11690       The next section describes the basic rules for deciding whether a given
11691       string  of  characters  is  a script run. Note, however, that there are
11692       some special cases involving the Chinese Han script, and an  additional
11693       constraint  for  decimal  digits.  These are covered in subsequent sec-
11694       tions.
11695
11696   Basic script run rules
11697
11698       A string that is less than two characters long is a script run. This is
11699       the only case in which an Unknown character can be  part  of  a  script
11700       run.  Longer strings are checked using only the Script Extensions prop-
11701       erty, not the basic Script property.
11702
11703       If a character's Script Extension property is the single value  "Inher-
11704       ited", it is always accepted as part of a script run. This is also true
11705       for  the  property  "Common", subject to the checking of decimal digits
11706       described below. All the remaining characters in a script run must have
11707       at least one script in common in their Script Extension lists. In  set-
11708       theoretic terminology, the intersection of all the sets of scripts must
11709       not be empty.
11710
11711       A  simple example is an Internet name such as "google.com". The letters
11712       are all in the Latin script, and the dot is Common, so this string is a
11713       script run.  However, the Cyrillic letter "o" looks exactly the same as
11714       the Latin "o"; a string that looks the same, but with Cyrillic "o"s  is
11715       not a script run.
11716
11717       More  interesting examples involve characters with more than one script
11718       in their Script Extension. Consider the following characters:
11719
11720         U+060C  Arabic comma
11721         U+06D4  Arabic full stop
11722
11723       The first has the Script Extension list Arabic, Hanifi  Rohingya,  Syr-
11724       iac,  and  Thaana; the second has just Arabic and Hanifi Rohingya. Both
11725       of them could appear in script runs of  either  Arabic  or  Hanifi  Ro-
11726       hingya.  The  first  could also appear in Syriac or Thaana script runs,
11727       but the second could not.
11728
11729   The Chinese Han script
11730
11731       The Chinese Han script is  commonly  used  in  conjunction  with  other
11732       scripts  for  writing certain languages. Japanese uses the Hiragana and
11733       Katakana scripts together with Han; Korean uses Hangul  and  Han;  Tai-
11734       wanese  Mandarin  uses  Bopomofo  and Han. These three combinations are
11735       treated as special cases when checking script runs and are, in  effect,
11736       "virtual  scripts".  Thus,  a script run may contain a mixture of Hira-
11737       gana, Katakana, and Han, or a mixture of Hangul and Han, or  a  mixture
11738       of  Bopomofo  and  Han,  but  not, for example, a mixture of Hangul and
11739       Bopomofo and Han. PCRE2 (like Perl) follows Unicode's  Technical  Stan-
11740       dard   39   ("Unicode   Security   Mechanisms",  http://unicode.org/re-
11741       ports/tr39/) in allowing such mixtures.
11742
11743   Decimal digits
11744
11745       Unicode contains many sets of 10 decimal digits in  different  scripts,
11746       and  some  scripts  (including the Common script) contain more than one
11747       set. Some of these decimal digits them are  visually  indistinguishable
11748       from  the  common  ASCII digits. In addition to the script checking de-
11749       scribed above, if a script run contains any decimal digits,  they  must
11750       all come from the same set of 10 adjacent characters.
11751
11752
11753VALIDITY OF UTF STRINGS
11754
11755       When  the  PCRE2_UTF  option is set, the strings passed as patterns and
11756       subjects are (by default) checked for validity on entry to the relevant
11757       functions. If an invalid UTF string is passed, a negative error code is
11758       returned. The code unit offset to the offending character  can  be  ex-
11759       tracted  from  the  match  data block by calling pcre2_get_startchar(),
11760       which is used for this purpose after a UTF error.
11761
11762       In some situations, you may already know that your strings  are  valid,
11763       and  therefore  want  to  skip these checks in order to improve perfor-
11764       mance, for example in the case of a long subject string that  is  being
11765       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
11766       pile time or at match time, PCRE2 assumes that the pattern  or  subject
11767       it is given (respectively) contains only valid UTF code unit sequences.
11768
11769       If  you  pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
11770       result is undefined and your program may crash or loop indefinitely  or
11771       give  incorrect  results.  There is, however, one mode of matching that
11772       can handle invalid UTF subject strings.  This  is  enabled  by  passing
11773       PCRE2_MATCH_INVALID_UTF  to  pcre2_compile()  and is discussed below in
11774       the next section. The  rest  of  this  section  covers  the  case  when
11775       PCRE2_MATCH_INVALID_UTF is not set.
11776
11777       Passing  PCRE2_NO_UTF_CHECK  to  pcre2_compile()  just disables the UTF
11778       check for the pattern; it does not also apply to  subject  strings.  If
11779       you  want  to disable the check for a subject string you must pass this
11780       same option to pcre2_match() or pcre2_dfa_match().
11781
11782       UTF-16 and UTF-32 strings can indicate their endianness by special code
11783       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
11784       this, expecting strings to be in host byte order.
11785
11786       Unless  PCRE2_NO_UTF_CHECK  is  set, a UTF string is checked before any
11787       other  processing  takes  place.  In  the  case  of  pcre2_match()  and
11788       pcre2_dfa_match()  calls  with a non-zero starting offset, the check is
11789       applied only to that part of the subject that could be inspected during
11790       matching, and there is a check that the starting offset points  to  the
11791       first  code  unit of a character or to the end of the subject. If there
11792       are no lookbehind assertions in the pattern, the check  starts  at  the
11793       starting  offset.   Otherwise,  it  starts at the length of the longest
11794       lookbehind before the starting offset, or at the start of  the  subject
11795       if  there are not that many characters before the starting offset. Note
11796       that the sequences \b and \B are one-character lookbehinds.
11797
11798       In addition to checking the format of the string, there is a  check  to
11799       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
11800       the  surrogate  area. The so-called "non-character" code points are not
11801       excluded because Unicode corrigendum #9 makes it clear that they should
11802       not be.
11803
11804       Characters in the "Surrogate Area" of Unicode are reserved for  use  by
11805       UTF-16,  where they are used in pairs to encode code points with values
11806       greater than 0xFFFF. The code points that are encoded by  UTF-16  pairs
11807       are  available  independently  in  the  UTF-8 and UTF-32 encodings. (In
11808       other words, the whole surrogate thing is a fudge for UTF-16 which  un-
11809       fortunately messes up UTF-8 and UTF-32.)
11810
11811       Setting  PCRE2_NO_UTF_CHECK  at compile time does not disable the error
11812       that is given if an escape sequence for an invalid Unicode  code  point
11813       is  encountered  in  the pattern. If you want to allow escape sequences
11814       such as \x{d800} (a surrogate code point) you  can  set  the  PCRE2_EX-
11815       TRA_ALLOW_SURROGATE_ESCAPES  extra  option.  However,  this is possible
11816       only in UTF-8 and UTF-32 modes, because these  values  are  not  repre-
11817       sentable in UTF-16.
11818
11819   Errors in UTF-8 strings
11820
11821       The following negative error codes are given for invalid UTF-8 strings:
11822
11823         PCRE2_ERROR_UTF8_ERR1
11824         PCRE2_ERROR_UTF8_ERR2
11825         PCRE2_ERROR_UTF8_ERR3
11826         PCRE2_ERROR_UTF8_ERR4
11827         PCRE2_ERROR_UTF8_ERR5
11828
11829       The  string  ends  with a truncated UTF-8 character; the code specifies
11830       how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
11831       characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
11832       nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
11833       checked first; hence the possibility of 4 or 5 missing bytes.
11834
11835         PCRE2_ERROR_UTF8_ERR6
11836         PCRE2_ERROR_UTF8_ERR7
11837         PCRE2_ERROR_UTF8_ERR8
11838         PCRE2_ERROR_UTF8_ERR9
11839         PCRE2_ERROR_UTF8_ERR10
11840
11841       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
11842       the  character  do  not have the binary value 0b10 (that is, either the
11843       most significant bit is 0, or the next bit is 1).
11844
11845         PCRE2_ERROR_UTF8_ERR11
11846         PCRE2_ERROR_UTF8_ERR12
11847
11848       A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
11849       long; these code points are excluded by RFC 3629.
11850
11851         PCRE2_ERROR_UTF8_ERR13
11852
11853       A 4-byte character has a value greater than 0x10ffff; these code points
11854       are excluded by RFC 3629.
11855
11856         PCRE2_ERROR_UTF8_ERR14
11857
11858       A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
11859       range of code points are reserved by RFC 3629 for use with UTF-16,  and
11860       so are excluded from UTF-8.
11861
11862         PCRE2_ERROR_UTF8_ERR15
11863         PCRE2_ERROR_UTF8_ERR16
11864         PCRE2_ERROR_UTF8_ERR17
11865         PCRE2_ERROR_UTF8_ERR18
11866         PCRE2_ERROR_UTF8_ERR19
11867
11868       A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
11869       for a value that can be represented by fewer bytes, which  is  invalid.
11870       For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
11871       rect coding uses just one byte.
11872
11873         PCRE2_ERROR_UTF8_ERR20
11874
11875       The two most significant bits of the first byte of a character have the
11876       binary value 0b10 (that is, the most significant bit is 1 and the  sec-
11877       ond  is  0). Such a byte can only validly occur as the second or subse-
11878       quent byte of a multi-byte character.
11879
11880         PCRE2_ERROR_UTF8_ERR21
11881
11882       The first byte of a character has the value 0xfe or 0xff. These  values
11883       can never occur in a valid UTF-8 string.
11884
11885   Errors in UTF-16 strings
11886
11887       The  following  negative  error  codes  are  given  for  invalid UTF-16
11888       strings:
11889
11890         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
11891         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
11892         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
11893
11894
11895   Errors in UTF-32 strings
11896
11897       The following  negative  error  codes  are  given  for  invalid  UTF-32
11898       strings:
11899
11900         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
11901         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
11902
11903
11904MATCHING IN INVALID UTF STRINGS
11905
11906       You can run pattern matches on subject strings that may contain invalid
11907       UTF  sequences  if  you  call  pcre2_compile() with the PCRE2_MATCH_IN-
11908       VALID_UTF option. This is supported  by  pcre2_match(),  including  JIT
11909       matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is
11910       set,  it  forces  PCRE2_UTF  to be set as well. Note, however, that the
11911       pattern itself must be a valid UTF string.
11912
11913       If you do not set PCRE2_MATCH_INVALID_UTF when  calling  pcre2_compile,
11914       and  you  are  not  certain that your subject strings are valid UTF se-
11915       quences, you should not make  use  of  the  JIT  "fast  path"  function
11916       pcre2_jit_match()  because it bypasses sanity checks, including the one
11917       for UTF validity. An invalid string may cause undefined behaviour,  in-
11918       cluding looping, crashing, or giving the wrong answer.
11919
11920       Setting  PCRE2_MATCH_INVALID_UTF  does  not affect what pcre2_compile()
11921       generates, but if pcre2_jit_compile() is subsequently called,  it  does
11922       generate different code. If JIT is not used, the option affects the be-
11923       haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11924       VALID_UTF  is  set  at  compile  time, PCRE2_NO_UTF_CHECK is ignored at
11925       match time.
11926
11927       In this mode, an invalid  code  unit  sequence  in  the  subject  never
11928       matches  any  pattern  item.  It  does not match dot, it does not match
11929       \p{Any}, it does not even match negative items such as [^X]. A  lookbe-
11930       hind  assertion fails if it encounters an invalid sequence while moving
11931       the current point backwards. In other words, an invalid UTF  code  unit
11932       sequence acts as a barrier which no match can cross.
11933
11934       You can also think of this as the subject being split up into fragments
11935       of  valid UTF, delimited internally by invalid code unit sequences. The
11936       pattern is matched fragment by fragment. The  result  of  a  successful
11937       match,  however,  is  given  as code unit offsets in the entire subject
11938       string in the usual way. There are a few points to consider:
11939
11940       The internal boundaries are not interpreted as the beginnings  or  ends
11941       of  lines  and  so  do not match circumflex or dollar characters in the
11942       pattern.
11943
11944       If pcre2_match() is called with an offset that  points  to  an  invalid
11945       UTF-sequence,  that  sequence  is  skipped, and the match starts at the
11946       next valid UTF character, or the end of the subject.
11947
11948       At internal fragment boundaries, \b and \B behave in the same way as at
11949       the beginning and end of the subject. For example, a sequence  such  as
11950       \bWORD\b  would match an instance of WORD that is surrounded by invalid
11951       UTF code units.
11952
11953       Using PCRE2_MATCH_INVALID_UTF, an application can run matches on  arbi-
11954       trary  data,  knowing  that  any  matched strings that are returned are
11955       valid UTF. This can be useful when searching for UTF text in executable
11956       or other binary files.
11957
11958       Note, however, that the  16-bit  and  32-bit  PCRE2  libraries  process
11959       strings  as  sequences of uint16_t or uint32_t code points. They cannot
11960       find valid UTF sequences within an arbitrary  string  of  bytes  unless
11961       such sequences are suitably aligned.
11962
11963
11964AUTHOR
11965
11966       Philip Hazel
11967       Retired from University Computing Service
11968       Cambridge, England.
11969
11970
11971REVISION
11972
11973       Last updated: 12 October 2023
11974       Copyright (c) 1997-2023 University of Cambridge.
11975
11976
11977PCRE2 10.43                    04 February 2023                PCRE2UNICODE(3)
11978------------------------------------------------------------------------------
11979
11980
11981