xref: /aosp_15_r20/external/pcre/doc/pcre2.txt (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi-----------------------------------------------------------------------------
2*22dc650dSSadaf EbrahimiThis file contains a concatenation of the PCRE2 man pages, converted to plain
3*22dc650dSSadaf Ebrahimitext format for ease of searching with a text editor, or for use on systems
4*22dc650dSSadaf Ebrahimithat do not have a man page processor. The small individual files that give
5*22dc650dSSadaf Ebrahimisynopses of each function in the library have not been included. Neither has
6*22dc650dSSadaf Ebrahimithe pcre2demo program. There are separate text files for the pcre2grep and
7*22dc650dSSadaf Ebrahimipcre2test commands.
8*22dc650dSSadaf Ebrahimi-----------------------------------------------------------------------------
9*22dc650dSSadaf Ebrahimi
10*22dc650dSSadaf Ebrahimi
11*22dc650dSSadaf Ebrahimi
12*22dc650dSSadaf EbrahimiPCRE2(3)                   Library Functions Manual                   PCRE2(3)
13*22dc650dSSadaf Ebrahimi
14*22dc650dSSadaf Ebrahimi
15*22dc650dSSadaf EbrahimiNAME
16*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
17*22dc650dSSadaf Ebrahimi
18*22dc650dSSadaf Ebrahimi
19*22dc650dSSadaf EbrahimiINTRODUCTION
20*22dc650dSSadaf Ebrahimi
21*22dc650dSSadaf Ebrahimi       PCRE2 is the name used for a revised API for the PCRE library, which is
22*22dc650dSSadaf Ebrahimi       a  set  of  functions,  written in C, that implement regular expression
23*22dc650dSSadaf Ebrahimi       pattern matching using the same syntax and semantics as Perl, with just
24*22dc650dSSadaf Ebrahimi       a few differences. After nearly two decades,  the  limitations  of  the
25*22dc650dSSadaf Ebrahimi       original  API  were  making development increasingly difficult. The new
26*22dc650dSSadaf Ebrahimi       API is more extensible, and it was simplified by abolishing  the  sepa-
27*22dc650dSSadaf Ebrahimi       rate  "study" optimizing function; in PCRE2, patterns are automatically
28*22dc650dSSadaf Ebrahimi       optimized where possible. Since forking from PCRE1, the code  has  been
29*22dc650dSSadaf Ebrahimi       extensively  refactored and new features introduced. The old library is
30*22dc650dSSadaf Ebrahimi       now obsolete and is no longer maintained.
31*22dc650dSSadaf Ebrahimi
32*22dc650dSSadaf Ebrahimi       As well as Perl-style regular expression patterns, some  features  that
33*22dc650dSSadaf Ebrahimi       appeared  in  Python and the original PCRE before they appeared in Perl
34*22dc650dSSadaf Ebrahimi       are available using the Python syntax. There is also some  support  for
35*22dc650dSSadaf Ebrahimi       one  or  two .NET and Oniguruma syntax items, and there are options for
36*22dc650dSSadaf Ebrahimi       requesting  some  minor  changes  that  give  better  ECMAScript   (aka
37*22dc650dSSadaf Ebrahimi       JavaScript) compatibility.
38*22dc650dSSadaf Ebrahimi
39*22dc650dSSadaf Ebrahimi       The  source code for PCRE2 can be compiled to support strings of 8-bit,
40*22dc650dSSadaf Ebrahimi       16-bit, or 32-bit code units, which means that up to three separate li-
41*22dc650dSSadaf Ebrahimi       braries may be installed, one for each code unit size. The size of code
42*22dc650dSSadaf Ebrahimi       unit is not related to the bit size of the underlying  hardware.  In  a
43*22dc650dSSadaf Ebrahimi       64-bit  environment that also supports 32-bit applications, versions of
44*22dc650dSSadaf Ebrahimi       PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
45*22dc650dSSadaf Ebrahimi
46*22dc650dSSadaf Ebrahimi       The original work to extend PCRE to 16-bit and 32-bit  code  units  was
47*22dc650dSSadaf Ebrahimi       done by Zoltan Herczeg and Christian Persch, respectively. In all three
48*22dc650dSSadaf Ebrahimi       cases,  strings  can  be  interpreted  either as one character per code
49*22dc650dSSadaf Ebrahimi       unit, or as UTF-encoded Unicode, with support for Unicode general cate-
50*22dc650dSSadaf Ebrahimi       gory properties. Unicode support is optional at build time (but is  the
51*22dc650dSSadaf Ebrahimi       default). However, processing strings as UTF code units must be enabled
52*22dc650dSSadaf Ebrahimi       explicitly at run time. The version of Unicode in use can be discovered
53*22dc650dSSadaf Ebrahimi       by running
54*22dc650dSSadaf Ebrahimi
55*22dc650dSSadaf Ebrahimi         pcre2test -C
56*22dc650dSSadaf Ebrahimi
57*22dc650dSSadaf Ebrahimi       The  three  libraries  contain  identical sets of functions, with names
58*22dc650dSSadaf Ebrahimi       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
59*22dc650dSSadaf Ebrahimi       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
60*22dc650dSSadaf Ebrahimi       32, a program that uses just one code unit width can be  written  using
61*22dc650dSSadaf Ebrahimi       generic names such as pcre2_compile(), and the documentation is written
62*22dc650dSSadaf Ebrahimi       assuming that this is the case.
63*22dc650dSSadaf Ebrahimi
64*22dc650dSSadaf Ebrahimi       In addition to the Perl-compatible matching function, PCRE2 contains an
65*22dc650dSSadaf Ebrahimi       alternative  function that matches the same compiled patterns in a dif-
66*22dc650dSSadaf Ebrahimi       ferent way. In certain circumstances, the alternative function has some
67*22dc650dSSadaf Ebrahimi       advantages.  For a discussion of the two matching algorithms,  see  the
68*22dc650dSSadaf Ebrahimi       pcre2matching page.
69*22dc650dSSadaf Ebrahimi
70*22dc650dSSadaf Ebrahimi       Details  of  exactly which Perl regular expression features are and are
71*22dc650dSSadaf Ebrahimi       not supported by  PCRE2  are  given  in  separate  documents.  See  the
72*22dc650dSSadaf Ebrahimi       pcre2pattern  and  pcre2compat  pages. There is a syntax summary in the
73*22dc650dSSadaf Ebrahimi       pcre2syntax page.
74*22dc650dSSadaf Ebrahimi
75*22dc650dSSadaf Ebrahimi       Some features of PCRE2 can be included, excluded, or changed  when  the
76*22dc650dSSadaf Ebrahimi       library  is  built. The pcre2_config() function makes it possible for a
77*22dc650dSSadaf Ebrahimi       client to discover which features are  available.  The  features  them-
78*22dc650dSSadaf Ebrahimi       selves are described in the pcre2build page. Documentation about build-
79*22dc650dSSadaf Ebrahimi       ing  PCRE2 for various operating systems can be found in the README and
80*22dc650dSSadaf Ebrahimi       NON-AUTOTOOLS_BUILD files in the source distribution.
81*22dc650dSSadaf Ebrahimi
82*22dc650dSSadaf Ebrahimi       The libraries contains a number of undocumented internal functions  and
83*22dc650dSSadaf Ebrahimi       data  tables  that  are  used by more than one of the exported external
84*22dc650dSSadaf Ebrahimi       functions, but which are not intended  for  use  by  external  callers.
85*22dc650dSSadaf Ebrahimi       Their  names  all begin with "_pcre2", which hopefully will not provoke
86*22dc650dSSadaf Ebrahimi       any name clashes. In some environments, it is possible to control which
87*22dc650dSSadaf Ebrahimi       external symbols are exported when a shared library is  built,  and  in
88*22dc650dSSadaf Ebrahimi       these cases the undocumented symbols are not exported.
89*22dc650dSSadaf Ebrahimi
90*22dc650dSSadaf Ebrahimi
91*22dc650dSSadaf EbrahimiSECURITY CONSIDERATIONS
92*22dc650dSSadaf Ebrahimi
93*22dc650dSSadaf Ebrahimi       If  you  are using PCRE2 in a non-UTF application that permits users to
94*22dc650dSSadaf Ebrahimi       supply arbitrary patterns for compilation, you should  be  aware  of  a
95*22dc650dSSadaf Ebrahimi       feature that allows users to turn on UTF support from within a pattern.
96*22dc650dSSadaf Ebrahimi       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
97*22dc650dSSadaf Ebrahimi       mode, which interprets patterns and subjects as strings of  UTF-8  code
98*22dc650dSSadaf Ebrahimi       units instead of individual 8-bit characters. This causes both the pat-
99*22dc650dSSadaf Ebrahimi       tern  and  any data against which it is matched to be checked for UTF-8
100*22dc650dSSadaf Ebrahimi       validity. If the data string is very long, such a check might use  suf-
101*22dc650dSSadaf Ebrahimi       ficiently  many  resources as to cause your application to lose perfor-
102*22dc650dSSadaf Ebrahimi       mance.
103*22dc650dSSadaf Ebrahimi
104*22dc650dSSadaf Ebrahimi       One way of guarding against this possibility is to use  the  pcre2_pat-
105*22dc650dSSadaf Ebrahimi       tern_info()  function  to  check  the  compiled  pattern's  options for
106*22dc650dSSadaf Ebrahimi       PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF  option  when
107*22dc650dSSadaf Ebrahimi       calling  pcre2_compile().  This causes a compile time error if the pat-
108*22dc650dSSadaf Ebrahimi       tern contains a UTF-setting sequence.
109*22dc650dSSadaf Ebrahimi
110*22dc650dSSadaf Ebrahimi       The use of Unicode properties for character types such as \d  can  also
111*22dc650dSSadaf Ebrahimi       be  enabled  from within the pattern, by specifying "(*UCP)". This fea-
112*22dc650dSSadaf Ebrahimi       ture can be disallowed by setting the PCRE2_NEVER_UCP option.
113*22dc650dSSadaf Ebrahimi
114*22dc650dSSadaf Ebrahimi       If your application is one that supports UTF, be  aware  that  validity
115*22dc650dSSadaf Ebrahimi       checking  can  take time. If the same data string is to be matched many
116*22dc650dSSadaf Ebrahimi       times, you can use the PCRE2_NO_UTF_CHECK option  for  the  second  and
117*22dc650dSSadaf Ebrahimi       subsequent matches to avoid running redundant checks.
118*22dc650dSSadaf Ebrahimi
119*22dc650dSSadaf Ebrahimi       The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
120*22dc650dSSadaf Ebrahimi       to  problems,  because  it  may leave the current matching point in the
121*22dc650dSSadaf Ebrahimi       middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op-
122*22dc650dSSadaf Ebrahimi       tion can be used by an application to lock out the use of \C, causing a
123*22dc650dSSadaf Ebrahimi       compile-time  error  if it is encountered. It is also possible to build
124*22dc650dSSadaf Ebrahimi       PCRE2 with the use of \C permanently disabled.
125*22dc650dSSadaf Ebrahimi
126*22dc650dSSadaf Ebrahimi       Another way that performance can be hit is by running  a  pattern  that
127*22dc650dSSadaf Ebrahimi       has  a  very  large search tree against a string that will never match.
128*22dc650dSSadaf Ebrahimi       Nested unlimited repeats in a pattern are a common example. PCRE2  pro-
129*22dc650dSSadaf Ebrahimi       vides  some  protection  against  this: see the pcre2_set_match_limit()
130*22dc650dSSadaf Ebrahimi       function in the pcre2api page.  There  is  a  similar  function  called
131*22dc650dSSadaf Ebrahimi       pcre2_set_depth_limit() that can be used to restrict the amount of mem-
132*22dc650dSSadaf Ebrahimi       ory that is used.
133*22dc650dSSadaf Ebrahimi
134*22dc650dSSadaf Ebrahimi
135*22dc650dSSadaf EbrahimiUSER DOCUMENTATION
136*22dc650dSSadaf Ebrahimi
137*22dc650dSSadaf Ebrahimi       The  user  documentation for PCRE2 comprises a number of different sec-
138*22dc650dSSadaf Ebrahimi       tions. In the "man" format, each of these is a separate "man page".  In
139*22dc650dSSadaf Ebrahimi       the  HTML  format, each is a separate page, linked from the index page.
140*22dc650dSSadaf Ebrahimi       In the plain  text  format,  the  descriptions  of  the  pcre2grep  and
141*22dc650dSSadaf Ebrahimi       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
142*22dc650dSSadaf Ebrahimi       respectively.  The remaining sections, except for the pcre2demo section
143*22dc650dSSadaf Ebrahimi       (which is a program listing), and the short pages for individual  func-
144*22dc650dSSadaf Ebrahimi       tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
145*22dc650dSSadaf Ebrahimi       tions are as follows:
146*22dc650dSSadaf Ebrahimi
147*22dc650dSSadaf Ebrahimi         pcre2              this document
148*22dc650dSSadaf Ebrahimi         pcre2-config       show PCRE2 installation configuration information
149*22dc650dSSadaf Ebrahimi         pcre2api           details of PCRE2's native C API
150*22dc650dSSadaf Ebrahimi         pcre2build         building PCRE2
151*22dc650dSSadaf Ebrahimi         pcre2callout       details of the pattern callout feature
152*22dc650dSSadaf Ebrahimi         pcre2compat        discussion of Perl compatibility
153*22dc650dSSadaf Ebrahimi         pcre2convert       details of pattern conversion functions
154*22dc650dSSadaf Ebrahimi         pcre2demo          a demonstration C program that uses PCRE2
155*22dc650dSSadaf Ebrahimi         pcre2grep          description of the pcre2grep command (8-bit only)
156*22dc650dSSadaf Ebrahimi         pcre2jit           discussion of just-in-time optimization support
157*22dc650dSSadaf Ebrahimi         pcre2limits        details of size and other limits
158*22dc650dSSadaf Ebrahimi         pcre2matching      discussion of the two matching algorithms
159*22dc650dSSadaf Ebrahimi         pcre2partial       details of the partial matching facility
160*22dc650dSSadaf Ebrahimi         pcre2pattern       syntax and semantics of supported regular
161*22dc650dSSadaf Ebrahimi                              expression patterns
162*22dc650dSSadaf Ebrahimi         pcre2perform       discussion of performance issues
163*22dc650dSSadaf Ebrahimi         pcre2posix         the POSIX-compatible C API for the 8-bit library
164*22dc650dSSadaf Ebrahimi         pcre2sample        discussion of the pcre2demo program
165*22dc650dSSadaf Ebrahimi         pcre2serialize     details of pattern serialization
166*22dc650dSSadaf Ebrahimi         pcre2syntax        quick syntax reference
167*22dc650dSSadaf Ebrahimi         pcre2test          description of the pcre2test command
168*22dc650dSSadaf Ebrahimi         pcre2unicode       discussion of Unicode and UTF support
169*22dc650dSSadaf Ebrahimi
170*22dc650dSSadaf Ebrahimi       In the "man" and HTML formats, there is also a short page  for  each  C
171*22dc650dSSadaf Ebrahimi       library function, listing its arguments and results.
172*22dc650dSSadaf Ebrahimi
173*22dc650dSSadaf Ebrahimi
174*22dc650dSSadaf EbrahimiAUTHOR
175*22dc650dSSadaf Ebrahimi
176*22dc650dSSadaf Ebrahimi       Philip Hazel
177*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
178*22dc650dSSadaf Ebrahimi       Cambridge, England.
179*22dc650dSSadaf Ebrahimi
180*22dc650dSSadaf Ebrahimi       Putting  an  actual email address here is a spam magnet. If you want to
181*22dc650dSSadaf Ebrahimi       email me, use my two names separated by a dot at gmail.com.
182*22dc650dSSadaf Ebrahimi
183*22dc650dSSadaf Ebrahimi
184*22dc650dSSadaf EbrahimiREVISION
185*22dc650dSSadaf Ebrahimi
186*22dc650dSSadaf Ebrahimi       Last updated: 27 August 2021
187*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2021 University of Cambridge.
188*22dc650dSSadaf Ebrahimi
189*22dc650dSSadaf Ebrahimi
190*22dc650dSSadaf EbrahimiPCRE2 10.38                     27 August 2021                        PCRE2(3)
191*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
192*22dc650dSSadaf Ebrahimi
193*22dc650dSSadaf Ebrahimi
194*22dc650dSSadaf Ebrahimi
195*22dc650dSSadaf EbrahimiPCRE2API(3)                Library Functions Manual                PCRE2API(3)
196*22dc650dSSadaf Ebrahimi
197*22dc650dSSadaf Ebrahimi
198*22dc650dSSadaf EbrahimiNAME
199*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
200*22dc650dSSadaf Ebrahimi
201*22dc650dSSadaf Ebrahimi       #include <pcre2.h>
202*22dc650dSSadaf Ebrahimi
203*22dc650dSSadaf Ebrahimi       PCRE2  is  a  new API for PCRE, starting at release 10.0. This document
204*22dc650dSSadaf Ebrahimi       contains a description of all its native functions. See the pcre2 docu-
205*22dc650dSSadaf Ebrahimi       ment for an overview of all the PCRE2 documentation.
206*22dc650dSSadaf Ebrahimi
207*22dc650dSSadaf Ebrahimi
208*22dc650dSSadaf EbrahimiPCRE2 NATIVE API BASIC FUNCTIONS
209*22dc650dSSadaf Ebrahimi
210*22dc650dSSadaf Ebrahimi       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
211*22dc650dSSadaf Ebrahimi         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
212*22dc650dSSadaf Ebrahimi         pcre2_compile_context *ccontext);
213*22dc650dSSadaf Ebrahimi
214*22dc650dSSadaf Ebrahimi       void pcre2_code_free(pcre2_code *code);
215*22dc650dSSadaf Ebrahimi
216*22dc650dSSadaf Ebrahimi       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
217*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
218*22dc650dSSadaf Ebrahimi
219*22dc650dSSadaf Ebrahimi       pcre2_match_data *pcre2_match_data_create_from_pattern(
220*22dc650dSSadaf Ebrahimi         const pcre2_code *code, pcre2_general_context *gcontext);
221*22dc650dSSadaf Ebrahimi
222*22dc650dSSadaf Ebrahimi       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
223*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
224*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
225*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext);
226*22dc650dSSadaf Ebrahimi
227*22dc650dSSadaf Ebrahimi       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
228*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
229*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
230*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext,
231*22dc650dSSadaf Ebrahimi         int *workspace, PCRE2_SIZE wscount);
232*22dc650dSSadaf Ebrahimi
233*22dc650dSSadaf Ebrahimi       void pcre2_match_data_free(pcre2_match_data *match_data);
234*22dc650dSSadaf Ebrahimi
235*22dc650dSSadaf Ebrahimi
236*22dc650dSSadaf EbrahimiPCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
237*22dc650dSSadaf Ebrahimi
238*22dc650dSSadaf Ebrahimi       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
239*22dc650dSSadaf Ebrahimi
240*22dc650dSSadaf Ebrahimi       PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data);
241*22dc650dSSadaf Ebrahimi
242*22dc650dSSadaf Ebrahimi       PCRE2_SIZE pcre2_get_match_data_heapframes_size(
243*22dc650dSSadaf Ebrahimi         pcre2_match_data *match_data);
244*22dc650dSSadaf Ebrahimi
245*22dc650dSSadaf Ebrahimi       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
246*22dc650dSSadaf Ebrahimi
247*22dc650dSSadaf Ebrahimi       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
248*22dc650dSSadaf Ebrahimi
249*22dc650dSSadaf Ebrahimi       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
250*22dc650dSSadaf Ebrahimi
251*22dc650dSSadaf Ebrahimi
252*22dc650dSSadaf EbrahimiPCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
253*22dc650dSSadaf Ebrahimi
254*22dc650dSSadaf Ebrahimi       pcre2_general_context *pcre2_general_context_create(
255*22dc650dSSadaf Ebrahimi         void *(*private_malloc)(PCRE2_SIZE, void *),
256*22dc650dSSadaf Ebrahimi         void (*private_free)(void *, void *), void *memory_data);
257*22dc650dSSadaf Ebrahimi
258*22dc650dSSadaf Ebrahimi       pcre2_general_context *pcre2_general_context_copy(
259*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
260*22dc650dSSadaf Ebrahimi
261*22dc650dSSadaf Ebrahimi       void pcre2_general_context_free(pcre2_general_context *gcontext);
262*22dc650dSSadaf Ebrahimi
263*22dc650dSSadaf Ebrahimi
264*22dc650dSSadaf EbrahimiPCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
265*22dc650dSSadaf Ebrahimi
266*22dc650dSSadaf Ebrahimi       pcre2_compile_context *pcre2_compile_context_create(
267*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
268*22dc650dSSadaf Ebrahimi
269*22dc650dSSadaf Ebrahimi       pcre2_compile_context *pcre2_compile_context_copy(
270*22dc650dSSadaf Ebrahimi         pcre2_compile_context *ccontext);
271*22dc650dSSadaf Ebrahimi
272*22dc650dSSadaf Ebrahimi       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
273*22dc650dSSadaf Ebrahimi
274*22dc650dSSadaf Ebrahimi       int pcre2_set_bsr(pcre2_compile_context *ccontext,
275*22dc650dSSadaf Ebrahimi         uint32_t value);
276*22dc650dSSadaf Ebrahimi
277*22dc650dSSadaf Ebrahimi       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
278*22dc650dSSadaf Ebrahimi         const uint8_t *tables);
279*22dc650dSSadaf Ebrahimi
280*22dc650dSSadaf Ebrahimi       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
281*22dc650dSSadaf Ebrahimi         uint32_t extra_options);
282*22dc650dSSadaf Ebrahimi
283*22dc650dSSadaf Ebrahimi       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
284*22dc650dSSadaf Ebrahimi         PCRE2_SIZE value);
285*22dc650dSSadaf Ebrahimi
286*22dc650dSSadaf Ebrahimi       int pcre2_set_max_pattern_compiled_length(
287*22dc650dSSadaf Ebrahimi         pcre2_compile_context *ccontext, PCRE2_SIZE value);
288*22dc650dSSadaf Ebrahimi
289*22dc650dSSadaf Ebrahimi       int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext,
290*22dc650dSSadaf Ebrahimi         uint32_t value);
291*22dc650dSSadaf Ebrahimi
292*22dc650dSSadaf Ebrahimi       int pcre2_set_newline(pcre2_compile_context *ccontext,
293*22dc650dSSadaf Ebrahimi         uint32_t value);
294*22dc650dSSadaf Ebrahimi
295*22dc650dSSadaf Ebrahimi       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
296*22dc650dSSadaf Ebrahimi         uint32_t value);
297*22dc650dSSadaf Ebrahimi
298*22dc650dSSadaf Ebrahimi       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
299*22dc650dSSadaf Ebrahimi         int (*guard_function)(uint32_t, void *), void *user_data);
300*22dc650dSSadaf Ebrahimi
301*22dc650dSSadaf Ebrahimi
302*22dc650dSSadaf EbrahimiPCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
303*22dc650dSSadaf Ebrahimi
304*22dc650dSSadaf Ebrahimi       pcre2_match_context *pcre2_match_context_create(
305*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
306*22dc650dSSadaf Ebrahimi
307*22dc650dSSadaf Ebrahimi       pcre2_match_context *pcre2_match_context_copy(
308*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext);
309*22dc650dSSadaf Ebrahimi
310*22dc650dSSadaf Ebrahimi       void pcre2_match_context_free(pcre2_match_context *mcontext);
311*22dc650dSSadaf Ebrahimi
312*22dc650dSSadaf Ebrahimi       int pcre2_set_callout(pcre2_match_context *mcontext,
313*22dc650dSSadaf Ebrahimi         int (*callout_function)(pcre2_callout_block *, void *),
314*22dc650dSSadaf Ebrahimi         void *callout_data);
315*22dc650dSSadaf Ebrahimi
316*22dc650dSSadaf Ebrahimi       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
317*22dc650dSSadaf Ebrahimi         int (*callout_function)(pcre2_substitute_callout_block *, void *),
318*22dc650dSSadaf Ebrahimi         void *callout_data);
319*22dc650dSSadaf Ebrahimi
320*22dc650dSSadaf Ebrahimi       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
321*22dc650dSSadaf Ebrahimi         PCRE2_SIZE value);
322*22dc650dSSadaf Ebrahimi
323*22dc650dSSadaf Ebrahimi       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
324*22dc650dSSadaf Ebrahimi         uint32_t value);
325*22dc650dSSadaf Ebrahimi
326*22dc650dSSadaf Ebrahimi       int pcre2_set_match_limit(pcre2_match_context *mcontext,
327*22dc650dSSadaf Ebrahimi         uint32_t value);
328*22dc650dSSadaf Ebrahimi
329*22dc650dSSadaf Ebrahimi       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
330*22dc650dSSadaf Ebrahimi         uint32_t value);
331*22dc650dSSadaf Ebrahimi
332*22dc650dSSadaf Ebrahimi
333*22dc650dSSadaf EbrahimiPCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
334*22dc650dSSadaf Ebrahimi
335*22dc650dSSadaf Ebrahimi       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
336*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
337*22dc650dSSadaf Ebrahimi
338*22dc650dSSadaf Ebrahimi       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
339*22dc650dSSadaf Ebrahimi         uint32_t number, PCRE2_UCHAR *buffer,
340*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *bufflen);
341*22dc650dSSadaf Ebrahimi
342*22dc650dSSadaf Ebrahimi       void pcre2_substring_free(PCRE2_UCHAR *buffer);
343*22dc650dSSadaf Ebrahimi
344*22dc650dSSadaf Ebrahimi       int pcre2_substring_get_byname(pcre2_match_data *match_data,
345*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
346*22dc650dSSadaf Ebrahimi
347*22dc650dSSadaf Ebrahimi       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
348*22dc650dSSadaf Ebrahimi         uint32_t number, PCRE2_UCHAR **bufferptr,
349*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *bufflen);
350*22dc650dSSadaf Ebrahimi
351*22dc650dSSadaf Ebrahimi       int pcre2_substring_length_byname(pcre2_match_data *match_data,
352*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_SIZE *length);
353*22dc650dSSadaf Ebrahimi
354*22dc650dSSadaf Ebrahimi       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
355*22dc650dSSadaf Ebrahimi         uint32_t number, PCRE2_SIZE *length);
356*22dc650dSSadaf Ebrahimi
357*22dc650dSSadaf Ebrahimi       int pcre2_substring_nametable_scan(const pcre2_code *code,
358*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
359*22dc650dSSadaf Ebrahimi
360*22dc650dSSadaf Ebrahimi       int pcre2_substring_number_from_name(const pcre2_code *code,
361*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name);
362*22dc650dSSadaf Ebrahimi
363*22dc650dSSadaf Ebrahimi       void pcre2_substring_list_free(PCRE2_UCHAR **list);
364*22dc650dSSadaf Ebrahimi
365*22dc650dSSadaf Ebrahimi       int pcre2_substring_list_get(pcre2_match_data *match_data,
366*22dc650dSSadaf Ebrahimi         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
367*22dc650dSSadaf Ebrahimi
368*22dc650dSSadaf Ebrahimi
369*22dc650dSSadaf EbrahimiPCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
370*22dc650dSSadaf Ebrahimi
371*22dc650dSSadaf Ebrahimi       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
372*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
373*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
374*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext, PCRE2_SPTR replacementz,
375*22dc650dSSadaf Ebrahimi         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
376*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *outlengthptr);
377*22dc650dSSadaf Ebrahimi
378*22dc650dSSadaf Ebrahimi
379*22dc650dSSadaf EbrahimiPCRE2 NATIVE API JIT FUNCTIONS
380*22dc650dSSadaf Ebrahimi
381*22dc650dSSadaf Ebrahimi       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
382*22dc650dSSadaf Ebrahimi
383*22dc650dSSadaf Ebrahimi       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
384*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
385*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
386*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext);
387*22dc650dSSadaf Ebrahimi
388*22dc650dSSadaf Ebrahimi       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
389*22dc650dSSadaf Ebrahimi
390*22dc650dSSadaf Ebrahimi       pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize,
391*22dc650dSSadaf Ebrahimi         size_t maxsize, pcre2_general_context *gcontext);
392*22dc650dSSadaf Ebrahimi
393*22dc650dSSadaf Ebrahimi       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
394*22dc650dSSadaf Ebrahimi         pcre2_jit_callback callback_function, void *callback_data);
395*22dc650dSSadaf Ebrahimi
396*22dc650dSSadaf Ebrahimi       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
397*22dc650dSSadaf Ebrahimi
398*22dc650dSSadaf Ebrahimi
399*22dc650dSSadaf EbrahimiPCRE2 NATIVE API SERIALIZATION FUNCTIONS
400*22dc650dSSadaf Ebrahimi
401*22dc650dSSadaf Ebrahimi       int32_t pcre2_serialize_decode(pcre2_code **codes,
402*22dc650dSSadaf Ebrahimi         int32_t number_of_codes, const uint8_t *bytes,
403*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
404*22dc650dSSadaf Ebrahimi
405*22dc650dSSadaf Ebrahimi       int32_t pcre2_serialize_encode(const pcre2_code **codes,
406*22dc650dSSadaf Ebrahimi         int32_t number_of_codes, uint8_t **serialized_bytes,
407*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
408*22dc650dSSadaf Ebrahimi
409*22dc650dSSadaf Ebrahimi       void pcre2_serialize_free(uint8_t *bytes);
410*22dc650dSSadaf Ebrahimi
411*22dc650dSSadaf Ebrahimi       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
412*22dc650dSSadaf Ebrahimi
413*22dc650dSSadaf Ebrahimi
414*22dc650dSSadaf EbrahimiPCRE2 NATIVE API AUXILIARY FUNCTIONS
415*22dc650dSSadaf Ebrahimi
416*22dc650dSSadaf Ebrahimi       pcre2_code *pcre2_code_copy(const pcre2_code *code);
417*22dc650dSSadaf Ebrahimi
418*22dc650dSSadaf Ebrahimi       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
419*22dc650dSSadaf Ebrahimi
420*22dc650dSSadaf Ebrahimi       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
421*22dc650dSSadaf Ebrahimi         PCRE2_SIZE bufflen);
422*22dc650dSSadaf Ebrahimi
423*22dc650dSSadaf Ebrahimi       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
424*22dc650dSSadaf Ebrahimi
425*22dc650dSSadaf Ebrahimi       void pcre2_maketables_free(pcre2_general_context *gcontext,
426*22dc650dSSadaf Ebrahimi         const uint8_t *tables);
427*22dc650dSSadaf Ebrahimi
428*22dc650dSSadaf Ebrahimi       int pcre2_pattern_info(const pcre2_code *code, uint32_t what,
429*22dc650dSSadaf Ebrahimi         void *where);
430*22dc650dSSadaf Ebrahimi
431*22dc650dSSadaf Ebrahimi       int pcre2_callout_enumerate(const pcre2_code *code,
432*22dc650dSSadaf Ebrahimi         int (*callback)(pcre2_callout_enumerate_block *, void *),
433*22dc650dSSadaf Ebrahimi         void *user_data);
434*22dc650dSSadaf Ebrahimi
435*22dc650dSSadaf Ebrahimi       int pcre2_config(uint32_t what, void *where);
436*22dc650dSSadaf Ebrahimi
437*22dc650dSSadaf Ebrahimi
438*22dc650dSSadaf EbrahimiPCRE2 NATIVE API OBSOLETE FUNCTIONS
439*22dc650dSSadaf Ebrahimi
440*22dc650dSSadaf Ebrahimi       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
441*22dc650dSSadaf Ebrahimi         uint32_t value);
442*22dc650dSSadaf Ebrahimi
443*22dc650dSSadaf Ebrahimi       int pcre2_set_recursion_memory_management(
444*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext,
445*22dc650dSSadaf Ebrahimi         void *(*private_malloc)(size_t, void *),
446*22dc650dSSadaf Ebrahimi         void (*private_free)(void *, void *), void *memory_data);
447*22dc650dSSadaf Ebrahimi
448*22dc650dSSadaf Ebrahimi       These functions became obsolete at release 10.30 and are retained  only
449*22dc650dSSadaf Ebrahimi       for  backward  compatibility.  They should not be used in new code. The
450*22dc650dSSadaf Ebrahimi       first is replaced by pcre2_set_depth_limit(); the second is  no  longer
451*22dc650dSSadaf Ebrahimi       needed and has no effect (it always returns zero).
452*22dc650dSSadaf Ebrahimi
453*22dc650dSSadaf Ebrahimi
454*22dc650dSSadaf EbrahimiPCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
455*22dc650dSSadaf Ebrahimi
456*22dc650dSSadaf Ebrahimi       pcre2_convert_context *pcre2_convert_context_create(
457*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
458*22dc650dSSadaf Ebrahimi
459*22dc650dSSadaf Ebrahimi       pcre2_convert_context *pcre2_convert_context_copy(
460*22dc650dSSadaf Ebrahimi         pcre2_convert_context *cvcontext);
461*22dc650dSSadaf Ebrahimi
462*22dc650dSSadaf Ebrahimi       void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
463*22dc650dSSadaf Ebrahimi
464*22dc650dSSadaf Ebrahimi       int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
465*22dc650dSSadaf Ebrahimi         uint32_t escape_char);
466*22dc650dSSadaf Ebrahimi
467*22dc650dSSadaf Ebrahimi       int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
468*22dc650dSSadaf Ebrahimi         uint32_t separator_char);
469*22dc650dSSadaf Ebrahimi
470*22dc650dSSadaf Ebrahimi       int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
471*22dc650dSSadaf Ebrahimi         uint32_t options, PCRE2_UCHAR **buffer,
472*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
473*22dc650dSSadaf Ebrahimi
474*22dc650dSSadaf Ebrahimi       void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
475*22dc650dSSadaf Ebrahimi
476*22dc650dSSadaf Ebrahimi       These  functions  provide  a  way of converting non-PCRE2 patterns into
477*22dc650dSSadaf Ebrahimi       patterns that can be processed by pcre2_compile(). This facility is ex-
478*22dc650dSSadaf Ebrahimi       perimental and may be changed in future releases. At  present,  "globs"
479*22dc650dSSadaf Ebrahimi       and  POSIX  basic  and  extended patterns can be converted. Details are
480*22dc650dSSadaf Ebrahimi       given in the pcre2convert documentation.
481*22dc650dSSadaf Ebrahimi
482*22dc650dSSadaf Ebrahimi
483*22dc650dSSadaf EbrahimiPCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
484*22dc650dSSadaf Ebrahimi
485*22dc650dSSadaf Ebrahimi       There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
486*22dc650dSSadaf Ebrahimi       code  units,  respectively.  However,  there  is  just one header file,
487*22dc650dSSadaf Ebrahimi       pcre2.h.  This contains the function prototypes and  other  definitions
488*22dc650dSSadaf Ebrahimi       for all three libraries. One, two, or all three can be installed simul-
489*22dc650dSSadaf Ebrahimi       taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
490*22dc650dSSadaf Ebrahimi       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
491*22dc650dSSadaf Ebrahimi       inal PCRE libraries.  Every PCRE2 function  comes  in  three  different
492*22dc650dSSadaf Ebrahimi       forms, one for each library, for example:
493*22dc650dSSadaf Ebrahimi
494*22dc650dSSadaf Ebrahimi         pcre2_compile_8()
495*22dc650dSSadaf Ebrahimi         pcre2_compile_16()
496*22dc650dSSadaf Ebrahimi         pcre2_compile_32()
497*22dc650dSSadaf Ebrahimi
498*22dc650dSSadaf Ebrahimi       There are also three different sets of data types:
499*22dc650dSSadaf Ebrahimi
500*22dc650dSSadaf Ebrahimi         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
501*22dc650dSSadaf Ebrahimi         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
502*22dc650dSSadaf Ebrahimi
503*22dc650dSSadaf Ebrahimi       The  UCHAR  types define unsigned code units of the appropriate widths.
504*22dc650dSSadaf Ebrahimi       For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.  The  SPTR
505*22dc650dSSadaf Ebrahimi       types are pointers to constants of the equivalent UCHAR types, that is,
506*22dc650dSSadaf Ebrahimi       they are pointers to vectors of unsigned code units.
507*22dc650dSSadaf Ebrahimi
508*22dc650dSSadaf Ebrahimi       Character  strings  are  passed  to a PCRE2 library as sequences of un-
509*22dc650dSSadaf Ebrahimi       signed integers in code units of the appropriate width. The length of a
510*22dc650dSSadaf Ebrahimi       string may be given as a number of code units, or  the  string  may  be
511*22dc650dSSadaf Ebrahimi       specified as zero-terminated.
512*22dc650dSSadaf Ebrahimi
513*22dc650dSSadaf Ebrahimi       Many  applications use only one code unit width. For their convenience,
514*22dc650dSSadaf Ebrahimi       macros are defined whose names are the generic forms such as pcre2_com-
515*22dc650dSSadaf Ebrahimi       pile() and  PCRE2_SPTR.  These  macros  use  the  value  of  the  macro
516*22dc650dSSadaf Ebrahimi       PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
517*22dc650dSSadaf Ebrahimi       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
518*22dc650dSSadaf Ebrahimi       An application must define it to be  8,  16,  or  32  before  including
519*22dc650dSSadaf Ebrahimi       pcre2.h in order to make use of the generic names.
520*22dc650dSSadaf Ebrahimi
521*22dc650dSSadaf Ebrahimi       Applications  that use more than one code unit width can be linked with
522*22dc650dSSadaf Ebrahimi       more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to
523*22dc650dSSadaf Ebrahimi       be  0  before  including pcre2.h, and then use the real function names.
524*22dc650dSSadaf Ebrahimi       Any code that is to be included in an environment where  the  value  of
525*22dc650dSSadaf Ebrahimi       PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function
526*22dc650dSSadaf Ebrahimi       names. (Unfortunately, it is not possible in C code to save and restore
527*22dc650dSSadaf Ebrahimi       the value of a macro.)
528*22dc650dSSadaf Ebrahimi
529*22dc650dSSadaf Ebrahimi       If PCRE2_CODE_UNIT_WIDTH is not defined  before  including  pcre2.h,  a
530*22dc650dSSadaf Ebrahimi       compiler error occurs.
531*22dc650dSSadaf Ebrahimi
532*22dc650dSSadaf Ebrahimi       When  using  multiple  libraries  in an application, you must take care
533*22dc650dSSadaf Ebrahimi       when processing any particular pattern to use  only  functions  from  a
534*22dc650dSSadaf Ebrahimi       single  library.   For example, if you want to run a match using a pat-
535*22dc650dSSadaf Ebrahimi       tern that was compiled with pcre2_compile_16(), you  must  do  so  with
536*22dc650dSSadaf Ebrahimi       pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
537*22dc650dSSadaf Ebrahimi
538*22dc650dSSadaf Ebrahimi       In  the  function summaries above, and in the rest of this document and
539*22dc650dSSadaf Ebrahimi       other PCRE2 documents, functions and data  types  are  described  using
540*22dc650dSSadaf Ebrahimi       their generic names, without the _8, _16, or _32 suffix.
541*22dc650dSSadaf Ebrahimi
542*22dc650dSSadaf Ebrahimi
543*22dc650dSSadaf EbrahimiPCRE2 API OVERVIEW
544*22dc650dSSadaf Ebrahimi
545*22dc650dSSadaf Ebrahimi       PCRE2  has  its  own  native  API, which is described in this document.
546*22dc650dSSadaf Ebrahimi       There are also some wrapper functions for the 8-bit library that corre-
547*22dc650dSSadaf Ebrahimi       spond to the POSIX regular expression API, but they do not give  access
548*22dc650dSSadaf Ebrahimi       to  all  the  functionality of PCRE2 and they are not thread-safe. They
549*22dc650dSSadaf Ebrahimi       are described in the pcre2posix documentation. Both these APIs define a
550*22dc650dSSadaf Ebrahimi       set of C function calls.
551*22dc650dSSadaf Ebrahimi
552*22dc650dSSadaf Ebrahimi       The native API C data types, function prototypes,  option  values,  and
553*22dc650dSSadaf Ebrahimi       error codes are defined in the header file pcre2.h, which also contains
554*22dc650dSSadaf Ebrahimi       definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
555*22dc650dSSadaf Ebrahimi       numbers  for the library. Applications can use these to include support
556*22dc650dSSadaf Ebrahimi       for different releases of PCRE2.
557*22dc650dSSadaf Ebrahimi
558*22dc650dSSadaf Ebrahimi       In a Windows environment, if you want to statically link an application
559*22dc650dSSadaf Ebrahimi       program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
560*22dc650dSSadaf Ebrahimi       before including pcre2.h.
561*22dc650dSSadaf Ebrahimi
562*22dc650dSSadaf Ebrahimi       The  functions pcre2_compile() and pcre2_match() are used for compiling
563*22dc650dSSadaf Ebrahimi       and matching regular expressions in a Perl-compatible manner. A  sample
564*22dc650dSSadaf Ebrahimi       program that demonstrates the simplest way of using them is provided in
565*22dc650dSSadaf Ebrahimi       the file called pcre2demo.c in the PCRE2 source distribution. A listing
566*22dc650dSSadaf Ebrahimi       of  this  program  is  given  in  the  pcre2demo documentation, and the
567*22dc650dSSadaf Ebrahimi       pcre2sample documentation describes how to compile and run it.
568*22dc650dSSadaf Ebrahimi
569*22dc650dSSadaf Ebrahimi       The compiling and matching functions recognize various options that are
570*22dc650dSSadaf Ebrahimi       passed as bits in an options argument. There are also some more compli-
571*22dc650dSSadaf Ebrahimi       cated parameters such as custom memory  management  functions  and  re-
572*22dc650dSSadaf Ebrahimi       source  limits  that  are  passed  in "contexts" (which are just memory
573*22dc650dSSadaf Ebrahimi       blocks, described below). Simple applications do not need to  make  use
574*22dc650dSSadaf Ebrahimi       of contexts.
575*22dc650dSSadaf Ebrahimi
576*22dc650dSSadaf Ebrahimi       Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
577*22dc650dSSadaf Ebrahimi       that can be built in  appropriate  hardware  environments.  It  greatly
578*22dc650dSSadaf Ebrahimi       speeds  up  the matching performance of many patterns. Programs can re-
579*22dc650dSSadaf Ebrahimi       quest that it be used if available by calling pcre2_jit_compile() after
580*22dc650dSSadaf Ebrahimi       a pattern has been successfully compiled by pcre2_compile(). This  does
581*22dc650dSSadaf Ebrahimi       nothing if JIT support is not available.
582*22dc650dSSadaf Ebrahimi
583*22dc650dSSadaf Ebrahimi       More  complicated  programs  might  need  to make use of the specialist
584*22dc650dSSadaf Ebrahimi       functions   pcre2_jit_stack_create(),    pcre2_jit_stack_free(),    and
585*22dc650dSSadaf Ebrahimi       pcre2_jit_stack_assign()  in order to control the JIT code's memory us-
586*22dc650dSSadaf Ebrahimi       age.
587*22dc650dSSadaf Ebrahimi
588*22dc650dSSadaf Ebrahimi       JIT matching is automatically used by pcre2_match() if it is available,
589*22dc650dSSadaf Ebrahimi       unless the PCRE2_NO_JIT option is set. There is also a direct interface
590*22dc650dSSadaf Ebrahimi       for JIT matching, which gives improved performance at  the  expense  of
591*22dc650dSSadaf Ebrahimi       less  sanity  checking. The JIT-specific functions are discussed in the
592*22dc650dSSadaf Ebrahimi       pcre2jit documentation.
593*22dc650dSSadaf Ebrahimi
594*22dc650dSSadaf Ebrahimi       A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
595*22dc650dSSadaf Ebrahimi       patible,  is  also  provided.  This  uses a different algorithm for the
596*22dc650dSSadaf Ebrahimi       matching. The alternative algorithm finds all possible  matches  (at  a
597*22dc650dSSadaf Ebrahimi       given  point  in  the subject), and scans the subject just once (unless
598*22dc650dSSadaf Ebrahimi       there are lookaround assertions). However, this algorithm does not  re-
599*22dc650dSSadaf Ebrahimi       turn  captured substrings. A description of the two matching algorithms
600*22dc650dSSadaf Ebrahimi       and their advantages and disadvantages is given  in  the  pcre2matching
601*22dc650dSSadaf Ebrahimi       documentation. There is no JIT support for pcre2_dfa_match().
602*22dc650dSSadaf Ebrahimi
603*22dc650dSSadaf Ebrahimi       In  addition  to  the  main compiling and matching functions, there are
604*22dc650dSSadaf Ebrahimi       convenience functions for extracting captured substrings from a subject
605*22dc650dSSadaf Ebrahimi       string that has been matched by pcre2_match(). They are:
606*22dc650dSSadaf Ebrahimi
607*22dc650dSSadaf Ebrahimi         pcre2_substring_copy_byname()
608*22dc650dSSadaf Ebrahimi         pcre2_substring_copy_bynumber()
609*22dc650dSSadaf Ebrahimi         pcre2_substring_get_byname()
610*22dc650dSSadaf Ebrahimi         pcre2_substring_get_bynumber()
611*22dc650dSSadaf Ebrahimi         pcre2_substring_list_get()
612*22dc650dSSadaf Ebrahimi         pcre2_substring_length_byname()
613*22dc650dSSadaf Ebrahimi         pcre2_substring_length_bynumber()
614*22dc650dSSadaf Ebrahimi         pcre2_substring_nametable_scan()
615*22dc650dSSadaf Ebrahimi         pcre2_substring_number_from_name()
616*22dc650dSSadaf Ebrahimi
617*22dc650dSSadaf Ebrahimi       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
618*22dc650dSSadaf Ebrahimi       vided,  to  free  memory used for extracted strings. If either of these
619*22dc650dSSadaf Ebrahimi       functions is called with a NULL argument, the function returns  immedi-
620*22dc650dSSadaf Ebrahimi       ately without doing anything.
621*22dc650dSSadaf Ebrahimi
622*22dc650dSSadaf Ebrahimi       The  function  pcre2_substitute()  can be called to match a pattern and
623*22dc650dSSadaf Ebrahimi       return a copy of the subject string with substitutions for  parts  that
624*22dc650dSSadaf Ebrahimi       were matched.
625*22dc650dSSadaf Ebrahimi
626*22dc650dSSadaf Ebrahimi       Functions  whose  names begin with pcre2_serialize_ are used for saving
627*22dc650dSSadaf Ebrahimi       compiled patterns on disc or elsewhere, and reloading them later.
628*22dc650dSSadaf Ebrahimi
629*22dc650dSSadaf Ebrahimi       Finally, there are functions for finding out information about  a  com-
630*22dc650dSSadaf Ebrahimi       piled  pattern  (pcre2_pattern_info()) and about the configuration with
631*22dc650dSSadaf Ebrahimi       which PCRE2 was built (pcre2_config()).
632*22dc650dSSadaf Ebrahimi
633*22dc650dSSadaf Ebrahimi       Functions with names ending with _free() are used  for  freeing  memory
634*22dc650dSSadaf Ebrahimi       blocks  of  various  sorts.  In all cases, if one of these functions is
635*22dc650dSSadaf Ebrahimi       called with a NULL argument, it does nothing.
636*22dc650dSSadaf Ebrahimi
637*22dc650dSSadaf Ebrahimi
638*22dc650dSSadaf EbrahimiSTRING LENGTHS AND OFFSETS
639*22dc650dSSadaf Ebrahimi
640*22dc650dSSadaf Ebrahimi       The PCRE2 API uses string lengths and  offsets  into  strings  of  code
641*22dc650dSSadaf Ebrahimi       units  in  several  places. These values are always of type PCRE2_SIZE,
642*22dc650dSSadaf Ebrahimi       which is an unsigned integer type, currently always defined as  size_t.
643*22dc650dSSadaf Ebrahimi       The  largest  value  that  can  be  stored  in  such  a  type  (that is
644*22dc650dSSadaf Ebrahimi       ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
645*22dc650dSSadaf Ebrahimi       strings  and  unset offsets.  Therefore, the longest string that can be
646*22dc650dSSadaf Ebrahimi       handled is one less than this maximum. Note that string lengths are al-
647*22dc650dSSadaf Ebrahimi       ways given in code units. Only in the 8-bit library is  such  a  length
648*22dc650dSSadaf Ebrahimi       the same as the number of bytes in the string.
649*22dc650dSSadaf Ebrahimi
650*22dc650dSSadaf Ebrahimi
651*22dc650dSSadaf EbrahimiNEWLINES
652*22dc650dSSadaf Ebrahimi
653*22dc650dSSadaf Ebrahimi       PCRE2 supports five different conventions for indicating line breaks in
654*22dc650dSSadaf Ebrahimi       strings:  a  single  CR (carriage return) character, a single LF (line-
655*22dc650dSSadaf Ebrahimi       feed) character, the two-character sequence CRLF, any of the three pre-
656*22dc650dSSadaf Ebrahimi       ceding, or any Unicode newline sequence. The Unicode newline  sequences
657*22dc650dSSadaf Ebrahimi       are  the  three just mentioned, plus the single characters VT (vertical
658*22dc650dSSadaf Ebrahimi       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
659*22dc650dSSadaf Ebrahimi       separator, U+2028), and PS (paragraph separator, U+2029).
660*22dc650dSSadaf Ebrahimi
661*22dc650dSSadaf Ebrahimi       Each of the first three conventions is used by at least  one  operating
662*22dc650dSSadaf Ebrahimi       system as its standard newline sequence. When PCRE2 is built, a default
663*22dc650dSSadaf Ebrahimi       can be specified.  If it is not, the default is set to LF, which is the
664*22dc650dSSadaf Ebrahimi       Unix standard. However, the newline convention can be changed by an ap-
665*22dc650dSSadaf Ebrahimi       plication  when calling pcre2_compile(), or it can be specified by spe-
666*22dc650dSSadaf Ebrahimi       cial text at the start of the pattern itself; this overrides any  other
667*22dc650dSSadaf Ebrahimi       settings.  See the pcre2pattern page for details of the special charac-
668*22dc650dSSadaf Ebrahimi       ter sequences.
669*22dc650dSSadaf Ebrahimi
670*22dc650dSSadaf Ebrahimi       In the PCRE2 documentation the word "newline"  is  used  to  mean  "the
671*22dc650dSSadaf Ebrahimi       character or pair of characters that indicate a line break". The choice
672*22dc650dSSadaf Ebrahimi       of  newline convention affects the handling of the dot, circumflex, and
673*22dc650dSSadaf Ebrahimi       dollar metacharacters, the handling of #-comments in /x mode, and, when
674*22dc650dSSadaf Ebrahimi       CRLF is a recognized line ending sequence, the match position  advance-
675*22dc650dSSadaf Ebrahimi       ment for a non-anchored pattern. There is more detail about this in the
676*22dc650dSSadaf Ebrahimi       section on pcre2_match() options below.
677*22dc650dSSadaf Ebrahimi
678*22dc650dSSadaf Ebrahimi       The  choice of newline convention does not affect the interpretation of
679*22dc650dSSadaf Ebrahimi       the \n or \r escape sequences, nor does it affect what \R matches; this
680*22dc650dSSadaf Ebrahimi       has its own separate convention.
681*22dc650dSSadaf Ebrahimi
682*22dc650dSSadaf Ebrahimi
683*22dc650dSSadaf EbrahimiMULTITHREADING
684*22dc650dSSadaf Ebrahimi
685*22dc650dSSadaf Ebrahimi       In a multithreaded application it is important to keep  thread-specific
686*22dc650dSSadaf Ebrahimi       data  separate  from data that can be shared between threads. The PCRE2
687*22dc650dSSadaf Ebrahimi       library code itself is thread-safe: it contains  no  static  or  global
688*22dc650dSSadaf Ebrahimi       variables. The API is designed to be fairly simple for non-threaded ap-
689*22dc650dSSadaf Ebrahimi       plications  while at the same time ensuring that multithreaded applica-
690*22dc650dSSadaf Ebrahimi       tions can use it.
691*22dc650dSSadaf Ebrahimi
692*22dc650dSSadaf Ebrahimi       There are several different blocks of data that are used to pass infor-
693*22dc650dSSadaf Ebrahimi       mation between the application and the PCRE2 libraries.
694*22dc650dSSadaf Ebrahimi
695*22dc650dSSadaf Ebrahimi   The compiled pattern
696*22dc650dSSadaf Ebrahimi
697*22dc650dSSadaf Ebrahimi       A pointer to the compiled form of a pattern is  returned  to  the  user
698*22dc650dSSadaf Ebrahimi       when pcre2_compile() is successful. The data in the compiled pattern is
699*22dc650dSSadaf Ebrahimi       fixed,  and  does not change when the pattern is matched. Therefore, it
700*22dc650dSSadaf Ebrahimi       is thread-safe, that is, the same compiled pattern can be used by  more
701*22dc650dSSadaf Ebrahimi       than one thread simultaneously. For example, an application can compile
702*22dc650dSSadaf Ebrahimi       all its patterns at the start, before forking off multiple threads that
703*22dc650dSSadaf Ebrahimi       use  them.  However,  if the just-in-time (JIT) optimization feature is
704*22dc650dSSadaf Ebrahimi       being used, it needs separate memory stack areas for each  thread.  See
705*22dc650dSSadaf Ebrahimi       the pcre2jit documentation for more details.
706*22dc650dSSadaf Ebrahimi
707*22dc650dSSadaf Ebrahimi       In  a more complicated situation, where patterns are compiled only when
708*22dc650dSSadaf Ebrahimi       they are first needed, but are still shared between  threads,  pointers
709*22dc650dSSadaf Ebrahimi       to  compiled  patterns  must  be protected from simultaneous writing by
710*22dc650dSSadaf Ebrahimi       multiple threads. This is somewhat tricky to do correctly. If you  know
711*22dc650dSSadaf Ebrahimi       that  writing  to  a pointer is atomic in your environment, you can use
712*22dc650dSSadaf Ebrahimi       logic like this:
713*22dc650dSSadaf Ebrahimi
714*22dc650dSSadaf Ebrahimi         Get a read-only (shared) lock (mutex) for pointer
715*22dc650dSSadaf Ebrahimi         if (pointer == NULL)
716*22dc650dSSadaf Ebrahimi           {
717*22dc650dSSadaf Ebrahimi           Get a write (unique) lock for pointer
718*22dc650dSSadaf Ebrahimi           if (pointer == NULL) pointer = pcre2_compile(...
719*22dc650dSSadaf Ebrahimi           }
720*22dc650dSSadaf Ebrahimi         Release the lock
721*22dc650dSSadaf Ebrahimi         Use pointer in pcre2_match()
722*22dc650dSSadaf Ebrahimi
723*22dc650dSSadaf Ebrahimi       Of course, testing for compilation errors should also  be  included  in
724*22dc650dSSadaf Ebrahimi       the code.
725*22dc650dSSadaf Ebrahimi
726*22dc650dSSadaf Ebrahimi       The  reason  for checking the pointer a second time is as follows: Sev-
727*22dc650dSSadaf Ebrahimi       eral threads may have acquired the shared lock and tested  the  pointer
728*22dc650dSSadaf Ebrahimi       for being NULL, but only one of them will be given the write lock, with
729*22dc650dSSadaf Ebrahimi       the  rest kept waiting. The winning thread will compile the pattern and
730*22dc650dSSadaf Ebrahimi       store the result.  After this thread releases the write  lock,  another
731*22dc650dSSadaf Ebrahimi       thread  will  get it, and if it does not retest pointer for being NULL,
732*22dc650dSSadaf Ebrahimi       will recompile the pattern and overwrite the pointer, creating a memory
733*22dc650dSSadaf Ebrahimi       leak and possibly causing other issues.
734*22dc650dSSadaf Ebrahimi
735*22dc650dSSadaf Ebrahimi       In an environment where writing to a pointer may  not  be  atomic,  the
736*22dc650dSSadaf Ebrahimi       above  logic  is not sufficient. The thread that is doing the compiling
737*22dc650dSSadaf Ebrahimi       may be descheduled after writing only part of the pointer, which  could
738*22dc650dSSadaf Ebrahimi       cause  other  threads  to use an invalid value. Instead of checking the
739*22dc650dSSadaf Ebrahimi       pointer itself, a separate "pointer is valid" flag (that can be updated
740*22dc650dSSadaf Ebrahimi       atomically) must be used:
741*22dc650dSSadaf Ebrahimi
742*22dc650dSSadaf Ebrahimi         Get a read-only (shared) lock (mutex) for pointer
743*22dc650dSSadaf Ebrahimi         if (!pointer_is_valid)
744*22dc650dSSadaf Ebrahimi           {
745*22dc650dSSadaf Ebrahimi           Get a write (unique) lock for pointer
746*22dc650dSSadaf Ebrahimi           if (!pointer_is_valid)
747*22dc650dSSadaf Ebrahimi             {
748*22dc650dSSadaf Ebrahimi             pointer = pcre2_compile(...
749*22dc650dSSadaf Ebrahimi             pointer_is_valid = TRUE
750*22dc650dSSadaf Ebrahimi             }
751*22dc650dSSadaf Ebrahimi           }
752*22dc650dSSadaf Ebrahimi         Release the lock
753*22dc650dSSadaf Ebrahimi         Use pointer in pcre2_match()
754*22dc650dSSadaf Ebrahimi
755*22dc650dSSadaf Ebrahimi       If JIT is being used, but the JIT compilation is not being done immedi-
756*22dc650dSSadaf Ebrahimi       ately (perhaps waiting to see if the pattern  is  used  often  enough),
757*22dc650dSSadaf Ebrahimi       similar  logic  is required. JIT compilation updates a value within the
758*22dc650dSSadaf Ebrahimi       compiled code block, so a thread must gain unique write access  to  the
759*22dc650dSSadaf Ebrahimi       pointer     before    calling    pcre2_jit_compile().    Alternatively,
760*22dc650dSSadaf Ebrahimi       pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob-
761*22dc650dSSadaf Ebrahimi       tain  a  private  copy of the compiled code before calling the JIT com-
762*22dc650dSSadaf Ebrahimi       piler.
763*22dc650dSSadaf Ebrahimi
764*22dc650dSSadaf Ebrahimi   Context blocks
765*22dc650dSSadaf Ebrahimi
766*22dc650dSSadaf Ebrahimi       The next main section below introduces the idea of "contexts" in  which
767*22dc650dSSadaf Ebrahimi       PCRE2 functions are called. A context is nothing more than a collection
768*22dc650dSSadaf Ebrahimi       of parameters that control the way PCRE2 operates. Grouping a number of
769*22dc650dSSadaf Ebrahimi       parameters together in a context is a convenient way of passing them to
770*22dc650dSSadaf Ebrahimi       a  PCRE2  function without using lots of arguments. The parameters that
771*22dc650dSSadaf Ebrahimi       are stored in contexts are in some sense  "advanced  features"  of  the
772*22dc650dSSadaf Ebrahimi       API. Many straightforward applications will not need to use contexts.
773*22dc650dSSadaf Ebrahimi
774*22dc650dSSadaf Ebrahimi       In a multithreaded application, if the parameters in a context are val-
775*22dc650dSSadaf Ebrahimi       ues  that  are  never  changed, the same context can be used by all the
776*22dc650dSSadaf Ebrahimi       threads. However, if any thread needs to change any value in a context,
777*22dc650dSSadaf Ebrahimi       it must make its own thread-specific copy.
778*22dc650dSSadaf Ebrahimi
779*22dc650dSSadaf Ebrahimi   Match blocks
780*22dc650dSSadaf Ebrahimi
781*22dc650dSSadaf Ebrahimi       The matching functions need a block of memory for storing  the  results
782*22dc650dSSadaf Ebrahimi       of a match. This includes details of what was matched, as well as addi-
783*22dc650dSSadaf Ebrahimi       tional  information  such as the name of a (*MARK) setting. Each thread
784*22dc650dSSadaf Ebrahimi       must provide its own copy of this memory.
785*22dc650dSSadaf Ebrahimi
786*22dc650dSSadaf Ebrahimi
787*22dc650dSSadaf EbrahimiPCRE2 CONTEXTS
788*22dc650dSSadaf Ebrahimi
789*22dc650dSSadaf Ebrahimi       Some PCRE2 functions have a lot of parameters, many of which  are  used
790*22dc650dSSadaf Ebrahimi       only  by  specialist  applications,  for example, those that use custom
791*22dc650dSSadaf Ebrahimi       memory management or non-standard character tables.  To  keep  function
792*22dc650dSSadaf Ebrahimi       argument  lists  at a reasonable size, and at the same time to keep the
793*22dc650dSSadaf Ebrahimi       API extensible, "uncommon" parameters are passed to  certain  functions
794*22dc650dSSadaf Ebrahimi       in  a  context instead of directly. A context is just a block of memory
795*22dc650dSSadaf Ebrahimi       that holds the parameter values.  Applications that do not need to  ad-
796*22dc650dSSadaf Ebrahimi       just any of the context parameters can pass NULL when a context pointer
797*22dc650dSSadaf Ebrahimi       is required.
798*22dc650dSSadaf Ebrahimi
799*22dc650dSSadaf Ebrahimi       There  are  three different types of context: a general context that is
800*22dc650dSSadaf Ebrahimi       relevant for several PCRE2 operations, a compile-time  context,  and  a
801*22dc650dSSadaf Ebrahimi       match-time context.
802*22dc650dSSadaf Ebrahimi
803*22dc650dSSadaf Ebrahimi   The general context
804*22dc650dSSadaf Ebrahimi
805*22dc650dSSadaf Ebrahimi       At  present,  this context just contains pointers to (and data for) ex-
806*22dc650dSSadaf Ebrahimi       ternal memory management functions that are called from several  places
807*22dc650dSSadaf Ebrahimi       in  the  PCRE2  library.  The  context  is  named `general' rather than
808*22dc650dSSadaf Ebrahimi       specifically `memory' because in future other fields may be  added.  If
809*22dc650dSSadaf Ebrahimi       you  do not want to supply your own custom memory management functions,
810*22dc650dSSadaf Ebrahimi       you do not need to bother with a general context. A general context  is
811*22dc650dSSadaf Ebrahimi       created by:
812*22dc650dSSadaf Ebrahimi
813*22dc650dSSadaf Ebrahimi       pcre2_general_context *pcre2_general_context_create(
814*22dc650dSSadaf Ebrahimi         void *(*private_malloc)(PCRE2_SIZE, void *),
815*22dc650dSSadaf Ebrahimi         void (*private_free)(void *, void *), void *memory_data);
816*22dc650dSSadaf Ebrahimi
817*22dc650dSSadaf Ebrahimi       The  two  function pointers specify custom memory management functions,
818*22dc650dSSadaf Ebrahimi       whose prototypes are:
819*22dc650dSSadaf Ebrahimi
820*22dc650dSSadaf Ebrahimi         void *private_malloc(PCRE2_SIZE, void *);
821*22dc650dSSadaf Ebrahimi         void  private_free(void *, void *);
822*22dc650dSSadaf Ebrahimi
823*22dc650dSSadaf Ebrahimi       Whenever code in PCRE2 calls these functions, the final argument is the
824*22dc650dSSadaf Ebrahimi       value of memory_data. Either of the first two arguments of the creation
825*22dc650dSSadaf Ebrahimi       function may be NULL, in which case the system memory management  func-
826*22dc650dSSadaf Ebrahimi       tions  malloc()  and free() are used. (This is not currently useful, as
827*22dc650dSSadaf Ebrahimi       there are no other fields in a general context,  but  in  future  there
828*22dc650dSSadaf Ebrahimi       might  be.)  The private_malloc() function is used (if supplied) to ob-
829*22dc650dSSadaf Ebrahimi       tain memory for storing the context, and all three values are saved  as
830*22dc650dSSadaf Ebrahimi       part of the context.
831*22dc650dSSadaf Ebrahimi
832*22dc650dSSadaf Ebrahimi       Whenever  PCRE2  creates a data block of any kind, the block contains a
833*22dc650dSSadaf Ebrahimi       pointer to the free() function that matches the malloc() function  that
834*22dc650dSSadaf Ebrahimi       was  used.  When  the  time  comes  to free the block, this function is
835*22dc650dSSadaf Ebrahimi       called.
836*22dc650dSSadaf Ebrahimi
837*22dc650dSSadaf Ebrahimi       A general context can be copied by calling:
838*22dc650dSSadaf Ebrahimi
839*22dc650dSSadaf Ebrahimi       pcre2_general_context *pcre2_general_context_copy(
840*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
841*22dc650dSSadaf Ebrahimi
842*22dc650dSSadaf Ebrahimi       The memory used for a general context should be freed by calling:
843*22dc650dSSadaf Ebrahimi
844*22dc650dSSadaf Ebrahimi       void pcre2_general_context_free(pcre2_general_context *gcontext);
845*22dc650dSSadaf Ebrahimi
846*22dc650dSSadaf Ebrahimi       If this function is passed a  NULL  argument,  it  returns  immediately
847*22dc650dSSadaf Ebrahimi       without doing anything.
848*22dc650dSSadaf Ebrahimi
849*22dc650dSSadaf Ebrahimi   The compile context
850*22dc650dSSadaf Ebrahimi
851*22dc650dSSadaf Ebrahimi       A  compile context is required if you want to provide an external func-
852*22dc650dSSadaf Ebrahimi       tion for stack checking during compilation or  to  change  the  default
853*22dc650dSSadaf Ebrahimi       values of any of the following compile-time parameters:
854*22dc650dSSadaf Ebrahimi
855*22dc650dSSadaf Ebrahimi         What \R matches (Unicode newlines or CR, LF, CRLF only)
856*22dc650dSSadaf Ebrahimi         PCRE2's character tables
857*22dc650dSSadaf Ebrahimi         The newline character sequence
858*22dc650dSSadaf Ebrahimi         The compile time nested parentheses limit
859*22dc650dSSadaf Ebrahimi         The maximum length of the pattern string
860*22dc650dSSadaf Ebrahimi         The extra options bits (none set by default)
861*22dc650dSSadaf Ebrahimi
862*22dc650dSSadaf Ebrahimi       A  compile context is also required if you are using custom memory man-
863*22dc650dSSadaf Ebrahimi       agement.  If none of these apply, just pass NULL as the  context  argu-
864*22dc650dSSadaf Ebrahimi       ment of pcre2_compile().
865*22dc650dSSadaf Ebrahimi
866*22dc650dSSadaf Ebrahimi       A  compile context is created, copied, and freed by the following func-
867*22dc650dSSadaf Ebrahimi       tions:
868*22dc650dSSadaf Ebrahimi
869*22dc650dSSadaf Ebrahimi       pcre2_compile_context *pcre2_compile_context_create(
870*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
871*22dc650dSSadaf Ebrahimi
872*22dc650dSSadaf Ebrahimi       pcre2_compile_context *pcre2_compile_context_copy(
873*22dc650dSSadaf Ebrahimi         pcre2_compile_context *ccontext);
874*22dc650dSSadaf Ebrahimi
875*22dc650dSSadaf Ebrahimi       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
876*22dc650dSSadaf Ebrahimi
877*22dc650dSSadaf Ebrahimi       A compile context is created with default values  for  its  parameters.
878*22dc650dSSadaf Ebrahimi       These can be changed by calling the following functions, which return 0
879*22dc650dSSadaf Ebrahimi       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
880*22dc650dSSadaf Ebrahimi
881*22dc650dSSadaf Ebrahimi       int pcre2_set_bsr(pcre2_compile_context *ccontext,
882*22dc650dSSadaf Ebrahimi         uint32_t value);
883*22dc650dSSadaf Ebrahimi
884*22dc650dSSadaf Ebrahimi       The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
885*22dc650dSSadaf Ebrahimi       CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
886*22dc650dSSadaf Ebrahimi       Unicode line ending sequence. The value is used by the JIT compiler and
887*22dc650dSSadaf Ebrahimi       by   the   two   interpreted   matching  functions,  pcre2_match()  and
888*22dc650dSSadaf Ebrahimi       pcre2_dfa_match().
889*22dc650dSSadaf Ebrahimi
890*22dc650dSSadaf Ebrahimi       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
891*22dc650dSSadaf Ebrahimi         const uint8_t *tables);
892*22dc650dSSadaf Ebrahimi
893*22dc650dSSadaf Ebrahimi       The value must be the result of a  call  to  pcre2_maketables(),  whose
894*22dc650dSSadaf Ebrahimi       only argument is a general context. This function builds a set of char-
895*22dc650dSSadaf Ebrahimi       acter tables in the current locale.
896*22dc650dSSadaf Ebrahimi
897*22dc650dSSadaf Ebrahimi       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
898*22dc650dSSadaf Ebrahimi         uint32_t extra_options);
899*22dc650dSSadaf Ebrahimi
900*22dc650dSSadaf Ebrahimi       As  PCRE2  has developed, almost all the 32 option bits that are avail-
901*22dc650dSSadaf Ebrahimi       able in the options argument of pcre2_compile() have been used  up.  To
902*22dc650dSSadaf Ebrahimi       avoid  running  out, the compile context contains a set of extra option
903*22dc650dSSadaf Ebrahimi       bits which are used for some newer, assumed rarer, options. This  func-
904*22dc650dSSadaf Ebrahimi       tion  sets  those bits. It always sets all the bits (either on or off).
905*22dc650dSSadaf Ebrahimi       It does not modify any existing setting. The available options are  de-
906*22dc650dSSadaf Ebrahimi       fined in the section entitled "Extra compile options" below.
907*22dc650dSSadaf Ebrahimi
908*22dc650dSSadaf Ebrahimi       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
909*22dc650dSSadaf Ebrahimi         PCRE2_SIZE value);
910*22dc650dSSadaf Ebrahimi
911*22dc650dSSadaf Ebrahimi       This  sets a maximum length, in code units, for any pattern string that
912*22dc650dSSadaf Ebrahimi       is compiled with this context. If the pattern is longer,  an  error  is
913*22dc650dSSadaf Ebrahimi       generated.   This facility is provided so that applications that accept
914*22dc650dSSadaf Ebrahimi       patterns from external sources can limit their size. The default is the
915*22dc650dSSadaf Ebrahimi       largest number that a PCRE2_SIZE variable can  hold,  which  is  effec-
916*22dc650dSSadaf Ebrahimi       tively unlimited.
917*22dc650dSSadaf Ebrahimi
918*22dc650dSSadaf Ebrahimi       int pcre2_set_max_pattern_compiled_length(
919*22dc650dSSadaf Ebrahimi         pcre2_compile_context *ccontext, PCRE2_SIZE value);
920*22dc650dSSadaf Ebrahimi
921*22dc650dSSadaf Ebrahimi       This  sets  a maximum size, in bytes, for the memory needed to hold the
922*22dc650dSSadaf Ebrahimi       compiled version of a pattern that is compiled with  this  context.  If
923*22dc650dSSadaf Ebrahimi       the  pattern needs more memory, an error is generated. This facility is
924*22dc650dSSadaf Ebrahimi       provided so  that  applications  that  accept  patterns  from  external
925*22dc650dSSadaf Ebrahimi       sources  can  limit  the  amount of memory they use. The default is the
926*22dc650dSSadaf Ebrahimi       largest number that a PCRE2_SIZE variable can  hold,  which  is  effec-
927*22dc650dSSadaf Ebrahimi       tively unlimited.
928*22dc650dSSadaf Ebrahimi
929*22dc650dSSadaf Ebrahimi       int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext,
930*22dc650dSSadaf Ebrahimi         uint32_t value);
931*22dc650dSSadaf Ebrahimi
932*22dc650dSSadaf Ebrahimi       This  sets  a  maximum length for the number of characters matched by a
933*22dc650dSSadaf Ebrahimi       variable-length lookbehind assertion. The default is set when PCRE2  is
934*22dc650dSSadaf Ebrahimi       built,  with  the ultimate default being 255, the same as Perl. Lookbe-
935*22dc650dSSadaf Ebrahimi       hind assertions without a bounding length are not supported.
936*22dc650dSSadaf Ebrahimi
937*22dc650dSSadaf Ebrahimi       int pcre2_set_newline(pcre2_compile_context *ccontext,
938*22dc650dSSadaf Ebrahimi         uint32_t value);
939*22dc650dSSadaf Ebrahimi
940*22dc650dSSadaf Ebrahimi       This specifies which characters or character sequences are to be recog-
941*22dc650dSSadaf Ebrahimi       nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage
942*22dc650dSSadaf Ebrahimi       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
943*22dc650dSSadaf Ebrahimi       two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
944*22dc650dSSadaf Ebrahimi       of the above), PCRE2_NEWLINE_ANY (any  Unicode  newline  sequence),  or
945*22dc650dSSadaf Ebrahimi       PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
946*22dc650dSSadaf Ebrahimi
947*22dc650dSSadaf Ebrahimi       A pattern can override the value set in the compile context by starting
948*22dc650dSSadaf Ebrahimi       with a sequence such as (*CRLF). See the pcre2pattern page for details.
949*22dc650dSSadaf Ebrahimi
950*22dc650dSSadaf Ebrahimi       When  a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or PCRE2_EX-
951*22dc650dSSadaf Ebrahimi       TENDED_MORE option, the newline convention affects the  recognition  of
952*22dc650dSSadaf Ebrahimi       the  end  of internal comments starting with #. The value is saved with
953*22dc650dSSadaf Ebrahimi       the compiled pattern for subsequent use by the JIT compiler and by  the
954*22dc650dSSadaf Ebrahimi       two     interpreted     matching     functions,    pcre2_match()    and
955*22dc650dSSadaf Ebrahimi       pcre2_dfa_match().
956*22dc650dSSadaf Ebrahimi
957*22dc650dSSadaf Ebrahimi       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
958*22dc650dSSadaf Ebrahimi         uint32_t value);
959*22dc650dSSadaf Ebrahimi
960*22dc650dSSadaf Ebrahimi       This parameter adjusts the limit, set  when  PCRE2  is  built  (default
961*22dc650dSSadaf Ebrahimi       250),  on  the  depth  of  parenthesis nesting in a pattern. This limit
962*22dc650dSSadaf Ebrahimi       stops rogue patterns using up too much system  stack  when  being  com-
963*22dc650dSSadaf Ebrahimi       piled.  The limit applies to parentheses of all kinds, not just captur-
964*22dc650dSSadaf Ebrahimi       ing parentheses.
965*22dc650dSSadaf Ebrahimi
966*22dc650dSSadaf Ebrahimi       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
967*22dc650dSSadaf Ebrahimi         int (*guard_function)(uint32_t, void *), void *user_data);
968*22dc650dSSadaf Ebrahimi
969*22dc650dSSadaf Ebrahimi       There is at least one application that runs PCRE2 in threads with  very
970*22dc650dSSadaf Ebrahimi       limited  system  stack,  where running out of stack is to be avoided at
971*22dc650dSSadaf Ebrahimi       all costs. The parenthesis limit above cannot take account of how  much
972*22dc650dSSadaf Ebrahimi       stack  is  actually  available during compilation. For a finer control,
973*22dc650dSSadaf Ebrahimi       you can supply a  function  that  is  called  whenever  pcre2_compile()
974*22dc650dSSadaf Ebrahimi       starts  to compile a parenthesized part of a pattern. This function can
975*22dc650dSSadaf Ebrahimi       check the actual stack size (or anything else  that  it  wants  to,  of
976*22dc650dSSadaf Ebrahimi       course).
977*22dc650dSSadaf Ebrahimi
978*22dc650dSSadaf Ebrahimi       The  first  argument to the callout function gives the current depth of
979*22dc650dSSadaf Ebrahimi       nesting, and the second is user data that is set up by the  last  argu-
980*22dc650dSSadaf Ebrahimi       ment   of  pcre2_set_compile_recursion_guard().  The  callout  function
981*22dc650dSSadaf Ebrahimi       should return zero if all is well, or non-zero to force an error.
982*22dc650dSSadaf Ebrahimi
983*22dc650dSSadaf Ebrahimi   The match context
984*22dc650dSSadaf Ebrahimi
985*22dc650dSSadaf Ebrahimi       A match context is required if you want to:
986*22dc650dSSadaf Ebrahimi
987*22dc650dSSadaf Ebrahimi         Set up a callout function
988*22dc650dSSadaf Ebrahimi         Set an offset limit for matching an unanchored pattern
989*22dc650dSSadaf Ebrahimi         Change the limit on the amount of heap used when matching
990*22dc650dSSadaf Ebrahimi         Change the backtracking match limit
991*22dc650dSSadaf Ebrahimi         Change the backtracking depth limit
992*22dc650dSSadaf Ebrahimi         Set custom memory management specifically for the match
993*22dc650dSSadaf Ebrahimi
994*22dc650dSSadaf Ebrahimi       If none of these apply, just pass  NULL  as  the  context  argument  of
995*22dc650dSSadaf Ebrahimi       pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
996*22dc650dSSadaf Ebrahimi
997*22dc650dSSadaf Ebrahimi       A  match  context  is created, copied, and freed by the following func-
998*22dc650dSSadaf Ebrahimi       tions:
999*22dc650dSSadaf Ebrahimi
1000*22dc650dSSadaf Ebrahimi       pcre2_match_context *pcre2_match_context_create(
1001*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
1002*22dc650dSSadaf Ebrahimi
1003*22dc650dSSadaf Ebrahimi       pcre2_match_context *pcre2_match_context_copy(
1004*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext);
1005*22dc650dSSadaf Ebrahimi
1006*22dc650dSSadaf Ebrahimi       void pcre2_match_context_free(pcre2_match_context *mcontext);
1007*22dc650dSSadaf Ebrahimi
1008*22dc650dSSadaf Ebrahimi       A match context is created with  default  values  for  its  parameters.
1009*22dc650dSSadaf Ebrahimi       These can be changed by calling the following functions, which return 0
1010*22dc650dSSadaf Ebrahimi       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
1011*22dc650dSSadaf Ebrahimi
1012*22dc650dSSadaf Ebrahimi       int pcre2_set_callout(pcre2_match_context *mcontext,
1013*22dc650dSSadaf Ebrahimi         int (*callout_function)(pcre2_callout_block *, void *),
1014*22dc650dSSadaf Ebrahimi         void *callout_data);
1015*22dc650dSSadaf Ebrahimi
1016*22dc650dSSadaf Ebrahimi       This  sets  up a callout function for PCRE2 to call at specified points
1017*22dc650dSSadaf Ebrahimi       during a matching operation. Details are given in the pcre2callout doc-
1018*22dc650dSSadaf Ebrahimi       umentation.
1019*22dc650dSSadaf Ebrahimi
1020*22dc650dSSadaf Ebrahimi       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
1021*22dc650dSSadaf Ebrahimi         int (*callout_function)(pcre2_substitute_callout_block *, void *),
1022*22dc650dSSadaf Ebrahimi         void *callout_data);
1023*22dc650dSSadaf Ebrahimi
1024*22dc650dSSadaf Ebrahimi       This sets up a callout function for PCRE2 to call after each  substitu-
1025*22dc650dSSadaf Ebrahimi       tion made by pcre2_substitute(). Details are given in the section enti-
1026*22dc650dSSadaf Ebrahimi       tled "Creating a new string with substitutions" below.
1027*22dc650dSSadaf Ebrahimi
1028*22dc650dSSadaf Ebrahimi       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
1029*22dc650dSSadaf Ebrahimi         PCRE2_SIZE value);
1030*22dc650dSSadaf Ebrahimi
1031*22dc650dSSadaf Ebrahimi       The  offset_limit parameter limits how far an unanchored search can ad-
1032*22dc650dSSadaf Ebrahimi       vance in the subject string. The  default  value  is  PCRE2_UNSET.  The
1033*22dc650dSSadaf Ebrahimi       pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO-
1034*22dc650dSSadaf Ebrahimi       MATCH if a match with a starting point before or at the given offset is
1035*22dc650dSSadaf Ebrahimi       not found. The pcre2_substitute() function makes no more substitutions.
1036*22dc650dSSadaf Ebrahimi
1037*22dc650dSSadaf Ebrahimi       For example, if the pattern /abc/ is matched against "123abc"  with  an
1038*22dc650dSSadaf Ebrahimi       offset  limit  less  than 3, the result is PCRE2_ERROR_NOMATCH. A match
1039*22dc650dSSadaf Ebrahimi       can never be  found  if  the  startoffset  argument  of  pcre2_match(),
1040*22dc650dSSadaf Ebrahimi       pcre2_dfa_match(),  or  pcre2_substitute()  is  greater than the offset
1041*22dc650dSSadaf Ebrahimi       limit set in the match context.
1042*22dc650dSSadaf Ebrahimi
1043*22dc650dSSadaf Ebrahimi       When using this facility, you must set the  PCRE2_USE_OFFSET_LIMIT  op-
1044*22dc650dSSadaf Ebrahimi       tion when calling pcre2_compile() so that when JIT is in use, different
1045*22dc650dSSadaf Ebrahimi       code  can  be  compiled. If a match is started with a non-default match
1046*22dc650dSSadaf Ebrahimi       limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
1047*22dc650dSSadaf Ebrahimi
1048*22dc650dSSadaf Ebrahimi       The offset limit facility can be used to track progress when  searching
1049*22dc650dSSadaf Ebrahimi       large  subject  strings or to limit the extent of global substitutions.
1050*22dc650dSSadaf Ebrahimi       See also the PCRE2_FIRSTLINE option, which requires a  match  to  start
1051*22dc650dSSadaf Ebrahimi       before  or  at  the first newline that follows the start of matching in
1052*22dc650dSSadaf Ebrahimi       the subject. If this is set with an offset limit, a match must occur in
1053*22dc650dSSadaf Ebrahimi       the first line and also  within  the  offset  limit.  In  other  words,
1054*22dc650dSSadaf Ebrahimi       whichever limit comes first is used.
1055*22dc650dSSadaf Ebrahimi
1056*22dc650dSSadaf Ebrahimi       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
1057*22dc650dSSadaf Ebrahimi         uint32_t value);
1058*22dc650dSSadaf Ebrahimi
1059*22dc650dSSadaf Ebrahimi       The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
1060*22dc650dSSadaf Ebrahimi       the  maximum  amount  of heap memory that pcre2_match() may use to hold
1061*22dc650dSSadaf Ebrahimi       backtracking information when running an interpretive match. This limit
1062*22dc650dSSadaf Ebrahimi       also applies to pcre2_dfa_match(), which may use the heap when process-
1063*22dc650dSSadaf Ebrahimi       ing patterns with a lot of nested pattern recursion or  lookarounds  or
1064*22dc650dSSadaf Ebrahimi       atomic groups. This limit does not apply to matching with the JIT opti-
1065*22dc650dSSadaf Ebrahimi       mization,  which  has  its  own  memory  control  arrangements (see the
1066*22dc650dSSadaf Ebrahimi       pcre2jit documentation for more details). If the limit is reached,  the
1067*22dc650dSSadaf Ebrahimi       negative  error  code  PCRE2_ERROR_HEAPLIMIT  is  returned. The default
1068*22dc650dSSadaf Ebrahimi       limit can be set when PCRE2 is built; if it is not, the default is  set
1069*22dc650dSSadaf Ebrahimi       very large and is essentially unlimited.
1070*22dc650dSSadaf Ebrahimi
1071*22dc650dSSadaf Ebrahimi       A value for the heap limit may also be supplied by an item at the start
1072*22dc650dSSadaf Ebrahimi       of a pattern of the form
1073*22dc650dSSadaf Ebrahimi
1074*22dc650dSSadaf Ebrahimi         (*LIMIT_HEAP=ddd)
1075*22dc650dSSadaf Ebrahimi
1076*22dc650dSSadaf Ebrahimi       where  ddd  is a decimal number. However, such a setting is ignored un-
1077*22dc650dSSadaf Ebrahimi       less ddd is less than the limit set by the caller of pcre2_match()  or,
1078*22dc650dSSadaf Ebrahimi       if no such limit is set, less than the default.
1079*22dc650dSSadaf Ebrahimi
1080*22dc650dSSadaf Ebrahimi       The  pcre2_match() function always needs some heap memory, so setting a
1081*22dc650dSSadaf Ebrahimi       value of zero guarantees a "heap limit exceeded" error. Details of  how
1082*22dc650dSSadaf Ebrahimi       pcre2_match()  uses  the  heap are given in the pcre2perform documenta-
1083*22dc650dSSadaf Ebrahimi       tion.
1084*22dc650dSSadaf Ebrahimi
1085*22dc650dSSadaf Ebrahimi       For pcre2_dfa_match(), a vector on the system stack is used  when  pro-
1086*22dc650dSSadaf Ebrahimi       cessing  pattern recursions, lookarounds, or atomic groups, and only if
1087*22dc650dSSadaf Ebrahimi       this is not big enough is heap memory used. In  this  case,  setting  a
1088*22dc650dSSadaf Ebrahimi       value of zero disables the use of the heap.
1089*22dc650dSSadaf Ebrahimi
1090*22dc650dSSadaf Ebrahimi       int pcre2_set_match_limit(pcre2_match_context *mcontext,
1091*22dc650dSSadaf Ebrahimi         uint32_t value);
1092*22dc650dSSadaf Ebrahimi
1093*22dc650dSSadaf Ebrahimi       The match_limit parameter provides a means of preventing PCRE2 from us-
1094*22dc650dSSadaf Ebrahimi       ing  up  too many computing resources when processing patterns that are
1095*22dc650dSSadaf Ebrahimi       not going to match, but which have a very large number of possibilities
1096*22dc650dSSadaf Ebrahimi       in their search trees. The classic  example  is  a  pattern  that  uses
1097*22dc650dSSadaf Ebrahimi       nested unlimited repeats.
1098*22dc650dSSadaf Ebrahimi
1099*22dc650dSSadaf Ebrahimi       There  is an internal counter in pcre2_match() that is incremented each
1100*22dc650dSSadaf Ebrahimi       time round its main matching loop. If  this  value  reaches  the  match
1101*22dc650dSSadaf Ebrahimi       limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
1102*22dc650dSSadaf Ebrahimi       This  has  the  effect  of limiting the amount of backtracking that can
1103*22dc650dSSadaf Ebrahimi       take place. For patterns that are not anchored, the count restarts from
1104*22dc650dSSadaf Ebrahimi       zero for each position in the subject string. This limit  also  applies
1105*22dc650dSSadaf Ebrahimi       to pcre2_dfa_match(), though the counting is done in a different way.
1106*22dc650dSSadaf Ebrahimi
1107*22dc650dSSadaf Ebrahimi       When  pcre2_match()  is  called  with  a  pattern that was successfully
1108*22dc650dSSadaf Ebrahimi       processed by pcre2_jit_compile(), the way in which matching is executed
1109*22dc650dSSadaf Ebrahimi       is entirely different. However, there is still the possibility of  run-
1110*22dc650dSSadaf Ebrahimi       away matching that goes on for a very long time, and so the match_limit
1111*22dc650dSSadaf Ebrahimi       value  is  also used in this case (but in a different way) to limit how
1112*22dc650dSSadaf Ebrahimi       long the matching can continue.
1113*22dc650dSSadaf Ebrahimi
1114*22dc650dSSadaf Ebrahimi       The default value for the limit can be set when PCRE2 is built; the de-
1115*22dc650dSSadaf Ebrahimi       fault is 10 million, which handles all but the most  extreme  cases.  A
1116*22dc650dSSadaf Ebrahimi       value  for the match limit may also be supplied by an item at the start
1117*22dc650dSSadaf Ebrahimi       of a pattern of the form
1118*22dc650dSSadaf Ebrahimi
1119*22dc650dSSadaf Ebrahimi         (*LIMIT_MATCH=ddd)
1120*22dc650dSSadaf Ebrahimi
1121*22dc650dSSadaf Ebrahimi       where ddd is a decimal number. However, such a setting is  ignored  un-
1122*22dc650dSSadaf Ebrahimi       less  ddd  is less than the limit set by the caller of pcre2_match() or
1123*22dc650dSSadaf Ebrahimi       pcre2_dfa_match() or, if no such limit is set, less than the default.
1124*22dc650dSSadaf Ebrahimi
1125*22dc650dSSadaf Ebrahimi       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
1126*22dc650dSSadaf Ebrahimi         uint32_t value);
1127*22dc650dSSadaf Ebrahimi
1128*22dc650dSSadaf Ebrahimi       This  parameter  limits   the   depth   of   nested   backtracking   in
1129*22dc650dSSadaf Ebrahimi       pcre2_match().   Each time a nested backtracking point is passed, a new
1130*22dc650dSSadaf Ebrahimi       memory frame is used to remember the state of matching at  that  point.
1131*22dc650dSSadaf Ebrahimi       Thus,  this  parameter  indirectly  limits the amount of memory that is
1132*22dc650dSSadaf Ebrahimi       used in a match. However, because the size of each memory frame depends
1133*22dc650dSSadaf Ebrahimi       on the number of capturing parentheses, the actual memory limit  varies
1134*22dc650dSSadaf Ebrahimi       from  pattern to pattern. This limit was more useful in versions before
1135*22dc650dSSadaf Ebrahimi       10.30, where function recursion was used for backtracking.
1136*22dc650dSSadaf Ebrahimi
1137*22dc650dSSadaf Ebrahimi       The depth limit is not relevant, and is ignored, when matching is  done
1138*22dc650dSSadaf Ebrahimi       using JIT compiled code. However, it is supported by pcre2_dfa_match(),
1139*22dc650dSSadaf Ebrahimi       which  uses it to limit the depth of nested internal recursive function
1140*22dc650dSSadaf Ebrahimi       calls that implement atomic groups, lookaround assertions, and  pattern
1141*22dc650dSSadaf Ebrahimi       recursions. This limits, indirectly, the amount of system stack that is
1142*22dc650dSSadaf Ebrahimi       used.  It  was  more useful in versions before 10.32, when stack memory
1143*22dc650dSSadaf Ebrahimi       was used for local workspace vectors for recursive function calls. From
1144*22dc650dSSadaf Ebrahimi       version 10.32, only local variables are allocated on the stack  and  as
1145*22dc650dSSadaf Ebrahimi       each call uses only a few hundred bytes, even a small stack can support
1146*22dc650dSSadaf Ebrahimi       quite a lot of recursion.
1147*22dc650dSSadaf Ebrahimi
1148*22dc650dSSadaf Ebrahimi       If  the depth of internal recursive function calls is great enough, lo-
1149*22dc650dSSadaf Ebrahimi       cal workspace vectors are allocated on the heap from version 10.32  on-
1150*22dc650dSSadaf Ebrahimi       wards,  so  the  depth  limit also indirectly limits the amount of heap
1151*22dc650dSSadaf Ebrahimi       memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
1152*22dc650dSSadaf Ebrahimi       matched to a very long string using pcre2_dfa_match(), can use a  great
1153*22dc650dSSadaf Ebrahimi       deal  of memory. However, it is probably better to limit heap usage di-
1154*22dc650dSSadaf Ebrahimi       rectly by calling pcre2_set_heap_limit().
1155*22dc650dSSadaf Ebrahimi
1156*22dc650dSSadaf Ebrahimi       The default value for the depth limit can be set when PCRE2  is  built;
1157*22dc650dSSadaf Ebrahimi       if  it  is not, the default is set to the same value as the default for
1158*22dc650dSSadaf Ebrahimi       the  match  limit.   If  the  limit  is  exceeded,   pcre2_match()   or
1159*22dc650dSSadaf Ebrahimi       pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
1160*22dc650dSSadaf Ebrahimi       limit  may also be supplied by an item at the start of a pattern of the
1161*22dc650dSSadaf Ebrahimi       form
1162*22dc650dSSadaf Ebrahimi
1163*22dc650dSSadaf Ebrahimi         (*LIMIT_DEPTH=ddd)
1164*22dc650dSSadaf Ebrahimi
1165*22dc650dSSadaf Ebrahimi       where ddd is a decimal number. However, such a setting is  ignored  un-
1166*22dc650dSSadaf Ebrahimi       less  ddd  is less than the limit set by the caller of pcre2_match() or
1167*22dc650dSSadaf Ebrahimi       pcre2_dfa_match() or, if no such limit is set, less than the default.
1168*22dc650dSSadaf Ebrahimi
1169*22dc650dSSadaf Ebrahimi
1170*22dc650dSSadaf EbrahimiCHECKING BUILD-TIME OPTIONS
1171*22dc650dSSadaf Ebrahimi
1172*22dc650dSSadaf Ebrahimi       int pcre2_config(uint32_t what, void *where);
1173*22dc650dSSadaf Ebrahimi
1174*22dc650dSSadaf Ebrahimi       The function pcre2_config() makes it possible for  a  PCRE2  client  to
1175*22dc650dSSadaf Ebrahimi       find  the  value  of  certain  configuration parameters and to discover
1176*22dc650dSSadaf Ebrahimi       which optional features have been compiled into the PCRE2 library.  The
1177*22dc650dSSadaf Ebrahimi       pcre2build documentation has more details about these features.
1178*22dc650dSSadaf Ebrahimi
1179*22dc650dSSadaf Ebrahimi       The  first  argument  for pcre2_config() specifies which information is
1180*22dc650dSSadaf Ebrahimi       required. The second argument is a pointer to memory into which the in-
1181*22dc650dSSadaf Ebrahimi       formation is placed. If NULL is passed, the function returns the amount
1182*22dc650dSSadaf Ebrahimi       of memory that is needed for the requested information. For calls  that
1183*22dc650dSSadaf Ebrahimi       return  numerical  values, the value is in bytes; when requesting these
1184*22dc650dSSadaf Ebrahimi       values, where should point to appropriately aligned memory.  For  calls
1185*22dc650dSSadaf Ebrahimi       that  return  strings,  the required length is given in code units, not
1186*22dc650dSSadaf Ebrahimi       counting the terminating zero.
1187*22dc650dSSadaf Ebrahimi
1188*22dc650dSSadaf Ebrahimi       When requesting information, the returned value from pcre2_config()  is
1189*22dc650dSSadaf Ebrahimi       non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1190*22dc650dSSadaf Ebrahimi       TION if the value in the first argument is not recognized. The  follow-
1191*22dc650dSSadaf Ebrahimi       ing information is available:
1192*22dc650dSSadaf Ebrahimi
1193*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_BSR
1194*22dc650dSSadaf Ebrahimi
1195*22dc650dSSadaf Ebrahimi       The  output  is a uint32_t integer whose value indicates what character
1196*22dc650dSSadaf Ebrahimi       sequences the \R  escape  sequence  matches  by  default.  A  value  of
1197*22dc650dSSadaf Ebrahimi       PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se-
1198*22dc650dSSadaf Ebrahimi       quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF,
1199*22dc650dSSadaf Ebrahimi       or CRLF. The default can be overridden when a pattern is compiled.
1200*22dc650dSSadaf Ebrahimi
1201*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_COMPILED_WIDTHS
1202*22dc650dSSadaf Ebrahimi
1203*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer whose lower bits indicate  which  code
1204*22dc650dSSadaf Ebrahimi       unit  widths  were  selected  when PCRE2 was built. The 1-bit indicates
1205*22dc650dSSadaf Ebrahimi       8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit  sup-
1206*22dc650dSSadaf Ebrahimi       port, respectively.
1207*22dc650dSSadaf Ebrahimi
1208*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_DEPTHLIMIT
1209*22dc650dSSadaf Ebrahimi
1210*22dc650dSSadaf Ebrahimi       The  output  is a uint32_t integer that gives the default limit for the
1211*22dc650dSSadaf Ebrahimi       depth of nested backtracking in pcre2_match() or the  depth  of  nested
1212*22dc650dSSadaf Ebrahimi       recursions,  lookarounds,  and atomic groups in pcre2_dfa_match(). Fur-
1213*22dc650dSSadaf Ebrahimi       ther details are given with pcre2_set_depth_limit() above.
1214*22dc650dSSadaf Ebrahimi
1215*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_HEAPLIMIT
1216*22dc650dSSadaf Ebrahimi
1217*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer that gives, in kibibytes, the  default
1218*22dc650dSSadaf Ebrahimi       limit   for  the  amount  of  heap  memory  used  by  pcre2_match()  or
1219*22dc650dSSadaf Ebrahimi       pcre2_dfa_match().     Further     details     are      given      with
1220*22dc650dSSadaf Ebrahimi       pcre2_set_heap_limit() above.
1221*22dc650dSSadaf Ebrahimi
1222*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_JIT
1223*22dc650dSSadaf Ebrahimi
1224*22dc650dSSadaf Ebrahimi       The  output  is  a  uint32_t  integer that is set to one if support for
1225*22dc650dSSadaf Ebrahimi       just-in-time compiling is included in the library; otherwise it is  set
1226*22dc650dSSadaf Ebrahimi       to zero. Note that having the support in the library does not guarantee
1227*22dc650dSSadaf Ebrahimi       that  JIT will be used for any given match. See the pcre2jit documenta-
1228*22dc650dSSadaf Ebrahimi       tion for more details.
1229*22dc650dSSadaf Ebrahimi
1230*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_JITTARGET
1231*22dc650dSSadaf Ebrahimi
1232*22dc650dSSadaf Ebrahimi       The where argument should point to a buffer that is at  least  48  code
1233*22dc650dSSadaf Ebrahimi       units  long.  (The  exact  length  required  can  be  found  by calling
1234*22dc650dSSadaf Ebrahimi       pcre2_config() with where set to NULL.) The buffer  is  filled  with  a
1235*22dc650dSSadaf Ebrahimi       string  that  contains  the  name of the architecture for which the JIT
1236*22dc650dSSadaf Ebrahimi       compiler is configured, for example "x86 32bit  (little  endian  +  un-
1237*22dc650dSSadaf Ebrahimi       aligned)".  If  JIT  support is not available, PCRE2_ERROR_BADOPTION is
1238*22dc650dSSadaf Ebrahimi       returned, otherwise the number of code units used is returned. This  is
1239*22dc650dSSadaf Ebrahimi       the length of the string, plus one unit for the terminating zero.
1240*22dc650dSSadaf Ebrahimi
1241*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_LINKSIZE
1242*22dc650dSSadaf Ebrahimi
1243*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer that contains the number of bytes used
1244*22dc650dSSadaf Ebrahimi       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
1245*22dc650dSSadaf Ebrahimi       configured, the value can be set to 2, 3, or 4, with the default  being
1246*22dc650dSSadaf Ebrahimi       2.  This is the value that is returned by pcre2_config(). However, when
1247*22dc650dSSadaf Ebrahimi       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1248*22dc650dSSadaf Ebrahimi       when  the  32-bit  library  is compiled, internal linkages always use 4
1249*22dc650dSSadaf Ebrahimi       bytes, so the configured value is not relevant.
1250*22dc650dSSadaf Ebrahimi
1251*22dc650dSSadaf Ebrahimi       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1252*22dc650dSSadaf Ebrahimi       for all but the most massive patterns, since it allows the size of  the
1253*22dc650dSSadaf Ebrahimi       compiled  pattern  to  be  up  to 65535 code units. Larger values allow
1254*22dc650dSSadaf Ebrahimi       larger regular expressions to be compiled by those two  libraries,  but
1255*22dc650dSSadaf Ebrahimi       at the expense of slower matching.
1256*22dc650dSSadaf Ebrahimi
1257*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_MATCHLIMIT
1258*22dc650dSSadaf Ebrahimi
1259*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer that gives the default match limit for
1260*22dc650dSSadaf Ebrahimi       pcre2_match().  Further  details are given with pcre2_set_match_limit()
1261*22dc650dSSadaf Ebrahimi       above.
1262*22dc650dSSadaf Ebrahimi
1263*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_NEWLINE
1264*22dc650dSSadaf Ebrahimi
1265*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer  whose  value  specifies  the  default
1266*22dc650dSSadaf Ebrahimi       character  sequence that is recognized as meaning "newline". The values
1267*22dc650dSSadaf Ebrahimi       are:
1268*22dc650dSSadaf Ebrahimi
1269*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_CR       Carriage return (CR)
1270*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_LF       Linefeed (LF)
1271*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1272*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_ANY      Any Unicode line ending
1273*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1274*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
1275*22dc650dSSadaf Ebrahimi
1276*22dc650dSSadaf Ebrahimi       The default should normally correspond to  the  standard  sequence  for
1277*22dc650dSSadaf Ebrahimi       your operating system.
1278*22dc650dSSadaf Ebrahimi
1279*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_NEVER_BACKSLASH_C
1280*22dc650dSSadaf Ebrahimi
1281*22dc650dSSadaf Ebrahimi       The  output  is  a uint32_t integer that is set to one if the use of \C
1282*22dc650dSSadaf Ebrahimi       was permanently disabled when PCRE2 was built; otherwise it is  set  to
1283*22dc650dSSadaf Ebrahimi       zero.
1284*22dc650dSSadaf Ebrahimi
1285*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_PARENSLIMIT
1286*22dc650dSSadaf Ebrahimi
1287*22dc650dSSadaf Ebrahimi       The  output is a uint32_t integer that gives the maximum depth of nest-
1288*22dc650dSSadaf Ebrahimi       ing of parentheses (of any kind) in a pattern. This limit is imposed to
1289*22dc650dSSadaf Ebrahimi       cap the amount of system stack used when a pattern is compiled.  It  is
1290*22dc650dSSadaf Ebrahimi       specified  when PCRE2 is built; the default is 250. This limit does not
1291*22dc650dSSadaf Ebrahimi       take into account the stack that may already be used by the calling ap-
1292*22dc650dSSadaf Ebrahimi       plication.  For  finer  control  over  compilation  stack  usage,   see
1293*22dc650dSSadaf Ebrahimi       pcre2_set_compile_recursion_guard().
1294*22dc650dSSadaf Ebrahimi
1295*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_STACKRECURSE
1296*22dc650dSSadaf Ebrahimi
1297*22dc650dSSadaf Ebrahimi       This parameter is obsolete and should not be used in new code. The out-
1298*22dc650dSSadaf Ebrahimi       put is a uint32_t integer that is always set to zero.
1299*22dc650dSSadaf Ebrahimi
1300*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_TABLES_LENGTH
1301*22dc650dSSadaf Ebrahimi
1302*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer that gives the length of PCRE2's char-
1303*22dc650dSSadaf Ebrahimi       acter  processing  tables in bytes. For details of these tables see the
1304*22dc650dSSadaf Ebrahimi       section on locale support below.
1305*22dc650dSSadaf Ebrahimi
1306*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_UNICODE_VERSION
1307*22dc650dSSadaf Ebrahimi
1308*22dc650dSSadaf Ebrahimi       The where argument should point to a buffer that is at  least  24  code
1309*22dc650dSSadaf Ebrahimi       units  long.  (The  exact  length  required  can  be  found  by calling
1310*22dc650dSSadaf Ebrahimi       pcre2_config() with where set to NULL.)  If  PCRE2  has  been  compiled
1311*22dc650dSSadaf Ebrahimi       without  Unicode  support,  the buffer is filled with the text "Unicode
1312*22dc650dSSadaf Ebrahimi       not supported". Otherwise, the Unicode  version  string  (for  example,
1313*22dc650dSSadaf Ebrahimi       "8.0.0")  is  inserted. The number of code units used is returned. This
1314*22dc650dSSadaf Ebrahimi       is the length of the string plus one unit for the terminating zero.
1315*22dc650dSSadaf Ebrahimi
1316*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_UNICODE
1317*22dc650dSSadaf Ebrahimi
1318*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer that is set to one if Unicode  support
1319*22dc650dSSadaf Ebrahimi       is  available; otherwise it is set to zero. Unicode support implies UTF
1320*22dc650dSSadaf Ebrahimi       support.
1321*22dc650dSSadaf Ebrahimi
1322*22dc650dSSadaf Ebrahimi         PCRE2_CONFIG_VERSION
1323*22dc650dSSadaf Ebrahimi
1324*22dc650dSSadaf Ebrahimi       The where argument should point to a buffer that is at  least  24  code
1325*22dc650dSSadaf Ebrahimi       units  long.  (The  exact  length  required  can  be  found  by calling
1326*22dc650dSSadaf Ebrahimi       pcre2_config() with where set to NULL.) The buffer is filled  with  the
1327*22dc650dSSadaf Ebrahimi       PCRE2 version string, zero-terminated. The number of code units used is
1328*22dc650dSSadaf Ebrahimi       returned. This is the length of the string plus one unit for the termi-
1329*22dc650dSSadaf Ebrahimi       nating zero.
1330*22dc650dSSadaf Ebrahimi
1331*22dc650dSSadaf Ebrahimi
1332*22dc650dSSadaf EbrahimiCOMPILING A PATTERN
1333*22dc650dSSadaf Ebrahimi
1334*22dc650dSSadaf Ebrahimi       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
1335*22dc650dSSadaf Ebrahimi         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
1336*22dc650dSSadaf Ebrahimi         pcre2_compile_context *ccontext);
1337*22dc650dSSadaf Ebrahimi
1338*22dc650dSSadaf Ebrahimi       void pcre2_code_free(pcre2_code *code);
1339*22dc650dSSadaf Ebrahimi
1340*22dc650dSSadaf Ebrahimi       pcre2_code *pcre2_code_copy(const pcre2_code *code);
1341*22dc650dSSadaf Ebrahimi
1342*22dc650dSSadaf Ebrahimi       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
1343*22dc650dSSadaf Ebrahimi
1344*22dc650dSSadaf Ebrahimi       The  pcre2_compile() function compiles a pattern into an internal form.
1345*22dc650dSSadaf Ebrahimi       The pattern is defined by a pointer to a string of  code  units  and  a
1346*22dc650dSSadaf Ebrahimi       length in code units. If the pattern is zero-terminated, the length can
1347*22dc650dSSadaf Ebrahimi       be  specified  as  PCRE2_ZERO_TERMINATED. A NULL pattern pointer with a
1348*22dc650dSSadaf Ebrahimi       length of zero is treated as an empty  string  (NULL  with  a  non-zero
1349*22dc650dSSadaf Ebrahimi       length  causes  an  error  return). The function returns a pointer to a
1350*22dc650dSSadaf Ebrahimi       block of memory that contains the compiled pattern and related data, or
1351*22dc650dSSadaf Ebrahimi       NULL if an error occurred.
1352*22dc650dSSadaf Ebrahimi
1353*22dc650dSSadaf Ebrahimi       If the compile context argument ccontext is NULL, memory for  the  com-
1354*22dc650dSSadaf Ebrahimi       piled  pattern  is  obtained  by calling malloc(). Otherwise, it is ob-
1355*22dc650dSSadaf Ebrahimi       tained from the same memory function that was used for the compile con-
1356*22dc650dSSadaf Ebrahimi       text. The caller must free the memory by calling pcre2_code_free() when
1357*22dc650dSSadaf Ebrahimi       it is no longer needed.  If pcre2_code_free() is called with a NULL ar-
1358*22dc650dSSadaf Ebrahimi       gument, it returns immediately, without doing anything.
1359*22dc650dSSadaf Ebrahimi
1360*22dc650dSSadaf Ebrahimi       The function pcre2_code_copy() makes a copy of the compiled code in new
1361*22dc650dSSadaf Ebrahimi       memory, using the same memory allocator as was used for  the  original.
1362*22dc650dSSadaf Ebrahimi       However,  if  the  code has been processed by the JIT compiler (see be-
1363*22dc650dSSadaf Ebrahimi       low), the JIT information cannot be copied (because it is  position-de-
1364*22dc650dSSadaf Ebrahimi       pendent).   The  new copy can initially be used only for non-JIT match-
1365*22dc650dSSadaf Ebrahimi       ing, though it can be passed to  pcre2_jit_compile()  if  required.  If
1366*22dc650dSSadaf Ebrahimi       pcre2_code_copy() is called with a NULL argument, it returns NULL.
1367*22dc650dSSadaf Ebrahimi
1368*22dc650dSSadaf Ebrahimi       The pcre2_code_copy() function provides a way for individual threads in
1369*22dc650dSSadaf Ebrahimi       a  multithreaded  application  to acquire a private copy of shared com-
1370*22dc650dSSadaf Ebrahimi       piled code.  However, it does not make a copy of the  character  tables
1371*22dc650dSSadaf Ebrahimi       used  by  the compiled pattern; the new pattern code points to the same
1372*22dc650dSSadaf Ebrahimi       tables as the original code.  (See "Locale Support" below  for  details
1373*22dc650dSSadaf Ebrahimi       of  these  character  tables.) In many applications the same tables are
1374*22dc650dSSadaf Ebrahimi       used throughout, so this behaviour is appropriate. Nevertheless,  there
1375*22dc650dSSadaf Ebrahimi       are occasions when a copy of a compiled pattern and the relevant tables
1376*22dc650dSSadaf Ebrahimi       are  needed.  The pcre2_code_copy_with_tables() provides this facility.
1377*22dc650dSSadaf Ebrahimi       Copies of both the code and the tables are  made,  with  the  new  code
1378*22dc650dSSadaf Ebrahimi       pointing  to the new tables. The memory for the new tables is automati-
1379*22dc650dSSadaf Ebrahimi       cally freed when pcre2_code_free() is called for the new  copy  of  the
1380*22dc650dSSadaf Ebrahimi       compiled  code.  If pcre2_code_copy_with_tables() is called with a NULL
1381*22dc650dSSadaf Ebrahimi       argument, it returns NULL.
1382*22dc650dSSadaf Ebrahimi
1383*22dc650dSSadaf Ebrahimi       NOTE: When one of the matching functions is  called,  pointers  to  the
1384*22dc650dSSadaf Ebrahimi       compiled pattern and the subject string are set in the match data block
1385*22dc650dSSadaf Ebrahimi       so  that  they  can be referenced by the substring extraction functions
1386*22dc650dSSadaf Ebrahimi       after a successful match.  After running a match, you must not  free  a
1387*22dc650dSSadaf Ebrahimi       compiled  pattern or a subject string until after all operations on the
1388*22dc650dSSadaf Ebrahimi       match data block have taken place, unless, in the case of  the  subject
1389*22dc650dSSadaf Ebrahimi       string,  you  have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
1390*22dc650dSSadaf Ebrahimi       described in the section entitled "Option bits for  pcre2_match()"  be-
1391*22dc650dSSadaf Ebrahimi       low.
1392*22dc650dSSadaf Ebrahimi
1393*22dc650dSSadaf Ebrahimi       The  options argument for pcre2_compile() contains various bit settings
1394*22dc650dSSadaf Ebrahimi       that affect the compilation. It should be zero if none of them are  re-
1395*22dc650dSSadaf Ebrahimi       quired.  The  available  options  are described below. Some of them (in
1396*22dc650dSSadaf Ebrahimi       particular, those that are compatible with Perl,  but  some  others  as
1397*22dc650dSSadaf Ebrahimi       well)  can  also  be set and unset from within the pattern (see the de-
1398*22dc650dSSadaf Ebrahimi       tailed description in the pcre2pattern documentation).
1399*22dc650dSSadaf Ebrahimi
1400*22dc650dSSadaf Ebrahimi       For those options that can be different in different parts of the  pat-
1401*22dc650dSSadaf Ebrahimi       tern,  the contents of the options argument specifies their settings at
1402*22dc650dSSadaf Ebrahimi       the start of compilation. The  PCRE2_ANCHORED,  PCRE2_ENDANCHORED,  and
1403*22dc650dSSadaf Ebrahimi       PCRE2_NO_UTF_CHECK  options  can be set at the time of matching as well
1404*22dc650dSSadaf Ebrahimi       as at compile time.
1405*22dc650dSSadaf Ebrahimi
1406*22dc650dSSadaf Ebrahimi       Some additional options and less frequently required compile-time para-
1407*22dc650dSSadaf Ebrahimi       meters (for example, the newline setting) can be provided in a  compile
1408*22dc650dSSadaf Ebrahimi       context (as described above).
1409*22dc650dSSadaf Ebrahimi
1410*22dc650dSSadaf Ebrahimi       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1411*22dc650dSSadaf Ebrahimi       diately.  Otherwise,  the  variables to which these point are set to an
1412*22dc650dSSadaf Ebrahimi       error code and an offset (number of code units) within the pattern, re-
1413*22dc650dSSadaf Ebrahimi       spectively, when pcre2_compile() returns NULL because a compilation er-
1414*22dc650dSSadaf Ebrahimi       ror has occurred.
1415*22dc650dSSadaf Ebrahimi
1416*22dc650dSSadaf Ebrahimi       There are nearly 100 positive error codes that pcre2_compile() may  re-
1417*22dc650dSSadaf Ebrahimi       turn  if it finds an error in the pattern. There are also some negative
1418*22dc650dSSadaf Ebrahimi       error codes that are used for invalid UTF strings when validity  check-
1419*22dc650dSSadaf Ebrahimi       ing  is  in  force.  These  are  the same as given by pcre2_match() and
1420*22dc650dSSadaf Ebrahimi       pcre2_dfa_match(), and are described in the pcre2unicode documentation.
1421*22dc650dSSadaf Ebrahimi       There is no separate documentation for the positive  error  codes,  be-
1422*22dc650dSSadaf Ebrahimi       cause  the  textual  error  messages  that  are obtained by calling the
1423*22dc650dSSadaf Ebrahimi       pcre2_get_error_message() function (see "Obtaining a textual error mes-
1424*22dc650dSSadaf Ebrahimi       sage" below) should be  self-explanatory.  Macro  names  starting  with
1425*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_  are defined for both positive and negative error codes in
1426*22dc650dSSadaf Ebrahimi       pcre2.h. When compilation is successful errorcode is  set  to  a  value
1427*22dc650dSSadaf Ebrahimi       that  returns  the message "no error" if passed to pcre2_get_error_mes-
1428*22dc650dSSadaf Ebrahimi       sage().
1429*22dc650dSSadaf Ebrahimi
1430*22dc650dSSadaf Ebrahimi       The value returned in erroroffset is an indication of where in the pat-
1431*22dc650dSSadaf Ebrahimi       tern an error occurred. When there is no error,  zero  is  returned.  A
1432*22dc650dSSadaf Ebrahimi       non-zero  value  is  not  necessarily the furthest point in the pattern
1433*22dc650dSSadaf Ebrahimi       that was read. For example, after the error  "lookbehind  assertion  is
1434*22dc650dSSadaf Ebrahimi       not  fixed length", the error offset points to the start of the failing
1435*22dc650dSSadaf Ebrahimi       assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1436*22dc650dSSadaf Ebrahimi       the first code unit of the failing character.
1437*22dc650dSSadaf Ebrahimi
1438*22dc650dSSadaf Ebrahimi       Some errors are not detected until the whole pattern has been  scanned;
1439*22dc650dSSadaf Ebrahimi       in  these  cases,  the offset passed back is the length of the pattern.
1440*22dc650dSSadaf Ebrahimi       Note that the offset is in code units, not characters, even  in  a  UTF
1441*22dc650dSSadaf Ebrahimi       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1442*22dc650dSSadaf Ebrahimi       acter.
1443*22dc650dSSadaf Ebrahimi
1444*22dc650dSSadaf Ebrahimi       This  code  fragment shows a typical straightforward call to pcre2_com-
1445*22dc650dSSadaf Ebrahimi       pile():
1446*22dc650dSSadaf Ebrahimi
1447*22dc650dSSadaf Ebrahimi         pcre2_code *re;
1448*22dc650dSSadaf Ebrahimi         PCRE2_SIZE erroffset;
1449*22dc650dSSadaf Ebrahimi         int errorcode;
1450*22dc650dSSadaf Ebrahimi         re = pcre2_compile(
1451*22dc650dSSadaf Ebrahimi           "^A.*Z",                /* the pattern */
1452*22dc650dSSadaf Ebrahimi           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1453*22dc650dSSadaf Ebrahimi           0,                      /* default options */
1454*22dc650dSSadaf Ebrahimi           &errorcode,             /* for error code */
1455*22dc650dSSadaf Ebrahimi           &erroffset,             /* for error offset */
1456*22dc650dSSadaf Ebrahimi           NULL);                  /* no compile context */
1457*22dc650dSSadaf Ebrahimi
1458*22dc650dSSadaf Ebrahimi
1459*22dc650dSSadaf Ebrahimi   Main compile options
1460*22dc650dSSadaf Ebrahimi
1461*22dc650dSSadaf Ebrahimi       The following names for option bits are defined in the  pcre2.h  header
1462*22dc650dSSadaf Ebrahimi       file:
1463*22dc650dSSadaf Ebrahimi
1464*22dc650dSSadaf Ebrahimi         PCRE2_ANCHORED
1465*22dc650dSSadaf Ebrahimi
1466*22dc650dSSadaf Ebrahimi       If this bit is set, the pattern is forced to be "anchored", that is, it
1467*22dc650dSSadaf Ebrahimi       is  constrained to match only at the first matching point in the string
1468*22dc650dSSadaf Ebrahimi       that is being searched (the "subject string"). This effect can also  be
1469*22dc650dSSadaf Ebrahimi       achieved  by appropriate constructs in the pattern itself, which is the
1470*22dc650dSSadaf Ebrahimi       only way to do it in Perl.
1471*22dc650dSSadaf Ebrahimi
1472*22dc650dSSadaf Ebrahimi         PCRE2_ALLOW_EMPTY_CLASS
1473*22dc650dSSadaf Ebrahimi
1474*22dc650dSSadaf Ebrahimi       By default, for compatibility with Perl, a closing square bracket  that
1475*22dc650dSSadaf Ebrahimi       immediately  follows  an opening one is treated as a data character for
1476*22dc650dSSadaf Ebrahimi       the class. When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it  terminates  the
1477*22dc650dSSadaf Ebrahimi       class, which therefore contains no characters and so can never match.
1478*22dc650dSSadaf Ebrahimi
1479*22dc650dSSadaf Ebrahimi         PCRE2_ALT_BSUX
1480*22dc650dSSadaf Ebrahimi
1481*22dc650dSSadaf Ebrahimi       This  option  request  alternative  handling of three escape sequences,
1482*22dc650dSSadaf Ebrahimi       which makes PCRE2's behaviour more like  ECMAscript  (aka  JavaScript).
1483*22dc650dSSadaf Ebrahimi       When it is set:
1484*22dc650dSSadaf Ebrahimi
1485*22dc650dSSadaf Ebrahimi       (1) \U matches an upper case "U" character; by default \U causes a com-
1486*22dc650dSSadaf Ebrahimi       pile time error (Perl uses \U to upper case subsequent characters).
1487*22dc650dSSadaf Ebrahimi
1488*22dc650dSSadaf Ebrahimi       (2) \u matches a lower case "u" character unless it is followed by four
1489*22dc650dSSadaf Ebrahimi       hexadecimal  digits,  in  which case the hexadecimal number defines the
1490*22dc650dSSadaf Ebrahimi       code point to match. By default, \u causes a compile time  error  (Perl
1491*22dc650dSSadaf Ebrahimi       uses it to upper case the following character).
1492*22dc650dSSadaf Ebrahimi
1493*22dc650dSSadaf Ebrahimi       (3)  \x matches a lower case "x" character unless it is followed by two
1494*22dc650dSSadaf Ebrahimi       hexadecimal digits, in which case the hexadecimal  number  defines  the
1495*22dc650dSSadaf Ebrahimi       code  point  to  match. By default, as in Perl, a hexadecimal number is
1496*22dc650dSSadaf Ebrahimi       always expected after \x, but it may have zero, one, or two digits (so,
1497*22dc650dSSadaf Ebrahimi       for example, \xz matches a binary zero character followed by z).
1498*22dc650dSSadaf Ebrahimi
1499*22dc650dSSadaf Ebrahimi       ECMAscript 6 added additional functionality to \u. This can be accessed
1500*22dc650dSSadaf Ebrahimi       using the PCRE2_EXTRA_ALT_BSUX extra option  (see  "Extra  compile  op-
1501*22dc650dSSadaf Ebrahimi       tions" below).  Note that this alternative escape handling applies only
1502*22dc650dSSadaf Ebrahimi       to  patterns.  Neither  of  these options affects the processing of re-
1503*22dc650dSSadaf Ebrahimi       placement strings passed to pcre2_substitute().
1504*22dc650dSSadaf Ebrahimi
1505*22dc650dSSadaf Ebrahimi         PCRE2_ALT_CIRCUMFLEX
1506*22dc650dSSadaf Ebrahimi
1507*22dc650dSSadaf Ebrahimi       In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
1508*22dc650dSSadaf Ebrahimi       metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
1509*22dc650dSSadaf Ebrahimi       is set), and also after any internal  newline.  However,  it  does  not
1510*22dc650dSSadaf Ebrahimi       match after a newline at the end of the subject, for compatibility with
1511*22dc650dSSadaf Ebrahimi       Perl.  If  you want a multiline circumflex also to match after a termi-
1512*22dc650dSSadaf Ebrahimi       nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
1513*22dc650dSSadaf Ebrahimi
1514*22dc650dSSadaf Ebrahimi         PCRE2_ALT_VERBNAMES
1515*22dc650dSSadaf Ebrahimi
1516*22dc650dSSadaf Ebrahimi       By default, for compatibility with Perl, the name in any verb  sequence
1517*22dc650dSSadaf Ebrahimi       such  as  (*MARK:NAME)  is any sequence of characters that does not in-
1518*22dc650dSSadaf Ebrahimi       clude a closing parenthesis. The name is not processed in any way,  and
1519*22dc650dSSadaf Ebrahimi       it  is  not possible to include a closing parenthesis in the name. How-
1520*22dc650dSSadaf Ebrahimi       ever, if the PCRE2_ALT_VERBNAMES option is set, normal  backslash  pro-
1521*22dc650dSSadaf Ebrahimi       cessing  is  applied to verb names and only an unescaped closing paren-
1522*22dc650dSSadaf Ebrahimi       thesis terminates the name. A closing parenthesis can be included in  a
1523*22dc650dSSadaf Ebrahimi       name  either  as  \)  or  between  \Q  and \E. If the PCRE2_EXTENDED or
1524*22dc650dSSadaf Ebrahimi       PCRE2_EXTENDED_MORE option is set with  PCRE2_ALT_VERBNAMES,  unescaped
1525*22dc650dSSadaf Ebrahimi       whitespace  in verb names is skipped and #-comments are recognized, ex-
1526*22dc650dSSadaf Ebrahimi       actly as in the rest of the pattern.
1527*22dc650dSSadaf Ebrahimi
1528*22dc650dSSadaf Ebrahimi         PCRE2_AUTO_CALLOUT
1529*22dc650dSSadaf Ebrahimi
1530*22dc650dSSadaf Ebrahimi       If this bit  is  set,  pcre2_compile()  automatically  inserts  callout
1531*22dc650dSSadaf Ebrahimi       items,  all  with  number 255, before each pattern item, except immedi-
1532*22dc650dSSadaf Ebrahimi       ately before or after an explicit callout in the pattern.  For  discus-
1533*22dc650dSSadaf Ebrahimi       sion of the callout facility, see the pcre2callout documentation.
1534*22dc650dSSadaf Ebrahimi
1535*22dc650dSSadaf Ebrahimi         PCRE2_CASELESS
1536*22dc650dSSadaf Ebrahimi
1537*22dc650dSSadaf Ebrahimi       If  this  bit is set, letters in the pattern match both upper and lower
1538*22dc650dSSadaf Ebrahimi       case letters in the subject. It is equivalent to Perl's /i option,  and
1539*22dc650dSSadaf Ebrahimi       it  can be changed within a pattern by a (?i) option setting. If either
1540*22dc650dSSadaf Ebrahimi       PCRE2_UTF or PCRE2_UCP is set, Unicode  properties  are  used  for  all
1541*22dc650dSSadaf Ebrahimi       characters  with more than one other case, and for all characters whose
1542*22dc650dSSadaf Ebrahimi       code points are greater than U+007F. Note  that  there  are  two  ASCII
1543*22dc650dSSadaf Ebrahimi       characters, K and S, that, in addition to their lower case ASCII equiv-
1544*22dc650dSSadaf Ebrahimi       alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1545*22dc650dSSadaf Ebrahimi       S) respectively. If you do not want this case equivalence, you can sup-
1546*22dc650dSSadaf Ebrahimi       press it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
1547*22dc650dSSadaf Ebrahimi
1548*22dc650dSSadaf Ebrahimi       For lower valued characters with only one other case, a lookup table is
1549*22dc650dSSadaf Ebrahimi       used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set,  a  lookup
1550*22dc650dSSadaf Ebrahimi       table is used for all code points less than 256, and higher code points
1551*22dc650dSSadaf Ebrahimi       (available only in 16-bit or 32-bit mode) are treated as not having an-
1552*22dc650dSSadaf Ebrahimi       other case.
1553*22dc650dSSadaf Ebrahimi
1554*22dc650dSSadaf Ebrahimi         PCRE2_DOLLAR_ENDONLY
1555*22dc650dSSadaf Ebrahimi
1556*22dc650dSSadaf Ebrahimi       If  this bit is set, a dollar metacharacter in the pattern matches only
1557*22dc650dSSadaf Ebrahimi       at the end of the subject string. Without this option,  a  dollar  also
1558*22dc650dSSadaf Ebrahimi       matches  immediately before a newline at the end of the string (but not
1559*22dc650dSSadaf Ebrahimi       before any other newlines). The PCRE2_DOLLAR_ENDONLY option is  ignored
1560*22dc650dSSadaf Ebrahimi       if  PCRE2_MULTILINE  is  set.  There is no equivalent to this option in
1561*22dc650dSSadaf Ebrahimi       Perl, and no way to set it within a pattern.
1562*22dc650dSSadaf Ebrahimi
1563*22dc650dSSadaf Ebrahimi         PCRE2_DOTALL
1564*22dc650dSSadaf Ebrahimi
1565*22dc650dSSadaf Ebrahimi       If this bit is set, a dot metacharacter  in  the  pattern  matches  any
1566*22dc650dSSadaf Ebrahimi       character,  including  one  that  indicates a newline. However, it only
1567*22dc650dSSadaf Ebrahimi       ever matches one character, even if newlines are coded as CRLF. Without
1568*22dc650dSSadaf Ebrahimi       this option, a dot does not match when the current position in the sub-
1569*22dc650dSSadaf Ebrahimi       ject is at a newline. This option is equivalent to  Perl's  /s  option,
1570*22dc650dSSadaf Ebrahimi       and it can be changed within a pattern by a (?s) option setting. A neg-
1571*22dc650dSSadaf Ebrahimi       ative  class such as [^a] always matches newline characters, and the \N
1572*22dc650dSSadaf Ebrahimi       escape sequence always matches a non-newline character, independent  of
1573*22dc650dSSadaf Ebrahimi       the setting of PCRE2_DOTALL.
1574*22dc650dSSadaf Ebrahimi
1575*22dc650dSSadaf Ebrahimi         PCRE2_DUPNAMES
1576*22dc650dSSadaf Ebrahimi
1577*22dc650dSSadaf Ebrahimi       If  this  bit is set, names used to identify capture groups need not be
1578*22dc650dSSadaf Ebrahimi       unique.  This can be helpful for certain types of pattern  when  it  is
1579*22dc650dSSadaf Ebrahimi       known  that  only  one instance of the named group can ever be matched.
1580*22dc650dSSadaf Ebrahimi       There are more details of named capture  groups  below;  see  also  the
1581*22dc650dSSadaf Ebrahimi       pcre2pattern documentation.
1582*22dc650dSSadaf Ebrahimi
1583*22dc650dSSadaf Ebrahimi         PCRE2_ENDANCHORED
1584*22dc650dSSadaf Ebrahimi
1585*22dc650dSSadaf Ebrahimi       If  this  bit is set, the end of any pattern match must be right at the
1586*22dc650dSSadaf Ebrahimi       end of the string being searched (the "subject string"). If the pattern
1587*22dc650dSSadaf Ebrahimi       match succeeds by reaching (*ACCEPT), but does not reach the end of the
1588*22dc650dSSadaf Ebrahimi       subject, the match fails at the current starting point. For  unanchored
1589*22dc650dSSadaf Ebrahimi       patterns,  a  new  match is then tried at the next starting point. How-
1590*22dc650dSSadaf Ebrahimi       ever, if the match succeeds by reaching the end of the pattern, but not
1591*22dc650dSSadaf Ebrahimi       the end of the subject, backtracking occurs and  an  alternative  match
1592*22dc650dSSadaf Ebrahimi       may be found. Consider these two patterns:
1593*22dc650dSSadaf Ebrahimi
1594*22dc650dSSadaf Ebrahimi         .(*ACCEPT)|..
1595*22dc650dSSadaf Ebrahimi         .|..
1596*22dc650dSSadaf Ebrahimi
1597*22dc650dSSadaf Ebrahimi       If  matched against "abc" with PCRE2_ENDANCHORED set, the first matches
1598*22dc650dSSadaf Ebrahimi       "c" whereas the second matches "bc". The  effect  of  PCRE2_ENDANCHORED
1599*22dc650dSSadaf Ebrahimi       can  also  be achieved by appropriate constructs in the pattern itself,
1600*22dc650dSSadaf Ebrahimi       which is the only way to do it in Perl.
1601*22dc650dSSadaf Ebrahimi
1602*22dc650dSSadaf Ebrahimi       For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
1603*22dc650dSSadaf Ebrahimi       to the first (that is, the  longest)  matched  string.  Other  parallel
1604*22dc650dSSadaf Ebrahimi       matches,  which are necessarily substrings of the first one, must obvi-
1605*22dc650dSSadaf Ebrahimi       ously end before the end of the subject.
1606*22dc650dSSadaf Ebrahimi
1607*22dc650dSSadaf Ebrahimi         PCRE2_EXTENDED
1608*22dc650dSSadaf Ebrahimi
1609*22dc650dSSadaf Ebrahimi       If this bit is set, most white space characters in the pattern are  to-
1610*22dc650dSSadaf Ebrahimi       tally  ignored except when escaped, inside a character class, or inside
1611*22dc650dSSadaf Ebrahimi       a \Q...\E sequence. However, white space  is  not  allowed  within  se-
1612*22dc650dSSadaf Ebrahimi       quences  such  as  (?> that introduce various parenthesized groups, nor
1613*22dc650dSSadaf Ebrahimi       within numerical quantifiers such as {1,3}. Ignorable  white  space  is
1614*22dc650dSSadaf Ebrahimi       permitted  between  an  item  and  a following quantifier and between a
1615*22dc650dSSadaf Ebrahimi       quantifier and a following + that indicates  possessiveness.  PCRE2_EX-
1616*22dc650dSSadaf Ebrahimi       TENDED  is equivalent to Perl's /x option, and it can be changed within
1617*22dc650dSSadaf Ebrahimi       a pattern by a (?x) option setting.
1618*22dc650dSSadaf Ebrahimi
1619*22dc650dSSadaf Ebrahimi       When PCRE2 is compiled without Unicode support,  PCRE2_EXTENDED  recog-
1620*22dc650dSSadaf Ebrahimi       nizes  as  white space only those characters with code points less than
1621*22dc650dSSadaf Ebrahimi       256 that are flagged as white space in its low-character table. The ta-
1622*22dc650dSSadaf Ebrahimi       ble is normally created by pcre2_maketables(), which uses the isspace()
1623*22dc650dSSadaf Ebrahimi       function to identify space characters. In most ASCII environments,  the
1624*22dc650dSSadaf Ebrahimi       relevant  characters  are  those  with code points 0x0009 (tab), 0x000A
1625*22dc650dSSadaf Ebrahimi       (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D  (carriage
1626*22dc650dSSadaf Ebrahimi       return), and 0x0020 (space).
1627*22dc650dSSadaf Ebrahimi
1628*22dc650dSSadaf Ebrahimi       When PCRE2 is compiled with Unicode support, in addition to these char-
1629*22dc650dSSadaf Ebrahimi       acters,  five  more Unicode "Pattern White Space" characters are recog-
1630*22dc650dSSadaf Ebrahimi       nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1631*22dc650dSSadaf Ebrahimi       right mark), U+200F (right-to-left mark), U+2028 (line separator),  and
1632*22dc650dSSadaf Ebrahimi       U+2029  (paragraph  separator).  This  set of characters is the same as
1633*22dc650dSSadaf Ebrahimi       recognized by Perl's /x option. Note that the horizontal  and  vertical
1634*22dc650dSSadaf Ebrahimi       space  characters that are matched by the \h and \v escapes in patterns
1635*22dc650dSSadaf Ebrahimi       are a much bigger set.
1636*22dc650dSSadaf Ebrahimi
1637*22dc650dSSadaf Ebrahimi       As well as ignoring most white space, PCRE2_EXTENDED also causes  char-
1638*22dc650dSSadaf Ebrahimi       acters  between  an  unescaped # outside a character class and the next
1639*22dc650dSSadaf Ebrahimi       newline, inclusive, to be ignored, which makes it possible  to  include
1640*22dc650dSSadaf Ebrahimi       comments inside complicated patterns. Note that the end of this type of
1641*22dc650dSSadaf Ebrahimi       comment  is a literal newline sequence in the pattern; escape sequences
1642*22dc650dSSadaf Ebrahimi       that happen to represent a newline do not count.
1643*22dc650dSSadaf Ebrahimi
1644*22dc650dSSadaf Ebrahimi       Which characters are interpreted as newlines can be specified by a set-
1645*22dc650dSSadaf Ebrahimi       ting in the compile context that is passed to pcre2_compile() or  by  a
1646*22dc650dSSadaf Ebrahimi       special  sequence at the start of the pattern, as described in the sec-
1647*22dc650dSSadaf Ebrahimi       tion entitled "Newline conventions" in the pcre2pattern  documentation.
1648*22dc650dSSadaf Ebrahimi       A default is defined when PCRE2 is built.
1649*22dc650dSSadaf Ebrahimi
1650*22dc650dSSadaf Ebrahimi         PCRE2_EXTENDED_MORE
1651*22dc650dSSadaf Ebrahimi
1652*22dc650dSSadaf Ebrahimi       This  option  has  the  effect of PCRE2_EXTENDED, but, in addition, un-
1653*22dc650dSSadaf Ebrahimi       escaped space and horizontal tab characters are ignored inside a  char-
1654*22dc650dSSadaf Ebrahimi       acter  class. Note: only these two characters are ignored, not the full
1655*22dc650dSSadaf Ebrahimi       set of pattern white space characters that are ignored outside a  char-
1656*22dc650dSSadaf Ebrahimi       acter  class.  PCRE2_EXTENDED_MORE  is equivalent to Perl's /xx option,
1657*22dc650dSSadaf Ebrahimi       and it can be changed within a pattern by a (?xx) option setting.
1658*22dc650dSSadaf Ebrahimi
1659*22dc650dSSadaf Ebrahimi         PCRE2_FIRSTLINE
1660*22dc650dSSadaf Ebrahimi
1661*22dc650dSSadaf Ebrahimi       If this option is set, the start of an unanchored pattern match must be
1662*22dc650dSSadaf Ebrahimi       before or at the first newline in  the  subject  string  following  the
1663*22dc650dSSadaf Ebrahimi       start  of  matching, though the matched text may continue over the new-
1664*22dc650dSSadaf Ebrahimi       line. If startoffset is non-zero, the limiting newline is not necessar-
1665*22dc650dSSadaf Ebrahimi       ily the first newline in the  subject.  For  example,  if  the  subject
1666*22dc650dSSadaf Ebrahimi       string is "abc\nxyz" (where \n represents a single-character newline) a
1667*22dc650dSSadaf Ebrahimi       pattern  match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
1668*22dc650dSSadaf Ebrahimi       greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a  more
1669*22dc650dSSadaf Ebrahimi       general  limiting  facility.  If  PCRE2_FIRSTLINE is set with an offset
1670*22dc650dSSadaf Ebrahimi       limit, a match must occur in the first line and also within the  offset
1671*22dc650dSSadaf Ebrahimi       limit. In other words, whichever limit comes first is used. This option
1672*22dc650dSSadaf Ebrahimi       has no effect for anchored patterns.
1673*22dc650dSSadaf Ebrahimi
1674*22dc650dSSadaf Ebrahimi         PCRE2_LITERAL
1675*22dc650dSSadaf Ebrahimi
1676*22dc650dSSadaf Ebrahimi       If this option is set, all meta-characters in the pattern are disabled,
1677*22dc650dSSadaf Ebrahimi       and  it is treated as a literal string. Matching literal strings with a
1678*22dc650dSSadaf Ebrahimi       regular expression engine is not the most efficient way of doing it. If
1679*22dc650dSSadaf Ebrahimi       you are doing a lot of literal matching and  are  worried  about  effi-
1680*22dc650dSSadaf Ebrahimi       ciency, you should consider using other approaches. The only other main
1681*22dc650dSSadaf Ebrahimi       options  that  are  allowed  with  PCRE2_LITERAL  are:  PCRE2_ANCHORED,
1682*22dc650dSSadaf Ebrahimi       PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
1683*22dc650dSSadaf Ebrahimi       PCRE2_MATCH_INVALID_UTF,  PCRE2_NO_START_OPTIMIZE,  PCRE2_NO_UTF_CHECK,
1684*22dc650dSSadaf Ebrahimi       PCRE2_UTF,  and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options PCRE2_EX-
1685*22dc650dSSadaf Ebrahimi       TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
1686*22dc650dSSadaf Ebrahimi       options cause an error.
1687*22dc650dSSadaf Ebrahimi
1688*22dc650dSSadaf Ebrahimi         PCRE2_MATCH_INVALID_UTF
1689*22dc650dSSadaf Ebrahimi
1690*22dc650dSSadaf Ebrahimi       This option forces PCRE2_UTF (see below) and also enables  support  for
1691*22dc650dSSadaf Ebrahimi       matching  by  pcre2_match() in subject strings that contain invalid UTF
1692*22dc650dSSadaf Ebrahimi       sequences.  Note, however, that the 16-bit and 32-bit  PCRE2  libraries
1693*22dc650dSSadaf Ebrahimi       process  strings as sequences of uint16_t or uint32_t code points. They
1694*22dc650dSSadaf Ebrahimi       cannot find valid UTF sequences within an arbitrary string of bytes un-
1695*22dc650dSSadaf Ebrahimi       less such sequences are suitably aligned. This  facility  is  not  sup-
1696*22dc650dSSadaf Ebrahimi       ported  for  DFA matching. For details, see the pcre2unicode documenta-
1697*22dc650dSSadaf Ebrahimi       tion.
1698*22dc650dSSadaf Ebrahimi
1699*22dc650dSSadaf Ebrahimi         PCRE2_MATCH_UNSET_BACKREF
1700*22dc650dSSadaf Ebrahimi
1701*22dc650dSSadaf Ebrahimi       If this option is set,  a  backreference  to  an  unset  capture  group
1702*22dc650dSSadaf Ebrahimi       matches  an  empty  string (by default this causes the current matching
1703*22dc650dSSadaf Ebrahimi       alternative to fail).  A pattern such as (\1)(a) succeeds when this op-
1704*22dc650dSSadaf Ebrahimi       tion is set (assuming it can find an "a" in the  subject),  whereas  it
1705*22dc650dSSadaf Ebrahimi       fails  by  default,  for  Perl compatibility. Setting this option makes
1706*22dc650dSSadaf Ebrahimi       PCRE2 behave more like ECMAscript (aka JavaScript).
1707*22dc650dSSadaf Ebrahimi
1708*22dc650dSSadaf Ebrahimi         PCRE2_MULTILINE
1709*22dc650dSSadaf Ebrahimi
1710*22dc650dSSadaf Ebrahimi       By default, for the purposes of matching "start of line"  and  "end  of
1711*22dc650dSSadaf Ebrahimi       line",  PCRE2  treats the subject string as consisting of a single line
1712*22dc650dSSadaf Ebrahimi       of characters, even if it actually contains  newlines.  The  "start  of
1713*22dc650dSSadaf Ebrahimi       line"  metacharacter  (^)  matches only at the start of the string, and
1714*22dc650dSSadaf Ebrahimi       the "end of line" metacharacter ($) matches only  at  the  end  of  the
1715*22dc650dSSadaf Ebrahimi       string,  or  before a terminating newline (except when PCRE2_DOLLAR_EN-
1716*22dc650dSSadaf Ebrahimi       DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any
1717*22dc650dSSadaf Ebrahimi       character" metacharacter (.) does not match at a newline.  This  behav-
1718*22dc650dSSadaf Ebrahimi       iour (for ^, $, and dot) is the same as Perl.
1719*22dc650dSSadaf Ebrahimi
1720*22dc650dSSadaf Ebrahimi       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
1721*22dc650dSSadaf Ebrahimi       constructs match immediately following or immediately  before  internal
1722*22dc650dSSadaf Ebrahimi       newlines  in  the  subject string, respectively, as well as at the very
1723*22dc650dSSadaf Ebrahimi       start and end. This is equivalent to Perl's /m option, and  it  can  be
1724*22dc650dSSadaf Ebrahimi       changed within a pattern by a (?m) option setting. Note that the "start
1725*22dc650dSSadaf Ebrahimi       of line" metacharacter does not match after a newline at the end of the
1726*22dc650dSSadaf Ebrahimi       subject,  for compatibility with Perl.  However, you can change this by
1727*22dc650dSSadaf Ebrahimi       setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in  a
1728*22dc650dSSadaf Ebrahimi       subject  string,  or  no  occurrences  of  ^ or $ in a pattern, setting
1729*22dc650dSSadaf Ebrahimi       PCRE2_MULTILINE has no effect.
1730*22dc650dSSadaf Ebrahimi
1731*22dc650dSSadaf Ebrahimi         PCRE2_NEVER_BACKSLASH_C
1732*22dc650dSSadaf Ebrahimi
1733*22dc650dSSadaf Ebrahimi       This option locks out the use of \C in the pattern that is  being  com-
1734*22dc650dSSadaf Ebrahimi       piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1735*22dc650dSSadaf Ebrahimi       UTF-16 modes, because it may leave the current matching  point  in  the
1736*22dc650dSSadaf Ebrahimi       middle of a multi-code-unit character. This option may be useful in ap-
1737*22dc650dSSadaf Ebrahimi       plications that process patterns from external sources. Note that there
1738*22dc650dSSadaf Ebrahimi       is also a build-time option that permanently locks out the use of \C.
1739*22dc650dSSadaf Ebrahimi
1740*22dc650dSSadaf Ebrahimi         PCRE2_NEVER_UCP
1741*22dc650dSSadaf Ebrahimi
1742*22dc650dSSadaf Ebrahimi       This  option  locks  out the use of Unicode properties for handling \B,
1743*22dc650dSSadaf Ebrahimi       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
1744*22dc650dSSadaf Ebrahimi       described for the PCRE2_UCP option below. In  particular,  it  prevents
1745*22dc650dSSadaf Ebrahimi       the  creator of the pattern from enabling this facility by starting the
1746*22dc650dSSadaf Ebrahimi       pattern with (*UCP). This option may be  useful  in  applications  that
1747*22dc650dSSadaf Ebrahimi       process patterns from external sources. The option combination PCRE_UCP
1748*22dc650dSSadaf Ebrahimi       and PCRE_NEVER_UCP causes an error.
1749*22dc650dSSadaf Ebrahimi
1750*22dc650dSSadaf Ebrahimi         PCRE2_NEVER_UTF
1751*22dc650dSSadaf Ebrahimi
1752*22dc650dSSadaf Ebrahimi       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
1753*22dc650dSSadaf Ebrahimi       or UTF-32, depending on which library is in use. In particular, it pre-
1754*22dc650dSSadaf Ebrahimi       vents the creator of the pattern from switching to  UTF  interpretation
1755*22dc650dSSadaf Ebrahimi       by  starting  the pattern with (*UTF). This option may be useful in ap-
1756*22dc650dSSadaf Ebrahimi       plications that process patterns from external sources. The combination
1757*22dc650dSSadaf Ebrahimi       of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
1758*22dc650dSSadaf Ebrahimi
1759*22dc650dSSadaf Ebrahimi         PCRE2_NO_AUTO_CAPTURE
1760*22dc650dSSadaf Ebrahimi
1761*22dc650dSSadaf Ebrahimi       If this option is set, it disables the use of numbered capturing paren-
1762*22dc650dSSadaf Ebrahimi       theses in the pattern. Any opening parenthesis that is not followed  by
1763*22dc650dSSadaf Ebrahimi       ?  behaves as if it were followed by ?: but named parentheses can still
1764*22dc650dSSadaf Ebrahimi       be used for capturing (and they acquire numbers in the usual way). This
1765*22dc650dSSadaf Ebrahimi       is the same as Perl's /n option.  Note that, when this option  is  set,
1766*22dc650dSSadaf Ebrahimi       references  to  capture  groups (backreferences or recursion/subroutine
1767*22dc650dSSadaf Ebrahimi       calls) may only refer to named groups, though the reference can  be  by
1768*22dc650dSSadaf Ebrahimi       name or by number.
1769*22dc650dSSadaf Ebrahimi
1770*22dc650dSSadaf Ebrahimi         PCRE2_NO_AUTO_POSSESS
1771*22dc650dSSadaf Ebrahimi
1772*22dc650dSSadaf Ebrahimi       If this option is set, it disables "auto-possessification", which is an
1773*22dc650dSSadaf Ebrahimi       optimization  that,  for example, turns a+b into a++b in order to avoid
1774*22dc650dSSadaf Ebrahimi       backtracks into a+ that can never be successful. However,  if  callouts
1775*22dc650dSSadaf Ebrahimi       are  in  use,  auto-possessification means that some callouts are never
1776*22dc650dSSadaf Ebrahimi       taken. You can set this option if you want the matching functions to do
1777*22dc650dSSadaf Ebrahimi       a full unoptimized search and run all the callouts, but  it  is  mainly
1778*22dc650dSSadaf Ebrahimi       provided for testing purposes.
1779*22dc650dSSadaf Ebrahimi
1780*22dc650dSSadaf Ebrahimi         PCRE2_NO_DOTSTAR_ANCHOR
1781*22dc650dSSadaf Ebrahimi
1782*22dc650dSSadaf Ebrahimi       If this option is set, it disables an optimization that is applied when
1783*22dc650dSSadaf Ebrahimi       .*  is  the  first significant item in a top-level branch of a pattern,
1784*22dc650dSSadaf Ebrahimi       and all the other branches also start with .* or with \A or  \G  or  ^.
1785*22dc650dSSadaf Ebrahimi       The  optimization  is  automatically disabled for .* if it is inside an
1786*22dc650dSSadaf Ebrahimi       atomic group or a capture group that is the subject of a backreference,
1787*22dc650dSSadaf Ebrahimi       or if the pattern contains (*PRUNE) or (*SKIP). When  the  optimization
1788*22dc650dSSadaf Ebrahimi       is   not   disabled,  such  a  pattern  is  automatically  anchored  if
1789*22dc650dSSadaf Ebrahimi       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1790*22dc650dSSadaf Ebrahimi       for any ^ items. Otherwise, the fact that any match must  start  either
1791*22dc650dSSadaf Ebrahimi       at  the start of the subject or following a newline is remembered. Like
1792*22dc650dSSadaf Ebrahimi       other optimizations, this can cause callouts to be skipped.
1793*22dc650dSSadaf Ebrahimi
1794*22dc650dSSadaf Ebrahimi         PCRE2_NO_START_OPTIMIZE
1795*22dc650dSSadaf Ebrahimi
1796*22dc650dSSadaf Ebrahimi       This is an option whose main effect is at matching time.  It  does  not
1797*22dc650dSSadaf Ebrahimi       change what pcre2_compile() generates, but it does affect the output of
1798*22dc650dSSadaf Ebrahimi       the JIT compiler.
1799*22dc650dSSadaf Ebrahimi
1800*22dc650dSSadaf Ebrahimi       There  are  a  number of optimizations that may occur at the start of a
1801*22dc650dSSadaf Ebrahimi       match, in order to speed up the process. For example, if  it  is  known
1802*22dc650dSSadaf Ebrahimi       that  an  unanchored  match must start with a specific code unit value,
1803*22dc650dSSadaf Ebrahimi       the matching code searches the subject for that value, and fails  imme-
1804*22dc650dSSadaf Ebrahimi       diately  if it cannot find it, without actually running the main match-
1805*22dc650dSSadaf Ebrahimi       ing function. This means that a special item such as (*COMMIT)  at  the
1806*22dc650dSSadaf Ebrahimi       start  of  a  pattern is not considered until after a suitable starting
1807*22dc650dSSadaf Ebrahimi       point for the match has been found.  Also,  when  callouts  or  (*MARK)
1808*22dc650dSSadaf Ebrahimi       items  are  in use, these "start-up" optimizations can cause them to be
1809*22dc650dSSadaf Ebrahimi       skipped if the pattern is never actually used. The  start-up  optimiza-
1810*22dc650dSSadaf Ebrahimi       tions  are  in effect a pre-scan of the subject that takes place before
1811*22dc650dSSadaf Ebrahimi       the pattern is run.
1812*22dc650dSSadaf Ebrahimi
1813*22dc650dSSadaf Ebrahimi       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1814*22dc650dSSadaf Ebrahimi       possibly causing performance to suffer,  but  ensuring  that  in  cases
1815*22dc650dSSadaf Ebrahimi       where  the  result is "no match", the callouts do occur, and that items
1816*22dc650dSSadaf Ebrahimi       such as (*COMMIT) and (*MARK) are considered at every possible starting
1817*22dc650dSSadaf Ebrahimi       position in the subject string.
1818*22dc650dSSadaf Ebrahimi
1819*22dc650dSSadaf Ebrahimi       Setting PCRE2_NO_START_OPTIMIZE may change the outcome  of  a  matching
1820*22dc650dSSadaf Ebrahimi       operation.  Consider the pattern
1821*22dc650dSSadaf Ebrahimi
1822*22dc650dSSadaf Ebrahimi         (*COMMIT)ABC
1823*22dc650dSSadaf Ebrahimi
1824*22dc650dSSadaf Ebrahimi       When  this  is compiled, PCRE2 records the fact that a match must start
1825*22dc650dSSadaf Ebrahimi       with the character "A". Suppose the subject  string  is  "DEFABC".  The
1826*22dc650dSSadaf Ebrahimi       start-up  optimization  scans along the subject, finds "A" and runs the
1827*22dc650dSSadaf Ebrahimi       first match attempt from there. The (*COMMIT) item means that the  pat-
1828*22dc650dSSadaf Ebrahimi       tern  must  match the current starting position, which in this case, it
1829*22dc650dSSadaf Ebrahimi       does. However, if the same match is  run  with  PCRE2_NO_START_OPTIMIZE
1830*22dc650dSSadaf Ebrahimi       set,  the  initial  scan  along the subject string does not happen. The
1831*22dc650dSSadaf Ebrahimi       first match attempt is run starting  from  "D"  and  when  this  fails,
1832*22dc650dSSadaf Ebrahimi       (*COMMIT)  prevents any further matches being tried, so the overall re-
1833*22dc650dSSadaf Ebrahimi       sult is "no match".
1834*22dc650dSSadaf Ebrahimi
1835*22dc650dSSadaf Ebrahimi       As another start-up optimization makes use of a minimum  length  for  a
1836*22dc650dSSadaf Ebrahimi       matching subject, which is recorded when possible. Consider the pattern
1837*22dc650dSSadaf Ebrahimi
1838*22dc650dSSadaf Ebrahimi         (*MARK:1)B(*MARK:2)(X|Y)
1839*22dc650dSSadaf Ebrahimi
1840*22dc650dSSadaf Ebrahimi       The  minimum  length  for  a match is two characters. If the subject is
1841*22dc650dSSadaf Ebrahimi       "XXBB", the "starting character" optimization skips "XX", then tries to
1842*22dc650dSSadaf Ebrahimi       match "BB", which is long enough. In the process, (*MARK:2) is  encoun-
1843*22dc650dSSadaf Ebrahimi       tered  and  remembered.  When  the match attempt fails, the next "B" is
1844*22dc650dSSadaf Ebrahimi       found, but there is only one character left, so there are no  more  at-
1845*22dc650dSSadaf Ebrahimi       tempts,  and  "no  match"  is returned with the "last mark seen" set to
1846*22dc650dSSadaf Ebrahimi       "2". If NO_START_OPTIMIZE is set, however, matches are tried  at  every
1847*22dc650dSSadaf Ebrahimi       possible  starting position, including at the end of the subject, where
1848*22dc650dSSadaf Ebrahimi       (*MARK:1) is encountered, but there is no "B", so the "last mark  seen"
1849*22dc650dSSadaf Ebrahimi       that  is returned is "1". In this case, the optimizations do not affect
1850*22dc650dSSadaf Ebrahimi       the overall match result, which is still "no match", but they do affect
1851*22dc650dSSadaf Ebrahimi       the auxiliary information that is returned.
1852*22dc650dSSadaf Ebrahimi
1853*22dc650dSSadaf Ebrahimi         PCRE2_NO_UTF_CHECK
1854*22dc650dSSadaf Ebrahimi
1855*22dc650dSSadaf Ebrahimi       When PCRE2_UTF is set, the validity of the pattern as a UTF  string  is
1856*22dc650dSSadaf Ebrahimi       automatically  checked.  There  are  discussions  about the validity of
1857*22dc650dSSadaf Ebrahimi       UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1858*22dc650dSSadaf Ebrahimi       document.  If an invalid UTF sequence is found, pcre2_compile() returns
1859*22dc650dSSadaf Ebrahimi       a negative error code.
1860*22dc650dSSadaf Ebrahimi
1861*22dc650dSSadaf Ebrahimi       If you know that your pattern is a valid UTF string, and  you  want  to
1862*22dc650dSSadaf Ebrahimi       skip   this   check   for   performance   reasons,   you  can  set  the
1863*22dc650dSSadaf Ebrahimi       PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1864*22dc650dSSadaf Ebrahimi       valid UTF string as a pattern is undefined. It may cause  your  program
1865*22dc650dSSadaf Ebrahimi       to crash or loop.
1866*22dc650dSSadaf Ebrahimi
1867*22dc650dSSadaf Ebrahimi       Note  that  this  option  can  also  be  passed  to  pcre2_match()  and
1868*22dc650dSSadaf Ebrahimi       pcre2_dfa_match(), to suppress UTF validity  checking  of  the  subject
1869*22dc650dSSadaf Ebrahimi       string.
1870*22dc650dSSadaf Ebrahimi
1871*22dc650dSSadaf Ebrahimi       Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1872*22dc650dSSadaf Ebrahimi       able  the error that is given if an escape sequence for an invalid Uni-
1873*22dc650dSSadaf Ebrahimi       code code point is encountered in the pattern. In particular,  the  so-
1874*22dc650dSSadaf Ebrahimi       called  "surrogate"  code points (0xd800 to 0xdfff) are invalid. If you
1875*22dc650dSSadaf Ebrahimi       want to allow escape  sequences  such  as  \x{d800}  you  can  set  the
1876*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  extra  option, as described in the
1877*22dc650dSSadaf Ebrahimi       section entitled "Extra compile options" below.  However, this is  pos-
1878*22dc650dSSadaf Ebrahimi       sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1879*22dc650dSSadaf Ebrahimi       resentable in UTF-16.
1880*22dc650dSSadaf Ebrahimi
1881*22dc650dSSadaf Ebrahimi         PCRE2_UCP
1882*22dc650dSSadaf Ebrahimi
1883*22dc650dSSadaf Ebrahimi       This option has two effects. Firstly, it change the way PCRE2 processes
1884*22dc650dSSadaf Ebrahimi       \B,  \b,  \D,  \d,  \S,  \s,  \W,  \w,  and some of the POSIX character
1885*22dc650dSSadaf Ebrahimi       classes. By default, only  ASCII  characters  are  recognized,  but  if
1886*22dc650dSSadaf Ebrahimi       PCRE2_UCP  is  set, Unicode properties are used to classify characters.
1887*22dc650dSSadaf Ebrahimi       There are some PCRE2_EXTRA options (see below) that add  finer  control
1888*22dc650dSSadaf Ebrahimi       to  this  behaviour.  More  details are given in the section on generic
1889*22dc650dSSadaf Ebrahimi       character types in the pcre2pattern page.
1890*22dc650dSSadaf Ebrahimi
1891*22dc650dSSadaf Ebrahimi       The second effect of PCRE2_UCP is to force the use of  Unicode  proper-
1892*22dc650dSSadaf Ebrahimi       ties for upper/lower casing operations, even when PCRE2_UTF is not set.
1893*22dc650dSSadaf Ebrahimi       This  makes  it  possible  to process strings in the 16-bit UCS-2 code.
1894*22dc650dSSadaf Ebrahimi       This option is available only if PCRE2 has been compiled  with  Unicode
1895*22dc650dSSadaf Ebrahimi       support  (which is the default).  The PCRE2_EXTRA_CASELESS_RESTRICT op-
1896*22dc650dSSadaf Ebrahimi       tion (see below) restricts caseless matching such that ASCII characters
1897*22dc650dSSadaf Ebrahimi       match only ASCII characters and non-ASCII characters  match  only  non-
1898*22dc650dSSadaf Ebrahimi       ASCII characters.
1899*22dc650dSSadaf Ebrahimi
1900*22dc650dSSadaf Ebrahimi         PCRE2_UNGREEDY
1901*22dc650dSSadaf Ebrahimi
1902*22dc650dSSadaf Ebrahimi       This  option  inverts  the "greediness" of the quantifiers so that they
1903*22dc650dSSadaf Ebrahimi       are not greedy by default, but become greedy if followed by "?". It  is
1904*22dc650dSSadaf Ebrahimi       not  compatible  with Perl. It can also be set by a (?U) option setting
1905*22dc650dSSadaf Ebrahimi       within the pattern.
1906*22dc650dSSadaf Ebrahimi
1907*22dc650dSSadaf Ebrahimi         PCRE2_USE_OFFSET_LIMIT
1908*22dc650dSSadaf Ebrahimi
1909*22dc650dSSadaf Ebrahimi       This option must be set for pcre2_compile() if pcre2_set_offset_limit()
1910*22dc650dSSadaf Ebrahimi       is going to be used to set a non-default offset limit in a  match  con-
1911*22dc650dSSadaf Ebrahimi       text  for  matches  that  use this pattern. An error is generated if an
1912*22dc650dSSadaf Ebrahimi       offset limit is set without this option. For more details, see the  de-
1913*22dc650dSSadaf Ebrahimi       scription  of  pcre2_set_offset_limit()  in  the section that describes
1914*22dc650dSSadaf Ebrahimi       match contexts. See also the PCRE2_FIRSTLINE option above.
1915*22dc650dSSadaf Ebrahimi
1916*22dc650dSSadaf Ebrahimi         PCRE2_UTF
1917*22dc650dSSadaf Ebrahimi
1918*22dc650dSSadaf Ebrahimi       This option causes PCRE2 to regard both the  pattern  and  the  subject
1919*22dc650dSSadaf Ebrahimi       strings  that  are  subsequently processed as strings of UTF characters
1920*22dc650dSSadaf Ebrahimi       instead of single-code-unit strings. It  is  available  when  PCRE2  is
1921*22dc650dSSadaf Ebrahimi       built  to  include  Unicode  support (which is the default). If Unicode
1922*22dc650dSSadaf Ebrahimi       support is not available, the use of this option provokes an error. De-
1923*22dc650dSSadaf Ebrahimi       tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in  the
1924*22dc650dSSadaf Ebrahimi       pcre2unicode  page.  In  particular,  note  that  it  changes  the  way
1925*22dc650dSSadaf Ebrahimi       PCRE2_CASELESS works.
1926*22dc650dSSadaf Ebrahimi
1927*22dc650dSSadaf Ebrahimi   Extra compile options
1928*22dc650dSSadaf Ebrahimi
1929*22dc650dSSadaf Ebrahimi       The option bits that can be set in a compile  context  by  calling  the
1930*22dc650dSSadaf Ebrahimi       pcre2_set_compile_extra_options() function are as follows:
1931*22dc650dSSadaf Ebrahimi
1932*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
1933*22dc650dSSadaf Ebrahimi
1934*22dc650dSSadaf Ebrahimi       Since release 10.38 PCRE2 has forbidden the use of \K within lookaround
1935*22dc650dSSadaf Ebrahimi       assertions, following Perl's lead. This option is provided to re-enable
1936*22dc650dSSadaf Ebrahimi       the previous behaviour (act in positive lookarounds, ignore in negative
1937*22dc650dSSadaf Ebrahimi       ones) in case anybody is relying on it.
1938*22dc650dSSadaf Ebrahimi
1939*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
1940*22dc650dSSadaf Ebrahimi
1941*22dc650dSSadaf Ebrahimi       This  option  applies when compiling a pattern in UTF-8 or UTF-32 mode.
1942*22dc650dSSadaf Ebrahimi       It is forbidden in UTF-16 mode, and ignored in non-UTF  modes.  Unicode
1943*22dc650dSSadaf Ebrahimi       "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1944*22dc650dSSadaf Ebrahimi       in  UTF-16  to  encode  code points with values in the range 0x10000 to
1945*22dc650dSSadaf Ebrahimi       0x10ffff. The surrogates cannot therefore  be  represented  in  UTF-16.
1946*22dc650dSSadaf Ebrahimi       They can be represented in UTF-8 and UTF-32, but are defined as invalid
1947*22dc650dSSadaf Ebrahimi       code  points,  and  cause  errors  if  encountered in a UTF-8 or UTF-32
1948*22dc650dSSadaf Ebrahimi       string that is being checked for validity by PCRE2.
1949*22dc650dSSadaf Ebrahimi
1950*22dc650dSSadaf Ebrahimi       These values also cause errors if encountered in escape sequences  such
1951*22dc650dSSadaf Ebrahimi       as \x{d912} within a pattern. However, it seems that some applications,
1952*22dc650dSSadaf Ebrahimi       when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1953*22dc650dSSadaf Ebrahimi       plicitly   test   for   the  surrogates  using  escape  sequences.  The
1954*22dc650dSSadaf Ebrahimi       PCRE2_NO_UTF_CHECK option does not disable the error that  occurs,  be-
1955*22dc650dSSadaf Ebrahimi       cause it applies only to the testing of input strings for UTF validity.
1956*22dc650dSSadaf Ebrahimi
1957*22dc650dSSadaf Ebrahimi       If  the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1958*22dc650dSSadaf Ebrahimi       gate code point values in UTF-8 and UTF-32 patterns no  longer  provoke
1959*22dc650dSSadaf Ebrahimi       errors  and are incorporated in the compiled pattern. However, they can
1960*22dc650dSSadaf Ebrahimi       only match subject characters if the matching function is  called  with
1961*22dc650dSSadaf Ebrahimi       PCRE2_NO_UTF_CHECK set.
1962*22dc650dSSadaf Ebrahimi
1963*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ALT_BSUX
1964*22dc650dSSadaf Ebrahimi
1965*22dc650dSSadaf Ebrahimi       The  original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and
1966*22dc650dSSadaf Ebrahimi       \x in the way that ECMAscript (aka JavaScript) does.  Additional  func-
1967*22dc650dSSadaf Ebrahimi       tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has
1968*22dc650dSSadaf Ebrahimi       the  effect  of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..}
1969*22dc650dSSadaf Ebrahimi       as a hexadecimal character code, where hhh.. is any number of hexadeci-
1970*22dc650dSSadaf Ebrahimi       mal digits.
1971*22dc650dSSadaf Ebrahimi
1972*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ASCII_BSD
1973*22dc650dSSadaf Ebrahimi
1974*22dc650dSSadaf Ebrahimi       This option forces \d to match only ASCII digits, even  when  PCRE2_UCP
1975*22dc650dSSadaf Ebrahimi       is  set.   It can be changed within a pattern by means of the (?aD) op-
1976*22dc650dSSadaf Ebrahimi       tion setting.
1977*22dc650dSSadaf Ebrahimi
1978*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ASCII_BSS
1979*22dc650dSSadaf Ebrahimi
1980*22dc650dSSadaf Ebrahimi       This option forces \s to match only ASCII space characters,  even  when
1981*22dc650dSSadaf Ebrahimi       PCRE2_UCP  is  set.  It can be changed within a pattern by means of the
1982*22dc650dSSadaf Ebrahimi       (?aS) option setting.
1983*22dc650dSSadaf Ebrahimi
1984*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ASCII_BSW
1985*22dc650dSSadaf Ebrahimi
1986*22dc650dSSadaf Ebrahimi       This option forces \w to match only ASCII word  characters,  even  when
1987*22dc650dSSadaf Ebrahimi       PCRE2_UCP  is  set.  It can be changed within a pattern by means of the
1988*22dc650dSSadaf Ebrahimi       (?aW) option setting.
1989*22dc650dSSadaf Ebrahimi
1990*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ASCII_DIGIT
1991*22dc650dSSadaf Ebrahimi
1992*22dc650dSSadaf Ebrahimi       This option forces the POSIX character classes [:digit:] and [:xdigit:]
1993*22dc650dSSadaf Ebrahimi       to match only ASCII digits, even when  PCRE2_UCP  is  set.  It  can  be
1994*22dc650dSSadaf Ebrahimi       changed within a pattern by means of the (?aT) option setting.
1995*22dc650dSSadaf Ebrahimi
1996*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ASCII_POSIX
1997*22dc650dSSadaf Ebrahimi
1998*22dc650dSSadaf Ebrahimi       This option forces all the POSIX character classes, including [:digit:]
1999*22dc650dSSadaf Ebrahimi       and  [:xdigit:], to match only ASCII characters, even when PCRE2_UCP is
2000*22dc650dSSadaf Ebrahimi       set. It can be changed within a pattern by means of  the  (?aP)  option
2001*22dc650dSSadaf Ebrahimi       setting,  but note that this also sets PCRE2_EXTRA_ASCII_DIGIT in order
2002*22dc650dSSadaf Ebrahimi       to ensure that (?-aP) unsets all ASCII restrictions for POSIX classes.
2003*22dc650dSSadaf Ebrahimi
2004*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
2005*22dc650dSSadaf Ebrahimi
2006*22dc650dSSadaf Ebrahimi       This is a dangerous option. Use with care. By default, an  unrecognized
2007*22dc650dSSadaf Ebrahimi       escape  such  as \j or a malformed one such as \x{2z} causes a compile-
2008*22dc650dSSadaf Ebrahimi       time error when detected by pcre2_compile(). Perl is somewhat inconsis-
2009*22dc650dSSadaf Ebrahimi       tent in handling such items: for example, \j is treated  as  a  literal
2010*22dc650dSSadaf Ebrahimi       "j",  and non-hexadecimal digits in \x{} are just ignored, though warn-
2011*22dc650dSSadaf Ebrahimi       ings are given in both cases if Perl's warning switch is enabled.  How-
2012*22dc650dSSadaf Ebrahimi       ever,  a  malformed  octal  number  after \o{ always causes an error in
2013*22dc650dSSadaf Ebrahimi       Perl.
2014*22dc650dSSadaf Ebrahimi
2015*22dc650dSSadaf Ebrahimi       If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL  extra  option  is  passed  to
2016*22dc650dSSadaf Ebrahimi       pcre2_compile(),  all  unrecognized  or  malformed escape sequences are
2017*22dc650dSSadaf Ebrahimi       treated as single-character escapes. For example, \j is a  literal  "j"
2018*22dc650dSSadaf Ebrahimi       and  \x{2z}  is treated as the literal string "x{2z}". Setting this op-
2019*22dc650dSSadaf Ebrahimi       tion means that typos in patterns may go undetected and have unexpected
2020*22dc650dSSadaf Ebrahimi       results. Also note that a sequence such as [\N{] is  interpreted  as  a
2021*22dc650dSSadaf Ebrahimi       malformed  attempt  at [\N{...}] and so is treated as [N{] whereas [\N]
2022*22dc650dSSadaf Ebrahimi       gives an error because an unqualified \N is a valid escape sequence but
2023*22dc650dSSadaf Ebrahimi       is not supported in a character class. To reiterate: this is a  danger-
2024*22dc650dSSadaf Ebrahimi       ous option. Use with great care.
2025*22dc650dSSadaf Ebrahimi
2026*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_CASELESS_RESTRICT
2027*22dc650dSSadaf Ebrahimi
2028*22dc650dSSadaf Ebrahimi       When  either  PCRE2_UCP  or PCRE2_UTF is set, caseless matching follows
2029*22dc650dSSadaf Ebrahimi       Unicode rules, which allow for more than two cases per character. There
2030*22dc650dSSadaf Ebrahimi       are two case-equivalent character sets that contain both ASCII and non-
2031*22dc650dSSadaf Ebrahimi       ASCII characters. The ASCII letter S is case-equivalent to U+017f (long
2032*22dc650dSSadaf Ebrahimi       S) and the ASCII letter K is case-equivalent to U+212a  (Kelvin  sign).
2033*22dc650dSSadaf Ebrahimi       This  option  disables  recognition of case-equivalences that cross the
2034*22dc650dSSadaf Ebrahimi       ASCII/non-ASCII boundary. In a caseless match, both characters must ei-
2035*22dc650dSSadaf Ebrahimi       ther be ASCII or non-ASCII. The option can be changed with a pattern by
2036*22dc650dSSadaf Ebrahimi       the (?r) option setting.
2037*22dc650dSSadaf Ebrahimi
2038*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_ESCAPED_CR_IS_LF
2039*22dc650dSSadaf Ebrahimi
2040*22dc650dSSadaf Ebrahimi       There are some legacy applications where the escape sequence  \r  in  a
2041*22dc650dSSadaf Ebrahimi       pattern  is expected to match a newline. If this option is set, \r in a
2042*22dc650dSSadaf Ebrahimi       pattern is converted to \n so that it matches a LF  (linefeed)  instead
2043*22dc650dSSadaf Ebrahimi       of  a CR (carriage return) character. The option does not affect a lit-
2044*22dc650dSSadaf Ebrahimi       eral CR in the pattern, nor does it affect CR specified as an  explicit
2045*22dc650dSSadaf Ebrahimi       code point such as \x{0D}.
2046*22dc650dSSadaf Ebrahimi
2047*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_MATCH_LINE
2048*22dc650dSSadaf Ebrahimi
2049*22dc650dSSadaf Ebrahimi       This  option  is  provided  for  use  by the -x option of pcre2grep. It
2050*22dc650dSSadaf Ebrahimi       causes the pattern only to match complete lines. This  is  achieved  by
2051*22dc650dSSadaf Ebrahimi       automatically  inserting  the  code for "^(?:" at the start of the com-
2052*22dc650dSSadaf Ebrahimi       piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE  is  set,
2053*22dc650dSSadaf Ebrahimi       the  matched  line may be in the middle of the subject string. This op-
2054*22dc650dSSadaf Ebrahimi       tion can be used with PCRE2_LITERAL.
2055*22dc650dSSadaf Ebrahimi
2056*22dc650dSSadaf Ebrahimi         PCRE2_EXTRA_MATCH_WORD
2057*22dc650dSSadaf Ebrahimi
2058*22dc650dSSadaf Ebrahimi       This option is provided for use by  the  -w  option  of  pcre2grep.  It
2059*22dc650dSSadaf Ebrahimi       causes  the  pattern only to match strings that have a word boundary at
2060*22dc650dSSadaf Ebrahimi       the start and the end. This is achieved by automatically inserting  the
2061*22dc650dSSadaf Ebrahimi       code  for "\b(?:" at the start of the compiled pattern and ")\b" at the
2062*22dc650dSSadaf Ebrahimi       end. The option may be used with PCRE2_LITERAL. However, it is  ignored
2063*22dc650dSSadaf Ebrahimi       if PCRE2_EXTRA_MATCH_LINE is also set.
2064*22dc650dSSadaf Ebrahimi
2065*22dc650dSSadaf Ebrahimi
2066*22dc650dSSadaf EbrahimiJUST-IN-TIME (JIT) COMPILATION
2067*22dc650dSSadaf Ebrahimi
2068*22dc650dSSadaf Ebrahimi       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
2069*22dc650dSSadaf Ebrahimi
2070*22dc650dSSadaf Ebrahimi       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
2071*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2072*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
2073*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext);
2074*22dc650dSSadaf Ebrahimi
2075*22dc650dSSadaf Ebrahimi       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
2076*22dc650dSSadaf Ebrahimi
2077*22dc650dSSadaf Ebrahimi       pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize,
2078*22dc650dSSadaf Ebrahimi         size_t maxsize, pcre2_general_context *gcontext);
2079*22dc650dSSadaf Ebrahimi
2080*22dc650dSSadaf Ebrahimi       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
2081*22dc650dSSadaf Ebrahimi         pcre2_jit_callback callback_function, void *callback_data);
2082*22dc650dSSadaf Ebrahimi
2083*22dc650dSSadaf Ebrahimi       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
2084*22dc650dSSadaf Ebrahimi
2085*22dc650dSSadaf Ebrahimi       These  functions  provide  support  for  JIT compilation, which, if the
2086*22dc650dSSadaf Ebrahimi       just-in-time compiler is available, further processes a  compiled  pat-
2087*22dc650dSSadaf Ebrahimi       tern into machine code that executes much faster than the pcre2_match()
2088*22dc650dSSadaf Ebrahimi       interpretive  matching function. Full details are given in the pcre2jit
2089*22dc650dSSadaf Ebrahimi       documentation.
2090*22dc650dSSadaf Ebrahimi
2091*22dc650dSSadaf Ebrahimi       JIT compilation is a heavyweight optimization. It can  take  some  time
2092*22dc650dSSadaf Ebrahimi       for  patterns  to  be analyzed, and for one-off matches and simple pat-
2093*22dc650dSSadaf Ebrahimi       terns the benefit of faster execution might be offset by a much  slower
2094*22dc650dSSadaf Ebrahimi       compilation  time.  Most (but not all) patterns can be optimized by the
2095*22dc650dSSadaf Ebrahimi       JIT compiler.
2096*22dc650dSSadaf Ebrahimi
2097*22dc650dSSadaf Ebrahimi
2098*22dc650dSSadaf EbrahimiLOCALE SUPPORT
2099*22dc650dSSadaf Ebrahimi
2100*22dc650dSSadaf Ebrahimi       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
2101*22dc650dSSadaf Ebrahimi
2102*22dc650dSSadaf Ebrahimi       void pcre2_maketables_free(pcre2_general_context *gcontext,
2103*22dc650dSSadaf Ebrahimi         const uint8_t *tables);
2104*22dc650dSSadaf Ebrahimi
2105*22dc650dSSadaf Ebrahimi       PCRE2 handles caseless matching, and determines whether characters  are
2106*22dc650dSSadaf Ebrahimi       letters,  digits, or whatever, by reference to a set of tables, indexed
2107*22dc650dSSadaf Ebrahimi       by character code point. However, this applies only to characters whose
2108*22dc650dSSadaf Ebrahimi       code points are less than 256. By default,  higher-valued  code  points
2109*22dc650dSSadaf Ebrahimi       never match escapes such as \w or \d.
2110*22dc650dSSadaf Ebrahimi
2111*22dc650dSSadaf Ebrahimi       When PCRE2 is built with Unicode support (the default), certain Unicode
2112*22dc650dSSadaf Ebrahimi       character  properties  can be tested with \p and \P, or, alternatively,
2113*22dc650dSSadaf Ebrahimi       the PCRE2_UCP option can be set when a pattern is compiled; this causes
2114*22dc650dSSadaf Ebrahimi       \w and friends to use Unicode property support instead of the  built-in
2115*22dc650dSSadaf Ebrahimi       tables.  PCRE2_UCP also causes upper/lower casing operations on charac-
2116*22dc650dSSadaf Ebrahimi       ters with code points greater than 127 to use Unicode properties. These
2117*22dc650dSSadaf Ebrahimi       effects  apply even when PCRE2_UTF is not set. There are, however, some
2118*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA options (see above) that can be used to modify or  suppress
2119*22dc650dSSadaf Ebrahimi       them.
2120*22dc650dSSadaf Ebrahimi
2121*22dc650dSSadaf Ebrahimi       The  use  of  locales  with Unicode is discouraged. If you are handling
2122*22dc650dSSadaf Ebrahimi       characters with code points greater than 127,  you  should  either  use
2123*22dc650dSSadaf Ebrahimi       Unicode support, or use locales, but not try to mix the two.
2124*22dc650dSSadaf Ebrahimi
2125*22dc650dSSadaf Ebrahimi       PCRE2  contains a built-in set of character tables that are used by de-
2126*22dc650dSSadaf Ebrahimi       fault.  These are sufficient for many applications. Normally,  the  in-
2127*22dc650dSSadaf Ebrahimi       ternal  tables  recognize only ASCII characters. However, when PCRE2 is
2128*22dc650dSSadaf Ebrahimi       built, it is possible to cause the internal tables to be rebuilt in the
2129*22dc650dSSadaf Ebrahimi       default "C" locale of the local system, which may cause them to be dif-
2130*22dc650dSSadaf Ebrahimi       ferent.
2131*22dc650dSSadaf Ebrahimi
2132*22dc650dSSadaf Ebrahimi       The built-in tables can be overridden by tables supplied by the  appli-
2133*22dc650dSSadaf Ebrahimi       cation  that  calls  PCRE2.  These may be created in a different locale
2134*22dc650dSSadaf Ebrahimi       from the default.  As more and more applications change to  using  Uni-
2135*22dc650dSSadaf Ebrahimi       code, the need for this locale support is expected to die away.
2136*22dc650dSSadaf Ebrahimi
2137*22dc650dSSadaf Ebrahimi       External  tables  are built by calling the pcre2_maketables() function,
2138*22dc650dSSadaf Ebrahimi       in the relevant locale. The only argument to this function is a general
2139*22dc650dSSadaf Ebrahimi       context, which can be used to pass a custom memory  allocator.  If  the
2140*22dc650dSSadaf Ebrahimi       argument is NULL, the system malloc() is used. The result can be passed
2141*22dc650dSSadaf Ebrahimi       to pcre2_compile() as often as necessary, by creating a compile context
2142*22dc650dSSadaf Ebrahimi       and  calling  pcre2_set_character_tables()  to  set  the tables pointer
2143*22dc650dSSadaf Ebrahimi       therein.
2144*22dc650dSSadaf Ebrahimi
2145*22dc650dSSadaf Ebrahimi       For example, to build and use  tables  that  are  appropriate  for  the
2146*22dc650dSSadaf Ebrahimi       French  locale  (where accented characters with values greater than 127
2147*22dc650dSSadaf Ebrahimi       are treated as letters), the following code could be used:
2148*22dc650dSSadaf Ebrahimi
2149*22dc650dSSadaf Ebrahimi         setlocale(LC_CTYPE, "fr_FR");
2150*22dc650dSSadaf Ebrahimi         tables = pcre2_maketables(NULL);
2151*22dc650dSSadaf Ebrahimi         ccontext = pcre2_compile_context_create(NULL);
2152*22dc650dSSadaf Ebrahimi         pcre2_set_character_tables(ccontext, tables);
2153*22dc650dSSadaf Ebrahimi         re = pcre2_compile(..., ccontext);
2154*22dc650dSSadaf Ebrahimi
2155*22dc650dSSadaf Ebrahimi       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2156*22dc650dSSadaf Ebrahimi       if you are using Windows, the name for the French locale is "french".
2157*22dc650dSSadaf Ebrahimi
2158*22dc650dSSadaf Ebrahimi       The pointer that is passed (via the compile context) to pcre2_compile()
2159*22dc650dSSadaf Ebrahimi       is saved with the compiled pattern, and the same tables are used by the
2160*22dc650dSSadaf Ebrahimi       matching  functions.  Thus,  for  any  single  pattern, compilation and
2161*22dc650dSSadaf Ebrahimi       matching both happen in the same locale, but different patterns can  be
2162*22dc650dSSadaf Ebrahimi       processed in different locales.
2163*22dc650dSSadaf Ebrahimi
2164*22dc650dSSadaf Ebrahimi       It  is the caller's responsibility to ensure that the memory containing
2165*22dc650dSSadaf Ebrahimi       the tables remains available while they are still in use. When they are
2166*22dc650dSSadaf Ebrahimi       no longer needed, you can discard them  using  pcre2_maketables_free(),
2167*22dc650dSSadaf Ebrahimi       which  should  pass as its first parameter the same global context that
2168*22dc650dSSadaf Ebrahimi       was used to create the tables.
2169*22dc650dSSadaf Ebrahimi
2170*22dc650dSSadaf Ebrahimi   Saving locale tables
2171*22dc650dSSadaf Ebrahimi
2172*22dc650dSSadaf Ebrahimi       The tables described above are just a sequence of binary  bytes,  which
2173*22dc650dSSadaf Ebrahimi       makes  them  independent of hardware characteristics such as endianness
2174*22dc650dSSadaf Ebrahimi       or whether the processor is 32-bit or 64-bit. A copy of the  result  of
2175*22dc650dSSadaf Ebrahimi       pcre2_maketables()  can  therefore  be saved in a file or elsewhere and
2176*22dc650dSSadaf Ebrahimi       re-used later, even in a different program or on another computer.  The
2177*22dc650dSSadaf Ebrahimi       size  of  the  tables  (number  of  bytes)  must be obtained by calling
2178*22dc650dSSadaf Ebrahimi       pcre2_config()  with  the  PCRE2_CONFIG_TABLES_LENGTH  option   because
2179*22dc650dSSadaf Ebrahimi       pcre2_maketables()   does   not   return  this  value.  Note  that  the
2180*22dc650dSSadaf Ebrahimi       pcre2_dftables program, which is part of the PCRE2 build system, can be
2181*22dc650dSSadaf Ebrahimi       used stand-alone to create a file that contains a set of binary tables.
2182*22dc650dSSadaf Ebrahimi       See the pcre2build documentation for details.
2183*22dc650dSSadaf Ebrahimi
2184*22dc650dSSadaf Ebrahimi
2185*22dc650dSSadaf EbrahimiINFORMATION ABOUT A COMPILED PATTERN
2186*22dc650dSSadaf Ebrahimi
2187*22dc650dSSadaf Ebrahimi       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
2188*22dc650dSSadaf Ebrahimi
2189*22dc650dSSadaf Ebrahimi       The pcre2_pattern_info() function returns general information  about  a
2190*22dc650dSSadaf Ebrahimi       compiled pattern. For information about callouts, see the next section.
2191*22dc650dSSadaf Ebrahimi       The  first  argument  for pcre2_pattern_info() is a pointer to the com-
2192*22dc650dSSadaf Ebrahimi       piled pattern. The second argument specifies which piece of information
2193*22dc650dSSadaf Ebrahimi       is required, and the third argument is a pointer to a variable  to  re-
2194*22dc650dSSadaf Ebrahimi       ceive  the  data.  If the third argument is NULL, the first argument is
2195*22dc650dSSadaf Ebrahimi       ignored, and the function returns the size in  bytes  of  the  variable
2196*22dc650dSSadaf Ebrahimi       that is required for the information requested. Otherwise, the yield of
2197*22dc650dSSadaf Ebrahimi       the function is zero for success, or one of the following negative num-
2198*22dc650dSSadaf Ebrahimi       bers:
2199*22dc650dSSadaf Ebrahimi
2200*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NULL           the argument code was NULL
2201*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
2202*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADOPTION      the value of what was invalid
2203*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UNSET          the requested field is not set
2204*22dc650dSSadaf Ebrahimi
2205*22dc650dSSadaf Ebrahimi       The "magic number" is placed at the start of each compiled pattern as a
2206*22dc650dSSadaf Ebrahimi       simple  check  against  passing  an arbitrary memory pointer. Here is a
2207*22dc650dSSadaf Ebrahimi       typical call of pcre2_pattern_info(), to obtain the length of the  com-
2208*22dc650dSSadaf Ebrahimi       piled pattern:
2209*22dc650dSSadaf Ebrahimi
2210*22dc650dSSadaf Ebrahimi         int rc;
2211*22dc650dSSadaf Ebrahimi         size_t length;
2212*22dc650dSSadaf Ebrahimi         rc = pcre2_pattern_info(
2213*22dc650dSSadaf Ebrahimi           re,               /* result of pcre2_compile() */
2214*22dc650dSSadaf Ebrahimi           PCRE2_INFO_SIZE,  /* what is required */
2215*22dc650dSSadaf Ebrahimi           &length);         /* where to put the data */
2216*22dc650dSSadaf Ebrahimi
2217*22dc650dSSadaf Ebrahimi       The possible values for the second argument are defined in pcre2.h, and
2218*22dc650dSSadaf Ebrahimi       are as follows:
2219*22dc650dSSadaf Ebrahimi
2220*22dc650dSSadaf Ebrahimi         PCRE2_INFO_ALLOPTIONS
2221*22dc650dSSadaf Ebrahimi         PCRE2_INFO_ARGOPTIONS
2222*22dc650dSSadaf Ebrahimi         PCRE2_INFO_EXTRAOPTIONS
2223*22dc650dSSadaf Ebrahimi
2224*22dc650dSSadaf Ebrahimi       Return copies of the pattern's options. The third argument should point
2225*22dc650dSSadaf Ebrahimi       to  a  uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2226*22dc650dSSadaf Ebrahimi       tions that were passed to  pcre2_compile(),  whereas  PCRE2_INFO_ALLOP-
2227*22dc650dSSadaf Ebrahimi       TIONS  returns  the compile options as modified by any top-level (*XXX)
2228*22dc650dSSadaf Ebrahimi       option settings such as (*UTF) at the  start  of  the  pattern  itself.
2229*22dc650dSSadaf Ebrahimi       PCRE2_INFO_EXTRAOPTIONS  returns the extra options that were set in the
2230*22dc650dSSadaf Ebrahimi       compile context by calling the pcre2_set_compile_extra_options()  func-
2231*22dc650dSSadaf Ebrahimi       tion.
2232*22dc650dSSadaf Ebrahimi
2233*22dc650dSSadaf Ebrahimi       For  example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2234*22dc650dSSadaf Ebrahimi       TENDED option, the result for PCRE2_INFO_ALLOPTIONS  is  PCRE2_EXTENDED
2235*22dc650dSSadaf Ebrahimi       and  PCRE2_UTF.   Option settings such as (?i) that can change within a
2236*22dc650dSSadaf Ebrahimi       pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they
2237*22dc650dSSadaf Ebrahimi       appear right at the start of the pattern. (This was different  in  some
2238*22dc650dSSadaf Ebrahimi       earlier releases.)
2239*22dc650dSSadaf Ebrahimi
2240*22dc650dSSadaf Ebrahimi       A  pattern compiled without PCRE2_ANCHORED is automatically anchored by
2241*22dc650dSSadaf Ebrahimi       PCRE2 if the first significant item in every top-level branch is one of
2242*22dc650dSSadaf Ebrahimi       the following:
2243*22dc650dSSadaf Ebrahimi
2244*22dc650dSSadaf Ebrahimi         ^     unless PCRE2_MULTILINE is set
2245*22dc650dSSadaf Ebrahimi         \A    always
2246*22dc650dSSadaf Ebrahimi         \G    always
2247*22dc650dSSadaf Ebrahimi         .*    sometimes - see below
2248*22dc650dSSadaf Ebrahimi
2249*22dc650dSSadaf Ebrahimi       When .* is the first significant item, anchoring is possible only  when
2250*22dc650dSSadaf Ebrahimi       all the following are true:
2251*22dc650dSSadaf Ebrahimi
2252*22dc650dSSadaf Ebrahimi         .* is not in an atomic group
2253*22dc650dSSadaf Ebrahimi         .* is not in a capture group that is the subject
2254*22dc650dSSadaf Ebrahimi              of a backreference
2255*22dc650dSSadaf Ebrahimi         PCRE2_DOTALL is in force for .*
2256*22dc650dSSadaf Ebrahimi         Neither (*PRUNE) nor (*SKIP) appears in the pattern
2257*22dc650dSSadaf Ebrahimi         PCRE2_NO_DOTSTAR_ANCHOR is not set
2258*22dc650dSSadaf Ebrahimi
2259*22dc650dSSadaf Ebrahimi       For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2260*22dc650dSSadaf Ebrahimi       the options returned for PCRE2_INFO_ALLOPTIONS.
2261*22dc650dSSadaf Ebrahimi
2262*22dc650dSSadaf Ebrahimi         PCRE2_INFO_BACKREFMAX
2263*22dc650dSSadaf Ebrahimi
2264*22dc650dSSadaf Ebrahimi       Return the number of the highest  backreference  in  the  pattern.  The
2265*22dc650dSSadaf Ebrahimi       third  argument  should  point  to  a  uint32_t variable. Named capture
2266*22dc650dSSadaf Ebrahimi       groups acquire numbers as well as names, and these  count  towards  the
2267*22dc650dSSadaf Ebrahimi       highest  backreference.  Backreferences  such as \4 or \g{12} match the
2268*22dc650dSSadaf Ebrahimi       captured characters of the given group, but in addition, the check that
2269*22dc650dSSadaf Ebrahimi       a capture group is set in a conditional group such as (?(3)a|b) is also
2270*22dc650dSSadaf Ebrahimi       a backreference.  Zero is returned if there are no backreferences.
2271*22dc650dSSadaf Ebrahimi
2272*22dc650dSSadaf Ebrahimi         PCRE2_INFO_BSR
2273*22dc650dSSadaf Ebrahimi
2274*22dc650dSSadaf Ebrahimi       The output is a uint32_t integer whose value indicates  what  character
2275*22dc650dSSadaf Ebrahimi       sequences  the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
2276*22dc650dSSadaf Ebrahimi       means that \R matches any Unicode line  ending  sequence;  a  value  of
2277*22dc650dSSadaf Ebrahimi       PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
2278*22dc650dSSadaf Ebrahimi
2279*22dc650dSSadaf Ebrahimi         PCRE2_INFO_CAPTURECOUNT
2280*22dc650dSSadaf Ebrahimi
2281*22dc650dSSadaf Ebrahimi       Return  the  highest  capture  group number in the pattern. In patterns
2282*22dc650dSSadaf Ebrahimi       where (?| is not used, this is also the total number of capture groups.
2283*22dc650dSSadaf Ebrahimi       The third argument should point to a uint32_t variable.
2284*22dc650dSSadaf Ebrahimi
2285*22dc650dSSadaf Ebrahimi         PCRE2_INFO_DEPTHLIMIT
2286*22dc650dSSadaf Ebrahimi
2287*22dc650dSSadaf Ebrahimi       If the pattern set a backtracking depth limit by including an  item  of
2288*22dc650dSSadaf Ebrahimi       the  form  (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
2289*22dc650dSSadaf Ebrahimi       third argument should point to a uint32_t integer. If no such value has
2290*22dc650dSSadaf Ebrahimi       been set, the call to pcre2_pattern_info() returns the error  PCRE2_ER-
2291*22dc650dSSadaf Ebrahimi       ROR_UNSET. Note that this limit will only be used during matching if it
2292*22dc650dSSadaf Ebrahimi       is  less  than  the  limit  set or defaulted by the caller of the match
2293*22dc650dSSadaf Ebrahimi       function.
2294*22dc650dSSadaf Ebrahimi
2295*22dc650dSSadaf Ebrahimi         PCRE2_INFO_FIRSTBITMAP
2296*22dc650dSSadaf Ebrahimi
2297*22dc650dSSadaf Ebrahimi       In the absence of a single first code unit for a non-anchored  pattern,
2298*22dc650dSSadaf Ebrahimi       pcre2_compile()  may construct a 256-bit table that defines a fixed set
2299*22dc650dSSadaf Ebrahimi       of values for the first code unit in any match. For example, a  pattern
2300*22dc650dSSadaf Ebrahimi       that  starts  with  [abc]  results in a table with three bits set. When
2301*22dc650dSSadaf Ebrahimi       code unit values greater than 255 are supported, the flag bit  for  255
2302*22dc650dSSadaf Ebrahimi       means  "any  code unit of value 255 or above". If such a table was con-
2303*22dc650dSSadaf Ebrahimi       structed, a pointer to it is returned. Otherwise NULL is returned.  The
2304*22dc650dSSadaf Ebrahimi       third argument should point to a const uint8_t * variable.
2305*22dc650dSSadaf Ebrahimi
2306*22dc650dSSadaf Ebrahimi         PCRE2_INFO_FIRSTCODETYPE
2307*22dc650dSSadaf Ebrahimi
2308*22dc650dSSadaf Ebrahimi       Return information about the first code unit of any matched string, for
2309*22dc650dSSadaf Ebrahimi       a  non-anchored  pattern. The third argument should point to a uint32_t
2310*22dc650dSSadaf Ebrahimi       variable. If there is a fixed first value, for example, the letter  "c"
2311*22dc650dSSadaf Ebrahimi       from  a  pattern such as (cat|cow|coyote), 1 is returned, and the value
2312*22dc650dSSadaf Ebrahimi       can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is  no  fixed
2313*22dc650dSSadaf Ebrahimi       first  value,  but it is known that a match can occur only at the start
2314*22dc650dSSadaf Ebrahimi       of the subject or following a newline in the subject,  2  is  returned.
2315*22dc650dSSadaf Ebrahimi       Otherwise, and for anchored patterns, 0 is returned.
2316*22dc650dSSadaf Ebrahimi
2317*22dc650dSSadaf Ebrahimi         PCRE2_INFO_FIRSTCODEUNIT
2318*22dc650dSSadaf Ebrahimi
2319*22dc650dSSadaf Ebrahimi       Return  the  value  of  the first code unit of any matched string for a
2320*22dc650dSSadaf Ebrahimi       pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise  return  0.
2321*22dc650dSSadaf Ebrahimi       The  third  argument  should point to a uint32_t variable. In the 8-bit
2322*22dc650dSSadaf Ebrahimi       library, the value is always less than 256. In the 16-bit  library  the
2323*22dc650dSSadaf Ebrahimi       value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2324*22dc650dSSadaf Ebrahimi       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2325*22dc650dSSadaf Ebrahimi       mode.
2326*22dc650dSSadaf Ebrahimi
2327*22dc650dSSadaf Ebrahimi         PCRE2_INFO_FRAMESIZE
2328*22dc650dSSadaf Ebrahimi
2329*22dc650dSSadaf Ebrahimi       Return the size (in bytes) of the data frames that are used to remember
2330*22dc650dSSadaf Ebrahimi       backtracking positions when the pattern is processed  by  pcre2_match()
2331*22dc650dSSadaf Ebrahimi       without  the  use  of  JIT. The third argument should point to a size_t
2332*22dc650dSSadaf Ebrahimi       variable. The frame size depends on the number of capturing parentheses
2333*22dc650dSSadaf Ebrahimi       in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2334*22dc650dSSadaf Ebrahimi       ables.
2335*22dc650dSSadaf Ebrahimi
2336*22dc650dSSadaf Ebrahimi         PCRE2_INFO_HASBACKSLASHC
2337*22dc650dSSadaf Ebrahimi
2338*22dc650dSSadaf Ebrahimi       Return 1 if the pattern contains any instances of \C, otherwise 0.  The
2339*22dc650dSSadaf Ebrahimi       third argument should point to a uint32_t variable.
2340*22dc650dSSadaf Ebrahimi
2341*22dc650dSSadaf Ebrahimi         PCRE2_INFO_HASCRORLF
2342*22dc650dSSadaf Ebrahimi
2343*22dc650dSSadaf Ebrahimi       Return  1  if  the  pattern  contains any explicit matches for CR or LF
2344*22dc650dSSadaf Ebrahimi       characters, otherwise 0. The third argument should point to a  uint32_t
2345*22dc650dSSadaf Ebrahimi       variable.  An explicit match is either a literal CR or LF character, or
2346*22dc650dSSadaf Ebrahimi       \r or \n or one of the  equivalent  hexadecimal  or  octal  escape  se-
2347*22dc650dSSadaf Ebrahimi       quences.
2348*22dc650dSSadaf Ebrahimi
2349*22dc650dSSadaf Ebrahimi         PCRE2_INFO_HEAPLIMIT
2350*22dc650dSSadaf Ebrahimi
2351*22dc650dSSadaf Ebrahimi       If the pattern set a heap memory limit by including an item of the form
2352*22dc650dSSadaf Ebrahimi       (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2353*22dc650dSSadaf Ebrahimi       ment should point to a uint32_t integer. If no such value has been set,
2354*22dc650dSSadaf Ebrahimi       the  call  to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
2355*22dc650dSSadaf Ebrahimi       Note that this limit will only be used during matching if  it  is  less
2356*22dc650dSSadaf Ebrahimi       than the limit set or defaulted by the caller of the match function.
2357*22dc650dSSadaf Ebrahimi
2358*22dc650dSSadaf Ebrahimi         PCRE2_INFO_JCHANGED
2359*22dc650dSSadaf Ebrahimi
2360*22dc650dSSadaf Ebrahimi       Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
2361*22dc650dSSadaf Ebrahimi       otherwise 0. The third argument should point to  a  uint32_t  variable.
2362*22dc650dSSadaf Ebrahimi       (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2363*22dc650dSSadaf Ebrahimi       tively.
2364*22dc650dSSadaf Ebrahimi
2365*22dc650dSSadaf Ebrahimi         PCRE2_INFO_JITSIZE
2366*22dc650dSSadaf Ebrahimi
2367*22dc650dSSadaf Ebrahimi       If the compiled pattern was successfully  processed  by  pcre2_jit_com-
2368*22dc650dSSadaf Ebrahimi       pile(),  return  the  size  of  the JIT compiled code, otherwise return
2369*22dc650dSSadaf Ebrahimi       zero. The third argument should point to a size_t variable.
2370*22dc650dSSadaf Ebrahimi
2371*22dc650dSSadaf Ebrahimi         PCRE2_INFO_LASTCODETYPE
2372*22dc650dSSadaf Ebrahimi
2373*22dc650dSSadaf Ebrahimi       Returns 1 if there is a rightmost literal code unit that must exist  in
2374*22dc650dSSadaf Ebrahimi       any  matched string, other than at its start. The third argument should
2375*22dc650dSSadaf Ebrahimi       point to a uint32_t variable. If there is no such value, 0 is returned.
2376*22dc650dSSadaf Ebrahimi       When 1 is returned, the code unit value itself can be  retrieved  using
2377*22dc650dSSadaf Ebrahimi       PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
2378*22dc650dSSadaf Ebrahimi       recorded  only if it follows something of variable length. For example,
2379*22dc650dSSadaf Ebrahimi       for the pattern /^a\d+z\d+/ the returned value is 1 (with "z"  returned
2380*22dc650dSSadaf Ebrahimi       from  PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is
2381*22dc650dSSadaf Ebrahimi       0.
2382*22dc650dSSadaf Ebrahimi
2383*22dc650dSSadaf Ebrahimi         PCRE2_INFO_LASTCODEUNIT
2384*22dc650dSSadaf Ebrahimi
2385*22dc650dSSadaf Ebrahimi       Return the value of the rightmost literal code unit that must exist  in
2386*22dc650dSSadaf Ebrahimi       any  matched  string,  other  than  at  its  start, for a pattern where
2387*22dc650dSSadaf Ebrahimi       PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2388*22dc650dSSadaf Ebrahimi       ment should point to a uint32_t variable.
2389*22dc650dSSadaf Ebrahimi
2390*22dc650dSSadaf Ebrahimi         PCRE2_INFO_MATCHEMPTY
2391*22dc650dSSadaf Ebrahimi
2392*22dc650dSSadaf Ebrahimi       Return 1 if the pattern might match an empty string, otherwise  0.  The
2393*22dc650dSSadaf Ebrahimi       third argument should point to a uint32_t variable. When a pattern con-
2394*22dc650dSSadaf Ebrahimi       tains recursive subroutine calls it is not always possible to determine
2395*22dc650dSSadaf Ebrahimi       whether or not it can match an empty string. PCRE2 takes a cautious ap-
2396*22dc650dSSadaf Ebrahimi       proach and returns 1 in such cases.
2397*22dc650dSSadaf Ebrahimi
2398*22dc650dSSadaf Ebrahimi         PCRE2_INFO_MATCHLIMIT
2399*22dc650dSSadaf Ebrahimi
2400*22dc650dSSadaf Ebrahimi       If  the  pattern  set  a  match  limit by including an item of the form
2401*22dc650dSSadaf Ebrahimi       (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third  ar-
2402*22dc650dSSadaf Ebrahimi       gument  should  point  to a uint32_t integer. If no such value has been
2403*22dc650dSSadaf Ebrahimi       set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2404*22dc650dSSadaf Ebrahimi       SET. Note that this limit will only be used during matching  if  it  is
2405*22dc650dSSadaf Ebrahimi       less  than  the limit set or defaulted by the caller of the match func-
2406*22dc650dSSadaf Ebrahimi       tion.
2407*22dc650dSSadaf Ebrahimi
2408*22dc650dSSadaf Ebrahimi         PCRE2_INFO_MAXLOOKBEHIND
2409*22dc650dSSadaf Ebrahimi
2410*22dc650dSSadaf Ebrahimi       A lookbehind assertion moves back a certain number of  characters  (not
2411*22dc650dSSadaf Ebrahimi       code  units)  when  it starts to process each of its branches. This re-
2412*22dc650dSSadaf Ebrahimi       quest returns the largest of these backward moves. The  third  argument
2413*22dc650dSSadaf Ebrahimi       should point to a uint32_t integer. The simple assertions \b and \B re-
2414*22dc650dSSadaf Ebrahimi       quire  a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
2415*22dc650dSSadaf Ebrahimi       return 1 in the absence of anything longer. \A also  registers  a  one-
2416*22dc650dSSadaf Ebrahimi       character  lookbehind, though it does not actually inspect the previous
2417*22dc650dSSadaf Ebrahimi       character.
2418*22dc650dSSadaf Ebrahimi
2419*22dc650dSSadaf Ebrahimi       Note that this information is useful for multi-segment matching only if
2420*22dc650dSSadaf Ebrahimi       the pattern contains no nested lookbehinds. For  example,  the  pattern
2421*22dc650dSSadaf Ebrahimi       (?<=a(?<=ba)c)  returns  a  maximum  lookbehind  of  2,  but when it is
2422*22dc650dSSadaf Ebrahimi       processed, the first lookbehind moves back by two  characters,  matches
2423*22dc650dSSadaf Ebrahimi       one  character, then the nested lookbehind also moves back by two char-
2424*22dc650dSSadaf Ebrahimi       acters. This puts the matching point three characters earlier  than  it
2425*22dc650dSSadaf Ebrahimi       was  at the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a
2426*22dc650dSSadaf Ebrahimi       debugging tool. See the pcre2partial documentation for a discussion  of
2427*22dc650dSSadaf Ebrahimi       multi-segment matching.
2428*22dc650dSSadaf Ebrahimi
2429*22dc650dSSadaf Ebrahimi         PCRE2_INFO_MINLENGTH
2430*22dc650dSSadaf Ebrahimi
2431*22dc650dSSadaf Ebrahimi       If  a  minimum  length  for  matching subject strings was computed, its
2432*22dc650dSSadaf Ebrahimi       value is returned. Otherwise the returned value is 0. This value is not
2433*22dc650dSSadaf Ebrahimi       computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number  of
2434*22dc650dSSadaf Ebrahimi       characters,  which in UTF mode may be different from the number of code
2435*22dc650dSSadaf Ebrahimi       units. The third argument should point  to  a  uint32_t  variable.  The
2436*22dc650dSSadaf Ebrahimi       value  is a lower bound to the length of any matching string. There may
2437*22dc650dSSadaf Ebrahimi       not be any strings of that length that do  actually  match,  but  every
2438*22dc650dSSadaf Ebrahimi       string that does match is at least that long.
2439*22dc650dSSadaf Ebrahimi
2440*22dc650dSSadaf Ebrahimi         PCRE2_INFO_NAMECOUNT
2441*22dc650dSSadaf Ebrahimi         PCRE2_INFO_NAMEENTRYSIZE
2442*22dc650dSSadaf Ebrahimi         PCRE2_INFO_NAMETABLE
2443*22dc650dSSadaf Ebrahimi
2444*22dc650dSSadaf Ebrahimi       PCRE2 supports the use of named as well as numbered capturing parenthe-
2445*22dc650dSSadaf Ebrahimi       ses.  The names are just an additional way of identifying the parenthe-
2446*22dc650dSSadaf Ebrahimi       ses, which still acquire numbers. Several convenience functions such as
2447*22dc650dSSadaf Ebrahimi       pcre2_substring_get_byname() are provided for extracting captured  sub-
2448*22dc650dSSadaf Ebrahimi       strings  by  name. It is also possible to extract the data directly, by
2449*22dc650dSSadaf Ebrahimi       first converting the name to a number in order to  access  the  correct
2450*22dc650dSSadaf Ebrahimi       pointers  in the output vector (described with pcre2_match() below). To
2451*22dc650dSSadaf Ebrahimi       do the conversion, you need to use the name-to-number map, which is de-
2452*22dc650dSSadaf Ebrahimi       scribed by these three values.
2453*22dc650dSSadaf Ebrahimi
2454*22dc650dSSadaf Ebrahimi       The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
2455*22dc650dSSadaf Ebrahimi       COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
2456*22dc650dSSadaf Ebrahimi       the size of each entry in code units; both of these return  a  uint32_t
2457*22dc650dSSadaf Ebrahimi       value. The entry size depends on the length of the longest name.
2458*22dc650dSSadaf Ebrahimi
2459*22dc650dSSadaf Ebrahimi       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
2460*22dc650dSSadaf Ebrahimi       This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2461*22dc650dSSadaf Ebrahimi       brary,  the first two bytes of each entry are the number of the captur-
2462*22dc650dSSadaf Ebrahimi       ing parenthesis, most significant byte first. In  the  16-bit  library,
2463*22dc650dSSadaf Ebrahimi       the  pointer  points  to 16-bit code units, the first of which contains
2464*22dc650dSSadaf Ebrahimi       the parenthesis number. In the 32-bit library, the  pointer  points  to
2465*22dc650dSSadaf Ebrahimi       32-bit  code units, the first of which contains the parenthesis number.
2466*22dc650dSSadaf Ebrahimi       The rest of the entry is the corresponding name, zero terminated.
2467*22dc650dSSadaf Ebrahimi
2468*22dc650dSSadaf Ebrahimi       The names are in alphabetical order. If (?| is used to create  multiple
2469*22dc650dSSadaf Ebrahimi       capture groups with the same number, as described in the section on du-
2470*22dc650dSSadaf Ebrahimi       plicate group numbers in the pcre2pattern page, the groups may be given
2471*22dc650dSSadaf Ebrahimi       the  same  name,  but  there  is only one entry in the table. Different
2472*22dc650dSSadaf Ebrahimi       names for groups of the same number are not permitted.
2473*22dc650dSSadaf Ebrahimi
2474*22dc650dSSadaf Ebrahimi       Duplicate names for capture groups with different numbers  are  permit-
2475*22dc650dSSadaf Ebrahimi       ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the
2476*22dc650dSSadaf Ebrahimi       order  in  which  they were found in the pattern. In the absence of (?|
2477*22dc650dSSadaf Ebrahimi       this is the order of increasing number; when (?| is used  this  is  not
2478*22dc650dSSadaf Ebrahimi       necessarily  the  case because later capture groups may have lower num-
2479*22dc650dSSadaf Ebrahimi       bers.
2480*22dc650dSSadaf Ebrahimi
2481*22dc650dSSadaf Ebrahimi       As a simple example of the name/number table,  consider  the  following
2482*22dc650dSSadaf Ebrahimi       pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED
2483*22dc650dSSadaf Ebrahimi       is set, so white space - including newlines - is ignored):
2484*22dc650dSSadaf Ebrahimi
2485*22dc650dSSadaf Ebrahimi         (?<date> (?<year>(\d\d)?\d\d) -
2486*22dc650dSSadaf Ebrahimi         (?<month>\d\d) - (?<day>\d\d) )
2487*22dc650dSSadaf Ebrahimi
2488*22dc650dSSadaf Ebrahimi       There are four named capture groups, so the table has four entries, and
2489*22dc650dSSadaf Ebrahimi       each entry in the table is eight bytes long. The table is  as  follows,
2490*22dc650dSSadaf Ebrahimi       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2491*22dc650dSSadaf Ebrahimi       as ??:
2492*22dc650dSSadaf Ebrahimi
2493*22dc650dSSadaf Ebrahimi         00 01 d  a  t  e  00 ??
2494*22dc650dSSadaf Ebrahimi         00 05 d  a  y  00 ?? ??
2495*22dc650dSSadaf Ebrahimi         00 04 m  o  n  t  h  00
2496*22dc650dSSadaf Ebrahimi         00 02 y  e  a  r  00 ??
2497*22dc650dSSadaf Ebrahimi
2498*22dc650dSSadaf Ebrahimi       When  writing  code to extract data from named capture groups using the
2499*22dc650dSSadaf Ebrahimi       name-to-number map, remember that the length of the entries  is  likely
2500*22dc650dSSadaf Ebrahimi       to be different for each compiled pattern.
2501*22dc650dSSadaf Ebrahimi
2502*22dc650dSSadaf Ebrahimi         PCRE2_INFO_NEWLINE
2503*22dc650dSSadaf Ebrahimi
2504*22dc650dSSadaf Ebrahimi       The output is one of the following uint32_t values:
2505*22dc650dSSadaf Ebrahimi
2506*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_CR       Carriage return (CR)
2507*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_LF       Linefeed (LF)
2508*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
2509*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_ANY      Any Unicode line ending
2510*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
2511*22dc650dSSadaf Ebrahimi         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
2512*22dc650dSSadaf Ebrahimi
2513*22dc650dSSadaf Ebrahimi       This identifies the character sequence that will be recognized as mean-
2514*22dc650dSSadaf Ebrahimi       ing "newline" while matching.
2515*22dc650dSSadaf Ebrahimi
2516*22dc650dSSadaf Ebrahimi         PCRE2_INFO_SIZE
2517*22dc650dSSadaf Ebrahimi
2518*22dc650dSSadaf Ebrahimi       Return  the  size  of  the compiled pattern in bytes (for all three li-
2519*22dc650dSSadaf Ebrahimi       braries). The third argument should point to a  size_t  variable.  This
2520*22dc650dSSadaf Ebrahimi       value  includes  the  size  of the general data block that precedes the
2521*22dc650dSSadaf Ebrahimi       code units of the compiled pattern itself. The value that is used  when
2522*22dc650dSSadaf Ebrahimi       pcre2_compile()  is  getting memory in which to place the compiled pat-
2523*22dc650dSSadaf Ebrahimi       tern may be slightly larger than the value returned by this option, be-
2524*22dc650dSSadaf Ebrahimi       cause there are cases where the code that calculates the  size  has  to
2525*22dc650dSSadaf Ebrahimi       over-estimate.  Processing a pattern with the JIT compiler does not al-
2526*22dc650dSSadaf Ebrahimi       ter the value returned by this option.
2527*22dc650dSSadaf Ebrahimi
2528*22dc650dSSadaf Ebrahimi
2529*22dc650dSSadaf EbrahimiINFORMATION ABOUT A PATTERN'S CALLOUTS
2530*22dc650dSSadaf Ebrahimi
2531*22dc650dSSadaf Ebrahimi       int pcre2_callout_enumerate(const pcre2_code *code,
2532*22dc650dSSadaf Ebrahimi         int (*callback)(pcre2_callout_enumerate_block *, void *),
2533*22dc650dSSadaf Ebrahimi         void *user_data);
2534*22dc650dSSadaf Ebrahimi
2535*22dc650dSSadaf Ebrahimi       A script language that supports the use of string arguments in callouts
2536*22dc650dSSadaf Ebrahimi       might like to scan all the callouts in a  pattern  before  running  the
2537*22dc650dSSadaf Ebrahimi       match. This can be done by calling pcre2_callout_enumerate(). The first
2538*22dc650dSSadaf Ebrahimi       argument  is  a  pointer  to a compiled pattern, the second points to a
2539*22dc650dSSadaf Ebrahimi       callback function, and the third is arbitrary user data.  The  callback
2540*22dc650dSSadaf Ebrahimi       function  is  called  for  every callout in the pattern in the order in
2541*22dc650dSSadaf Ebrahimi       which they appear. Its first argument is a pointer to a callout enumer-
2542*22dc650dSSadaf Ebrahimi       ation block, and its second argument is the user_data  value  that  was
2543*22dc650dSSadaf Ebrahimi       passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2544*22dc650dSSadaf Ebrahimi       meration block are described in the pcre2callout  documentation,  which
2545*22dc650dSSadaf Ebrahimi       also gives further details about callouts.
2546*22dc650dSSadaf Ebrahimi
2547*22dc650dSSadaf Ebrahimi
2548*22dc650dSSadaf EbrahimiSERIALIZATION AND PRECOMPILING
2549*22dc650dSSadaf Ebrahimi
2550*22dc650dSSadaf Ebrahimi       It  is possible to save compiled patterns on disc or elsewhere, and re-
2551*22dc650dSSadaf Ebrahimi       load them later, subject to a number of restrictions. The host on which
2552*22dc650dSSadaf Ebrahimi       the patterns are reloaded must be running the same  version  of  PCRE2,
2553*22dc650dSSadaf Ebrahimi       with  the same code unit width, and must also have the same endianness,
2554*22dc650dSSadaf Ebrahimi       pointer width, and PCRE2_SIZE type. Before  compiled  patterns  can  be
2555*22dc650dSSadaf Ebrahimi       saved, they must be converted to a "serialized" form, which in the case
2556*22dc650dSSadaf Ebrahimi       of PCRE2 is really just a bytecode dump.  The functions whose names be-
2557*22dc650dSSadaf Ebrahimi       gin with pcre2_serialize_ are used for converting to and from the seri-
2558*22dc650dSSadaf Ebrahimi       alized  form.  They  are described in the pcre2serialize documentation.
2559*22dc650dSSadaf Ebrahimi       Note that PCRE2 serialization does not convert compiled patterns to  an
2560*22dc650dSSadaf Ebrahimi       abstract format like Java or .NET serialization.
2561*22dc650dSSadaf Ebrahimi
2562*22dc650dSSadaf Ebrahimi
2563*22dc650dSSadaf EbrahimiTHE MATCH DATA BLOCK
2564*22dc650dSSadaf Ebrahimi
2565*22dc650dSSadaf Ebrahimi       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
2566*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
2567*22dc650dSSadaf Ebrahimi
2568*22dc650dSSadaf Ebrahimi       pcre2_match_data *pcre2_match_data_create_from_pattern(
2569*22dc650dSSadaf Ebrahimi         const pcre2_code *code, pcre2_general_context *gcontext);
2570*22dc650dSSadaf Ebrahimi
2571*22dc650dSSadaf Ebrahimi       void pcre2_match_data_free(pcre2_match_data *match_data);
2572*22dc650dSSadaf Ebrahimi
2573*22dc650dSSadaf Ebrahimi       Information  about  a  successful  or unsuccessful match is placed in a
2574*22dc650dSSadaf Ebrahimi       match data block, which is an opaque  structure  that  is  accessed  by
2575*22dc650dSSadaf Ebrahimi       function  calls.  In particular, the match data block contains a vector
2576*22dc650dSSadaf Ebrahimi       of offsets into the subject string that define the matched parts of the
2577*22dc650dSSadaf Ebrahimi       subject. This is known as the ovector.
2578*22dc650dSSadaf Ebrahimi
2579*22dc650dSSadaf Ebrahimi       Before calling pcre2_match(), pcre2_dfa_match(),  or  pcre2_jit_match()
2580*22dc650dSSadaf Ebrahimi       you must create a match data block by calling one of the creation func-
2581*22dc650dSSadaf Ebrahimi       tions  above.  For pcre2_match_data_create(), the first argument is the
2582*22dc650dSSadaf Ebrahimi       number of pairs of offsets in the ovector.
2583*22dc650dSSadaf Ebrahimi
2584*22dc650dSSadaf Ebrahimi       When using pcre2_match(), one pair of offsets is required  to  identify
2585*22dc650dSSadaf Ebrahimi       the  string that matched the whole pattern, with an additional pair for
2586*22dc650dSSadaf Ebrahimi       each captured substring. For example, a value of 4 creates enough space
2587*22dc650dSSadaf Ebrahimi       to record the matched portion of the subject plus three  captured  sub-
2588*22dc650dSSadaf Ebrahimi       strings.
2589*22dc650dSSadaf Ebrahimi
2590*22dc650dSSadaf Ebrahimi       When  using  pcre2_dfa_match() there may be multiple matched substrings
2591*22dc650dSSadaf Ebrahimi       of different lengths at the same point  in  the  subject.  The  ovector
2592*22dc650dSSadaf Ebrahimi       should be made large enough to hold as many as are expected.
2593*22dc650dSSadaf Ebrahimi
2594*22dc650dSSadaf Ebrahimi       A  minimum  of at least 1 pair is imposed by pcre2_match_data_create(),
2595*22dc650dSSadaf Ebrahimi       so it is always possible to return the overall matched  string  in  the
2596*22dc650dSSadaf Ebrahimi       case   of   pcre2_match()   or   the  longest  match  in  the  case  of
2597*22dc650dSSadaf Ebrahimi       pcre2_dfa_match(). The maximum number of pairs is 65535; if  the  first
2598*22dc650dSSadaf Ebrahimi       argument  of  pcre2_match_data_create()  is greater than this, 65535 is
2599*22dc650dSSadaf Ebrahimi       used.
2600*22dc650dSSadaf Ebrahimi
2601*22dc650dSSadaf Ebrahimi       The second argument of pcre2_match_data_create() is a pointer to a gen-
2602*22dc650dSSadaf Ebrahimi       eral context, which can specify custom memory management for  obtaining
2603*22dc650dSSadaf Ebrahimi       the memory for the match data block. If you are not using custom memory
2604*22dc650dSSadaf Ebrahimi       management, pass NULL, which causes malloc() to be used.
2605*22dc650dSSadaf Ebrahimi
2606*22dc650dSSadaf Ebrahimi       For  pcre2_match_data_create_from_pattern(),  the  first  argument is a
2607*22dc650dSSadaf Ebrahimi       pointer to a compiled pattern. The ovector is created to be exactly the
2608*22dc650dSSadaf Ebrahimi       right size to hold all the substrings  a  pattern  might  capture  when
2609*22dc650dSSadaf Ebrahimi       matched using pcre2_match(). You should not use this call when matching
2610*22dc650dSSadaf Ebrahimi       with  pcre2_dfa_match().  The  second  argument is again a pointer to a
2611*22dc650dSSadaf Ebrahimi       general context, but in this case if NULL is passed, the memory is  ob-
2612*22dc650dSSadaf Ebrahimi       tained  using the same allocator that was used for the compiled pattern
2613*22dc650dSSadaf Ebrahimi       (custom or default).
2614*22dc650dSSadaf Ebrahimi
2615*22dc650dSSadaf Ebrahimi       A match data block can be used many times, with the same  or  different
2616*22dc650dSSadaf Ebrahimi       compiled  patterns. You can extract information from a match data block
2617*22dc650dSSadaf Ebrahimi       after a match operation has finished,  using  functions  that  are  de-
2618*22dc650dSSadaf Ebrahimi       scribed in the sections on matched strings and other match data below.
2619*22dc650dSSadaf Ebrahimi
2620*22dc650dSSadaf Ebrahimi       When  a  call  of  pcre2_match()  fails, valid data is available in the
2621*22dc650dSSadaf Ebrahimi       match block only  when  the  error  is  PCRE2_ERROR_NOMATCH,  PCRE2_ER-
2622*22dc650dSSadaf Ebrahimi       ROR_PARTIAL,  or  one of the error codes for an invalid UTF string. Ex-
2623*22dc650dSSadaf Ebrahimi       actly what is available depends on the error, and is detailed below.
2624*22dc650dSSadaf Ebrahimi
2625*22dc650dSSadaf Ebrahimi       When one of the matching functions is called, pointers to the  compiled
2626*22dc650dSSadaf Ebrahimi       pattern  and the subject string are set in the match data block so that
2627*22dc650dSSadaf Ebrahimi       they can be referenced by the extraction functions after  a  successful
2628*22dc650dSSadaf Ebrahimi       match. After running a match, you must not free a compiled pattern or a
2629*22dc650dSSadaf Ebrahimi       subject  string until after all operations on the match data block (for
2630*22dc650dSSadaf Ebrahimi       that match) have taken place,  unless,  in  the  case  of  the  subject
2631*22dc650dSSadaf Ebrahimi       string,  you  have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
2632*22dc650dSSadaf Ebrahimi       described in the section entitled "Option bits for  pcre2_match()"  be-
2633*22dc650dSSadaf Ebrahimi       low.
2634*22dc650dSSadaf Ebrahimi
2635*22dc650dSSadaf Ebrahimi       When  a match data block itself is no longer needed, it should be freed
2636*22dc650dSSadaf Ebrahimi       by calling pcre2_match_data_free(). If this function is called  with  a
2637*22dc650dSSadaf Ebrahimi       NULL argument, it returns immediately, without doing anything.
2638*22dc650dSSadaf Ebrahimi
2639*22dc650dSSadaf Ebrahimi
2640*22dc650dSSadaf EbrahimiMEMORY USE FOR MATCH DATA BLOCKS
2641*22dc650dSSadaf Ebrahimi
2642*22dc650dSSadaf Ebrahimi       PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data);
2643*22dc650dSSadaf Ebrahimi
2644*22dc650dSSadaf Ebrahimi       PCRE2_SIZE pcre2_get_match_data_heapframes_size(
2645*22dc650dSSadaf Ebrahimi         pcre2_match_data *match_data);
2646*22dc650dSSadaf Ebrahimi
2647*22dc650dSSadaf Ebrahimi       The  size of a match data block depends on the size of the ovector that
2648*22dc650dSSadaf Ebrahimi       it contains. The function pcre2_get_match_data_size() returns the size,
2649*22dc650dSSadaf Ebrahimi       in bytes, of the block that is its argument.
2650*22dc650dSSadaf Ebrahimi
2651*22dc650dSSadaf Ebrahimi       When pcre2_match() runs interpretively (that is, without using JIT), it
2652*22dc650dSSadaf Ebrahimi       makes use of a vector of data frames for remembering backtracking posi-
2653*22dc650dSSadaf Ebrahimi       tions.  The size of each individual frame depends on the number of cap-
2654*22dc650dSSadaf Ebrahimi       turing parentheses in the  pattern  and  can  be  obtained  by  calling
2655*22dc650dSSadaf Ebrahimi       pcre2_pattern_info() with the PCRE2_INFO_FRAMESIZE option (see the sec-
2656*22dc650dSSadaf Ebrahimi       tion entitled "Information about a compiled pattern" above).
2657*22dc650dSSadaf Ebrahimi
2658*22dc650dSSadaf Ebrahimi       Heap  memory is used for the frames vector; if the initial memory block
2659*22dc650dSSadaf Ebrahimi       turns out to be too small during  matching,  it  is  automatically  ex-
2660*22dc650dSSadaf Ebrahimi       panded.  When  pcre2_match()  returns, the memory is not freed, but re-
2661*22dc650dSSadaf Ebrahimi       mains attached to the match data  block,  for  use  by  any  subsequent
2662*22dc650dSSadaf Ebrahimi       matches  that  use  the  same block. It is automatically freed when the
2663*22dc650dSSadaf Ebrahimi       match data block itself is freed.
2664*22dc650dSSadaf Ebrahimi
2665*22dc650dSSadaf Ebrahimi       You can find the current size of the frames vector that  a  match  data
2666*22dc650dSSadaf Ebrahimi       block  owns  by  calling  pcre2_get_match_data_heapframes_size(). For a
2667*22dc650dSSadaf Ebrahimi       newly created match data block the size will be  zero.  Some  types  of
2668*22dc650dSSadaf Ebrahimi       match may require a lot of frames and thus a large vector; applications
2669*22dc650dSSadaf Ebrahimi       that run in environments where memory is constrained can check this and
2670*22dc650dSSadaf Ebrahimi       free the match data block if the heap frames vector has become too big.
2671*22dc650dSSadaf Ebrahimi
2672*22dc650dSSadaf Ebrahimi
2673*22dc650dSSadaf EbrahimiMATCHING A PATTERN: THE TRADITIONAL FUNCTION
2674*22dc650dSSadaf Ebrahimi
2675*22dc650dSSadaf Ebrahimi       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
2676*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2677*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
2678*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext);
2679*22dc650dSSadaf Ebrahimi
2680*22dc650dSSadaf Ebrahimi       The  function pcre2_match() is called to match a subject string against
2681*22dc650dSSadaf Ebrahimi       a compiled pattern, which is passed in the code argument. You can  call
2682*22dc650dSSadaf Ebrahimi       pcre2_match() with the same code argument as many times as you like, in
2683*22dc650dSSadaf Ebrahimi       order  to  find multiple matches in the subject string or to match dif-
2684*22dc650dSSadaf Ebrahimi       ferent subject strings with the same pattern.
2685*22dc650dSSadaf Ebrahimi
2686*22dc650dSSadaf Ebrahimi       This function is the main matching facility of the library, and it  op-
2687*22dc650dSSadaf Ebrahimi       erates  in  a Perl-like manner. For specialist use there is also an al-
2688*22dc650dSSadaf Ebrahimi       ternative matching function, which is described below  in  the  section
2689*22dc650dSSadaf Ebrahimi       about the pcre2_dfa_match() function.
2690*22dc650dSSadaf Ebrahimi
2691*22dc650dSSadaf Ebrahimi       Here is an example of a simple call to pcre2_match():
2692*22dc650dSSadaf Ebrahimi
2693*22dc650dSSadaf Ebrahimi         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2694*22dc650dSSadaf Ebrahimi         int rc = pcre2_match(
2695*22dc650dSSadaf Ebrahimi           re,             /* result of pcre2_compile() */
2696*22dc650dSSadaf Ebrahimi           "some string",  /* the subject string */
2697*22dc650dSSadaf Ebrahimi           11,             /* the length of the subject string */
2698*22dc650dSSadaf Ebrahimi           0,              /* start at offset 0 in the subject */
2699*22dc650dSSadaf Ebrahimi           0,              /* default options */
2700*22dc650dSSadaf Ebrahimi           md,             /* the match data block */
2701*22dc650dSSadaf Ebrahimi           NULL);          /* a match context; NULL means use defaults */
2702*22dc650dSSadaf Ebrahimi
2703*22dc650dSSadaf Ebrahimi       If  the  subject  string is zero-terminated, the length can be given as
2704*22dc650dSSadaf Ebrahimi       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
2705*22dc650dSSadaf Ebrahimi       common matching parameters are to be changed. For details, see the sec-
2706*22dc650dSSadaf Ebrahimi       tion on the match context above.
2707*22dc650dSSadaf Ebrahimi
2708*22dc650dSSadaf Ebrahimi   The string to be matched by pcre2_match()
2709*22dc650dSSadaf Ebrahimi
2710*22dc650dSSadaf Ebrahimi       The subject string is passed to pcre2_match() as a pointer in  subject,
2711*22dc650dSSadaf Ebrahimi       a  length  in  length, and a starting offset in startoffset. The length
2712*22dc650dSSadaf Ebrahimi       and offset are in code units, not characters.  That  is,  they  are  in
2713*22dc650dSSadaf Ebrahimi       bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
2714*22dc650dSSadaf Ebrahimi       and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
2715*22dc650dSSadaf Ebrahimi       cessing is enabled. As a special case, if subject is NULL and length is
2716*22dc650dSSadaf Ebrahimi       zero,  the  subject is assumed to be an empty string. If length is non-
2717*22dc650dSSadaf Ebrahimi       zero, an error occurs if subject is NULL.
2718*22dc650dSSadaf Ebrahimi
2719*22dc650dSSadaf Ebrahimi       If startoffset is greater than the length of the subject, pcre2_match()
2720*22dc650dSSadaf Ebrahimi       returns PCRE2_ERROR_BADOFFSET. When the starting offset  is  zero,  the
2721*22dc650dSSadaf Ebrahimi       search  for a match starts at the beginning of the subject, and this is
2722*22dc650dSSadaf Ebrahimi       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2723*22dc650dSSadaf Ebrahimi       set must point to the start of a character, or to the end of  the  sub-
2724*22dc650dSSadaf Ebrahimi       ject  (in  UTF-32 mode, one code unit equals one character, so all off-
2725*22dc650dSSadaf Ebrahimi       sets are valid). Like the pattern string, the subject may  contain  bi-
2726*22dc650dSSadaf Ebrahimi       nary zeros.
2727*22dc650dSSadaf Ebrahimi
2728*22dc650dSSadaf Ebrahimi       A  non-zero  starting offset is useful when searching for another match
2729*22dc650dSSadaf Ebrahimi       in the same subject by calling pcre2_match()  again  after  a  previous
2730*22dc650dSSadaf Ebrahimi       success.   Setting  startoffset  differs  from passing over a shortened
2731*22dc650dSSadaf Ebrahimi       string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins
2732*22dc650dSSadaf Ebrahimi       with any kind of lookbehind. For example, consider the pattern
2733*22dc650dSSadaf Ebrahimi
2734*22dc650dSSadaf Ebrahimi         \Biss\B
2735*22dc650dSSadaf Ebrahimi
2736*22dc650dSSadaf Ebrahimi       which  finds  occurrences  of "iss" in the middle of words. (\B matches
2737*22dc650dSSadaf Ebrahimi       only if the current position in the subject is not  a  word  boundary.)
2738*22dc650dSSadaf Ebrahimi       When   applied   to   the   string  "Mississippi"  the  first  call  to
2739*22dc650dSSadaf Ebrahimi       pcre2_match() finds the first occurrence. If  pcre2_match()  is  called
2740*22dc650dSSadaf Ebrahimi       again with just the remainder of the subject, namely "issippi", it does
2741*22dc650dSSadaf Ebrahimi       not  match,  because  \B  is  always false at the start of the subject,
2742*22dc650dSSadaf Ebrahimi       which is deemed to be a word boundary.  However,  if  pcre2_match()  is
2743*22dc650dSSadaf Ebrahimi       passed the entire string again, but with startoffset set to 4, it finds
2744*22dc650dSSadaf Ebrahimi       the  second  occurrence  of "iss" because it is able to look behind the
2745*22dc650dSSadaf Ebrahimi       starting point to discover that it is preceded by a letter.
2746*22dc650dSSadaf Ebrahimi
2747*22dc650dSSadaf Ebrahimi       Finding all the matches in a subject is tricky  when  the  pattern  can
2748*22dc650dSSadaf Ebrahimi       match an empty string. It is possible to emulate Perl's /g behaviour by
2749*22dc650dSSadaf Ebrahimi       first   trying   the   match   again  at  the  same  offset,  with  the
2750*22dc650dSSadaf Ebrahimi       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options,  and  then  if  that
2751*22dc650dSSadaf Ebrahimi       fails,  advancing  the  starting  offset  and  trying an ordinary match
2752*22dc650dSSadaf Ebrahimi       again. There is some code that demonstrates  how  to  do  this  in  the
2753*22dc650dSSadaf Ebrahimi       pcre2demo  sample  program. In the most general case, you have to check
2754*22dc650dSSadaf Ebrahimi       to see if the newline convention recognizes CRLF as a newline,  and  if
2755*22dc650dSSadaf Ebrahimi       so,  and the current character is CR followed by LF, advance the start-
2756*22dc650dSSadaf Ebrahimi       ing offset by two characters instead of one.
2757*22dc650dSSadaf Ebrahimi
2758*22dc650dSSadaf Ebrahimi       If a non-zero starting offset is passed when the pattern is anchored, a
2759*22dc650dSSadaf Ebrahimi       single attempt to match at the given offset is made. This can only suc-
2760*22dc650dSSadaf Ebrahimi       ceed if the pattern does not require the match to be at  the  start  of
2761*22dc650dSSadaf Ebrahimi       the  subject.  In other words, the anchoring must be the result of set-
2762*22dc650dSSadaf Ebrahimi       ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL,  not
2763*22dc650dSSadaf Ebrahimi       by starting the pattern with ^ or \A.
2764*22dc650dSSadaf Ebrahimi
2765*22dc650dSSadaf Ebrahimi   Option bits for pcre2_match()
2766*22dc650dSSadaf Ebrahimi
2767*22dc650dSSadaf Ebrahimi       The unused bits of the options argument for pcre2_match() must be zero.
2768*22dc650dSSadaf Ebrahimi       The    only    bits    that    may    be    set   are   PCRE2_ANCHORED,
2769*22dc650dSSadaf Ebrahimi       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK,  PCRE2_EN-
2770*22dc650dSSadaf Ebrahimi       DANCHORED,       PCRE2_NOTBOL,       PCRE2_NOTEOL,      PCRE2_NOTEMPTY,
2771*22dc650dSSadaf Ebrahimi       PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_JIT,  PCRE2_NO_UTF_CHECK,  PCRE2_PAR-
2772*22dc650dSSadaf Ebrahimi       TIAL_HARD, and PCRE2_PARTIAL_SOFT.  Their action is described below.
2773*22dc650dSSadaf Ebrahimi
2774*22dc650dSSadaf Ebrahimi       Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2775*22dc650dSSadaf Ebrahimi       ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2776*22dc650dSSadaf Ebrahimi       is  disabled  and  the  interpretive  code  in  pcre2_match()  is  run.
2777*22dc650dSSadaf Ebrahimi       PCRE2_DISABLE_RECURSELOOP_CHECK is  ignored  by  JIT,  but  apart  from
2778*22dc650dSSadaf Ebrahimi       PCRE2_NO_JIT  (obviously),  the remaining options are supported for JIT
2779*22dc650dSSadaf Ebrahimi       matching.
2780*22dc650dSSadaf Ebrahimi
2781*22dc650dSSadaf Ebrahimi         PCRE2_ANCHORED
2782*22dc650dSSadaf Ebrahimi
2783*22dc650dSSadaf Ebrahimi       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
2784*22dc650dSSadaf Ebrahimi       matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or
2785*22dc650dSSadaf Ebrahimi       turned  out to be anchored by virtue of its contents, it cannot be made
2786*22dc650dSSadaf Ebrahimi       unachored at matching time. Note that setting the option at match  time
2787*22dc650dSSadaf Ebrahimi       disables JIT matching.
2788*22dc650dSSadaf Ebrahimi
2789*22dc650dSSadaf Ebrahimi         PCRE2_COPY_MATCHED_SUBJECT
2790*22dc650dSSadaf Ebrahimi
2791*22dc650dSSadaf Ebrahimi       By  default,  a  pointer to the subject is remembered in the match data
2792*22dc650dSSadaf Ebrahimi       block so that, after a successful match, it can be  referenced  by  the
2793*22dc650dSSadaf Ebrahimi       substring  extraction  functions.  This means that the subject's memory
2794*22dc650dSSadaf Ebrahimi       must not be freed until all such operations are complete. For some  ap-
2795*22dc650dSSadaf Ebrahimi       plications  where the lifetime of the subject string is not guaranteed,
2796*22dc650dSSadaf Ebrahimi       it may be necessary to make a copy of the subject  string,  but  it  is
2797*22dc650dSSadaf Ebrahimi       wasteful  to do this unless the match is successful. After a successful
2798*22dc650dSSadaf Ebrahimi       match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied  and
2799*22dc650dSSadaf Ebrahimi       the  new  pointer  is remembered in the match data block instead of the
2800*22dc650dSSadaf Ebrahimi       original subject pointer. The memory allocator that was  used  for  the
2801*22dc650dSSadaf Ebrahimi       match  block  itself  is  used.  The  copy  is automatically freed when
2802*22dc650dSSadaf Ebrahimi       pcre2_match_data_free() is called to free the match data block.  It  is
2803*22dc650dSSadaf Ebrahimi       also automatically freed if the match data block is re-used for another
2804*22dc650dSSadaf Ebrahimi       match operation.
2805*22dc650dSSadaf Ebrahimi
2806*22dc650dSSadaf Ebrahimi         PCRE2_DISABLE_RECURSELOOP_CHECK
2807*22dc650dSSadaf Ebrahimi
2808*22dc650dSSadaf Ebrahimi       This  option  is relevant only to pcre2_match() for interpretive match-
2809*22dc650dSSadaf Ebrahimi       ing.   It  is  ignored  when  JIT  is  used,  and  is   forbidden   for
2810*22dc650dSSadaf Ebrahimi       pcre2_dfa_match().
2811*22dc650dSSadaf Ebrahimi
2812*22dc650dSSadaf Ebrahimi       The use of recursion in patterns can lead to infinite loops. In the in-
2813*22dc650dSSadaf Ebrahimi       terpretive  matcher  these  would  be eventually caught by the match or
2814*22dc650dSSadaf Ebrahimi       heap limits, but this could take a long time and/or use a lot of memory
2815*22dc650dSSadaf Ebrahimi       if the limits are large. There is therefore a check  at  the  start  of
2816*22dc650dSSadaf Ebrahimi       each  recursion.   If  the  same  group is still active from a previous
2817*22dc650dSSadaf Ebrahimi       call, and the current subject pointer is the same  as  it  was  at  the
2818*22dc650dSSadaf Ebrahimi       start  of  that group, and the furthest inspected character of the sub-
2819*22dc650dSSadaf Ebrahimi       ject has not changed, an error is generated.
2820*22dc650dSSadaf Ebrahimi
2821*22dc650dSSadaf Ebrahimi       There are rare cases of matches that would complete,  but  nevertheless
2822*22dc650dSSadaf Ebrahimi       trigger  this  error.  This  option  disables the check. It is provided
2823*22dc650dSSadaf Ebrahimi       mainly for testing when comparing JIT and interpretive behaviour.
2824*22dc650dSSadaf Ebrahimi
2825*22dc650dSSadaf Ebrahimi         PCRE2_ENDANCHORED
2826*22dc650dSSadaf Ebrahimi
2827*22dc650dSSadaf Ebrahimi       If the PCRE2_ENDANCHORED option is set, any string  that  pcre2_match()
2828*22dc650dSSadaf Ebrahimi       matches  must be right at the end of the subject string. Note that set-
2829*22dc650dSSadaf Ebrahimi       ting the option at match time disables JIT matching.
2830*22dc650dSSadaf Ebrahimi
2831*22dc650dSSadaf Ebrahimi         PCRE2_NOTBOL
2832*22dc650dSSadaf Ebrahimi
2833*22dc650dSSadaf Ebrahimi       This option specifies that first character of the subject string is not
2834*22dc650dSSadaf Ebrahimi       the beginning of a line, so the  circumflex  metacharacter  should  not
2835*22dc650dSSadaf Ebrahimi       match  before  it.  Setting  this without having set PCRE2_MULTILINE at
2836*22dc650dSSadaf Ebrahimi       compile time causes circumflex never to match. This option affects only
2837*22dc650dSSadaf Ebrahimi       the behaviour of the circumflex metacharacter. It does not affect \A.
2838*22dc650dSSadaf Ebrahimi
2839*22dc650dSSadaf Ebrahimi         PCRE2_NOTEOL
2840*22dc650dSSadaf Ebrahimi
2841*22dc650dSSadaf Ebrahimi       This option specifies that the end of the subject string is not the end
2842*22dc650dSSadaf Ebrahimi       of a line, so the dollar metacharacter should not match it nor  (except
2843*22dc650dSSadaf Ebrahimi       in  multiline mode) a newline immediately before it. Setting this with-
2844*22dc650dSSadaf Ebrahimi       out having set PCRE2_MULTILINE at compile time causes dollar  never  to
2845*22dc650dSSadaf Ebrahimi       match. This option affects only the behaviour of the dollar metacharac-
2846*22dc650dSSadaf Ebrahimi       ter. It does not affect \Z or \z.
2847*22dc650dSSadaf Ebrahimi
2848*22dc650dSSadaf Ebrahimi         PCRE2_NOTEMPTY
2849*22dc650dSSadaf Ebrahimi
2850*22dc650dSSadaf Ebrahimi       An empty string is not considered to be a valid match if this option is
2851*22dc650dSSadaf Ebrahimi       set.  If  there are alternatives in the pattern, they are tried. If all
2852*22dc650dSSadaf Ebrahimi       the alternatives match the empty string, the entire  match  fails.  For
2853*22dc650dSSadaf Ebrahimi       example, if the pattern
2854*22dc650dSSadaf Ebrahimi
2855*22dc650dSSadaf Ebrahimi         a?b?
2856*22dc650dSSadaf Ebrahimi
2857*22dc650dSSadaf Ebrahimi       is  applied  to  a  string not beginning with "a" or "b", it matches an
2858*22dc650dSSadaf Ebrahimi       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
2859*22dc650dSSadaf Ebrahimi       match is not valid, so pcre2_match() searches further into  the  string
2860*22dc650dSSadaf Ebrahimi       for occurrences of "a" or "b".
2861*22dc650dSSadaf Ebrahimi
2862*22dc650dSSadaf Ebrahimi         PCRE2_NOTEMPTY_ATSTART
2863*22dc650dSSadaf Ebrahimi
2864*22dc650dSSadaf Ebrahimi       This  is  like PCRE2_NOTEMPTY, except that it locks out an empty string
2865*22dc650dSSadaf Ebrahimi       match only at the first matching position, that is, at the start of the
2866*22dc650dSSadaf Ebrahimi       subject plus the starting offset. An empty string match  later  in  the
2867*22dc650dSSadaf Ebrahimi       subject is permitted.  If the pattern is anchored, such a match can oc-
2868*22dc650dSSadaf Ebrahimi       cur only if the pattern contains \K.
2869*22dc650dSSadaf Ebrahimi
2870*22dc650dSSadaf Ebrahimi         PCRE2_NO_JIT
2871*22dc650dSSadaf Ebrahimi
2872*22dc650dSSadaf Ebrahimi       By   default,   if   a  pattern  has  been  successfully  processed  by
2873*22dc650dSSadaf Ebrahimi       pcre2_jit_compile(), JIT is automatically used  when  pcre2_match()  is
2874*22dc650dSSadaf Ebrahimi       called  with  options  that JIT supports. Setting PCRE2_NO_JIT disables
2875*22dc650dSSadaf Ebrahimi       the use of JIT; it forces matching to be done by the interpreter.
2876*22dc650dSSadaf Ebrahimi
2877*22dc650dSSadaf Ebrahimi         PCRE2_NO_UTF_CHECK
2878*22dc650dSSadaf Ebrahimi
2879*22dc650dSSadaf Ebrahimi       When PCRE2_UTF is set at compile time, the validity of the subject as a
2880*22dc650dSSadaf Ebrahimi       UTF  string  is  checked  unless  PCRE2_NO_UTF_CHECK   is   passed   to
2881*22dc650dSSadaf Ebrahimi       pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile().
2882*22dc650dSSadaf Ebrahimi       The latter special case is discussed in detail in the pcre2unicode doc-
2883*22dc650dSSadaf Ebrahimi       umentation.
2884*22dc650dSSadaf Ebrahimi
2885*22dc650dSSadaf Ebrahimi       In  the default case, if a non-zero starting offset is given, the check
2886*22dc650dSSadaf Ebrahimi       is applied only to that part of the subject  that  could  be  inspected
2887*22dc650dSSadaf Ebrahimi       during  matching,  and there is a check that the starting offset points
2888*22dc650dSSadaf Ebrahimi       to the first code unit of a character or to the end of the subject.  If
2889*22dc650dSSadaf Ebrahimi       there  are no lookbehind assertions in the pattern, the check starts at
2890*22dc650dSSadaf Ebrahimi       the starting offset.  Otherwise, it starts at the length of the longest
2891*22dc650dSSadaf Ebrahimi       lookbehind before the starting offset, or at the start of  the  subject
2892*22dc650dSSadaf Ebrahimi       if  there are not that many characters before the starting offset. Note
2893*22dc650dSSadaf Ebrahimi       that the sequences \b and \B are one-character lookbehinds.
2894*22dc650dSSadaf Ebrahimi
2895*22dc650dSSadaf Ebrahimi       The check is carried out before any other processing takes place, and a
2896*22dc650dSSadaf Ebrahimi       negative error code is returned if the check fails. There  are  several
2897*22dc650dSSadaf Ebrahimi       UTF  error  codes  for each code unit width, corresponding to different
2898*22dc650dSSadaf Ebrahimi       problems with the code unit sequence. There are discussions  about  the
2899*22dc650dSSadaf Ebrahimi       validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
2900*22dc650dSSadaf Ebrahimi       pcre2unicode documentation.
2901*22dc650dSSadaf Ebrahimi
2902*22dc650dSSadaf Ebrahimi       If you know that your subject is valid, and you want to skip this check
2903*22dc650dSSadaf Ebrahimi       for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when
2904*22dc650dSSadaf Ebrahimi       calling pcre2_match(). You might want to do this  for  the  second  and
2905*22dc650dSSadaf Ebrahimi       subsequent  calls  to pcre2_match() if you are making repeated calls to
2906*22dc650dSSadaf Ebrahimi       find multiple matches in the same subject string.
2907*22dc650dSSadaf Ebrahimi
2908*22dc650dSSadaf Ebrahimi       Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile  time,  when
2909*22dc650dSSadaf Ebrahimi       PCRE2_NO_UTF_CHECK  is  set  at match time the effect of passing an in-
2910*22dc650dSSadaf Ebrahimi       valid string as a subject, or an invalid value of startoffset, is unde-
2911*22dc650dSSadaf Ebrahimi       fined.  Your program may crash or loop indefinitely or give  wrong  re-
2912*22dc650dSSadaf Ebrahimi       sults.
2913*22dc650dSSadaf Ebrahimi
2914*22dc650dSSadaf Ebrahimi         PCRE2_PARTIAL_HARD
2915*22dc650dSSadaf Ebrahimi         PCRE2_PARTIAL_SOFT
2916*22dc650dSSadaf Ebrahimi
2917*22dc650dSSadaf Ebrahimi       These options turn on the partial matching feature. A partial match oc-
2918*22dc650dSSadaf Ebrahimi       curs  if  the  end  of  the subject string is reached successfully, but
2919*22dc650dSSadaf Ebrahimi       there are not enough subject characters to complete the match. In addi-
2920*22dc650dSSadaf Ebrahimi       tion, either at least one character must have  been  inspected  or  the
2921*22dc650dSSadaf Ebrahimi       pattern  must  contain  a  lookbehind,  or the pattern must be one that
2922*22dc650dSSadaf Ebrahimi       could match an empty string.
2923*22dc650dSSadaf Ebrahimi
2924*22dc650dSSadaf Ebrahimi       If this situation arises when PCRE2_PARTIAL_SOFT  (but  not  PCRE2_PAR-
2925*22dc650dSSadaf Ebrahimi       TIAL_HARD) is set, matching continues by testing any remaining alterna-
2926*22dc650dSSadaf Ebrahimi       tives.  Only  if  no complete match can be found is PCRE2_ERROR_PARTIAL
2927*22dc650dSSadaf Ebrahimi       returned instead of PCRE2_ERROR_NOMATCH.  In  other  words,  PCRE2_PAR-
2928*22dc650dSSadaf Ebrahimi       TIAL_SOFT  specifies  that  the  caller is prepared to handle a partial
2929*22dc650dSSadaf Ebrahimi       match, but only if no complete match can be found.
2930*22dc650dSSadaf Ebrahimi
2931*22dc650dSSadaf Ebrahimi       If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In  this
2932*22dc650dSSadaf Ebrahimi       case,  if  a  partial match is found, pcre2_match() immediately returns
2933*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_PARTIAL, without considering  any  other  alternatives.  In
2934*22dc650dSSadaf Ebrahimi       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2935*22dc650dSSadaf Ebrahimi       ered to be more important that an alternative complete match.
2936*22dc650dSSadaf Ebrahimi
2937*22dc650dSSadaf Ebrahimi       There is a more detailed discussion of partial and multi-segment match-
2938*22dc650dSSadaf Ebrahimi       ing, with examples, in the pcre2partial documentation.
2939*22dc650dSSadaf Ebrahimi
2940*22dc650dSSadaf Ebrahimi
2941*22dc650dSSadaf EbrahimiNEWLINE HANDLING WHEN MATCHING
2942*22dc650dSSadaf Ebrahimi
2943*22dc650dSSadaf Ebrahimi       When  PCRE2 is built, a default newline convention is set; this is usu-
2944*22dc650dSSadaf Ebrahimi       ally the standard convention for the operating system. The default  can
2945*22dc650dSSadaf Ebrahimi       be  overridden  in a compile context by calling pcre2_set_newline(). It
2946*22dc650dSSadaf Ebrahimi       can also be overridden by starting a pattern string with, for  example,
2947*22dc650dSSadaf Ebrahimi       (*CRLF),  as  described  in  the  section on newline conventions in the
2948*22dc650dSSadaf Ebrahimi       pcre2pattern page. During matching, the newline choice affects the  be-
2949*22dc650dSSadaf Ebrahimi       haviour  of the dot, circumflex, and dollar metacharacters. It may also
2950*22dc650dSSadaf Ebrahimi       alter the way the match starting position is  advanced  after  a  match
2951*22dc650dSSadaf Ebrahimi       failure for an unanchored pattern.
2952*22dc650dSSadaf Ebrahimi
2953*22dc650dSSadaf Ebrahimi       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2954*22dc650dSSadaf Ebrahimi       set  as  the  newline convention, and a match attempt for an unanchored
2955*22dc650dSSadaf Ebrahimi       pattern fails when the current starting position is at a CRLF sequence,
2956*22dc650dSSadaf Ebrahimi       and the pattern contains no explicit matches for CR or  LF  characters,
2957*22dc650dSSadaf Ebrahimi       the  match  position  is  advanced by two characters instead of one, in
2958*22dc650dSSadaf Ebrahimi       other words, to after the CRLF.
2959*22dc650dSSadaf Ebrahimi
2960*22dc650dSSadaf Ebrahimi       The above rule is a compromise that makes the most common cases work as
2961*22dc650dSSadaf Ebrahimi       expected. For example, if the pattern is .+A (and the PCRE2_DOTALL  op-
2962*22dc650dSSadaf Ebrahimi       tion  is  not set), it does not match the string "\r\nA" because, after
2963*22dc650dSSadaf Ebrahimi       failing at the start, it skips both the CR and the LF before  retrying.
2964*22dc650dSSadaf Ebrahimi       However,  the  pattern  [\r\n]A does match that string, because it con-
2965*22dc650dSSadaf Ebrahimi       tains an explicit CR or LF reference, and so advances only by one char-
2966*22dc650dSSadaf Ebrahimi       acter after the first failure.
2967*22dc650dSSadaf Ebrahimi
2968*22dc650dSSadaf Ebrahimi       An explicit match for CR of LF is either a literal appearance of one of
2969*22dc650dSSadaf Ebrahimi       those characters in the pattern, or one of the \r or \n  or  equivalent
2970*22dc650dSSadaf Ebrahimi       octal or hexadecimal escape sequences. Implicit matches such as [^X] do
2971*22dc650dSSadaf Ebrahimi       not  count, nor does \s, even though it includes CR and LF in the char-
2972*22dc650dSSadaf Ebrahimi       acters that it matches.
2973*22dc650dSSadaf Ebrahimi
2974*22dc650dSSadaf Ebrahimi       Notwithstanding the above, anomalous effects may still occur when  CRLF
2975*22dc650dSSadaf Ebrahimi       is a valid newline sequence and explicit \r or \n escapes appear in the
2976*22dc650dSSadaf Ebrahimi       pattern.
2977*22dc650dSSadaf Ebrahimi
2978*22dc650dSSadaf Ebrahimi
2979*22dc650dSSadaf EbrahimiHOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
2980*22dc650dSSadaf Ebrahimi
2981*22dc650dSSadaf Ebrahimi       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
2982*22dc650dSSadaf Ebrahimi
2983*22dc650dSSadaf Ebrahimi       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
2984*22dc650dSSadaf Ebrahimi
2985*22dc650dSSadaf Ebrahimi       In  general, a pattern matches a certain portion of the subject, and in
2986*22dc650dSSadaf Ebrahimi       addition, further substrings from the subject  may  be  picked  out  by
2987*22dc650dSSadaf Ebrahimi       parenthesized  parts  of  the  pattern.  Following the usage in Jeffrey
2988*22dc650dSSadaf Ebrahimi       Friedl's book, this is called "capturing"  in  what  follows,  and  the
2989*22dc650dSSadaf Ebrahimi       phrase  "capture  group" (Perl terminology) is used for a fragment of a
2990*22dc650dSSadaf Ebrahimi       pattern that picks out a substring. PCRE2 supports several other  kinds
2991*22dc650dSSadaf Ebrahimi       of parenthesized group that do not cause substrings to be captured. The
2992*22dc650dSSadaf Ebrahimi       pcre2_pattern_info()  function can be used to find out how many capture
2993*22dc650dSSadaf Ebrahimi       groups there are in a compiled pattern.
2994*22dc650dSSadaf Ebrahimi
2995*22dc650dSSadaf Ebrahimi       You can use auxiliary functions for accessing  captured  substrings  by
2996*22dc650dSSadaf Ebrahimi       number or by name, as described in sections below.
2997*22dc650dSSadaf Ebrahimi
2998*22dc650dSSadaf Ebrahimi       Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2999*22dc650dSSadaf Ebrahimi       ues,  called  the  ovector,  which  contains  the  offsets  of captured
3000*22dc650dSSadaf Ebrahimi       strings.  It  is  part  of  the  match  data   block.    The   function
3001*22dc650dSSadaf Ebrahimi       pcre2_get_ovector_pointer()  returns  the  address  of the ovector, and
3002*22dc650dSSadaf Ebrahimi       pcre2_get_ovector_count() returns the number of pairs of values it con-
3003*22dc650dSSadaf Ebrahimi       tains.
3004*22dc650dSSadaf Ebrahimi
3005*22dc650dSSadaf Ebrahimi       Within the ovector, the first in each pair of values is set to the off-
3006*22dc650dSSadaf Ebrahimi       set of the first code unit of a substring, and the second is set to the
3007*22dc650dSSadaf Ebrahimi       offset of the first code unit after the end of a substring. These  val-
3008*22dc650dSSadaf Ebrahimi       ues  are always code unit offsets, not character offsets. That is, they
3009*22dc650dSSadaf Ebrahimi       are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
3010*22dc650dSSadaf Ebrahimi       brary, and 32-bit offsets in the 32-bit library.
3011*22dc650dSSadaf Ebrahimi
3012*22dc650dSSadaf Ebrahimi       After a partial match  (error  return  PCRE2_ERROR_PARTIAL),  only  the
3013*22dc650dSSadaf Ebrahimi       first  pair  of  offsets  (that is, ovector[0] and ovector[1]) are set.
3014*22dc650dSSadaf Ebrahimi       They identify the part of the subject that was partially  matched.  See
3015*22dc650dSSadaf Ebrahimi       the pcre2partial documentation for details of partial matching.
3016*22dc650dSSadaf Ebrahimi
3017*22dc650dSSadaf Ebrahimi       After  a  fully  successful match, the first pair of offsets identifies
3018*22dc650dSSadaf Ebrahimi       the portion of the subject string that was matched by the  entire  pat-
3019*22dc650dSSadaf Ebrahimi       tern.  The  next  pair is used for the first captured substring, and so
3020*22dc650dSSadaf Ebrahimi       on. The value returned by pcre2_match() is one more  than  the  highest
3021*22dc650dSSadaf Ebrahimi       numbered  pair  that  has been set. For example, if two substrings have
3022*22dc650dSSadaf Ebrahimi       been captured, the returned value is 3. If there are no  captured  sub-
3023*22dc650dSSadaf Ebrahimi       strings, the return value from a successful match is 1, indicating that
3024*22dc650dSSadaf Ebrahimi       just the first pair of offsets has been set.
3025*22dc650dSSadaf Ebrahimi
3026*22dc650dSSadaf Ebrahimi       If  a  pattern uses the \K escape sequence within a positive assertion,
3027*22dc650dSSadaf Ebrahimi       the reported start of a successful match can be greater than the end of
3028*22dc650dSSadaf Ebrahimi       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
3029*22dc650dSSadaf Ebrahimi       "ab", the start and end offset values for the match are 2 and 0.
3030*22dc650dSSadaf Ebrahimi
3031*22dc650dSSadaf Ebrahimi       If  a  capture group is matched repeatedly within a single match opera-
3032*22dc650dSSadaf Ebrahimi       tion, it is the last portion of the subject that it matched that is re-
3033*22dc650dSSadaf Ebrahimi       turned.
3034*22dc650dSSadaf Ebrahimi
3035*22dc650dSSadaf Ebrahimi       If the ovector is too small to hold all the captured substring offsets,
3036*22dc650dSSadaf Ebrahimi       as much as possible is filled in, and the function returns a  value  of
3037*22dc650dSSadaf Ebrahimi       zero.  If captured substrings are not of interest, pcre2_match() may be
3038*22dc650dSSadaf Ebrahimi       called with a match data block whose ovector is of minimum length (that
3039*22dc650dSSadaf Ebrahimi       is, one pair).
3040*22dc650dSSadaf Ebrahimi
3041*22dc650dSSadaf Ebrahimi       It is possible for capture group number n+1 to match some part  of  the
3042*22dc650dSSadaf Ebrahimi       subject  when  group  n  has  not been used at all. For example, if the
3043*22dc650dSSadaf Ebrahimi       string "abc" is matched against the pattern (a|(z))(bc) the return from
3044*22dc650dSSadaf Ebrahimi       the function is 4, and groups 1 and 3 are matched, but 2 is  not.  When
3045*22dc650dSSadaf Ebrahimi       this  happens,  both values in the offset pairs corresponding to unused
3046*22dc650dSSadaf Ebrahimi       groups are set to PCRE2_UNSET.
3047*22dc650dSSadaf Ebrahimi
3048*22dc650dSSadaf Ebrahimi       Offset values that correspond to unused groups at the end  of  the  ex-
3049*22dc650dSSadaf Ebrahimi       pression  are also set to PCRE2_UNSET. For example, if the string "abc"
3050*22dc650dSSadaf Ebrahimi       is matched against the pattern (abc)(x(yz)?)? groups 2 and  3  are  not
3051*22dc650dSSadaf Ebrahimi       matched.  The  return  from the function is 2, because the highest used
3052*22dc650dSSadaf Ebrahimi       capture group number is 1. The offsets for the second and third capture
3053*22dc650dSSadaf Ebrahimi       groups (assuming the vector is large enough,  of  course)  are  set  to
3054*22dc650dSSadaf Ebrahimi       PCRE2_UNSET.
3055*22dc650dSSadaf Ebrahimi
3056*22dc650dSSadaf Ebrahimi       Elements in the ovector that do not correspond to capturing parentheses
3057*22dc650dSSadaf Ebrahimi       in the pattern are never changed. That is, if a pattern contains n cap-
3058*22dc650dSSadaf Ebrahimi       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
3059*22dc650dSSadaf Ebrahimi       pcre2_match().  The  other  elements retain whatever values they previ-
3060*22dc650dSSadaf Ebrahimi       ously had. After a failed match attempt, the contents  of  the  ovector
3061*22dc650dSSadaf Ebrahimi       are unchanged.
3062*22dc650dSSadaf Ebrahimi
3063*22dc650dSSadaf Ebrahimi
3064*22dc650dSSadaf EbrahimiOTHER INFORMATION ABOUT A MATCH
3065*22dc650dSSadaf Ebrahimi
3066*22dc650dSSadaf Ebrahimi       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
3067*22dc650dSSadaf Ebrahimi
3068*22dc650dSSadaf Ebrahimi       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
3069*22dc650dSSadaf Ebrahimi
3070*22dc650dSSadaf Ebrahimi       As  well as the offsets in the ovector, other information about a match
3071*22dc650dSSadaf Ebrahimi       is retained in the match data block and can be retrieved by  the  above
3072*22dc650dSSadaf Ebrahimi       functions  in  appropriate  circumstances.  If they are called at other
3073*22dc650dSSadaf Ebrahimi       times, the result is undefined.
3074*22dc650dSSadaf Ebrahimi
3075*22dc650dSSadaf Ebrahimi       After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a
3076*22dc650dSSadaf Ebrahimi       failure  to  match (PCRE2_ERROR_NOMATCH), a mark name may be available.
3077*22dc650dSSadaf Ebrahimi       The function pcre2_get_mark() can be called to access this name,  which
3078*22dc650dSSadaf Ebrahimi       can  be  specified  in  the  pattern by any of the backtracking control
3079*22dc650dSSadaf Ebrahimi       verbs, not just (*MARK). The same function applies to all the verbs. It
3080*22dc650dSSadaf Ebrahimi       returns a pointer to the zero-terminated name, which is within the com-
3081*22dc650dSSadaf Ebrahimi       piled pattern. If no name is available, NULL is returned. The length of
3082*22dc650dSSadaf Ebrahimi       the name (excluding the terminating zero) is stored in  the  code  unit
3083*22dc650dSSadaf Ebrahimi       that  precedes  the name. You should use this length instead of relying
3084*22dc650dSSadaf Ebrahimi       on the terminating zero if the name might contain a binary zero.
3085*22dc650dSSadaf Ebrahimi
3086*22dc650dSSadaf Ebrahimi       After a successful match, the name that is returned is  the  last  mark
3087*22dc650dSSadaf Ebrahimi       name encountered on the matching path through the pattern. Instances of
3088*22dc650dSSadaf Ebrahimi       backtracking  verbs  without  names do not count. Thus, for example, if
3089*22dc650dSSadaf Ebrahimi       the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
3090*22dc650dSSadaf Ebrahimi       After a "no match" or a partial match, the last encountered name is re-
3091*22dc650dSSadaf Ebrahimi       turned. For example, consider this pattern:
3092*22dc650dSSadaf Ebrahimi
3093*22dc650dSSadaf Ebrahimi         ^(*MARK:A)((*MARK:B)a|b)c
3094*22dc650dSSadaf Ebrahimi
3095*22dc650dSSadaf Ebrahimi       When it matches "bc", the returned name is A. The B mark is  "seen"  in
3096*22dc650dSSadaf Ebrahimi       the  first  branch of the group, but it is not on the matching path. On
3097*22dc650dSSadaf Ebrahimi       the other hand, when this pattern fails to  match  "bx",  the  returned
3098*22dc650dSSadaf Ebrahimi       name is B.
3099*22dc650dSSadaf Ebrahimi
3100*22dc650dSSadaf Ebrahimi       Warning:  By  default, certain start-of-match optimizations are used to
3101*22dc650dSSadaf Ebrahimi       give a fast "no match" result in some situations. For example,  if  the
3102*22dc650dSSadaf Ebrahimi       anchoring  is removed from the pattern above, there is an initial check
3103*22dc650dSSadaf Ebrahimi       for the presence of "c" in the subject before running the matching  en-
3104*22dc650dSSadaf Ebrahimi       gine. This check fails for "bx", causing a match failure without seeing
3105*22dc650dSSadaf Ebrahimi       any  marks. You can disable the start-of-match optimizations by setting
3106*22dc650dSSadaf Ebrahimi       the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or  by  starting
3107*22dc650dSSadaf Ebrahimi       the pattern with (*NO_START_OPT).
3108*22dc650dSSadaf Ebrahimi
3109*22dc650dSSadaf Ebrahimi       After  a  successful  match, a partial match, or one of the invalid UTF
3110*22dc650dSSadaf Ebrahimi       errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar()  can
3111*22dc650dSSadaf Ebrahimi       be called. After a successful or partial match it returns the code unit
3112*22dc650dSSadaf Ebrahimi       offset  of  the character at which the match started. For a non-partial
3113*22dc650dSSadaf Ebrahimi       match, this can be different to the value of ovector[0] if the  pattern
3114*22dc650dSSadaf Ebrahimi       contains  the  \K escape sequence. After a partial match, however, this
3115*22dc650dSSadaf Ebrahimi       value is always the same as ovector[0] because \K does not  affect  the
3116*22dc650dSSadaf Ebrahimi       result of a partial match.
3117*22dc650dSSadaf Ebrahimi
3118*22dc650dSSadaf Ebrahimi       After  a UTF check failure, pcre2_get_startchar() can be used to obtain
3119*22dc650dSSadaf Ebrahimi       the code unit offset of the invalid UTF character. Details are given in
3120*22dc650dSSadaf Ebrahimi       the pcre2unicode page.
3121*22dc650dSSadaf Ebrahimi
3122*22dc650dSSadaf Ebrahimi
3123*22dc650dSSadaf EbrahimiERROR RETURNS FROM pcre2_match()
3124*22dc650dSSadaf Ebrahimi
3125*22dc650dSSadaf Ebrahimi       If pcre2_match() fails, it returns a negative number. This can be  con-
3126*22dc650dSSadaf Ebrahimi       verted  to a text string by calling the pcre2_get_error_message() func-
3127*22dc650dSSadaf Ebrahimi       tion (see "Obtaining a textual error message" below).   Negative  error
3128*22dc650dSSadaf Ebrahimi       codes  are  also  returned  by other functions, and are documented with
3129*22dc650dSSadaf Ebrahimi       them. The codes are given names in the header file. If UTF checking  is
3130*22dc650dSSadaf Ebrahimi       in force and an invalid UTF subject string is detected, one of a number
3131*22dc650dSSadaf Ebrahimi       of  UTF-specific negative error codes is returned. Details are given in
3132*22dc650dSSadaf Ebrahimi       the pcre2unicode page. The following are the other errors that  may  be
3133*22dc650dSSadaf Ebrahimi       returned by pcre2_match():
3134*22dc650dSSadaf Ebrahimi
3135*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NOMATCH
3136*22dc650dSSadaf Ebrahimi
3137*22dc650dSSadaf Ebrahimi       The subject string did not match the pattern.
3138*22dc650dSSadaf Ebrahimi
3139*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_PARTIAL
3140*22dc650dSSadaf Ebrahimi
3141*22dc650dSSadaf Ebrahimi       The  subject  string did not match, but it did match partially. See the
3142*22dc650dSSadaf Ebrahimi       pcre2partial documentation for details of partial matching.
3143*22dc650dSSadaf Ebrahimi
3144*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADMAGIC
3145*22dc650dSSadaf Ebrahimi
3146*22dc650dSSadaf Ebrahimi       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
3147*22dc650dSSadaf Ebrahimi       to catch the case when it is passed a junk pointer. This is  the  error
3148*22dc650dSSadaf Ebrahimi       that is returned when the magic number is not present.
3149*22dc650dSSadaf Ebrahimi
3150*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADMODE
3151*22dc650dSSadaf Ebrahimi
3152*22dc650dSSadaf Ebrahimi       This  error is given when a compiled pattern is passed to a function in
3153*22dc650dSSadaf Ebrahimi       a library of a different code unit width, for example, a  pattern  com-
3154*22dc650dSSadaf Ebrahimi       piled  by  the  8-bit  library  is passed to a 16-bit or 32-bit library
3155*22dc650dSSadaf Ebrahimi       function.
3156*22dc650dSSadaf Ebrahimi
3157*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADOFFSET
3158*22dc650dSSadaf Ebrahimi
3159*22dc650dSSadaf Ebrahimi       The value of startoffset was greater than the length of the subject.
3160*22dc650dSSadaf Ebrahimi
3161*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADOPTION
3162*22dc650dSSadaf Ebrahimi
3163*22dc650dSSadaf Ebrahimi       An unrecognized bit was set in the options argument.
3164*22dc650dSSadaf Ebrahimi
3165*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADUTFOFFSET
3166*22dc650dSSadaf Ebrahimi
3167*22dc650dSSadaf Ebrahimi       The UTF code unit sequence that was passed as a subject was checked and
3168*22dc650dSSadaf Ebrahimi       found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the
3169*22dc650dSSadaf Ebrahimi       value  of startoffset did not point to the beginning of a UTF character
3170*22dc650dSSadaf Ebrahimi       or the end of the subject.
3171*22dc650dSSadaf Ebrahimi
3172*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_CALLOUT
3173*22dc650dSSadaf Ebrahimi
3174*22dc650dSSadaf Ebrahimi       This error is never generated by pcre2_match() itself. It  is  provided
3175*22dc650dSSadaf Ebrahimi       for  use  by  callout  functions  that  want  to cause pcre2_match() or
3176*22dc650dSSadaf Ebrahimi       pcre2_callout_enumerate() to return a distinctive error code.  See  the
3177*22dc650dSSadaf Ebrahimi       pcre2callout documentation for details.
3178*22dc650dSSadaf Ebrahimi
3179*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_DEPTHLIMIT
3180*22dc650dSSadaf Ebrahimi
3181*22dc650dSSadaf Ebrahimi       The nested backtracking depth limit was reached.
3182*22dc650dSSadaf Ebrahimi
3183*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_HEAPLIMIT
3184*22dc650dSSadaf Ebrahimi
3185*22dc650dSSadaf Ebrahimi       The heap limit was reached.
3186*22dc650dSSadaf Ebrahimi
3187*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_INTERNAL
3188*22dc650dSSadaf Ebrahimi
3189*22dc650dSSadaf Ebrahimi       An  unexpected  internal error has occurred. This error could be caused
3190*22dc650dSSadaf Ebrahimi       by a bug in PCRE2 or by overwriting of the compiled pattern.
3191*22dc650dSSadaf Ebrahimi
3192*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_JIT_STACKLIMIT
3193*22dc650dSSadaf Ebrahimi
3194*22dc650dSSadaf Ebrahimi       This error is returned when a pattern that was successfully studied us-
3195*22dc650dSSadaf Ebrahimi       ing JIT is being matched, but the memory available for the just-in-time
3196*22dc650dSSadaf Ebrahimi       processing stack is not large enough. See  the  pcre2jit  documentation
3197*22dc650dSSadaf Ebrahimi       for more details.
3198*22dc650dSSadaf Ebrahimi
3199*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_MATCHLIMIT
3200*22dc650dSSadaf Ebrahimi
3201*22dc650dSSadaf Ebrahimi       The backtracking match limit was reached.
3202*22dc650dSSadaf Ebrahimi
3203*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NOMEMORY
3204*22dc650dSSadaf Ebrahimi
3205*22dc650dSSadaf Ebrahimi       Heap  memory  is  used  to  remember backtracking points. This error is
3206*22dc650dSSadaf Ebrahimi       given when the memory allocation function (default  or  custom)  fails.
3207*22dc650dSSadaf Ebrahimi       Note  that  a  different  error, PCRE2_ERROR_HEAPLIMIT, is given if the
3208*22dc650dSSadaf Ebrahimi       amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is
3209*22dc650dSSadaf Ebrahimi       also returned if PCRE2_COPY_MATCHED_SUBJECT is set and  memory  alloca-
3210*22dc650dSSadaf Ebrahimi       tion fails.
3211*22dc650dSSadaf Ebrahimi
3212*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NULL
3213*22dc650dSSadaf Ebrahimi
3214*22dc650dSSadaf Ebrahimi       Either the code, subject, or match_data argument was passed as NULL.
3215*22dc650dSSadaf Ebrahimi
3216*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_RECURSELOOP
3217*22dc650dSSadaf Ebrahimi
3218*22dc650dSSadaf Ebrahimi       This  error  is  returned  when  pcre2_match() detects a recursion loop
3219*22dc650dSSadaf Ebrahimi       within the pattern. Specifically, it means that either the  whole  pat-
3220*22dc650dSSadaf Ebrahimi       tern or a capture group has been called recursively for the second time
3221*22dc650dSSadaf Ebrahimi       at  the  same position in the subject string. Some simple patterns that
3222*22dc650dSSadaf Ebrahimi       might do this are detected and faulted at compile time, but  more  com-
3223*22dc650dSSadaf Ebrahimi       plicated  cases,  in particular mutual recursions between two different
3224*22dc650dSSadaf Ebrahimi       groups, cannot be detected until matching is attempted.
3225*22dc650dSSadaf Ebrahimi
3226*22dc650dSSadaf Ebrahimi
3227*22dc650dSSadaf EbrahimiOBTAINING A TEXTUAL ERROR MESSAGE
3228*22dc650dSSadaf Ebrahimi
3229*22dc650dSSadaf Ebrahimi       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
3230*22dc650dSSadaf Ebrahimi         PCRE2_SIZE bufflen);
3231*22dc650dSSadaf Ebrahimi
3232*22dc650dSSadaf Ebrahimi       A text message for an error code  from  any  PCRE2  function  (compile,
3233*22dc650dSSadaf Ebrahimi       match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
3234*22dc650dSSadaf Ebrahimi       sage(). The code is passed as the first argument,  with  the  remaining
3235*22dc650dSSadaf Ebrahimi       two  arguments  specifying  a  code  unit buffer and its length in code
3236*22dc650dSSadaf Ebrahimi       units, into which the text message is placed. The message  is  returned
3237*22dc650dSSadaf Ebrahimi       in  code  units  of the appropriate width for the library that is being
3238*22dc650dSSadaf Ebrahimi       used.
3239*22dc650dSSadaf Ebrahimi
3240*22dc650dSSadaf Ebrahimi       The returned message is terminated with a trailing zero, and the  func-
3241*22dc650dSSadaf Ebrahimi       tion  returns  the  number  of  code units used, excluding the trailing
3242*22dc650dSSadaf Ebrahimi       zero. If the error number is unknown, the negative error code PCRE2_ER-
3243*22dc650dSSadaf Ebrahimi       ROR_BADDATA is returned. If the buffer is too  small,  the  message  is
3244*22dc650dSSadaf Ebrahimi       truncated (but still with a trailing zero), and the negative error code
3245*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_NOMEMORY  is returned.  None of the messages are very long;
3246*22dc650dSSadaf Ebrahimi       a buffer size of 120 code units is ample.
3247*22dc650dSSadaf Ebrahimi
3248*22dc650dSSadaf Ebrahimi
3249*22dc650dSSadaf EbrahimiEXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3250*22dc650dSSadaf Ebrahimi
3251*22dc650dSSadaf Ebrahimi       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
3252*22dc650dSSadaf Ebrahimi         uint32_t number, PCRE2_SIZE *length);
3253*22dc650dSSadaf Ebrahimi
3254*22dc650dSSadaf Ebrahimi       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
3255*22dc650dSSadaf Ebrahimi         uint32_t number, PCRE2_UCHAR *buffer,
3256*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *bufflen);
3257*22dc650dSSadaf Ebrahimi
3258*22dc650dSSadaf Ebrahimi       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
3259*22dc650dSSadaf Ebrahimi         uint32_t number, PCRE2_UCHAR **bufferptr,
3260*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *bufflen);
3261*22dc650dSSadaf Ebrahimi
3262*22dc650dSSadaf Ebrahimi       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3263*22dc650dSSadaf Ebrahimi
3264*22dc650dSSadaf Ebrahimi       Captured substrings can be accessed directly by using  the  ovector  as
3265*22dc650dSSadaf Ebrahimi       described above.  For convenience, auxiliary functions are provided for
3266*22dc650dSSadaf Ebrahimi       extracting   captured  substrings  as  new,  separate,  zero-terminated
3267*22dc650dSSadaf Ebrahimi       strings. A substring that contains a binary zero is correctly extracted
3268*22dc650dSSadaf Ebrahimi       and has a further zero added on the end, but  the  result  is  not,  of
3269*22dc650dSSadaf Ebrahimi       course, a C string.
3270*22dc650dSSadaf Ebrahimi
3271*22dc650dSSadaf Ebrahimi       The functions in this section identify substrings by number. The number
3272*22dc650dSSadaf Ebrahimi       zero refers to the entire matched substring, with higher numbers refer-
3273*22dc650dSSadaf Ebrahimi       ring  to  substrings  captured by parenthesized groups. After a partial
3274*22dc650dSSadaf Ebrahimi       match, only substring zero is available.  An  attempt  to  extract  any
3275*22dc650dSSadaf Ebrahimi       other  substring  gives the error PCRE2_ERROR_PARTIAL. The next section
3276*22dc650dSSadaf Ebrahimi       describes similar functions for extracting captured substrings by name.
3277*22dc650dSSadaf Ebrahimi
3278*22dc650dSSadaf Ebrahimi       If a pattern uses the \K escape sequence within a  positive  assertion,
3279*22dc650dSSadaf Ebrahimi       the reported start of a successful match can be greater than the end of
3280*22dc650dSSadaf Ebrahimi       the  match.   For  example,  if the pattern (?=ab\K) is matched against
3281*22dc650dSSadaf Ebrahimi       "ab", the start and end offset values for the match are  2  and  0.  In
3282*22dc650dSSadaf Ebrahimi       this  situation,  calling  these functions with a zero substring number
3283*22dc650dSSadaf Ebrahimi       extracts a zero-length empty string.
3284*22dc650dSSadaf Ebrahimi
3285*22dc650dSSadaf Ebrahimi       You can find the length in code units of a captured  substring  without
3286*22dc650dSSadaf Ebrahimi       extracting  it  by calling pcre2_substring_length_bynumber(). The first
3287*22dc650dSSadaf Ebrahimi       argument is a pointer to the match data block, the second is the  group
3288*22dc650dSSadaf Ebrahimi       number,  and the third is a pointer to a variable into which the length
3289*22dc650dSSadaf Ebrahimi       is placed. If you just want to know whether or not  the  substring  has
3290*22dc650dSSadaf Ebrahimi       been captured, you can pass the third argument as NULL.
3291*22dc650dSSadaf Ebrahimi
3292*22dc650dSSadaf Ebrahimi       The  pcre2_substring_copy_bynumber()  function  copies  a captured sub-
3293*22dc650dSSadaf Ebrahimi       string into a supplied buffer,  whereas  pcre2_substring_get_bynumber()
3294*22dc650dSSadaf Ebrahimi       copies  it  into  new memory, obtained using the same memory allocation
3295*22dc650dSSadaf Ebrahimi       function that was used for the match data block. The  first  two  argu-
3296*22dc650dSSadaf Ebrahimi       ments  of  these  functions are a pointer to the match data block and a
3297*22dc650dSSadaf Ebrahimi       capture group number.
3298*22dc650dSSadaf Ebrahimi
3299*22dc650dSSadaf Ebrahimi       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
3300*22dc650dSSadaf Ebrahimi       the buffer and a pointer to a variable that contains its length in code
3301*22dc650dSSadaf Ebrahimi       units.  This is updated to contain the actual number of code units used
3302*22dc650dSSadaf Ebrahimi       for the extracted substring, excluding the terminating zero.
3303*22dc650dSSadaf Ebrahimi
3304*22dc650dSSadaf Ebrahimi       For pcre2_substring_get_bynumber() the third and fourth arguments point
3305*22dc650dSSadaf Ebrahimi       to variables that are updated with a pointer to the new memory and  the
3306*22dc650dSSadaf Ebrahimi       number  of  code units that comprise the substring, again excluding the
3307*22dc650dSSadaf Ebrahimi       terminating zero. When the substring is no longer  needed,  the  memory
3308*22dc650dSSadaf Ebrahimi       should be freed by calling pcre2_substring_free().
3309*22dc650dSSadaf Ebrahimi
3310*22dc650dSSadaf Ebrahimi       The  return  value  from  all these functions is zero for success, or a
3311*22dc650dSSadaf Ebrahimi       negative error code. If the pattern match  failed,  the  match  failure
3312*22dc650dSSadaf Ebrahimi       code  is returned.  If a substring number greater than zero is used af-
3313*22dc650dSSadaf Ebrahimi       ter a partial match, PCRE2_ERROR_PARTIAL is  returned.  Other  possible
3314*22dc650dSSadaf Ebrahimi       error codes are:
3315*22dc650dSSadaf Ebrahimi
3316*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NOMEMORY
3317*22dc650dSSadaf Ebrahimi
3318*22dc650dSSadaf Ebrahimi       The  buffer  was  too small for pcre2_substring_copy_bynumber(), or the
3319*22dc650dSSadaf Ebrahimi       attempt to get memory failed for pcre2_substring_get_bynumber().
3320*22dc650dSSadaf Ebrahimi
3321*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NOSUBSTRING
3322*22dc650dSSadaf Ebrahimi
3323*22dc650dSSadaf Ebrahimi       There is no substring with that number in the  pattern,  that  is,  the
3324*22dc650dSSadaf Ebrahimi       number is greater than the number of capturing parentheses.
3325*22dc650dSSadaf Ebrahimi
3326*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UNAVAILABLE
3327*22dc650dSSadaf Ebrahimi
3328*22dc650dSSadaf Ebrahimi       The substring number, though not greater than the number of captures in
3329*22dc650dSSadaf Ebrahimi       the pattern, is greater than the number of slots in the ovector, so the
3330*22dc650dSSadaf Ebrahimi       substring could not be captured.
3331*22dc650dSSadaf Ebrahimi
3332*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UNSET
3333*22dc650dSSadaf Ebrahimi
3334*22dc650dSSadaf Ebrahimi       The  substring  did  not  participate in the match. For example, if the
3335*22dc650dSSadaf Ebrahimi       pattern is (abc)|(def) and the subject is "def", and the  ovector  con-
3336*22dc650dSSadaf Ebrahimi       tains at least two capturing slots, substring number 1 is unset.
3337*22dc650dSSadaf Ebrahimi
3338*22dc650dSSadaf Ebrahimi
3339*22dc650dSSadaf EbrahimiEXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
3340*22dc650dSSadaf Ebrahimi
3341*22dc650dSSadaf Ebrahimi       int pcre2_substring_list_get(pcre2_match_data *match_data,
3342*22dc650dSSadaf Ebrahimi         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
3343*22dc650dSSadaf Ebrahimi
3344*22dc650dSSadaf Ebrahimi       void pcre2_substring_list_free(PCRE2_UCHAR **list);
3345*22dc650dSSadaf Ebrahimi
3346*22dc650dSSadaf Ebrahimi       The  pcre2_substring_list_get()  function  extracts  all available sub-
3347*22dc650dSSadaf Ebrahimi       strings and builds a list of pointers to  them.  It  also  (optionally)
3348*22dc650dSSadaf Ebrahimi       builds  a  second list that contains their lengths (in code units), ex-
3349*22dc650dSSadaf Ebrahimi       cluding a terminating zero that is added to each of them. All  this  is
3350*22dc650dSSadaf Ebrahimi       done in a single block of memory that is obtained using the same memory
3351*22dc650dSSadaf Ebrahimi       allocation function that was used to get the match data block.
3352*22dc650dSSadaf Ebrahimi
3353*22dc650dSSadaf Ebrahimi       This  function  must be called only after a successful match. If called
3354*22dc650dSSadaf Ebrahimi       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3355*22dc650dSSadaf Ebrahimi
3356*22dc650dSSadaf Ebrahimi       The address of the memory block is returned via listptr, which is  also
3357*22dc650dSSadaf Ebrahimi       the start of the list of string pointers. The end of the list is marked
3358*22dc650dSSadaf Ebrahimi       by  a  NULL pointer. The address of the list of lengths is returned via
3359*22dc650dSSadaf Ebrahimi       lengthsptr. If your strings do not contain binary zeros and you do  not
3360*22dc650dSSadaf Ebrahimi       therefore need the lengths, you may supply NULL as the lengthsptr argu-
3361*22dc650dSSadaf Ebrahimi       ment  to  disable  the  creation of a list of lengths. The yield of the
3362*22dc650dSSadaf Ebrahimi       function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3363*22dc650dSSadaf Ebrahimi       ory  block could not be obtained. When the list is no longer needed, it
3364*22dc650dSSadaf Ebrahimi       should be freed by calling pcre2_substring_list_free().
3365*22dc650dSSadaf Ebrahimi
3366*22dc650dSSadaf Ebrahimi       If this function encounters a substring that is unset, which can happen
3367*22dc650dSSadaf Ebrahimi       when capture group number n+1 matches some part  of  the  subject,  but
3368*22dc650dSSadaf Ebrahimi       group  n has not been used at all, it returns an empty string. This can
3369*22dc650dSSadaf Ebrahimi       be distinguished from a genuine zero-length substring by inspecting the
3370*22dc650dSSadaf Ebrahimi       appropriate offset in the ovector, which contain PCRE2_UNSET for  unset
3371*22dc650dSSadaf Ebrahimi       substrings, or by calling pcre2_substring_length_bynumber().
3372*22dc650dSSadaf Ebrahimi
3373*22dc650dSSadaf Ebrahimi
3374*22dc650dSSadaf EbrahimiEXTRACTING CAPTURED SUBSTRINGS BY NAME
3375*22dc650dSSadaf Ebrahimi
3376*22dc650dSSadaf Ebrahimi       int pcre2_substring_number_from_name(const pcre2_code *code,
3377*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name);
3378*22dc650dSSadaf Ebrahimi
3379*22dc650dSSadaf Ebrahimi       int pcre2_substring_length_byname(pcre2_match_data *match_data,
3380*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_SIZE *length);
3381*22dc650dSSadaf Ebrahimi
3382*22dc650dSSadaf Ebrahimi       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
3383*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
3384*22dc650dSSadaf Ebrahimi
3385*22dc650dSSadaf Ebrahimi       int pcre2_substring_get_byname(pcre2_match_data *match_data,
3386*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
3387*22dc650dSSadaf Ebrahimi
3388*22dc650dSSadaf Ebrahimi       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3389*22dc650dSSadaf Ebrahimi
3390*22dc650dSSadaf Ebrahimi       To  extract a substring by name, you first have to find associated num-
3391*22dc650dSSadaf Ebrahimi       ber.  For example, for this pattern:
3392*22dc650dSSadaf Ebrahimi
3393*22dc650dSSadaf Ebrahimi         (a+)b(?<xxx>\d+)...
3394*22dc650dSSadaf Ebrahimi
3395*22dc650dSSadaf Ebrahimi       the number of the capture group called "xxx" is 2. If the name is known
3396*22dc650dSSadaf Ebrahimi       to be unique (PCRE2_DUPNAMES was not set), you can find the number from
3397*22dc650dSSadaf Ebrahimi       the name by calling pcre2_substring_number_from_name(). The first argu-
3398*22dc650dSSadaf Ebrahimi       ment is the compiled pattern, and the second is the name. The yield  of
3399*22dc650dSSadaf Ebrahimi       the  function  is the group number, PCRE2_ERROR_NOSUBSTRING if there is
3400*22dc650dSSadaf Ebrahimi       no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if  there  is
3401*22dc650dSSadaf Ebrahimi       more  than one group with that name.  Given the number, you can extract
3402*22dc650dSSadaf Ebrahimi       the substring directly from the ovector, or use one of  the  "bynumber"
3403*22dc650dSSadaf Ebrahimi       functions described above.
3404*22dc650dSSadaf Ebrahimi
3405*22dc650dSSadaf Ebrahimi       For  convenience,  there are also "byname" functions that correspond to
3406*22dc650dSSadaf Ebrahimi       the "bynumber" functions, the only difference being that the second ar-
3407*22dc650dSSadaf Ebrahimi       gument is a name instead of a number.  If  PCRE2_DUPNAMES  is  set  and
3408*22dc650dSSadaf Ebrahimi       there are duplicate names, these functions scan all the groups with the
3409*22dc650dSSadaf Ebrahimi       given  name,  and  return  the  captured substring from the first named
3410*22dc650dSSadaf Ebrahimi       group that is set.
3411*22dc650dSSadaf Ebrahimi
3412*22dc650dSSadaf Ebrahimi       If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING  is
3413*22dc650dSSadaf Ebrahimi       returned.  If  all  groups  with the name have numbers that are greater
3414*22dc650dSSadaf Ebrahimi       than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3415*22dc650dSSadaf Ebrahimi       turned. If there is at least one group with a slot in the ovector,  but
3416*22dc650dSSadaf Ebrahimi       no group is found to be set, PCRE2_ERROR_UNSET is returned.
3417*22dc650dSSadaf Ebrahimi
3418*22dc650dSSadaf Ebrahimi       Warning: If the pattern uses the (?| feature to set up multiple capture
3419*22dc650dSSadaf Ebrahimi       groups  with  the same number, as described in the section on duplicate
3420*22dc650dSSadaf Ebrahimi       group numbers in the pcre2pattern page, you cannot use names to distin-
3421*22dc650dSSadaf Ebrahimi       guish the different capture groups, because names are not  included  in
3422*22dc650dSSadaf Ebrahimi       the  compiled  code.  The  matching process uses only numbers. For this
3423*22dc650dSSadaf Ebrahimi       reason, the use of different names for  groups  with  the  same  number
3424*22dc650dSSadaf Ebrahimi       causes an error at compile time.
3425*22dc650dSSadaf Ebrahimi
3426*22dc650dSSadaf Ebrahimi
3427*22dc650dSSadaf EbrahimiCREATING A NEW STRING WITH SUBSTITUTIONS
3428*22dc650dSSadaf Ebrahimi
3429*22dc650dSSadaf Ebrahimi       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
3430*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3431*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
3432*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext, PCRE2_SPTR replacement,
3433*22dc650dSSadaf Ebrahimi         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
3434*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *outlengthptr);
3435*22dc650dSSadaf Ebrahimi
3436*22dc650dSSadaf Ebrahimi       This  function  optionally calls pcre2_match() and then makes a copy of
3437*22dc650dSSadaf Ebrahimi       the subject string in outputbuffer, replacing parts that  were  matched
3438*22dc650dSSadaf Ebrahimi       with the replacement string, whose length is supplied in rlength, which
3439*22dc650dSSadaf Ebrahimi       can  be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
3440*22dc650dSSadaf Ebrahimi       a special case, if replacement is NULL and rlength  is  zero,  the  re-
3441*22dc650dSSadaf Ebrahimi       placement  is assumed to be an empty string. If rlength is non-zero, an
3442*22dc650dSSadaf Ebrahimi       error occurs if replacement is NULL.
3443*22dc650dSSadaf Ebrahimi
3444*22dc650dSSadaf Ebrahimi       There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3445*22dc650dSSadaf Ebrahimi       turn just the replacement string(s). The default action is  to  perform
3446*22dc650dSSadaf Ebrahimi       just  one  replacement  if  the pattern matches, but there is an option
3447*22dc650dSSadaf Ebrahimi       that requests multiple replacements  (see  PCRE2_SUBSTITUTE_GLOBAL  be-
3448*22dc650dSSadaf Ebrahimi       low).
3449*22dc650dSSadaf Ebrahimi
3450*22dc650dSSadaf Ebrahimi       If  successful,  pcre2_substitute() returns the number of substitutions
3451*22dc650dSSadaf Ebrahimi       that were carried out. This may be zero if no match was found,  and  is
3452*22dc650dSSadaf Ebrahimi       never  greater  than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3453*22dc650dSSadaf Ebrahimi       tive value is returned if an error is detected.
3454*22dc650dSSadaf Ebrahimi
3455*22dc650dSSadaf Ebrahimi       Matches in which a \K item in a lookahead in  the  pattern  causes  the
3456*22dc650dSSadaf Ebrahimi       match  to  end  before it starts are not supported, and give rise to an
3457*22dc650dSSadaf Ebrahimi       error return. For global replacements, matches in which \K in a lookbe-
3458*22dc650dSSadaf Ebrahimi       hind causes the match to start earlier than the point that was  reached
3459*22dc650dSSadaf Ebrahimi       in the previous iteration are also not supported.
3460*22dc650dSSadaf Ebrahimi
3461*22dc650dSSadaf Ebrahimi       The  first  seven  arguments  of pcre2_substitute() are the same as for
3462*22dc650dSSadaf Ebrahimi       pcre2_match(), except that the partial matching options are not permit-
3463*22dc650dSSadaf Ebrahimi       ted, and match_data may be passed as NULL, in which case a  match  data
3464*22dc650dSSadaf Ebrahimi       block  is obtained and freed within this function, using memory manage-
3465*22dc650dSSadaf Ebrahimi       ment functions from the match context, if provided, or else those  that
3466*22dc650dSSadaf Ebrahimi       were used to allocate memory for the compiled code.
3467*22dc650dSSadaf Ebrahimi
3468*22dc650dSSadaf Ebrahimi       If  match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
3469*22dc650dSSadaf Ebrahimi       provided block is used for all calls to pcre2_match(), and its contents
3470*22dc650dSSadaf Ebrahimi       afterwards are the result of the final call. For global  changes,  this
3471*22dc650dSSadaf Ebrahimi       will always be a no-match error. The contents of the ovector within the
3472*22dc650dSSadaf Ebrahimi       match data block may or may not have been changed.
3473*22dc650dSSadaf Ebrahimi
3474*22dc650dSSadaf Ebrahimi       As  well as the usual options for pcre2_match(), a number of additional
3475*22dc650dSSadaf Ebrahimi       options can be set in the options argument of pcre2_substitute().   One
3476*22dc650dSSadaf Ebrahimi       such  option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
3477*22dc650dSSadaf Ebrahimi       match_data block must be provided, and it must have already  been  used
3478*22dc650dSSadaf Ebrahimi       for an external call to pcre2_match() with the same pattern and subject
3479*22dc650dSSadaf Ebrahimi       arguments.  The  data in the match_data block (return code, offset vec-
3480*22dc650dSSadaf Ebrahimi       tor) is then  used  for  the  first  substitution  instead  of  calling
3481*22dc650dSSadaf Ebrahimi       pcre2_match()  from  within pcre2_substitute(). This allows an applica-
3482*22dc650dSSadaf Ebrahimi       tion to check for a match before choosing to substitute, without having
3483*22dc650dSSadaf Ebrahimi       to repeat the match.
3484*22dc650dSSadaf Ebrahimi
3485*22dc650dSSadaf Ebrahimi       The contents of the  externally  supplied  match  data  block  are  not
3486*22dc650dSSadaf Ebrahimi       changed   when   PCRE2_SUBSTITUTE_MATCHED   is  set.  If  PCRE2_SUBSTI-
3487*22dc650dSSadaf Ebrahimi       TUTE_GLOBAL is also set, pcre2_match() is called after the  first  sub-
3488*22dc650dSSadaf Ebrahimi       stitution  to  check for further matches, but this is done using an in-
3489*22dc650dSSadaf Ebrahimi       ternally obtained match data block, thus always  leaving  the  external
3490*22dc650dSSadaf Ebrahimi       block unchanged.
3491*22dc650dSSadaf Ebrahimi
3492*22dc650dSSadaf Ebrahimi       The  code  argument is not used for matching before the first substitu-
3493*22dc650dSSadaf Ebrahimi       tion when PCRE2_SUBSTITUTE_MATCHED is set, but  it  must  be  provided,
3494*22dc650dSSadaf Ebrahimi       even  when  PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3495*22dc650dSSadaf Ebrahimi       formation such as the UTF setting and the number of capturing parenthe-
3496*22dc650dSSadaf Ebrahimi       ses in the pattern.
3497*22dc650dSSadaf Ebrahimi
3498*22dc650dSSadaf Ebrahimi       The default action of pcre2_substitute() is to return  a  copy  of  the
3499*22dc650dSSadaf Ebrahimi       subject string with matched substrings replaced. However, if PCRE2_SUB-
3500*22dc650dSSadaf Ebrahimi       STITUTE_REPLACEMENT_ONLY  is  set,  only the replacement substrings are
3501*22dc650dSSadaf Ebrahimi       returned. In the global case, multiple replacements are concatenated in
3502*22dc650dSSadaf Ebrahimi       the output buffer. Substitution callouts (see below)  can  be  used  to
3503*22dc650dSSadaf Ebrahimi       separate them if necessary.
3504*22dc650dSSadaf Ebrahimi
3505*22dc650dSSadaf Ebrahimi       The  outlengthptr  argument of pcre2_substitute() must point to a vari-
3506*22dc650dSSadaf Ebrahimi       able that contains the length, in code units, of the output buffer.  If
3507*22dc650dSSadaf Ebrahimi       the  function is successful, the value is updated to contain the length
3508*22dc650dSSadaf Ebrahimi       in code units of the new string, excluding the trailing  zero  that  is
3509*22dc650dSSadaf Ebrahimi       automatically added.
3510*22dc650dSSadaf Ebrahimi
3511*22dc650dSSadaf Ebrahimi       If  the  function is not successful, the value set via outlengthptr de-
3512*22dc650dSSadaf Ebrahimi       pends on the type of  error.  For  syntax  errors  in  the  replacement
3513*22dc650dSSadaf Ebrahimi       string, the value is the offset in the replacement string where the er-
3514*22dc650dSSadaf Ebrahimi       ror  was  detected.  For  other errors, the value is PCRE2_UNSET by de-
3515*22dc650dSSadaf Ebrahimi       fault. This includes the case of the output buffer being too small, un-
3516*22dc650dSSadaf Ebrahimi       less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
3517*22dc650dSSadaf Ebrahimi
3518*22dc650dSSadaf Ebrahimi       PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when  the  output
3519*22dc650dSSadaf Ebrahimi       buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3520*22dc650dSSadaf Ebrahimi       ORY  immediately.  If  this  option is set, however, pcre2_substitute()
3521*22dc650dSSadaf Ebrahimi       continues to go through the motions of matching and substituting (with-
3522*22dc650dSSadaf Ebrahimi       out, of course, writing anything) in  order  to  compute  the  size  of
3523*22dc650dSSadaf Ebrahimi       buffer  that  is needed. This value is passed back via the outlengthptr
3524*22dc650dSSadaf Ebrahimi       variable, with  the  result  of  the  function  still  being  PCRE2_ER-
3525*22dc650dSSadaf Ebrahimi       ROR_NOMEMORY.
3526*22dc650dSSadaf Ebrahimi
3527*22dc650dSSadaf Ebrahimi       Passing  a  buffer  size  of zero is a permitted way of finding out how
3528*22dc650dSSadaf Ebrahimi       much memory is needed for given substitution. However, this  does  mean
3529*22dc650dSSadaf Ebrahimi       that the entire operation is carried out twice. Depending on the appli-
3530*22dc650dSSadaf Ebrahimi       cation,  it  may  be more efficient to allocate a large buffer and free
3531*22dc650dSSadaf Ebrahimi       the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3532*22dc650dSSadaf Ebrahimi       FLOW_LENGTH.
3533*22dc650dSSadaf Ebrahimi
3534*22dc650dSSadaf Ebrahimi       The  replacement  string,  which  is interpreted as a UTF string in UTF
3535*22dc650dSSadaf Ebrahimi       mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set.  An
3536*22dc650dSSadaf Ebrahimi       invalid UTF replacement string causes an immediate return with the rel-
3537*22dc650dSSadaf Ebrahimi       evant UTF error code.
3538*22dc650dSSadaf Ebrahimi
3539*22dc650dSSadaf Ebrahimi       If  PCRE2_SUBSTITUTE_LITERAL  is set, the replacement string is not in-
3540*22dc650dSSadaf Ebrahimi       terpreted in any way. By default, however, a dollar character is an es-
3541*22dc650dSSadaf Ebrahimi       cape character that can specify the insertion of characters  from  cap-
3542*22dc650dSSadaf Ebrahimi       ture  groups  and names from (*MARK) or other control verbs in the pat-
3543*22dc650dSSadaf Ebrahimi       tern. Dollar is the only escape character (backslash is treated as lit-
3544*22dc650dSSadaf Ebrahimi       eral). The following forms are always recognized:
3545*22dc650dSSadaf Ebrahimi
3546*22dc650dSSadaf Ebrahimi         $$                  insert a dollar character
3547*22dc650dSSadaf Ebrahimi         $<n> or ${<n>}      insert the contents of group <n>
3548*22dc650dSSadaf Ebrahimi         $*MARK or ${*MARK}  insert a control verb name
3549*22dc650dSSadaf Ebrahimi
3550*22dc650dSSadaf Ebrahimi       Either a group number or a group name  can  be  given  for  <n>.  Curly
3551*22dc650dSSadaf Ebrahimi       brackets  are  required only if the following character would be inter-
3552*22dc650dSSadaf Ebrahimi       preted as part of the number or name. The number may be zero to include
3553*22dc650dSSadaf Ebrahimi       the entire matched string.   For  example,  if  the  pattern  a(b)c  is
3554*22dc650dSSadaf Ebrahimi       matched  with "=abc=" and the replacement string "+$1$0$1+", the result
3555*22dc650dSSadaf Ebrahimi       is "=+babcb+=".
3556*22dc650dSSadaf Ebrahimi
3557*22dc650dSSadaf Ebrahimi       $*MARK inserts the name from the last encountered backtracking  control
3558*22dc650dSSadaf Ebrahimi       verb  on the matching path that has a name. (*MARK) must always include
3559*22dc650dSSadaf Ebrahimi       a name, but the other verbs need not.  For  example,  in  the  case  of
3560*22dc650dSSadaf Ebrahimi       (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
3561*22dc650dSSadaf Ebrahimi       the  relevant  name is "B". This facility can be used to perform simple
3562*22dc650dSSadaf Ebrahimi       simultaneous substitutions, as this pcre2test example shows:
3563*22dc650dSSadaf Ebrahimi
3564*22dc650dSSadaf Ebrahimi         /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
3565*22dc650dSSadaf Ebrahimi             apple lemon
3566*22dc650dSSadaf Ebrahimi          2: pear orange
3567*22dc650dSSadaf Ebrahimi
3568*22dc650dSSadaf Ebrahimi       PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
3569*22dc650dSSadaf Ebrahimi       string, replacing every matching substring. If this option is not  set,
3570*22dc650dSSadaf Ebrahimi       only  the  first matching substring is replaced. The search for matches
3571*22dc650dSSadaf Ebrahimi       takes place in the original subject string (that is, previous  replace-
3572*22dc650dSSadaf Ebrahimi       ments  do  not  affect  it).  Iteration is implemented by advancing the
3573*22dc650dSSadaf Ebrahimi       startoffset value for each search, which is always  passed  the  entire
3574*22dc650dSSadaf Ebrahimi       subject string. If an offset limit is set in the match context, search-
3575*22dc650dSSadaf Ebrahimi       ing stops when that limit is reached.
3576*22dc650dSSadaf Ebrahimi
3577*22dc650dSSadaf Ebrahimi       You  can  restrict  the effect of a global substitution to a portion of
3578*22dc650dSSadaf Ebrahimi       the subject string by setting either or both of startoffset and an off-
3579*22dc650dSSadaf Ebrahimi       set limit. Here is a pcre2test example:
3580*22dc650dSSadaf Ebrahimi
3581*22dc650dSSadaf Ebrahimi         /B/g,replace=!,use_offset_limit
3582*22dc650dSSadaf Ebrahimi         ABC ABC ABC ABC\=offset=3,offset_limit=12
3583*22dc650dSSadaf Ebrahimi          2: ABC A!C A!C ABC
3584*22dc650dSSadaf Ebrahimi
3585*22dc650dSSadaf Ebrahimi       When continuing with global substitutions after  matching  a  substring
3586*22dc650dSSadaf Ebrahimi       with zero length, an attempt to find a non-empty match at the same off-
3587*22dc650dSSadaf Ebrahimi       set is performed.  If this is not successful, the offset is advanced by
3588*22dc650dSSadaf Ebrahimi       one character except when CRLF is a valid newline sequence and the next
3589*22dc650dSSadaf Ebrahimi       two  characters are CR, LF. In this case, the offset is advanced by two
3590*22dc650dSSadaf Ebrahimi       characters.
3591*22dc650dSSadaf Ebrahimi
3592*22dc650dSSadaf Ebrahimi       PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that
3593*22dc650dSSadaf Ebrahimi       do not appear in the pattern to be treated as unset groups. This option
3594*22dc650dSSadaf Ebrahimi       should be used with care, because it means that a typo in a group  name
3595*22dc650dSSadaf Ebrahimi       or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
3596*22dc650dSSadaf Ebrahimi
3597*22dc650dSSadaf Ebrahimi       PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3598*22dc650dSSadaf Ebrahimi       known  groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated
3599*22dc650dSSadaf Ebrahimi       as empty strings when inserted as described above. If  this  option  is
3600*22dc650dSSadaf Ebrahimi       not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3601*22dc650dSSadaf Ebrahimi       SET  error.  This  option  does not influence the extended substitution
3602*22dc650dSSadaf Ebrahimi       syntax described below.
3603*22dc650dSSadaf Ebrahimi
3604*22dc650dSSadaf Ebrahimi       PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to  the
3605*22dc650dSSadaf Ebrahimi       replacement  string.  Without this option, only the dollar character is
3606*22dc650dSSadaf Ebrahimi       special, and only the group insertion forms  listed  above  are  valid.
3607*22dc650dSSadaf Ebrahimi       When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3608*22dc650dSSadaf Ebrahimi
3609*22dc650dSSadaf Ebrahimi       Firstly,  backslash in a replacement string is interpreted as an escape
3610*22dc650dSSadaf Ebrahimi       character. The usual forms such as \n or \x{ddd} can be used to specify
3611*22dc650dSSadaf Ebrahimi       particular character codes, and backslash followed by any  non-alphanu-
3612*22dc650dSSadaf Ebrahimi       meric  character  quotes  that character. Extended quoting can be coded
3613*22dc650dSSadaf Ebrahimi       using \Q...\E, exactly as in pattern strings.
3614*22dc650dSSadaf Ebrahimi
3615*22dc650dSSadaf Ebrahimi       There are also four escape sequences for forcing the case  of  inserted
3616*22dc650dSSadaf Ebrahimi       letters.   The  insertion  mechanism has three states: no case forcing,
3617*22dc650dSSadaf Ebrahimi       force upper case, and force lower case. The escape sequences change the
3618*22dc650dSSadaf Ebrahimi       current state: \U and \L change to upper or lower case forcing, respec-
3619*22dc650dSSadaf Ebrahimi       tively, and \E (when not terminating a \Q quoted sequence)  reverts  to
3620*22dc650dSSadaf Ebrahimi       no  case  forcing. The sequences \u and \l force the next character (if
3621*22dc650dSSadaf Ebrahimi       it is a letter) to upper or lower  case,  respectively,  and  then  the
3622*22dc650dSSadaf Ebrahimi       state automatically reverts to no case forcing. Case forcing applies to
3623*22dc650dSSadaf Ebrahimi       all  inserted  characters, including those from capture groups and let-
3624*22dc650dSSadaf Ebrahimi       ters within \Q...\E quoted sequences. If either PCRE2_UTF or  PCRE2_UCP
3625*22dc650dSSadaf Ebrahimi       was  set when the pattern was compiled, Unicode properties are used for
3626*22dc650dSSadaf Ebrahimi       case forcing characters whose code points are greater than 127.
3627*22dc650dSSadaf Ebrahimi
3628*22dc650dSSadaf Ebrahimi       Note that case forcing sequences such as \U...\E do not nest. For exam-
3629*22dc650dSSadaf Ebrahimi       ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc";  the  final
3630*22dc650dSSadaf Ebrahimi       \E  has  no  effect.  Note  also  that the PCRE2_ALT_BSUX and PCRE2_EX-
3631*22dc650dSSadaf Ebrahimi       TRA_ALT_BSUX options do not apply to replacement strings.
3632*22dc650dSSadaf Ebrahimi
3633*22dc650dSSadaf Ebrahimi       The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to  add  more
3634*22dc650dSSadaf Ebrahimi       flexibility  to  capture  group  substitution. The syntax is similar to
3635*22dc650dSSadaf Ebrahimi       that used by Bash:
3636*22dc650dSSadaf Ebrahimi
3637*22dc650dSSadaf Ebrahimi         ${<n>:-<string>}
3638*22dc650dSSadaf Ebrahimi         ${<n>:+<string1>:<string2>}
3639*22dc650dSSadaf Ebrahimi
3640*22dc650dSSadaf Ebrahimi       As before, <n> may be a group number or a name. The first  form  speci-
3641*22dc650dSSadaf Ebrahimi       fies  a  default  value. If group <n> is set, its value is inserted; if
3642*22dc650dSSadaf Ebrahimi       not, <string> is expanded and the  result  inserted.  The  second  form
3643*22dc650dSSadaf Ebrahimi       specifies  strings that are expanded and inserted when group <n> is set
3644*22dc650dSSadaf Ebrahimi       or unset, respectively. The first form is just a  convenient  shorthand
3645*22dc650dSSadaf Ebrahimi       for
3646*22dc650dSSadaf Ebrahimi
3647*22dc650dSSadaf Ebrahimi         ${<n>:+${<n>}:<string>}
3648*22dc650dSSadaf Ebrahimi
3649*22dc650dSSadaf Ebrahimi       Backslash  can  be  used to escape colons and closing curly brackets in
3650*22dc650dSSadaf Ebrahimi       the replacement strings. A change of the case forcing  state  within  a
3651*22dc650dSSadaf Ebrahimi       replacement  string  remains  in  force  afterwards,  as  shown in this
3652*22dc650dSSadaf Ebrahimi       pcre2test example:
3653*22dc650dSSadaf Ebrahimi
3654*22dc650dSSadaf Ebrahimi         /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
3655*22dc650dSSadaf Ebrahimi             body
3656*22dc650dSSadaf Ebrahimi          1: hello
3657*22dc650dSSadaf Ebrahimi             somebody
3658*22dc650dSSadaf Ebrahimi          1: HELLO
3659*22dc650dSSadaf Ebrahimi
3660*22dc650dSSadaf Ebrahimi       The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these  extended
3661*22dc650dSSadaf Ebrahimi       substitutions.  However,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un-
3662*22dc650dSSadaf Ebrahimi       known groups in the extended syntax forms to be treated as unset.
3663*22dc650dSSadaf Ebrahimi
3664*22dc650dSSadaf Ebrahimi       If  PCRE2_SUBSTITUTE_LITERAL  is  set,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
3665*22dc650dSSadaf Ebrahimi       PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3666*22dc650dSSadaf Ebrahimi       vant and are ignored.
3667*22dc650dSSadaf Ebrahimi
3668*22dc650dSSadaf Ebrahimi   Substitution errors
3669*22dc650dSSadaf Ebrahimi
3670*22dc650dSSadaf Ebrahimi       In  the  event of an error, pcre2_substitute() returns a negative error
3671*22dc650dSSadaf Ebrahimi       code. Except for PCRE2_ERROR_NOMATCH (which is never returned),  errors
3672*22dc650dSSadaf Ebrahimi       from pcre2_match() are passed straight back.
3673*22dc650dSSadaf Ebrahimi
3674*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3675*22dc650dSSadaf Ebrahimi       tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3676*22dc650dSSadaf Ebrahimi
3677*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3678*22dc650dSSadaf Ebrahimi       ing  an  unknown  substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
3679*22dc650dSSadaf Ebrahimi       when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN-
3680*22dc650dSSadaf Ebrahimi       SET_EMPTY is not set.
3681*22dc650dSSadaf Ebrahimi
3682*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_NOMEMORY  is  returned  if  the  output  buffer  is not big
3683*22dc650dSSadaf Ebrahimi       enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
3684*22dc650dSSadaf Ebrahimi       of buffer that is needed is returned via outlengthptr. Note  that  this
3685*22dc650dSSadaf Ebrahimi       does not happen by default.
3686*22dc650dSSadaf Ebrahimi
3687*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
3688*22dc650dSSadaf Ebrahimi       match_data  argument is NULL or if the subject or replacement arguments
3689*22dc650dSSadaf Ebrahimi       are NULL. For backward compatibility reasons an exception is  made  for
3690*22dc650dSSadaf Ebrahimi       the replacement argument if the rlength argument is also 0.
3691*22dc650dSSadaf Ebrahimi
3692*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_BADREPLACEMENT  is  used for miscellaneous syntax errors in
3693*22dc650dSSadaf Ebrahimi       the replacement string, with more  particular  errors  being  PCRE2_ER-
3694*22dc650dSSadaf Ebrahimi       ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE
3695*22dc650dSSadaf Ebrahimi       (closing  curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax
3696*22dc650dSSadaf Ebrahimi       error in extended group substitution),  and  PCRE2_ERROR_BADSUBSPATTERN
3697*22dc650dSSadaf Ebrahimi       (the pattern match ended before it started or the match started earlier
3698*22dc650dSSadaf Ebrahimi       than  the  current  position  in the subject, which can happen if \K is
3699*22dc650dSSadaf Ebrahimi       used in an assertion).
3700*22dc650dSSadaf Ebrahimi
3701*22dc650dSSadaf Ebrahimi       As for all PCRE2 errors, a text message that describes the error can be
3702*22dc650dSSadaf Ebrahimi       obtained by calling the pcre2_get_error_message()  function  (see  "Ob-
3703*22dc650dSSadaf Ebrahimi       taining a textual error message" above).
3704*22dc650dSSadaf Ebrahimi
3705*22dc650dSSadaf Ebrahimi   Substitution callouts
3706*22dc650dSSadaf Ebrahimi
3707*22dc650dSSadaf Ebrahimi       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
3708*22dc650dSSadaf Ebrahimi         int (*callout_function)(pcre2_substitute_callout_block *, void *),
3709*22dc650dSSadaf Ebrahimi         void *callout_data);
3710*22dc650dSSadaf Ebrahimi
3711*22dc650dSSadaf Ebrahimi       The  pcre2_set_substitution_callout() function can be used to specify a
3712*22dc650dSSadaf Ebrahimi       callout function for pcre2_substitute(). This information is passed  in
3713*22dc650dSSadaf Ebrahimi       a match context. The callout function is called after each substitution
3714*22dc650dSSadaf Ebrahimi       has been processed, but it can cause the replacement not to happen. The
3715*22dc650dSSadaf Ebrahimi       callout  function is not called for simulated substitutions that happen
3716*22dc650dSSadaf Ebrahimi       as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
3717*22dc650dSSadaf Ebrahimi
3718*22dc650dSSadaf Ebrahimi       The first argument of the callout function is a pointer to a substitute
3719*22dc650dSSadaf Ebrahimi       callout block structure, which contains the following fields, not  nec-
3720*22dc650dSSadaf Ebrahimi       essarily in this order:
3721*22dc650dSSadaf Ebrahimi
3722*22dc650dSSadaf Ebrahimi         uint32_t    version;
3723*22dc650dSSadaf Ebrahimi         uint32_t    subscount;
3724*22dc650dSSadaf Ebrahimi         PCRE2_SPTR  input;
3725*22dc650dSSadaf Ebrahimi         PCRE2_SPTR  output;
3726*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *ovector;
3727*22dc650dSSadaf Ebrahimi         uint32_t    oveccount;
3728*22dc650dSSadaf Ebrahimi         PCRE2_SIZE  output_offsets[2];
3729*22dc650dSSadaf Ebrahimi
3730*22dc650dSSadaf Ebrahimi       The  version field contains the version number of the block format. The
3731*22dc650dSSadaf Ebrahimi       current version is 0. The version number will  increase  in  future  if
3732*22dc650dSSadaf Ebrahimi       more  fields are added, but the intention is never to remove any of the
3733*22dc650dSSadaf Ebrahimi       existing fields.
3734*22dc650dSSadaf Ebrahimi
3735*22dc650dSSadaf Ebrahimi       The subscount field is the number of the current match. It is 1 for the
3736*22dc650dSSadaf Ebrahimi       first callout, 2 for the second, and so on. The input and output point-
3737*22dc650dSSadaf Ebrahimi       ers are copies of the values passed to pcre2_substitute().
3738*22dc650dSSadaf Ebrahimi
3739*22dc650dSSadaf Ebrahimi       The ovector field points to the ovector, which contains the  result  of
3740*22dc650dSSadaf Ebrahimi       the most recent match. The oveccount field contains the number of pairs
3741*22dc650dSSadaf Ebrahimi       that are set in the ovector, and is always greater than zero.
3742*22dc650dSSadaf Ebrahimi
3743*22dc650dSSadaf Ebrahimi       The  output_offsets  vector  contains the offsets of the replacement in
3744*22dc650dSSadaf Ebrahimi       the output string. This has already been processed for dollar  and  (if
3745*22dc650dSSadaf Ebrahimi       requested) backslash substitutions as described above.
3746*22dc650dSSadaf Ebrahimi
3747*22dc650dSSadaf Ebrahimi       The  second  argument  of  the  callout function is the value passed as
3748*22dc650dSSadaf Ebrahimi       callout_data when the function was registered. The  value  returned  by
3749*22dc650dSSadaf Ebrahimi       the callout function is interpreted as follows:
3750*22dc650dSSadaf Ebrahimi
3751*22dc650dSSadaf Ebrahimi       If  the  value is zero, the replacement is accepted, and, if PCRE2_SUB-
3752*22dc650dSSadaf Ebrahimi       STITUTE_GLOBAL is set, processing continues with a search for the  next
3753*22dc650dSSadaf Ebrahimi       match.  If  the  value  is not zero, the current replacement is not ac-
3754*22dc650dSSadaf Ebrahimi       cepted. If the value is greater than zero,  processing  continues  when
3755*22dc650dSSadaf Ebrahimi       PCRE2_SUBSTITUTE_GLOBAL  is set. Otherwise (the value is less than zero
3756*22dc650dSSadaf Ebrahimi       or PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied
3757*22dc650dSSadaf Ebrahimi       to the output and the call to pcre2_substitute() exits,  returning  the
3758*22dc650dSSadaf Ebrahimi       number of matches so far.
3759*22dc650dSSadaf Ebrahimi
3760*22dc650dSSadaf Ebrahimi
3761*22dc650dSSadaf EbrahimiDUPLICATE CAPTURE GROUP NAMES
3762*22dc650dSSadaf Ebrahimi
3763*22dc650dSSadaf Ebrahimi       int pcre2_substring_nametable_scan(const pcre2_code *code,
3764*22dc650dSSadaf Ebrahimi         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
3765*22dc650dSSadaf Ebrahimi
3766*22dc650dSSadaf Ebrahimi       When  a  pattern  is compiled with the PCRE2_DUPNAMES option, names for
3767*22dc650dSSadaf Ebrahimi       capture groups are not required to be unique. Duplicate names  are  al-
3768*22dc650dSSadaf Ebrahimi       ways  allowed for groups with the same number, created by using the (?|
3769*22dc650dSSadaf Ebrahimi       feature. Indeed, if such groups are named, they are required to use the
3770*22dc650dSSadaf Ebrahimi       same names.
3771*22dc650dSSadaf Ebrahimi
3772*22dc650dSSadaf Ebrahimi       Normally, patterns that use duplicate names are such that  in  any  one
3773*22dc650dSSadaf Ebrahimi       match,  only  one of each set of identically-named groups participates.
3774*22dc650dSSadaf Ebrahimi       An example is shown in the pcre2pattern documentation.
3775*22dc650dSSadaf Ebrahimi
3776*22dc650dSSadaf Ebrahimi       When  duplicates   are   present,   pcre2_substring_copy_byname()   and
3777*22dc650dSSadaf Ebrahimi       pcre2_substring_get_byname()  return  the first substring corresponding
3778*22dc650dSSadaf Ebrahimi       to the given name that is set. Only if none are set is  PCRE2_ERROR_UN-
3779*22dc650dSSadaf Ebrahimi       SET  is  returned.  The pcre2_substring_number_from_name() function re-
3780*22dc650dSSadaf Ebrahimi       turns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are  duplicate
3781*22dc650dSSadaf Ebrahimi       names.
3782*22dc650dSSadaf Ebrahimi
3783*22dc650dSSadaf Ebrahimi       If  you want to get full details of all captured substrings for a given
3784*22dc650dSSadaf Ebrahimi       name, you must use the pcre2_substring_nametable_scan()  function.  The
3785*22dc650dSSadaf Ebrahimi       first  argument is the compiled pattern, and the second is the name. If
3786*22dc650dSSadaf Ebrahimi       the third and fourth arguments are NULL, the function returns  a  group
3787*22dc650dSSadaf Ebrahimi       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
3788*22dc650dSSadaf Ebrahimi
3789*22dc650dSSadaf Ebrahimi       When the third and fourth arguments are not NULL, they must be pointers
3790*22dc650dSSadaf Ebrahimi       to  variables  that are updated by the function. After it has run, they
3791*22dc650dSSadaf Ebrahimi       point to the first and last entries in the name-to-number table for the
3792*22dc650dSSadaf Ebrahimi       given name, and the function returns the length of each entry  in  code
3793*22dc650dSSadaf Ebrahimi       units.  In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
3794*22dc650dSSadaf Ebrahimi       no entries for the given name.
3795*22dc650dSSadaf Ebrahimi
3796*22dc650dSSadaf Ebrahimi       The format of the name table is described above in the section entitled
3797*22dc650dSSadaf Ebrahimi       Information about a pattern. Given all the  relevant  entries  for  the
3798*22dc650dSSadaf Ebrahimi       name,  you  can  extract  each of their numbers, and hence the captured
3799*22dc650dSSadaf Ebrahimi       data.
3800*22dc650dSSadaf Ebrahimi
3801*22dc650dSSadaf Ebrahimi
3802*22dc650dSSadaf EbrahimiFINDING ALL POSSIBLE MATCHES AT ONE POSITION
3803*22dc650dSSadaf Ebrahimi
3804*22dc650dSSadaf Ebrahimi       The traditional matching function uses a  similar  algorithm  to  Perl,
3805*22dc650dSSadaf Ebrahimi       which  stops when it finds the first match at a given point in the sub-
3806*22dc650dSSadaf Ebrahimi       ject. If you want to find all possible matches, or the longest possible
3807*22dc650dSSadaf Ebrahimi       match at a given position,  consider  using  the  alternative  matching
3808*22dc650dSSadaf Ebrahimi       function  (see  below) instead. If you cannot use the alternative func-
3809*22dc650dSSadaf Ebrahimi       tion, you can kludge it up by making use of the callout facility, which
3810*22dc650dSSadaf Ebrahimi       is described in the pcre2callout documentation.
3811*22dc650dSSadaf Ebrahimi
3812*22dc650dSSadaf Ebrahimi       What you have to do is to insert a callout right at the end of the pat-
3813*22dc650dSSadaf Ebrahimi       tern.  When your callout function is called, extract and save the  cur-
3814*22dc650dSSadaf Ebrahimi       rent  matched  substring.  Then return 1, which forces pcre2_match() to
3815*22dc650dSSadaf Ebrahimi       backtrack and try other alternatives. Ultimately, when it runs  out  of
3816*22dc650dSSadaf Ebrahimi       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
3817*22dc650dSSadaf Ebrahimi
3818*22dc650dSSadaf Ebrahimi
3819*22dc650dSSadaf EbrahimiMATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3820*22dc650dSSadaf Ebrahimi
3821*22dc650dSSadaf Ebrahimi       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
3822*22dc650dSSadaf Ebrahimi         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3823*22dc650dSSadaf Ebrahimi         uint32_t options, pcre2_match_data *match_data,
3824*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext,
3825*22dc650dSSadaf Ebrahimi         int *workspace, PCRE2_SIZE wscount);
3826*22dc650dSSadaf Ebrahimi
3827*22dc650dSSadaf Ebrahimi       The  function  pcre2_dfa_match()  is  called  to match a subject string
3828*22dc650dSSadaf Ebrahimi       against a compiled pattern, using a matching algorithm that  scans  the
3829*22dc650dSSadaf Ebrahimi       subject string just once (not counting lookaround assertions), and does
3830*22dc650dSSadaf Ebrahimi       not  backtrack (except when processing lookaround assertions). This has
3831*22dc650dSSadaf Ebrahimi       different characteristics to the normal algorithm, and is not  compati-
3832*22dc650dSSadaf Ebrahimi       ble  with  Perl.  Some  of  the features of PCRE2 patterns are not sup-
3833*22dc650dSSadaf Ebrahimi       ported. Nevertheless, there are times when this kind of matching can be
3834*22dc650dSSadaf Ebrahimi       useful. For a discussion of the two matching algorithms, and a list  of
3835*22dc650dSSadaf Ebrahimi       features that pcre2_dfa_match() does not support, see the pcre2matching
3836*22dc650dSSadaf Ebrahimi       documentation.
3837*22dc650dSSadaf Ebrahimi
3838*22dc650dSSadaf Ebrahimi       The  arguments  for  the pcre2_dfa_match() function are the same as for
3839*22dc650dSSadaf Ebrahimi       pcre2_match(), plus two extras. The ovector within the match data block
3840*22dc650dSSadaf Ebrahimi       is used in a different way, and this is described below. The other com-
3841*22dc650dSSadaf Ebrahimi       mon arguments are used in the same way as for pcre2_match(),  so  their
3842*22dc650dSSadaf Ebrahimi       description is not repeated here.
3843*22dc650dSSadaf Ebrahimi
3844*22dc650dSSadaf Ebrahimi       The  two  additional  arguments provide workspace for the function. The
3845*22dc650dSSadaf Ebrahimi       workspace vector should contain at least 20 elements. It  is  used  for
3846*22dc650dSSadaf Ebrahimi       keeping  track  of  multiple paths through the pattern tree. More work-
3847*22dc650dSSadaf Ebrahimi       space is needed for patterns and subjects where there are a lot of  po-
3848*22dc650dSSadaf Ebrahimi       tential matches.
3849*22dc650dSSadaf Ebrahimi
3850*22dc650dSSadaf Ebrahimi       Here is an example of a simple call to pcre2_dfa_match():
3851*22dc650dSSadaf Ebrahimi
3852*22dc650dSSadaf Ebrahimi         int wspace[20];
3853*22dc650dSSadaf Ebrahimi         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
3854*22dc650dSSadaf Ebrahimi         int rc = pcre2_dfa_match(
3855*22dc650dSSadaf Ebrahimi           re,             /* result of pcre2_compile() */
3856*22dc650dSSadaf Ebrahimi           "some string",  /* the subject string */
3857*22dc650dSSadaf Ebrahimi           11,             /* the length of the subject string */
3858*22dc650dSSadaf Ebrahimi           0,              /* start at offset 0 in the subject */
3859*22dc650dSSadaf Ebrahimi           0,              /* default options */
3860*22dc650dSSadaf Ebrahimi           md,             /* the match data block */
3861*22dc650dSSadaf Ebrahimi           NULL,           /* a match context; NULL means use defaults */
3862*22dc650dSSadaf Ebrahimi           wspace,         /* working space vector */
3863*22dc650dSSadaf Ebrahimi           20);            /* number of elements (NOT size in bytes) */
3864*22dc650dSSadaf Ebrahimi
3865*22dc650dSSadaf Ebrahimi   Option bits for pcre2_dfa_match()
3866*22dc650dSSadaf Ebrahimi
3867*22dc650dSSadaf Ebrahimi       The  unused  bits of the options argument for pcre2_dfa_match() must be
3868*22dc650dSSadaf Ebrahimi       zero.  The  only   bits   that   may   be   set   are   PCRE2_ANCHORED,
3869*22dc650dSSadaf Ebrahimi       PCRE2_COPY_MATCHED_SUBJECT,  PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3870*22dc650dSSadaf Ebrahimi       TEOL,   PCRE2_NOTEMPTY,   PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK,
3871*22dc650dSSadaf Ebrahimi       PCRE2_PARTIAL_HARD,    PCRE2_PARTIAL_SOFT,    PCRE2_DFA_SHORTEST,   and
3872*22dc650dSSadaf Ebrahimi       PCRE2_DFA_RESTART. All but the last four of these are exactly the  same
3873*22dc650dSSadaf Ebrahimi       as for pcre2_match(), so their description is not repeated here.
3874*22dc650dSSadaf Ebrahimi
3875*22dc650dSSadaf Ebrahimi         PCRE2_PARTIAL_HARD
3876*22dc650dSSadaf Ebrahimi         PCRE2_PARTIAL_SOFT
3877*22dc650dSSadaf Ebrahimi
3878*22dc650dSSadaf Ebrahimi       These  have  the  same general effect as they do for pcre2_match(), but
3879*22dc650dSSadaf Ebrahimi       the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
3880*22dc650dSSadaf Ebrahimi       pcre2_dfa_match(),  it  returns  PCRE2_ERROR_PARTIAL  if the end of the
3881*22dc650dSSadaf Ebrahimi       subject is reached and there is still at least one matching possibility
3882*22dc650dSSadaf Ebrahimi       that requires additional characters. This happens even if some complete
3883*22dc650dSSadaf Ebrahimi       matches have already been found. When PCRE2_PARTIAL_SOFT  is  set,  the
3884*22dc650dSSadaf Ebrahimi       return  code  PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
3885*22dc650dSSadaf Ebrahimi       if the end of the subject is  reached,  there  have  been  no  complete
3886*22dc650dSSadaf Ebrahimi       matches, but there is still at least one matching possibility. The por-
3887*22dc650dSSadaf Ebrahimi       tion  of  the  string that was inspected when the longest partial match
3888*22dc650dSSadaf Ebrahimi       was found is set as the first matching string in both cases. There is a
3889*22dc650dSSadaf Ebrahimi       more detailed discussion of partial and  multi-segment  matching,  with
3890*22dc650dSSadaf Ebrahimi       examples, in the pcre2partial documentation.
3891*22dc650dSSadaf Ebrahimi
3892*22dc650dSSadaf Ebrahimi         PCRE2_DFA_SHORTEST
3893*22dc650dSSadaf Ebrahimi
3894*22dc650dSSadaf Ebrahimi       Setting  the PCRE2_DFA_SHORTEST option causes the matching algorithm to
3895*22dc650dSSadaf Ebrahimi       stop as soon as it has found one match. Because of the way the alterna-
3896*22dc650dSSadaf Ebrahimi       tive algorithm works, this is necessarily the shortest  possible  match
3897*22dc650dSSadaf Ebrahimi       at the first possible matching point in the subject string.
3898*22dc650dSSadaf Ebrahimi
3899*22dc650dSSadaf Ebrahimi         PCRE2_DFA_RESTART
3900*22dc650dSSadaf Ebrahimi
3901*22dc650dSSadaf Ebrahimi       When  pcre2_dfa_match() returns a partial match, it is possible to call
3902*22dc650dSSadaf Ebrahimi       it again, with additional subject characters, and have it continue with
3903*22dc650dSSadaf Ebrahimi       the same match. The PCRE2_DFA_RESTART option requests this action; when
3904*22dc650dSSadaf Ebrahimi       it is set, the workspace and wscount options must  reference  the  same
3905*22dc650dSSadaf Ebrahimi       vector  as  before  because data about the match so far is left in them
3906*22dc650dSSadaf Ebrahimi       after a partial match. There is more discussion of this facility in the
3907*22dc650dSSadaf Ebrahimi       pcre2partial documentation.
3908*22dc650dSSadaf Ebrahimi
3909*22dc650dSSadaf Ebrahimi   Successful returns from pcre2_dfa_match()
3910*22dc650dSSadaf Ebrahimi
3911*22dc650dSSadaf Ebrahimi       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3912*22dc650dSSadaf Ebrahimi       string in the subject. Note, however, that all the matches from one run
3913*22dc650dSSadaf Ebrahimi       of the function start at the same point in  the  subject.  The  shorter
3914*22dc650dSSadaf Ebrahimi       matches  are all initial substrings of the longer matches. For example,
3915*22dc650dSSadaf Ebrahimi       if the pattern
3916*22dc650dSSadaf Ebrahimi
3917*22dc650dSSadaf Ebrahimi         <.*>
3918*22dc650dSSadaf Ebrahimi
3919*22dc650dSSadaf Ebrahimi       is matched against the string
3920*22dc650dSSadaf Ebrahimi
3921*22dc650dSSadaf Ebrahimi         This is <something> <something else> <something further> no more
3922*22dc650dSSadaf Ebrahimi
3923*22dc650dSSadaf Ebrahimi       the three matched strings are
3924*22dc650dSSadaf Ebrahimi
3925*22dc650dSSadaf Ebrahimi         <something> <something else> <something further>
3926*22dc650dSSadaf Ebrahimi         <something> <something else>
3927*22dc650dSSadaf Ebrahimi         <something>
3928*22dc650dSSadaf Ebrahimi
3929*22dc650dSSadaf Ebrahimi       On success, the yield of the function is a number  greater  than  zero,
3930*22dc650dSSadaf Ebrahimi       which  is  the  number  of  matched substrings. The offsets of the sub-
3931*22dc650dSSadaf Ebrahimi       strings are returned in the ovector, and can be extracted by number  in
3932*22dc650dSSadaf Ebrahimi       the  same way as for pcre2_match(), but the numbers bear no relation to
3933*22dc650dSSadaf Ebrahimi       any capture groups that may exist in the pattern, because DFA  matching
3934*22dc650dSSadaf Ebrahimi       does not support capturing.
3935*22dc650dSSadaf Ebrahimi
3936*22dc650dSSadaf Ebrahimi       Calls  to the convenience functions that extract substrings by name re-
3937*22dc650dSSadaf Ebrahimi       turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3938*22dc650dSSadaf Ebrahimi       ter a DFA match. The convenience functions that extract  substrings  by
3939*22dc650dSSadaf Ebrahimi       number never return PCRE2_ERROR_NOSUBSTRING.
3940*22dc650dSSadaf Ebrahimi
3941*22dc650dSSadaf Ebrahimi       The  matched  strings  are  stored  in  the ovector in reverse order of
3942*22dc650dSSadaf Ebrahimi       length; that is, the longest matching string is first.  If  there  were
3943*22dc650dSSadaf Ebrahimi       too  many matches to fit into the ovector, the yield of the function is
3944*22dc650dSSadaf Ebrahimi       zero, and the vector is filled with the longest matches.
3945*22dc650dSSadaf Ebrahimi
3946*22dc650dSSadaf Ebrahimi       NOTE: PCRE2's "auto-possessification" optimization usually  applies  to
3947*22dc650dSSadaf Ebrahimi       character  repeats at the end of a pattern (as well as internally). For
3948*22dc650dSSadaf Ebrahimi       example, the pattern "a\d+" is compiled as if it were "a\d++". For  DFA
3949*22dc650dSSadaf Ebrahimi       matching,  this means that only one possible match is found. If you re-
3950*22dc650dSSadaf Ebrahimi       ally do want multiple matches in such cases, either use an ungreedy re-
3951*22dc650dSSadaf Ebrahimi       peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when  com-
3952*22dc650dSSadaf Ebrahimi       piling.
3953*22dc650dSSadaf Ebrahimi
3954*22dc650dSSadaf Ebrahimi   Error returns from pcre2_dfa_match()
3955*22dc650dSSadaf Ebrahimi
3956*22dc650dSSadaf Ebrahimi       The pcre2_dfa_match() function returns a negative number when it fails.
3957*22dc650dSSadaf Ebrahimi       Many  of  the  errors  are  the same as for pcre2_match(), as described
3958*22dc650dSSadaf Ebrahimi       above.  There are in addition the following errors that are specific to
3959*22dc650dSSadaf Ebrahimi       pcre2_dfa_match():
3960*22dc650dSSadaf Ebrahimi
3961*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_DFA_UITEM
3962*22dc650dSSadaf Ebrahimi
3963*22dc650dSSadaf Ebrahimi       This return is given if pcre2_dfa_match() encounters  an  item  in  the
3964*22dc650dSSadaf Ebrahimi       pattern  that it does not support, for instance, the use of \C in a UTF
3965*22dc650dSSadaf Ebrahimi       mode or a backreference.
3966*22dc650dSSadaf Ebrahimi
3967*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_DFA_UCOND
3968*22dc650dSSadaf Ebrahimi
3969*22dc650dSSadaf Ebrahimi       This return is given if pcre2_dfa_match() encounters a  condition  item
3970*22dc650dSSadaf Ebrahimi       that uses a backreference for the condition, or a test for recursion in
3971*22dc650dSSadaf Ebrahimi       a specific capture group. These are not supported.
3972*22dc650dSSadaf Ebrahimi
3973*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_DFA_UINVALID_UTF
3974*22dc650dSSadaf Ebrahimi
3975*22dc650dSSadaf Ebrahimi       This  return is given if pcre2_dfa_match() is called for a pattern that
3976*22dc650dSSadaf Ebrahimi       was compiled with PCRE2_MATCH_INVALID_UTF. This is  not  supported  for
3977*22dc650dSSadaf Ebrahimi       DFA matching.
3978*22dc650dSSadaf Ebrahimi
3979*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_DFA_WSSIZE
3980*22dc650dSSadaf Ebrahimi
3981*22dc650dSSadaf Ebrahimi       This  return  is  given  if  pcre2_dfa_match() runs out of space in the
3982*22dc650dSSadaf Ebrahimi       workspace vector.
3983*22dc650dSSadaf Ebrahimi
3984*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_DFA_RECURSE
3985*22dc650dSSadaf Ebrahimi
3986*22dc650dSSadaf Ebrahimi       When a recursion or subroutine call is processed, the matching function
3987*22dc650dSSadaf Ebrahimi       calls itself recursively, using private  memory  for  the  ovector  and
3988*22dc650dSSadaf Ebrahimi       workspace.   This  error  is given if the internal ovector is not large
3989*22dc650dSSadaf Ebrahimi       enough. This should be extremely rare, as a  vector  of  size  1000  is
3990*22dc650dSSadaf Ebrahimi       used.
3991*22dc650dSSadaf Ebrahimi
3992*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_DFA_BADRESTART
3993*22dc650dSSadaf Ebrahimi
3994*22dc650dSSadaf Ebrahimi       When  pcre2_dfa_match()  is  called  with the PCRE2_DFA_RESTART option,
3995*22dc650dSSadaf Ebrahimi       some plausibility checks are made on the  contents  of  the  workspace,
3996*22dc650dSSadaf Ebrahimi       which  should  contain data about the previous partial match. If any of
3997*22dc650dSSadaf Ebrahimi       these checks fail, this error is given.
3998*22dc650dSSadaf Ebrahimi
3999*22dc650dSSadaf Ebrahimi
4000*22dc650dSSadaf EbrahimiSEE ALSO
4001*22dc650dSSadaf Ebrahimi
4002*22dc650dSSadaf Ebrahimi       pcre2build(3),   pcre2callout(3),    pcre2demo(3),    pcre2matching(3),
4003*22dc650dSSadaf Ebrahimi       pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
4004*22dc650dSSadaf Ebrahimi
4005*22dc650dSSadaf Ebrahimi
4006*22dc650dSSadaf EbrahimiAUTHOR
4007*22dc650dSSadaf Ebrahimi
4008*22dc650dSSadaf Ebrahimi       Philip Hazel
4009*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
4010*22dc650dSSadaf Ebrahimi       Cambridge, England.
4011*22dc650dSSadaf Ebrahimi
4012*22dc650dSSadaf Ebrahimi
4013*22dc650dSSadaf EbrahimiREVISION
4014*22dc650dSSadaf Ebrahimi
4015*22dc650dSSadaf Ebrahimi       Last updated: 24 April 2024
4016*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2024 University of Cambridge.
4017*22dc650dSSadaf Ebrahimi
4018*22dc650dSSadaf Ebrahimi
4019*22dc650dSSadaf EbrahimiPCRE2 10.44                      24 April 2024                     PCRE2API(3)
4020*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
4021*22dc650dSSadaf Ebrahimi
4022*22dc650dSSadaf Ebrahimi
4023*22dc650dSSadaf Ebrahimi
4024*22dc650dSSadaf EbrahimiPCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
4025*22dc650dSSadaf Ebrahimi
4026*22dc650dSSadaf Ebrahimi
4027*22dc650dSSadaf EbrahimiNAME
4028*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
4029*22dc650dSSadaf Ebrahimi
4030*22dc650dSSadaf Ebrahimi
4031*22dc650dSSadaf EbrahimiBUILDING PCRE2
4032*22dc650dSSadaf Ebrahimi
4033*22dc650dSSadaf Ebrahimi       PCRE2  is distributed with a configure script that can be used to build
4034*22dc650dSSadaf Ebrahimi       the library in Unix-like environments using the applications  known  as
4035*22dc650dSSadaf Ebrahimi       Autotools. Also in the distribution are files to support building using
4036*22dc650dSSadaf Ebrahimi       CMake  instead  of configure. The text file README contains general in-
4037*22dc650dSSadaf Ebrahimi       formation about building with Autotools (some of which is repeated  be-
4038*22dc650dSSadaf Ebrahimi       low),  and  also  has some comments about building on various operating
4039*22dc650dSSadaf Ebrahimi       systems. The files in the vms directory support building under OpenVMS.
4040*22dc650dSSadaf Ebrahimi       There is a lot more information about building PCRE2 without using  Au-
4041*22dc650dSSadaf Ebrahimi       totools  (including  information  about  using  CMake  and building "by
4042*22dc650dSSadaf Ebrahimi       hand") in the text file called NON-AUTOTOOLS-BUILD.  You should consult
4043*22dc650dSSadaf Ebrahimi       this file as well as the README file if you are building in a non-Unix-
4044*22dc650dSSadaf Ebrahimi       like environment.
4045*22dc650dSSadaf Ebrahimi
4046*22dc650dSSadaf Ebrahimi
4047*22dc650dSSadaf EbrahimiPCRE2 BUILD-TIME OPTIONS
4048*22dc650dSSadaf Ebrahimi
4049*22dc650dSSadaf Ebrahimi       The rest of this document describes the optional features of PCRE2 that
4050*22dc650dSSadaf Ebrahimi       can be selected when the library is compiled. It  assumes  use  of  the
4051*22dc650dSSadaf Ebrahimi       configure  script,  where  the  optional features are selected or dese-
4052*22dc650dSSadaf Ebrahimi       lected by providing options to configure before running the  make  com-
4053*22dc650dSSadaf Ebrahimi       mand.  However,  the same options can be selected in both Unix-like and
4054*22dc650dSSadaf Ebrahimi       non-Unix-like environments if you are using CMake instead of  configure
4055*22dc650dSSadaf Ebrahimi       to build PCRE2.
4056*22dc650dSSadaf Ebrahimi
4057*22dc650dSSadaf Ebrahimi       If  you  are not using Autotools or CMake, option selection can be done
4058*22dc650dSSadaf Ebrahimi       by editing the config.h file, or by passing parameter settings  to  the
4059*22dc650dSSadaf Ebrahimi       compiler, as described in NON-AUTOTOOLS-BUILD.
4060*22dc650dSSadaf Ebrahimi
4061*22dc650dSSadaf Ebrahimi       The complete list of options for configure (which includes the standard
4062*22dc650dSSadaf Ebrahimi       ones  such  as  the selection of the installation directory) can be ob-
4063*22dc650dSSadaf Ebrahimi       tained by running
4064*22dc650dSSadaf Ebrahimi
4065*22dc650dSSadaf Ebrahimi         ./configure --help
4066*22dc650dSSadaf Ebrahimi
4067*22dc650dSSadaf Ebrahimi       The following sections include descriptions of "on/off"  options  whose
4068*22dc650dSSadaf Ebrahimi       names begin with --enable or --disable. Because of the way that config-
4069*22dc650dSSadaf Ebrahimi       ure  works, --enable and --disable always come in pairs, so the comple-
4070*22dc650dSSadaf Ebrahimi       mentary option always exists as well, but as it specifies the  default,
4071*22dc650dSSadaf Ebrahimi       it is not described.  Options that specify values have names that start
4072*22dc650dSSadaf Ebrahimi       with --with. At the end of a configure run, a summary of the configura-
4073*22dc650dSSadaf Ebrahimi       tion is output.
4074*22dc650dSSadaf Ebrahimi
4075*22dc650dSSadaf Ebrahimi
4076*22dc650dSSadaf EbrahimiBUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
4077*22dc650dSSadaf Ebrahimi
4078*22dc650dSSadaf Ebrahimi       By  default, a library called libpcre2-8 is built, containing functions
4079*22dc650dSSadaf Ebrahimi       that take string arguments contained in arrays  of  bytes,  interpreted
4080*22dc650dSSadaf Ebrahimi       either  as single-byte characters, or UTF-8 strings. You can also build
4081*22dc650dSSadaf Ebrahimi       two other libraries, called libpcre2-16 and libpcre2-32, which  process
4082*22dc650dSSadaf Ebrahimi       strings  that  are contained in arrays of 16-bit and 32-bit code units,
4083*22dc650dSSadaf Ebrahimi       respectively. These can be interpreted either as single-unit characters
4084*22dc650dSSadaf Ebrahimi       or UTF-16/UTF-32 strings. To build these additional libraries, add  one
4085*22dc650dSSadaf Ebrahimi       or both of the following to the configure command:
4086*22dc650dSSadaf Ebrahimi
4087*22dc650dSSadaf Ebrahimi         --enable-pcre2-16
4088*22dc650dSSadaf Ebrahimi         --enable-pcre2-32
4089*22dc650dSSadaf Ebrahimi
4090*22dc650dSSadaf Ebrahimi       If you do not want the 8-bit library, add
4091*22dc650dSSadaf Ebrahimi
4092*22dc650dSSadaf Ebrahimi         --disable-pcre2-8
4093*22dc650dSSadaf Ebrahimi
4094*22dc650dSSadaf Ebrahimi       as  well.  At least one of the three libraries must be built. Note that
4095*22dc650dSSadaf Ebrahimi       the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
4096*22dc650dSSadaf Ebrahimi       an  8-bit  program.  Neither  of these are built if you select only the
4097*22dc650dSSadaf Ebrahimi       16-bit or 32-bit libraries.
4098*22dc650dSSadaf Ebrahimi
4099*22dc650dSSadaf Ebrahimi
4100*22dc650dSSadaf EbrahimiBUILDING SHARED AND STATIC LIBRARIES
4101*22dc650dSSadaf Ebrahimi
4102*22dc650dSSadaf Ebrahimi       The Autotools PCRE2 building process uses libtool to build both  shared
4103*22dc650dSSadaf Ebrahimi       and  static  libraries by default. You can suppress an unwanted library
4104*22dc650dSSadaf Ebrahimi       by adding one of
4105*22dc650dSSadaf Ebrahimi
4106*22dc650dSSadaf Ebrahimi         --disable-shared
4107*22dc650dSSadaf Ebrahimi         --disable-static
4108*22dc650dSSadaf Ebrahimi
4109*22dc650dSSadaf Ebrahimi       to the configure command. Setting --disable-shared ensures  that  PCRE2
4110*22dc650dSSadaf Ebrahimi       libraries  are  built  as  static libraries. The binaries that are then
4111*22dc650dSSadaf Ebrahimi       created as part of  the  build  process  (for  example,  pcre2test  and
4112*22dc650dSSadaf Ebrahimi       pcre2grep)  are linked statically with one or more PCRE2 libraries, but
4113*22dc650dSSadaf Ebrahimi       may also be dynamically linked with other libraries such  as  libc.  If
4114*22dc650dSSadaf Ebrahimi       you  want these binaries to be fully statically linked, you can set LD-
4115*22dc650dSSadaf Ebrahimi       FLAGS like this:
4116*22dc650dSSadaf Ebrahimi
4117*22dc650dSSadaf Ebrahimi       LDFLAGS=--static ./configure --disable-shared
4118*22dc650dSSadaf Ebrahimi
4119*22dc650dSSadaf Ebrahimi       Note the two hyphens in --static. Of course, this works only if  static
4120*22dc650dSSadaf Ebrahimi       versions of all the relevant libraries are available for linking.
4121*22dc650dSSadaf Ebrahimi
4122*22dc650dSSadaf Ebrahimi
4123*22dc650dSSadaf EbrahimiUNICODE AND UTF SUPPORT
4124*22dc650dSSadaf Ebrahimi
4125*22dc650dSSadaf Ebrahimi       By  default,  PCRE2 is built with support for Unicode and UTF character
4126*22dc650dSSadaf Ebrahimi       strings.  To build it without Unicode support, add
4127*22dc650dSSadaf Ebrahimi
4128*22dc650dSSadaf Ebrahimi         --disable-unicode
4129*22dc650dSSadaf Ebrahimi
4130*22dc650dSSadaf Ebrahimi       to the configure command. This setting applies to all three  libraries.
4131*22dc650dSSadaf Ebrahimi       It  is  not  possible to build one library with Unicode support and an-
4132*22dc650dSSadaf Ebrahimi       other without in the same configuration.
4133*22dc650dSSadaf Ebrahimi
4134*22dc650dSSadaf Ebrahimi       Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
4135*22dc650dSSadaf Ebrahimi       UTF-16 or UTF-32. To do that, applications that use the library can set
4136*22dc650dSSadaf Ebrahimi       the  PCRE2_UTF  option when they call pcre2_compile() to compile a pat-
4137*22dc650dSSadaf Ebrahimi       tern.  Alternatively, patterns may be started with  (*UTF)  unless  the
4138*22dc650dSSadaf Ebrahimi       application has locked this out by setting PCRE2_NEVER_UTF.
4139*22dc650dSSadaf Ebrahimi
4140*22dc650dSSadaf Ebrahimi       UTF support allows the libraries to process character code points up to
4141*22dc650dSSadaf Ebrahimi       0x10ffff  in  the  strings that they handle. Unicode support also gives
4142*22dc650dSSadaf Ebrahimi       access to the Unicode properties of characters, using  pattern  escapes
4143*22dc650dSSadaf Ebrahimi       such as \P, \p, and \X. Only the general category properties such as Lu
4144*22dc650dSSadaf Ebrahimi       and Nd, script names, and some bi-directional properties are supported.
4145*22dc650dSSadaf Ebrahimi       Details are given in the pcre2pattern documentation.
4146*22dc650dSSadaf Ebrahimi
4147*22dc650dSSadaf Ebrahimi       Pattern escapes such as \d and \w do not by default make use of Unicode
4148*22dc650dSSadaf Ebrahimi       properties.  The  application  can  request that they do by setting the
4149*22dc650dSSadaf Ebrahimi       PCRE2_UCP option. Unless the application  has  set  PCRE2_NEVER_UCP,  a
4150*22dc650dSSadaf Ebrahimi       pattern may also request this by starting with (*UCP).
4151*22dc650dSSadaf Ebrahimi
4152*22dc650dSSadaf Ebrahimi
4153*22dc650dSSadaf EbrahimiDISABLING THE USE OF \C
4154*22dc650dSSadaf Ebrahimi
4155*22dc650dSSadaf Ebrahimi       The \C escape sequence, which matches a single code unit, even in a UTF
4156*22dc650dSSadaf Ebrahimi       mode,  can  cause unpredictable behaviour because it may leave the cur-
4157*22dc650dSSadaf Ebrahimi       rent matching point in the middle of a multi-code-unit  character.  The
4158*22dc650dSSadaf Ebrahimi       application  can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
4159*22dc650dSSadaf Ebrahimi       tion when calling pcre2_compile(). There is also a build-time option
4160*22dc650dSSadaf Ebrahimi
4161*22dc650dSSadaf Ebrahimi         --enable-never-backslash-C
4162*22dc650dSSadaf Ebrahimi
4163*22dc650dSSadaf Ebrahimi       (note the upper case C) which locks out the use of \C entirely.
4164*22dc650dSSadaf Ebrahimi
4165*22dc650dSSadaf Ebrahimi
4166*22dc650dSSadaf EbrahimiJUST-IN-TIME COMPILER SUPPORT
4167*22dc650dSSadaf Ebrahimi
4168*22dc650dSSadaf Ebrahimi       Just-in-time (JIT) compiler support is included in the build by  speci-
4169*22dc650dSSadaf Ebrahimi       fying
4170*22dc650dSSadaf Ebrahimi
4171*22dc650dSSadaf Ebrahimi         --enable-jit
4172*22dc650dSSadaf Ebrahimi
4173*22dc650dSSadaf Ebrahimi       This  support  is available only for certain hardware architectures. If
4174*22dc650dSSadaf Ebrahimi       this option is set for an unsupported architecture,  a  building  error
4175*22dc650dSSadaf Ebrahimi       occurs.  If in doubt, use
4176*22dc650dSSadaf Ebrahimi
4177*22dc650dSSadaf Ebrahimi         --enable-jit=auto
4178*22dc650dSSadaf Ebrahimi
4179*22dc650dSSadaf Ebrahimi       which  enables  JIT  only if the current hardware is supported. You can
4180*22dc650dSSadaf Ebrahimi       check if JIT is enabled in the configuration summary that is output  at
4181*22dc650dSSadaf Ebrahimi       the  end  of a configure run. If you are enabling JIT under SELinux you
4182*22dc650dSSadaf Ebrahimi       may also want to add
4183*22dc650dSSadaf Ebrahimi
4184*22dc650dSSadaf Ebrahimi         --enable-jit-sealloc
4185*22dc650dSSadaf Ebrahimi
4186*22dc650dSSadaf Ebrahimi       which enables the use of an execmem allocator in JIT that is compatible
4187*22dc650dSSadaf Ebrahimi       with SELinux. This has no  effect  if  JIT  is  not  enabled.  See  the
4188*22dc650dSSadaf Ebrahimi       pcre2jit  documentation for a discussion of JIT usage. When JIT support
4189*22dc650dSSadaf Ebrahimi       is enabled, pcre2grep automatically makes use of it, unless you add
4190*22dc650dSSadaf Ebrahimi
4191*22dc650dSSadaf Ebrahimi         --disable-pcre2grep-jit
4192*22dc650dSSadaf Ebrahimi
4193*22dc650dSSadaf Ebrahimi       to the configure command.
4194*22dc650dSSadaf Ebrahimi
4195*22dc650dSSadaf Ebrahimi
4196*22dc650dSSadaf EbrahimiNEWLINE RECOGNITION
4197*22dc650dSSadaf Ebrahimi
4198*22dc650dSSadaf Ebrahimi       By default, PCRE2 interprets the linefeed (LF) character as  indicating
4199*22dc650dSSadaf Ebrahimi       the  end  of  a line. This is the normal newline character on Unix-like
4200*22dc650dSSadaf Ebrahimi       systems. You can compile PCRE2 to use carriage return (CR) instead,  by
4201*22dc650dSSadaf Ebrahimi       adding
4202*22dc650dSSadaf Ebrahimi
4203*22dc650dSSadaf Ebrahimi         --enable-newline-is-cr
4204*22dc650dSSadaf Ebrahimi
4205*22dc650dSSadaf Ebrahimi       to  the  configure command. There is also an --enable-newline-is-lf op-
4206*22dc650dSSadaf Ebrahimi       tion, which explicitly specifies linefeed as the newline character.
4207*22dc650dSSadaf Ebrahimi
4208*22dc650dSSadaf Ebrahimi       Alternatively, you can specify that line endings are to be indicated by
4209*22dc650dSSadaf Ebrahimi       the two-character sequence CRLF (CR immediately followed by LF). If you
4210*22dc650dSSadaf Ebrahimi       want this, add
4211*22dc650dSSadaf Ebrahimi
4212*22dc650dSSadaf Ebrahimi         --enable-newline-is-crlf
4213*22dc650dSSadaf Ebrahimi
4214*22dc650dSSadaf Ebrahimi       to the configure command. There is a fourth option, specified by
4215*22dc650dSSadaf Ebrahimi
4216*22dc650dSSadaf Ebrahimi         --enable-newline-is-anycrlf
4217*22dc650dSSadaf Ebrahimi
4218*22dc650dSSadaf Ebrahimi       which causes PCRE2 to recognize any of the three sequences CR,  LF,  or
4219*22dc650dSSadaf Ebrahimi       CRLF as indicating a line ending. A fifth option, specified by
4220*22dc650dSSadaf Ebrahimi
4221*22dc650dSSadaf Ebrahimi         --enable-newline-is-any
4222*22dc650dSSadaf Ebrahimi
4223*22dc650dSSadaf Ebrahimi       causes  PCRE2  to  recognize  any Unicode newline sequence. The Unicode
4224*22dc650dSSadaf Ebrahimi       newline sequences are the three just mentioned, plus the single charac-
4225*22dc650dSSadaf Ebrahimi       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
4226*22dc650dSSadaf Ebrahimi       U+0085), LS (line separator,  U+2028),  and  PS  (paragraph  separator,
4227*22dc650dSSadaf Ebrahimi       U+2029). The final option is
4228*22dc650dSSadaf Ebrahimi
4229*22dc650dSSadaf Ebrahimi         --enable-newline-is-nul
4230*22dc650dSSadaf Ebrahimi
4231*22dc650dSSadaf Ebrahimi       which  causes  NUL  (binary  zero) to be set as the default line-ending
4232*22dc650dSSadaf Ebrahimi       character.
4233*22dc650dSSadaf Ebrahimi
4234*22dc650dSSadaf Ebrahimi       Whatever default line ending convention is selected when PCRE2 is built
4235*22dc650dSSadaf Ebrahimi       can be overridden by applications that use the library. At  build  time
4236*22dc650dSSadaf Ebrahimi       it is recommended to use the standard for your operating system.
4237*22dc650dSSadaf Ebrahimi
4238*22dc650dSSadaf Ebrahimi
4239*22dc650dSSadaf EbrahimiWHAT \R MATCHES
4240*22dc650dSSadaf Ebrahimi
4241*22dc650dSSadaf Ebrahimi       By  default,  the  sequence \R in a pattern matches any Unicode newline
4242*22dc650dSSadaf Ebrahimi       sequence, independently of what has been selected as  the  line  ending
4243*22dc650dSSadaf Ebrahimi       sequence. If you specify
4244*22dc650dSSadaf Ebrahimi
4245*22dc650dSSadaf Ebrahimi         --enable-bsr-anycrlf
4246*22dc650dSSadaf Ebrahimi
4247*22dc650dSSadaf Ebrahimi       the  default  is changed so that \R matches only CR, LF, or CRLF. What-
4248*22dc650dSSadaf Ebrahimi       ever is selected when PCRE2 is built can be overridden by  applications
4249*22dc650dSSadaf Ebrahimi       that use the library.
4250*22dc650dSSadaf Ebrahimi
4251*22dc650dSSadaf Ebrahimi
4252*22dc650dSSadaf EbrahimiHANDLING VERY LARGE PATTERNS
4253*22dc650dSSadaf Ebrahimi
4254*22dc650dSSadaf Ebrahimi       Within  a  compiled  pattern,  offset values are used to point from one
4255*22dc650dSSadaf Ebrahimi       part to another (for example, from an opening parenthesis to an  alter-
4256*22dc650dSSadaf Ebrahimi       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
4257*22dc650dSSadaf Ebrahimi       two-byte values are used for these offsets, leading to a  maximum  size
4258*22dc650dSSadaf Ebrahimi       for a compiled pattern of around 64 thousand code units. This is suffi-
4259*22dc650dSSadaf Ebrahimi       cient  to handle all but the most gigantic patterns. Nevertheless, some
4260*22dc650dSSadaf Ebrahimi       people do want to process truly enormous patterns, so it is possible to
4261*22dc650dSSadaf Ebrahimi       compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
4262*22dc650dSSadaf Ebrahimi       ting such as
4263*22dc650dSSadaf Ebrahimi
4264*22dc650dSSadaf Ebrahimi         --with-link-size=3
4265*22dc650dSSadaf Ebrahimi
4266*22dc650dSSadaf Ebrahimi       to  the  configure command. The value given must be 2, 3, or 4. For the
4267*22dc650dSSadaf Ebrahimi       16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
4268*22dc650dSSadaf Ebrahimi       using  longer  offsets slows down the operation of PCRE2 because it has
4269*22dc650dSSadaf Ebrahimi       to load additional data when handling them. For the 32-bit library  the
4270*22dc650dSSadaf Ebrahimi       value  is  always 4 and cannot be overridden; the value of --with-link-
4271*22dc650dSSadaf Ebrahimi       size is ignored.
4272*22dc650dSSadaf Ebrahimi
4273*22dc650dSSadaf Ebrahimi
4274*22dc650dSSadaf EbrahimiLIMITING PCRE2 RESOURCE USAGE
4275*22dc650dSSadaf Ebrahimi
4276*22dc650dSSadaf Ebrahimi       The pcre2_match() function increments a counter each time it goes round
4277*22dc650dSSadaf Ebrahimi       its main loop. Putting a limit on this counter controls the  amount  of
4278*22dc650dSSadaf Ebrahimi       computing  resource  used  by a single call to pcre2_match(). The limit
4279*22dc650dSSadaf Ebrahimi       can be changed at run time, as described in the pcre2api documentation.
4280*22dc650dSSadaf Ebrahimi       The default is 10 million, but this can be changed by adding a  setting
4281*22dc650dSSadaf Ebrahimi       such as
4282*22dc650dSSadaf Ebrahimi
4283*22dc650dSSadaf Ebrahimi         --with-match-limit=500000
4284*22dc650dSSadaf Ebrahimi
4285*22dc650dSSadaf Ebrahimi       to   the   configure   command.   This  setting  also  applies  to  the
4286*22dc650dSSadaf Ebrahimi       pcre2_dfa_match() matching function, and to JIT  matching  (though  the
4287*22dc650dSSadaf Ebrahimi       counting is done differently).
4288*22dc650dSSadaf Ebrahimi
4289*22dc650dSSadaf Ebrahimi       The  pcre2_match()  function  uses  heap  memory to record backtracking
4290*22dc650dSSadaf Ebrahimi       points. The more nested backtracking points there  are  (that  is,  the
4291*22dc650dSSadaf Ebrahimi       deeper  the  search tree), the more memory is needed. There is an upper
4292*22dc650dSSadaf Ebrahimi       limit, specified in kibibytes (units of 1024 bytes). This limit can  be
4293*22dc650dSSadaf Ebrahimi       changed  at  run  time, as described in the pcre2api documentation. The
4294*22dc650dSSadaf Ebrahimi       default limit (in effect unlimited) is 20 million. You can change  this
4295*22dc650dSSadaf Ebrahimi       by a setting such as
4296*22dc650dSSadaf Ebrahimi
4297*22dc650dSSadaf Ebrahimi         --with-heap-limit=500
4298*22dc650dSSadaf Ebrahimi
4299*22dc650dSSadaf Ebrahimi       which  limits the amount of heap to 500 KiB. This limit applies only to
4300*22dc650dSSadaf Ebrahimi       interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
4301*22dc650dSSadaf Ebrahimi       also use the heap for internal workspace  when  processing  complicated
4302*22dc650dSSadaf Ebrahimi       patterns.  This limit does not apply when JIT (which has its own memory
4303*22dc650dSSadaf Ebrahimi       arrangements) is used.
4304*22dc650dSSadaf Ebrahimi
4305*22dc650dSSadaf Ebrahimi       You can also explicitly limit the depth of nested backtracking  in  the
4306*22dc650dSSadaf Ebrahimi       pcre2_match() interpreter. This limit defaults to the value that is set
4307*22dc650dSSadaf Ebrahimi       for  --with-match-limit.  You  can set a lower default limit by adding,
4308*22dc650dSSadaf Ebrahimi       for example,
4309*22dc650dSSadaf Ebrahimi
4310*22dc650dSSadaf Ebrahimi         --with-match-limit-depth=10000
4311*22dc650dSSadaf Ebrahimi
4312*22dc650dSSadaf Ebrahimi       to the configure command. This value can be  overridden  at  run  time.
4313*22dc650dSSadaf Ebrahimi       This  depth  limit  indirectly limits the amount of heap memory that is
4314*22dc650dSSadaf Ebrahimi       used, but because the size of each backtracking "frame" depends on  the
4315*22dc650dSSadaf Ebrahimi       number  of  capturing parentheses in a pattern, the amount of heap that
4316*22dc650dSSadaf Ebrahimi       is used before the limit is reached varies  from  pattern  to  pattern.
4317*22dc650dSSadaf Ebrahimi       This limit was more useful in versions before 10.30, where function re-
4318*22dc650dSSadaf Ebrahimi       cursion was used for backtracking.
4319*22dc650dSSadaf Ebrahimi
4320*22dc650dSSadaf Ebrahimi       As well as applying to pcre2_match(), the depth limit also controls the
4321*22dc650dSSadaf Ebrahimi       depth  of recursive function calls in pcre2_dfa_match(). These are used
4322*22dc650dSSadaf Ebrahimi       for lookaround assertions, atomic groups,  and  recursion  within  pat-
4323*22dc650dSSadaf Ebrahimi       terns.  The limit does not apply to JIT matching.
4324*22dc650dSSadaf Ebrahimi
4325*22dc650dSSadaf Ebrahimi
4326*22dc650dSSadaf EbrahimiLIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS
4327*22dc650dSSadaf Ebrahimi
4328*22dc650dSSadaf Ebrahimi       Lookbehind  assertions  in which one or more branches can match a vari-
4329*22dc650dSSadaf Ebrahimi       able number of characters are supported only  if  there  is  a  maximum
4330*22dc650dSSadaf Ebrahimi       matching  length  for  each  top-level branch. There is a limit to this
4331*22dc650dSSadaf Ebrahimi       maximum that defaults to 255 characters. You can alter this default  by
4332*22dc650dSSadaf Ebrahimi       a setting such as
4333*22dc650dSSadaf Ebrahimi
4334*22dc650dSSadaf Ebrahimi         --with-max-varlookbehind=100
4335*22dc650dSSadaf Ebrahimi
4336*22dc650dSSadaf Ebrahimi       The limit can be changed at runtime by calling pcre2_set_max_varlookbe-
4337*22dc650dSSadaf Ebrahimi       hind().  Lookbehind  assertions  in  which every branch matches a fixed
4338*22dc650dSSadaf Ebrahimi       number of characters (not necessarily all the same) are not constrained
4339*22dc650dSSadaf Ebrahimi       by this limit.
4340*22dc650dSSadaf Ebrahimi
4341*22dc650dSSadaf Ebrahimi
4342*22dc650dSSadaf EbrahimiCREATING CHARACTER TABLES AT BUILD TIME
4343*22dc650dSSadaf Ebrahimi
4344*22dc650dSSadaf Ebrahimi       PCRE2 uses fixed tables for processing characters whose code points are
4345*22dc650dSSadaf Ebrahimi       less than 256. By default, PCRE2 is built with a set of tables that are
4346*22dc650dSSadaf Ebrahimi       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
4347*22dc650dSSadaf Ebrahimi       for ASCII codes only. If you add
4348*22dc650dSSadaf Ebrahimi
4349*22dc650dSSadaf Ebrahimi         --enable-rebuild-chartables
4350*22dc650dSSadaf Ebrahimi
4351*22dc650dSSadaf Ebrahimi       to  the  configure  command, the distributed tables are no longer used.
4352*22dc650dSSadaf Ebrahimi       Instead, a program called pcre2_dftables is compiled and run. This out-
4353*22dc650dSSadaf Ebrahimi       puts the source for new set of tables, created in the default locale of
4354*22dc650dSSadaf Ebrahimi       your C run-time system. This method of replacing the  tables  does  not
4355*22dc650dSSadaf Ebrahimi       work if you are cross compiling, because pcre2_dftables needs to be run
4356*22dc650dSSadaf Ebrahimi       on the local host and therefore not compiled with the cross compiler.
4357*22dc650dSSadaf Ebrahimi
4358*22dc650dSSadaf Ebrahimi       If you need to create alternative tables when cross compiling, you will
4359*22dc650dSSadaf Ebrahimi       have  to  do so "by hand". There may also be other reasons for creating
4360*22dc650dSSadaf Ebrahimi       tables manually.  To cause pcre2_dftables to  be  built  on  the  local
4361*22dc650dSSadaf Ebrahimi       host, run a normal compiling command, and then run the program with the
4362*22dc650dSSadaf Ebrahimi       output file as its argument, for example:
4363*22dc650dSSadaf Ebrahimi
4364*22dc650dSSadaf Ebrahimi         cc src/pcre2_dftables.c -o pcre2_dftables
4365*22dc650dSSadaf Ebrahimi         ./pcre2_dftables src/pcre2_chartables.c
4366*22dc650dSSadaf Ebrahimi
4367*22dc650dSSadaf Ebrahimi       This  builds the tables in the default locale of the local host. If you
4368*22dc650dSSadaf Ebrahimi       want to specify a locale, you must use the -L option:
4369*22dc650dSSadaf Ebrahimi
4370*22dc650dSSadaf Ebrahimi         LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4371*22dc650dSSadaf Ebrahimi
4372*22dc650dSSadaf Ebrahimi       You can also specify -b (with or without -L). This causes the tables to
4373*22dc650dSSadaf Ebrahimi       be written in binary instead of as source code. A set of binary  tables
4374*22dc650dSSadaf Ebrahimi       can  be  loaded  into memory by an application and passed to pcre2_com-
4375*22dc650dSSadaf Ebrahimi       pile() in the same way as tables created by calling pcre2_maketables().
4376*22dc650dSSadaf Ebrahimi       The tables are just a string of bytes, independent of hardware  charac-
4377*22dc650dSSadaf Ebrahimi       teristics  such  as  endianness. This means they can be bundled with an
4378*22dc650dSSadaf Ebrahimi       application that runs in different environments, to  ensure  consistent
4379*22dc650dSSadaf Ebrahimi       behaviour.
4380*22dc650dSSadaf Ebrahimi
4381*22dc650dSSadaf Ebrahimi
4382*22dc650dSSadaf EbrahimiUSING EBCDIC CODE
4383*22dc650dSSadaf Ebrahimi
4384*22dc650dSSadaf Ebrahimi       PCRE2  assumes  by default that it will run in an environment where the
4385*22dc650dSSadaf Ebrahimi       character code is ASCII or Unicode, which is a superset of ASCII.  This
4386*22dc650dSSadaf Ebrahimi       is the case for most computer operating systems. PCRE2 can, however, be
4387*22dc650dSSadaf Ebrahimi       compiled to run in an 8-bit EBCDIC environment by adding
4388*22dc650dSSadaf Ebrahimi
4389*22dc650dSSadaf Ebrahimi         --enable-ebcdic --disable-unicode
4390*22dc650dSSadaf Ebrahimi
4391*22dc650dSSadaf Ebrahimi       to the configure command. This setting implies --enable-rebuild-charta-
4392*22dc650dSSadaf Ebrahimi       bles.  You should only use it if you know that you are in an EBCDIC en-
4393*22dc650dSSadaf Ebrahimi       vironment (for example, an IBM mainframe operating system).
4394*22dc650dSSadaf Ebrahimi
4395*22dc650dSSadaf Ebrahimi       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
4396*22dc650dSSadaf Ebrahimi       version  of  the  library. Consequently, --enable-unicode and --enable-
4397*22dc650dSSadaf Ebrahimi       ebcdic are mutually exclusive.
4398*22dc650dSSadaf Ebrahimi
4399*22dc650dSSadaf Ebrahimi       The EBCDIC character that corresponds to an ASCII LF is assumed to have
4400*22dc650dSSadaf Ebrahimi       the value 0x15 by default. However, in some EBCDIC  environments,  0x25
4401*22dc650dSSadaf Ebrahimi       is used. In such an environment you should use
4402*22dc650dSSadaf Ebrahimi
4403*22dc650dSSadaf Ebrahimi         --enable-ebcdic-nl25
4404*22dc650dSSadaf Ebrahimi
4405*22dc650dSSadaf Ebrahimi       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4406*22dc650dSSadaf Ebrahimi       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
4407*22dc650dSSadaf Ebrahimi       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4408*22dc650dSSadaf Ebrahimi       acter (which, in Unicode, is 0x85).
4409*22dc650dSSadaf Ebrahimi
4410*22dc650dSSadaf Ebrahimi       The options that select newline behaviour, such as --enable-newline-is-
4411*22dc650dSSadaf Ebrahimi       cr, and equivalent run-time options, refer to these character values in
4412*22dc650dSSadaf Ebrahimi       an EBCDIC environment.
4413*22dc650dSSadaf Ebrahimi
4414*22dc650dSSadaf Ebrahimi
4415*22dc650dSSadaf EbrahimiPCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
4416*22dc650dSSadaf Ebrahimi
4417*22dc650dSSadaf Ebrahimi       By default pcre2grep supports the use of callouts with string arguments
4418*22dc650dSSadaf Ebrahimi       within the patterns it is matching. There are two kinds: one that  gen-
4419*22dc650dSSadaf Ebrahimi       erates output using local code, and another that calls an external pro-
4420*22dc650dSSadaf Ebrahimi       gram  or  script.   If --disable-pcre2grep-callout-fork is added to the
4421*22dc650dSSadaf Ebrahimi       configure command, only the first kind  of  callout  is  supported;  if
4422*22dc650dSSadaf Ebrahimi       --disable-pcre2grep-callout  is  used,  all callouts are completely ig-
4423*22dc650dSSadaf Ebrahimi       nored. For more details of pcre2grep callouts, see the pcre2grep  docu-
4424*22dc650dSSadaf Ebrahimi       mentation.
4425*22dc650dSSadaf Ebrahimi
4426*22dc650dSSadaf Ebrahimi
4427*22dc650dSSadaf EbrahimiPCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
4428*22dc650dSSadaf Ebrahimi
4429*22dc650dSSadaf Ebrahimi       By  default,  pcre2grep reads all files as plain text. You can build it
4430*22dc650dSSadaf Ebrahimi       so that it recognizes files whose names end in .gz or .bz2,  and  reads
4431*22dc650dSSadaf Ebrahimi       them with libz or libbz2, respectively, by adding one or both of
4432*22dc650dSSadaf Ebrahimi
4433*22dc650dSSadaf Ebrahimi         --enable-pcre2grep-libz
4434*22dc650dSSadaf Ebrahimi         --enable-pcre2grep-libbz2
4435*22dc650dSSadaf Ebrahimi
4436*22dc650dSSadaf Ebrahimi       to the configure command. These options naturally require that the rel-
4437*22dc650dSSadaf Ebrahimi       evant  libraries  are installed on your system. Configuration will fail
4438*22dc650dSSadaf Ebrahimi       if they are not.
4439*22dc650dSSadaf Ebrahimi
4440*22dc650dSSadaf Ebrahimi
4441*22dc650dSSadaf EbrahimiPCRE2GREP BUFFER SIZE
4442*22dc650dSSadaf Ebrahimi
4443*22dc650dSSadaf Ebrahimi       pcre2grep uses an internal buffer to hold a "window" on the file it  is
4444*22dc650dSSadaf Ebrahimi       scanning, in order to be able to output "before" and "after" lines when
4445*22dc650dSSadaf Ebrahimi       it finds a match. The default starting size of the buffer is 20KiB. The
4446*22dc650dSSadaf Ebrahimi       buffer  itself  is  three times this size, but because of the way it is
4447*22dc650dSSadaf Ebrahimi       used for holding "before" lines, the longest line that is guaranteed to
4448*22dc650dSSadaf Ebrahimi       be processable is the notional buffer size. If a longer line is encoun-
4449*22dc650dSSadaf Ebrahimi       tered, pcre2grep automatically expands the buffer, up  to  a  specified
4450*22dc650dSSadaf Ebrahimi       maximum  size, whose default is 1MiB or the starting size, whichever is
4451*22dc650dSSadaf Ebrahimi       the larger. You can change the default parameter values by adding,  for
4452*22dc650dSSadaf Ebrahimi       example,
4453*22dc650dSSadaf Ebrahimi
4454*22dc650dSSadaf Ebrahimi         --with-pcre2grep-bufsize=51200
4455*22dc650dSSadaf Ebrahimi         --with-pcre2grep-max-bufsize=2097152
4456*22dc650dSSadaf Ebrahimi
4457*22dc650dSSadaf Ebrahimi       to  the  configure  command. The caller of pcre2grep can override these
4458*22dc650dSSadaf Ebrahimi       values by using --buffer-size  and  --max-buffer-size  on  the  command
4459*22dc650dSSadaf Ebrahimi       line.
4460*22dc650dSSadaf Ebrahimi
4461*22dc650dSSadaf Ebrahimi
4462*22dc650dSSadaf EbrahimiPCRE2TEST OPTION FOR LIBREADLINE SUPPORT
4463*22dc650dSSadaf Ebrahimi
4464*22dc650dSSadaf Ebrahimi       If you add one of
4465*22dc650dSSadaf Ebrahimi
4466*22dc650dSSadaf Ebrahimi         --enable-pcre2test-libreadline
4467*22dc650dSSadaf Ebrahimi         --enable-pcre2test-libedit
4468*22dc650dSSadaf Ebrahimi
4469*22dc650dSSadaf Ebrahimi       to  the configure command, pcre2test is linked with the libreadline or-
4470*22dc650dSSadaf Ebrahimi       libedit library, respectively, and when its input is from  a  terminal,
4471*22dc650dSSadaf Ebrahimi       it  reads  it using the readline() function. This provides line-editing
4472*22dc650dSSadaf Ebrahimi       and history facilities. Note that libreadline is  GPL-licensed,  so  if
4473*22dc650dSSadaf Ebrahimi       you  distribute  a binary of pcre2test linked in this way, there may be
4474*22dc650dSSadaf Ebrahimi       licensing issues. These can be avoided by linking instead with libedit,
4475*22dc650dSSadaf Ebrahimi       which has a BSD licence.
4476*22dc650dSSadaf Ebrahimi
4477*22dc650dSSadaf Ebrahimi       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
4478*22dc650dSSadaf Ebrahimi       be  added to the pcre2test build. In many operating environments with a
4479*22dc650dSSadaf Ebrahimi       system-installed readline library this is sufficient. However, in  some
4480*22dc650dSSadaf Ebrahimi       environments (e.g. if an unmodified distribution version of readline is
4481*22dc650dSSadaf Ebrahimi       in  use),  some  extra configuration may be necessary. The INSTALL file
4482*22dc650dSSadaf Ebrahimi       for libreadline says this:
4483*22dc650dSSadaf Ebrahimi
4484*22dc650dSSadaf Ebrahimi         "Readline uses the termcap functions, but does not link with
4485*22dc650dSSadaf Ebrahimi         the termcap or curses library itself, allowing applications
4486*22dc650dSSadaf Ebrahimi         which link with readline the to choose an appropriate library."
4487*22dc650dSSadaf Ebrahimi
4488*22dc650dSSadaf Ebrahimi       If your environment has not been set up so that an appropriate  library
4489*22dc650dSSadaf Ebrahimi       is automatically included, you may need to add something like
4490*22dc650dSSadaf Ebrahimi
4491*22dc650dSSadaf Ebrahimi         LIBS="-ncurses"
4492*22dc650dSSadaf Ebrahimi
4493*22dc650dSSadaf Ebrahimi       immediately before the configure command.
4494*22dc650dSSadaf Ebrahimi
4495*22dc650dSSadaf Ebrahimi
4496*22dc650dSSadaf EbrahimiINCLUDING DEBUGGING CODE
4497*22dc650dSSadaf Ebrahimi
4498*22dc650dSSadaf Ebrahimi       If you add
4499*22dc650dSSadaf Ebrahimi
4500*22dc650dSSadaf Ebrahimi         --enable-debug
4501*22dc650dSSadaf Ebrahimi
4502*22dc650dSSadaf Ebrahimi       to  the configure command, additional debugging code is included in the
4503*22dc650dSSadaf Ebrahimi       build. This feature is intended for use by the PCRE2 maintainers.
4504*22dc650dSSadaf Ebrahimi
4505*22dc650dSSadaf Ebrahimi
4506*22dc650dSSadaf EbrahimiDEBUGGING WITH VALGRIND SUPPORT
4507*22dc650dSSadaf Ebrahimi
4508*22dc650dSSadaf Ebrahimi       If you add
4509*22dc650dSSadaf Ebrahimi
4510*22dc650dSSadaf Ebrahimi         --enable-valgrind
4511*22dc650dSSadaf Ebrahimi
4512*22dc650dSSadaf Ebrahimi       to the configure command, PCRE2 will use valgrind annotations  to  mark
4513*22dc650dSSadaf Ebrahimi       certain  memory  regions as unaddressable. This allows it to detect in-
4514*22dc650dSSadaf Ebrahimi       valid memory accesses, and is mostly useful for debugging PCRE2 itself.
4515*22dc650dSSadaf Ebrahimi
4516*22dc650dSSadaf Ebrahimi
4517*22dc650dSSadaf EbrahimiCODE COVERAGE REPORTING
4518*22dc650dSSadaf Ebrahimi
4519*22dc650dSSadaf Ebrahimi       If your C compiler is gcc, you can build a version of  PCRE2  that  can
4520*22dc650dSSadaf Ebrahimi       generate a code coverage report for its test suite. To enable this, you
4521*22dc650dSSadaf Ebrahimi       must install lcov version 1.6 or above. Then specify
4522*22dc650dSSadaf Ebrahimi
4523*22dc650dSSadaf Ebrahimi         --enable-coverage
4524*22dc650dSSadaf Ebrahimi
4525*22dc650dSSadaf Ebrahimi       to the configure command and build PCRE2 in the usual way.
4526*22dc650dSSadaf Ebrahimi
4527*22dc650dSSadaf Ebrahimi       Note that using ccache (a caching C compiler) is incompatible with code
4528*22dc650dSSadaf Ebrahimi       coverage  reporting. If you have configured ccache to run automatically
4529*22dc650dSSadaf Ebrahimi       on your system, you must set the environment variable
4530*22dc650dSSadaf Ebrahimi
4531*22dc650dSSadaf Ebrahimi         CCACHE_DISABLE=1
4532*22dc650dSSadaf Ebrahimi
4533*22dc650dSSadaf Ebrahimi       before running make to build PCRE2, so that ccache is not used.
4534*22dc650dSSadaf Ebrahimi
4535*22dc650dSSadaf Ebrahimi       When --enable-coverage is used,  the  following  addition  targets  are
4536*22dc650dSSadaf Ebrahimi       added to the Makefile:
4537*22dc650dSSadaf Ebrahimi
4538*22dc650dSSadaf Ebrahimi         make coverage
4539*22dc650dSSadaf Ebrahimi
4540*22dc650dSSadaf Ebrahimi       This  creates  a  fresh coverage report for the PCRE2 test suite. It is
4541*22dc650dSSadaf Ebrahimi       equivalent to running "make coverage-reset", "make  coverage-baseline",
4542*22dc650dSSadaf Ebrahimi       "make check", and then "make coverage-report".
4543*22dc650dSSadaf Ebrahimi
4544*22dc650dSSadaf Ebrahimi         make coverage-reset
4545*22dc650dSSadaf Ebrahimi
4546*22dc650dSSadaf Ebrahimi       This zeroes the coverage counters, but does nothing else.
4547*22dc650dSSadaf Ebrahimi
4548*22dc650dSSadaf Ebrahimi         make coverage-baseline
4549*22dc650dSSadaf Ebrahimi
4550*22dc650dSSadaf Ebrahimi       This captures baseline coverage information.
4551*22dc650dSSadaf Ebrahimi
4552*22dc650dSSadaf Ebrahimi         make coverage-report
4553*22dc650dSSadaf Ebrahimi
4554*22dc650dSSadaf Ebrahimi       This creates the coverage report.
4555*22dc650dSSadaf Ebrahimi
4556*22dc650dSSadaf Ebrahimi         make coverage-clean-report
4557*22dc650dSSadaf Ebrahimi
4558*22dc650dSSadaf Ebrahimi       This  removes the generated coverage report without cleaning the cover-
4559*22dc650dSSadaf Ebrahimi       age data itself.
4560*22dc650dSSadaf Ebrahimi
4561*22dc650dSSadaf Ebrahimi         make coverage-clean-data
4562*22dc650dSSadaf Ebrahimi
4563*22dc650dSSadaf Ebrahimi       This removes the captured coverage data without removing  the  coverage
4564*22dc650dSSadaf Ebrahimi       files created at compile time (*.gcno).
4565*22dc650dSSadaf Ebrahimi
4566*22dc650dSSadaf Ebrahimi         make coverage-clean
4567*22dc650dSSadaf Ebrahimi
4568*22dc650dSSadaf Ebrahimi       This  cleans all coverage data including the generated coverage report.
4569*22dc650dSSadaf Ebrahimi       For more information about code coverage, see the gcov and  lcov  docu-
4570*22dc650dSSadaf Ebrahimi       mentation.
4571*22dc650dSSadaf Ebrahimi
4572*22dc650dSSadaf Ebrahimi
4573*22dc650dSSadaf EbrahimiDISABLING THE Z AND T FORMATTING MODIFIERS
4574*22dc650dSSadaf Ebrahimi
4575*22dc650dSSadaf Ebrahimi       The  C99  standard  defines formatting modifiers z and t for size_t and
4576*22dc650dSSadaf Ebrahimi       ptrdiff_t values, respectively. By default, PCRE2 uses these  modifiers
4577*22dc650dSSadaf Ebrahimi       in environments other than old versions of Microsoft Visual Studio when
4578*22dc650dSSadaf Ebrahimi       __STDC_VERSION__  is  defined  and has a value greater than or equal to
4579*22dc650dSSadaf Ebrahimi       199901L (indicating support for C99).  However, there is at  least  one
4580*22dc650dSSadaf Ebrahimi       environment that claims to be C99 but does not support these modifiers.
4581*22dc650dSSadaf Ebrahimi       If
4582*22dc650dSSadaf Ebrahimi
4583*22dc650dSSadaf Ebrahimi         --disable-percent-zt
4584*22dc650dSSadaf Ebrahimi
4585*22dc650dSSadaf Ebrahimi       is specified, no use is made of the z or t modifiers. Instead of %td or
4586*22dc650dSSadaf Ebrahimi       %zu,  a  suitable  format is used depending in the size of long for the
4587*22dc650dSSadaf Ebrahimi       platform.
4588*22dc650dSSadaf Ebrahimi
4589*22dc650dSSadaf Ebrahimi
4590*22dc650dSSadaf EbrahimiSUPPORT FOR FUZZERS
4591*22dc650dSSadaf Ebrahimi
4592*22dc650dSSadaf Ebrahimi       There is a special option for use by people who  want  to  run  fuzzing
4593*22dc650dSSadaf Ebrahimi       tests on PCRE2:
4594*22dc650dSSadaf Ebrahimi
4595*22dc650dSSadaf Ebrahimi         --enable-fuzz-support
4596*22dc650dSSadaf Ebrahimi
4597*22dc650dSSadaf Ebrahimi       At present this applies only to the 8-bit library. If set, it causes an
4598*22dc650dSSadaf Ebrahimi       extra  library  called  libpcre2-fuzzsupport.a to be built, but not in-
4599*22dc650dSSadaf Ebrahimi       stalled. This contains a single  function  called  LLVMFuzzerTestOneIn-
4600*22dc650dSSadaf Ebrahimi       put()  whose  arguments are a pointer to a string and the length of the
4601*22dc650dSSadaf Ebrahimi       string. When called, this function tries to compile  the  string  as  a
4602*22dc650dSSadaf Ebrahimi       pattern,  and if that succeeds, to match it.  This is done both with no
4603*22dc650dSSadaf Ebrahimi       options and with some random options bits that are generated  from  the
4604*22dc650dSSadaf Ebrahimi       string.
4605*22dc650dSSadaf Ebrahimi
4606*22dc650dSSadaf Ebrahimi       Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz-
4607*22dc650dSSadaf Ebrahimi       zcheck to be created. This is normally run under valgrind or used  when
4608*22dc650dSSadaf Ebrahimi       PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
4609*22dc650dSSadaf Ebrahimi       function  and  outputs  information  about  what it is doing. The input
4610*22dc650dSSadaf Ebrahimi       strings are specified by arguments: if an argument starts with "="  the
4611*22dc650dSSadaf Ebrahimi       rest  of it is a literal input string. Otherwise, it is assumed to be a
4612*22dc650dSSadaf Ebrahimi       file name, and the contents of the file are the test string.
4613*22dc650dSSadaf Ebrahimi
4614*22dc650dSSadaf Ebrahimi
4615*22dc650dSSadaf EbrahimiOBSOLETE OPTION
4616*22dc650dSSadaf Ebrahimi
4617*22dc650dSSadaf Ebrahimi       In versions of PCRE2 prior to 10.30, there were two  ways  of  handling
4618*22dc650dSSadaf Ebrahimi       backtracking  in the pcre2_match() function. The default was to use the
4619*22dc650dSSadaf Ebrahimi       system stack, but if
4620*22dc650dSSadaf Ebrahimi
4621*22dc650dSSadaf Ebrahimi         --disable-stack-for-recursion
4622*22dc650dSSadaf Ebrahimi
4623*22dc650dSSadaf Ebrahimi       was set, memory on the heap was used. From release 10.30  onwards  this
4624*22dc650dSSadaf Ebrahimi       has  changed  (the  stack  is  no longer used) and this option now does
4625*22dc650dSSadaf Ebrahimi       nothing except give a warning.
4626*22dc650dSSadaf Ebrahimi
4627*22dc650dSSadaf Ebrahimi
4628*22dc650dSSadaf EbrahimiSEE ALSO
4629*22dc650dSSadaf Ebrahimi
4630*22dc650dSSadaf Ebrahimi       pcre2api(3), pcre2-config(3).
4631*22dc650dSSadaf Ebrahimi
4632*22dc650dSSadaf Ebrahimi
4633*22dc650dSSadaf EbrahimiAUTHOR
4634*22dc650dSSadaf Ebrahimi
4635*22dc650dSSadaf Ebrahimi       Philip Hazel
4636*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
4637*22dc650dSSadaf Ebrahimi       Cambridge, England.
4638*22dc650dSSadaf Ebrahimi
4639*22dc650dSSadaf Ebrahimi
4640*22dc650dSSadaf EbrahimiREVISION
4641*22dc650dSSadaf Ebrahimi
4642*22dc650dSSadaf Ebrahimi       Last updated: 15 April 2024
4643*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2024 University of Cambridge.
4644*22dc650dSSadaf Ebrahimi
4645*22dc650dSSadaf Ebrahimi
4646*22dc650dSSadaf EbrahimiPCRE2 10.44                      15 April 2024                   PCRE2BUILD(3)
4647*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
4648*22dc650dSSadaf Ebrahimi
4649*22dc650dSSadaf Ebrahimi
4650*22dc650dSSadaf Ebrahimi
4651*22dc650dSSadaf EbrahimiPCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
4652*22dc650dSSadaf Ebrahimi
4653*22dc650dSSadaf Ebrahimi
4654*22dc650dSSadaf EbrahimiNAME
4655*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
4656*22dc650dSSadaf Ebrahimi
4657*22dc650dSSadaf Ebrahimi
4658*22dc650dSSadaf EbrahimiSYNOPSIS
4659*22dc650dSSadaf Ebrahimi
4660*22dc650dSSadaf Ebrahimi       #include <pcre2.h>
4661*22dc650dSSadaf Ebrahimi
4662*22dc650dSSadaf Ebrahimi       int (*pcre2_callout)(pcre2_callout_block *, void *);
4663*22dc650dSSadaf Ebrahimi
4664*22dc650dSSadaf Ebrahimi       int pcre2_callout_enumerate(const pcre2_code *code,
4665*22dc650dSSadaf Ebrahimi         int (*callback)(pcre2_callout_enumerate_block *, void *),
4666*22dc650dSSadaf Ebrahimi         void *user_data);
4667*22dc650dSSadaf Ebrahimi
4668*22dc650dSSadaf Ebrahimi
4669*22dc650dSSadaf EbrahimiDESCRIPTION
4670*22dc650dSSadaf Ebrahimi
4671*22dc650dSSadaf Ebrahimi       PCRE2  provides  a  feature  called "callout", which is a means of tem-
4672*22dc650dSSadaf Ebrahimi       porarily passing control to the caller of PCRE2 in the middle  of  pat-
4673*22dc650dSSadaf Ebrahimi       tern  matching.  The  caller  of PCRE2 provides an external function by
4674*22dc650dSSadaf Ebrahimi       putting its entry point in a match context (see pcre2_set_callout()  in
4675*22dc650dSSadaf Ebrahimi       the pcre2api documentation).
4676*22dc650dSSadaf Ebrahimi
4677*22dc650dSSadaf Ebrahimi       When  using the pcre2_substitute() function, an additional callout fea-
4678*22dc650dSSadaf Ebrahimi       ture is available. This does a callout after each change to the subject
4679*22dc650dSSadaf Ebrahimi       string and is described in the pcre2api documentation; the rest of this
4680*22dc650dSSadaf Ebrahimi       document is concerned with callouts during pattern matching.
4681*22dc650dSSadaf Ebrahimi
4682*22dc650dSSadaf Ebrahimi       Within a regular expression, (?C<arg>) indicates a point at  which  the
4683*22dc650dSSadaf Ebrahimi       external  function  is  to  be  called. Different callout points can be
4684*22dc650dSSadaf Ebrahimi       identified by putting a number less than 256 after the  letter  C.  The
4685*22dc650dSSadaf Ebrahimi       default  value is zero.  Alternatively, the argument may be a delimited
4686*22dc650dSSadaf Ebrahimi       string. The starting delimiter must be one of ` ' " ^ % # $ {  and  the
4687*22dc650dSSadaf Ebrahimi       ending delimiter is the same as the start, except for {, where the end-
4688*22dc650dSSadaf Ebrahimi       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
4689*22dc650dSSadaf Ebrahimi       string, it must be doubled. For example, this pattern has  two  callout
4690*22dc650dSSadaf Ebrahimi       points:
4691*22dc650dSSadaf Ebrahimi
4692*22dc650dSSadaf Ebrahimi         (?C1)abc(?C"some ""arbitrary"" text")def
4693*22dc650dSSadaf Ebrahimi
4694*22dc650dSSadaf Ebrahimi       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
4695*22dc650dSSadaf Ebrahimi       PCRE2  automatically inserts callouts, all with number 255, before each
4696*22dc650dSSadaf Ebrahimi       item in the pattern except for immediately before or after an  explicit
4697*22dc650dSSadaf Ebrahimi       callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
4698*22dc650dSSadaf Ebrahimi
4699*22dc650dSSadaf Ebrahimi         A(?C3)B
4700*22dc650dSSadaf Ebrahimi
4701*22dc650dSSadaf Ebrahimi       it is processed as if it were
4702*22dc650dSSadaf Ebrahimi
4703*22dc650dSSadaf Ebrahimi         (?C255)A(?C3)B(?C255)
4704*22dc650dSSadaf Ebrahimi
4705*22dc650dSSadaf Ebrahimi       Here is a more complicated example:
4706*22dc650dSSadaf Ebrahimi
4707*22dc650dSSadaf Ebrahimi         A(\d{2}|--)
4708*22dc650dSSadaf Ebrahimi
4709*22dc650dSSadaf Ebrahimi       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
4710*22dc650dSSadaf Ebrahimi
4711*22dc650dSSadaf Ebrahimi         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4712*22dc650dSSadaf Ebrahimi
4713*22dc650dSSadaf Ebrahimi       Notice  that  there  is a callout before and after each parenthesis and
4714*22dc650dSSadaf Ebrahimi       alternation bar. If the pattern contains a conditional group whose con-
4715*22dc650dSSadaf Ebrahimi       dition is an assertion, an automatic callout  is  inserted  immediately
4716*22dc650dSSadaf Ebrahimi       before  the  condition. Such a callout may also be inserted explicitly,
4717*22dc650dSSadaf Ebrahimi       for example:
4718*22dc650dSSadaf Ebrahimi
4719*22dc650dSSadaf Ebrahimi         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
4720*22dc650dSSadaf Ebrahimi
4721*22dc650dSSadaf Ebrahimi       This applies only to assertion conditions (because they are  themselves
4722*22dc650dSSadaf Ebrahimi       independent groups).
4723*22dc650dSSadaf Ebrahimi
4724*22dc650dSSadaf Ebrahimi       Callouts  can  be useful for tracking the progress of pattern matching.
4725*22dc650dSSadaf Ebrahimi       The pcre2test program has a pattern qualifier (/auto_callout) that sets
4726*22dc650dSSadaf Ebrahimi       automatic callouts.  When any callouts are  present,  the  output  from
4727*22dc650dSSadaf Ebrahimi       pcre2test  indicates  how  the pattern is being matched. This is useful
4728*22dc650dSSadaf Ebrahimi       information when you are trying to optimize the performance of  a  par-
4729*22dc650dSSadaf Ebrahimi       ticular pattern.
4730*22dc650dSSadaf Ebrahimi
4731*22dc650dSSadaf Ebrahimi
4732*22dc650dSSadaf EbrahimiMISSING CALLOUTS
4733*22dc650dSSadaf Ebrahimi
4734*22dc650dSSadaf Ebrahimi       You  should  be  aware  that, because of optimizations in the way PCRE2
4735*22dc650dSSadaf Ebrahimi       compiles and matches patterns, callouts sometimes do not happen exactly
4736*22dc650dSSadaf Ebrahimi       as you might expect.
4737*22dc650dSSadaf Ebrahimi
4738*22dc650dSSadaf Ebrahimi   Auto-possessification
4739*22dc650dSSadaf Ebrahimi
4740*22dc650dSSadaf Ebrahimi       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4741*22dc650dSSadaf Ebrahimi       that what follows cannot be part of the repeat. For example, a+[bc]  is
4742*22dc650dSSadaf Ebrahimi       compiled  as if it were a++[bc]. The pcre2test output when this pattern
4743*22dc650dSSadaf Ebrahimi       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
4744*22dc650dSSadaf Ebrahimi       to the string "aaaa" is:
4745*22dc650dSSadaf Ebrahimi
4746*22dc650dSSadaf Ebrahimi         --->aaaa
4747*22dc650dSSadaf Ebrahimi          +0 ^        a+
4748*22dc650dSSadaf Ebrahimi          +2 ^   ^    [bc]
4749*22dc650dSSadaf Ebrahimi         No match
4750*22dc650dSSadaf Ebrahimi
4751*22dc650dSSadaf Ebrahimi       This indicates that when matching [bc] fails, there is no  backtracking
4752*22dc650dSSadaf Ebrahimi       into a+ (because it is being treated as a++) and therefore the callouts
4753*22dc650dSSadaf Ebrahimi       that  would  be  taken for the backtracks do not occur. You can disable
4754*22dc650dSSadaf Ebrahimi       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
4755*22dc650dSSadaf Ebrahimi       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
4756*22dc650dSSadaf Ebrahimi       this case, the output changes to this:
4757*22dc650dSSadaf Ebrahimi
4758*22dc650dSSadaf Ebrahimi         --->aaaa
4759*22dc650dSSadaf Ebrahimi          +0 ^        a+
4760*22dc650dSSadaf Ebrahimi          +2 ^   ^    [bc]
4761*22dc650dSSadaf Ebrahimi          +2 ^  ^     [bc]
4762*22dc650dSSadaf Ebrahimi          +2 ^ ^      [bc]
4763*22dc650dSSadaf Ebrahimi          +2 ^^       [bc]
4764*22dc650dSSadaf Ebrahimi         No match
4765*22dc650dSSadaf Ebrahimi
4766*22dc650dSSadaf Ebrahimi       This time, when matching [bc] fails, the matcher backtracks into a+ and
4767*22dc650dSSadaf Ebrahimi       tries again, repeatedly, until a+ itself fails.
4768*22dc650dSSadaf Ebrahimi
4769*22dc650dSSadaf Ebrahimi   Automatic .* anchoring
4770*22dc650dSSadaf Ebrahimi
4771*22dc650dSSadaf Ebrahimi       By default, an optimization is applied when .* is the first significant
4772*22dc650dSSadaf Ebrahimi       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
4773*22dc650dSSadaf Ebrahimi       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
4774*22dc650dSSadaf Ebrahimi       is not set, a match can start only after an internal newline or at  the
4775*22dc650dSSadaf Ebrahimi       beginning of the subject, and pcre2_compile() remembers this. If a pat-
4776*22dc650dSSadaf Ebrahimi       tern  has more than one top-level branch, automatic anchoring occurs if
4777*22dc650dSSadaf Ebrahimi       all branches are anchorable.
4778*22dc650dSSadaf Ebrahimi
4779*22dc650dSSadaf Ebrahimi       This optimization is disabled, however, if .* is in an atomic group  or
4780*22dc650dSSadaf Ebrahimi       if  there  is a backreference to the capture group in which it appears.
4781*22dc650dSSadaf Ebrahimi       It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
4782*22dc650dSSadaf Ebrahimi       ever, the presence of callouts does not affect it.
4783*22dc650dSSadaf Ebrahimi
4784*22dc650dSSadaf Ebrahimi       For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
4785*22dc650dSSadaf Ebrahimi       and applied to the string "aa", the pcre2test output is:
4786*22dc650dSSadaf Ebrahimi
4787*22dc650dSSadaf Ebrahimi         --->aa
4788*22dc650dSSadaf Ebrahimi          +0 ^      .*
4789*22dc650dSSadaf Ebrahimi          +2 ^ ^    \d
4790*22dc650dSSadaf Ebrahimi          +2 ^^     \d
4791*22dc650dSSadaf Ebrahimi          +2 ^      \d
4792*22dc650dSSadaf Ebrahimi         No match
4793*22dc650dSSadaf Ebrahimi
4794*22dc650dSSadaf Ebrahimi       This shows that all match attempts start at the beginning of  the  sub-
4795*22dc650dSSadaf Ebrahimi       ject. In other words, the pattern is anchored. You can disable this op-
4796*22dc650dSSadaf Ebrahimi       timization  by  passing  PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
4797*22dc650dSSadaf Ebrahimi       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
4798*22dc650dSSadaf Ebrahimi       put changes to:
4799*22dc650dSSadaf Ebrahimi
4800*22dc650dSSadaf Ebrahimi         --->aa
4801*22dc650dSSadaf Ebrahimi          +0 ^      .*
4802*22dc650dSSadaf Ebrahimi          +2 ^ ^    \d
4803*22dc650dSSadaf Ebrahimi          +2 ^^     \d
4804*22dc650dSSadaf Ebrahimi          +2 ^      \d
4805*22dc650dSSadaf Ebrahimi          +0  ^     .*
4806*22dc650dSSadaf Ebrahimi          +2  ^^    \d
4807*22dc650dSSadaf Ebrahimi          +2  ^     \d
4808*22dc650dSSadaf Ebrahimi         No match
4809*22dc650dSSadaf Ebrahimi
4810*22dc650dSSadaf Ebrahimi       This  shows more match attempts, starting at the second subject charac-
4811*22dc650dSSadaf Ebrahimi       ter.  Another optimization, described in the next section,  means  that
4812*22dc650dSSadaf Ebrahimi       there is no subsequent attempt to match with an empty subject.
4813*22dc650dSSadaf Ebrahimi
4814*22dc650dSSadaf Ebrahimi   Other optimizations
4815*22dc650dSSadaf Ebrahimi
4816*22dc650dSSadaf Ebrahimi       Other  optimizations  that  provide fast "no match" results also affect
4817*22dc650dSSadaf Ebrahimi       callouts.  For example, if the pattern is
4818*22dc650dSSadaf Ebrahimi
4819*22dc650dSSadaf Ebrahimi         ab(?C4)cd
4820*22dc650dSSadaf Ebrahimi
4821*22dc650dSSadaf Ebrahimi       PCRE2 knows that any matching string must contain the  letter  "d".  If
4822*22dc650dSSadaf Ebrahimi       the  subject  string  is  "abyz",  the  lack of "d" means that matching
4823*22dc650dSSadaf Ebrahimi       doesn't ever start, and the callout is  never  reached.  However,  with
4824*22dc650dSSadaf Ebrahimi       "abyd", though the result is still no match, the callout is obeyed.
4825*22dc650dSSadaf Ebrahimi
4826*22dc650dSSadaf Ebrahimi       For  most  patterns  PCRE2  also knows the minimum length of a matching
4827*22dc650dSSadaf Ebrahimi       string, and will immediately give a "no match" return without  actually
4828*22dc650dSSadaf Ebrahimi       running  a  match if the subject is not long enough, or, for unanchored
4829*22dc650dSSadaf Ebrahimi       patterns, if it has been scanned far enough.
4830*22dc650dSSadaf Ebrahimi
4831*22dc650dSSadaf Ebrahimi       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4832*22dc650dSSadaf Ebrahimi       MIZE option  to  pcre2_compile(),  or  by  starting  the  pattern  with
4833*22dc650dSSadaf Ebrahimi       (*NO_START_OPT).  This slows down the matching process, but does ensure
4834*22dc650dSSadaf Ebrahimi       that callouts such as the example above are obeyed.
4835*22dc650dSSadaf Ebrahimi
4836*22dc650dSSadaf Ebrahimi
4837*22dc650dSSadaf EbrahimiTHE CALLOUT INTERFACE
4838*22dc650dSSadaf Ebrahimi
4839*22dc650dSSadaf Ebrahimi       During matching, when PCRE2 reaches a callout  point,  if  an  external
4840*22dc650dSSadaf Ebrahimi       function  is  provided in the match context, it is called. This applies
4841*22dc650dSSadaf Ebrahimi       to both normal, DFA, and JIT matching. The first argument to the  call-
4842*22dc650dSSadaf Ebrahimi       out function is a pointer to a pcre2_callout block. The second argument
4843*22dc650dSSadaf Ebrahimi       is  the  void * callout data that was supplied when the callout was set
4844*22dc650dSSadaf Ebrahimi       up by calling pcre2_set_callout() (see the pcre2api documentation). The
4845*22dc650dSSadaf Ebrahimi       callout block structure contains the following fields, not  necessarily
4846*22dc650dSSadaf Ebrahimi       in this order:
4847*22dc650dSSadaf Ebrahimi
4848*22dc650dSSadaf Ebrahimi         uint32_t      version;
4849*22dc650dSSadaf Ebrahimi         uint32_t      callout_number;
4850*22dc650dSSadaf Ebrahimi         uint32_t      capture_top;
4851*22dc650dSSadaf Ebrahimi         uint32_t      capture_last;
4852*22dc650dSSadaf Ebrahimi         uint32_t      callout_flags;
4853*22dc650dSSadaf Ebrahimi         PCRE2_SIZE   *offset_vector;
4854*22dc650dSSadaf Ebrahimi         PCRE2_SPTR    mark;
4855*22dc650dSSadaf Ebrahimi         PCRE2_SPTR    subject;
4856*22dc650dSSadaf Ebrahimi         PCRE2_SIZE    subject_length;
4857*22dc650dSSadaf Ebrahimi         PCRE2_SIZE    start_match;
4858*22dc650dSSadaf Ebrahimi         PCRE2_SIZE    current_position;
4859*22dc650dSSadaf Ebrahimi         PCRE2_SIZE    pattern_position;
4860*22dc650dSSadaf Ebrahimi         PCRE2_SIZE    next_item_length;
4861*22dc650dSSadaf Ebrahimi         PCRE2_SIZE    callout_string_offset;
4862*22dc650dSSadaf Ebrahimi         PCRE2_SIZE    callout_string_length;
4863*22dc650dSSadaf Ebrahimi         PCRE2_SPTR    callout_string;
4864*22dc650dSSadaf Ebrahimi
4865*22dc650dSSadaf Ebrahimi       The  version field contains the version number of the block format. The
4866*22dc650dSSadaf Ebrahimi       current version is 2; the three callout string fields  were  added  for
4867*22dc650dSSadaf Ebrahimi       version  1, and the callout_flags field for version 2. If you are writ-
4868*22dc650dSSadaf Ebrahimi       ing an application that might use an  earlier  release  of  PCRE2,  you
4869*22dc650dSSadaf Ebrahimi       should  check  the version number before accessing any of these fields.
4870*22dc650dSSadaf Ebrahimi       The version number will increase in future if more  fields  are  added,
4871*22dc650dSSadaf Ebrahimi       but the intention is never to remove any of the existing fields.
4872*22dc650dSSadaf Ebrahimi
4873*22dc650dSSadaf Ebrahimi   Fields for numerical callouts
4874*22dc650dSSadaf Ebrahimi
4875*22dc650dSSadaf Ebrahimi       For  a  numerical  callout,  callout_string is NULL, and callout_number
4876*22dc650dSSadaf Ebrahimi       contains the number of the callout, in the range  0-255.  This  is  the
4877*22dc650dSSadaf Ebrahimi       number  that  follows  (?C for callouts that part of the pattern; it is
4878*22dc650dSSadaf Ebrahimi       255 for automatically generated callouts.
4879*22dc650dSSadaf Ebrahimi
4880*22dc650dSSadaf Ebrahimi   Fields for string callouts
4881*22dc650dSSadaf Ebrahimi
4882*22dc650dSSadaf Ebrahimi       For callouts with string arguments, callout_number is always zero,  and
4883*22dc650dSSadaf Ebrahimi       callout_string  points  to the string that is contained within the com-
4884*22dc650dSSadaf Ebrahimi       piled pattern. Its length is given by callout_string_length. Duplicated
4885*22dc650dSSadaf Ebrahimi       ending delimiters that were present in the original pattern string have
4886*22dc650dSSadaf Ebrahimi       been turned into single characters, but there is no other processing of
4887*22dc650dSSadaf Ebrahimi       the callout string argument. An additional code unit containing  binary
4888*22dc650dSSadaf Ebrahimi       zero  is  present  after the string, but is not included in the length.
4889*22dc650dSSadaf Ebrahimi       The delimiter that was used to start the string is also  stored  within
4890*22dc650dSSadaf Ebrahimi       the  pattern, immediately before the string itself. You can access this
4891*22dc650dSSadaf Ebrahimi       delimiter as callout_string[-1] if you need it.
4892*22dc650dSSadaf Ebrahimi
4893*22dc650dSSadaf Ebrahimi       The callout_string_offset field is the code unit offset to the start of
4894*22dc650dSSadaf Ebrahimi       the callout argument string within the original pattern string. This is
4895*22dc650dSSadaf Ebrahimi       provided for the benefit of applications such as script languages  that
4896*22dc650dSSadaf Ebrahimi       might need to report errors in the callout string within the pattern.
4897*22dc650dSSadaf Ebrahimi
4898*22dc650dSSadaf Ebrahimi   Fields for all callouts
4899*22dc650dSSadaf Ebrahimi
4900*22dc650dSSadaf Ebrahimi       The  remaining  fields in the callout block are the same for both kinds
4901*22dc650dSSadaf Ebrahimi       of callout.
4902*22dc650dSSadaf Ebrahimi
4903*22dc650dSSadaf Ebrahimi       The offset_vector field is a pointer to a vector of  capturing  offsets
4904*22dc650dSSadaf Ebrahimi       (the "ovector"). You may read the elements in this vector, but you must
4905*22dc650dSSadaf Ebrahimi       not change any of them.
4906*22dc650dSSadaf Ebrahimi
4907*22dc650dSSadaf Ebrahimi       For  calls  to pcre2_match(), the offset_vector field is not (since re-
4908*22dc650dSSadaf Ebrahimi       lease 10.30) a pointer to the actual ovector that  was  passed  to  the
4909*22dc650dSSadaf Ebrahimi       matching  function in the match data block. Instead it points to an in-
4910*22dc650dSSadaf Ebrahimi       ternal ovector of a size large enough to  hold  all  possible  captured
4911*22dc650dSSadaf Ebrahimi       substrings in the pattern. Note that whenever a recursion or subroutine
4912*22dc650dSSadaf Ebrahimi       call  within  a pattern completes, the capturing state is reset to what
4913*22dc650dSSadaf Ebrahimi       it was before.
4914*22dc650dSSadaf Ebrahimi
4915*22dc650dSSadaf Ebrahimi       The capture_last field contains the number of the  most  recently  cap-
4916*22dc650dSSadaf Ebrahimi       tured  substring,  and the capture_top field contains one more than the
4917*22dc650dSSadaf Ebrahimi       number of the highest numbered captured substring so far.  If  no  sub-
4918*22dc650dSSadaf Ebrahimi       strings  have yet been captured, the value of capture_last is 0 and the
4919*22dc650dSSadaf Ebrahimi       value of capture_top is 1. The values of these  fields  do  not  always
4920*22dc650dSSadaf Ebrahimi       differ   by   one;  for  example,  when  the  callout  in  the  pattern
4921*22dc650dSSadaf Ebrahimi       ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
4922*22dc650dSSadaf Ebrahimi
4923*22dc650dSSadaf Ebrahimi       The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in-
4924*22dc650dSSadaf Ebrahimi       spected  in  order to extract substrings that have been matched so far,
4925*22dc650dSSadaf Ebrahimi       in the same way as extracting substrings after a match  has  completed.
4926*22dc650dSSadaf Ebrahimi       The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4927*22dc650dSSadaf Ebrahimi       the match is by definition not complete. Substrings that have not  been
4928*22dc650dSSadaf Ebrahimi       captured  but whose numbers are less than capture_top also have both of
4929*22dc650dSSadaf Ebrahimi       their ovector slots set to PCRE2_UNSET.
4930*22dc650dSSadaf Ebrahimi
4931*22dc650dSSadaf Ebrahimi       For DFA matching, the offset_vector field points to  the  ovector  that
4932*22dc650dSSadaf Ebrahimi       was  passed  to the matching function in the match data block for call-
4933*22dc650dSSadaf Ebrahimi       outs at the top level, but to an internal ovector during the processing
4934*22dc650dSSadaf Ebrahimi       of pattern recursions, lookarounds, and atomic groups.  However,  these
4935*22dc650dSSadaf Ebrahimi       ovectors  hold no useful information because pcre2_dfa_match() does not
4936*22dc650dSSadaf Ebrahimi       support substring capturing. The value of capture_top is always  1  and
4937*22dc650dSSadaf Ebrahimi       the value of capture_last is always 0 for DFA matching.
4938*22dc650dSSadaf Ebrahimi
4939*22dc650dSSadaf Ebrahimi       The subject and subject_length fields contain copies of the values that
4940*22dc650dSSadaf Ebrahimi       were passed to the matching function.
4941*22dc650dSSadaf Ebrahimi
4942*22dc650dSSadaf Ebrahimi       The  start_match  field normally contains the offset within the subject
4943*22dc650dSSadaf Ebrahimi       at which the current match attempt started. However, if the escape  se-
4944*22dc650dSSadaf Ebrahimi       quence  \K  has  been encountered, this value is changed to reflect the
4945*22dc650dSSadaf Ebrahimi       modified starting point. If the pattern is not  anchored,  the  callout
4946*22dc650dSSadaf Ebrahimi       function may be called several times from the same point in the pattern
4947*22dc650dSSadaf Ebrahimi       for different starting points in the subject.
4948*22dc650dSSadaf Ebrahimi
4949*22dc650dSSadaf Ebrahimi       The  current_position  field  contains the offset within the subject of
4950*22dc650dSSadaf Ebrahimi       the current match pointer.
4951*22dc650dSSadaf Ebrahimi
4952*22dc650dSSadaf Ebrahimi       The pattern_position field contains the offset in the pattern string to
4953*22dc650dSSadaf Ebrahimi       the next item to be matched.
4954*22dc650dSSadaf Ebrahimi
4955*22dc650dSSadaf Ebrahimi       The next_item_length field contains the length of the next item  to  be
4956*22dc650dSSadaf Ebrahimi       processed  in the pattern string. When the callout is at the end of the
4957*22dc650dSSadaf Ebrahimi       pattern, the length is zero.  When  the  callout  precedes  an  opening
4958*22dc650dSSadaf Ebrahimi       parenthesis, the length includes meta characters that follow the paren-
4959*22dc650dSSadaf Ebrahimi       thesis.  For  example,  in a callout before an assertion such as (?=ab)
4960*22dc650dSSadaf Ebrahimi       the length is 3. For an alternation bar or a closing  parenthesis,  the
4961*22dc650dSSadaf Ebrahimi       length  is  one,  unless a closing parenthesis is followed by a quanti-
4962*22dc650dSSadaf Ebrahimi       fier, in which case its length is included. (This  changed  in  release
4963*22dc650dSSadaf Ebrahimi       10.23.  In  earlier  releases, before an opening parenthesis the length
4964*22dc650dSSadaf Ebrahimi       was that of the entire group, and before an alternation bar or a  clos-
4965*22dc650dSSadaf Ebrahimi       ing parenthesis the length was zero.)
4966*22dc650dSSadaf Ebrahimi
4967*22dc650dSSadaf Ebrahimi       The  pattern_position  and next_item_length fields are intended to help
4968*22dc650dSSadaf Ebrahimi       in distinguishing between different automatic callouts, which all  have
4969*22dc650dSSadaf Ebrahimi       the  same  callout  number. However, they are set for all callouts, and
4970*22dc650dSSadaf Ebrahimi       are used by pcre2test to show the next item to be matched when display-
4971*22dc650dSSadaf Ebrahimi       ing callout information.
4972*22dc650dSSadaf Ebrahimi
4973*22dc650dSSadaf Ebrahimi       In callouts from pcre2_match() the mark field contains a pointer to the
4974*22dc650dSSadaf Ebrahimi       zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
4975*22dc650dSSadaf Ebrahimi       (*THEN)  item  in the match, or NULL if no such items have been passed.
4976*22dc650dSSadaf Ebrahimi       Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
4977*22dc650dSSadaf Ebrahimi       previous (*MARK). In callouts from the DFA matching function this field
4978*22dc650dSSadaf Ebrahimi       always contains NULL.
4979*22dc650dSSadaf Ebrahimi
4980*22dc650dSSadaf Ebrahimi       The   callout_flags   field   is   always   zero   in   callouts   from
4981*22dc650dSSadaf Ebrahimi       pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4982*22dc650dSSadaf Ebrahimi       JIT is used, the following bits may be set:
4983*22dc650dSSadaf Ebrahimi
4984*22dc650dSSadaf Ebrahimi         PCRE2_CALLOUT_STARTMATCH
4985*22dc650dSSadaf Ebrahimi
4986*22dc650dSSadaf Ebrahimi       This is set for the first callout after the start of matching for  each
4987*22dc650dSSadaf Ebrahimi       new starting position in the subject.
4988*22dc650dSSadaf Ebrahimi
4989*22dc650dSSadaf Ebrahimi         PCRE2_CALLOUT_BACKTRACK
4990*22dc650dSSadaf Ebrahimi
4991*22dc650dSSadaf Ebrahimi       This  is  set if there has been a matching backtrack since the previous
4992*22dc650dSSadaf Ebrahimi       callout, or since the start of matching if this is  the  first  callout
4993*22dc650dSSadaf Ebrahimi       from a pcre2_match() run.
4994*22dc650dSSadaf Ebrahimi
4995*22dc650dSSadaf Ebrahimi       Both  bits  are  set when a backtrack has caused a "bumpalong" to a new
4996*22dc650dSSadaf Ebrahimi       starting position in the subject. Output from pcre2test does not  indi-
4997*22dc650dSSadaf Ebrahimi       cate  the  presence  of these bits unless the callout_extra modifier is
4998*22dc650dSSadaf Ebrahimi       set.
4999*22dc650dSSadaf Ebrahimi
5000*22dc650dSSadaf Ebrahimi       The information in the callout_flags field is provided so that applica-
5001*22dc650dSSadaf Ebrahimi       tions can track and tell their users how matching with backtracking  is
5002*22dc650dSSadaf Ebrahimi       done.  This  can be useful when trying to optimize patterns, or just to
5003*22dc650dSSadaf Ebrahimi       understand how PCRE2 works. There is no  support  in  pcre2_dfa_match()
5004*22dc650dSSadaf Ebrahimi       because  there is no backtracking in DFA matching, and there is no sup-
5005*22dc650dSSadaf Ebrahimi       port in JIT because JIT is all about maximimizing matching performance.
5006*22dc650dSSadaf Ebrahimi       In both these cases the callout_flags field is always zero.
5007*22dc650dSSadaf Ebrahimi
5008*22dc650dSSadaf Ebrahimi
5009*22dc650dSSadaf EbrahimiRETURN VALUES FROM CALLOUTS
5010*22dc650dSSadaf Ebrahimi
5011*22dc650dSSadaf Ebrahimi       The external callout function returns an integer to PCRE2. If the value
5012*22dc650dSSadaf Ebrahimi       is zero, matching proceeds as normal. If  the  value  is  greater  than
5013*22dc650dSSadaf Ebrahimi       zero,  matching  fails  at  the current point, but the testing of other
5014*22dc650dSSadaf Ebrahimi       matching possibilities goes ahead, just as if a lookahead assertion had
5015*22dc650dSSadaf Ebrahimi       failed. If the value is less than zero, the match is abandoned, and the
5016*22dc650dSSadaf Ebrahimi       matching function returns the negative value.
5017*22dc650dSSadaf Ebrahimi
5018*22dc650dSSadaf Ebrahimi       Negative values should normally be chosen from  the  set  of  PCRE2_ER-
5019*22dc650dSSadaf Ebrahimi       ROR_xxx  values.  In  particular, PCRE2_ERROR_NOMATCH forces a standard
5020*22dc650dSSadaf Ebrahimi       "no match" failure. The error number  PCRE2_ERROR_CALLOUT  is  reserved
5021*22dc650dSSadaf Ebrahimi       for use by callout functions; it will never be used by PCRE2 itself.
5022*22dc650dSSadaf Ebrahimi
5023*22dc650dSSadaf Ebrahimi
5024*22dc650dSSadaf EbrahimiCALLOUT ENUMERATION
5025*22dc650dSSadaf Ebrahimi
5026*22dc650dSSadaf Ebrahimi       int pcre2_callout_enumerate(const pcre2_code *code,
5027*22dc650dSSadaf Ebrahimi         int (*callback)(pcre2_callout_enumerate_block *, void *),
5028*22dc650dSSadaf Ebrahimi         void *user_data);
5029*22dc650dSSadaf Ebrahimi
5030*22dc650dSSadaf Ebrahimi       A script language that supports the use of string arguments in callouts
5031*22dc650dSSadaf Ebrahimi       might  like  to  scan  all the callouts in a pattern before running the
5032*22dc650dSSadaf Ebrahimi       match. This can be done by calling pcre2_callout_enumerate(). The first
5033*22dc650dSSadaf Ebrahimi       argument is a pointer to a compiled pattern, the  second  points  to  a
5034*22dc650dSSadaf Ebrahimi       callback  function,  and the third is arbitrary user data. The callback
5035*22dc650dSSadaf Ebrahimi       function is called for every callout in the pattern  in  the  order  in
5036*22dc650dSSadaf Ebrahimi       which they appear. Its first argument is a pointer to a callout enumer-
5037*22dc650dSSadaf Ebrahimi       ation  block,  and  its second argument is the user_data value that was
5038*22dc650dSSadaf Ebrahimi       passed to pcre2_callout_enumerate(). The data block contains  the  fol-
5039*22dc650dSSadaf Ebrahimi       lowing fields:
5040*22dc650dSSadaf Ebrahimi
5041*22dc650dSSadaf Ebrahimi         version                Block version number
5042*22dc650dSSadaf Ebrahimi         pattern_position       Offset to next item in pattern
5043*22dc650dSSadaf Ebrahimi         next_item_length       Length of next item in pattern
5044*22dc650dSSadaf Ebrahimi         callout_number         Number for numbered callouts
5045*22dc650dSSadaf Ebrahimi         callout_string_offset  Offset to string within pattern
5046*22dc650dSSadaf Ebrahimi         callout_string_length  Length of callout string
5047*22dc650dSSadaf Ebrahimi         callout_string         Points to callout string or is NULL
5048*22dc650dSSadaf Ebrahimi
5049*22dc650dSSadaf Ebrahimi       The  version  number is currently 0. It will increase if new fields are
5050*22dc650dSSadaf Ebrahimi       ever added to the block. The remaining fields are  the  same  as  their
5051*22dc650dSSadaf Ebrahimi       namesakes  in  the pcre2_callout block that is used for callouts during
5052*22dc650dSSadaf Ebrahimi       matching, as described above.
5053*22dc650dSSadaf Ebrahimi
5054*22dc650dSSadaf Ebrahimi       Note that the value of pattern_position is  unique  for  each  callout.
5055*22dc650dSSadaf Ebrahimi       However,  if  a callout occurs inside a group that is quantified with a
5056*22dc650dSSadaf Ebrahimi       non-zero minimum or a fixed maximum, the group is replicated inside the
5057*22dc650dSSadaf Ebrahimi       compiled pattern. For example, a pattern such as /(a){2}/  is  compiled
5058*22dc650dSSadaf Ebrahimi       as  if it were /(a)(a)/. This means that the callout will be enumerated
5059*22dc650dSSadaf Ebrahimi       more than once, but with the same value for  pattern_position  in  each
5060*22dc650dSSadaf Ebrahimi       case.
5061*22dc650dSSadaf Ebrahimi
5062*22dc650dSSadaf Ebrahimi       The callback function should normally return zero. If it returns a non-
5063*22dc650dSSadaf Ebrahimi       zero value, scanning the pattern stops, and that value is returned from
5064*22dc650dSSadaf Ebrahimi       pcre2_callout_enumerate().
5065*22dc650dSSadaf Ebrahimi
5066*22dc650dSSadaf Ebrahimi
5067*22dc650dSSadaf EbrahimiAUTHOR
5068*22dc650dSSadaf Ebrahimi
5069*22dc650dSSadaf Ebrahimi       Philip Hazel
5070*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
5071*22dc650dSSadaf Ebrahimi       Cambridge, England.
5072*22dc650dSSadaf Ebrahimi
5073*22dc650dSSadaf Ebrahimi
5074*22dc650dSSadaf EbrahimiREVISION
5075*22dc650dSSadaf Ebrahimi
5076*22dc650dSSadaf Ebrahimi       Last updated: 19 January 2024
5077*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2024 University of Cambridge.
5078*22dc650dSSadaf Ebrahimi
5079*22dc650dSSadaf Ebrahimi
5080*22dc650dSSadaf EbrahimiPCRE2 10.43                     19 January 2024                PCRE2CALLOUT(3)
5081*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
5082*22dc650dSSadaf Ebrahimi
5083*22dc650dSSadaf Ebrahimi
5084*22dc650dSSadaf Ebrahimi
5085*22dc650dSSadaf EbrahimiPCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
5086*22dc650dSSadaf Ebrahimi
5087*22dc650dSSadaf Ebrahimi
5088*22dc650dSSadaf EbrahimiNAME
5089*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
5090*22dc650dSSadaf Ebrahimi
5091*22dc650dSSadaf Ebrahimi
5092*22dc650dSSadaf EbrahimiDIFFERENCES BETWEEN PCRE2 AND PERL
5093*22dc650dSSadaf Ebrahimi
5094*22dc650dSSadaf Ebrahimi       This  document describes some of the known differences in the ways that
5095*22dc650dSSadaf Ebrahimi       PCRE2 and Perl handle regular expressions.  The  differences  described
5096*22dc650dSSadaf Ebrahimi       here  are  with  respect  to  Perl version 5.38.0, but as both Perl and
5097*22dc650dSSadaf Ebrahimi       PCRE2 are continually changing, the information may at times be out  of
5098*22dc650dSSadaf Ebrahimi       date.
5099*22dc650dSSadaf Ebrahimi
5100*22dc650dSSadaf Ebrahimi       1.  When  PCRE2_DOTALL  (equivalent to Perl's /s qualifier) is not set,
5101*22dc650dSSadaf Ebrahimi       the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.'
5102*22dc650dSSadaf Ebrahimi       matches the next character unless it is the  start  of  a  newline  se-
5103*22dc650dSSadaf Ebrahimi       quence.  This  means  that, if the newline setting is CR, CRLF, or NUL,
5104*22dc650dSSadaf Ebrahimi       '.' will match the code point LF (0x0A) in ASCII/Unicode  environments,
5105*22dc650dSSadaf Ebrahimi       and  NL  (either  0x15 or 0x25) when using EBCDIC. In Perl, '.' appears
5106*22dc650dSSadaf Ebrahimi       never to match LF, even when 0x0A is not a newline indicator.
5107*22dc650dSSadaf Ebrahimi
5108*22dc650dSSadaf Ebrahimi       2. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
5109*22dc650dSSadaf Ebrahimi       it does have are given in the pcre2unicode page.
5110*22dc650dSSadaf Ebrahimi
5111*22dc650dSSadaf Ebrahimi       3.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
5112*22dc650dSSadaf Ebrahimi       tions, but they do not mean what you might think. For example, (?!a){3}
5113*22dc650dSSadaf Ebrahimi       does not assert that the next three characters are not "a". It just as-
5114*22dc650dSSadaf Ebrahimi       serts that the next character is not "a"  three  times  (in  principle;
5115*22dc650dSSadaf Ebrahimi       PCRE2  optimizes this to run the assertion just once). Perl allows some
5116*22dc650dSSadaf Ebrahimi       repeat quantifiers on other assertions, for example, \b* , but these do
5117*22dc650dSSadaf Ebrahimi       not seem to have any use. PCRE2 does not allow any kind  of  quantifier
5118*22dc650dSSadaf Ebrahimi       on non-lookaround assertions.
5119*22dc650dSSadaf Ebrahimi
5120*22dc650dSSadaf Ebrahimi       4.  If a braced quantifier such as {1,2} appears where there is nothing
5121*22dc650dSSadaf Ebrahimi       to repeat (for example, at the start of a branch), PCRE2 raises an  er-
5122*22dc650dSSadaf Ebrahimi       ror whereas Perl treats the quantifier characters as literal.
5123*22dc650dSSadaf Ebrahimi
5124*22dc650dSSadaf Ebrahimi       5.  Capture groups that occur inside negative lookaround assertions are
5125*22dc650dSSadaf Ebrahimi       counted, but their entries in the offsets vector are set  only  when  a
5126*22dc650dSSadaf Ebrahimi       negative  assertion is a condition that has a matching branch (that is,
5127*22dc650dSSadaf Ebrahimi       the condition is false).  Perl may set such  capture  groups  in  other
5128*22dc650dSSadaf Ebrahimi       circumstances.
5129*22dc650dSSadaf Ebrahimi
5130*22dc650dSSadaf Ebrahimi       6.  The  following Perl escape sequences are not supported: \F, \l, \L,
5131*22dc650dSSadaf Ebrahimi       \u, \U, and \N when followed by a character name. \N on its own, match-
5132*22dc650dSSadaf Ebrahimi       ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
5133*22dc650dSSadaf Ebrahimi       point,  are  supported.  The  escapes that modify the case of following
5134*22dc650dSSadaf Ebrahimi       letters are implemented by Perl's general string-handling and  are  not
5135*22dc650dSSadaf Ebrahimi       part of its pattern matching engine. If any of these are encountered by
5136*22dc650dSSadaf Ebrahimi       PCRE2,  an  error  is  generated  by default. However, if either of the
5137*22dc650dSSadaf Ebrahimi       PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U  and  \u  are
5138*22dc650dSSadaf Ebrahimi       interpreted as ECMAScript interprets them.
5139*22dc650dSSadaf Ebrahimi
5140*22dc650dSSadaf Ebrahimi       7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
5141*22dc650dSSadaf Ebrahimi       is built with Unicode support (the default). The properties that can be
5142*22dc650dSSadaf Ebrahimi       tested  with  \p  and \P are limited to the general category properties
5143*22dc650dSSadaf Ebrahimi       such as Lu and Nd, the derived properties  Any  and  LC  (synonym  L&),
5144*22dc650dSSadaf Ebrahimi       script  names such as Greek or Han, Bidi_Class, Bidi_Control, and a few
5145*22dc650dSSadaf Ebrahimi       binary properties. Both PCRE2 and Perl support the Cs (surrogate) prop-
5146*22dc650dSSadaf Ebrahimi       erty, but in PCRE2 its use is limited. See the pcre2pattern  documenta-
5147*22dc650dSSadaf Ebrahimi       tion  for  details. The long synonyms for property names that Perl sup-
5148*22dc650dSSadaf Ebrahimi       ports (such as \p{Letter}) are not supported by PCRE2, nor is  it  per-
5149*22dc650dSSadaf Ebrahimi       mitted to prefix any of these properties with "Is".
5150*22dc650dSSadaf Ebrahimi
5151*22dc650dSSadaf Ebrahimi       8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
5152*22dc650dSSadaf Ebrahimi       in between are treated as literals. However, this is slightly different
5153*22dc650dSSadaf Ebrahimi       from  Perl  in  that  $  and  @ are also handled as literals inside the
5154*22dc650dSSadaf Ebrahimi       quotes. In Perl, they cause variable interpolation (PCRE2 does not have
5155*22dc650dSSadaf Ebrahimi       variables). Also, Perl does "double-quotish backslash interpolation" on
5156*22dc650dSSadaf Ebrahimi       any backslashes between \Q and \E which, its documentation  says,  "may
5157*22dc650dSSadaf Ebrahimi       lead  to confusing results". PCRE2 treats a backslash between \Q and \E
5158*22dc650dSSadaf Ebrahimi       just like any other character. Note the following examples:
5159*22dc650dSSadaf Ebrahimi
5160*22dc650dSSadaf Ebrahimi           Pattern            PCRE2 matches     Perl matches
5161*22dc650dSSadaf Ebrahimi
5162*22dc650dSSadaf Ebrahimi           \Qabc$xyz\E        abc$xyz           abc followed by the
5163*22dc650dSSadaf Ebrahimi                                                  contents of $xyz
5164*22dc650dSSadaf Ebrahimi           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
5165*22dc650dSSadaf Ebrahimi           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
5166*22dc650dSSadaf Ebrahimi           \QA\B\E            A\B               A\B
5167*22dc650dSSadaf Ebrahimi           \Q\\E              \                 \\E
5168*22dc650dSSadaf Ebrahimi
5169*22dc650dSSadaf Ebrahimi       The \Q...\E sequence is recognized both inside  and  outside  character
5170*22dc650dSSadaf Ebrahimi       classes by both PCRE2 and Perl.
5171*22dc650dSSadaf Ebrahimi
5172*22dc650dSSadaf Ebrahimi       9.   Fairly  obviously,  PCRE2  does  not  support  the  (?{code})  and
5173*22dc650dSSadaf Ebrahimi       (??{code}) constructions. However, PCRE2 does have a "callout" feature,
5174*22dc650dSSadaf Ebrahimi       which allows an external function to be called during pattern matching.
5175*22dc650dSSadaf Ebrahimi       See the pcre2callout documentation for details.
5176*22dc650dSSadaf Ebrahimi
5177*22dc650dSSadaf Ebrahimi       10. Subroutine calls (whether recursive or not) were treated as  atomic
5178*22dc650dSSadaf Ebrahimi       groups  up to PCRE2 release 10.23, but from release 10.30 this changed,
5179*22dc650dSSadaf Ebrahimi       and backtracking into subroutine calls is now supported, as in Perl.
5180*22dc650dSSadaf Ebrahimi
5181*22dc650dSSadaf Ebrahimi       11. In PCRE2, if any of the backtracking control verbs are  used  in  a
5182*22dc650dSSadaf Ebrahimi       group  that  is  called  as  a subroutine (whether or not recursively),
5183*22dc650dSSadaf Ebrahimi       their effect is confined to that group; it does not extend to the  sur-
5184*22dc650dSSadaf Ebrahimi       rounding  pattern.  This is not always the case in Perl. In particular,
5185*22dc650dSSadaf Ebrahimi       if (*THEN) is present in a group that is called as  a  subroutine,  its
5186*22dc650dSSadaf Ebrahimi       action is limited to that group, even if the group does not contain any
5187*22dc650dSSadaf Ebrahimi       |  characters.  Note  that such groups are processed as anchored at the
5188*22dc650dSSadaf Ebrahimi       point where they are tested.
5189*22dc650dSSadaf Ebrahimi
5190*22dc650dSSadaf Ebrahimi       12. If a pattern contains more than one backtracking control verb,  the
5191*22dc650dSSadaf Ebrahimi       first  one  that  is backtracked onto acts. For example, in the pattern
5192*22dc650dSSadaf Ebrahimi       A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but  a  failure
5193*22dc650dSSadaf Ebrahimi       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
5194*22dc650dSSadaf Ebrahimi       it is the same as PCRE2, but there are cases where it differs.
5195*22dc650dSSadaf Ebrahimi
5196*22dc650dSSadaf Ebrahimi       13.  There are some differences that are concerned with the settings of
5197*22dc650dSSadaf Ebrahimi       captured strings when part of  a  pattern  is  repeated.  For  example,
5198*22dc650dSSadaf Ebrahimi       matching  "aba"  against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5199*22dc650dSSadaf Ebrahimi       set, but in PCRE2 it is set to "b".
5200*22dc650dSSadaf Ebrahimi
5201*22dc650dSSadaf Ebrahimi       14. PCRE2's handling of duplicate capture group numbers  and  names  is
5202*22dc650dSSadaf Ebrahimi       not  as  general as Perl's. This is a consequence of the fact the PCRE2
5203*22dc650dSSadaf Ebrahimi       works internally just with numbers, using an external table  to  trans-
5204*22dc650dSSadaf Ebrahimi       late  between  numbers  and  names.  In  particular,  a pattern such as
5205*22dc650dSSadaf Ebrahimi       (?|(?<a>A)|(?<b>B)), where the two capture groups have the same  number
5206*22dc650dSSadaf Ebrahimi       but  different  names, is not supported, and causes an error at compile
5207*22dc650dSSadaf Ebrahimi       time. If it were allowed, it would not be possible to distinguish which
5208*22dc650dSSadaf Ebrahimi       group matched, because both names map to capture  group  number  1.  To
5209*22dc650dSSadaf Ebrahimi       avoid this confusing situation, an error is given at compile time.
5210*22dc650dSSadaf Ebrahimi
5211*22dc650dSSadaf Ebrahimi       15. Perl used to recognize comments in some places that PCRE2 does not,
5212*22dc650dSSadaf Ebrahimi       for  example,  between  the  ( and ? at the start of a group. If the /x
5213*22dc650dSSadaf Ebrahimi       modifier is set, Perl allowed white space between ( and  ?  though  the
5214*22dc650dSSadaf Ebrahimi       latest  Perls give an error (for a while it was just deprecated). There
5215*22dc650dSSadaf Ebrahimi       may still be some cases where Perl behaves differently.
5216*22dc650dSSadaf Ebrahimi
5217*22dc650dSSadaf Ebrahimi       16. Perl, when in warning mode, gives warnings  for  character  classes
5218*22dc650dSSadaf Ebrahimi       such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5219*22dc650dSSadaf Ebrahimi       als. PCRE2 has no warning features, so it gives an error in these cases
5220*22dc650dSSadaf Ebrahimi       because they are almost certainly user mistakes.
5221*22dc650dSSadaf Ebrahimi
5222*22dc650dSSadaf Ebrahimi       17. In PCRE2, the upper/lower case character properties Lu and  Ll  are
5223*22dc650dSSadaf Ebrahimi       not  affected when case-independent matching is specified. For example,
5224*22dc650dSSadaf Ebrahimi       \p{Lu} always matches an upper case letter. I think Perl has changed in
5225*22dc650dSSadaf Ebrahimi       this respect; in the release at the time of writing (5.38), \p{Lu}  and
5226*22dc650dSSadaf Ebrahimi       \p{Ll} match all letters, regardless of case, when case independence is
5227*22dc650dSSadaf Ebrahimi       specified.
5228*22dc650dSSadaf Ebrahimi
5229*22dc650dSSadaf Ebrahimi       18. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5230*22dc650dSSadaf Ebrahimi       sertions.  From  release 10.38 PCRE2 does the same by default. However,
5231*22dc650dSSadaf Ebrahimi       there is an option for re-enabling the previous  behaviour.  When  this
5232*22dc650dSSadaf Ebrahimi       option  is  set,  \K is acted on when it occurs in positive assertions,
5233*22dc650dSSadaf Ebrahimi       but is ignored in negative assertions.
5234*22dc650dSSadaf Ebrahimi
5235*22dc650dSSadaf Ebrahimi       19. PCRE2 provides some extensions to the Perl regular  expression  fa-
5236*22dc650dSSadaf Ebrahimi       cilities.   Perl  5.10  included  new features that were not in earlier
5237*22dc650dSSadaf Ebrahimi       versions of Perl, some of which (such as  named  parentheses)  were  in
5238*22dc650dSSadaf Ebrahimi       PCRE2 for some time before. This list is with respect to Perl 5.38:
5239*22dc650dSSadaf Ebrahimi
5240*22dc650dSSadaf Ebrahimi       (a)  If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
5241*22dc650dSSadaf Ebrahimi       $ meta-character matches only at the very end of the string.
5242*22dc650dSSadaf Ebrahimi
5243*22dc650dSSadaf Ebrahimi       (b) A backslash followed  by  a  letter  with  no  special  meaning  is
5244*22dc650dSSadaf Ebrahimi       faulted. (Perl can be made to issue a warning.)
5245*22dc650dSSadaf Ebrahimi
5246*22dc650dSSadaf Ebrahimi       (c)  If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
5247*22dc650dSSadaf Ebrahimi       fiers is inverted, that is, by default they are not greedy, but if fol-
5248*22dc650dSSadaf Ebrahimi       lowed by a question mark they are.
5249*22dc650dSSadaf Ebrahimi
5250*22dc650dSSadaf Ebrahimi       (d) PCRE2_ANCHORED can be used at matching time to force a  pattern  to
5251*22dc650dSSadaf Ebrahimi       be tried only at the first matching position in the subject string.
5252*22dc650dSSadaf Ebrahimi
5253*22dc650dSSadaf Ebrahimi       (e)     The     PCRE2_NOTBOL,    PCRE2_NOTEOL,    PCRE2_NOTEMPTY    and
5254*22dc650dSSadaf Ebrahimi       PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
5255*22dc650dSSadaf Ebrahimi
5256*22dc650dSSadaf Ebrahimi       (f) The \R escape sequence can be restricted to match only CR,  LF,  or
5257*22dc650dSSadaf Ebrahimi       CRLF by the PCRE2_BSR_ANYCRLF option.
5258*22dc650dSSadaf Ebrahimi
5259*22dc650dSSadaf Ebrahimi       (g)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
5260*22dc650dSSadaf Ebrahimi       and variable interpolation, but not general hooks on every match.
5261*22dc650dSSadaf Ebrahimi
5262*22dc650dSSadaf Ebrahimi       (h) The partial matching facility is PCRE2-specific.
5263*22dc650dSSadaf Ebrahimi
5264*22dc650dSSadaf Ebrahimi       (i) The alternative matching function (pcre2_dfa_match() matches  in  a
5265*22dc650dSSadaf Ebrahimi       different way and is not Perl-compatible.
5266*22dc650dSSadaf Ebrahimi
5267*22dc650dSSadaf Ebrahimi       (j)  PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
5268*22dc650dSSadaf Ebrahimi       at the start of a pattern. These set overall  options  that  cannot  be
5269*22dc650dSSadaf Ebrahimi       changed within the pattern.
5270*22dc650dSSadaf Ebrahimi
5271*22dc650dSSadaf Ebrahimi       (k)  PCRE2  supports non-atomic positive lookaround assertions. This is
5272*22dc650dSSadaf Ebrahimi       an extension to the lookaround facilities. The default, Perl-compatible
5273*22dc650dSSadaf Ebrahimi       lookarounds are atomic.
5274*22dc650dSSadaf Ebrahimi
5275*22dc650dSSadaf Ebrahimi       (l) There are three syntactical items in patterns that can refer  to  a
5276*22dc650dSSadaf Ebrahimi       capturing  group  by  number: back references such as \g{2}, subroutine
5277*22dc650dSSadaf Ebrahimi       calls such as (?3), and condition references such as  (?(4)...).  PCRE2
5278*22dc650dSSadaf Ebrahimi       supports  relative  group numbers such as +2 and -4 in all three cases.
5279*22dc650dSSadaf Ebrahimi       Perl supports both plus and minus for subroutine calls, but only  minus
5280*22dc650dSSadaf Ebrahimi       for back references, and no relative numbering at all for conditions.
5281*22dc650dSSadaf Ebrahimi
5282*22dc650dSSadaf Ebrahimi       20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5283*22dc650dSSadaf Ebrahimi       tion for details. Perl went with 5.10 from recursion to iteration keep-
5284*22dc650dSSadaf Ebrahimi       ing the intermediate matches on the heap, which is ~10% slower but does
5285*22dc650dSSadaf Ebrahimi       not  fall into any stack-overflow limit. PCRE2 made a similar change at
5286*22dc650dSSadaf Ebrahimi       release 10.30, and also has many build-time and  run-time  customizable
5287*22dc650dSSadaf Ebrahimi       limits.
5288*22dc650dSSadaf Ebrahimi
5289*22dc650dSSadaf Ebrahimi       21.  Unlike  Perl,  PCRE2 doesn't have character set modifiers and spe-
5290*22dc650dSSadaf Ebrahimi       cially no way to set characters by context just  like  Perl's  "/d".  A
5291*22dc650dSSadaf Ebrahimi       regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules
5292*22dc650dSSadaf Ebrahimi       to  Perl's  "/u";  something closer to "/a" could be selected by adding
5293*22dc650dSSadaf Ebrahimi       other PCRE2_EXTRA_ASCII* options on top.
5294*22dc650dSSadaf Ebrahimi
5295*22dc650dSSadaf Ebrahimi       22. Some recursive patterns that Perl diagnoses as infinite  recursions
5296*22dc650dSSadaf Ebrahimi       can be handled by PCRE2, either by the interpreter or the JIT. An exam-
5297*22dc650dSSadaf Ebrahimi       ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number
5298*22dc650dSSadaf Ebrahimi       of repeated "abcd" substrings at the end of the subject.
5299*22dc650dSSadaf Ebrahimi
5300*22dc650dSSadaf Ebrahimi
5301*22dc650dSSadaf EbrahimiAUTHOR
5302*22dc650dSSadaf Ebrahimi
5303*22dc650dSSadaf Ebrahimi       Philip Hazel
5304*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
5305*22dc650dSSadaf Ebrahimi       Cambridge, England.
5306*22dc650dSSadaf Ebrahimi
5307*22dc650dSSadaf Ebrahimi
5308*22dc650dSSadaf EbrahimiREVISION
5309*22dc650dSSadaf Ebrahimi
5310*22dc650dSSadaf Ebrahimi       Last updated: 30 November 2023
5311*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2023 University of Cambridge.
5312*22dc650dSSadaf Ebrahimi
5313*22dc650dSSadaf Ebrahimi
5314*22dc650dSSadaf EbrahimiPCRE2 10.43                    30 November 2023                 PCRE2COMPAT(3)
5315*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
5316*22dc650dSSadaf Ebrahimi
5317*22dc650dSSadaf Ebrahimi
5318*22dc650dSSadaf Ebrahimi
5319*22dc650dSSadaf EbrahimiPCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
5320*22dc650dSSadaf Ebrahimi
5321*22dc650dSSadaf Ebrahimi
5322*22dc650dSSadaf EbrahimiNAME
5323*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
5324*22dc650dSSadaf Ebrahimi
5325*22dc650dSSadaf Ebrahimi
5326*22dc650dSSadaf EbrahimiPCRE2 JUST-IN-TIME COMPILER SUPPORT
5327*22dc650dSSadaf Ebrahimi
5328*22dc650dSSadaf Ebrahimi       Just-in-time  compiling  is a heavyweight optimization that can greatly
5329*22dc650dSSadaf Ebrahimi       speed up pattern matching. However, it comes at the cost of extra  pro-
5330*22dc650dSSadaf Ebrahimi       cessing  before  the  match is performed, so it is of most benefit when
5331*22dc650dSSadaf Ebrahimi       the same pattern is going to be matched many times. This does not  nec-
5332*22dc650dSSadaf Ebrahimi       essarily  mean many calls of a matching function; if the pattern is not
5333*22dc650dSSadaf Ebrahimi       anchored, matching attempts may take place many times at various  posi-
5334*22dc650dSSadaf Ebrahimi       tions in the subject, even for a single call. Therefore, if the subject
5335*22dc650dSSadaf Ebrahimi       string  is  very  long,  it  may  still pay to use JIT even for one-off
5336*22dc650dSSadaf Ebrahimi       matches. JIT support is available for all  of  the  8-bit,  16-bit  and
5337*22dc650dSSadaf Ebrahimi       32-bit PCRE2 libraries.
5338*22dc650dSSadaf Ebrahimi
5339*22dc650dSSadaf Ebrahimi       JIT  support  applies  only to the traditional Perl-compatible matching
5340*22dc650dSSadaf Ebrahimi       function.  It does not apply when the DFA matching  function  is  being
5341*22dc650dSSadaf Ebrahimi       used. The code for JIT support was written by Zoltan Herczeg.
5342*22dc650dSSadaf Ebrahimi
5343*22dc650dSSadaf Ebrahimi
5344*22dc650dSSadaf EbrahimiAVAILABILITY OF JIT SUPPORT
5345*22dc650dSSadaf Ebrahimi
5346*22dc650dSSadaf Ebrahimi       JIT  support  is  an  optional feature of PCRE2. The "configure" option
5347*22dc650dSSadaf Ebrahimi       --enable-jit (or equivalent CMake option) must be  set  when  PCRE2  is
5348*22dc650dSSadaf Ebrahimi       built  if  you want to use JIT. The support is limited to the following
5349*22dc650dSSadaf Ebrahimi       hardware platforms:
5350*22dc650dSSadaf Ebrahimi
5351*22dc650dSSadaf Ebrahimi         ARM 32-bit (v7, and Thumb2)
5352*22dc650dSSadaf Ebrahimi         ARM 64-bit
5353*22dc650dSSadaf Ebrahimi         IBM s390x 64 bit
5354*22dc650dSSadaf Ebrahimi         Intel x86 32-bit and 64-bit
5355*22dc650dSSadaf Ebrahimi         LoongArch 64 bit
5356*22dc650dSSadaf Ebrahimi         MIPS 32-bit and 64-bit
5357*22dc650dSSadaf Ebrahimi         Power PC 32-bit and 64-bit
5358*22dc650dSSadaf Ebrahimi         RISC-V 32-bit and 64-bit
5359*22dc650dSSadaf Ebrahimi
5360*22dc650dSSadaf Ebrahimi       If --enable-jit is set on an unsupported platform, compilation fails.
5361*22dc650dSSadaf Ebrahimi
5362*22dc650dSSadaf Ebrahimi       A client program can tell  if  JIT  support  is  available  by  calling
5363*22dc650dSSadaf Ebrahimi       pcre2_config()  with  the PCRE2_CONFIG_JIT option. The result is one if
5364*22dc650dSSadaf Ebrahimi       PCRE2 was built with JIT support, and zero otherwise.  However,  having
5365*22dc650dSSadaf Ebrahimi       the  JIT code available does not guarantee that it will be used for any
5366*22dc650dSSadaf Ebrahimi       particular match. One reason for this is that there are a number of op-
5367*22dc650dSSadaf Ebrahimi       tions and pattern items that are not supported by JIT (see below).  An-
5368*22dc650dSSadaf Ebrahimi       other  reason  is that in some environments JIT is unable to get memory
5369*22dc650dSSadaf Ebrahimi       in which to build its compiled code. The only guarantee from pcre2_con-
5370*22dc650dSSadaf Ebrahimi       fig() is that if it returns zero, JIT will definitely not be used.
5371*22dc650dSSadaf Ebrahimi
5372*22dc650dSSadaf Ebrahimi       A simple program does not need to check availability in  order  to  use
5373*22dc650dSSadaf Ebrahimi       JIT  when  possible. The API is implemented in a way that falls back to
5374*22dc650dSSadaf Ebrahimi       the interpretive code if JIT is not available or cannot be used  for  a
5375*22dc650dSSadaf Ebrahimi       given  match.  For  programs  that  need the best possible performance,
5376*22dc650dSSadaf Ebrahimi       there is a "fast path" API that is JIT-specific.
5377*22dc650dSSadaf Ebrahimi
5378*22dc650dSSadaf Ebrahimi
5379*22dc650dSSadaf EbrahimiSIMPLE USE OF JIT
5380*22dc650dSSadaf Ebrahimi
5381*22dc650dSSadaf Ebrahimi       To make use of the JIT support in the simplest way, all you have to  do
5382*22dc650dSSadaf Ebrahimi       is  to  call pcre2_jit_compile() after successfully compiling a pattern
5383*22dc650dSSadaf Ebrahimi       with pcre2_compile(). This function has two arguments: the first is the
5384*22dc650dSSadaf Ebrahimi       compiled pattern pointer that was returned by pcre2_compile(), and  the
5385*22dc650dSSadaf Ebrahimi       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
5386*22dc650dSSadaf Ebrahimi       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
5387*22dc650dSSadaf Ebrahimi
5388*22dc650dSSadaf Ebrahimi       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
5389*22dc650dSSadaf Ebrahimi       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
5390*22dc650dSSadaf Ebrahimi       pattern is passed to the JIT compiler, which turns it into machine code
5391*22dc650dSSadaf Ebrahimi       that executes much faster than the normal interpretive code, but yields
5392*22dc650dSSadaf Ebrahimi       exactly the same results. The returned value  from  pcre2_jit_compile()
5393*22dc650dSSadaf Ebrahimi       is zero on success, or a negative error code.
5394*22dc650dSSadaf Ebrahimi
5395*22dc650dSSadaf Ebrahimi       There  is  a limit to the size of pattern that JIT supports, imposed by
5396*22dc650dSSadaf Ebrahimi       the size of machine stack that it uses. The exact rules are  not  docu-
5397*22dc650dSSadaf Ebrahimi       mented because they may change at any time, in particular, when new op-
5398*22dc650dSSadaf Ebrahimi       timizations  are  introduced.   If  a  pattern  is  too  big, a call to
5399*22dc650dSSadaf Ebrahimi       pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
5400*22dc650dSSadaf Ebrahimi
5401*22dc650dSSadaf Ebrahimi       PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for  com-
5402*22dc650dSSadaf Ebrahimi       plete  matches. If you want to run partial matches using the PCRE2_PAR-
5403*22dc650dSSadaf Ebrahimi       TIAL_HARD or PCRE2_PARTIAL_SOFT options of  pcre2_match(),  you  should
5404*22dc650dSSadaf Ebrahimi       set  one  or  both  of  the  other  options  as  well as, or instead of
5405*22dc650dSSadaf Ebrahimi       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
5406*22dc650dSSadaf Ebrahimi       for each of the three modes (normal, soft partial, hard partial).  When
5407*22dc650dSSadaf Ebrahimi       pcre2_match()  is  called,  the appropriate code is run if it is avail-
5408*22dc650dSSadaf Ebrahimi       able. Otherwise, the pattern is matched using interpretive code.
5409*22dc650dSSadaf Ebrahimi
5410*22dc650dSSadaf Ebrahimi       You can call pcre2_jit_compile() multiple times for the  same  compiled
5411*22dc650dSSadaf Ebrahimi       pattern.  It does nothing if it has previously compiled code for any of
5412*22dc650dSSadaf Ebrahimi       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
5413*22dc650dSSadaf Ebrahimi       PLETE  and  (perhaps  later,  when  you find you need partial matching)
5414*22dc650dSSadaf Ebrahimi       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
5415*22dc650dSSadaf Ebrahimi       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5416*22dc650dSSadaf Ebrahimi       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5417*22dc650dSSadaf Ebrahimi       diately returns zero. This is an alternative way of testing whether JIT
5418*22dc650dSSadaf Ebrahimi       is available.
5419*22dc650dSSadaf Ebrahimi
5420*22dc650dSSadaf Ebrahimi       At  present,  it  is not possible to free JIT compiled code except when
5421*22dc650dSSadaf Ebrahimi       the entire compiled pattern is freed by calling pcre2_code_free().
5422*22dc650dSSadaf Ebrahimi
5423*22dc650dSSadaf Ebrahimi       In some circumstances you may need to call additional functions.  These
5424*22dc650dSSadaf Ebrahimi       are  described  in the section entitled "Controlling the JIT stack" be-
5425*22dc650dSSadaf Ebrahimi       low.
5426*22dc650dSSadaf Ebrahimi
5427*22dc650dSSadaf Ebrahimi       There are some pcre2_match() options that are not supported by JIT, and
5428*22dc650dSSadaf Ebrahimi       there are also some pattern items that JIT cannot handle.  Details  are
5429*22dc650dSSadaf Ebrahimi       given  below.   In both cases, matching automatically falls back to the
5430*22dc650dSSadaf Ebrahimi       interpretive code. If you want to know whether JIT  was  actually  used
5431*22dc650dSSadaf Ebrahimi       for  a particular match, you should arrange for a JIT callback function
5432*22dc650dSSadaf Ebrahimi       to be set up as described in the section entitled "Controlling the  JIT
5433*22dc650dSSadaf Ebrahimi       stack"  below,  even  if  you  do  not need to supply a non-default JIT
5434*22dc650dSSadaf Ebrahimi       stack. Such a callback function is called whenever JIT code is about to
5435*22dc650dSSadaf Ebrahimi       be obeyed. If the match-time options are not right for  JIT  execution,
5436*22dc650dSSadaf Ebrahimi       the callback function is not obeyed.
5437*22dc650dSSadaf Ebrahimi
5438*22dc650dSSadaf Ebrahimi       If  the  JIT  compiler finds an unsupported item, no JIT data is gener-
5439*22dc650dSSadaf Ebrahimi       ated. You can find out if JIT compilation was successful for a compiled
5440*22dc650dSSadaf Ebrahimi       pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5441*22dc650dSSadaf Ebrahimi       tion. A non-zero result means that JIT compilation  was  successful.  A
5442*22dc650dSSadaf Ebrahimi       result of 0 means that JIT support is not available, or the pattern was
5443*22dc650dSSadaf Ebrahimi       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
5444*22dc650dSSadaf Ebrahimi       to handle the pattern. Successful JIT compilation  does  not,  however,
5445*22dc650dSSadaf Ebrahimi       guarantee  the  use  of  JIT at match time because there are some match
5446*22dc650dSSadaf Ebrahimi       time options that are not supported by JIT.
5447*22dc650dSSadaf Ebrahimi
5448*22dc650dSSadaf Ebrahimi
5449*22dc650dSSadaf EbrahimiMATCHING SUBJECTS CONTAINING INVALID UTF
5450*22dc650dSSadaf Ebrahimi
5451*22dc650dSSadaf Ebrahimi       When a pattern is compiled with the PCRE2_UTF option,  subject  strings
5452*22dc650dSSadaf Ebrahimi       are  normally expected to be a valid sequence of UTF code units. By de-
5453*22dc650dSSadaf Ebrahimi       fault, this is checked at the start of matching and an error is  gener-
5454*22dc650dSSadaf Ebrahimi       ated  if  invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be
5455*22dc650dSSadaf Ebrahimi       passed to pcre2_match() to skip the check (for improved performance) if
5456*22dc650dSSadaf Ebrahimi       you are sure that a subject string is valid. If  this  option  is  used
5457*22dc650dSSadaf Ebrahimi       with  an  invalid  string, the result is undefined. The calling program
5458*22dc650dSSadaf Ebrahimi       may crash or loop or otherwise misbehave.
5459*22dc650dSSadaf Ebrahimi
5460*22dc650dSSadaf Ebrahimi       However, a way of running matches on strings that may  contain  invalid
5461*22dc650dSSadaf Ebrahimi       UTF   sequences   is   available.   Calling  pcre2_compile()  with  the
5462*22dc650dSSadaf Ebrahimi       PCRE2_MATCH_INVALID_UTF option has two effects:  it  tells  the  inter-
5463*22dc650dSSadaf Ebrahimi       preter  in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5464*22dc650dSSadaf Ebrahimi       pile() is subsequently called, the compiled JIT code also supports  in-
5465*22dc650dSSadaf Ebrahimi       valid  UTF.  Details of how this support works, in both the JIT and the
5466*22dc650dSSadaf Ebrahimi       interpretive cases, is given in the pcre2unicode documentation.
5467*22dc650dSSadaf Ebrahimi
5468*22dc650dSSadaf Ebrahimi       There  is  also  an  obsolete  option  for  pcre2_jit_compile()  called
5469*22dc650dSSadaf Ebrahimi       PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5470*22dc650dSSadaf Ebrahimi       ibility.     It   is   superseded   by   the   pcre2_compile()   option
5471*22dc650dSSadaf Ebrahimi       PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed
5472*22dc650dSSadaf Ebrahimi       in future.
5473*22dc650dSSadaf Ebrahimi
5474*22dc650dSSadaf Ebrahimi
5475*22dc650dSSadaf EbrahimiUNSUPPORTED OPTIONS AND PATTERN ITEMS
5476*22dc650dSSadaf Ebrahimi
5477*22dc650dSSadaf Ebrahimi       The pcre2_match() options that  are  supported  for  JIT  matching  are
5478*22dc650dSSadaf Ebrahimi       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
5479*22dc650dSSadaf Ebrahimi       PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and
5480*22dc650dSSadaf Ebrahimi       PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED  and  PCRE2_ENDANCHORED  options
5481*22dc650dSSadaf Ebrahimi       are not supported at match time.
5482*22dc650dSSadaf Ebrahimi
5483*22dc650dSSadaf Ebrahimi       If  the  PCRE2_NO_JIT option is passed to pcre2_match() it disables the
5484*22dc650dSSadaf Ebrahimi       use of JIT, forcing matching by the interpreter code.
5485*22dc650dSSadaf Ebrahimi
5486*22dc650dSSadaf Ebrahimi       The only unsupported pattern items are \C (match a  single  data  unit)
5487*22dc650dSSadaf Ebrahimi       when  running in a UTF mode, and a callout immediately before an asser-
5488*22dc650dSSadaf Ebrahimi       tion condition in a conditional group.
5489*22dc650dSSadaf Ebrahimi
5490*22dc650dSSadaf Ebrahimi
5491*22dc650dSSadaf EbrahimiRETURN VALUES FROM JIT MATCHING
5492*22dc650dSSadaf Ebrahimi
5493*22dc650dSSadaf Ebrahimi       When a pattern is matched using JIT, the return values are the same  as
5494*22dc650dSSadaf Ebrahimi       those  given  by the interpretive pcre2_match() code, with the addition
5495*22dc650dSSadaf Ebrahimi       of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that  the
5496*22dc650dSSadaf Ebrahimi       memory  used  for  the JIT stack was insufficient. See "Controlling the
5497*22dc650dSSadaf Ebrahimi       JIT stack" below for a discussion of JIT stack usage.
5498*22dc650dSSadaf Ebrahimi
5499*22dc650dSSadaf Ebrahimi       The error code PCRE2_ERROR_MATCHLIMIT is returned by the  JIT  code  if
5500*22dc650dSSadaf Ebrahimi       searching  a  very large pattern tree goes on for too long, as it is in
5501*22dc650dSSadaf Ebrahimi       the same circumstance when JIT is not used, but the details of  exactly
5502*22dc650dSSadaf Ebrahimi       what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5503*22dc650dSSadaf Ebrahimi       is never returned when JIT matching is used.
5504*22dc650dSSadaf Ebrahimi
5505*22dc650dSSadaf Ebrahimi
5506*22dc650dSSadaf EbrahimiCONTROLLING THE JIT STACK
5507*22dc650dSSadaf Ebrahimi
5508*22dc650dSSadaf Ebrahimi       When the compiled JIT code runs, it needs a block of memory to use as a
5509*22dc650dSSadaf Ebrahimi       stack.   By  default, it uses 32KiB on the machine stack. However, some
5510*22dc650dSSadaf Ebrahimi       large or complicated patterns need more than this. The error  PCRE2_ER-
5511*22dc650dSSadaf Ebrahimi       ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5512*22dc650dSSadaf Ebrahimi       tions are provided for managing blocks of memory for use as JIT stacks.
5513*22dc650dSSadaf Ebrahimi       There  is further discussion about the use of JIT stacks in the section
5514*22dc650dSSadaf Ebrahimi       entitled "JIT stack FAQ" below.
5515*22dc650dSSadaf Ebrahimi
5516*22dc650dSSadaf Ebrahimi       The pcre2_jit_stack_create() function creates a JIT  stack.  Its  argu-
5517*22dc650dSSadaf Ebrahimi       ments  are  a starting size, a maximum size, and a general context (for
5518*22dc650dSSadaf Ebrahimi       memory allocation functions, or NULL for standard  memory  allocation).
5519*22dc650dSSadaf Ebrahimi       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
5520*22dc650dSSadaf Ebrahimi       NULL  if there is an error. The pcre2_jit_stack_free() function is used
5521*22dc650dSSadaf Ebrahimi       to free a stack that is no longer needed. If its argument is NULL, this
5522*22dc650dSSadaf Ebrahimi       function returns immediately, without doing anything. (For the  techni-
5523*22dc650dSSadaf Ebrahimi       cally  minded: the address space is allocated by mmap or VirtualAlloc.)
5524*22dc650dSSadaf Ebrahimi       A maximum stack size of 512KiB to 1MiB should be more than  enough  for
5525*22dc650dSSadaf Ebrahimi       any pattern.
5526*22dc650dSSadaf Ebrahimi
5527*22dc650dSSadaf Ebrahimi       The  pcre2_jit_stack_assign()  function  specifies which stack JIT code
5528*22dc650dSSadaf Ebrahimi       should use. Its arguments are as follows:
5529*22dc650dSSadaf Ebrahimi
5530*22dc650dSSadaf Ebrahimi         pcre2_match_context  *mcontext
5531*22dc650dSSadaf Ebrahimi         pcre2_jit_callback    callback
5532*22dc650dSSadaf Ebrahimi         void                 *data
5533*22dc650dSSadaf Ebrahimi
5534*22dc650dSSadaf Ebrahimi       The first argument is a pointer to a match context. When this is subse-
5535*22dc650dSSadaf Ebrahimi       quently passed to a matching function, its information determines which
5536*22dc650dSSadaf Ebrahimi       JIT stack is used. If this argument is NULL, the function returns imme-
5537*22dc650dSSadaf Ebrahimi       diately, without doing anything. There are three cases for  the  values
5538*22dc650dSSadaf Ebrahimi       of the other two options:
5539*22dc650dSSadaf Ebrahimi
5540*22dc650dSSadaf Ebrahimi         (1) If callback is NULL and data is NULL, an internal 32KiB block
5541*22dc650dSSadaf Ebrahimi             on the machine stack is used. This is the default when a match
5542*22dc650dSSadaf Ebrahimi             context is created.
5543*22dc650dSSadaf Ebrahimi
5544*22dc650dSSadaf Ebrahimi         (2) If callback is NULL and data is not NULL, data must be
5545*22dc650dSSadaf Ebrahimi             a pointer to a valid JIT stack, the result of calling
5546*22dc650dSSadaf Ebrahimi             pcre2_jit_stack_create().
5547*22dc650dSSadaf Ebrahimi
5548*22dc650dSSadaf Ebrahimi         (3) If callback is not NULL, it must point to a function that is
5549*22dc650dSSadaf Ebrahimi             called with data as an argument at the start of matching, in
5550*22dc650dSSadaf Ebrahimi             order to set up a JIT stack. If the return from the callback
5551*22dc650dSSadaf Ebrahimi             function is NULL, the internal 32KiB stack is used; otherwise the
5552*22dc650dSSadaf Ebrahimi             return value must be a valid JIT stack, the result of calling
5553*22dc650dSSadaf Ebrahimi             pcre2_jit_stack_create().
5554*22dc650dSSadaf Ebrahimi
5555*22dc650dSSadaf Ebrahimi       A  callback function is obeyed whenever JIT code is about to be run; it
5556*22dc650dSSadaf Ebrahimi       is not obeyed when pcre2_match() is called with options that are incom-
5557*22dc650dSSadaf Ebrahimi       patible for JIT matching. A callback function can therefore be used  to
5558*22dc650dSSadaf Ebrahimi       determine  whether  a match operation was executed by JIT or by the in-
5559*22dc650dSSadaf Ebrahimi       terpreter.
5560*22dc650dSSadaf Ebrahimi
5561*22dc650dSSadaf Ebrahimi       You may safely use the same JIT stack for more than one pattern (either
5562*22dc650dSSadaf Ebrahimi       by assigning directly or by callback), as  long  as  the  patterns  are
5563*22dc650dSSadaf Ebrahimi       matched sequentially in the same thread. Currently, the only way to set
5564*22dc650dSSadaf Ebrahimi       up  non-sequential matches in one thread is to use callouts: if a call-
5565*22dc650dSSadaf Ebrahimi       out function starts another match, that match must use a different  JIT
5566*22dc650dSSadaf Ebrahimi       stack to the one used for currently suspended match(es).
5567*22dc650dSSadaf Ebrahimi
5568*22dc650dSSadaf Ebrahimi       In  a multithread application, if you do not specify a JIT stack, or if
5569*22dc650dSSadaf Ebrahimi       you assign or pass back NULL from a callback, that is thread-safe,  be-
5570*22dc650dSSadaf Ebrahimi       cause  each thread has its own machine stack. However, if you assign or
5571*22dc650dSSadaf Ebrahimi       pass back a non-NULL JIT stack, this must be a different stack for each
5572*22dc650dSSadaf Ebrahimi       thread so that the application is thread-safe.
5573*22dc650dSSadaf Ebrahimi
5574*22dc650dSSadaf Ebrahimi       Strictly speaking, even more is allowed. You can assign the  same  non-
5575*22dc650dSSadaf Ebrahimi       NULL  stack  to a match context that is used by any number of patterns,
5576*22dc650dSSadaf Ebrahimi       as long as they are not used for matching by multiple  threads  at  the
5577*22dc650dSSadaf Ebrahimi       same  time.  For  example, you could use the same stack in all compiled
5578*22dc650dSSadaf Ebrahimi       patterns, with a global mutex in the callback to wait until  the  stack
5579*22dc650dSSadaf Ebrahimi       is available for use. However, this is an inefficient solution, and not
5580*22dc650dSSadaf Ebrahimi       recommended.
5581*22dc650dSSadaf Ebrahimi
5582*22dc650dSSadaf Ebrahimi       This  is a suggestion for how a multithreaded program that needs to set
5583*22dc650dSSadaf Ebrahimi       up non-default JIT stacks might operate:
5584*22dc650dSSadaf Ebrahimi
5585*22dc650dSSadaf Ebrahimi         During thread initialization
5586*22dc650dSSadaf Ebrahimi           thread_local_var = pcre2_jit_stack_create(...)
5587*22dc650dSSadaf Ebrahimi
5588*22dc650dSSadaf Ebrahimi         During thread exit
5589*22dc650dSSadaf Ebrahimi           pcre2_jit_stack_free(thread_local_var)
5590*22dc650dSSadaf Ebrahimi
5591*22dc650dSSadaf Ebrahimi         Use a one-line callback function
5592*22dc650dSSadaf Ebrahimi           return thread_local_var
5593*22dc650dSSadaf Ebrahimi
5594*22dc650dSSadaf Ebrahimi       All the functions described in this section do nothing if  JIT  is  not
5595*22dc650dSSadaf Ebrahimi       available.
5596*22dc650dSSadaf Ebrahimi
5597*22dc650dSSadaf Ebrahimi
5598*22dc650dSSadaf EbrahimiJIT STACK FAQ
5599*22dc650dSSadaf Ebrahimi
5600*22dc650dSSadaf Ebrahimi       (1) Why do we need JIT stacks?
5601*22dc650dSSadaf Ebrahimi
5602*22dc650dSSadaf Ebrahimi       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5603*22dc650dSSadaf Ebrahimi       where  the local data of the current node is pushed before checking its
5604*22dc650dSSadaf Ebrahimi       child nodes.  Allocating real machine stack on some platforms is diffi-
5605*22dc650dSSadaf Ebrahimi       cult. For example, the stack chain needs to be updated every time if we
5606*22dc650dSSadaf Ebrahimi       extend the stack on PowerPC.  Although it  is  possible,  its  updating
5607*22dc650dSSadaf Ebrahimi       time overhead decreases performance. So we do the recursion in memory.
5608*22dc650dSSadaf Ebrahimi
5609*22dc650dSSadaf Ebrahimi       (2) Why don't we simply allocate blocks of memory with malloc()?
5610*22dc650dSSadaf Ebrahimi
5611*22dc650dSSadaf Ebrahimi       Modern  operating  systems have a nice feature: they can reserve an ad-
5612*22dc650dSSadaf Ebrahimi       dress space instead of allocating memory. We can safely allocate memory
5613*22dc650dSSadaf Ebrahimi       pages inside this address space, so the stack could grow without moving
5614*22dc650dSSadaf Ebrahimi       memory data (this is important because of pointers). Thus we can  allo-
5615*22dc650dSSadaf Ebrahimi       cate  1MiB  address  space,  and use only a single memory page (usually
5616*22dc650dSSadaf Ebrahimi       4KiB) if that is enough. However, we can still grow up to 1MiB  anytime
5617*22dc650dSSadaf Ebrahimi       if needed.
5618*22dc650dSSadaf Ebrahimi
5619*22dc650dSSadaf Ebrahimi       (3) Who "owns" a JIT stack?
5620*22dc650dSSadaf Ebrahimi
5621*22dc650dSSadaf Ebrahimi       The owner of the stack is the user program, not the JIT studied pattern
5622*22dc650dSSadaf Ebrahimi       or anything else. The user program must ensure that if a stack is being
5623*22dc650dSSadaf Ebrahimi       used by pcre2_match(), (that is, it is assigned to a match context that
5624*22dc650dSSadaf Ebrahimi       is  passed  to  the  pattern currently running), that stack must not be
5625*22dc650dSSadaf Ebrahimi       used by any other threads (to avoid overwriting the same memory  area).
5626*22dc650dSSadaf Ebrahimi       The best practice for multithreaded programs is to allocate a stack for
5627*22dc650dSSadaf Ebrahimi       each thread, and return this stack through the JIT callback function.
5628*22dc650dSSadaf Ebrahimi
5629*22dc650dSSadaf Ebrahimi       (4) When should a JIT stack be freed?
5630*22dc650dSSadaf Ebrahimi
5631*22dc650dSSadaf Ebrahimi       You can free a JIT stack at any time, as long as it will not be used by
5632*22dc650dSSadaf Ebrahimi       pcre2_match() again. When you assign the stack to a match context, only
5633*22dc650dSSadaf Ebrahimi       a  pointer  is  set. There is no reference counting or any other magic.
5634*22dc650dSSadaf Ebrahimi       You can free compiled patterns, contexts, and stacks in any order, any-
5635*22dc650dSSadaf Ebrahimi       time.  Just do not call pcre2_match() with a match context pointing  to
5636*22dc650dSSadaf Ebrahimi       an already freed stack, as that will cause SEGFAULT. (Also, do not free
5637*22dc650dSSadaf Ebrahimi       a  stack  currently  used  by pcre2_match() in another thread). You can
5638*22dc650dSSadaf Ebrahimi       also replace the stack in a context at any time when it is not in  use.
5639*22dc650dSSadaf Ebrahimi       You should free the previous stack before assigning a replacement.
5640*22dc650dSSadaf Ebrahimi
5641*22dc650dSSadaf Ebrahimi       (5)  Should  I  allocate/free  a  stack every time before/after calling
5642*22dc650dSSadaf Ebrahimi       pcre2_match()?
5643*22dc650dSSadaf Ebrahimi
5644*22dc650dSSadaf Ebrahimi       No, because this is too costly in  terms  of  resources.  However,  you
5645*22dc650dSSadaf Ebrahimi       could  implement  some clever idea which release the stack if it is not
5646*22dc650dSSadaf Ebrahimi       used in let's say two minutes. The JIT callback  can  help  to  achieve
5647*22dc650dSSadaf Ebrahimi       this without keeping a list of patterns.
5648*22dc650dSSadaf Ebrahimi
5649*22dc650dSSadaf Ebrahimi       (6)  OK, the stack is for long term memory allocation. But what happens
5650*22dc650dSSadaf Ebrahimi       if a pattern causes stack overflow with a stack of 1MiB? Is  that  1MiB
5651*22dc650dSSadaf Ebrahimi       kept until the stack is freed?
5652*22dc650dSSadaf Ebrahimi
5653*22dc650dSSadaf Ebrahimi       Especially on embedded systems, it might be a good idea to release mem-
5654*22dc650dSSadaf Ebrahimi       ory  sometimes  without  freeing the stack. There is no API for this at
5655*22dc650dSSadaf Ebrahimi       the moment.  Probably a function call which returns with the  currently
5656*22dc650dSSadaf Ebrahimi       allocated  memory for any stack and another which allows releasing mem-
5657*22dc650dSSadaf Ebrahimi       ory (shrinking the stack) would be a good idea if someone needs this.
5658*22dc650dSSadaf Ebrahimi
5659*22dc650dSSadaf Ebrahimi       (7) This is too much of a headache. Isn't there any better solution for
5660*22dc650dSSadaf Ebrahimi       JIT stack handling?
5661*22dc650dSSadaf Ebrahimi
5662*22dc650dSSadaf Ebrahimi       No, thanks to Windows. If POSIX threads were used everywhere, we  could
5663*22dc650dSSadaf Ebrahimi       throw out this complicated API.
5664*22dc650dSSadaf Ebrahimi
5665*22dc650dSSadaf Ebrahimi
5666*22dc650dSSadaf EbrahimiFREEING JIT SPECULATIVE MEMORY
5667*22dc650dSSadaf Ebrahimi
5668*22dc650dSSadaf Ebrahimi       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
5669*22dc650dSSadaf Ebrahimi
5670*22dc650dSSadaf Ebrahimi       The JIT executable allocator does not free all memory when it is possi-
5671*22dc650dSSadaf Ebrahimi       ble.  It  expects new allocations, and keeps some free memory around to
5672*22dc650dSSadaf Ebrahimi       improve allocation speed. However, in low memory conditions,  it  might
5673*22dc650dSSadaf Ebrahimi       be  better to free all possible memory. You can cause this to happen by
5674*22dc650dSSadaf Ebrahimi       calling pcre2_jit_free_unused_memory(). Its argument is a general  con-
5675*22dc650dSSadaf Ebrahimi       text, for custom memory management, or NULL for standard memory manage-
5676*22dc650dSSadaf Ebrahimi       ment.
5677*22dc650dSSadaf Ebrahimi
5678*22dc650dSSadaf Ebrahimi
5679*22dc650dSSadaf EbrahimiEXAMPLE CODE
5680*22dc650dSSadaf Ebrahimi
5681*22dc650dSSadaf Ebrahimi       This  is  a  single-threaded example that specifies a JIT stack without
5682*22dc650dSSadaf Ebrahimi       using a callback. A real program should include  error  checking  after
5683*22dc650dSSadaf Ebrahimi       all the function calls.
5684*22dc650dSSadaf Ebrahimi
5685*22dc650dSSadaf Ebrahimi         int rc;
5686*22dc650dSSadaf Ebrahimi         pcre2_code *re;
5687*22dc650dSSadaf Ebrahimi         pcre2_match_data *match_data;
5688*22dc650dSSadaf Ebrahimi         pcre2_match_context *mcontext;
5689*22dc650dSSadaf Ebrahimi         pcre2_jit_stack *jit_stack;
5690*22dc650dSSadaf Ebrahimi
5691*22dc650dSSadaf Ebrahimi         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
5692*22dc650dSSadaf Ebrahimi           &errornumber, &erroffset, NULL);
5693*22dc650dSSadaf Ebrahimi         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
5694*22dc650dSSadaf Ebrahimi         mcontext = pcre2_match_context_create(NULL);
5695*22dc650dSSadaf Ebrahimi         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
5696*22dc650dSSadaf Ebrahimi         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
5697*22dc650dSSadaf Ebrahimi         match_data = pcre2_match_data_create(re, 10);
5698*22dc650dSSadaf Ebrahimi         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
5699*22dc650dSSadaf Ebrahimi         /* Process result */
5700*22dc650dSSadaf Ebrahimi
5701*22dc650dSSadaf Ebrahimi         pcre2_code_free(re);
5702*22dc650dSSadaf Ebrahimi         pcre2_match_data_free(match_data);
5703*22dc650dSSadaf Ebrahimi         pcre2_match_context_free(mcontext);
5704*22dc650dSSadaf Ebrahimi         pcre2_jit_stack_free(jit_stack);
5705*22dc650dSSadaf Ebrahimi
5706*22dc650dSSadaf Ebrahimi
5707*22dc650dSSadaf EbrahimiJIT FAST PATH API
5708*22dc650dSSadaf Ebrahimi
5709*22dc650dSSadaf Ebrahimi       Because the API described above falls back to interpreted matching when
5710*22dc650dSSadaf Ebrahimi       JIT  is  not  available, it is convenient for programs that are written
5711*22dc650dSSadaf Ebrahimi       for  general  use  in  many  environments.  However,  calling  JIT  via
5712*22dc650dSSadaf Ebrahimi       pcre2_match() does have a performance impact. Programs that are written
5713*22dc650dSSadaf Ebrahimi       for  use  where  JIT  is known to be available, and which need the best
5714*22dc650dSSadaf Ebrahimi       possible performance, can instead use a "fast path"  API  to  call  JIT
5715*22dc650dSSadaf Ebrahimi       matching  directly instead of calling pcre2_match() (obviously only for
5716*22dc650dSSadaf Ebrahimi       patterns that have been successfully processed by pcre2_jit_compile()).
5717*22dc650dSSadaf Ebrahimi
5718*22dc650dSSadaf Ebrahimi       The fast path function is called pcre2_jit_match(), and  it  takes  ex-
5719*22dc650dSSadaf Ebrahimi       actly  the same arguments as pcre2_match(). However, the subject string
5720*22dc650dSSadaf Ebrahimi       must be specified with a  length;  PCRE2_ZERO_TERMINATED  is  not  sup-
5721*22dc650dSSadaf Ebrahimi       ported.  Unsupported  option  bits  (for  example,  PCRE2_ANCHORED  and
5722*22dc650dSSadaf Ebrahimi       PCRE2_ENDANCHORED) are ignored, as is the PCRE2_NO_JIT option. The  re-
5723*22dc650dSSadaf Ebrahimi       turn  values  are  also  the  same as for pcre2_match(), plus PCRE2_ER-
5724*22dc650dSSadaf Ebrahimi       ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
5725*22dc650dSSadaf Ebrahimi       that was not compiled.
5726*22dc650dSSadaf Ebrahimi
5727*22dc650dSSadaf Ebrahimi       When you call pcre2_match(), as well as testing for invalid options,  a
5728*22dc650dSSadaf Ebrahimi       number of other sanity checks are performed on the arguments. For exam-
5729*22dc650dSSadaf Ebrahimi       ple,  if the subject pointer is NULL but the length is non-zero, an im-
5730*22dc650dSSadaf Ebrahimi       mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set,  a  UTF
5731*22dc650dSSadaf Ebrahimi       subject string is tested for validity. In the interests of speed, these
5732*22dc650dSSadaf Ebrahimi       checks  do  not  happen  on  the  JIT fast path. If invalid UTF data is
5733*22dc650dSSadaf Ebrahimi       passed when PCRE2_MATCH_INVALID_UTF was not  set  for  pcre2_compile(),
5734*22dc650dSSadaf Ebrahimi       the  result  is  undefined. The program may crash or loop or give wrong
5735*22dc650dSSadaf Ebrahimi       results. In the absence  of  PCRE2_MATCH_INVALID_UTF  you  should  call
5736*22dc650dSSadaf Ebrahimi       pcre2_jit_match()  in  UTF  mode  only  if  you are sure the subject is
5737*22dc650dSSadaf Ebrahimi       valid.
5738*22dc650dSSadaf Ebrahimi
5739*22dc650dSSadaf Ebrahimi       Bypassing the sanity checks and the  pcre2_match()  wrapping  can  give
5740*22dc650dSSadaf Ebrahimi       speedups of more than 10%.
5741*22dc650dSSadaf Ebrahimi
5742*22dc650dSSadaf Ebrahimi
5743*22dc650dSSadaf EbrahimiSEE ALSO
5744*22dc650dSSadaf Ebrahimi
5745*22dc650dSSadaf Ebrahimi       pcre2api(3), pcre2unicode(3)
5746*22dc650dSSadaf Ebrahimi
5747*22dc650dSSadaf Ebrahimi
5748*22dc650dSSadaf EbrahimiAUTHOR
5749*22dc650dSSadaf Ebrahimi
5750*22dc650dSSadaf Ebrahimi       Philip Hazel (FAQ by Zoltan Herczeg)
5751*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
5752*22dc650dSSadaf Ebrahimi       Cambridge, England.
5753*22dc650dSSadaf Ebrahimi
5754*22dc650dSSadaf Ebrahimi
5755*22dc650dSSadaf EbrahimiREVISION
5756*22dc650dSSadaf Ebrahimi
5757*22dc650dSSadaf Ebrahimi       Last updated: 21 February 2024
5758*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2024 University of Cambridge.
5759*22dc650dSSadaf Ebrahimi
5760*22dc650dSSadaf Ebrahimi
5761*22dc650dSSadaf EbrahimiPCRE2 10.43                    21 February 2024                    PCRE2JIT(3)
5762*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
5763*22dc650dSSadaf Ebrahimi
5764*22dc650dSSadaf Ebrahimi
5765*22dc650dSSadaf Ebrahimi
5766*22dc650dSSadaf EbrahimiPCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
5767*22dc650dSSadaf Ebrahimi
5768*22dc650dSSadaf Ebrahimi
5769*22dc650dSSadaf EbrahimiNAME
5770*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
5771*22dc650dSSadaf Ebrahimi
5772*22dc650dSSadaf Ebrahimi
5773*22dc650dSSadaf EbrahimiSIZE AND OTHER LIMITATIONS
5774*22dc650dSSadaf Ebrahimi
5775*22dc650dSSadaf Ebrahimi       There are some size limitations in PCRE2 but it is hoped that they will
5776*22dc650dSSadaf Ebrahimi       never in practice be relevant.
5777*22dc650dSSadaf Ebrahimi
5778*22dc650dSSadaf Ebrahimi       The  maximum  size  of  a compiled pattern is approximately 64 thousand
5779*22dc650dSSadaf Ebrahimi       code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5780*22dc650dSSadaf Ebrahimi       the default internal linkage size, which  is  2  bytes  for  these  li-
5781*22dc650dSSadaf Ebrahimi       braries.  If  you  want  to  process regular expressions that are truly
5782*22dc650dSSadaf Ebrahimi       enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
5783*22dc650dSSadaf Ebrahimi       (when building the 16-bit library, 3 is  rounded  up  to  4).  See  the
5784*22dc650dSSadaf Ebrahimi       README file in the source distribution and the pcre2build documentation
5785*22dc650dSSadaf Ebrahimi       for  details.  In  these cases the limit is substantially larger.  How-
5786*22dc650dSSadaf Ebrahimi       ever, the speed of execution is slower. In the 32-bit library, the  in-
5787*22dc650dSSadaf Ebrahimi       ternal linkage size is always 4.
5788*22dc650dSSadaf Ebrahimi
5789*22dc650dSSadaf Ebrahimi       The maximum length of a source pattern string is essentially unlimited;
5790*22dc650dSSadaf Ebrahimi       it  is  the largest number a PCRE2_SIZE variable can hold. However, the
5791*22dc650dSSadaf Ebrahimi       program that calls pcre2_compile() can specify a smaller limit.
5792*22dc650dSSadaf Ebrahimi
5793*22dc650dSSadaf Ebrahimi       The maximum length (in code units) of a subject string is one less than
5794*22dc650dSSadaf Ebrahimi       the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5795*22dc650dSSadaf Ebrahimi       signed integer type, usually defined as size_t. Its maximum value (that
5796*22dc650dSSadaf Ebrahimi       is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi-
5797*22dc650dSSadaf Ebrahimi       nated strings and unset offsets.
5798*22dc650dSSadaf Ebrahimi
5799*22dc650dSSadaf Ebrahimi       All values in repeating quantifiers must be less than 65536.
5800*22dc650dSSadaf Ebrahimi
5801*22dc650dSSadaf Ebrahimi       There are two different limits that apply to branches of lookbehind as-
5802*22dc650dSSadaf Ebrahimi       sertions.   If every branch in such an assertion matches a fixed number
5803*22dc650dSSadaf Ebrahimi       of characters, the maximum length of any branch is 65535 characters. If
5804*22dc650dSSadaf Ebrahimi       any branch matches a variable number of characters,  then  the  maximum
5805*22dc650dSSadaf Ebrahimi       matching  length  for every branch is limited. The default limit is set
5806*22dc650dSSadaf Ebrahimi       at compile time, defaulting to 255, but can be changed by  the  calling
5807*22dc650dSSadaf Ebrahimi       program.
5808*22dc650dSSadaf Ebrahimi
5809*22dc650dSSadaf Ebrahimi       There  is no limit to the number of parenthesized groups, but there can
5810*22dc650dSSadaf Ebrahimi       be no more than 65535 capture groups, and there is a limit to the depth
5811*22dc650dSSadaf Ebrahimi       of nesting of parenthesized subpatterns of all kinds. This  is  imposed
5812*22dc650dSSadaf Ebrahimi       in  order to limit the amount of system stack used at compile time. The
5813*22dc650dSSadaf Ebrahimi       default limit can be specified when PCRE2 is built; if not, the default
5814*22dc650dSSadaf Ebrahimi       is set to  250.  An  application  can  change  this  limit  by  calling
5815*22dc650dSSadaf Ebrahimi       pcre2_set_parens_nest_limit() to set the limit in a compile context.
5816*22dc650dSSadaf Ebrahimi
5817*22dc650dSSadaf Ebrahimi       The  maximum length of name for a named capture group is 32 code units,
5818*22dc650dSSadaf Ebrahimi       and the maximum number of such groups is 10000.
5819*22dc650dSSadaf Ebrahimi
5820*22dc650dSSadaf Ebrahimi       The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
5821*22dc650dSSadaf Ebrahimi       (*THEN)  verb  is  255  code units for the 8-bit library and 65535 code
5822*22dc650dSSadaf Ebrahimi       units for the 16-bit and 32-bit libraries.
5823*22dc650dSSadaf Ebrahimi
5824*22dc650dSSadaf Ebrahimi       The maximum length of a string argument to a  callout  is  the  largest
5825*22dc650dSSadaf Ebrahimi       number a 32-bit unsigned integer can hold.
5826*22dc650dSSadaf Ebrahimi
5827*22dc650dSSadaf Ebrahimi       The  maximum  amount  of heap memory used for matching is controlled by
5828*22dc650dSSadaf Ebrahimi       the heap limit, which can be set in a pattern or in  a  match  context.
5829*22dc650dSSadaf Ebrahimi       The default is a very large number, effectively unlimited.
5830*22dc650dSSadaf Ebrahimi
5831*22dc650dSSadaf Ebrahimi
5832*22dc650dSSadaf EbrahimiAUTHOR
5833*22dc650dSSadaf Ebrahimi
5834*22dc650dSSadaf Ebrahimi       Philip Hazel
5835*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
5836*22dc650dSSadaf Ebrahimi       Cambridge, England.
5837*22dc650dSSadaf Ebrahimi
5838*22dc650dSSadaf Ebrahimi
5839*22dc650dSSadaf EbrahimiREVISION
5840*22dc650dSSadaf Ebrahimi
5841*22dc650dSSadaf Ebrahimi       Last updated: August 2023
5842*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2023 University of Cambridge.
5843*22dc650dSSadaf Ebrahimi
5844*22dc650dSSadaf Ebrahimi
5845*22dc650dSSadaf EbrahimiPCRE2 10.43                      1 August 2023                  PCRE2LIMITS(3)
5846*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
5847*22dc650dSSadaf Ebrahimi
5848*22dc650dSSadaf Ebrahimi
5849*22dc650dSSadaf Ebrahimi
5850*22dc650dSSadaf EbrahimiPCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
5851*22dc650dSSadaf Ebrahimi
5852*22dc650dSSadaf Ebrahimi
5853*22dc650dSSadaf EbrahimiNAME
5854*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
5855*22dc650dSSadaf Ebrahimi
5856*22dc650dSSadaf Ebrahimi
5857*22dc650dSSadaf EbrahimiPCRE2 MATCHING ALGORITHMS
5858*22dc650dSSadaf Ebrahimi
5859*22dc650dSSadaf Ebrahimi       This document describes the two different algorithms that are available
5860*22dc650dSSadaf Ebrahimi       in  PCRE2  for  matching  a compiled regular expression against a given
5861*22dc650dSSadaf Ebrahimi       subject string. The "standard" algorithm is the  one  provided  by  the
5862*22dc650dSSadaf Ebrahimi       pcre2_match() function. This works in the same as Perl's matching func-
5863*22dc650dSSadaf Ebrahimi       tion,  and  provide  a Perl-compatible matching operation. The just-in-
5864*22dc650dSSadaf Ebrahimi       time (JIT) optimization that is described in the pcre2jit documentation
5865*22dc650dSSadaf Ebrahimi       is compatible with this function.
5866*22dc650dSSadaf Ebrahimi
5867*22dc650dSSadaf Ebrahimi       An alternative algorithm is provided by the pcre2_dfa_match() function;
5868*22dc650dSSadaf Ebrahimi       it operates in a different way, and is not Perl-compatible. This alter-
5869*22dc650dSSadaf Ebrahimi       native has advantages and disadvantages compared with the standard  al-
5870*22dc650dSSadaf Ebrahimi       gorithm, and these are described below.
5871*22dc650dSSadaf Ebrahimi
5872*22dc650dSSadaf Ebrahimi       When there is only one possible way in which a given subject string can
5873*22dc650dSSadaf Ebrahimi       match  a pattern, the two algorithms give the same answer. A difference
5874*22dc650dSSadaf Ebrahimi       arises, however, when there are multiple possibilities. For example, if
5875*22dc650dSSadaf Ebrahimi       the pattern
5876*22dc650dSSadaf Ebrahimi
5877*22dc650dSSadaf Ebrahimi         ^<.*>
5878*22dc650dSSadaf Ebrahimi
5879*22dc650dSSadaf Ebrahimi       is matched against the string
5880*22dc650dSSadaf Ebrahimi
5881*22dc650dSSadaf Ebrahimi         <something> <something else> <something further>
5882*22dc650dSSadaf Ebrahimi
5883*22dc650dSSadaf Ebrahimi       there are three possible answers. The standard algorithm finds only one
5884*22dc650dSSadaf Ebrahimi       of them, whereas the alternative algorithm finds all three.
5885*22dc650dSSadaf Ebrahimi
5886*22dc650dSSadaf Ebrahimi
5887*22dc650dSSadaf EbrahimiREGULAR EXPRESSIONS AS TREES
5888*22dc650dSSadaf Ebrahimi
5889*22dc650dSSadaf Ebrahimi       The set of strings that are matched by a regular expression can be rep-
5890*22dc650dSSadaf Ebrahimi       resented as a tree structure. An unlimited repetition  in  the  pattern
5891*22dc650dSSadaf Ebrahimi       makes  the  tree of infinite size, but it is still a tree. Matching the
5892*22dc650dSSadaf Ebrahimi       pattern to a given subject string (from a given starting point) can  be
5893*22dc650dSSadaf Ebrahimi       thought  of  as  a  search of the tree.  There are two ways to search a
5894*22dc650dSSadaf Ebrahimi       tree: depth-first and breadth-first, and these correspond  to  the  two
5895*22dc650dSSadaf Ebrahimi       matching algorithms provided by PCRE2.
5896*22dc650dSSadaf Ebrahimi
5897*22dc650dSSadaf Ebrahimi
5898*22dc650dSSadaf EbrahimiTHE STANDARD MATCHING ALGORITHM
5899*22dc650dSSadaf Ebrahimi
5900*22dc650dSSadaf Ebrahimi       In  the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5901*22dc650dSSadaf Ebrahimi       sions", the standard algorithm is an "NFA  algorithm".  It  conducts  a
5902*22dc650dSSadaf Ebrahimi       depth-first  search  of  the pattern tree. That is, it proceeds along a
5903*22dc650dSSadaf Ebrahimi       single path through the tree, checking that the subject matches what is
5904*22dc650dSSadaf Ebrahimi       required. When there is a mismatch, the algorithm  tries  any  alterna-
5905*22dc650dSSadaf Ebrahimi       tives  at  the  current point, and if they all fail, it backs up to the
5906*22dc650dSSadaf Ebrahimi       previous branch point in the  tree,  and  tries  the  next  alternative
5907*22dc650dSSadaf Ebrahimi       branch  at  that  level.  This often involves backing up (moving to the
5908*22dc650dSSadaf Ebrahimi       left) in the subject string as well.  The  order  in  which  repetition
5909*22dc650dSSadaf Ebrahimi       branches  are  tried  is controlled by the greedy or ungreedy nature of
5910*22dc650dSSadaf Ebrahimi       the quantifier.
5911*22dc650dSSadaf Ebrahimi
5912*22dc650dSSadaf Ebrahimi       If a leaf node is reached, a matching string has  been  found,  and  at
5913*22dc650dSSadaf Ebrahimi       that  point the algorithm stops. Thus, if there is more than one possi-
5914*22dc650dSSadaf Ebrahimi       ble match, this algorithm returns the first one that it finds.  Whether
5915*22dc650dSSadaf Ebrahimi       this  is the shortest, the longest, or some intermediate length depends
5916*22dc650dSSadaf Ebrahimi       on the way the alternations and the greedy or ungreedy repetition quan-
5917*22dc650dSSadaf Ebrahimi       tifiers are specified in the pattern.
5918*22dc650dSSadaf Ebrahimi
5919*22dc650dSSadaf Ebrahimi       Because it ends up with a single path through the  tree,  it  is  rela-
5920*22dc650dSSadaf Ebrahimi       tively  straightforward  for  this  algorithm to keep track of the sub-
5921*22dc650dSSadaf Ebrahimi       strings that are matched by portions of  the  pattern  in  parentheses.
5922*22dc650dSSadaf Ebrahimi       This provides support for capturing parentheses and backreferences.
5923*22dc650dSSadaf Ebrahimi
5924*22dc650dSSadaf Ebrahimi
5925*22dc650dSSadaf EbrahimiTHE ALTERNATIVE MATCHING ALGORITHM
5926*22dc650dSSadaf Ebrahimi
5927*22dc650dSSadaf Ebrahimi       This  algorithm  conducts  a breadth-first search of the tree. Starting
5928*22dc650dSSadaf Ebrahimi       from the first matching point in the  subject,  it  scans  the  subject
5929*22dc650dSSadaf Ebrahimi       string from left to right, once, character by character, and as it does
5930*22dc650dSSadaf Ebrahimi       this,  it remembers all the paths through the tree that represent valid
5931*22dc650dSSadaf Ebrahimi       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
5932*22dc650dSSadaf Ebrahimi       though  it is not implemented as a traditional finite state machine (it
5933*22dc650dSSadaf Ebrahimi       keeps multiple states active simultaneously).
5934*22dc650dSSadaf Ebrahimi
5935*22dc650dSSadaf Ebrahimi       Although the general principle of this matching algorithm  is  that  it
5936*22dc650dSSadaf Ebrahimi       scans  the subject string only once, without backtracking, there is one
5937*22dc650dSSadaf Ebrahimi       exception: when a lookaround assertion is encountered,  the  characters
5938*22dc650dSSadaf Ebrahimi       following  or  preceding the current point have to be independently in-
5939*22dc650dSSadaf Ebrahimi       spected.
5940*22dc650dSSadaf Ebrahimi
5941*22dc650dSSadaf Ebrahimi       The scan continues until either the end of the subject is  reached,  or
5942*22dc650dSSadaf Ebrahimi       there  are  no more unterminated paths. At this point, terminated paths
5943*22dc650dSSadaf Ebrahimi       represent the different matching possibilities (if there are none,  the
5944*22dc650dSSadaf Ebrahimi       match  has  failed).   Thus,  if there is more than one possible match,
5945*22dc650dSSadaf Ebrahimi       this algorithm finds all of them,  and  in  particular,  it  finds  the
5946*22dc650dSSadaf Ebrahimi       longest.  The  matches  are returned in the output vector in decreasing
5947*22dc650dSSadaf Ebrahimi       order of length. There is an option to stop  the  algorithm  after  the
5948*22dc650dSSadaf Ebrahimi       first match (which is necessarily the shortest) is found.
5949*22dc650dSSadaf Ebrahimi
5950*22dc650dSSadaf Ebrahimi       Note  that the size of vector needed to contain all the results depends
5951*22dc650dSSadaf Ebrahimi       on the number of simultaneous matches, not on the number of parentheses
5952*22dc650dSSadaf Ebrahimi       in the pattern. Using pcre2_match_data_create_from_pattern() to  create
5953*22dc650dSSadaf Ebrahimi       the  match  data block is therefore not advisable when doing DFA match-
5954*22dc650dSSadaf Ebrahimi       ing.
5955*22dc650dSSadaf Ebrahimi
5956*22dc650dSSadaf Ebrahimi       Note also that all the matches that are found start at the  same  point
5957*22dc650dSSadaf Ebrahimi       in the subject. If the pattern
5958*22dc650dSSadaf Ebrahimi
5959*22dc650dSSadaf Ebrahimi         cat(er(pillar)?)?
5960*22dc650dSSadaf Ebrahimi
5961*22dc650dSSadaf Ebrahimi       is  matched  against the string "the caterpillar catchment", the result
5962*22dc650dSSadaf Ebrahimi       is the three strings "caterpillar", "cater", and "cat"  that  start  at
5963*22dc650dSSadaf Ebrahimi       the  fifth  character  of the subject. The algorithm does not automati-
5964*22dc650dSSadaf Ebrahimi       cally move on to find matches that start at later positions.
5965*22dc650dSSadaf Ebrahimi
5966*22dc650dSSadaf Ebrahimi       PCRE2's "auto-possessification" optimization usually applies to charac-
5967*22dc650dSSadaf Ebrahimi       ter repeats at the end of a pattern (as well as internally). For  exam-
5968*22dc650dSSadaf Ebrahimi       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
5969*22dc650dSSadaf Ebrahimi       is  no  point even considering the possibility of backtracking into the
5970*22dc650dSSadaf Ebrahimi       repeated digits. For DFA matching, this means that  only  one  possible
5971*22dc650dSSadaf Ebrahimi       match  is  found. If you really do want multiple matches in such cases,
5972*22dc650dSSadaf Ebrahimi       either use an ungreedy repeat ("a\d+?") or set  the  PCRE2_NO_AUTO_POS-
5973*22dc650dSSadaf Ebrahimi       SESS option when compiling.
5974*22dc650dSSadaf Ebrahimi
5975*22dc650dSSadaf Ebrahimi       There  are  a  number of features of PCRE2 regular expressions that are
5976*22dc650dSSadaf Ebrahimi       not supported or behave differently in the alternative  matching  func-
5977*22dc650dSSadaf Ebrahimi       tion. Those that are not supported cause an error if encountered.
5978*22dc650dSSadaf Ebrahimi
5979*22dc650dSSadaf Ebrahimi       1.  Because the algorithm finds all possible matches, the greedy or un-
5980*22dc650dSSadaf Ebrahimi       greedy nature of repetition quantifiers is not relevant (though it  may
5981*22dc650dSSadaf Ebrahimi       affect  auto-possessification,  as  just  described).  During matching,
5982*22dc650dSSadaf Ebrahimi       greedy and ungreedy quantifiers are treated in exactly  the  same  way.
5983*22dc650dSSadaf Ebrahimi       However, possessive quantifiers can make a difference when what follows
5984*22dc650dSSadaf Ebrahimi       could  also  match  what  is  quantified, for example in a pattern like
5985*22dc650dSSadaf Ebrahimi       this:
5986*22dc650dSSadaf Ebrahimi
5987*22dc650dSSadaf Ebrahimi         ^a++\w!
5988*22dc650dSSadaf Ebrahimi
5989*22dc650dSSadaf Ebrahimi       This pattern matches "aaab!" but not "aaa!", which would be matched  by
5990*22dc650dSSadaf Ebrahimi       a  non-possessive quantifier. Similarly, if an atomic group is present,
5991*22dc650dSSadaf Ebrahimi       it is matched as if it were a standalone pattern at the current  point,
5992*22dc650dSSadaf Ebrahimi       and  the  longest match is then "locked in" for the rest of the overall
5993*22dc650dSSadaf Ebrahimi       pattern.
5994*22dc650dSSadaf Ebrahimi
5995*22dc650dSSadaf Ebrahimi       2. When dealing with multiple paths through the tree simultaneously, it
5996*22dc650dSSadaf Ebrahimi       is not straightforward to keep track of  captured  substrings  for  the
5997*22dc650dSSadaf Ebrahimi       different  matching  possibilities,  and PCRE2's implementation of this
5998*22dc650dSSadaf Ebrahimi       algorithm does not attempt to do this. This means that no captured sub-
5999*22dc650dSSadaf Ebrahimi       strings are available.
6000*22dc650dSSadaf Ebrahimi
6001*22dc650dSSadaf Ebrahimi       3. Because no substrings are captured, backreferences within  the  pat-
6002*22dc650dSSadaf Ebrahimi       tern are not supported.
6003*22dc650dSSadaf Ebrahimi
6004*22dc650dSSadaf Ebrahimi       4.  For  the same reason, conditional expressions that use a backrefer-
6005*22dc650dSSadaf Ebrahimi       ence as the condition or test for a specific group  recursion  are  not
6006*22dc650dSSadaf Ebrahimi       supported.
6007*22dc650dSSadaf Ebrahimi
6008*22dc650dSSadaf Ebrahimi       5. Again for the same reason, script runs are not supported.
6009*22dc650dSSadaf Ebrahimi
6010*22dc650dSSadaf Ebrahimi       6. Because many paths through the tree may be active, the \K escape se-
6011*22dc650dSSadaf Ebrahimi       quence,  which  resets the start of the match when encountered (but may
6012*22dc650dSSadaf Ebrahimi       be on some paths and not on others), is not supported.
6013*22dc650dSSadaf Ebrahimi
6014*22dc650dSSadaf Ebrahimi       7. Callouts are supported, but the value of the  capture_top  field  is
6015*22dc650dSSadaf Ebrahimi       always 1, and the value of the capture_last field is always 0.
6016*22dc650dSSadaf Ebrahimi
6017*22dc650dSSadaf Ebrahimi       8.  The  \C  escape  sequence, which (in the standard algorithm) always
6018*22dc650dSSadaf Ebrahimi       matches a single code unit, even in a UTF mode,  is  not  supported  in
6019*22dc650dSSadaf Ebrahimi       these  modes,  because the alternative algorithm moves through the sub-
6020*22dc650dSSadaf Ebrahimi       ject string one character (not code unit) at a  time,  for  all  active
6021*22dc650dSSadaf Ebrahimi       paths through the tree.
6022*22dc650dSSadaf Ebrahimi
6023*22dc650dSSadaf Ebrahimi       9.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
6024*22dc650dSSadaf Ebrahimi       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
6025*22dc650dSSadaf Ebrahimi       negative assertion.
6026*22dc650dSSadaf Ebrahimi
6027*22dc650dSSadaf Ebrahimi       10.  The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup-
6028*22dc650dSSadaf Ebrahimi       ported by pcre2_dfa_match().
6029*22dc650dSSadaf Ebrahimi
6030*22dc650dSSadaf Ebrahimi
6031*22dc650dSSadaf EbrahimiADVANTAGES OF THE ALTERNATIVE ALGORITHM
6032*22dc650dSSadaf Ebrahimi
6033*22dc650dSSadaf Ebrahimi       The main advantage of the alternative algorithm is  that  all  possible
6034*22dc650dSSadaf Ebrahimi       matches (at a single point in the subject) are automatically found, and
6035*22dc650dSSadaf Ebrahimi       in  particular, the longest match is found. To find more than one match
6036*22dc650dSSadaf Ebrahimi       at the same point using the standard algorithm, you have to  do  kludgy
6037*22dc650dSSadaf Ebrahimi       things with callouts.
6038*22dc650dSSadaf Ebrahimi
6039*22dc650dSSadaf Ebrahimi       Partial  matching  is  possible with this algorithm, though it has some
6040*22dc650dSSadaf Ebrahimi       limitations. The pcre2partial documentation gives  details  of  partial
6041*22dc650dSSadaf Ebrahimi       matching and discusses multi-segment matching.
6042*22dc650dSSadaf Ebrahimi
6043*22dc650dSSadaf Ebrahimi
6044*22dc650dSSadaf EbrahimiDISADVANTAGES OF THE ALTERNATIVE ALGORITHM
6045*22dc650dSSadaf Ebrahimi
6046*22dc650dSSadaf Ebrahimi       The alternative algorithm suffers from a number of disadvantages:
6047*22dc650dSSadaf Ebrahimi
6048*22dc650dSSadaf Ebrahimi       1.  It  is  substantially  slower  than the standard algorithm. This is
6049*22dc650dSSadaf Ebrahimi       partly because it has to search for all possible matches, but  is  also
6050*22dc650dSSadaf Ebrahimi       because it is less susceptible to optimization.
6051*22dc650dSSadaf Ebrahimi
6052*22dc650dSSadaf Ebrahimi       2.  Capturing  parentheses,  backreferences,  script runs, and matching
6053*22dc650dSSadaf Ebrahimi       within invalid UTF string are not supported.
6054*22dc650dSSadaf Ebrahimi
6055*22dc650dSSadaf Ebrahimi       3. Although atomic groups are supported, their use does not provide the
6056*22dc650dSSadaf Ebrahimi       performance advantage that it does for the standard algorithm.
6057*22dc650dSSadaf Ebrahimi
6058*22dc650dSSadaf Ebrahimi       4. JIT optimization is not supported.
6059*22dc650dSSadaf Ebrahimi
6060*22dc650dSSadaf Ebrahimi
6061*22dc650dSSadaf EbrahimiAUTHOR
6062*22dc650dSSadaf Ebrahimi
6063*22dc650dSSadaf Ebrahimi       Philip Hazel
6064*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
6065*22dc650dSSadaf Ebrahimi       Cambridge, England.
6066*22dc650dSSadaf Ebrahimi
6067*22dc650dSSadaf Ebrahimi
6068*22dc650dSSadaf EbrahimiREVISION
6069*22dc650dSSadaf Ebrahimi
6070*22dc650dSSadaf Ebrahimi       Last updated: 19 January 2024
6071*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2024 University of Cambridge.
6072*22dc650dSSadaf Ebrahimi
6073*22dc650dSSadaf Ebrahimi
6074*22dc650dSSadaf EbrahimiPCRE2 10.43                     19 January 2024               PCRE2MATCHING(3)
6075*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
6076*22dc650dSSadaf Ebrahimi
6077*22dc650dSSadaf Ebrahimi
6078*22dc650dSSadaf Ebrahimi
6079*22dc650dSSadaf EbrahimiPCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
6080*22dc650dSSadaf Ebrahimi
6081*22dc650dSSadaf Ebrahimi
6082*22dc650dSSadaf EbrahimiNAME
6083*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions
6084*22dc650dSSadaf Ebrahimi
6085*22dc650dSSadaf Ebrahimi
6086*22dc650dSSadaf EbrahimiPARTIAL MATCHING IN PCRE2
6087*22dc650dSSadaf Ebrahimi
6088*22dc650dSSadaf Ebrahimi       In  normal use of PCRE2, if there is a match up to the end of a subject
6089*22dc650dSSadaf Ebrahimi       string, but more characters are needed to  match  the  entire  pattern,
6090*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_NOMATCH  is  returned,  just  like any other failing match.
6091*22dc650dSSadaf Ebrahimi       There are circumstances where it might be helpful to  distinguish  this
6092*22dc650dSSadaf Ebrahimi       "partial match" case.
6093*22dc650dSSadaf Ebrahimi
6094*22dc650dSSadaf Ebrahimi       One  example  is  an application where the subject string is very long,
6095*22dc650dSSadaf Ebrahimi       and not all available at once. The requirement here is to be able to do
6096*22dc650dSSadaf Ebrahimi       the matching segment by segment, but special action is  needed  when  a
6097*22dc650dSSadaf Ebrahimi       matched substring spans the boundary between two segments.
6098*22dc650dSSadaf Ebrahimi
6099*22dc650dSSadaf Ebrahimi       Another  example is checking a user input string as it is typed, to en-
6100*22dc650dSSadaf Ebrahimi       sure that it conforms to a required format. Invalid characters  can  be
6101*22dc650dSSadaf Ebrahimi       immediately diagnosed and rejected, giving instant feedback.
6102*22dc650dSSadaf Ebrahimi
6103*22dc650dSSadaf Ebrahimi       Partial  matching  is a PCRE2-specific feature; it is not Perl-compati-
6104*22dc650dSSadaf Ebrahimi       ble. It is requested  by  setting  one  of  the  PCRE2_PARTIAL_HARD  or
6105*22dc650dSSadaf Ebrahimi       PCRE2_PARTIAL_SOFT  options  when calling a matching function. The dif-
6106*22dc650dSSadaf Ebrahimi       ference between the two options is whether or not a  partial  match  is
6107*22dc650dSSadaf Ebrahimi       preferred  to  an alternative complete match, though the details differ
6108*22dc650dSSadaf Ebrahimi       between the two types of matching function. If both  options  are  set,
6109*22dc650dSSadaf Ebrahimi       PCRE2_PARTIAL_HARD takes precedence.
6110*22dc650dSSadaf Ebrahimi
6111*22dc650dSSadaf Ebrahimi       If  you  want to use partial matching with just-in-time optimized code,
6112*22dc650dSSadaf Ebrahimi       as well as setting a partial match option for  the  matching  function,
6113*22dc650dSSadaf Ebrahimi       you  must  also  call pcre2_jit_compile() with one or both of these op-
6114*22dc650dSSadaf Ebrahimi       tions:
6115*22dc650dSSadaf Ebrahimi
6116*22dc650dSSadaf Ebrahimi         PCRE2_JIT_PARTIAL_HARD
6117*22dc650dSSadaf Ebrahimi         PCRE2_JIT_PARTIAL_SOFT
6118*22dc650dSSadaf Ebrahimi
6119*22dc650dSSadaf Ebrahimi       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
6120*22dc650dSSadaf Ebrahimi       tial  matches  on  the same pattern. Separate code is compiled for each
6121*22dc650dSSadaf Ebrahimi       mode. If the appropriate JIT mode has not been  compiled,  interpretive
6122*22dc650dSSadaf Ebrahimi       matching code is used.
6123*22dc650dSSadaf Ebrahimi
6124*22dc650dSSadaf Ebrahimi       Setting  a partial matching option disables two of PCRE2's standard op-
6125*22dc650dSSadaf Ebrahimi       timization hints. PCRE2 remembers the last literal code unit in a  pat-
6126*22dc650dSSadaf Ebrahimi       tern,  and  abandons  matching  immediately if it is not present in the
6127*22dc650dSSadaf Ebrahimi       subject string.  This optimization cannot be used for a subject  string
6128*22dc650dSSadaf Ebrahimi       that  might match only partially. PCRE2 also remembers a minimum length
6129*22dc650dSSadaf Ebrahimi       of a matching string, and does not bother to run the matching  function
6130*22dc650dSSadaf Ebrahimi       on  shorter  strings.  This  optimization  is also disabled for partial
6131*22dc650dSSadaf Ebrahimi       matching.
6132*22dc650dSSadaf Ebrahimi
6133*22dc650dSSadaf Ebrahimi
6134*22dc650dSSadaf EbrahimiREQUIREMENTS FOR A PARTIAL MATCH
6135*22dc650dSSadaf Ebrahimi
6136*22dc650dSSadaf Ebrahimi       A possible partial match occurs during matching when  the  end  of  the
6137*22dc650dSSadaf Ebrahimi       subject  string is reached successfully, but either more characters are
6138*22dc650dSSadaf Ebrahimi       needed to complete the match, or the addition of more characters  might
6139*22dc650dSSadaf Ebrahimi       change what is matched.
6140*22dc650dSSadaf Ebrahimi
6141*22dc650dSSadaf Ebrahimi       Example  1: if the pattern is /abc/ and the subject is "ab", more char-
6142*22dc650dSSadaf Ebrahimi       acters are definitely needed to complete a match.  In  this  case  both
6143*22dc650dSSadaf Ebrahimi       hard and soft matching options yield a partial match.
6144*22dc650dSSadaf Ebrahimi
6145*22dc650dSSadaf Ebrahimi       Example  2: if the pattern is /ab+/ and the subject is "ab", a complete
6146*22dc650dSSadaf Ebrahimi       match can be found, but the addition of more  characters  might  change
6147*22dc650dSSadaf Ebrahimi       what  is  matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
6148*22dc650dSSadaf Ebrahimi       tial match; PCRE2_PARTIAL_SOFT returns the complete match.
6149*22dc650dSSadaf Ebrahimi
6150*22dc650dSSadaf Ebrahimi       On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set,  if
6151*22dc650dSSadaf Ebrahimi       the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
6152*22dc650dSSadaf Ebrahimi       match.   Otherwise, for both options, the next pattern item must be one
6153*22dc650dSSadaf Ebrahimi       that inspects a character, and at least one of the  following  must  be
6154*22dc650dSSadaf Ebrahimi       true:
6155*22dc650dSSadaf Ebrahimi
6156*22dc650dSSadaf Ebrahimi       (1)  At  least  one  character has already been inspected. An inspected
6157*22dc650dSSadaf Ebrahimi       character need not form part of the final  matched  string;  lookbehind
6158*22dc650dSSadaf Ebrahimi       assertions  and the \K escape sequence provide ways of inspecting char-
6159*22dc650dSSadaf Ebrahimi       acters before the start of a matched string.
6160*22dc650dSSadaf Ebrahimi
6161*22dc650dSSadaf Ebrahimi       (2) The pattern contains one or more lookbehind assertions. This condi-
6162*22dc650dSSadaf Ebrahimi       tion exists in case there is a lookbehind that inspects characters  be-
6163*22dc650dSSadaf Ebrahimi       fore the start of the match.
6164*22dc650dSSadaf Ebrahimi
6165*22dc650dSSadaf Ebrahimi       (3)  There  is a special case when the whole pattern can match an empty
6166*22dc650dSSadaf Ebrahimi       string.  When the starting point is at the  end  of  the  subject,  the
6167*22dc650dSSadaf Ebrahimi       empty  string  match is a possibility, and if PCRE2_PARTIAL_SOFT is set
6168*22dc650dSSadaf Ebrahimi       and neither of the above conditions is true, it is  returned.  However,
6169*22dc650dSSadaf Ebrahimi       because  adding  more  characters  might  result  in a non-empty match,
6170*22dc650dSSadaf Ebrahimi       PCRE2_PARTIAL_HARD returns a partial match, which in  this  case  means
6171*22dc650dSSadaf Ebrahimi       "there  is going to be a match at this point, but until some more char-
6172*22dc650dSSadaf Ebrahimi       acters are added, we do not know if it will be an empty string or some-
6173*22dc650dSSadaf Ebrahimi       thing longer".
6174*22dc650dSSadaf Ebrahimi
6175*22dc650dSSadaf Ebrahimi
6176*22dc650dSSadaf EbrahimiPARTIAL MATCHING USING pcre2_match()
6177*22dc650dSSadaf Ebrahimi
6178*22dc650dSSadaf Ebrahimi       When  a  partial  matching  option  is  set,  the  result  of   calling
6179*22dc650dSSadaf Ebrahimi       pcre2_match() can be one of the following:
6180*22dc650dSSadaf Ebrahimi
6181*22dc650dSSadaf Ebrahimi       A successful match
6182*22dc650dSSadaf Ebrahimi         A complete match has been found, starting and ending within this sub-
6183*22dc650dSSadaf Ebrahimi         ject.
6184*22dc650dSSadaf Ebrahimi
6185*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_NOMATCH
6186*22dc650dSSadaf Ebrahimi         No match can start anywhere in this subject.
6187*22dc650dSSadaf Ebrahimi
6188*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_PARTIAL
6189*22dc650dSSadaf Ebrahimi         Adding  more  characters may result in a complete match that uses one
6190*22dc650dSSadaf Ebrahimi         or more characters from the end of this subject.
6191*22dc650dSSadaf Ebrahimi
6192*22dc650dSSadaf Ebrahimi       When a partial match is returned, the first two elements in the ovector
6193*22dc650dSSadaf Ebrahimi       point to the portion of the subject that was matched, but the values in
6194*22dc650dSSadaf Ebrahimi       the rest of the ovector are undefined. The appearance of \K in the pat-
6195*22dc650dSSadaf Ebrahimi       tern has no effect for a partial match. Consider this pattern:
6196*22dc650dSSadaf Ebrahimi
6197*22dc650dSSadaf Ebrahimi         /abc\K123/
6198*22dc650dSSadaf Ebrahimi
6199*22dc650dSSadaf Ebrahimi       If it is matched against "456abc123xyz" the result is a complete match,
6200*22dc650dSSadaf Ebrahimi       and the ovector defines the matched string as "123", because \K  resets
6201*22dc650dSSadaf Ebrahimi       the  "start  of  match" point. However, if a partial match is requested
6202*22dc650dSSadaf Ebrahimi       and the subject string is "456abc12", a partial match is found for  the
6203*22dc650dSSadaf Ebrahimi       string  "abc12",  because  all these characters are needed for a subse-
6204*22dc650dSSadaf Ebrahimi       quent re-match with additional characters.
6205*22dc650dSSadaf Ebrahimi
6206*22dc650dSSadaf Ebrahimi       If there is more than one partial match, the first one that  was  found
6207*22dc650dSSadaf Ebrahimi       provides the data that is returned. Consider this pattern:
6208*22dc650dSSadaf Ebrahimi
6209*22dc650dSSadaf Ebrahimi         /123\w+X|dogY/
6210*22dc650dSSadaf Ebrahimi
6211*22dc650dSSadaf Ebrahimi       If  this is matched against the subject string "abc123dog", both alter-
6212*22dc650dSSadaf Ebrahimi       natives fail to match, but the end of the  subject  is  reached  during
6213*22dc650dSSadaf Ebrahimi       matching,  so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
6214*22dc650dSSadaf Ebrahimi       and 9, identifying "123dog" as the first partial match. (In this  exam-
6215*22dc650dSSadaf Ebrahimi       ple,  there are two partial matches, because "dog" on its own partially
6216*22dc650dSSadaf Ebrahimi       matches the second alternative.)
6217*22dc650dSSadaf Ebrahimi
6218*22dc650dSSadaf Ebrahimi   How a partial match is processed by pcre2_match()
6219*22dc650dSSadaf Ebrahimi
6220*22dc650dSSadaf Ebrahimi       What happens when a partial match is identified depends on which of the
6221*22dc650dSSadaf Ebrahimi       two partial matching options is set.
6222*22dc650dSSadaf Ebrahimi
6223*22dc650dSSadaf Ebrahimi       If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned  as  soon
6224*22dc650dSSadaf Ebrahimi       as  a partial match is found, without continuing to search for possible
6225*22dc650dSSadaf Ebrahimi       complete matches. This option is "hard" because it prefers  an  earlier
6226*22dc650dSSadaf Ebrahimi       partial match over a later complete match. For this reason, the assump-
6227*22dc650dSSadaf Ebrahimi       tion  is  made  that  the end of the supplied subject string is not the
6228*22dc650dSSadaf Ebrahimi       true end of the available data, which is why \z, \Z, \b, \B, and $  al-
6229*22dc650dSSadaf Ebrahimi       ways give a partial match.
6230*22dc650dSSadaf Ebrahimi
6231*22dc650dSSadaf Ebrahimi       If  PCRE2_PARTIAL_SOFT  is  set,  the  partial match is remembered, but
6232*22dc650dSSadaf Ebrahimi       matching continues as normal, and other alternatives in the pattern are
6233*22dc650dSSadaf Ebrahimi       tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re-
6234*22dc650dSSadaf Ebrahimi       turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
6235*22dc650dSSadaf Ebrahimi       prefers a complete match over a partial match. All the various matching
6236*22dc650dSSadaf Ebrahimi       items  in a pattern behave as if the subject string is potentially com-
6237*22dc650dSSadaf Ebrahimi       plete; \z, \Z, and $ match at the end of the subject,  as  normal,  and
6238*22dc650dSSadaf Ebrahimi       for \b and \B the end of the subject is treated as a non-alphanumeric.
6239*22dc650dSSadaf Ebrahimi
6240*22dc650dSSadaf Ebrahimi       The  difference  between the two partial matching options can be illus-
6241*22dc650dSSadaf Ebrahimi       trated by a pattern such as:
6242*22dc650dSSadaf Ebrahimi
6243*22dc650dSSadaf Ebrahimi         /dog(sbody)?/
6244*22dc650dSSadaf Ebrahimi
6245*22dc650dSSadaf Ebrahimi       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
6246*22dc650dSSadaf Ebrahimi       the  longer  string  if  possible). If it is matched against the string
6247*22dc650dSSadaf Ebrahimi       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
6248*22dc650dSSadaf Ebrahimi       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6249*22dc650dSSadaf Ebrahimi       TIAL. On the other hand, if the pattern is made ungreedy the result  is
6250*22dc650dSSadaf Ebrahimi       different:
6251*22dc650dSSadaf Ebrahimi
6252*22dc650dSSadaf Ebrahimi         /dog(sbody)??/
6253*22dc650dSSadaf Ebrahimi
6254*22dc650dSSadaf Ebrahimi       In  this  case  the  result  is always a complete match because that is
6255*22dc650dSSadaf Ebrahimi       found first, and matching never  continues  after  finding  a  complete
6256*22dc650dSSadaf Ebrahimi       match. It might be easier to follow this explanation by thinking of the
6257*22dc650dSSadaf Ebrahimi       two patterns like this:
6258*22dc650dSSadaf Ebrahimi
6259*22dc650dSSadaf Ebrahimi         /dog(sbody)?/    is the same as  /dogsbody|dog/
6260*22dc650dSSadaf Ebrahimi         /dog(sbody)??/   is the same as  /dog|dogsbody/
6261*22dc650dSSadaf Ebrahimi
6262*22dc650dSSadaf Ebrahimi       The  second pattern will never match "dogsbody", because it will always
6263*22dc650dSSadaf Ebrahimi       find the shorter match first.
6264*22dc650dSSadaf Ebrahimi
6265*22dc650dSSadaf Ebrahimi   Example of partial matching using pcre2test
6266*22dc650dSSadaf Ebrahimi
6267*22dc650dSSadaf Ebrahimi       The pcre2test data modifiers partial_hard (or ph) and partial_soft  (or
6268*22dc650dSSadaf Ebrahimi       ps)  set  PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when
6269*22dc650dSSadaf Ebrahimi       calling pcre2_match(). Here is a run of pcre2test using a pattern  that
6270*22dc650dSSadaf Ebrahimi       matches the whole subject in the form of a date:
6271*22dc650dSSadaf Ebrahimi
6272*22dc650dSSadaf Ebrahimi           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6273*22dc650dSSadaf Ebrahimi         data> 25dec3\=ph
6274*22dc650dSSadaf Ebrahimi         Partial match: 23dec3
6275*22dc650dSSadaf Ebrahimi         data> 3ju\=ph
6276*22dc650dSSadaf Ebrahimi         Partial match: 3ju
6277*22dc650dSSadaf Ebrahimi         data> 3juj\=ph
6278*22dc650dSSadaf Ebrahimi         No match
6279*22dc650dSSadaf Ebrahimi
6280*22dc650dSSadaf Ebrahimi       This  example  gives  the  same  results for both hard and soft partial
6281*22dc650dSSadaf Ebrahimi       matching options. Here is an example where there is a difference:
6282*22dc650dSSadaf Ebrahimi
6283*22dc650dSSadaf Ebrahimi           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6284*22dc650dSSadaf Ebrahimi         data> 25jun04\=ps
6285*22dc650dSSadaf Ebrahimi          0: 25jun04
6286*22dc650dSSadaf Ebrahimi          1: jun
6287*22dc650dSSadaf Ebrahimi         data> 25jun04\=ph
6288*22dc650dSSadaf Ebrahimi         Partial match: 25jun04
6289*22dc650dSSadaf Ebrahimi
6290*22dc650dSSadaf Ebrahimi       With  PCRE2_PARTIAL_SOFT,  the  subject  is  matched  completely.   For
6291*22dc650dSSadaf Ebrahimi       PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6292*22dc650dSSadaf Ebrahimi       so there is only a partial match.
6293*22dc650dSSadaf Ebrahimi
6294*22dc650dSSadaf Ebrahimi
6295*22dc650dSSadaf EbrahimiMULTI-SEGMENT MATCHING WITH pcre2_match()
6296*22dc650dSSadaf Ebrahimi
6297*22dc650dSSadaf Ebrahimi       PCRE  was  not originally designed with multi-segment matching in mind.
6298*22dc650dSSadaf Ebrahimi       However, over time, features (including  partial  matching)  that  make
6299*22dc650dSSadaf Ebrahimi       multi-segment matching possible have been added. A very long string can
6300*22dc650dSSadaf Ebrahimi       be  searched  segment  by  segment by calling pcre2_match() repeatedly,
6301*22dc650dSSadaf Ebrahimi       with the aim of achieving the same results that would happen if the en-
6302*22dc650dSSadaf Ebrahimi       tire string was available for searching all  the  time.  Normally,  the
6303*22dc650dSSadaf Ebrahimi       strings  that  are  being  sought are much shorter than each individual
6304*22dc650dSSadaf Ebrahimi       segment, and are in the middle of very long strings, so the pattern  is
6305*22dc650dSSadaf Ebrahimi       normally not anchored.
6306*22dc650dSSadaf Ebrahimi
6307*22dc650dSSadaf Ebrahimi       Special  logic  must  be implemented to handle a matched substring that
6308*22dc650dSSadaf Ebrahimi       spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
6309*22dc650dSSadaf Ebrahimi       returns a partial match at the end of a segment whenever there  is  the
6310*22dc650dSSadaf Ebrahimi       possibility  of  changing  the  match  by  adding  more characters. The
6311*22dc650dSSadaf Ebrahimi       PCRE2_NOTBOL option should also be set for all but the first segment.
6312*22dc650dSSadaf Ebrahimi
6313*22dc650dSSadaf Ebrahimi       When a partial match occurs, the next segment must be added to the cur-
6314*22dc650dSSadaf Ebrahimi       rent subject and the match re-run, using the  startoffset  argument  of
6315*22dc650dSSadaf Ebrahimi       pcre2_match()  to  begin  at the point where the partial match started.
6316*22dc650dSSadaf Ebrahimi       For example:
6317*22dc650dSSadaf Ebrahimi
6318*22dc650dSSadaf Ebrahimi           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
6319*22dc650dSSadaf Ebrahimi         data> ...the date is 23ja\=ph
6320*22dc650dSSadaf Ebrahimi         Partial match: 23ja
6321*22dc650dSSadaf Ebrahimi         data> ...the date is 23jan19 and on that day...\=offset=15
6322*22dc650dSSadaf Ebrahimi          0: 23jan19
6323*22dc650dSSadaf Ebrahimi          1: jan
6324*22dc650dSSadaf Ebrahimi
6325*22dc650dSSadaf Ebrahimi       Note the use of the offset modifier to start the new  match  where  the
6326*22dc650dSSadaf Ebrahimi       partial match was found. In this example, the next segment was added to
6327*22dc650dSSadaf Ebrahimi       the  one  in  which  the  partial  match  was  found.  This is the most
6328*22dc650dSSadaf Ebrahimi       straightforward approach, typically using a memory buffer that is twice
6329*22dc650dSSadaf Ebrahimi       the size of each segment. After a partial match, the first half of  the
6330*22dc650dSSadaf Ebrahimi       buffer  is  discarded,  the  second  half  is moved to the start of the
6331*22dc650dSSadaf Ebrahimi       buffer, and a new segment is added before repeating the match as in the
6332*22dc650dSSadaf Ebrahimi       example above. After a no match, the entire buffer can be discarded.
6333*22dc650dSSadaf Ebrahimi
6334*22dc650dSSadaf Ebrahimi       If there are memory constraints, you may want to discard text that pre-
6335*22dc650dSSadaf Ebrahimi       cedes a partial match before adding the  next  segment.  Unfortunately,
6336*22dc650dSSadaf Ebrahimi       this  is  not  at  present straightforward. In cases such as the above,
6337*22dc650dSSadaf Ebrahimi       where the pattern does not contain any lookbehinds, it is sufficient to
6338*22dc650dSSadaf Ebrahimi       retain only the partially matched substring. However,  if  the  pattern
6339*22dc650dSSadaf Ebrahimi       contains  a  lookbehind assertion, characters that precede the start of
6340*22dc650dSSadaf Ebrahimi       the partial match may have been inspected during the matching  process.
6341*22dc650dSSadaf Ebrahimi       When  pcre2test displays a partial match, it indicates these characters
6342*22dc650dSSadaf Ebrahimi       with '<' if the allusedtext modifier is set:
6343*22dc650dSSadaf Ebrahimi
6344*22dc650dSSadaf Ebrahimi           re> "(?<=123)abc"
6345*22dc650dSSadaf Ebrahimi         data> xx123ab\=ph,allusedtext
6346*22dc650dSSadaf Ebrahimi         Partial match: 123ab
6347*22dc650dSSadaf Ebrahimi                        <<<
6348*22dc650dSSadaf Ebrahimi
6349*22dc650dSSadaf Ebrahimi       However, the allusedtext modifier is not available  for  JIT  matching,
6350*22dc650dSSadaf Ebrahimi       because  JIT  matching  does  not  record the first (or last) consulted
6351*22dc650dSSadaf Ebrahimi       characters.  For this reason, this information is not available via the
6352*22dc650dSSadaf Ebrahimi       API. It is therefore not possible in general to obtain the exact number
6353*22dc650dSSadaf Ebrahimi       of characters that must be retained in order to get the right match re-
6354*22dc650dSSadaf Ebrahimi       sult. If you cannot retain the  entire  segment,  you  must  find  some
6355*22dc650dSSadaf Ebrahimi       heuristic way of choosing.
6356*22dc650dSSadaf Ebrahimi
6357*22dc650dSSadaf Ebrahimi       If  you know the approximate length of the matching substrings, you can
6358*22dc650dSSadaf Ebrahimi       use that to decide how much text to retain. The only lookbehind  infor-
6359*22dc650dSSadaf Ebrahimi       mation  that  is  currently  available via the API is the length of the
6360*22dc650dSSadaf Ebrahimi       longest individual lookbehind in a pattern, but this can be  misleading
6361*22dc650dSSadaf Ebrahimi       if  there  are  nested  lookbehinds.  The  value  returned  by  calling
6362*22dc650dSSadaf Ebrahimi       pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND  option  is  the
6363*22dc650dSSadaf Ebrahimi       maximum number of characters (not code units) that any individual look-
6364*22dc650dSSadaf Ebrahimi       behind   moves   back   when   it  is  processed.  A  pattern  such  as
6365*22dc650dSSadaf Ebrahimi       "(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects  two
6366*22dc650dSSadaf Ebrahimi       characters before its starting point.
6367*22dc650dSSadaf Ebrahimi
6368*22dc650dSSadaf Ebrahimi       In  a  non-UTF or a 32-bit case, moving back is just a subtraction, but
6369*22dc650dSSadaf Ebrahimi       in UTF-8 or UTF-16 you have  to  count  characters  while  moving  back
6370*22dc650dSSadaf Ebrahimi       through the code units.
6371*22dc650dSSadaf Ebrahimi
6372*22dc650dSSadaf Ebrahimi
6373*22dc650dSSadaf EbrahimiPARTIAL MATCHING USING pcre2_dfa_match()
6374*22dc650dSSadaf Ebrahimi
6375*22dc650dSSadaf Ebrahimi       The DFA function moves along the subject string character by character,
6376*22dc650dSSadaf Ebrahimi       without  backtracking,  searching  for  all possible matches simultane-
6377*22dc650dSSadaf Ebrahimi       ously. If the end of the subject is reached before the end of the  pat-
6378*22dc650dSSadaf Ebrahimi       tern, there is the possibility of a partial match.
6379*22dc650dSSadaf Ebrahimi
6380*22dc650dSSadaf Ebrahimi       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
6381*22dc650dSSadaf Ebrahimi       there  have  been  no complete matches. Otherwise, the complete matches
6382*22dc650dSSadaf Ebrahimi       are returned.  If PCRE2_PARTIAL_HARD is  set,  a  partial  match  takes
6383*22dc650dSSadaf Ebrahimi       precedence  over  any  complete matches. The portion of the string that
6384*22dc650dSSadaf Ebrahimi       was matched when the longest partial match was  found  is  set  as  the
6385*22dc650dSSadaf Ebrahimi       first matching string.
6386*22dc650dSSadaf Ebrahimi
6387*22dc650dSSadaf Ebrahimi       Because  the DFA function always searches for all possible matches, and
6388*22dc650dSSadaf Ebrahimi       there is no difference between greedy and ungreedy repetition, its  be-
6389*22dc650dSSadaf Ebrahimi       haviour  is different from the pcre2_match(). Consider the string "dog"
6390*22dc650dSSadaf Ebrahimi       matched against this ungreedy pattern:
6391*22dc650dSSadaf Ebrahimi
6392*22dc650dSSadaf Ebrahimi         /dog(sbody)??/
6393*22dc650dSSadaf Ebrahimi
6394*22dc650dSSadaf Ebrahimi       Whereas the standard function stops as soon as it  finds  the  complete
6395*22dc650dSSadaf Ebrahimi       match  for  "dog",  the  DFA  function also finds the partial match for
6396*22dc650dSSadaf Ebrahimi       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
6397*22dc650dSSadaf Ebrahimi
6398*22dc650dSSadaf Ebrahimi
6399*22dc650dSSadaf EbrahimiMULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6400*22dc650dSSadaf Ebrahimi
6401*22dc650dSSadaf Ebrahimi       When a partial match has been found using the DFA matching function, it
6402*22dc650dSSadaf Ebrahimi       is possible to continue the match by providing additional subject  data
6403*22dc650dSSadaf Ebrahimi       and  calling  the function again with the same compiled regular expres-
6404*22dc650dSSadaf Ebrahimi       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
6405*22dc650dSSadaf Ebrahimi       same working space as before, because this is where details of the pre-
6406*22dc650dSSadaf Ebrahimi       vious partial match are stored. You can set the  PCRE2_PARTIAL_SOFT  or
6407*22dc650dSSadaf Ebrahimi       PCRE2_PARTIAL_HARD  options  with PCRE2_DFA_RESTART to continue partial
6408*22dc650dSSadaf Ebrahimi       matching over multiple segments. Here is an example using pcre2test:
6409*22dc650dSSadaf Ebrahimi
6410*22dc650dSSadaf Ebrahimi           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6411*22dc650dSSadaf Ebrahimi         data> 23ja\=dfa,ps
6412*22dc650dSSadaf Ebrahimi         Partial match: 23ja
6413*22dc650dSSadaf Ebrahimi         data> n05\=dfa,dfa_restart
6414*22dc650dSSadaf Ebrahimi          0: n05
6415*22dc650dSSadaf Ebrahimi
6416*22dc650dSSadaf Ebrahimi       The first call has "23ja" as the subject, and requests  partial  match-
6417*22dc650dSSadaf Ebrahimi       ing;  the  second  call  has  "n05"  as  the  subject for the continued
6418*22dc650dSSadaf Ebrahimi       (restarted) match.  Notice that when the match is  complete,  only  the
6419*22dc650dSSadaf Ebrahimi       last  part  is  shown;  PCRE2 does not retain the previously partially-
6420*22dc650dSSadaf Ebrahimi       matched string. It is up to the calling program to do that if it  needs
6421*22dc650dSSadaf Ebrahimi       to.  This  means  that, for an unanchored pattern, if a continued match
6422*22dc650dSSadaf Ebrahimi       fails, it is not possible to try again at a  new  starting  point.  All
6423*22dc650dSSadaf Ebrahimi       this facility is capable of doing is continuing with the previous match
6424*22dc650dSSadaf Ebrahimi       attempt. For example, consider this pattern:
6425*22dc650dSSadaf Ebrahimi
6426*22dc650dSSadaf Ebrahimi         1234|3789
6427*22dc650dSSadaf Ebrahimi
6428*22dc650dSSadaf Ebrahimi       If  the  first  part of the subject is "ABC123", a partial match of the
6429*22dc650dSSadaf Ebrahimi       first alternative is found at offset 3. There is no partial  match  for
6430*22dc650dSSadaf Ebrahimi       the second alternative, because such a match does not start at the same
6431*22dc650dSSadaf Ebrahimi       point  in  the  subject  string. Attempting to continue with the string
6432*22dc650dSSadaf Ebrahimi       "7890" does not yield a match  because  only  those  alternatives  that
6433*22dc650dSSadaf Ebrahimi       match  at one point in the subject are remembered. Depending on the ap-
6434*22dc650dSSadaf Ebrahimi       plication, this may or may not be what you want.
6435*22dc650dSSadaf Ebrahimi
6436*22dc650dSSadaf Ebrahimi       If you do want to allow for starting again at the next  character,  one
6437*22dc650dSSadaf Ebrahimi       way  of  doing it is to retain some or all of the segment and try a new
6438*22dc650dSSadaf Ebrahimi       complete match, as described for pcre2_match() above. Another possibil-
6439*22dc650dSSadaf Ebrahimi       ity is to work with two buffers. If a partial match at offset n in  the
6440*22dc650dSSadaf Ebrahimi       first  buffer  is followed by "no match" when PCRE2_DFA_RESTART is used
6441*22dc650dSSadaf Ebrahimi       on the second buffer, you can then try a new match starting  at  offset
6442*22dc650dSSadaf Ebrahimi       n+1 in the first buffer.
6443*22dc650dSSadaf Ebrahimi
6444*22dc650dSSadaf Ebrahimi
6445*22dc650dSSadaf EbrahimiAUTHOR
6446*22dc650dSSadaf Ebrahimi
6447*22dc650dSSadaf Ebrahimi       Philip Hazel
6448*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
6449*22dc650dSSadaf Ebrahimi       Cambridge, England.
6450*22dc650dSSadaf Ebrahimi
6451*22dc650dSSadaf Ebrahimi
6452*22dc650dSSadaf EbrahimiREVISION
6453*22dc650dSSadaf Ebrahimi
6454*22dc650dSSadaf Ebrahimi       Last updated: 04 September 2019
6455*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2019 University of Cambridge.
6456*22dc650dSSadaf Ebrahimi
6457*22dc650dSSadaf Ebrahimi
6458*22dc650dSSadaf EbrahimiPCRE2 10.34                    04 September 2019               PCRE2PARTIAL(3)
6459*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
6460*22dc650dSSadaf Ebrahimi
6461*22dc650dSSadaf Ebrahimi
6462*22dc650dSSadaf Ebrahimi
6463*22dc650dSSadaf EbrahimiPCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
6464*22dc650dSSadaf Ebrahimi
6465*22dc650dSSadaf Ebrahimi
6466*22dc650dSSadaf EbrahimiNAME
6467*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
6468*22dc650dSSadaf Ebrahimi
6469*22dc650dSSadaf Ebrahimi
6470*22dc650dSSadaf EbrahimiPCRE2 REGULAR EXPRESSION DETAILS
6471*22dc650dSSadaf Ebrahimi
6472*22dc650dSSadaf Ebrahimi       The  syntax and semantics of the regular expressions that are supported
6473*22dc650dSSadaf Ebrahimi       by PCRE2 are described in detail below. There is a quick-reference syn-
6474*22dc650dSSadaf Ebrahimi       tax summary in the pcre2syntax page. PCRE2 tries to match  Perl  syntax
6475*22dc650dSSadaf Ebrahimi       and  semantics as closely as it can.  PCRE2 also supports some alterna-
6476*22dc650dSSadaf Ebrahimi       tive regular expression syntax (which does not conflict with  the  Perl
6477*22dc650dSSadaf Ebrahimi       syntax) in order to provide some compatibility with regular expressions
6478*22dc650dSSadaf Ebrahimi       in Python, .NET, and Oniguruma.
6479*22dc650dSSadaf Ebrahimi
6480*22dc650dSSadaf Ebrahimi       Perl's  regular expressions are described in its own documentation, and
6481*22dc650dSSadaf Ebrahimi       regular expressions in general are covered in a number of  books,  some
6482*22dc650dSSadaf Ebrahimi       of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6483*22dc650dSSadaf Ebrahimi       pressions",  published by O'Reilly, covers regular expressions in great
6484*22dc650dSSadaf Ebrahimi       detail. This description of PCRE2's regular expressions is intended  as
6485*22dc650dSSadaf Ebrahimi       reference material.
6486*22dc650dSSadaf Ebrahimi
6487*22dc650dSSadaf Ebrahimi       This  document  discusses the regular expression patterns that are sup-
6488*22dc650dSSadaf Ebrahimi       ported by PCRE2 when its  main  matching  function,  pcre2_match(),  is
6489*22dc650dSSadaf Ebrahimi       used.    PCRE2    also    has   an   alternative   matching   function,
6490*22dc650dSSadaf Ebrahimi       pcre2_dfa_match(), which matches using a different  algorithm  that  is
6491*22dc650dSSadaf Ebrahimi       not  Perl-compatible.  Some  of  the  features  discussed below are not
6492*22dc650dSSadaf Ebrahimi       available when DFA matching is used. The advantages  and  disadvantages
6493*22dc650dSSadaf Ebrahimi       of  the  alternative function, and how it differs from the normal func-
6494*22dc650dSSadaf Ebrahimi       tion, are discussed in the pcre2matching page.
6495*22dc650dSSadaf Ebrahimi
6496*22dc650dSSadaf Ebrahimi
6497*22dc650dSSadaf EbrahimiSPECIAL START-OF-PATTERN ITEMS
6498*22dc650dSSadaf Ebrahimi
6499*22dc650dSSadaf Ebrahimi       A number of options that can be passed to pcre2_compile() can  also  be
6500*22dc650dSSadaf Ebrahimi       set by special items at the start of a pattern. These are not Perl-com-
6501*22dc650dSSadaf Ebrahimi       patible,  but  are provided to make these options accessible to pattern
6502*22dc650dSSadaf Ebrahimi       writers who are not able to change the program that processes the  pat-
6503*22dc650dSSadaf Ebrahimi       tern.  Any  number  of these items may appear, but they must all be to-
6504*22dc650dSSadaf Ebrahimi       gether right at the start of the pattern string, and the  letters  must
6505*22dc650dSSadaf Ebrahimi       be in upper case.
6506*22dc650dSSadaf Ebrahimi
6507*22dc650dSSadaf Ebrahimi   UTF support
6508*22dc650dSSadaf Ebrahimi
6509*22dc650dSSadaf Ebrahimi       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6510*22dc650dSSadaf Ebrahimi       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6511*22dc650dSSadaf Ebrahimi       can  be  specified  for the 32-bit library, in which case it constrains
6512*22dc650dSSadaf Ebrahimi       the character values to valid  Unicode  code  points.  To  process  UTF
6513*22dc650dSSadaf Ebrahimi       strings,  PCRE2  must be built to include Unicode support (which is the
6514*22dc650dSSadaf Ebrahimi       default). When using UTF strings you must  either  call  the  compiling
6515*22dc650dSSadaf Ebrahimi       function  with  one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
6516*22dc650dSSadaf Ebrahimi       options, or the pattern must start with the  special  sequence  (*UTF),
6517*22dc650dSSadaf Ebrahimi       which  is  equivalent  to setting the relevant PCRE2_UTF. How setting a
6518*22dc650dSSadaf Ebrahimi       UTF mode affects pattern matching is mentioned in several places below.
6519*22dc650dSSadaf Ebrahimi       There is also a summary of features in the pcre2unicode page.
6520*22dc650dSSadaf Ebrahimi
6521*22dc650dSSadaf Ebrahimi       Some applications that allow their users to supply patterns may wish to
6522*22dc650dSSadaf Ebrahimi       restrict  them  to  non-UTF  data  for   security   reasons.   If   the
6523*22dc650dSSadaf Ebrahimi       PCRE2_NEVER_UTF  option is passed to pcre2_compile(), (*UTF) is not al-
6524*22dc650dSSadaf Ebrahimi       lowed, and its appearance in a pattern causes an error.
6525*22dc650dSSadaf Ebrahimi
6526*22dc650dSSadaf Ebrahimi   Unicode property support
6527*22dc650dSSadaf Ebrahimi
6528*22dc650dSSadaf Ebrahimi       Another special sequence that may appear at the start of a  pattern  is
6529*22dc650dSSadaf Ebrahimi       (*UCP).   This  has the same effect as setting the PCRE2_UCP option: it
6530*22dc650dSSadaf Ebrahimi       causes sequences such as \d and \w to use Unicode properties to  deter-
6531*22dc650dSSadaf Ebrahimi       mine character types, instead of recognizing only characters with codes
6532*22dc650dSSadaf Ebrahimi       less than 256 via a lookup table. If also causes upper/lower casing op-
6533*22dc650dSSadaf Ebrahimi       erations  to  use  Unicode  properties  for characters with code points
6534*22dc650dSSadaf Ebrahimi       greater than 127, even when UTF is not set.  These  behaviours  can  be
6535*22dc650dSSadaf Ebrahimi       changed  within  the pattern; see the section entitled "Internal Option
6536*22dc650dSSadaf Ebrahimi       Setting" below.
6537*22dc650dSSadaf Ebrahimi
6538*22dc650dSSadaf Ebrahimi       Some applications that allow their users to supply patterns may wish to
6539*22dc650dSSadaf Ebrahimi       restrict them for security reasons. If the  PCRE2_NEVER_UCP  option  is
6540*22dc650dSSadaf Ebrahimi       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
6541*22dc650dSSadaf Ebrahimi       a pattern causes an error.
6542*22dc650dSSadaf Ebrahimi
6543*22dc650dSSadaf Ebrahimi   Locking out empty string matching
6544*22dc650dSSadaf Ebrahimi
6545*22dc650dSSadaf Ebrahimi       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
6546*22dc650dSSadaf Ebrahimi       effect  as  passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
6547*22dc650dSSadaf Ebrahimi       to whichever matching function is subsequently called to match the pat-
6548*22dc650dSSadaf Ebrahimi       tern. These options lock out the matching of empty strings, either  en-
6549*22dc650dSSadaf Ebrahimi       tirely, or only at the start of the subject.
6550*22dc650dSSadaf Ebrahimi
6551*22dc650dSSadaf Ebrahimi   Disabling auto-possessification
6552*22dc650dSSadaf Ebrahimi
6553*22dc650dSSadaf Ebrahimi       If  a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
6554*22dc650dSSadaf Ebrahimi       setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from  making
6555*22dc650dSSadaf Ebrahimi       quantifiers  possessive  when  what  follows  cannot match the repeated
6556*22dc650dSSadaf Ebrahimi       item. For example, by default a+b is treated as a++b. For more details,
6557*22dc650dSSadaf Ebrahimi       see the pcre2api documentation.
6558*22dc650dSSadaf Ebrahimi
6559*22dc650dSSadaf Ebrahimi   Disabling start-up optimizations
6560*22dc650dSSadaf Ebrahimi
6561*22dc650dSSadaf Ebrahimi       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
6562*22dc650dSSadaf Ebrahimi       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6563*22dc650dSSadaf Ebrahimi       mizations  for  quickly  reaching "no match" results. For more details,
6564*22dc650dSSadaf Ebrahimi       see the pcre2api documentation.
6565*22dc650dSSadaf Ebrahimi
6566*22dc650dSSadaf Ebrahimi   Disabling automatic anchoring
6567*22dc650dSSadaf Ebrahimi
6568*22dc650dSSadaf Ebrahimi       If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the  same  effect
6569*22dc650dSSadaf Ebrahimi       as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6570*22dc650dSSadaf Ebrahimi       tions that apply to patterns whose top-level branches all start with .*
6571*22dc650dSSadaf Ebrahimi       (match any number of arbitrary characters). For more details,  see  the
6572*22dc650dSSadaf Ebrahimi       pcre2api documentation.
6573*22dc650dSSadaf Ebrahimi
6574*22dc650dSSadaf Ebrahimi   Disabling JIT compilation
6575*22dc650dSSadaf Ebrahimi
6576*22dc650dSSadaf Ebrahimi       If  a  pattern  that starts with (*NO_JIT) is successfully compiled, an
6577*22dc650dSSadaf Ebrahimi       attempt by the application to apply the  JIT  optimization  by  calling
6578*22dc650dSSadaf Ebrahimi       pcre2_jit_compile() is ignored.
6579*22dc650dSSadaf Ebrahimi
6580*22dc650dSSadaf Ebrahimi   Setting match resource limits
6581*22dc650dSSadaf Ebrahimi
6582*22dc650dSSadaf Ebrahimi       The pcre2_match() function contains a counter that is incremented every
6583*22dc650dSSadaf Ebrahimi       time it goes round its main loop. The caller of pcre2_match() can set a
6584*22dc650dSSadaf Ebrahimi       limit  on  this counter, which therefore limits the amount of computing
6585*22dc650dSSadaf Ebrahimi       resource used for a match. The maximum depth of nested backtracking can
6586*22dc650dSSadaf Ebrahimi       also be limited; this indirectly restricts the amount  of  heap  memory
6587*22dc650dSSadaf Ebrahimi       that  is  used,  but there is also an explicit memory limit that can be
6588*22dc650dSSadaf Ebrahimi       set.
6589*22dc650dSSadaf Ebrahimi
6590*22dc650dSSadaf Ebrahimi       These facilities are provided to catch runaway matches  that  are  pro-
6591*22dc650dSSadaf Ebrahimi       voked  by patterns with huge matching trees. A common example is a pat-
6592*22dc650dSSadaf Ebrahimi       tern with nested unlimited repeats applied to a long string  that  does
6593*22dc650dSSadaf Ebrahimi       not  match. When one of these limits is reached, pcre2_match() gives an
6594*22dc650dSSadaf Ebrahimi       error return. The limits can also be set by items at the start  of  the
6595*22dc650dSSadaf Ebrahimi       pattern of the form
6596*22dc650dSSadaf Ebrahimi
6597*22dc650dSSadaf Ebrahimi         (*LIMIT_HEAP=d)
6598*22dc650dSSadaf Ebrahimi         (*LIMIT_MATCH=d)
6599*22dc650dSSadaf Ebrahimi         (*LIMIT_DEPTH=d)
6600*22dc650dSSadaf Ebrahimi
6601*22dc650dSSadaf Ebrahimi       where d is any number of decimal digits. However, the value of the set-
6602*22dc650dSSadaf Ebrahimi       ting  must  be  less than the value set (or defaulted) by the caller of
6603*22dc650dSSadaf Ebrahimi       pcre2_match() for it to have any effect. In other  words,  the  pattern
6604*22dc650dSSadaf Ebrahimi       writer  can lower the limits set by the programmer, but not raise them.
6605*22dc650dSSadaf Ebrahimi       If there is more than one setting of one of  these  limits,  the  lower
6606*22dc650dSSadaf Ebrahimi       value  is used. The heap limit is specified in kibibytes (units of 1024
6607*22dc650dSSadaf Ebrahimi       bytes).
6608*22dc650dSSadaf Ebrahimi
6609*22dc650dSSadaf Ebrahimi       Prior to release 10.30, LIMIT_DEPTH was  called  LIMIT_RECURSION.  This
6610*22dc650dSSadaf Ebrahimi       name is still recognized for backwards compatibility.
6611*22dc650dSSadaf Ebrahimi
6612*22dc650dSSadaf Ebrahimi       The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
6613*22dc650dSSadaf Ebrahimi       interpreters are used for matching. It does not apply to JIT. The match
6614*22dc650dSSadaf Ebrahimi       limit  is used (but in a different way) when JIT is being used, or when
6615*22dc650dSSadaf Ebrahimi       pcre2_dfa_match() is called, to limit computing resource usage by those
6616*22dc650dSSadaf Ebrahimi       matching functions. The depth limit is ignored by JIT but  is  relevant
6617*22dc650dSSadaf Ebrahimi       for  DFA  matching, which uses function recursion for recursions within
6618*22dc650dSSadaf Ebrahimi       the pattern and for lookaround assertions and atomic  groups.  In  this
6619*22dc650dSSadaf Ebrahimi       case, the depth limit controls the depth of such recursion.
6620*22dc650dSSadaf Ebrahimi
6621*22dc650dSSadaf Ebrahimi   Newline conventions
6622*22dc650dSSadaf Ebrahimi
6623*22dc650dSSadaf Ebrahimi       PCRE2  supports six different conventions for indicating line breaks in
6624*22dc650dSSadaf Ebrahimi       strings: a single CR (carriage return) character, a  single  LF  (line-
6625*22dc650dSSadaf Ebrahimi       feed) character, the two-character sequence CRLF, any of the three pre-
6626*22dc650dSSadaf Ebrahimi       ceding,  any  Unicode  newline  sequence,  or the NUL character (binary
6627*22dc650dSSadaf Ebrahimi       zero). The pcre2api page has further  discussion  about  newlines,  and
6628*22dc650dSSadaf Ebrahimi       shows how to set the newline convention when calling pcre2_compile().
6629*22dc650dSSadaf Ebrahimi
6630*22dc650dSSadaf Ebrahimi       It  is also possible to specify a newline convention by starting a pat-
6631*22dc650dSSadaf Ebrahimi       tern string with one of the following sequences:
6632*22dc650dSSadaf Ebrahimi
6633*22dc650dSSadaf Ebrahimi         (*CR)        carriage return
6634*22dc650dSSadaf Ebrahimi         (*LF)        linefeed
6635*22dc650dSSadaf Ebrahimi         (*CRLF)      carriage return, followed by linefeed
6636*22dc650dSSadaf Ebrahimi         (*ANYCRLF)   any of the three above
6637*22dc650dSSadaf Ebrahimi         (*ANY)       all Unicode newline sequences
6638*22dc650dSSadaf Ebrahimi         (*NUL)       the NUL character (binary zero)
6639*22dc650dSSadaf Ebrahimi
6640*22dc650dSSadaf Ebrahimi       These override the default and the options given to the compiling func-
6641*22dc650dSSadaf Ebrahimi       tion. For example, on a Unix system where LF is the default newline se-
6642*22dc650dSSadaf Ebrahimi       quence, the pattern
6643*22dc650dSSadaf Ebrahimi
6644*22dc650dSSadaf Ebrahimi         (*CR)a.b
6645*22dc650dSSadaf Ebrahimi
6646*22dc650dSSadaf Ebrahimi       changes the convention to CR. That pattern matches "a\nb" because LF is
6647*22dc650dSSadaf Ebrahimi       no longer a newline. If more than one of these settings is present, the
6648*22dc650dSSadaf Ebrahimi       last one is used.
6649*22dc650dSSadaf Ebrahimi
6650*22dc650dSSadaf Ebrahimi       The newline convention affects where the circumflex and  dollar  asser-
6651*22dc650dSSadaf Ebrahimi       tions are true. It also affects the interpretation of the dot metachar-
6652*22dc650dSSadaf Ebrahimi       acter  when  PCRE2_DOTALL  is not set, and the behaviour of \N when not
6653*22dc650dSSadaf Ebrahimi       followed by an opening brace. However, it does not affect what  the  \R
6654*22dc650dSSadaf Ebrahimi       escape  sequence  matches.  By default, this is any Unicode newline se-
6655*22dc650dSSadaf Ebrahimi       quence, for Perl compatibility. However, this can be changed;  see  the
6656*22dc650dSSadaf Ebrahimi       next section and the description of \R in the section entitled "Newline
6657*22dc650dSSadaf Ebrahimi       sequences"  below. A change of \R setting can be combined with a change
6658*22dc650dSSadaf Ebrahimi       of newline convention.
6659*22dc650dSSadaf Ebrahimi
6660*22dc650dSSadaf Ebrahimi   Specifying what \R matches
6661*22dc650dSSadaf Ebrahimi
6662*22dc650dSSadaf Ebrahimi       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6663*22dc650dSSadaf Ebrahimi       the complete set  of  Unicode  line  endings)  by  setting  the  option
6664*22dc650dSSadaf Ebrahimi       PCRE2_BSR_ANYCRLF  at compile time. This effect can also be achieved by
6665*22dc650dSSadaf Ebrahimi       starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI-
6666*22dc650dSSadaf Ebrahimi       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
6667*22dc650dSSadaf Ebrahimi
6668*22dc650dSSadaf Ebrahimi
6669*22dc650dSSadaf EbrahimiEBCDIC CHARACTER CODES
6670*22dc650dSSadaf Ebrahimi
6671*22dc650dSSadaf Ebrahimi       PCRE2  can be compiled to run in an environment that uses EBCDIC as its
6672*22dc650dSSadaf Ebrahimi       character code instead of ASCII or Unicode (typically a mainframe  sys-
6673*22dc650dSSadaf Ebrahimi       tem).  In  the  sections below, character code values are ASCII or Uni-
6674*22dc650dSSadaf Ebrahimi       code; in an EBCDIC environment these characters may have different code
6675*22dc650dSSadaf Ebrahimi       values, and there are no code points greater than 255.
6676*22dc650dSSadaf Ebrahimi
6677*22dc650dSSadaf Ebrahimi
6678*22dc650dSSadaf EbrahimiCHARACTERS AND METACHARACTERS
6679*22dc650dSSadaf Ebrahimi
6680*22dc650dSSadaf Ebrahimi       A regular expression is a pattern that is  matched  against  a  subject
6681*22dc650dSSadaf Ebrahimi       string  from  left  to right. Most characters stand for themselves in a
6682*22dc650dSSadaf Ebrahimi       pattern, and match the corresponding characters in the  subject.  As  a
6683*22dc650dSSadaf Ebrahimi       trivial example, the pattern
6684*22dc650dSSadaf Ebrahimi
6685*22dc650dSSadaf Ebrahimi         The quick brown fox
6686*22dc650dSSadaf Ebrahimi
6687*22dc650dSSadaf Ebrahimi       matches a portion of a subject string that is identical to itself. When
6688*22dc650dSSadaf Ebrahimi       caseless  matching  is  specified  (the  PCRE2_CASELESS  option or (?i)
6689*22dc650dSSadaf Ebrahimi       within the pattern), letters are matched independently  of  case.  Note
6690*22dc650dSSadaf Ebrahimi       that  there  are  two  ASCII  characters, K and S, that, in addition to
6691*22dc650dSSadaf Ebrahimi       their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6692*22dc650dSSadaf Ebrahimi       U+212A  (Kelvin  sign)  and  U+017F  (long  S) respectively when either
6693*22dc650dSSadaf Ebrahimi       PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT
6694*22dc650dSSadaf Ebrahimi       option is in force (either passed to pcre2_compile()  or  set  by  (?r)
6695*22dc650dSSadaf Ebrahimi       within the pattern).
6696*22dc650dSSadaf Ebrahimi
6697*22dc650dSSadaf Ebrahimi       The power of regular expressions comes from the ability to include wild
6698*22dc650dSSadaf Ebrahimi       cards, character classes, alternatives, and repetitions in the pattern.
6699*22dc650dSSadaf Ebrahimi       These are encoded in the pattern by the use of metacharacters, which do
6700*22dc650dSSadaf Ebrahimi       not  stand  for  themselves but instead are interpreted in some special
6701*22dc650dSSadaf Ebrahimi       way.
6702*22dc650dSSadaf Ebrahimi
6703*22dc650dSSadaf Ebrahimi       There are two different sets of metacharacters: those that  are  recog-
6704*22dc650dSSadaf Ebrahimi       nized  anywhere in the pattern except within square brackets, and those
6705*22dc650dSSadaf Ebrahimi       that are recognized within square brackets.  Outside  square  brackets,
6706*22dc650dSSadaf Ebrahimi       the metacharacters are as follows:
6707*22dc650dSSadaf Ebrahimi
6708*22dc650dSSadaf Ebrahimi         \      general escape character with several uses
6709*22dc650dSSadaf Ebrahimi         ^      assert start of string (or line, in multiline mode)
6710*22dc650dSSadaf Ebrahimi         $      assert end of string (or line, in multiline mode)
6711*22dc650dSSadaf Ebrahimi         .      match any character except newline (by default)
6712*22dc650dSSadaf Ebrahimi         [      start character class definition
6713*22dc650dSSadaf Ebrahimi         |      start of alternative branch
6714*22dc650dSSadaf Ebrahimi         (      start group or control verb
6715*22dc650dSSadaf Ebrahimi         )      end group or control verb
6716*22dc650dSSadaf Ebrahimi         *      0 or more quantifier
6717*22dc650dSSadaf Ebrahimi         +      1 or more quantifier; also "possessive quantifier"
6718*22dc650dSSadaf Ebrahimi         ?      0 or 1 quantifier; also quantifier minimizer
6719*22dc650dSSadaf Ebrahimi         {      potential start of min/max quantifier
6720*22dc650dSSadaf Ebrahimi
6721*22dc650dSSadaf Ebrahimi       Brace  characters  {  and } are also used to enclose data for construc-
6722*22dc650dSSadaf Ebrahimi       tions such as \g{2} or \k{name}. In almost all uses  of  braces,  space
6723*22dc650dSSadaf Ebrahimi       and/or horizontal tab characters that follow { or precede } are allowed
6724*22dc650dSSadaf Ebrahimi       and  are  ignored. In the case of quantifiers, they may also appear be-
6725*22dc650dSSadaf Ebrahimi       fore or after the comma. The exception to this is \u{...} which  is  an
6726*22dc650dSSadaf Ebrahimi       ECMAScript  compatibility  feature  that  is  recognized  only when the
6727*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript  does  not  ignore  such
6728*22dc650dSSadaf Ebrahimi       white space; it causes the item to be interpreted as literal.
6729*22dc650dSSadaf Ebrahimi
6730*22dc650dSSadaf Ebrahimi       Part  of  a  pattern  that is in square brackets is called a "character
6731*22dc650dSSadaf Ebrahimi       class". In a character class the only metacharacters are:
6732*22dc650dSSadaf Ebrahimi
6733*22dc650dSSadaf Ebrahimi         \      general escape character
6734*22dc650dSSadaf Ebrahimi         ^      negate the class, but only if the first character
6735*22dc650dSSadaf Ebrahimi         -      indicates character range
6736*22dc650dSSadaf Ebrahimi         [      POSIX character class (if followed by POSIX syntax)
6737*22dc650dSSadaf Ebrahimi         ]      terminates the character class
6738*22dc650dSSadaf Ebrahimi
6739*22dc650dSSadaf Ebrahimi       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
6740*22dc650dSSadaf Ebrahimi       space in the pattern, other than in a character class, within a \Q...\E
6741*22dc650dSSadaf Ebrahimi       sequence,  or  between  a # outside a character class and the next new-
6742*22dc650dSSadaf Ebrahimi       line, inclusive, are ignored. An escaping backslash can be used to  in-
6743*22dc650dSSadaf Ebrahimi       clude  a  white  space  or a # character as part of the pattern. If the
6744*22dc650dSSadaf Ebrahimi       PCRE2_EXTENDED_MORE option is set, the same applies,  but  in  addition
6745*22dc650dSSadaf Ebrahimi       unescaped  space  and  horizontal  tab  characters are ignored inside a
6746*22dc650dSSadaf Ebrahimi       character class. Note: only these two characters are ignored,  not  the
6747*22dc650dSSadaf Ebrahimi       full  set  of pattern white space characters that are ignored outside a
6748*22dc650dSSadaf Ebrahimi       character class. Option settings can be changed within a  pattern;  see
6749*22dc650dSSadaf Ebrahimi       the section entitled "Internal Option Setting" below.
6750*22dc650dSSadaf Ebrahimi
6751*22dc650dSSadaf Ebrahimi       The following sections describe the use of each of the metacharacters.
6752*22dc650dSSadaf Ebrahimi
6753*22dc650dSSadaf Ebrahimi
6754*22dc650dSSadaf EbrahimiBACKSLASH
6755*22dc650dSSadaf Ebrahimi
6756*22dc650dSSadaf Ebrahimi       The backslash character has several uses. Firstly, if it is followed by
6757*22dc650dSSadaf Ebrahimi       a  character that is not a digit or a letter, it takes away any special
6758*22dc650dSSadaf Ebrahimi       meaning that character may have. This use of  backslash  as  an  escape
6759*22dc650dSSadaf Ebrahimi       character applies both inside and outside character classes.
6760*22dc650dSSadaf Ebrahimi
6761*22dc650dSSadaf Ebrahimi       For  example,  if you want to match a * character, you must write \* in
6762*22dc650dSSadaf Ebrahimi       the pattern. This escaping action applies whether or not the  following
6763*22dc650dSSadaf Ebrahimi       character  would  otherwise be interpreted as a metacharacter, so it is
6764*22dc650dSSadaf Ebrahimi       always safe to precede a non-alphanumeric  with  backslash  to  specify
6765*22dc650dSSadaf Ebrahimi       that it stands for itself.  In particular, if you want to match a back-
6766*22dc650dSSadaf Ebrahimi       slash, you write \\.
6767*22dc650dSSadaf Ebrahimi
6768*22dc650dSSadaf Ebrahimi       Only  ASCII  digits  and letters have any special meaning after a back-
6769*22dc650dSSadaf Ebrahimi       slash. All other characters (in particular, those whose code points are
6770*22dc650dSSadaf Ebrahimi       greater than 127) are treated as literals.
6771*22dc650dSSadaf Ebrahimi
6772*22dc650dSSadaf Ebrahimi       If you want to treat all characters in a sequence as literals, you  can
6773*22dc650dSSadaf Ebrahimi       do  so by putting them between \Q and \E. Note that this includes white
6774*22dc650dSSadaf Ebrahimi       space even when the PCRE2_EXTENDED option is set  so  that  most  other
6775*22dc650dSSadaf Ebrahimi       white  space is ignored. The behaviour is different from Perl in that $
6776*22dc650dSSadaf Ebrahimi       and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in
6777*22dc650dSSadaf Ebrahimi       Perl, $ and @ cause variable interpolation. Also,  Perl  does  "double-
6778*22dc650dSSadaf Ebrahimi       quotish  backslash  interpolation" on any backslashes between \Q and \E
6779*22dc650dSSadaf Ebrahimi       which, its documentation says, "may lead to confusing  results".  PCRE2
6780*22dc650dSSadaf Ebrahimi       treats  a  backslash  between  \Q and \E just like any other character.
6781*22dc650dSSadaf Ebrahimi       Note the following examples:
6782*22dc650dSSadaf Ebrahimi
6783*22dc650dSSadaf Ebrahimi         Pattern            PCRE2 matches   Perl matches
6784*22dc650dSSadaf Ebrahimi
6785*22dc650dSSadaf Ebrahimi         \Qabc$xyz\E        abc$xyz        abc followed by the
6786*22dc650dSSadaf Ebrahimi                                             contents of $xyz
6787*22dc650dSSadaf Ebrahimi         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
6788*22dc650dSSadaf Ebrahimi         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
6789*22dc650dSSadaf Ebrahimi         \QA\B\E            A\B            A\B
6790*22dc650dSSadaf Ebrahimi         \Q\\E              \              \\E
6791*22dc650dSSadaf Ebrahimi
6792*22dc650dSSadaf Ebrahimi       The \Q...\E sequence is recognized both inside  and  outside  character
6793*22dc650dSSadaf Ebrahimi       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
6794*22dc650dSSadaf Ebrahimi       is not followed by \E later in the pattern, the literal  interpretation
6795*22dc650dSSadaf Ebrahimi       continues  to  the  end  of  the pattern (that is, \E is assumed at the
6796*22dc650dSSadaf Ebrahimi       end). If the isolated \Q is inside a character class,  this  causes  an
6797*22dc650dSSadaf Ebrahimi       error,  because the character class is then not terminated by a closing
6798*22dc650dSSadaf Ebrahimi       square bracket.
6799*22dc650dSSadaf Ebrahimi
6800*22dc650dSSadaf Ebrahimi   Non-printing characters
6801*22dc650dSSadaf Ebrahimi
6802*22dc650dSSadaf Ebrahimi       A second use of backslash provides a way of encoding non-printing char-
6803*22dc650dSSadaf Ebrahimi       acters in patterns in a visible manner. There is no restriction on  the
6804*22dc650dSSadaf Ebrahimi       appearance  of non-printing characters in a pattern, but when a pattern
6805*22dc650dSSadaf Ebrahimi       is being prepared by text editing, it is often easier to use one of the
6806*22dc650dSSadaf Ebrahimi       following escape sequences instead of the binary  character  it  repre-
6807*22dc650dSSadaf Ebrahimi       sents.  In  an  ASCII or Unicode environment, these escapes are as fol-
6808*22dc650dSSadaf Ebrahimi       lows:
6809*22dc650dSSadaf Ebrahimi
6810*22dc650dSSadaf Ebrahimi         \a          alarm, that is, the BEL character (hex 07)
6811*22dc650dSSadaf Ebrahimi         \cx         "control-x", where x is a non-control ASCII character
6812*22dc650dSSadaf Ebrahimi         \e          escape (hex 1B)
6813*22dc650dSSadaf Ebrahimi         \f          form feed (hex 0C)
6814*22dc650dSSadaf Ebrahimi         \n          linefeed (hex 0A)
6815*22dc650dSSadaf Ebrahimi         \r          carriage return (hex 0D) (but see below)
6816*22dc650dSSadaf Ebrahimi         \t          tab (hex 09)
6817*22dc650dSSadaf Ebrahimi         \0dd        character with octal code 0dd
6818*22dc650dSSadaf Ebrahimi         \ddd        character with octal code ddd, or backreference
6819*22dc650dSSadaf Ebrahimi         \o{ddd..}   character with octal code ddd..
6820*22dc650dSSadaf Ebrahimi         \xhh        character with hex code hh
6821*22dc650dSSadaf Ebrahimi         \x{hhh..}   character with hex code hhh..
6822*22dc650dSSadaf Ebrahimi         \N{U+hhh..} character with Unicode hex code point hhh..
6823*22dc650dSSadaf Ebrahimi
6824*22dc650dSSadaf Ebrahimi       By default, after \x that is not followed by {, from zero to two  hexa-
6825*22dc650dSSadaf Ebrahimi       decimal  digits  are  read (letters can be in upper or lower case). Any
6826*22dc650dSSadaf Ebrahimi       number of hexadecimal digits may appear between \x{ and }. If a charac-
6827*22dc650dSSadaf Ebrahimi       ter other than a hexadecimal digit appears between \x{  and  },  or  if
6828*22dc650dSSadaf Ebrahimi       there is no terminating }, an error occurs.
6829*22dc650dSSadaf Ebrahimi
6830*22dc650dSSadaf Ebrahimi       Characters whose code points are less than 256 can be defined by either
6831*22dc650dSSadaf Ebrahimi       of the two syntaxes for \x or by an octal sequence. There is no differ-
6832*22dc650dSSadaf Ebrahimi       ence in the way they are handled. For example, \xdc is exactly the same
6833*22dc650dSSadaf Ebrahimi       as  \x{dc}  or \334.  However, using the braced versions does make such
6834*22dc650dSSadaf Ebrahimi       sequences easier to read.
6835*22dc650dSSadaf Ebrahimi
6836*22dc650dSSadaf Ebrahimi       Support is available for some ECMAScript (aka  JavaScript)  escape  se-
6837*22dc650dSSadaf Ebrahimi       quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6838*22dc650dSSadaf Ebrahimi       quence  \x  followed  by { is not recognized. Only if \x is followed by
6839*22dc650dSSadaf Ebrahimi       two hexadecimal digits is it recognized as a character  escape.  Other-
6840*22dc650dSSadaf Ebrahimi       wise  it  is interpreted as a literal "x" character. In this mode, sup-
6841*22dc650dSSadaf Ebrahimi       port for code points greater than 256 is provided by \u, which must  be
6842*22dc650dSSadaf Ebrahimi       followed  by  four hexadecimal digits; otherwise it is interpreted as a
6843*22dc650dSSadaf Ebrahimi       literal "u" character.
6844*22dc650dSSadaf Ebrahimi
6845*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad-
6846*22dc650dSSadaf Ebrahimi       dition, \u{hhh..} is recognized as the character specified by hexadeci-
6847*22dc650dSSadaf Ebrahimi       mal code point.  There may be any number of hexadecimal digits, but un-
6848*22dc650dSSadaf Ebrahimi       like  other places that also use curly brackets, spaces are not allowed
6849*22dc650dSSadaf Ebrahimi       and would result in the string being interpreted  as  a  literal.  This
6850*22dc650dSSadaf Ebrahimi       syntax is from ECMAScript 6.
6851*22dc650dSSadaf Ebrahimi
6852*22dc650dSSadaf Ebrahimi       The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6853*22dc650dSSadaf Ebrahimi       ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by
6854*22dc650dSSadaf Ebrahimi       Unicode  name;  PCRE2  does  not support this. Note that when \N is not
6855*22dc650dSSadaf Ebrahimi       followed by an opening brace (curly bracket) it has an entirely differ-
6856*22dc650dSSadaf Ebrahimi       ent meaning, matching any character that is not a newline.
6857*22dc650dSSadaf Ebrahimi
6858*22dc650dSSadaf Ebrahimi       There are some legacy applications where the escape sequence \r is  ex-
6859*22dc650dSSadaf Ebrahimi       pected  to  match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option
6860*22dc650dSSadaf Ebrahimi       is set, \r in a pattern is converted to \n so  that  it  matches  a  LF
6861*22dc650dSSadaf Ebrahimi       (linefeed) instead of a CR (carriage return) character.
6862*22dc650dSSadaf Ebrahimi
6863*22dc650dSSadaf Ebrahimi       An  error  occurs if \c is not followed by a character whose ASCII code
6864*22dc650dSSadaf Ebrahimi       point is in the range 32 to 126. The precise effect of \cx is  as  fol-
6865*22dc650dSSadaf Ebrahimi       lows:  if x is a lower case letter, it is converted to upper case. Then
6866*22dc650dSSadaf Ebrahimi       bit 6 of the character (hex 40) is inverted. Thus \cA to \cZ become hex
6867*22dc650dSSadaf Ebrahimi       01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B),  and
6868*22dc650dSSadaf Ebrahimi       \c;  becomes hex 7B (; is 3B). If the code unit following \c has a code
6869*22dc650dSSadaf Ebrahimi       point less than 32 or greater than 126, a compile-time error occurs.
6870*22dc650dSSadaf Ebrahimi
6871*22dc650dSSadaf Ebrahimi       When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..}  is  not  supported.
6872*22dc650dSSadaf Ebrahimi       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
6873*22dc650dSSadaf Ebrahimi       The \c escape is processed as specified for Perl in the perlebcdic doc-
6874*22dc650dSSadaf Ebrahimi       ument.  The  only characters that are allowed after \c are A-Z, a-z, or
6875*22dc650dSSadaf Ebrahimi       one of @, [, \, ], ^, _, or ?. Any other character provokes a  compile-
6876*22dc650dSSadaf Ebrahimi       time  error.  The  sequence  \c@ encodes character code 0; after \c the
6877*22dc650dSSadaf Ebrahimi       letters (in either case) encode characters 1-26 (hex 01 to hex 1A);  [,
6878*22dc650dSSadaf Ebrahimi       \,  ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
6879*22dc650dSSadaf Ebrahimi       comes either 255 (hex FF) or 95 (hex 5F).
6880*22dc650dSSadaf Ebrahimi
6881*22dc650dSSadaf Ebrahimi       Thus, apart from \c?, these escapes generate the  same  character  code
6882*22dc650dSSadaf Ebrahimi       values  as  they do in an ASCII environment, though the meanings of the
6883*22dc650dSSadaf Ebrahimi       values mostly differ. For example, \cG always generates code  value  7,
6884*22dc650dSSadaf Ebrahimi       which is BEL in ASCII but DEL in EBCDIC.
6885*22dc650dSSadaf Ebrahimi
6886*22dc650dSSadaf Ebrahimi       The  sequence  \c? generates DEL (127, hex 7F) in an ASCII environment,
6887*22dc650dSSadaf Ebrahimi       but because 127 is not a control character in  EBCDIC,  Perl  makes  it
6888*22dc650dSSadaf Ebrahimi       generate  the  APC character. Unfortunately, there are several variants
6889*22dc650dSSadaf Ebrahimi       of EBCDIC. In most of them the APC character has  the  value  255  (hex
6890*22dc650dSSadaf Ebrahimi       FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6891*22dc650dSSadaf Ebrahimi       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6892*22dc650dSSadaf Ebrahimi       95; otherwise it generates 255.
6893*22dc650dSSadaf Ebrahimi
6894*22dc650dSSadaf Ebrahimi       After \0 up to two further octal digits are read. If  there  are  fewer
6895*22dc650dSSadaf Ebrahimi       than  two  digits,  just  those that are present are used. Thus the se-
6896*22dc650dSSadaf Ebrahimi       quence \0\x\015 specifies two binary zeros followed by a  CR  character
6897*22dc650dSSadaf Ebrahimi       (code value 13). Make sure you supply two digits after the initial zero
6898*22dc650dSSadaf Ebrahimi       if the pattern character that follows is itself an octal digit.
6899*22dc650dSSadaf Ebrahimi
6900*22dc650dSSadaf Ebrahimi       The  escape \o must be followed by a sequence of octal digits, enclosed
6901*22dc650dSSadaf Ebrahimi       in braces. An error occurs if this is not the case. This  escape  is  a
6902*22dc650dSSadaf Ebrahimi       recent  addition  to Perl; it provides way of specifying character code
6903*22dc650dSSadaf Ebrahimi       points as octal numbers greater than 0777, and  it  also  allows  octal
6904*22dc650dSSadaf Ebrahimi       numbers and backreferences to be unambiguously specified.
6905*22dc650dSSadaf Ebrahimi
6906*22dc650dSSadaf Ebrahimi       For greater clarity and unambiguity, it is best to avoid following \ by
6907*22dc650dSSadaf Ebrahimi       a  digit  greater than zero. Instead, use \o{...} or \x{...} to specify
6908*22dc650dSSadaf Ebrahimi       numerical character code points, and \g{...} to specify backreferences.
6909*22dc650dSSadaf Ebrahimi       The following paragraphs describe the old, ambiguous syntax.
6910*22dc650dSSadaf Ebrahimi
6911*22dc650dSSadaf Ebrahimi       The handling of a backslash followed by a digit other than 0 is compli-
6912*22dc650dSSadaf Ebrahimi       cated, and Perl has changed over time, causing PCRE2 also to change.
6913*22dc650dSSadaf Ebrahimi
6914*22dc650dSSadaf Ebrahimi       Outside a character class, PCRE2 reads the digit and any following dig-
6915*22dc650dSSadaf Ebrahimi       its as a decimal number. If the number is less than 10, begins with the
6916*22dc650dSSadaf Ebrahimi       digit 8 or 9, or if there are  at  least  that  many  previous  capture
6917*22dc650dSSadaf Ebrahimi       groups  in the expression, the entire sequence is taken as a backrefer-
6918*22dc650dSSadaf Ebrahimi       ence. A description of how this works is  given  later,  following  the
6919*22dc650dSSadaf Ebrahimi       discussion  of parenthesized groups.  Otherwise, up to three octal dig-
6920*22dc650dSSadaf Ebrahimi       its are read to form a character code.
6921*22dc650dSSadaf Ebrahimi
6922*22dc650dSSadaf Ebrahimi       Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
6923*22dc650dSSadaf Ebrahimi       acters  "8"  and "9", and otherwise reads up to three octal digits fol-
6924*22dc650dSSadaf Ebrahimi       lowing the backslash, using them to generate a data character. Any sub-
6925*22dc650dSSadaf Ebrahimi       sequent digits stand for themselves. For example, outside  a  character
6926*22dc650dSSadaf Ebrahimi       class:
6927*22dc650dSSadaf Ebrahimi
6928*22dc650dSSadaf Ebrahimi         \040   is another way of writing an ASCII space
6929*22dc650dSSadaf Ebrahimi         \40    is the same, provided there are fewer than 40
6930*22dc650dSSadaf Ebrahimi                   previous capture groups
6931*22dc650dSSadaf Ebrahimi         \7     is always a backreference
6932*22dc650dSSadaf Ebrahimi         \11    might be a backreference, or another way of
6933*22dc650dSSadaf Ebrahimi                   writing a tab
6934*22dc650dSSadaf Ebrahimi         \011   is always a tab
6935*22dc650dSSadaf Ebrahimi         \0113  is a tab followed by the character "3"
6936*22dc650dSSadaf Ebrahimi         \113   might be a backreference, otherwise the
6937*22dc650dSSadaf Ebrahimi                   character with octal code 113
6938*22dc650dSSadaf Ebrahimi         \377   might be a backreference, otherwise
6939*22dc650dSSadaf Ebrahimi                   the value 255 (decimal)
6940*22dc650dSSadaf Ebrahimi         \81    is always a backreference
6941*22dc650dSSadaf Ebrahimi
6942*22dc650dSSadaf Ebrahimi       Note  that octal values of 100 or greater that are specified using this
6943*22dc650dSSadaf Ebrahimi       syntax must not be introduced by a leading zero, because no  more  than
6944*22dc650dSSadaf Ebrahimi       three octal digits are ever read.
6945*22dc650dSSadaf Ebrahimi
6946*22dc650dSSadaf Ebrahimi   Constraints on character values
6947*22dc650dSSadaf Ebrahimi
6948*22dc650dSSadaf Ebrahimi       Characters  that  are  specified using octal or hexadecimal numbers are
6949*22dc650dSSadaf Ebrahimi       limited to certain values, as follows:
6950*22dc650dSSadaf Ebrahimi
6951*22dc650dSSadaf Ebrahimi         8-bit non-UTF mode    no greater than 0xff
6952*22dc650dSSadaf Ebrahimi         16-bit non-UTF mode   no greater than 0xffff
6953*22dc650dSSadaf Ebrahimi         32-bit non-UTF mode   no greater than 0xffffffff
6954*22dc650dSSadaf Ebrahimi         All UTF modes         no greater than 0x10ffff and a valid code point
6955*22dc650dSSadaf Ebrahimi
6956*22dc650dSSadaf Ebrahimi       Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6957*22dc650dSSadaf Ebrahimi       (the so-called "surrogate" code points). The check  for  these  can  be
6958*22dc650dSSadaf Ebrahimi       disabled  by  the  caller  of  pcre2_compile()  by  setting  the option
6959*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only  in
6960*22dc650dSSadaf Ebrahimi       UTF-8  and  UTF-32 modes, because these values are not representable in
6961*22dc650dSSadaf Ebrahimi       UTF-16.
6962*22dc650dSSadaf Ebrahimi
6963*22dc650dSSadaf Ebrahimi   Escape sequences in character classes
6964*22dc650dSSadaf Ebrahimi
6965*22dc650dSSadaf Ebrahimi       All the sequences that define a single character value can be used both
6966*22dc650dSSadaf Ebrahimi       inside and outside character classes. In addition, inside  a  character
6967*22dc650dSSadaf Ebrahimi       class, \b is interpreted as the backspace character (hex 08).
6968*22dc650dSSadaf Ebrahimi
6969*22dc650dSSadaf Ebrahimi       When not followed by an opening brace, \N is not allowed in a character
6970*22dc650dSSadaf Ebrahimi       class.   \B,  \R, and \X are not special inside a character class. Like
6971*22dc650dSSadaf Ebrahimi       other unrecognized alphabetic escape sequences, they  cause  an  error.
6972*22dc650dSSadaf Ebrahimi       Outside a character class, these sequences have different meanings.
6973*22dc650dSSadaf Ebrahimi
6974*22dc650dSSadaf Ebrahimi   Unsupported escape sequences
6975*22dc650dSSadaf Ebrahimi
6976*22dc650dSSadaf Ebrahimi       In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
6977*22dc650dSSadaf Ebrahimi       string handler and used to modify the case of following characters.  By
6978*22dc650dSSadaf Ebrahimi       default,  PCRE2  does  not  support these escape sequences in patterns.
6979*22dc650dSSadaf Ebrahimi       However, if either of the PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  op-
6980*22dc650dSSadaf Ebrahimi       tions  is set, \U matches a "U" character, and \u can be used to define
6981*22dc650dSSadaf Ebrahimi       a character by code point, as described above.
6982*22dc650dSSadaf Ebrahimi
6983*22dc650dSSadaf Ebrahimi   Absolute and relative backreferences
6984*22dc650dSSadaf Ebrahimi
6985*22dc650dSSadaf Ebrahimi       The sequence \g followed by a signed or unsigned number, optionally en-
6986*22dc650dSSadaf Ebrahimi       closed in braces, is an absolute or  relative  backreference.  A  named
6987*22dc650dSSadaf Ebrahimi       backreference  can  be  coded as \g{name}. Backreferences are discussed
6988*22dc650dSSadaf Ebrahimi       later, following the discussion of parenthesized groups.
6989*22dc650dSSadaf Ebrahimi
6990*22dc650dSSadaf Ebrahimi   Absolute and relative subroutine calls
6991*22dc650dSSadaf Ebrahimi
6992*22dc650dSSadaf Ebrahimi       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
6993*22dc650dSSadaf Ebrahimi       name or a number enclosed either in angle brackets or single quotes, is
6994*22dc650dSSadaf Ebrahimi       an  alternative syntax for referencing a capture group as a subroutine.
6995*22dc650dSSadaf Ebrahimi       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6996*22dc650dSSadaf Ebrahimi       \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6997*22dc650dSSadaf Ebrahimi       erence; the latter is a subroutine call.
6998*22dc650dSSadaf Ebrahimi
6999*22dc650dSSadaf Ebrahimi   Generic character types
7000*22dc650dSSadaf Ebrahimi
7001*22dc650dSSadaf Ebrahimi       Another use of backslash is for specifying generic character types:
7002*22dc650dSSadaf Ebrahimi
7003*22dc650dSSadaf Ebrahimi         \d     any decimal digit
7004*22dc650dSSadaf Ebrahimi         \D     any character that is not a decimal digit
7005*22dc650dSSadaf Ebrahimi         \h     any horizontal white space character
7006*22dc650dSSadaf Ebrahimi         \H     any character that is not a horizontal white space character
7007*22dc650dSSadaf Ebrahimi         \N     any character that is not a newline
7008*22dc650dSSadaf Ebrahimi         \s     any white space character
7009*22dc650dSSadaf Ebrahimi         \S     any character that is not a white space character
7010*22dc650dSSadaf Ebrahimi         \v     any vertical white space character
7011*22dc650dSSadaf Ebrahimi         \V     any character that is not a vertical white space character
7012*22dc650dSSadaf Ebrahimi         \w     any "word" character
7013*22dc650dSSadaf Ebrahimi         \W     any "non-word" character
7014*22dc650dSSadaf Ebrahimi
7015*22dc650dSSadaf Ebrahimi       The  \N  escape  sequence has the same meaning as the "." metacharacter
7016*22dc650dSSadaf Ebrahimi       when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not  change
7017*22dc650dSSadaf Ebrahimi       the meaning of \N. Note that when \N is followed by an opening brace it
7018*22dc650dSSadaf Ebrahimi       has a different meaning. See the section entitled "Non-printing charac-
7019*22dc650dSSadaf Ebrahimi       ters"  above for details. Perl also uses \N{name} to specify characters
7020*22dc650dSSadaf Ebrahimi       by Unicode name; PCRE2 does not support this.
7021*22dc650dSSadaf Ebrahimi
7022*22dc650dSSadaf Ebrahimi       Each pair of lower and upper case escape sequences partitions the  com-
7023*22dc650dSSadaf Ebrahimi       plete  set  of  characters  into two disjoint sets. Any given character
7024*22dc650dSSadaf Ebrahimi       matches one, and only one, of each pair. The sequences can appear  both
7025*22dc650dSSadaf Ebrahimi       inside  and outside character classes. They each match one character of
7026*22dc650dSSadaf Ebrahimi       the appropriate type. If the current matching point is at  the  end  of
7027*22dc650dSSadaf Ebrahimi       the  subject string, all of them fail, because there is no character to
7028*22dc650dSSadaf Ebrahimi       match.
7029*22dc650dSSadaf Ebrahimi
7030*22dc650dSSadaf Ebrahimi       The default \s characters are HT (9), LF (10), VT  (11),  FF  (12),  CR
7031*22dc650dSSadaf Ebrahimi       (13),  and  space (32), which are defined as white space in the "C" lo-
7032*22dc650dSSadaf Ebrahimi       cale. This list may vary if locale-specific matching is  taking  place.
7033*22dc650dSSadaf Ebrahimi       For  example, in some locales the "non-breaking space" character (\xA0)
7034*22dc650dSSadaf Ebrahimi       is recognized as white space, and in others the VT character is not.
7035*22dc650dSSadaf Ebrahimi
7036*22dc650dSSadaf Ebrahimi       A "word" character is an underscore or any character that is  a  letter
7037*22dc650dSSadaf Ebrahimi       or  digit.   By  default,  the definition of letters and digits is con-
7038*22dc650dSSadaf Ebrahimi       trolled by PCRE2's low-valued character tables, and may vary if locale-
7039*22dc650dSSadaf Ebrahimi       specific matching is taking place (see "Locale support" in the pcre2api
7040*22dc650dSSadaf Ebrahimi       page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
7041*22dc650dSSadaf Ebrahimi       systems,  or "french" in Windows, some character codes greater than 127
7042*22dc650dSSadaf Ebrahimi       are used for accented letters, and these are then matched  by  \w.  The
7043*22dc650dSSadaf Ebrahimi       use of locales with Unicode is discouraged.
7044*22dc650dSSadaf Ebrahimi
7045*22dc650dSSadaf Ebrahimi       By  default,  characters  whose  code points are greater than 127 never
7046*22dc650dSSadaf Ebrahimi       match \d, \s, or \w, and always match \D, \S, and \W, although this may
7047*22dc650dSSadaf Ebrahimi       be different for characters in the range 128-255  when  locale-specific
7048*22dc650dSSadaf Ebrahimi       matching  is  happening.   These escape sequences retain their original
7049*22dc650dSSadaf Ebrahimi       meanings from before Unicode support was available,  mainly  for  effi-
7050*22dc650dSSadaf Ebrahimi       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
7051*22dc650dSSadaf Ebrahimi       changed so that Unicode properties  are  used  to  determine  character
7052*22dc650dSSadaf Ebrahimi       types, as follows:
7053*22dc650dSSadaf Ebrahimi
7054*22dc650dSSadaf Ebrahimi         \d  any character that matches \p{Nd} (decimal digit)
7055*22dc650dSSadaf Ebrahimi         \s  any character that matches \p{Z} or \h or \v
7056*22dc650dSSadaf Ebrahimi         \w  any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc}
7057*22dc650dSSadaf Ebrahimi
7058*22dc650dSSadaf Ebrahimi       The addition of \p{Mn} (non-spacing mark) and the replacement of an ex-
7059*22dc650dSSadaf Ebrahimi       plicit  test  for underscore with a test for \p{Pc} (connector punctua-
7060*22dc650dSSadaf Ebrahimi       tion) happened in PCRE2 release 10.43. This brings PCRE2 into line with
7061*22dc650dSSadaf Ebrahimi       Perl.
7062*22dc650dSSadaf Ebrahimi
7063*22dc650dSSadaf Ebrahimi       The upper case escapes match the inverse sets of characters. Note  that
7064*22dc650dSSadaf Ebrahimi       \d  matches  only decimal digits, whereas \w matches any Unicode digit,
7065*22dc650dSSadaf Ebrahimi       as well as other character categories. Note also that PCRE2_UCP affects
7066*22dc650dSSadaf Ebrahimi       \b, and \B because they are defined in terms of  \w  and  \W.  Matching
7067*22dc650dSSadaf Ebrahimi       these sequences is noticeably slower when PCRE2_UCP is set.
7068*22dc650dSSadaf Ebrahimi
7069*22dc650dSSadaf Ebrahimi       The  effect  of  PCRE2_UCP  on any one of these escape sequences can be
7070*22dc650dSSadaf Ebrahimi       negated by the  options  PCRE2_EXTRA_ASCII_BSD,  PCRE2_EXTRA_ASCII_BSS,
7071*22dc650dSSadaf Ebrahimi       and  PCRE2_EXTRA_ASCII_BSW,  respectively. These options can be set and
7072*22dc650dSSadaf Ebrahimi       reset within a pattern by means of an internal option setting (see  be-
7073*22dc650dSSadaf Ebrahimi       low).
7074*22dc650dSSadaf Ebrahimi
7075*22dc650dSSadaf Ebrahimi       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
7076*22dc650dSSadaf Ebrahimi       which match only ASCII characters by default, always match  a  specific
7077*22dc650dSSadaf Ebrahimi       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
7078*22dc650dSSadaf Ebrahimi       space characters are:
7079*22dc650dSSadaf Ebrahimi
7080*22dc650dSSadaf Ebrahimi         U+0009     Horizontal tab (HT)
7081*22dc650dSSadaf Ebrahimi         U+0020     Space
7082*22dc650dSSadaf Ebrahimi         U+00A0     Non-break space
7083*22dc650dSSadaf Ebrahimi         U+1680     Ogham space mark
7084*22dc650dSSadaf Ebrahimi         U+180E     Mongolian vowel separator
7085*22dc650dSSadaf Ebrahimi         U+2000     En quad
7086*22dc650dSSadaf Ebrahimi         U+2001     Em quad
7087*22dc650dSSadaf Ebrahimi         U+2002     En space
7088*22dc650dSSadaf Ebrahimi         U+2003     Em space
7089*22dc650dSSadaf Ebrahimi         U+2004     Three-per-em space
7090*22dc650dSSadaf Ebrahimi         U+2005     Four-per-em space
7091*22dc650dSSadaf Ebrahimi         U+2006     Six-per-em space
7092*22dc650dSSadaf Ebrahimi         U+2007     Figure space
7093*22dc650dSSadaf Ebrahimi         U+2008     Punctuation space
7094*22dc650dSSadaf Ebrahimi         U+2009     Thin space
7095*22dc650dSSadaf Ebrahimi         U+200A     Hair space
7096*22dc650dSSadaf Ebrahimi         U+202F     Narrow no-break space
7097*22dc650dSSadaf Ebrahimi         U+205F     Medium mathematical space
7098*22dc650dSSadaf Ebrahimi         U+3000     Ideographic space
7099*22dc650dSSadaf Ebrahimi
7100*22dc650dSSadaf Ebrahimi       The vertical space characters are:
7101*22dc650dSSadaf Ebrahimi
7102*22dc650dSSadaf Ebrahimi         U+000A     Linefeed (LF)
7103*22dc650dSSadaf Ebrahimi         U+000B     Vertical tab (VT)
7104*22dc650dSSadaf Ebrahimi         U+000C     Form feed (FF)
7105*22dc650dSSadaf Ebrahimi         U+000D     Carriage return (CR)
7106*22dc650dSSadaf Ebrahimi         U+0085     Next line (NEL)
7107*22dc650dSSadaf Ebrahimi         U+2028     Line separator
7108*22dc650dSSadaf Ebrahimi         U+2029     Paragraph separator
7109*22dc650dSSadaf Ebrahimi
7110*22dc650dSSadaf Ebrahimi       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
7111*22dc650dSSadaf Ebrahimi       than 256 are relevant.
7112*22dc650dSSadaf Ebrahimi
7113*22dc650dSSadaf Ebrahimi   Newline sequences
7114*22dc650dSSadaf Ebrahimi
7115*22dc650dSSadaf Ebrahimi       Outside  a  character class, by default, the escape sequence \R matches
7116*22dc650dSSadaf Ebrahimi       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
7117*22dc650dSSadaf Ebrahimi       to the following:
7118*22dc650dSSadaf Ebrahimi
7119*22dc650dSSadaf Ebrahimi         (?>\r\n|\n|\x0b|\f|\r|\x85)
7120*22dc650dSSadaf Ebrahimi
7121*22dc650dSSadaf Ebrahimi       This is an example of an "atomic group", details of which are given be-
7122*22dc650dSSadaf Ebrahimi       low.   This  particular group matches either the two-character sequence
7123*22dc650dSSadaf Ebrahimi       CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
7124*22dc650dSSadaf Ebrahimi       U+000A),  VT  (vertical  tab, U+000B), FF (form feed, U+000C), CR (car-
7125*22dc650dSSadaf Ebrahimi       riage return, U+000D), or NEL (next line, U+0085). Because this  is  an
7126*22dc650dSSadaf Ebrahimi       atomic  group,  the  two-character sequence is treated as a single unit
7127*22dc650dSSadaf Ebrahimi       that cannot be split.
7128*22dc650dSSadaf Ebrahimi
7129*22dc650dSSadaf Ebrahimi       In other modes, two additional characters whose code points are greater
7130*22dc650dSSadaf Ebrahimi       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
7131*22dc650dSSadaf Ebrahimi       rator, U+2029).  Unicode support is not needed for these characters  to
7132*22dc650dSSadaf Ebrahimi       be recognized.
7133*22dc650dSSadaf Ebrahimi
7134*22dc650dSSadaf Ebrahimi       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
7135*22dc650dSSadaf Ebrahimi       the  complete  set  of  Unicode  line  endings)  by  setting the option
7136*22dc650dSSadaf Ebrahimi       PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation  for  "back-
7137*22dc650dSSadaf Ebrahimi       slash R".) This can be made the default when PCRE2 is built; if this is
7138*22dc650dSSadaf Ebrahimi       the  case,  the other behaviour can be requested via the PCRE2_BSR_UNI-
7139*22dc650dSSadaf Ebrahimi       CODE option. It is also possible to specify these settings by  starting
7140*22dc650dSSadaf Ebrahimi       a pattern string with one of the following sequences:
7141*22dc650dSSadaf Ebrahimi
7142*22dc650dSSadaf Ebrahimi         (*BSR_ANYCRLF)   CR, LF, or CRLF only
7143*22dc650dSSadaf Ebrahimi         (*BSR_UNICODE)   any Unicode newline sequence
7144*22dc650dSSadaf Ebrahimi
7145*22dc650dSSadaf Ebrahimi       These override the default and the options given to the compiling func-
7146*22dc650dSSadaf Ebrahimi       tion.  Note that these special settings, which are not Perl-compatible,
7147*22dc650dSSadaf Ebrahimi       are  recognized only at the very start of a pattern, and that they must
7148*22dc650dSSadaf Ebrahimi       be in upper case. If more than one of them is present, the last one  is
7149*22dc650dSSadaf Ebrahimi       used. They can be combined with a change of newline convention; for ex-
7150*22dc650dSSadaf Ebrahimi       ample, a pattern can start with:
7151*22dc650dSSadaf Ebrahimi
7152*22dc650dSSadaf Ebrahimi         (*ANY)(*BSR_ANYCRLF)
7153*22dc650dSSadaf Ebrahimi
7154*22dc650dSSadaf Ebrahimi       They  can also be combined with the (*UTF) or (*UCP) special sequences.
7155*22dc650dSSadaf Ebrahimi       Inside a character class, \R is treated as an unrecognized  escape  se-
7156*22dc650dSSadaf Ebrahimi       quence, and causes an error.
7157*22dc650dSSadaf Ebrahimi
7158*22dc650dSSadaf Ebrahimi   Unicode character properties
7159*22dc650dSSadaf Ebrahimi
7160*22dc650dSSadaf Ebrahimi       When  PCRE2  is  built  with Unicode support (the default), three addi-
7161*22dc650dSSadaf Ebrahimi       tional escape sequences that match characters with specific  properties
7162*22dc650dSSadaf Ebrahimi       are available. They can be used in any mode, though in 8-bit and 16-bit
7163*22dc650dSSadaf Ebrahimi       non-UTF  modes these sequences are of course limited to testing charac-
7164*22dc650dSSadaf Ebrahimi       ters whose code points are less than U+0100 and U+10000,  respectively.
7165*22dc650dSSadaf Ebrahimi       In  32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
7166*22dc650dSSadaf Ebrahimi       limit) may be encountered. These are all treated as being  in  the  Un-
7167*22dc650dSSadaf Ebrahimi       known script and with an unassigned type.
7168*22dc650dSSadaf Ebrahimi
7169*22dc650dSSadaf Ebrahimi       Matching  characters by Unicode property is not fast, because PCRE2 has
7170*22dc650dSSadaf Ebrahimi       to do a multistage table lookup in order to find  a  character's  prop-
7171*22dc650dSSadaf Ebrahimi       erty. That is why the traditional escape sequences such as \d and \w do
7172*22dc650dSSadaf Ebrahimi       not  use  Unicode  properties  in PCRE2 by default, though you can make
7173*22dc650dSSadaf Ebrahimi       them do so by setting the PCRE2_UCP option or by starting  the  pattern
7174*22dc650dSSadaf Ebrahimi       with (*UCP).
7175*22dc650dSSadaf Ebrahimi
7176*22dc650dSSadaf Ebrahimi       The extra escape sequences that provide property support are:
7177*22dc650dSSadaf Ebrahimi
7178*22dc650dSSadaf Ebrahimi         \p{xx}   a character with the xx property
7179*22dc650dSSadaf Ebrahimi         \P{xx}   a character without the xx property
7180*22dc650dSSadaf Ebrahimi         \X       a Unicode extended grapheme cluster
7181*22dc650dSSadaf Ebrahimi
7182*22dc650dSSadaf Ebrahimi       The  property names represented by xx above are not case-sensitive, and
7183*22dc650dSSadaf Ebrahimi       in accordance with Unicode's "loose matching" rules,  spaces,  hyphens,
7184*22dc650dSSadaf Ebrahimi       and underscores are ignored. There is support for Unicode script names,
7185*22dc650dSSadaf Ebrahimi       Unicode general category properties, "Any", which matches any character
7186*22dc650dSSadaf Ebrahimi       (including  newline),  Bidi_Class,  a number of binary (yes/no) proper-
7187*22dc650dSSadaf Ebrahimi       ties, and some special PCRE2  properties  (described  below).   Certain
7188*22dc650dSSadaf Ebrahimi       other  Perl  properties such as "InMusicalSymbols" are not supported by
7189*22dc650dSSadaf Ebrahimi       PCRE2. Note that \P{Any} does  not  match  any  characters,  so  always
7190*22dc650dSSadaf Ebrahimi       causes a match failure.
7191*22dc650dSSadaf Ebrahimi
7192*22dc650dSSadaf Ebrahimi   Script properties for \p and \P
7193*22dc650dSSadaf Ebrahimi
7194*22dc650dSSadaf Ebrahimi       There are three different syntax forms for matching a script. Each Uni-
7195*22dc650dSSadaf Ebrahimi       code  character  has  a  basic  script and, optionally, a list of other
7196*22dc650dSSadaf Ebrahimi       scripts ("Script Extensions") with which it is commonly used. Using the
7197*22dc650dSSadaf Ebrahimi       Adlam script as an example, \p{sc:Adlam} matches characters whose basic
7198*22dc650dSSadaf Ebrahimi       script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
7199*22dc650dSSadaf Ebrahimi       that have Adlam in their extensions list. The full names  "script"  and
7200*22dc650dSSadaf Ebrahimi       "script extensions" for the property types are recognized, and a equals
7201*22dc650dSSadaf Ebrahimi       sign  is an alternative to the colon. If a script name is given without
7202*22dc650dSSadaf Ebrahimi       a property type, for example, \p{Adlam}, it is  treated  as  \p{scx:Ad-
7203*22dc650dSSadaf Ebrahimi       lam}.  Perl  changed  to  this interpretation at release 5.26 and PCRE2
7204*22dc650dSSadaf Ebrahimi       changed at release 10.40.
7205*22dc650dSSadaf Ebrahimi
7206*22dc650dSSadaf Ebrahimi       Unassigned characters (and in non-UTF 32-bit mode, characters with code
7207*22dc650dSSadaf Ebrahimi       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
7208*22dc650dSSadaf Ebrahimi       that are not part of an identified script are lumped together as  "Com-
7209*22dc650dSSadaf Ebrahimi       mon". The current list of recognized script names and their 4-character
7210*22dc650dSSadaf Ebrahimi       abbreviations can be obtained by running this command:
7211*22dc650dSSadaf Ebrahimi
7212*22dc650dSSadaf Ebrahimi         pcre2test -LS
7213*22dc650dSSadaf Ebrahimi
7214*22dc650dSSadaf Ebrahimi
7215*22dc650dSSadaf Ebrahimi   The general category property for \p and \P
7216*22dc650dSSadaf Ebrahimi
7217*22dc650dSSadaf Ebrahimi       Each character has exactly one Unicode general category property, spec-
7218*22dc650dSSadaf Ebrahimi       ified  by a two-letter abbreviation. For compatibility with Perl, nega-
7219*22dc650dSSadaf Ebrahimi       tion can be specified by including a  circumflex  between  the  opening
7220*22dc650dSSadaf Ebrahimi       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
7221*22dc650dSSadaf Ebrahimi       \P{Lu}.
7222*22dc650dSSadaf Ebrahimi
7223*22dc650dSSadaf Ebrahimi       If only one letter is specified with \p or \P, it includes all the gen-
7224*22dc650dSSadaf Ebrahimi       eral category properties that start with that letter. In this case,  in
7225*22dc650dSSadaf Ebrahimi       the  absence of negation, the curly brackets in the escape sequence are
7226*22dc650dSSadaf Ebrahimi       optional; these two examples have the same effect:
7227*22dc650dSSadaf Ebrahimi
7228*22dc650dSSadaf Ebrahimi         \p{L}
7229*22dc650dSSadaf Ebrahimi         \pL
7230*22dc650dSSadaf Ebrahimi
7231*22dc650dSSadaf Ebrahimi       The following general category property codes are supported:
7232*22dc650dSSadaf Ebrahimi
7233*22dc650dSSadaf Ebrahimi         C     Other
7234*22dc650dSSadaf Ebrahimi         Cc    Control
7235*22dc650dSSadaf Ebrahimi         Cf    Format
7236*22dc650dSSadaf Ebrahimi         Cn    Unassigned
7237*22dc650dSSadaf Ebrahimi         Co    Private use
7238*22dc650dSSadaf Ebrahimi         Cs    Surrogate
7239*22dc650dSSadaf Ebrahimi
7240*22dc650dSSadaf Ebrahimi         L     Letter
7241*22dc650dSSadaf Ebrahimi         Ll    Lower case letter
7242*22dc650dSSadaf Ebrahimi         Lm    Modifier letter
7243*22dc650dSSadaf Ebrahimi         Lo    Other letter
7244*22dc650dSSadaf Ebrahimi         Lt    Title case letter
7245*22dc650dSSadaf Ebrahimi         Lu    Upper case letter
7246*22dc650dSSadaf Ebrahimi
7247*22dc650dSSadaf Ebrahimi         M     Mark
7248*22dc650dSSadaf Ebrahimi         Mc    Spacing mark
7249*22dc650dSSadaf Ebrahimi         Me    Enclosing mark
7250*22dc650dSSadaf Ebrahimi         Mn    Non-spacing mark
7251*22dc650dSSadaf Ebrahimi
7252*22dc650dSSadaf Ebrahimi         N     Number
7253*22dc650dSSadaf Ebrahimi         Nd    Decimal number
7254*22dc650dSSadaf Ebrahimi         Nl    Letter number
7255*22dc650dSSadaf Ebrahimi         No    Other number
7256*22dc650dSSadaf Ebrahimi
7257*22dc650dSSadaf Ebrahimi         P     Punctuation
7258*22dc650dSSadaf Ebrahimi         Pc    Connector punctuation
7259*22dc650dSSadaf Ebrahimi         Pd    Dash punctuation
7260*22dc650dSSadaf Ebrahimi         Pe    Close punctuation
7261*22dc650dSSadaf Ebrahimi         Pf    Final punctuation
7262*22dc650dSSadaf Ebrahimi         Pi    Initial punctuation
7263*22dc650dSSadaf Ebrahimi         Po    Other punctuation
7264*22dc650dSSadaf Ebrahimi         Ps    Open punctuation
7265*22dc650dSSadaf Ebrahimi
7266*22dc650dSSadaf Ebrahimi         S     Symbol
7267*22dc650dSSadaf Ebrahimi         Sc    Currency symbol
7268*22dc650dSSadaf Ebrahimi         Sk    Modifier symbol
7269*22dc650dSSadaf Ebrahimi         Sm    Mathematical symbol
7270*22dc650dSSadaf Ebrahimi         So    Other symbol
7271*22dc650dSSadaf Ebrahimi
7272*22dc650dSSadaf Ebrahimi         Z     Separator
7273*22dc650dSSadaf Ebrahimi         Zl    Line separator
7274*22dc650dSSadaf Ebrahimi         Zp    Paragraph separator
7275*22dc650dSSadaf Ebrahimi         Zs    Space separator
7276*22dc650dSSadaf Ebrahimi
7277*22dc650dSSadaf Ebrahimi       The special property LC, which has the synonym L&, is  also  supported:
7278*22dc650dSSadaf Ebrahimi       it  matches  a  character that has the Lu, Ll, or Lt property, in other
7279*22dc650dSSadaf Ebrahimi       words, a letter that is not classified as a modifier or "other".
7280*22dc650dSSadaf Ebrahimi
7281*22dc650dSSadaf Ebrahimi       The Cs (Surrogate) property  applies  only  to  characters  whose  code
7282*22dc650dSSadaf Ebrahimi       points  are in the range U+D800 to U+DFFF. These characters are no dif-
7283*22dc650dSSadaf Ebrahimi       ferent to any other character when PCRE2 is not in UTF mode (using  the
7284*22dc650dSSadaf Ebrahimi       16-bit  or  32-bit  library).   However,  they are not valid in Unicode
7285*22dc650dSSadaf Ebrahimi       strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7286*22dc650dSSadaf Ebrahimi       ity  checking  has   been   turned   off   (see   the   discussion   of
7287*22dc650dSSadaf Ebrahimi       PCRE2_NO_UTF_CHECK in the pcre2api page).
7288*22dc650dSSadaf Ebrahimi
7289*22dc650dSSadaf Ebrahimi       The  long  synonyms  for  property  names  that  Perl supports (such as
7290*22dc650dSSadaf Ebrahimi       \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
7291*22dc650dSSadaf Ebrahimi       any of these properties with "Is".
7292*22dc650dSSadaf Ebrahimi
7293*22dc650dSSadaf Ebrahimi       No character that is in the Unicode table has the Cn (unassigned) prop-
7294*22dc650dSSadaf Ebrahimi       erty.  Instead, this property is assumed for any code point that is not
7295*22dc650dSSadaf Ebrahimi       in the Unicode table.
7296*22dc650dSSadaf Ebrahimi
7297*22dc650dSSadaf Ebrahimi       Specifying  caseless  matching  does not affect these escape sequences.
7298*22dc650dSSadaf Ebrahimi       For example, \p{Lu} always matches only upper  case  letters.  This  is
7299*22dc650dSSadaf Ebrahimi       different from the behaviour of current versions of Perl.
7300*22dc650dSSadaf Ebrahimi
7301*22dc650dSSadaf Ebrahimi   Binary (yes/no) properties for \p and \P
7302*22dc650dSSadaf Ebrahimi
7303*22dc650dSSadaf Ebrahimi       Unicode  defines  a  number  of  binary properties, that is, properties
7304*22dc650dSSadaf Ebrahimi       whose only values are true or false. You can obtain  a  list  of  those
7305*22dc650dSSadaf Ebrahimi       that  are  recognized  by \p and \P, along with their abbreviations, by
7306*22dc650dSSadaf Ebrahimi       running this command:
7307*22dc650dSSadaf Ebrahimi
7308*22dc650dSSadaf Ebrahimi         pcre2test -LP
7309*22dc650dSSadaf Ebrahimi
7310*22dc650dSSadaf Ebrahimi
7311*22dc650dSSadaf Ebrahimi   The Bidi_Class property for \p and \P
7312*22dc650dSSadaf Ebrahimi
7313*22dc650dSSadaf Ebrahimi         \p{Bidi_Class:<class>}   matches a character with the given class
7314*22dc650dSSadaf Ebrahimi         \p{BC:<class>}           matches a character with the given class
7315*22dc650dSSadaf Ebrahimi
7316*22dc650dSSadaf Ebrahimi       The recognized classes are:
7317*22dc650dSSadaf Ebrahimi
7318*22dc650dSSadaf Ebrahimi         AL          Arabic letter
7319*22dc650dSSadaf Ebrahimi         AN          Arabic number
7320*22dc650dSSadaf Ebrahimi         B           paragraph separator
7321*22dc650dSSadaf Ebrahimi         BN          boundary neutral
7322*22dc650dSSadaf Ebrahimi         CS          common separator
7323*22dc650dSSadaf Ebrahimi         EN          European number
7324*22dc650dSSadaf Ebrahimi         ES          European separator
7325*22dc650dSSadaf Ebrahimi         ET          European terminator
7326*22dc650dSSadaf Ebrahimi         FSI         first strong isolate
7327*22dc650dSSadaf Ebrahimi         L           left-to-right
7328*22dc650dSSadaf Ebrahimi         LRE         left-to-right embedding
7329*22dc650dSSadaf Ebrahimi         LRI         left-to-right isolate
7330*22dc650dSSadaf Ebrahimi         LRO         left-to-right override
7331*22dc650dSSadaf Ebrahimi         NSM         non-spacing mark
7332*22dc650dSSadaf Ebrahimi         ON          other neutral
7333*22dc650dSSadaf Ebrahimi         PDF         pop directional format
7334*22dc650dSSadaf Ebrahimi         PDI         pop directional isolate
7335*22dc650dSSadaf Ebrahimi         R           right-to-left
7336*22dc650dSSadaf Ebrahimi         RLE         right-to-left embedding
7337*22dc650dSSadaf Ebrahimi         RLI         right-to-left isolate
7338*22dc650dSSadaf Ebrahimi         RLO         right-to-left override
7339*22dc650dSSadaf Ebrahimi         S           segment separator
7340*22dc650dSSadaf Ebrahimi         WS          which space
7341*22dc650dSSadaf Ebrahimi
7342*22dc650dSSadaf Ebrahimi       An equals sign may be used instead of a  colon.  The  class  names  are
7343*22dc650dSSadaf Ebrahimi       case-insensitive; only the short names listed above are recognized.
7344*22dc650dSSadaf Ebrahimi
7345*22dc650dSSadaf Ebrahimi   Extended grapheme clusters
7346*22dc650dSSadaf Ebrahimi
7347*22dc650dSSadaf Ebrahimi       The  \X  escape  matches  any number of Unicode characters that form an
7348*22dc650dSSadaf Ebrahimi       "extended grapheme cluster", and treats the sequence as an atomic group
7349*22dc650dSSadaf Ebrahimi       (see below).  Unicode supports various kinds of composite character  by
7350*22dc650dSSadaf Ebrahimi       giving  each  character  a grapheme breaking property, and having rules
7351*22dc650dSSadaf Ebrahimi       that use these properties to define the boundaries of extended grapheme
7352*22dc650dSSadaf Ebrahimi       clusters. The rules are defined in Unicode Standard Annex 29,  "Unicode
7353*22dc650dSSadaf Ebrahimi       Text  Segmentation".  Unicode 11.0.0 abandoned the use of some previous
7354*22dc650dSSadaf Ebrahimi       properties that had been used for emojis.  Instead it introduced  vari-
7355*22dc650dSSadaf Ebrahimi       ous  emoji-specific  properties.  PCRE2  uses  only the Extended Picto-
7356*22dc650dSSadaf Ebrahimi       graphic property.
7357*22dc650dSSadaf Ebrahimi
7358*22dc650dSSadaf Ebrahimi       \X always matches at least one character. Then it  decides  whether  to
7359*22dc650dSSadaf Ebrahimi       add additional characters according to the following rules for ending a
7360*22dc650dSSadaf Ebrahimi       cluster:
7361*22dc650dSSadaf Ebrahimi
7362*22dc650dSSadaf Ebrahimi       1. End at the end of the subject string.
7363*22dc650dSSadaf Ebrahimi
7364*22dc650dSSadaf Ebrahimi       2.  Do not end between CR and LF; otherwise end after any control char-
7365*22dc650dSSadaf Ebrahimi       acter.
7366*22dc650dSSadaf Ebrahimi
7367*22dc650dSSadaf Ebrahimi       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
7368*22dc650dSSadaf Ebrahimi       characters  are of five types: L, V, T, LV, and LVT. An L character may
7369*22dc650dSSadaf Ebrahimi       be followed by an L, V, LV, or LVT character; an LV or V character  may
7370*22dc650dSSadaf Ebrahimi       be  followed  by  a V or T character; an LVT or T character may be fol-
7371*22dc650dSSadaf Ebrahimi       lowed only by a T character.
7372*22dc650dSSadaf Ebrahimi
7373*22dc650dSSadaf Ebrahimi       4. Do not end before extending characters or spacing marks or the zero-
7374*22dc650dSSadaf Ebrahimi       width joiner (ZWJ) character. Characters with the "mark"  property  al-
7375*22dc650dSSadaf Ebrahimi       ways have the "extend" grapheme breaking property.
7376*22dc650dSSadaf Ebrahimi
7377*22dc650dSSadaf Ebrahimi       5. Do not end after prepend characters.
7378*22dc650dSSadaf Ebrahimi
7379*22dc650dSSadaf Ebrahimi       6.  Do not end within emoji modifier sequences or emoji ZWJ (zero-width
7380*22dc650dSSadaf Ebrahimi       joiner) sequences. An emoji ZWJ sequence consists of a  character  with
7381*22dc650dSSadaf Ebrahimi       the  Extended_Pictographic property, optionally followed by one or more
7382*22dc650dSSadaf Ebrahimi       characters with the Extend property, followed  by  the  ZWJ  character,
7383*22dc650dSSadaf Ebrahimi       followed by another Extended_Pictographic character.
7384*22dc650dSSadaf Ebrahimi
7385*22dc650dSSadaf Ebrahimi       7.  Do not break within emoji flag sequences. That is, do not break be-
7386*22dc650dSSadaf Ebrahimi       tween regional indicator (RI) characters if there are an odd number  of
7387*22dc650dSSadaf Ebrahimi       RI characters before the break point.
7388*22dc650dSSadaf Ebrahimi
7389*22dc650dSSadaf Ebrahimi       8. Otherwise, end the cluster.
7390*22dc650dSSadaf Ebrahimi
7391*22dc650dSSadaf Ebrahimi   PCRE2's additional properties
7392*22dc650dSSadaf Ebrahimi
7393*22dc650dSSadaf Ebrahimi       As  well as the standard Unicode properties described above, PCRE2 sup-
7394*22dc650dSSadaf Ebrahimi       ports four more that make it possible to convert traditional escape se-
7395*22dc650dSSadaf Ebrahimi       quences such as \w and \s to use Unicode properties. PCRE2  uses  these
7396*22dc650dSSadaf Ebrahimi       non-standard,  non-Perl  properties  internally  when PCRE2_UCP is set.
7397*22dc650dSSadaf Ebrahimi       However, they may also be used explicitly. These properties are:
7398*22dc650dSSadaf Ebrahimi
7399*22dc650dSSadaf Ebrahimi         Xan   Any alphanumeric character
7400*22dc650dSSadaf Ebrahimi         Xps   Any POSIX space character
7401*22dc650dSSadaf Ebrahimi         Xsp   Any Perl space character
7402*22dc650dSSadaf Ebrahimi         Xwd   Any Perl "word" character
7403*22dc650dSSadaf Ebrahimi
7404*22dc650dSSadaf Ebrahimi       Xan matches characters that have either the L (letter) or the  N  (num-
7405*22dc650dSSadaf Ebrahimi       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
7406*22dc650dSSadaf Ebrahimi       form feed, or carriage return, and any other character that has  the  Z
7407*22dc650dSSadaf Ebrahimi       (separator)  property.  Xsp is the same as Xps; in PCRE1 it used to ex-
7408*22dc650dSSadaf Ebrahimi       clude vertical tab, for  Perl  compatibility,  but  Perl  changed.  Xwd
7409*22dc650dSSadaf Ebrahimi       matches the same characters as Xan, plus those that match Mn (non-spac-
7410*22dc650dSSadaf Ebrahimi       ing mark) or Pc (connector punctuation, which includes underscore).
7411*22dc650dSSadaf Ebrahimi
7412*22dc650dSSadaf Ebrahimi       There  is another non-standard property, Xuc, which matches any charac-
7413*22dc650dSSadaf Ebrahimi       ter that can be represented by a Universal Character Name  in  C++  and
7414*22dc650dSSadaf Ebrahimi       other  programming  languages.  These are the characters $, @, ` (grave
7415*22dc650dSSadaf Ebrahimi       accent), and all characters with Unicode code points  greater  than  or
7416*22dc650dSSadaf Ebrahimi       equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
7417*22dc650dSSadaf Ebrahimi       most base (ASCII) characters are excluded. (Universal  Character  Names
7418*22dc650dSSadaf Ebrahimi       are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
7419*22dc650dSSadaf Ebrahimi       Note that the Xuc property does not match these sequences but the char-
7420*22dc650dSSadaf Ebrahimi       acters that they represent.)
7421*22dc650dSSadaf Ebrahimi
7422*22dc650dSSadaf Ebrahimi   Resetting the match start
7423*22dc650dSSadaf Ebrahimi
7424*22dc650dSSadaf Ebrahimi       In normal use, the escape sequence \K  causes  any  previously  matched
7425*22dc650dSSadaf Ebrahimi       characters not to be included in the final matched sequence that is re-
7426*22dc650dSSadaf Ebrahimi       turned. For example, the pattern:
7427*22dc650dSSadaf Ebrahimi
7428*22dc650dSSadaf Ebrahimi         foo\Kbar
7429*22dc650dSSadaf Ebrahimi
7430*22dc650dSSadaf Ebrahimi       matches  "foobar",  but  reports that it has matched "bar". \K does not
7431*22dc650dSSadaf Ebrahimi       interact with anchoring in any way. The pattern:
7432*22dc650dSSadaf Ebrahimi
7433*22dc650dSSadaf Ebrahimi         ^foo\Kbar
7434*22dc650dSSadaf Ebrahimi
7435*22dc650dSSadaf Ebrahimi       matches only when the subject begins  with  "foobar"  (in  single  line
7436*22dc650dSSadaf Ebrahimi       mode),  though  it again reports the matched string as "bar". This fea-
7437*22dc650dSSadaf Ebrahimi       ture is similar to a lookbehind assertion (described  below),  but  the
7438*22dc650dSSadaf Ebrahimi       part of the pattern that precedes \K is not constrained to match a lim-
7439*22dc650dSSadaf Ebrahimi       ited  number  of characters, as is required for a lookbehind assertion.
7440*22dc650dSSadaf Ebrahimi       The use of \K does not interfere with  the  setting  of  captured  sub-
7441*22dc650dSSadaf Ebrahimi       strings.  For example, when the pattern
7442*22dc650dSSadaf Ebrahimi
7443*22dc650dSSadaf Ebrahimi         (foo)\Kbar
7444*22dc650dSSadaf Ebrahimi
7445*22dc650dSSadaf Ebrahimi       matches "foobar", the first substring is still set to "foo".
7446*22dc650dSSadaf Ebrahimi
7447*22dc650dSSadaf Ebrahimi       From  version  5.32.0  Perl  forbids the use of \K in lookaround asser-
7448*22dc650dSSadaf Ebrahimi       tions. From release 10.38 PCRE2 also forbids this by default.  However,
7449*22dc650dSSadaf Ebrahimi       the  PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK  option  can be used when calling
7450*22dc650dSSadaf Ebrahimi       pcre2_compile() to re-enable the previous behaviour. When  this  option
7451*22dc650dSSadaf Ebrahimi       is set, \K is acted upon when it occurs inside positive assertions, but
7452*22dc650dSSadaf Ebrahimi       is  ignored  in  negative  assertions. Note that when a pattern such as
7453*22dc650dSSadaf Ebrahimi       (?=ab\K) matches, the reported start of the match can be  greater  than
7454*22dc650dSSadaf Ebrahimi       the  end  of the match. Using \K in a lookbehind assertion at the start
7455*22dc650dSSadaf Ebrahimi       of a pattern can also lead to odd effects. For example,  consider  this
7456*22dc650dSSadaf Ebrahimi       pattern:
7457*22dc650dSSadaf Ebrahimi
7458*22dc650dSSadaf Ebrahimi         (?<=\Kfoo)bar
7459*22dc650dSSadaf Ebrahimi
7460*22dc650dSSadaf Ebrahimi       If  the  subject  is  "foobar", a call to pcre2_match() with a starting
7461*22dc650dSSadaf Ebrahimi       offset of 3 succeeds and reports the matching string as "foobar",  that
7462*22dc650dSSadaf Ebrahimi       is,  the  start  of  the reported match is earlier than where the match
7463*22dc650dSSadaf Ebrahimi       started.
7464*22dc650dSSadaf Ebrahimi
7465*22dc650dSSadaf Ebrahimi   Simple assertions
7466*22dc650dSSadaf Ebrahimi
7467*22dc650dSSadaf Ebrahimi       The final use of backslash is for certain simple assertions. An  asser-
7468*22dc650dSSadaf Ebrahimi       tion  specifies a condition that has to be met at a particular point in
7469*22dc650dSSadaf Ebrahimi       a match, without consuming any characters from the subject string.  The
7470*22dc650dSSadaf Ebrahimi       use  of groups for more complicated assertions is described below.  The
7471*22dc650dSSadaf Ebrahimi       backslashed assertions are:
7472*22dc650dSSadaf Ebrahimi
7473*22dc650dSSadaf Ebrahimi         \b     matches at a word boundary
7474*22dc650dSSadaf Ebrahimi         \B     matches when not at a word boundary
7475*22dc650dSSadaf Ebrahimi         \A     matches at the start of the subject
7476*22dc650dSSadaf Ebrahimi         \Z     matches at the end of the subject
7477*22dc650dSSadaf Ebrahimi                 also matches before a newline at the end of the subject
7478*22dc650dSSadaf Ebrahimi         \z     matches only at the end of the subject
7479*22dc650dSSadaf Ebrahimi         \G     matches at the first matching position in the subject
7480*22dc650dSSadaf Ebrahimi
7481*22dc650dSSadaf Ebrahimi       Inside a character class, \b has a different meaning;  it  matches  the
7482*22dc650dSSadaf Ebrahimi       backspace  character.  If  any  other  of these assertions appears in a
7483*22dc650dSSadaf Ebrahimi       character class, an "invalid escape sequence" error is generated.
7484*22dc650dSSadaf Ebrahimi
7485*22dc650dSSadaf Ebrahimi       A word boundary is a position in the subject string where  the  current
7486*22dc650dSSadaf Ebrahimi       character  and  the previous character do not both match \w or \W (i.e.
7487*22dc650dSSadaf Ebrahimi       one matches \w and the other matches \W), or the start or  end  of  the
7488*22dc650dSSadaf Ebrahimi       string  if  the  first or last character matches \w, respectively. When
7489*22dc650dSSadaf Ebrahimi       PCRE2 is built with Unicode support, the meanings of \w and \W  can  be
7490*22dc650dSSadaf Ebrahimi       changed by setting the PCRE2_UCP option. When this is done, it also af-
7491*22dc650dSSadaf Ebrahimi       fects  \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
7492*22dc650dSSadaf Ebrahimi       or "end of word" metasequence. However, whatever  follows  \b  normally
7493*22dc650dSSadaf Ebrahimi       determines  which  it  is. For example, the fragment \ba matches "a" at
7494*22dc650dSSadaf Ebrahimi       the start of a word.
7495*22dc650dSSadaf Ebrahimi
7496*22dc650dSSadaf Ebrahimi       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
7497*22dc650dSSadaf Ebrahimi       and dollar (described in the next section) in that they only ever match
7498*22dc650dSSadaf Ebrahimi       at  the  very start and end of the subject string, whatever options are
7499*22dc650dSSadaf Ebrahimi       set. Thus, they are independent of multiline mode. These  three  asser-
7500*22dc650dSSadaf Ebrahimi       tions  are  not  affected  by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
7501*22dc650dSSadaf Ebrahimi       which affect only the behaviour of the circumflex and dollar  metachar-
7502*22dc650dSSadaf Ebrahimi       acters.  However,  if the startoffset argument of pcre2_match() is non-
7503*22dc650dSSadaf Ebrahimi       zero, indicating that matching is to start at a point  other  than  the
7504*22dc650dSSadaf Ebrahimi       beginning  of  the subject, \A can never match.  The difference between
7505*22dc650dSSadaf Ebrahimi       \Z and \z is that \Z matches before a newline at the end of the  string
7506*22dc650dSSadaf Ebrahimi       as well as at the very end, whereas \z matches only at the end.
7507*22dc650dSSadaf Ebrahimi
7508*22dc650dSSadaf Ebrahimi       The  \G assertion is true only when the current matching position is at
7509*22dc650dSSadaf Ebrahimi       the start point of the matching process, as specified by the  startoff-
7510*22dc650dSSadaf Ebrahimi       set  argument  of  pcre2_match().  It differs from \A when the value of
7511*22dc650dSSadaf Ebrahimi       startoffset is non-zero. By calling pcre2_match() multiple  times  with
7512*22dc650dSSadaf Ebrahimi       appropriate  arguments,  you  can  mimic Perl's /g option, and it is in
7513*22dc650dSSadaf Ebrahimi       this kind of implementation where \G can be useful.
7514*22dc650dSSadaf Ebrahimi
7515*22dc650dSSadaf Ebrahimi       Note, however, that PCRE2's implementation of \G,  being  true  at  the
7516*22dc650dSSadaf Ebrahimi       starting  character  of  the matching process, is subtly different from
7517*22dc650dSSadaf Ebrahimi       Perl's, which defines it as true at the end of the previous  match.  In
7518*22dc650dSSadaf Ebrahimi       Perl,  these  can  be  different when the previously matched string was
7519*22dc650dSSadaf Ebrahimi       empty. Because PCRE2 does just one match at a time, it cannot reproduce
7520*22dc650dSSadaf Ebrahimi       this behaviour.
7521*22dc650dSSadaf Ebrahimi
7522*22dc650dSSadaf Ebrahimi       If all the alternatives of a pattern begin with \G, the  expression  is
7523*22dc650dSSadaf Ebrahimi       anchored to the starting match position, and the "anchored" flag is set
7524*22dc650dSSadaf Ebrahimi       in the compiled regular expression.
7525*22dc650dSSadaf Ebrahimi
7526*22dc650dSSadaf Ebrahimi
7527*22dc650dSSadaf EbrahimiCIRCUMFLEX AND DOLLAR
7528*22dc650dSSadaf Ebrahimi
7529*22dc650dSSadaf Ebrahimi       The  circumflex  and  dollar  metacharacters are zero-width assertions.
7530*22dc650dSSadaf Ebrahimi       That is, they test for a particular condition being true  without  con-
7531*22dc650dSSadaf Ebrahimi       suming any characters from the subject string. These two metacharacters
7532*22dc650dSSadaf Ebrahimi       are  concerned  with matching the starts and ends of lines. If the new-
7533*22dc650dSSadaf Ebrahimi       line convention is set so that only the two-character sequence CRLF  is
7534*22dc650dSSadaf Ebrahimi       recognized  as  a newline, isolated CR and LF characters are treated as
7535*22dc650dSSadaf Ebrahimi       ordinary data characters, and are not recognized as newlines.
7536*22dc650dSSadaf Ebrahimi
7537*22dc650dSSadaf Ebrahimi       Outside a character class, in the default matching mode, the circumflex
7538*22dc650dSSadaf Ebrahimi       character is an assertion that is true only  if  the  current  matching
7539*22dc650dSSadaf Ebrahimi       point  is  at the start of the subject string. If the startoffset argu-
7540*22dc650dSSadaf Ebrahimi       ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is  set,  circum-
7541*22dc650dSSadaf Ebrahimi       flex  can  never match if the PCRE2_MULTILINE option is unset. Inside a
7542*22dc650dSSadaf Ebrahimi       character class, circumflex has an entirely different meaning (see  be-
7543*22dc650dSSadaf Ebrahimi       low).
7544*22dc650dSSadaf Ebrahimi
7545*22dc650dSSadaf Ebrahimi       Circumflex  need  not be the first character of the pattern if a number
7546*22dc650dSSadaf Ebrahimi       of alternatives are involved, but it should be the first thing in  each
7547*22dc650dSSadaf Ebrahimi       alternative  in  which  it appears if the pattern is ever to match that
7548*22dc650dSSadaf Ebrahimi       branch. If all possible alternatives start with a circumflex, that  is,
7549*22dc650dSSadaf Ebrahimi       if  the  pattern  is constrained to match only at the start of the sub-
7550*22dc650dSSadaf Ebrahimi       ject, it is said to be an "anchored" pattern.  (There  are  also  other
7551*22dc650dSSadaf Ebrahimi       constructs that can cause a pattern to be anchored.)
7552*22dc650dSSadaf Ebrahimi
7553*22dc650dSSadaf Ebrahimi       The  dollar  character is an assertion that is true only if the current
7554*22dc650dSSadaf Ebrahimi       matching point is at the end of the subject string, or immediately  be-
7555*22dc650dSSadaf Ebrahimi       fore  a newline at the end of the string (by default), unless PCRE2_NO-
7556*22dc650dSSadaf Ebrahimi       TEOL is set. Note, however, that it does not actually  match  the  new-
7557*22dc650dSSadaf Ebrahimi       line.  Dollar need not be the last character of the pattern if a number
7558*22dc650dSSadaf Ebrahimi       of alternatives are involved, but it should be the  last  item  in  any
7559*22dc650dSSadaf Ebrahimi       branch  in which it appears. Dollar has no special meaning in a charac-
7560*22dc650dSSadaf Ebrahimi       ter class.
7561*22dc650dSSadaf Ebrahimi
7562*22dc650dSSadaf Ebrahimi       The meaning of dollar can be changed so that it  matches  only  at  the
7563*22dc650dSSadaf Ebrahimi       very  end  of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
7564*22dc650dSSadaf Ebrahimi       compile time. This does not affect the \Z assertion.
7565*22dc650dSSadaf Ebrahimi
7566*22dc650dSSadaf Ebrahimi       The meanings of the circumflex and dollar metacharacters are changed if
7567*22dc650dSSadaf Ebrahimi       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
7568*22dc650dSSadaf Ebrahimi       character  matches before any newlines in the string, as well as at the
7569*22dc650dSSadaf Ebrahimi       very end, and a circumflex matches immediately after internal  newlines
7570*22dc650dSSadaf Ebrahimi       as  well as at the start of the subject string. It does not match after
7571*22dc650dSSadaf Ebrahimi       a newline that ends the string, for compatibility with  Perl.  However,
7572*22dc650dSSadaf Ebrahimi       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
7573*22dc650dSSadaf Ebrahimi
7574*22dc650dSSadaf Ebrahimi       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
7575*22dc650dSSadaf Ebrahimi       (where \n represents a newline) in multiline mode, but  not  otherwise.
7576*22dc650dSSadaf Ebrahimi       Consequently,  patterns  that  are anchored in single line mode because
7577*22dc650dSSadaf Ebrahimi       all branches start with ^ are not anchored in  multiline  mode,  and  a
7578*22dc650dSSadaf Ebrahimi       match  for  circumflex  is  possible  when  the startoffset argument of
7579*22dc650dSSadaf Ebrahimi       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
7580*22dc650dSSadaf Ebrahimi       if PCRE2_MULTILINE is set.
7581*22dc650dSSadaf Ebrahimi
7582*22dc650dSSadaf Ebrahimi       When  the  newline  convention (see "Newline conventions" below) recog-
7583*22dc650dSSadaf Ebrahimi       nizes the two-character sequence CRLF as a newline, this is  preferred,
7584*22dc650dSSadaf Ebrahimi       even  if  the  single  characters CR and LF are also recognized as new-
7585*22dc650dSSadaf Ebrahimi       lines. For example, if the newline convention  is  "any",  a  multiline
7586*22dc650dSSadaf Ebrahimi       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
7587*22dc650dSSadaf Ebrahimi       than after CR, even though CR on its own is a valid newline.  (It  also
7588*22dc650dSSadaf Ebrahimi       matches at the very start of the string, of course.)
7589*22dc650dSSadaf Ebrahimi
7590*22dc650dSSadaf Ebrahimi       Note  that  the sequences \A, \Z, and \z can be used to match the start
7591*22dc650dSSadaf Ebrahimi       and end of the subject in both modes, and if all branches of a  pattern
7592*22dc650dSSadaf Ebrahimi       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
7593*22dc650dSSadaf Ebrahimi       set.
7594*22dc650dSSadaf Ebrahimi
7595*22dc650dSSadaf Ebrahimi
7596*22dc650dSSadaf EbrahimiFULL STOP (PERIOD, DOT) AND \N
7597*22dc650dSSadaf Ebrahimi
7598*22dc650dSSadaf Ebrahimi       Outside a character class, a dot in the pattern matches any one charac-
7599*22dc650dSSadaf Ebrahimi       ter in the subject string except (by default) a character  that  signi-
7600*22dc650dSSadaf Ebrahimi       fies the end of a line. One or more characters may be specified as line
7601*22dc650dSSadaf Ebrahimi       terminators (see "Newline conventions" above).
7602*22dc650dSSadaf Ebrahimi
7603*22dc650dSSadaf Ebrahimi       Dot  never matches a single line-ending character. When the two-charac-
7604*22dc650dSSadaf Ebrahimi       ter sequence CRLF is the only line ending, dot does not match CR if  it
7605*22dc650dSSadaf Ebrahimi       is  immediately followed by LF, but otherwise it matches all characters
7606*22dc650dSSadaf Ebrahimi       (including isolated CRs and LFs). When ANYCRLF  is  selected  for  line
7607*22dc650dSSadaf Ebrahimi       endings,  no  occurrences  of CR of LF match dot. When all Unicode line
7608*22dc650dSSadaf Ebrahimi       endings are being recognized, dot does not match CR or LF or any of the
7609*22dc650dSSadaf Ebrahimi       other line ending characters.
7610*22dc650dSSadaf Ebrahimi
7611*22dc650dSSadaf Ebrahimi       The behaviour of dot with regard to newlines can  be  changed.  If  the
7612*22dc650dSSadaf Ebrahimi       PCRE2_DOTALL  option  is  set, a dot matches any one character, without
7613*22dc650dSSadaf Ebrahimi       exception.  If the two-character sequence CRLF is present in  the  sub-
7614*22dc650dSSadaf Ebrahimi       ject string, it takes two dots to match it.
7615*22dc650dSSadaf Ebrahimi
7616*22dc650dSSadaf Ebrahimi       The  handling of dot is entirely independent of the handling of circum-
7617*22dc650dSSadaf Ebrahimi       flex and dollar, the only relationship being  that  they  both  involve
7618*22dc650dSSadaf Ebrahimi       newlines. Dot has no special meaning in a character class.
7619*22dc650dSSadaf Ebrahimi
7620*22dc650dSSadaf Ebrahimi       The  escape  sequence  \N when not followed by an opening brace behaves
7621*22dc650dSSadaf Ebrahimi       like a dot, except that it is not affected by the PCRE2_DOTALL  option.
7622*22dc650dSSadaf Ebrahimi       In  other words, it matches any character except one that signifies the
7623*22dc650dSSadaf Ebrahimi       end of a line.
7624*22dc650dSSadaf Ebrahimi
7625*22dc650dSSadaf Ebrahimi       When \N is followed by an opening brace it has a different meaning. See
7626*22dc650dSSadaf Ebrahimi       the section entitled "Non-printing characters" above for details.  Perl
7627*22dc650dSSadaf Ebrahimi       also  uses  \N{name}  to specify characters by Unicode name; PCRE2 does
7628*22dc650dSSadaf Ebrahimi       not support this.
7629*22dc650dSSadaf Ebrahimi
7630*22dc650dSSadaf Ebrahimi
7631*22dc650dSSadaf EbrahimiMATCHING A SINGLE CODE UNIT
7632*22dc650dSSadaf Ebrahimi
7633*22dc650dSSadaf Ebrahimi       Outside a character class, the escape sequence \C matches any one  code
7634*22dc650dSSadaf Ebrahimi       unit,  whether or not a UTF mode is set. In the 8-bit library, one code
7635*22dc650dSSadaf Ebrahimi       unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
7636*22dc650dSSadaf Ebrahimi       32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
7637*22dc650dSSadaf Ebrahimi       line-ending characters. The feature is provided in  Perl  in  order  to
7638*22dc650dSSadaf Ebrahimi       match individual bytes in UTF-8 mode, but it is unclear how it can use-
7639*22dc650dSSadaf Ebrahimi       fully be used.
7640*22dc650dSSadaf Ebrahimi
7641*22dc650dSSadaf Ebrahimi       Because  \C  breaks  up characters into individual code units, matching
7642*22dc650dSSadaf Ebrahimi       one unit with \C in UTF-8 or UTF-16 mode means that  the  rest  of  the
7643*22dc650dSSadaf Ebrahimi       string may start with a malformed UTF character. This has undefined re-
7644*22dc650dSSadaf Ebrahimi       sults, because PCRE2 assumes that it is matching character by character
7645*22dc650dSSadaf Ebrahimi       in a valid UTF string (by default it checks the subject string's valid-
7646*22dc650dSSadaf Ebrahimi       ity  at  the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK or
7647*22dc650dSSadaf Ebrahimi       PCRE2_MATCH_INVALID_UTF option is used).
7648*22dc650dSSadaf Ebrahimi
7649*22dc650dSSadaf Ebrahimi       An  application  can  lock  out  the  use  of   \C   by   setting   the
7650*22dc650dSSadaf Ebrahimi       PCRE2_NEVER_BACKSLASH_C  option  when  compiling  a pattern. It is also
7651*22dc650dSSadaf Ebrahimi       possible to build PCRE2 with the use of \C permanently disabled.
7652*22dc650dSSadaf Ebrahimi
7653*22dc650dSSadaf Ebrahimi       PCRE2 does not allow \C to appear in lookbehind  assertions  (described
7654*22dc650dSSadaf Ebrahimi       below)  in UTF-8 or UTF-16 modes, because this would make it impossible
7655*22dc650dSSadaf Ebrahimi       to calculate the length of  the  lookbehind.  Neither  the  alternative
7656*22dc650dSSadaf Ebrahimi       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
7657*22dc650dSSadaf Ebrahimi       these UTF modes.  The former gives a match-time error; the latter fails
7658*22dc650dSSadaf Ebrahimi       to optimize and so the match is always run using the interpreter.
7659*22dc650dSSadaf Ebrahimi
7660*22dc650dSSadaf Ebrahimi       In  the  32-bit  library, however, \C is always supported (when not ex-
7661*22dc650dSSadaf Ebrahimi       plicitly locked out) because it always  matches  a  single  code  unit,
7662*22dc650dSSadaf Ebrahimi       whether or not UTF-32 is specified.
7663*22dc650dSSadaf Ebrahimi
7664*22dc650dSSadaf Ebrahimi       In general, the \C escape sequence is best avoided. However, one way of
7665*22dc650dSSadaf Ebrahimi       using  it  that avoids the problem of malformed UTF-8 or UTF-16 charac-
7666*22dc650dSSadaf Ebrahimi       ters is to use a lookahead to check the length of the  next  character,
7667*22dc650dSSadaf Ebrahimi       as  in  this  pattern,  which could be used with a UTF-8 string (ignore
7668*22dc650dSSadaf Ebrahimi       white space and line breaks):
7669*22dc650dSSadaf Ebrahimi
7670*22dc650dSSadaf Ebrahimi         (?| (?=[\x00-\x7f])(\C) |
7671*22dc650dSSadaf Ebrahimi             (?=[\x80-\x{7ff}])(\C)(\C) |
7672*22dc650dSSadaf Ebrahimi             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7673*22dc650dSSadaf Ebrahimi             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7674*22dc650dSSadaf Ebrahimi
7675*22dc650dSSadaf Ebrahimi       In this example, a group that starts  with  (?|  resets  the  capturing
7676*22dc650dSSadaf Ebrahimi       parentheses  numbers in each alternative (see "Duplicate Group Numbers"
7677*22dc650dSSadaf Ebrahimi       below). The assertions at the start of each branch check the next UTF-8
7678*22dc650dSSadaf Ebrahimi       character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec-
7679*22dc650dSSadaf Ebrahimi       tively.  The  character's individual bytes are then captured by the ap-
7680*22dc650dSSadaf Ebrahimi       propriate number of \C groups.
7681*22dc650dSSadaf Ebrahimi
7682*22dc650dSSadaf Ebrahimi
7683*22dc650dSSadaf EbrahimiSQUARE BRACKETS AND CHARACTER CLASSES
7684*22dc650dSSadaf Ebrahimi
7685*22dc650dSSadaf Ebrahimi       An opening square bracket introduces a character class, terminated by a
7686*22dc650dSSadaf Ebrahimi       closing square bracket. A closing square bracket on its own is not spe-
7687*22dc650dSSadaf Ebrahimi       cial by default.  If a closing square bracket is required as  a  member
7688*22dc650dSSadaf Ebrahimi       of the class, it should be the first data character in the class (after
7689*22dc650dSSadaf Ebrahimi       an  initial  circumflex,  if present) or escaped with a backslash. This
7690*22dc650dSSadaf Ebrahimi       means that, by default, an empty class cannot be defined.  However,  if
7691*22dc650dSSadaf Ebrahimi       the  PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
7692*22dc650dSSadaf Ebrahimi       the start does end the (empty) class.
7693*22dc650dSSadaf Ebrahimi
7694*22dc650dSSadaf Ebrahimi       A character class matches a single character in the subject. A  matched
7695*22dc650dSSadaf Ebrahimi       character must be in the set of characters defined by the class, unless
7696*22dc650dSSadaf Ebrahimi       the  first  character in the class definition is a circumflex, in which
7697*22dc650dSSadaf Ebrahimi       case the subject character must not be in the set defined by the class.
7698*22dc650dSSadaf Ebrahimi       If a circumflex is actually required as a member of the  class,  ensure
7699*22dc650dSSadaf Ebrahimi       it is not the first character, or escape it with a backslash.
7700*22dc650dSSadaf Ebrahimi
7701*22dc650dSSadaf Ebrahimi       For  example, the character class [aeiou] matches any lower case vowel,
7702*22dc650dSSadaf Ebrahimi       while [^aeiou] matches any character that is not a  lower  case  vowel.
7703*22dc650dSSadaf Ebrahimi       Note that a circumflex is just a convenient notation for specifying the
7704*22dc650dSSadaf Ebrahimi       characters  that  are in the class by enumerating those that are not. A
7705*22dc650dSSadaf Ebrahimi       class that starts with a circumflex is not an assertion; it still  con-
7706*22dc650dSSadaf Ebrahimi       sumes  a  character  from the subject string, and therefore it fails if
7707*22dc650dSSadaf Ebrahimi       the current pointer is at the end of the string.
7708*22dc650dSSadaf Ebrahimi
7709*22dc650dSSadaf Ebrahimi       Characters in a class may be specified by their code points  using  \o,
7710*22dc650dSSadaf Ebrahimi       \x,  or \N{U+hh..} in the usual way. When caseless matching is set, any
7711*22dc650dSSadaf Ebrahimi       letters in a class represent both their upper case and lower case  ver-
7712*22dc650dSSadaf Ebrahimi       sions,  so  for example, a caseless [aeiou] matches "A" as well as "a",
7713*22dc650dSSadaf Ebrahimi       and a caseless [^aeiou] does not match "A", whereas a  caseful  version
7714*22dc650dSSadaf Ebrahimi       would.  Note that there are two ASCII characters, K and S, that, in ad-
7715*22dc650dSSadaf Ebrahimi       dition to their lower case ASCII equivalents, are case-equivalent  with
7716*22dc650dSSadaf Ebrahimi       Unicode  U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7717*22dc650dSSadaf Ebrahimi       ther PCRE2_UTF or PCRE2_UCP is set.
7718*22dc650dSSadaf Ebrahimi
7719*22dc650dSSadaf Ebrahimi       Characters that might indicate line breaks are  never  treated  in  any
7720*22dc650dSSadaf Ebrahimi       special  way  when matching character classes, whatever line-ending se-
7721*22dc650dSSadaf Ebrahimi       quence is  in  use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
7722*22dc650dSSadaf Ebrahimi       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
7723*22dc650dSSadaf Ebrahimi       one of these characters.
7724*22dc650dSSadaf Ebrahimi
7725*22dc650dSSadaf Ebrahimi       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
7726*22dc650dSSadaf Ebrahimi       \S, \v, \V, \w, and \W may appear in a character  class,  and  add  the
7727*22dc650dSSadaf Ebrahimi       characters  that  they  match  to  the  class.  For example, [\dABCDEF]
7728*22dc650dSSadaf Ebrahimi       matches any hexadecimal digit. In UTF modes, the PCRE2_UCP  option  af-
7729*22dc650dSSadaf Ebrahimi       fects the meanings of \d, \s, \w and their upper case partners, just as
7730*22dc650dSSadaf Ebrahimi       it does when they appear outside a character class, as described in the
7731*22dc650dSSadaf Ebrahimi       section  entitled  "Generic character types" above. The escape sequence
7732*22dc650dSSadaf Ebrahimi       \b has a different meaning inside a character  class;  it  matches  the
7733*22dc650dSSadaf Ebrahimi       backspace  character.  The sequences \B, \R, and \X are not special in-
7734*22dc650dSSadaf Ebrahimi       side a character class. Like any other unrecognized  escape  sequences,
7735*22dc650dSSadaf Ebrahimi       they  cause  an  error. The same is true for \N when not followed by an
7736*22dc650dSSadaf Ebrahimi       opening brace.
7737*22dc650dSSadaf Ebrahimi
7738*22dc650dSSadaf Ebrahimi       The minus (hyphen) character can be used to specify a range of  charac-
7739*22dc650dSSadaf Ebrahimi       ters  in  a  character class. For example, [d-m] matches any letter be-
7740*22dc650dSSadaf Ebrahimi       tween d and m, inclusive. If a minus character is required in a  class,
7741*22dc650dSSadaf Ebrahimi       it  must  be  escaped with a backslash or appear in a position where it
7742*22dc650dSSadaf Ebrahimi       cannot be interpreted as indicating a range, typically as the first  or
7743*22dc650dSSadaf Ebrahimi       last character in the class, or immediately after a range. For example,
7744*22dc650dSSadaf Ebrahimi       [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7745*22dc650dSSadaf Ebrahimi
7746*22dc650dSSadaf Ebrahimi       Perl treats a hyphen as a literal if it appears before or after a POSIX
7747*22dc650dSSadaf Ebrahimi       class (see below) or before or after a character type escape such as \d
7748*22dc650dSSadaf Ebrahimi       or  \H.  However, unless the hyphen is the last character in the class,
7749*22dc650dSSadaf Ebrahimi       Perl outputs a warning in its warning mode, as this is  most  likely  a
7750*22dc650dSSadaf Ebrahimi       user  error. As PCRE2 has no facility for warning, an error is given in
7751*22dc650dSSadaf Ebrahimi       these cases.
7752*22dc650dSSadaf Ebrahimi
7753*22dc650dSSadaf Ebrahimi       It is not possible to have the literal character "]" as the end charac-
7754*22dc650dSSadaf Ebrahimi       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
7755*22dc650dSSadaf Ebrahimi       two  characters ("W" and "-") followed by a literal string "46]", so it
7756*22dc650dSSadaf Ebrahimi       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
7757*22dc650dSSadaf Ebrahimi       backslash  it is interpreted as the end of range, so [W-\]46] is inter-
7758*22dc650dSSadaf Ebrahimi       preted as a class containing a range followed by two other  characters.
7759*22dc650dSSadaf Ebrahimi       The  octal or hexadecimal representation of "]" can also be used to end
7760*22dc650dSSadaf Ebrahimi       a range.
7761*22dc650dSSadaf Ebrahimi
7762*22dc650dSSadaf Ebrahimi       Ranges normally include all code points between the start and end char-
7763*22dc650dSSadaf Ebrahimi       acters, inclusive. They can also be used for code points specified  nu-
7764*22dc650dSSadaf Ebrahimi       merically,  for  example [\000-\037]. Ranges can include any characters
7765*22dc650dSSadaf Ebrahimi       that are valid for the current mode. In any  UTF  mode,  the  so-called
7766*22dc650dSSadaf Ebrahimi       "surrogate"  characters (those whose code points lie between 0xd800 and
7767*22dc650dSSadaf Ebrahimi       0xdfff inclusive) may not  be  specified  explicitly  by  default  (the
7768*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7769*22dc650dSSadaf Ebrahimi       ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7770*22dc650dSSadaf Ebrahimi       are always permitted.
7771*22dc650dSSadaf Ebrahimi
7772*22dc650dSSadaf Ebrahimi       There is a special case in EBCDIC environments  for  ranges  whose  end
7773*22dc650dSSadaf Ebrahimi       points are both specified as literal letters in the same case. For com-
7774*22dc650dSSadaf Ebrahimi       patibility  with Perl, EBCDIC code points within the range that are not
7775*22dc650dSSadaf Ebrahimi       letters are omitted. For example, [h-k] matches only  four  characters,
7776*22dc650dSSadaf Ebrahimi       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7777*22dc650dSSadaf Ebrahimi       points.  However,  if  the range is specified numerically, for example,
7778*22dc650dSSadaf Ebrahimi       [\x88-\x92] or [h-\x92], all code points are included.
7779*22dc650dSSadaf Ebrahimi
7780*22dc650dSSadaf Ebrahimi       If a range that includes letters is used when caseless matching is set,
7781*22dc650dSSadaf Ebrahimi       it matches the letters in either case. For example, [W-c] is equivalent
7782*22dc650dSSadaf Ebrahimi       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
7783*22dc650dSSadaf Ebrahimi       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
7784*22dc650dSSadaf Ebrahimi       accented E characters in both cases.
7785*22dc650dSSadaf Ebrahimi
7786*22dc650dSSadaf Ebrahimi       A circumflex can conveniently be used with  the  upper  case  character
7787*22dc650dSSadaf Ebrahimi       types  to specify a more restricted set of characters than the matching
7788*22dc650dSSadaf Ebrahimi       lower case type.  For example, the class [^\W_] matches any  letter  or
7789*22dc650dSSadaf Ebrahimi       digit, but not underscore, whereas [\w] includes underscore. A positive
7790*22dc650dSSadaf Ebrahimi       character class should be read as "something OR something OR ..." and a
7791*22dc650dSSadaf Ebrahimi       negative class as "NOT something AND NOT something AND NOT ...".
7792*22dc650dSSadaf Ebrahimi
7793*22dc650dSSadaf Ebrahimi       The  only  metacharacters  that are recognized in character classes are
7794*22dc650dSSadaf Ebrahimi       backslash, hyphen (only where it can be  interpreted  as  specifying  a
7795*22dc650dSSadaf Ebrahimi       range),  circumflex  (only  at the start), opening square bracket (only
7796*22dc650dSSadaf Ebrahimi       when it can be interpreted as introducing a POSIX class name, or for  a
7797*22dc650dSSadaf Ebrahimi       special  compatibility  feature  -  see the next two sections), and the
7798*22dc650dSSadaf Ebrahimi       terminating closing square bracket.  However,  escaping  other  non-al-
7799*22dc650dSSadaf Ebrahimi       phanumeric characters does no harm.
7800*22dc650dSSadaf Ebrahimi
7801*22dc650dSSadaf Ebrahimi
7802*22dc650dSSadaf EbrahimiPOSIX CHARACTER CLASSES
7803*22dc650dSSadaf Ebrahimi
7804*22dc650dSSadaf Ebrahimi       Perl supports the POSIX notation for character classes. This uses names
7805*22dc650dSSadaf Ebrahimi       enclosed  by [: and :] within the enclosing square brackets. PCRE2 also
7806*22dc650dSSadaf Ebrahimi       supports this notation. For example,
7807*22dc650dSSadaf Ebrahimi
7808*22dc650dSSadaf Ebrahimi         [01[:alpha:]%]
7809*22dc650dSSadaf Ebrahimi
7810*22dc650dSSadaf Ebrahimi       matches "0", "1", any alphabetic character, or "%". The supported class
7811*22dc650dSSadaf Ebrahimi       names are:
7812*22dc650dSSadaf Ebrahimi
7813*22dc650dSSadaf Ebrahimi         alnum    letters and digits
7814*22dc650dSSadaf Ebrahimi         alpha    letters
7815*22dc650dSSadaf Ebrahimi         ascii    character codes 0 - 127
7816*22dc650dSSadaf Ebrahimi         blank    space or tab only
7817*22dc650dSSadaf Ebrahimi         cntrl    control characters
7818*22dc650dSSadaf Ebrahimi         digit    decimal digits (same as \d)
7819*22dc650dSSadaf Ebrahimi         graph    printing characters, excluding space
7820*22dc650dSSadaf Ebrahimi         lower    lower case letters
7821*22dc650dSSadaf Ebrahimi         print    printing characters, including space
7822*22dc650dSSadaf Ebrahimi         punct    printing characters, excluding letters and digits and space
7823*22dc650dSSadaf Ebrahimi         space    white space (the same as \s from PCRE2 8.34)
7824*22dc650dSSadaf Ebrahimi         upper    upper case letters
7825*22dc650dSSadaf Ebrahimi         word     "word" characters (same as \w)
7826*22dc650dSSadaf Ebrahimi         xdigit   hexadecimal digits
7827*22dc650dSSadaf Ebrahimi
7828*22dc650dSSadaf Ebrahimi       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
7829*22dc650dSSadaf Ebrahimi       CR  (13),  and space (32). If locale-specific matching is taking place,
7830*22dc650dSSadaf Ebrahimi       the list of space characters may be different; there may  be  fewer  or
7831*22dc650dSSadaf Ebrahimi       more  of  them.  "Space" and \s match the same set of characters, as do
7832*22dc650dSSadaf Ebrahimi       "word" and \w.
7833*22dc650dSSadaf Ebrahimi
7834*22dc650dSSadaf Ebrahimi       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
7835*22dc650dSSadaf Ebrahimi       from  Perl  5.8. Another Perl extension is negation, which is indicated
7836*22dc650dSSadaf Ebrahimi       by a ^ character after the colon. For example,
7837*22dc650dSSadaf Ebrahimi
7838*22dc650dSSadaf Ebrahimi         [12[:^digit:]]
7839*22dc650dSSadaf Ebrahimi
7840*22dc650dSSadaf Ebrahimi       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7841*22dc650dSSadaf Ebrahimi       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
7842*22dc650dSSadaf Ebrahimi       these are not supported, and an error is given if they are encountered.
7843*22dc650dSSadaf Ebrahimi
7844*22dc650dSSadaf Ebrahimi       By default, characters with values greater than 127 do not match any of
7845*22dc650dSSadaf Ebrahimi       the POSIX character classes, although this may be different for charac-
7846*22dc650dSSadaf Ebrahimi       ters in the range 128-255 when locale-specific matching  is  happening.
7847*22dc650dSSadaf Ebrahimi       However,  in UCP mode, unless certain options are set (see below), some
7848*22dc650dSSadaf Ebrahimi       of the classes are changed so that  Unicode  character  properties  are
7849*22dc650dSSadaf Ebrahimi       used. This is achieved by replacing POSIX classes with other sequences,
7850*22dc650dSSadaf Ebrahimi       as follows:
7851*22dc650dSSadaf Ebrahimi
7852*22dc650dSSadaf Ebrahimi         [:alnum:]  becomes  \p{Xan}
7853*22dc650dSSadaf Ebrahimi         [:alpha:]  becomes  \p{L}
7854*22dc650dSSadaf Ebrahimi         [:blank:]  becomes  \h
7855*22dc650dSSadaf Ebrahimi         [:cntrl:]  becomes  \p{Cc}
7856*22dc650dSSadaf Ebrahimi         [:digit:]  becomes  \p{Nd}
7857*22dc650dSSadaf Ebrahimi         [:lower:]  becomes  \p{Ll}
7858*22dc650dSSadaf Ebrahimi         [:space:]  becomes  \p{Xps}
7859*22dc650dSSadaf Ebrahimi         [:upper:]  becomes  \p{Lu}
7860*22dc650dSSadaf Ebrahimi         [:word:]   becomes  \p{Xwd}
7861*22dc650dSSadaf Ebrahimi
7862*22dc650dSSadaf Ebrahimi       Negated  versions,  such as [:^alpha:] use \P instead of \p. Four other
7863*22dc650dSSadaf Ebrahimi       POSIX classes are handled specially in UCP mode:
7864*22dc650dSSadaf Ebrahimi
7865*22dc650dSSadaf Ebrahimi       [:graph:] This matches characters that have glyphs that mark  the  page
7866*22dc650dSSadaf Ebrahimi                 when printed. In Unicode property terms, it matches all char-
7867*22dc650dSSadaf Ebrahimi                 acters with the L, M, N, P, S, or Cf properties, except for:
7868*22dc650dSSadaf Ebrahimi
7869*22dc650dSSadaf Ebrahimi                   U+061C           Arabic Letter Mark
7870*22dc650dSSadaf Ebrahimi                   U+180E           Mongolian Vowel Separator
7871*22dc650dSSadaf Ebrahimi                   U+2066 - U+2069  Various "isolate"s
7872*22dc650dSSadaf Ebrahimi
7873*22dc650dSSadaf Ebrahimi
7874*22dc650dSSadaf Ebrahimi       [:print:] This  matches  the  same  characters  as [:graph:] plus space
7875*22dc650dSSadaf Ebrahimi                 characters that are not controls, that  is,  characters  with
7876*22dc650dSSadaf Ebrahimi                 the Zs property.
7877*22dc650dSSadaf Ebrahimi
7878*22dc650dSSadaf Ebrahimi       [:punct:] This matches all characters that have the Unicode P (punctua-
7879*22dc650dSSadaf Ebrahimi                 tion)  property,  plus those characters with code points less
7880*22dc650dSSadaf Ebrahimi                 than 256 that have the S (Symbol) property.
7881*22dc650dSSadaf Ebrahimi
7882*22dc650dSSadaf Ebrahimi       [:xdigit:]
7883*22dc650dSSadaf Ebrahimi                 In addition  to  the  ASCII  hexadecimal  digits,  this  also
7884*22dc650dSSadaf Ebrahimi                 matches  the  "fullwidth" versions of those characters, whose
7885*22dc650dSSadaf Ebrahimi                 Unicode code points start at U+FF10. This is  a  change  that
7886*22dc650dSSadaf Ebrahimi                 was made in PCRE release 10.43 for Perl compatibility.
7887*22dc650dSSadaf Ebrahimi
7888*22dc650dSSadaf Ebrahimi       The  other  POSIX  classes  are  unchanged by PCRE2_UCP, and match only
7889*22dc650dSSadaf Ebrahimi       characters with code points less than 256.
7890*22dc650dSSadaf Ebrahimi
7891*22dc650dSSadaf Ebrahimi       There are two options that can be used to restrict the POSIX classes to
7892*22dc650dSSadaf Ebrahimi       ASCII  characters  when  PCRE2_UCP  is  set.   The   option   PCRE2_EX-
7893*22dc650dSSadaf Ebrahimi       TRA_ASCII_DIGIT  affects  just  [:digit:] and [:xdigit:]. Within a pat-
7894*22dc650dSSadaf Ebrahimi       tern, this can be set and unset by  (?aT)  and  (?-aT).  The  PCRE2_EX-
7895*22dc650dSSadaf Ebrahimi       TRA_ASCII_POSIX  option  disables UCP processing for all POSIX classes,
7896*22dc650dSSadaf Ebrahimi       including [:digit:] and [:xdigit:]. Within a pattern, (?aP) and  (?-aP)
7897*22dc650dSSadaf Ebrahimi       set and unset both these options for consistency.
7898*22dc650dSSadaf Ebrahimi
7899*22dc650dSSadaf Ebrahimi
7900*22dc650dSSadaf EbrahimiCOMPATIBILITY FEATURE FOR WORD BOUNDARIES
7901*22dc650dSSadaf Ebrahimi
7902*22dc650dSSadaf Ebrahimi       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
7903*22dc650dSSadaf Ebrahimi       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
7904*22dc650dSSadaf Ebrahimi       and "end of word". PCRE2 treats these items as follows:
7905*22dc650dSSadaf Ebrahimi
7906*22dc650dSSadaf Ebrahimi         [[:<:]]  is converted to  \b(?=\w)
7907*22dc650dSSadaf Ebrahimi         [[:>:]]  is converted to  \b(?<=\w)
7908*22dc650dSSadaf Ebrahimi
7909*22dc650dSSadaf Ebrahimi       Only these exact character sequences are recognized. A sequence such as
7910*22dc650dSSadaf Ebrahimi       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
7911*22dc650dSSadaf Ebrahimi       support is not compatible with Perl. It is provided to help  migrations
7912*22dc650dSSadaf Ebrahimi       from other environments, and is best not used in any new patterns. Note
7913*22dc650dSSadaf Ebrahimi       that  \b matches at the start and the end of a word (see "Simple asser-
7914*22dc650dSSadaf Ebrahimi       tions" above), and in a Perl-style pattern the preceding  or  following
7915*22dc650dSSadaf Ebrahimi       character  normally shows which is wanted, without the need for the as-
7916*22dc650dSSadaf Ebrahimi       sertions that are used above in order to give exactly the POSIX  behav-
7917*22dc650dSSadaf Ebrahimi       iour.  Note  also  that  the PCRE2_UCP option changes the meaning of \w
7918*22dc650dSSadaf Ebrahimi       (and therefore \b) by default, so  it  also  affects  these  POSIX  se-
7919*22dc650dSSadaf Ebrahimi       quences.
7920*22dc650dSSadaf Ebrahimi
7921*22dc650dSSadaf Ebrahimi
7922*22dc650dSSadaf EbrahimiVERTICAL BAR
7923*22dc650dSSadaf Ebrahimi
7924*22dc650dSSadaf Ebrahimi       Vertical  bar characters are used to separate alternative patterns. For
7925*22dc650dSSadaf Ebrahimi       example, the pattern
7926*22dc650dSSadaf Ebrahimi
7927*22dc650dSSadaf Ebrahimi         gilbert|sullivan
7928*22dc650dSSadaf Ebrahimi
7929*22dc650dSSadaf Ebrahimi       matches either "gilbert" or "sullivan". Any number of alternatives  may
7930*22dc650dSSadaf Ebrahimi       appear,  and  an  empty  alternative  is  permitted (matching the empty
7931*22dc650dSSadaf Ebrahimi       string). The matching process tries each alternative in turn, from left
7932*22dc650dSSadaf Ebrahimi       to right, and the first one that succeeds is used. If the  alternatives
7933*22dc650dSSadaf Ebrahimi       are  within a group (defined below), "succeeds" means matching the rest
7934*22dc650dSSadaf Ebrahimi       of the main pattern as well as the alternative in the group.
7935*22dc650dSSadaf Ebrahimi
7936*22dc650dSSadaf Ebrahimi
7937*22dc650dSSadaf EbrahimiINTERNAL OPTION SETTING
7938*22dc650dSSadaf Ebrahimi
7939*22dc650dSSadaf Ebrahimi       The settings of several options can be changed within a  pattern  by  a
7940*22dc650dSSadaf Ebrahimi       sequence  of  letters  enclosed between "(?" and ")". The following are
7941*22dc650dSSadaf Ebrahimi       Perl-compatible, and are described in detail in the pcre2api documenta-
7942*22dc650dSSadaf Ebrahimi       tion. The option letters are:
7943*22dc650dSSadaf Ebrahimi
7944*22dc650dSSadaf Ebrahimi         i  for PCRE2_CASELESS
7945*22dc650dSSadaf Ebrahimi         m  for PCRE2_MULTILINE
7946*22dc650dSSadaf Ebrahimi         n  for PCRE2_NO_AUTO_CAPTURE
7947*22dc650dSSadaf Ebrahimi         s  for PCRE2_DOTALL
7948*22dc650dSSadaf Ebrahimi         x  for PCRE2_EXTENDED
7949*22dc650dSSadaf Ebrahimi         xx for PCRE2_EXTENDED_MORE
7950*22dc650dSSadaf Ebrahimi
7951*22dc650dSSadaf Ebrahimi       For example, (?im) sets caseless, multiline matching. It is also possi-
7952*22dc650dSSadaf Ebrahimi       ble to unset these options by preceding the relevant letters with a hy-
7953*22dc650dSSadaf Ebrahimi       phen, for example (?-im). The two "extended" options are  not  indepen-
7954*22dc650dSSadaf Ebrahimi       dent; unsetting either one cancels the effects of both of them.
7955*22dc650dSSadaf Ebrahimi
7956*22dc650dSSadaf Ebrahimi       A   combined  setting  and  unsetting  such  as  (?im-sx),  which  sets
7957*22dc650dSSadaf Ebrahimi       PCRE2_CASELESS and PCRE2_MULTILINE  while  unsetting  PCRE2_DOTALL  and
7958*22dc650dSSadaf Ebrahimi       PCRE2_EXTENDED,  is  also  permitted. Only one hyphen may appear in the
7959*22dc650dSSadaf Ebrahimi       options string. If a letter appears both before and after  the  hyphen,
7960*22dc650dSSadaf Ebrahimi       the  option  is unset. An empty options setting "(?)" is allowed. Need-
7961*22dc650dSSadaf Ebrahimi       less to say, it has no effect.
7962*22dc650dSSadaf Ebrahimi
7963*22dc650dSSadaf Ebrahimi       If the first character following (? is a circumflex, it causes  all  of
7964*22dc650dSSadaf Ebrahimi       the  above  options  to  be unset. Letters may follow the circumflex to
7965*22dc650dSSadaf Ebrahimi       cause some options to be re-instated, but a hyphen may not appear.
7966*22dc650dSSadaf Ebrahimi
7967*22dc650dSSadaf Ebrahimi       Some PCRE2-specific options can be changed by the same mechanism  using
7968*22dc650dSSadaf Ebrahimi       these pairs or individual letters:
7969*22dc650dSSadaf Ebrahimi
7970*22dc650dSSadaf Ebrahimi         aD for PCRE2_EXTRA_ASCII_BSD
7971*22dc650dSSadaf Ebrahimi         aS for PCRE2_EXTRA_ASCII_BSS
7972*22dc650dSSadaf Ebrahimi         aW for PCRE2_EXTRA_ASCII_BSW
7973*22dc650dSSadaf Ebrahimi         aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT
7974*22dc650dSSadaf Ebrahimi         aT for PCRE2_EXTRA_ASCII_DIGIT
7975*22dc650dSSadaf Ebrahimi         r  for PCRE2_EXTRA_CASELESS_RESTRICT
7976*22dc650dSSadaf Ebrahimi         J  for PCRE2_DUPNAMES
7977*22dc650dSSadaf Ebrahimi         U  for PCRE2_UNGREEDY
7978*22dc650dSSadaf Ebrahimi
7979*22dc650dSSadaf Ebrahimi       However,  except for 'r', these are not unset by (?^), which is equiva-
7980*22dc650dSSadaf Ebrahimi       lent to (?-imnrsx). If 'a' is not followed by any  of  the  upper  case
7981*22dc650dSSadaf Ebrahimi       letters shown above, it sets (or unsets) all the ASCII options.
7982*22dc650dSSadaf Ebrahimi
7983*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA_ASCII_DIGIT   has   no  additional  effect  when  PCRE2_EX-
7984*22dc650dSSadaf Ebrahimi       TRA_ASCII_POSIX is set, but including it in  (?aP)  means  that  (?-aP)
7985*22dc650dSSadaf Ebrahimi       suppresses all ASCII restrictions for POSIX classes.
7986*22dc650dSSadaf Ebrahimi
7987*22dc650dSSadaf Ebrahimi       When  one of these option changes occurs at top level (that is, not in-
7988*22dc650dSSadaf Ebrahimi       side group parentheses), the change applies until a subsequent  change,
7989*22dc650dSSadaf Ebrahimi       or  the  end of the pattern. An option change within a group (see below
7990*22dc650dSSadaf Ebrahimi       for a description of groups) affects only that part of the  group  that
7991*22dc650dSSadaf Ebrahimi       follows  it.  At  the  end  of the group these options are reset to the
7992*22dc650dSSadaf Ebrahimi       state they were before the group. For example,
7993*22dc650dSSadaf Ebrahimi
7994*22dc650dSSadaf Ebrahimi         (a(?i)b)c
7995*22dc650dSSadaf Ebrahimi
7996*22dc650dSSadaf Ebrahimi       matches abc and aBc and no other strings  (assuming  PCRE2_CASELESS  is
7997*22dc650dSSadaf Ebrahimi       not  set  externally).  Any changes made in one alternative do carry on
7998*22dc650dSSadaf Ebrahimi       into subsequent branches within the same group. For example,
7999*22dc650dSSadaf Ebrahimi
8000*22dc650dSSadaf Ebrahimi         (a(?i)b|c)
8001*22dc650dSSadaf Ebrahimi
8002*22dc650dSSadaf Ebrahimi       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
8003*22dc650dSSadaf Ebrahimi       first  branch  is  abandoned before the option setting. This is because
8004*22dc650dSSadaf Ebrahimi       the effects of option settings happen at compile time. There  would  be
8005*22dc650dSSadaf Ebrahimi       some very weird behaviour otherwise.
8006*22dc650dSSadaf Ebrahimi
8007*22dc650dSSadaf Ebrahimi       As  a  convenient shorthand, if any option settings are required at the
8008*22dc650dSSadaf Ebrahimi       start of a non-capturing group (see the next section), the option  let-
8009*22dc650dSSadaf Ebrahimi       ters may appear between the "?" and the ":". Thus the two patterns
8010*22dc650dSSadaf Ebrahimi
8011*22dc650dSSadaf Ebrahimi         (?i:saturday|sunday)
8012*22dc650dSSadaf Ebrahimi         (?:(?i)saturday|sunday)
8013*22dc650dSSadaf Ebrahimi
8014*22dc650dSSadaf Ebrahimi       match exactly the same set of strings.
8015*22dc650dSSadaf Ebrahimi
8016*22dc650dSSadaf Ebrahimi       Note:  There  are  other  PCRE2-specific options, applying to the whole
8017*22dc650dSSadaf Ebrahimi       pattern, which can be set by the application when the  compiling  func-
8018*22dc650dSSadaf Ebrahimi       tion  is  called.  In addition, the pattern can contain special leading
8019*22dc650dSSadaf Ebrahimi       sequences such as (*CRLF) to override what the application has  set  or
8020*22dc650dSSadaf Ebrahimi       what  has  been  defaulted.   Details are given in the section entitled
8021*22dc650dSSadaf Ebrahimi       "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
8022*22dc650dSSadaf Ebrahimi       sequences that can be used to set UTF and Unicode property modes;  they
8023*22dc650dSSadaf Ebrahimi       are  equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
8024*22dc650dSSadaf Ebrahimi       tively.  However,  the  application  can  set  the  PCRE2_NEVER_UTF  or
8025*22dc650dSSadaf Ebrahimi       PCRE2_NEVER_UCP  options,  which  lock  out  the  use of the (*UTF) and
8026*22dc650dSSadaf Ebrahimi       (*UCP) sequences.
8027*22dc650dSSadaf Ebrahimi
8028*22dc650dSSadaf Ebrahimi
8029*22dc650dSSadaf EbrahimiGROUPS
8030*22dc650dSSadaf Ebrahimi
8031*22dc650dSSadaf Ebrahimi       Groups are delimited by parentheses  (round  brackets),  which  can  be
8032*22dc650dSSadaf Ebrahimi       nested.  Turning part of a pattern into a group does two things:
8033*22dc650dSSadaf Ebrahimi
8034*22dc650dSSadaf Ebrahimi       1. It localizes a set of alternatives. For example, the pattern
8035*22dc650dSSadaf Ebrahimi
8036*22dc650dSSadaf Ebrahimi         cat(aract|erpillar|)
8037*22dc650dSSadaf Ebrahimi
8038*22dc650dSSadaf Ebrahimi       matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
8039*22dc650dSSadaf Ebrahimi       it would match "cataract", "erpillar" or an empty string.
8040*22dc650dSSadaf Ebrahimi
8041*22dc650dSSadaf Ebrahimi       2. It creates a "capture group". This means that, when the  whole  pat-
8042*22dc650dSSadaf Ebrahimi       tern  matches, the portion of the subject string that matched the group
8043*22dc650dSSadaf Ebrahimi       is passed back to the caller, separately from the portion that  matched
8044*22dc650dSSadaf Ebrahimi       the  whole  pattern.   (This  applies  only to the traditional matching
8045*22dc650dSSadaf Ebrahimi       function; the DFA matching function does not support capturing.)
8046*22dc650dSSadaf Ebrahimi
8047*22dc650dSSadaf Ebrahimi       Opening parentheses are counted from left to right (starting from 1) to
8048*22dc650dSSadaf Ebrahimi       obtain numbers for capture groups. For example, if the string "the  red
8049*22dc650dSSadaf Ebrahimi       king" is matched against the pattern
8050*22dc650dSSadaf Ebrahimi
8051*22dc650dSSadaf Ebrahimi         the ((red|white) (king|queen))
8052*22dc650dSSadaf Ebrahimi
8053*22dc650dSSadaf Ebrahimi       the captured substrings are "red king", "red", and "king", and are num-
8054*22dc650dSSadaf Ebrahimi       bered 1, 2, and 3, respectively.
8055*22dc650dSSadaf Ebrahimi
8056*22dc650dSSadaf Ebrahimi       The  fact  that  plain  parentheses  fulfil two functions is not always
8057*22dc650dSSadaf Ebrahimi       helpful.  There are often times when grouping is required without  cap-
8058*22dc650dSSadaf Ebrahimi       turing.  If an opening parenthesis is followed by a question mark and a
8059*22dc650dSSadaf Ebrahimi       colon, the group does not do any capturing, and  is  not  counted  when
8060*22dc650dSSadaf Ebrahimi       computing  the number of any subsequent capture groups. For example, if
8061*22dc650dSSadaf Ebrahimi       the string "the white queen" is matched against the pattern
8062*22dc650dSSadaf Ebrahimi
8063*22dc650dSSadaf Ebrahimi         the ((?:red|white) (king|queen))
8064*22dc650dSSadaf Ebrahimi
8065*22dc650dSSadaf Ebrahimi       the captured substrings are "white queen" and "queen", and are numbered
8066*22dc650dSSadaf Ebrahimi       1 and 2. The maximum number of capture groups is 65535.
8067*22dc650dSSadaf Ebrahimi
8068*22dc650dSSadaf Ebrahimi       As a convenient shorthand, if any option settings are required  at  the
8069*22dc650dSSadaf Ebrahimi       start  of  a non-capturing group, the option letters may appear between
8070*22dc650dSSadaf Ebrahimi       the "?" and the ":". Thus the two patterns
8071*22dc650dSSadaf Ebrahimi
8072*22dc650dSSadaf Ebrahimi         (?i:saturday|sunday)
8073*22dc650dSSadaf Ebrahimi         (?:(?i)saturday|sunday)
8074*22dc650dSSadaf Ebrahimi
8075*22dc650dSSadaf Ebrahimi       match exactly the same set of strings. Because alternative branches are
8076*22dc650dSSadaf Ebrahimi       tried from left to right, and options are not reset until  the  end  of
8077*22dc650dSSadaf Ebrahimi       the  group is reached, an option setting in one branch does affect sub-
8078*22dc650dSSadaf Ebrahimi       sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
8079*22dc650dSSadaf Ebrahimi       urday".
8080*22dc650dSSadaf Ebrahimi
8081*22dc650dSSadaf Ebrahimi
8082*22dc650dSSadaf EbrahimiDUPLICATE GROUP NUMBERS
8083*22dc650dSSadaf Ebrahimi
8084*22dc650dSSadaf Ebrahimi       Perl 5.10 introduced a feature whereby each alternative in a group uses
8085*22dc650dSSadaf Ebrahimi       the same numbers for its capturing parentheses.  Such  a  group  starts
8086*22dc650dSSadaf Ebrahimi       with  (?|  and  is  itself a non-capturing group. For example, consider
8087*22dc650dSSadaf Ebrahimi       this pattern:
8088*22dc650dSSadaf Ebrahimi
8089*22dc650dSSadaf Ebrahimi         (?|(Sat)ur|(Sun))day
8090*22dc650dSSadaf Ebrahimi
8091*22dc650dSSadaf Ebrahimi       Because the two alternatives are inside a (?| group, both sets of  cap-
8092*22dc650dSSadaf Ebrahimi       turing  parentheses  are  numbered one. Thus, when the pattern matches,
8093*22dc650dSSadaf Ebrahimi       you can look at captured substring number  one,  whichever  alternative
8094*22dc650dSSadaf Ebrahimi       matched.  This  construct  is useful when you want to capture part, but
8095*22dc650dSSadaf Ebrahimi       not all, of one of a number of alternatives. Inside a (?| group, paren-
8096*22dc650dSSadaf Ebrahimi       theses are numbered as usual, but the number is reset at the  start  of
8097*22dc650dSSadaf Ebrahimi       each  branch.  The numbers of any capturing parentheses that follow the
8098*22dc650dSSadaf Ebrahimi       whole group start after the highest number used in any branch. The fol-
8099*22dc650dSSadaf Ebrahimi       lowing example is taken from the Perl documentation. The numbers under-
8100*22dc650dSSadaf Ebrahimi       neath show in which buffer the captured content will be stored.
8101*22dc650dSSadaf Ebrahimi
8102*22dc650dSSadaf Ebrahimi         # before  ---------------branch-reset----------- after
8103*22dc650dSSadaf Ebrahimi         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
8104*22dc650dSSadaf Ebrahimi         # 1            2         2  3        2     3     4
8105*22dc650dSSadaf Ebrahimi
8106*22dc650dSSadaf Ebrahimi       A backreference to a capture group uses the most recent value  that  is
8107*22dc650dSSadaf Ebrahimi       set for the group. The following pattern matches "abcabc" or "defdef":
8108*22dc650dSSadaf Ebrahimi
8109*22dc650dSSadaf Ebrahimi         /(?|(abc)|(def))\1/
8110*22dc650dSSadaf Ebrahimi
8111*22dc650dSSadaf Ebrahimi       In  contrast, a subroutine call to a capture group always refers to the
8112*22dc650dSSadaf Ebrahimi       first one in the pattern with the given number. The  following  pattern
8113*22dc650dSSadaf Ebrahimi       matches "abcabc" or "defabc":
8114*22dc650dSSadaf Ebrahimi
8115*22dc650dSSadaf Ebrahimi         /(?|(abc)|(def))(?1)/
8116*22dc650dSSadaf Ebrahimi
8117*22dc650dSSadaf Ebrahimi       A relative reference such as (?-1) is no different: it is just a conve-
8118*22dc650dSSadaf Ebrahimi       nient way of computing an absolute group number.
8119*22dc650dSSadaf Ebrahimi
8120*22dc650dSSadaf Ebrahimi       If a condition test for a group's having matched refers to a non-unique
8121*22dc650dSSadaf Ebrahimi       number, the test is true if any group with that number has matched.
8122*22dc650dSSadaf Ebrahimi
8123*22dc650dSSadaf Ebrahimi       An  alternative approach to using this "branch reset" feature is to use
8124*22dc650dSSadaf Ebrahimi       duplicate named groups, as described in the next section.
8125*22dc650dSSadaf Ebrahimi
8126*22dc650dSSadaf Ebrahimi
8127*22dc650dSSadaf EbrahimiNAMED CAPTURE GROUPS
8128*22dc650dSSadaf Ebrahimi
8129*22dc650dSSadaf Ebrahimi       Identifying capture groups by number is simple, but it can be very hard
8130*22dc650dSSadaf Ebrahimi       to keep track of the numbers in complicated patterns.  Furthermore,  if
8131*22dc650dSSadaf Ebrahimi       an  expression  is  modified, the numbers may change. To help with this
8132*22dc650dSSadaf Ebrahimi       difficulty, PCRE2 supports the naming of capture groups.  This  feature
8133*22dc650dSSadaf Ebrahimi       was  not  added to Perl until release 5.10. Python had the feature ear-
8134*22dc650dSSadaf Ebrahimi       lier, and PCRE1 introduced it at release 4.0, using the Python  syntax.
8135*22dc650dSSadaf Ebrahimi       PCRE2 supports both the Perl and the Python syntax.
8136*22dc650dSSadaf Ebrahimi
8137*22dc650dSSadaf Ebrahimi       In  PCRE2,  a  capture  group  can  be  named  in  one  of  three ways:
8138*22dc650dSSadaf Ebrahimi       (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
8139*22dc650dSSadaf Ebrahimi       Names may be up to 128 code units long. When PCRE2_UTF is not set, they
8140*22dc650dSSadaf Ebrahimi       may contain only ASCII alphanumeric  characters  and  underscores,  but
8141*22dc650dSSadaf Ebrahimi       must start with a non-digit. When PCRE2_UTF is set, the syntax of group
8142*22dc650dSSadaf Ebrahimi       names is extended to allow any Unicode letter or Unicode decimal digit.
8143*22dc650dSSadaf Ebrahimi       In other words, group names must match one of these patterns:
8144*22dc650dSSadaf Ebrahimi
8145*22dc650dSSadaf Ebrahimi         ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
8146*22dc650dSSadaf Ebrahimi         ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set
8147*22dc650dSSadaf Ebrahimi
8148*22dc650dSSadaf Ebrahimi       References  to  capture groups from other parts of the pattern, such as
8149*22dc650dSSadaf Ebrahimi       backreferences, recursion, and conditions, can all be made by  name  as
8150*22dc650dSSadaf Ebrahimi       well as by number.
8151*22dc650dSSadaf Ebrahimi
8152*22dc650dSSadaf Ebrahimi       Named capture groups are allocated numbers as well as names, exactly as
8153*22dc650dSSadaf Ebrahimi       if  the  names were not present. In both PCRE2 and Perl, capture groups
8154*22dc650dSSadaf Ebrahimi       are primarily identified by numbers; any names  are  just  aliases  for
8155*22dc650dSSadaf Ebrahimi       these numbers. The PCRE2 API provides function calls for extracting the
8156*22dc650dSSadaf Ebrahimi       complete  name-to-number  translation table from a compiled pattern, as
8157*22dc650dSSadaf Ebrahimi       well as convenience functions for  extracting  captured  substrings  by
8158*22dc650dSSadaf Ebrahimi       name.
8159*22dc650dSSadaf Ebrahimi
8160*22dc650dSSadaf Ebrahimi       Warning:  When  more than one capture group has the same number, as de-
8161*22dc650dSSadaf Ebrahimi       scribed in the previous section, a name given to one of them applies to
8162*22dc650dSSadaf Ebrahimi       all of them. Perl allows identically numbered groups to have  different
8163*22dc650dSSadaf Ebrahimi       names.  Consider this pattern, where there are two capture groups, both
8164*22dc650dSSadaf Ebrahimi       numbered 1:
8165*22dc650dSSadaf Ebrahimi
8166*22dc650dSSadaf Ebrahimi         (?|(?<AA>aa)|(?<BB>bb))
8167*22dc650dSSadaf Ebrahimi
8168*22dc650dSSadaf Ebrahimi       Perl  allows  this,  with  both  names AA and BB as aliases of group 1.
8169*22dc650dSSadaf Ebrahimi       Thus, after a successful match, both names yield the same value (either
8170*22dc650dSSadaf Ebrahimi       "aa" or "bb").
8171*22dc650dSSadaf Ebrahimi
8172*22dc650dSSadaf Ebrahimi       In an attempt to reduce confusion, PCRE2 does not allow the same  group
8173*22dc650dSSadaf Ebrahimi       number to be associated with more than one name. The example above pro-
8174*22dc650dSSadaf Ebrahimi       vokes  a  compile-time  error. However, there is still scope for confu-
8175*22dc650dSSadaf Ebrahimi       sion. Consider this pattern:
8176*22dc650dSSadaf Ebrahimi
8177*22dc650dSSadaf Ebrahimi         (?|(?<AA>aa)|(bb))
8178*22dc650dSSadaf Ebrahimi
8179*22dc650dSSadaf Ebrahimi       Although the second group number 1 is not explicitly named, the name AA
8180*22dc650dSSadaf Ebrahimi       is still an alias for any group 1. Whether the pattern matches "aa"  or
8181*22dc650dSSadaf Ebrahimi       "bb", a reference by name to group AA yields the matched string.
8182*22dc650dSSadaf Ebrahimi
8183*22dc650dSSadaf Ebrahimi       By  default, a name must be unique within a pattern, except that dupli-
8184*22dc650dSSadaf Ebrahimi       cate names are permitted for groups with the same number, for example:
8185*22dc650dSSadaf Ebrahimi
8186*22dc650dSSadaf Ebrahimi         (?|(?<AA>aa)|(?<AA>bb))
8187*22dc650dSSadaf Ebrahimi
8188*22dc650dSSadaf Ebrahimi       The duplicate name constraint can be disabled by setting the PCRE2_DUP-
8189*22dc650dSSadaf Ebrahimi       NAMES option at compile time, or by the use of (?J) within the pattern,
8190*22dc650dSSadaf Ebrahimi       as described in the section entitled "Internal Option Setting" above.
8191*22dc650dSSadaf Ebrahimi
8192*22dc650dSSadaf Ebrahimi       Duplicate names can be useful for patterns where only one  instance  of
8193*22dc650dSSadaf Ebrahimi       the  named  capture group can match. Suppose you want to match the name
8194*22dc650dSSadaf Ebrahimi       of a weekday, either as a 3-letter abbreviation or as  the  full  name,
8195*22dc650dSSadaf Ebrahimi       and  in  both  cases you want to extract the abbreviation. This pattern
8196*22dc650dSSadaf Ebrahimi       (ignoring the line breaks) does the job:
8197*22dc650dSSadaf Ebrahimi
8198*22dc650dSSadaf Ebrahimi         (?J)
8199*22dc650dSSadaf Ebrahimi         (?<DN>Mon|Fri|Sun)(?:day)?|
8200*22dc650dSSadaf Ebrahimi         (?<DN>Tue)(?:sday)?|
8201*22dc650dSSadaf Ebrahimi         (?<DN>Wed)(?:nesday)?|
8202*22dc650dSSadaf Ebrahimi         (?<DN>Thu)(?:rsday)?|
8203*22dc650dSSadaf Ebrahimi         (?<DN>Sat)(?:urday)?
8204*22dc650dSSadaf Ebrahimi
8205*22dc650dSSadaf Ebrahimi       There are five capture groups, but only one is ever set after a  match.
8206*22dc650dSSadaf Ebrahimi       The  convenience  functions for extracting the data by name returns the
8207*22dc650dSSadaf Ebrahimi       substring for the first (and in this example, the only) group  of  that
8208*22dc650dSSadaf Ebrahimi       name that matched. This saves searching to find which numbered group it
8209*22dc650dSSadaf Ebrahimi       was.  (An  alternative  way of solving this problem is to use a "branch
8210*22dc650dSSadaf Ebrahimi       reset" group, as described in the previous section.)
8211*22dc650dSSadaf Ebrahimi
8212*22dc650dSSadaf Ebrahimi       If you make a backreference to a non-unique named group from  elsewhere
8213*22dc650dSSadaf Ebrahimi       in  the pattern, the groups to which the name refers are checked in the
8214*22dc650dSSadaf Ebrahimi       order in which they appear in the overall pattern. The first  one  that
8215*22dc650dSSadaf Ebrahimi       is  set  is  used  for the reference. For example, this pattern matches
8216*22dc650dSSadaf Ebrahimi       both "foofoo" and "barbar" but not "foobar" or "barfoo":
8217*22dc650dSSadaf Ebrahimi
8218*22dc650dSSadaf Ebrahimi         (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
8219*22dc650dSSadaf Ebrahimi
8220*22dc650dSSadaf Ebrahimi
8221*22dc650dSSadaf Ebrahimi       If you make a subroutine call to a non-unique named group, the one that
8222*22dc650dSSadaf Ebrahimi       corresponds to the first occurrence of the name is used. In the absence
8223*22dc650dSSadaf Ebrahimi       of duplicate numbers this is the one with the lowest number.
8224*22dc650dSSadaf Ebrahimi
8225*22dc650dSSadaf Ebrahimi       If you use a named reference in a condition test (see the section about
8226*22dc650dSSadaf Ebrahimi       conditions below), either to check whether a capture group has matched,
8227*22dc650dSSadaf Ebrahimi       or to check for recursion, all groups with the same name are tested. If
8228*22dc650dSSadaf Ebrahimi       the condition is true for any one of them,  the  overall  condition  is
8229*22dc650dSSadaf Ebrahimi       true.  This is the same behaviour as testing by number. For further de-
8230*22dc650dSSadaf Ebrahimi       tails of the interfaces for handling  named  capture  groups,  see  the
8231*22dc650dSSadaf Ebrahimi       pcre2api documentation.
8232*22dc650dSSadaf Ebrahimi
8233*22dc650dSSadaf Ebrahimi
8234*22dc650dSSadaf EbrahimiREPETITION
8235*22dc650dSSadaf Ebrahimi
8236*22dc650dSSadaf Ebrahimi       Repetition  is  specified  by  quantifiers, which may follow any one of
8237*22dc650dSSadaf Ebrahimi       these items:
8238*22dc650dSSadaf Ebrahimi
8239*22dc650dSSadaf Ebrahimi         a literal data character
8240*22dc650dSSadaf Ebrahimi         the dot metacharacter
8241*22dc650dSSadaf Ebrahimi         the \C escape sequence
8242*22dc650dSSadaf Ebrahimi         the \R escape sequence
8243*22dc650dSSadaf Ebrahimi         the \X escape sequence
8244*22dc650dSSadaf Ebrahimi         any escape sequence that matches a single character
8245*22dc650dSSadaf Ebrahimi         a character class
8246*22dc650dSSadaf Ebrahimi         a backreference
8247*22dc650dSSadaf Ebrahimi         a parenthesized group (including lookaround assertions)
8248*22dc650dSSadaf Ebrahimi         a subroutine call (recursive or otherwise)
8249*22dc650dSSadaf Ebrahimi
8250*22dc650dSSadaf Ebrahimi       If a quantifier does not follow a repeatable item, an error occurs. The
8251*22dc650dSSadaf Ebrahimi       general repetition quantifier specifies a minimum and maximum number of
8252*22dc650dSSadaf Ebrahimi       permitted matches by giving two numbers  in  curly  brackets  (braces),
8253*22dc650dSSadaf Ebrahimi       separated  by  a  comma.  The  numbers must be less than 65536, and the
8254*22dc650dSSadaf Ebrahimi       first must be less than or equal to the second. For example,
8255*22dc650dSSadaf Ebrahimi
8256*22dc650dSSadaf Ebrahimi         z{2,4}
8257*22dc650dSSadaf Ebrahimi
8258*22dc650dSSadaf Ebrahimi       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
8259*22dc650dSSadaf Ebrahimi       special  character.  If  the second number is omitted, but the comma is
8260*22dc650dSSadaf Ebrahimi       present, there is no upper limit; if the second number  and  the  comma
8261*22dc650dSSadaf Ebrahimi       are  both omitted, the quantifier specifies an exact number of required
8262*22dc650dSSadaf Ebrahimi       matches. Thus
8263*22dc650dSSadaf Ebrahimi
8264*22dc650dSSadaf Ebrahimi         [aeiou]{3,}
8265*22dc650dSSadaf Ebrahimi
8266*22dc650dSSadaf Ebrahimi       matches at least 3 successive vowels, but may match many more, whereas
8267*22dc650dSSadaf Ebrahimi
8268*22dc650dSSadaf Ebrahimi         \d{8}
8269*22dc650dSSadaf Ebrahimi
8270*22dc650dSSadaf Ebrahimi       matches exactly 8 digits. If the first number  is  omitted,  the  lower
8271*22dc650dSSadaf Ebrahimi       limit is taken as zero; in this case the upper limit must be present.
8272*22dc650dSSadaf Ebrahimi
8273*22dc650dSSadaf Ebrahimi         X{,4} is interpreted as X{0,4}
8274*22dc650dSSadaf Ebrahimi
8275*22dc650dSSadaf Ebrahimi       This  is  a  change in behaviour that happened in Perl 5.34.0 and PCRE2
8276*22dc650dSSadaf Ebrahimi       10.43. In earlier versions such a sequence was  not  interpreted  as  a
8277*22dc650dSSadaf Ebrahimi       quantifier. Other regular expression engines may behave either way.
8278*22dc650dSSadaf Ebrahimi
8279*22dc650dSSadaf Ebrahimi       If  the characters that follow an opening brace do not match the syntax
8280*22dc650dSSadaf Ebrahimi       of a quantifier, the brace is taken as a literal character. In particu-
8281*22dc650dSSadaf Ebrahimi       lar, this means that {,} is a literal string of three characters.
8282*22dc650dSSadaf Ebrahimi
8283*22dc650dSSadaf Ebrahimi       Note that not every opening brace is potentially the start of a quanti-
8284*22dc650dSSadaf Ebrahimi       fier because braces are used  in  other  items  such  as  \N{U+345}  or
8285*22dc650dSSadaf Ebrahimi       \k{name}.
8286*22dc650dSSadaf Ebrahimi
8287*22dc650dSSadaf Ebrahimi       In UTF modes, quantifiers apply to characters rather than to individual
8288*22dc650dSSadaf Ebrahimi       code  units. Thus, for example, \x{100}{2} matches two characters, each
8289*22dc650dSSadaf Ebrahimi       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
8290*22dc650dSSadaf Ebrahimi       larly, \X{3} matches three Unicode extended grapheme clusters, each  of
8291*22dc650dSSadaf Ebrahimi       which  may  be  several  code  units long (and they may be of different
8292*22dc650dSSadaf Ebrahimi       lengths).
8293*22dc650dSSadaf Ebrahimi
8294*22dc650dSSadaf Ebrahimi       The quantifier {0} is permitted, causing the expression to behave as if
8295*22dc650dSSadaf Ebrahimi       the previous item and the quantifier were not present. This may be use-
8296*22dc650dSSadaf Ebrahimi       ful for capture groups that are referenced as  subroutines  from  else-
8297*22dc650dSSadaf Ebrahimi       where  in the pattern (but see also the section entitled "Defining cap-
8298*22dc650dSSadaf Ebrahimi       ture groups for use by reference only" below). Except for parenthesized
8299*22dc650dSSadaf Ebrahimi       groups, items that have a {0} quantifier are omitted from the  compiled
8300*22dc650dSSadaf Ebrahimi       pattern.
8301*22dc650dSSadaf Ebrahimi
8302*22dc650dSSadaf Ebrahimi       For  convenience, the three most common quantifiers have single-charac-
8303*22dc650dSSadaf Ebrahimi       ter abbreviations:
8304*22dc650dSSadaf Ebrahimi
8305*22dc650dSSadaf Ebrahimi         *    is equivalent to {0,}
8306*22dc650dSSadaf Ebrahimi         +    is equivalent to {1,}
8307*22dc650dSSadaf Ebrahimi         ?    is equivalent to {0,1}
8308*22dc650dSSadaf Ebrahimi
8309*22dc650dSSadaf Ebrahimi       It is possible to construct infinite loops by following  a  group  that
8310*22dc650dSSadaf Ebrahimi       can  match no characters with a quantifier that has no upper limit, for
8311*22dc650dSSadaf Ebrahimi       example:
8312*22dc650dSSadaf Ebrahimi
8313*22dc650dSSadaf Ebrahimi         (a?)*
8314*22dc650dSSadaf Ebrahimi
8315*22dc650dSSadaf Ebrahimi       Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
8316*22dc650dSSadaf Ebrahimi       time for such patterns. However, because there are cases where this can
8317*22dc650dSSadaf Ebrahimi       be useful, such patterns are now accepted, but whenever an iteration of
8318*22dc650dSSadaf Ebrahimi       such  a group matches no characters, matching moves on to the next item
8319*22dc650dSSadaf Ebrahimi       in the pattern instead of repeatedly matching  an  empty  string.  This
8320*22dc650dSSadaf Ebrahimi       does  not  prevent  backtracking into any of the iterations if a subse-
8321*22dc650dSSadaf Ebrahimi       quent item fails to match.
8322*22dc650dSSadaf Ebrahimi
8323*22dc650dSSadaf Ebrahimi       By default, quantifiers are "greedy", that is, they match  as  much  as
8324*22dc650dSSadaf Ebrahimi       possible  (up  to the maximum number of permitted repetitions), without
8325*22dc650dSSadaf Ebrahimi       causing the rest of the pattern to fail. The classic example  of  where
8326*22dc650dSSadaf Ebrahimi       this gives problems is in trying to match comments in C programs. These
8327*22dc650dSSadaf Ebrahimi       appear  between  /*  and  */ and within the comment, individual * and /
8328*22dc650dSSadaf Ebrahimi       characters may appear. An attempt to match C comments by  applying  the
8329*22dc650dSSadaf Ebrahimi       pattern
8330*22dc650dSSadaf Ebrahimi
8331*22dc650dSSadaf Ebrahimi         /\*.*\*/
8332*22dc650dSSadaf Ebrahimi
8333*22dc650dSSadaf Ebrahimi       to the string
8334*22dc650dSSadaf Ebrahimi
8335*22dc650dSSadaf Ebrahimi         /* first comment */  not comment  /* second comment */
8336*22dc650dSSadaf Ebrahimi
8337*22dc650dSSadaf Ebrahimi       fails,  because it matches the entire string owing to the greediness of
8338*22dc650dSSadaf Ebrahimi       the .*  item. However, if a quantifier is followed by a question  mark,
8339*22dc650dSSadaf Ebrahimi       it ceases to be greedy, and instead matches the minimum number of times
8340*22dc650dSSadaf Ebrahimi       possible, so the pattern
8341*22dc650dSSadaf Ebrahimi
8342*22dc650dSSadaf Ebrahimi         /\*.*?\*/
8343*22dc650dSSadaf Ebrahimi
8344*22dc650dSSadaf Ebrahimi       does  the right thing with C comments. The meaning of the various quan-
8345*22dc650dSSadaf Ebrahimi       tifiers is not otherwise changed, just the preferred number of matches.
8346*22dc650dSSadaf Ebrahimi       Do not confuse this use of question mark with its use as  a  quantifier
8347*22dc650dSSadaf Ebrahimi       in  its  own  right.   Because it has two uses, it can sometimes appear
8348*22dc650dSSadaf Ebrahimi       doubled, as in
8349*22dc650dSSadaf Ebrahimi
8350*22dc650dSSadaf Ebrahimi         \d??\d
8351*22dc650dSSadaf Ebrahimi
8352*22dc650dSSadaf Ebrahimi       which matches one digit by preference, but can match two if that is the
8353*22dc650dSSadaf Ebrahimi       only way the rest of the pattern matches.
8354*22dc650dSSadaf Ebrahimi
8355*22dc650dSSadaf Ebrahimi       If the PCRE2_UNGREEDY option is set (an option that is not available in
8356*22dc650dSSadaf Ebrahimi       Perl), the quantifiers are not greedy by default, but  individual  ones
8357*22dc650dSSadaf Ebrahimi       can  be  made  greedy  by following them with a question mark. In other
8358*22dc650dSSadaf Ebrahimi       words, it inverts the default behaviour.
8359*22dc650dSSadaf Ebrahimi
8360*22dc650dSSadaf Ebrahimi       When a parenthesized group is quantified with a  minimum  repeat  count
8361*22dc650dSSadaf Ebrahimi       that  is  greater  than 1 or with a limited maximum, more memory is re-
8362*22dc650dSSadaf Ebrahimi       quired for the compiled pattern, in proportion to the size of the mini-
8363*22dc650dSSadaf Ebrahimi       mum or maximum.
8364*22dc650dSSadaf Ebrahimi
8365*22dc650dSSadaf Ebrahimi       If a pattern starts with  .*  or  .{0,}  and  the  PCRE2_DOTALL  option
8366*22dc650dSSadaf Ebrahimi       (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
8367*22dc650dSSadaf Ebrahimi       lines, the pattern is implicitly  anchored,  because  whatever  follows
8368*22dc650dSSadaf Ebrahimi       will  be  tried against every character position in the subject string,
8369*22dc650dSSadaf Ebrahimi       so there is no point in retrying the overall match at any position  af-
8370*22dc650dSSadaf Ebrahimi       ter  the  first. PCRE2 normally treats such a pattern as though it were
8371*22dc650dSSadaf Ebrahimi       preceded by \A.
8372*22dc650dSSadaf Ebrahimi
8373*22dc650dSSadaf Ebrahimi       In cases where it is known that the subject  string  contains  no  new-
8374*22dc650dSSadaf Ebrahimi       lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
8375*22dc650dSSadaf Ebrahimi       mization, or alternatively, using ^ to indicate anchoring explicitly.
8376*22dc650dSSadaf Ebrahimi
8377*22dc650dSSadaf Ebrahimi       However, there are some cases where the optimization  cannot  be  used.
8378*22dc650dSSadaf Ebrahimi       When  .*   is  inside  capturing  parentheses that are the subject of a
8379*22dc650dSSadaf Ebrahimi       backreference elsewhere in the pattern, a match at the start  may  fail
8380*22dc650dSSadaf Ebrahimi       where a later one succeeds. Consider, for example:
8381*22dc650dSSadaf Ebrahimi
8382*22dc650dSSadaf Ebrahimi         (.*)abc\1
8383*22dc650dSSadaf Ebrahimi
8384*22dc650dSSadaf Ebrahimi       If  the subject is "xyz123abc123" the match point is the fourth charac-
8385*22dc650dSSadaf Ebrahimi       ter. For this reason, such a pattern is not implicitly anchored.
8386*22dc650dSSadaf Ebrahimi
8387*22dc650dSSadaf Ebrahimi       Another case where implicit anchoring is not applied is when the  lead-
8388*22dc650dSSadaf Ebrahimi       ing  .* is inside an atomic group. Once again, a match at the start may
8389*22dc650dSSadaf Ebrahimi       fail where a later one succeeds. Consider this pattern:
8390*22dc650dSSadaf Ebrahimi
8391*22dc650dSSadaf Ebrahimi         (?>.*?a)b
8392*22dc650dSSadaf Ebrahimi
8393*22dc650dSSadaf Ebrahimi       It matches "ab" in the subject "aab". The use of the backtracking  con-
8394*22dc650dSSadaf Ebrahimi       trol  verbs  (*PRUNE)  and  (*SKIP) also disable this optimization, and
8395*22dc650dSSadaf Ebrahimi       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
8396*22dc650dSSadaf Ebrahimi
8397*22dc650dSSadaf Ebrahimi       When a capture group is repeated, the value captured is  the  substring
8398*22dc650dSSadaf Ebrahimi       that matched the final iteration. For example, after
8399*22dc650dSSadaf Ebrahimi
8400*22dc650dSSadaf Ebrahimi         (tweedle[dume]{3}\s*)+
8401*22dc650dSSadaf Ebrahimi
8402*22dc650dSSadaf Ebrahimi       has matched "tweedledum tweedledee" the value of the captured substring
8403*22dc650dSSadaf Ebrahimi       is  "tweedledee". However, if there are nested capture groups, the cor-
8404*22dc650dSSadaf Ebrahimi       responding captured values may have been set  in  previous  iterations.
8405*22dc650dSSadaf Ebrahimi       For example, after
8406*22dc650dSSadaf Ebrahimi
8407*22dc650dSSadaf Ebrahimi         (a|(b))+
8408*22dc650dSSadaf Ebrahimi
8409*22dc650dSSadaf Ebrahimi       matches "aba" the value of the second captured substring is "b".
8410*22dc650dSSadaf Ebrahimi
8411*22dc650dSSadaf Ebrahimi
8412*22dc650dSSadaf EbrahimiATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
8413*22dc650dSSadaf Ebrahimi
8414*22dc650dSSadaf Ebrahimi       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
8415*22dc650dSSadaf Ebrahimi       repetition, failure of what follows normally causes the  repeated  item
8416*22dc650dSSadaf Ebrahimi       to  be  re-evaluated to see if a different number of repeats allows the
8417*22dc650dSSadaf Ebrahimi       rest of the pattern to match. Sometimes it is useful to  prevent  this,
8418*22dc650dSSadaf Ebrahimi       either  to  change the nature of the match, or to cause it fail earlier
8419*22dc650dSSadaf Ebrahimi       than it otherwise might, when the author of the pattern knows there  is
8420*22dc650dSSadaf Ebrahimi       no point in carrying on.
8421*22dc650dSSadaf Ebrahimi
8422*22dc650dSSadaf Ebrahimi       Consider,  for  example, the pattern \d+foo when applied to the subject
8423*22dc650dSSadaf Ebrahimi       line
8424*22dc650dSSadaf Ebrahimi
8425*22dc650dSSadaf Ebrahimi         123456bar
8426*22dc650dSSadaf Ebrahimi
8427*22dc650dSSadaf Ebrahimi       After matching all 6 digits and then failing to match "foo", the normal
8428*22dc650dSSadaf Ebrahimi       action of the matcher is to try again with only 5 digits  matching  the
8429*22dc650dSSadaf Ebrahimi       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
8430*22dc650dSSadaf Ebrahimi       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
8431*22dc650dSSadaf Ebrahimi       the means for specifying that once a group has matched, it is not to be
8432*22dc650dSSadaf Ebrahimi       re-evaluated in this way.
8433*22dc650dSSadaf Ebrahimi
8434*22dc650dSSadaf Ebrahimi       If  we  use atomic grouping for the previous example, the matcher gives
8435*22dc650dSSadaf Ebrahimi       up immediately on failing to match "foo" the first time.  The  notation
8436*22dc650dSSadaf Ebrahimi       is a kind of special parenthesis, starting with (?> as in this example:
8437*22dc650dSSadaf Ebrahimi
8438*22dc650dSSadaf Ebrahimi         (?>\d+)foo
8439*22dc650dSSadaf Ebrahimi
8440*22dc650dSSadaf Ebrahimi       Perl  5.28  introduced an experimental alphabetic form starting with (*
8441*22dc650dSSadaf Ebrahimi       which may be easier to remember:
8442*22dc650dSSadaf Ebrahimi
8443*22dc650dSSadaf Ebrahimi         (*atomic:\d+)foo
8444*22dc650dSSadaf Ebrahimi
8445*22dc650dSSadaf Ebrahimi       This kind of parenthesized group "locks up" the part of the pattern  it
8446*22dc650dSSadaf Ebrahimi       contains once it has matched, and a failure further into the pattern is
8447*22dc650dSSadaf Ebrahimi       prevented  from  backtracking into it. Backtracking past it to previous
8448*22dc650dSSadaf Ebrahimi       items, however, works as normal.
8449*22dc650dSSadaf Ebrahimi
8450*22dc650dSSadaf Ebrahimi       An alternative description is that a group of this type matches exactly
8451*22dc650dSSadaf Ebrahimi       the string of characters that an  identical  standalone  pattern  would
8452*22dc650dSSadaf Ebrahimi       match, if anchored at the current point in the subject string.
8453*22dc650dSSadaf Ebrahimi
8454*22dc650dSSadaf Ebrahimi       Atomic  groups  are  not capture groups. Simple cases such as the above
8455*22dc650dSSadaf Ebrahimi       example can be thought of as a  maximizing  repeat  that  must  swallow
8456*22dc650dSSadaf Ebrahimi       everything  it can.  So, while both \d+ and \d+? are prepared to adjust
8457*22dc650dSSadaf Ebrahimi       the number of digits they match in order to make the rest of  the  pat-
8458*22dc650dSSadaf Ebrahimi       tern match, (?>\d+) can only match an entire sequence of digits.
8459*22dc650dSSadaf Ebrahimi
8460*22dc650dSSadaf Ebrahimi       Atomic  groups in general can of course contain arbitrarily complicated
8461*22dc650dSSadaf Ebrahimi       expressions, and can be nested. However, when the contents of an atomic
8462*22dc650dSSadaf Ebrahimi       group is just a single repeated item, as in the example above,  a  sim-
8463*22dc650dSSadaf Ebrahimi       pler  notation, called a "possessive quantifier" can be used. This con-
8464*22dc650dSSadaf Ebrahimi       sists of an additional + character following a quantifier.  Using  this
8465*22dc650dSSadaf Ebrahimi       notation, the previous example can be rewritten as
8466*22dc650dSSadaf Ebrahimi
8467*22dc650dSSadaf Ebrahimi         \d++foo
8468*22dc650dSSadaf Ebrahimi
8469*22dc650dSSadaf Ebrahimi       Note that a possessive quantifier can be used with an entire group, for
8470*22dc650dSSadaf Ebrahimi       example:
8471*22dc650dSSadaf Ebrahimi
8472*22dc650dSSadaf Ebrahimi         (abc|xyz){2,3}+
8473*22dc650dSSadaf Ebrahimi
8474*22dc650dSSadaf Ebrahimi       Possessive  quantifiers are always greedy; the setting of the PCRE2_UN-
8475*22dc650dSSadaf Ebrahimi       GREEDY option is ignored. They are a convenient notation for  the  sim-
8476*22dc650dSSadaf Ebrahimi       pler  forms  of  atomic  group.  However, there is no difference in the
8477*22dc650dSSadaf Ebrahimi       meaning of a possessive quantifier and  the  equivalent  atomic  group,
8478*22dc650dSSadaf Ebrahimi       though  there  may  be a performance difference; possessive quantifiers
8479*22dc650dSSadaf Ebrahimi       should be slightly faster.
8480*22dc650dSSadaf Ebrahimi
8481*22dc650dSSadaf Ebrahimi       The possessive quantifier syntax is an extension to the Perl  5.8  syn-
8482*22dc650dSSadaf Ebrahimi       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
8483*22dc650dSSadaf Ebrahimi       edition of his book. Mike McCloskey liked it, so implemented it when he
8484*22dc650dSSadaf Ebrahimi       built Sun's Java package, and PCRE1 copied it from there. It found  its
8485*22dc650dSSadaf Ebrahimi       way into Perl at release 5.10.
8486*22dc650dSSadaf Ebrahimi
8487*22dc650dSSadaf Ebrahimi       PCRE2  has  an  optimization  that automatically "possessifies" certain
8488*22dc650dSSadaf Ebrahimi       simple pattern constructs. For example, the sequence A+B is treated  as
8489*22dc650dSSadaf Ebrahimi       A++B  because  there is no point in backtracking into a sequence of A's
8490*22dc650dSSadaf Ebrahimi       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
8491*22dc650dSSadaf Ebrahimi       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
8492*22dc650dSSadaf Ebrahimi
8493*22dc650dSSadaf Ebrahimi       When a pattern contains an unlimited repeat inside a group that can it-
8494*22dc650dSSadaf Ebrahimi       self be repeated an unlimited number of times, the  use  of  an  atomic
8495*22dc650dSSadaf Ebrahimi       group  is the only way to avoid some failing matches taking a very long
8496*22dc650dSSadaf Ebrahimi       time indeed. The pattern
8497*22dc650dSSadaf Ebrahimi
8498*22dc650dSSadaf Ebrahimi         (\D+|<\d+>)*[!?]
8499*22dc650dSSadaf Ebrahimi
8500*22dc650dSSadaf Ebrahimi       matches an unlimited number of substrings that either consist  of  non-
8501*22dc650dSSadaf Ebrahimi       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
8502*22dc650dSSadaf Ebrahimi       matches, it runs quickly. However, if it is applied to
8503*22dc650dSSadaf Ebrahimi
8504*22dc650dSSadaf Ebrahimi         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8505*22dc650dSSadaf Ebrahimi
8506*22dc650dSSadaf Ebrahimi       it takes a long time before reporting  failure.  This  is  because  the
8507*22dc650dSSadaf Ebrahimi       string  can be divided between the internal \D+ repeat and the external
8508*22dc650dSSadaf Ebrahimi       * repeat in a large number of ways, and all have to be tried. (The  ex-
8509*22dc650dSSadaf Ebrahimi       ample uses [!?] rather than a single character at the end, because both
8510*22dc650dSSadaf Ebrahimi       PCRE2 and Perl have an optimization that allows for fast failure when a
8511*22dc650dSSadaf Ebrahimi       single  character is used. They remember the last single character that
8512*22dc650dSSadaf Ebrahimi       is required for a match, and fail early if it is  not  present  in  the
8513*22dc650dSSadaf Ebrahimi       string.)  If  the  pattern  is changed so that it uses an atomic group,
8514*22dc650dSSadaf Ebrahimi       like this:
8515*22dc650dSSadaf Ebrahimi
8516*22dc650dSSadaf Ebrahimi         ((?>\D+)|<\d+>)*[!?]
8517*22dc650dSSadaf Ebrahimi
8518*22dc650dSSadaf Ebrahimi       sequences of non-digits cannot be broken, and failure happens quickly.
8519*22dc650dSSadaf Ebrahimi
8520*22dc650dSSadaf Ebrahimi
8521*22dc650dSSadaf EbrahimiBACKREFERENCES
8522*22dc650dSSadaf Ebrahimi
8523*22dc650dSSadaf Ebrahimi       Outside a character class, a backslash followed by a digit greater than
8524*22dc650dSSadaf Ebrahimi       0 (and possibly further digits) is a backreference to a  capture  group
8525*22dc650dSSadaf Ebrahimi       earlier (that is, to its left) in the pattern, provided there have been
8526*22dc650dSSadaf Ebrahimi       that many previous capture groups.
8527*22dc650dSSadaf Ebrahimi
8528*22dc650dSSadaf Ebrahimi       However,  if the decimal number following the backslash is less than 8,
8529*22dc650dSSadaf Ebrahimi       it is always taken as a backreference, and  causes  an  error  only  if
8530*22dc650dSSadaf Ebrahimi       there  are not that many capture groups in the entire pattern. In other
8531*22dc650dSSadaf Ebrahimi       words, the group that is referenced need not be to the left of the ref-
8532*22dc650dSSadaf Ebrahimi       erence for numbers less than 8. A "forward backreference" of this  type
8533*22dc650dSSadaf Ebrahimi       can make sense when a repetition is involved and the group to the right
8534*22dc650dSSadaf Ebrahimi       has participated in an earlier iteration.
8535*22dc650dSSadaf Ebrahimi
8536*22dc650dSSadaf Ebrahimi       It  is  not  possible  to have a numerical "forward backreference" to a
8537*22dc650dSSadaf Ebrahimi       group whose number is 8 or more using this syntax  because  a  sequence
8538*22dc650dSSadaf Ebrahimi       such  as  \50  is  interpreted as a character defined in octal. See the
8539*22dc650dSSadaf Ebrahimi       subsection entitled "Non-printing characters" above for further details
8540*22dc650dSSadaf Ebrahimi       of the handling of digits following a backslash. Other forms  of  back-
8541*22dc650dSSadaf Ebrahimi       referencing  do  not suffer from this restriction. In particular, there
8542*22dc650dSSadaf Ebrahimi       is no problem when named capture groups are used (see below).
8543*22dc650dSSadaf Ebrahimi
8544*22dc650dSSadaf Ebrahimi       Another way of avoiding the ambiguity inherent in  the  use  of  digits
8545*22dc650dSSadaf Ebrahimi       following  a  backslash  is  to use the \g escape sequence. This escape
8546*22dc650dSSadaf Ebrahimi       must be followed by a signed or unsigned number, optionally enclosed in
8547*22dc650dSSadaf Ebrahimi       braces. These examples are all identical:
8548*22dc650dSSadaf Ebrahimi
8549*22dc650dSSadaf Ebrahimi         (ring), \1
8550*22dc650dSSadaf Ebrahimi         (ring), \g1
8551*22dc650dSSadaf Ebrahimi         (ring), \g{1}
8552*22dc650dSSadaf Ebrahimi
8553*22dc650dSSadaf Ebrahimi       An unsigned number specifies an absolute reference without the  ambigu-
8554*22dc650dSSadaf Ebrahimi       ity that is present in the older syntax. It is also useful when literal
8555*22dc650dSSadaf Ebrahimi       digits  follow  the reference. A signed number is a relative reference.
8556*22dc650dSSadaf Ebrahimi       Consider this example:
8557*22dc650dSSadaf Ebrahimi
8558*22dc650dSSadaf Ebrahimi         (abc(def)ghi)\g{-1}
8559*22dc650dSSadaf Ebrahimi
8560*22dc650dSSadaf Ebrahimi       The sequence \g{-1} is a reference to the capture group whose number is
8561*22dc650dSSadaf Ebrahimi       one less than the number of the next group to be started,  so  in  this
8562*22dc650dSSadaf Ebrahimi       example  (where the next group would be numbered 3) is it equivalent to
8563*22dc650dSSadaf Ebrahimi       \2, and \g{-2} would be equivalent to \1. Note that if  this  construct
8564*22dc650dSSadaf Ebrahimi       is  inside  a capture group, that group is included in the count, so in
8565*22dc650dSSadaf Ebrahimi       this example \g{-2} also refers to group 1:
8566*22dc650dSSadaf Ebrahimi
8567*22dc650dSSadaf Ebrahimi         (A)(\g{-2}B)
8568*22dc650dSSadaf Ebrahimi
8569*22dc650dSSadaf Ebrahimi       The use of relative references can be helpful  in  long  patterns,  and
8570*22dc650dSSadaf Ebrahimi       also  in  patterns  that are created by joining together fragments that
8571*22dc650dSSadaf Ebrahimi       contain references within themselves.
8572*22dc650dSSadaf Ebrahimi
8573*22dc650dSSadaf Ebrahimi       The sequence \g{+1} is a reference to the next capture  group  that  is
8574*22dc650dSSadaf Ebrahimi       started  after  this item, and \g{+2} refers to the one after that, and
8575*22dc650dSSadaf Ebrahimi       so on. This kind of forward reference can be useful  in  patterns  that
8576*22dc650dSSadaf Ebrahimi       repeat. Perl does not support the use of + in this way.
8577*22dc650dSSadaf Ebrahimi
8578*22dc650dSSadaf Ebrahimi       A  backreference  matches  whatever  actually most recently matched the
8579*22dc650dSSadaf Ebrahimi       capture group in the current subject string, rather  than  anything  at
8580*22dc650dSSadaf Ebrahimi       all that matches the group (see "Groups as subroutines" below for a way
8581*22dc650dSSadaf Ebrahimi       of doing that). So the pattern
8582*22dc650dSSadaf Ebrahimi
8583*22dc650dSSadaf Ebrahimi         (sens|respons)e and \1ibility
8584*22dc650dSSadaf Ebrahimi
8585*22dc650dSSadaf Ebrahimi       matches  "sense and sensibility" and "response and responsibility", but
8586*22dc650dSSadaf Ebrahimi       not "sense and responsibility". If caseful matching is in force at  the
8587*22dc650dSSadaf Ebrahimi       time  of  the backreference, the case of letters is relevant. For exam-
8588*22dc650dSSadaf Ebrahimi       ple,
8589*22dc650dSSadaf Ebrahimi
8590*22dc650dSSadaf Ebrahimi         ((?i)rah)\s+\1
8591*22dc650dSSadaf Ebrahimi
8592*22dc650dSSadaf Ebrahimi       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
8593*22dc650dSSadaf Ebrahimi       original capture group is matched caselessly.
8594*22dc650dSSadaf Ebrahimi
8595*22dc650dSSadaf Ebrahimi       There  are  several  different  ways of writing backreferences to named
8596*22dc650dSSadaf Ebrahimi       capture groups. The .NET syntax  is  \k{name},  the  Python  syntax  is
8597*22dc650dSSadaf Ebrahimi       (?=name),  and the original Perl syntax is \k<name> or \k'name'. All of
8598*22dc650dSSadaf Ebrahimi       these are now supported by both Perl and  PCRE2.  Perl  5.10's  unified
8599*22dc650dSSadaf Ebrahimi       backreference  syntax,  in  which  \g  can be used for both numeric and
8600*22dc650dSSadaf Ebrahimi       named references, is also supported by PCRE2.   We  could  rewrite  the
8601*22dc650dSSadaf Ebrahimi       above example in any of the following ways:
8602*22dc650dSSadaf Ebrahimi
8603*22dc650dSSadaf Ebrahimi         (?<p1>(?i)rah)\s+\k<p1>
8604*22dc650dSSadaf Ebrahimi         (?'p1'(?i)rah)\s+\k{p1}
8605*22dc650dSSadaf Ebrahimi         (?P<p1>(?i)rah)\s+(?P=p1)
8606*22dc650dSSadaf Ebrahimi         (?<p1>(?i)rah)\s+\g{p1}
8607*22dc650dSSadaf Ebrahimi
8608*22dc650dSSadaf Ebrahimi       A  capture  group  that is referenced by name may appear in the pattern
8609*22dc650dSSadaf Ebrahimi       before or after the reference.
8610*22dc650dSSadaf Ebrahimi
8611*22dc650dSSadaf Ebrahimi       There may be more than one backreference to the same group. If a  group
8612*22dc650dSSadaf Ebrahimi       has  not actually been used in a particular match, backreferences to it
8613*22dc650dSSadaf Ebrahimi       always fail by default. For example, the pattern
8614*22dc650dSSadaf Ebrahimi
8615*22dc650dSSadaf Ebrahimi         (a|(bc))\2
8616*22dc650dSSadaf Ebrahimi
8617*22dc650dSSadaf Ebrahimi       always fails if it starts to match "a" rather than  "bc".  However,  if
8618*22dc650dSSadaf Ebrahimi       the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8619*22dc650dSSadaf Ebrahimi       erence to an unset value matches an empty string.
8620*22dc650dSSadaf Ebrahimi
8621*22dc650dSSadaf Ebrahimi       Because  there may be many capture groups in a pattern, all digits fol-
8622*22dc650dSSadaf Ebrahimi       lowing a backslash are taken as part of a potential backreference  num-
8623*22dc650dSSadaf Ebrahimi       ber.  If  the  pattern continues with a digit character, some delimiter
8624*22dc650dSSadaf Ebrahimi       must be used to terminate the backreference. If the  PCRE2_EXTENDED  or
8625*22dc650dSSadaf Ebrahimi       PCRE2_EXTENDED_MORE  option is set, this can be white space. Otherwise,
8626*22dc650dSSadaf Ebrahimi       the \g{} syntax or an empty comment (see "Comments" below) can be used.
8627*22dc650dSSadaf Ebrahimi
8628*22dc650dSSadaf Ebrahimi   Recursive backreferences
8629*22dc650dSSadaf Ebrahimi
8630*22dc650dSSadaf Ebrahimi       A backreference that occurs inside the group to which it  refers  fails
8631*22dc650dSSadaf Ebrahimi       when  the  group  is  first used, so, for example, (a\1) never matches.
8632*22dc650dSSadaf Ebrahimi       However, such references can be useful inside repeated groups. For  ex-
8633*22dc650dSSadaf Ebrahimi       ample, the pattern
8634*22dc650dSSadaf Ebrahimi
8635*22dc650dSSadaf Ebrahimi         (a|b\1)+
8636*22dc650dSSadaf Ebrahimi
8637*22dc650dSSadaf Ebrahimi       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8638*22dc650dSSadaf Ebrahimi       ation of the group, the backreference matches the character string cor-
8639*22dc650dSSadaf Ebrahimi       responding  to  the  previous iteration. In order for this to work, the
8640*22dc650dSSadaf Ebrahimi       pattern must be such that the first iteration does not  need  to  match
8641*22dc650dSSadaf Ebrahimi       the  backreference. This can be done using alternation, as in the exam-
8642*22dc650dSSadaf Ebrahimi       ple above, or by a quantifier with a minimum of zero.
8643*22dc650dSSadaf Ebrahimi
8644*22dc650dSSadaf Ebrahimi       For versions of PCRE2 less than 10.25, backreferences of this type used
8645*22dc650dSSadaf Ebrahimi       to cause the group that they reference  to  be  treated  as  an  atomic
8646*22dc650dSSadaf Ebrahimi       group.   This restriction no longer applies, and backtracking into such
8647*22dc650dSSadaf Ebrahimi       groups can occur as normal.
8648*22dc650dSSadaf Ebrahimi
8649*22dc650dSSadaf Ebrahimi
8650*22dc650dSSadaf EbrahimiASSERTIONS
8651*22dc650dSSadaf Ebrahimi
8652*22dc650dSSadaf Ebrahimi       An assertion is a test on the characters  following  or  preceding  the
8653*22dc650dSSadaf Ebrahimi       current matching point that does not consume any characters. The simple
8654*22dc650dSSadaf Ebrahimi       assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
8655*22dc650dSSadaf Ebrahimi       above.
8656*22dc650dSSadaf Ebrahimi
8657*22dc650dSSadaf Ebrahimi       More complicated assertions are coded as  parenthesized  groups.  There
8658*22dc650dSSadaf Ebrahimi       are  two  kinds:  those  that look ahead of the current position in the
8659*22dc650dSSadaf Ebrahimi       subject string, and those that look behind it, and in each case an  as-
8660*22dc650dSSadaf Ebrahimi       sertion  may  be  positive (must match for the assertion to be true) or
8661*22dc650dSSadaf Ebrahimi       negative (must not match for the assertion to be  true).  An  assertion
8662*22dc650dSSadaf Ebrahimi       group is matched in the normal way, and if it is true, matching contin-
8663*22dc650dSSadaf Ebrahimi       ues  after it, but with the matching position in the subject string re-
8664*22dc650dSSadaf Ebrahimi       set to what it was before the assertion was processed.
8665*22dc650dSSadaf Ebrahimi
8666*22dc650dSSadaf Ebrahimi       The Perl-compatible lookaround assertions are atomic. If  an  assertion
8667*22dc650dSSadaf Ebrahimi       is  true, but there is a subsequent matching failure, there is no back-
8668*22dc650dSSadaf Ebrahimi       tracking into the assertion. However, there are some cases  where  non-
8669*22dc650dSSadaf Ebrahimi       atomic  assertions can be useful. PCRE2 has some support for these, de-
8670*22dc650dSSadaf Ebrahimi       scribed in the section entitled "Non-atomic assertions" below, but they
8671*22dc650dSSadaf Ebrahimi       are not Perl-compatible.
8672*22dc650dSSadaf Ebrahimi
8673*22dc650dSSadaf Ebrahimi       A lookaround assertion may appear as the  condition  in  a  conditional
8674*22dc650dSSadaf Ebrahimi       group  (see  below). In this case, the result of matching the assertion
8675*22dc650dSSadaf Ebrahimi       determines which branch of the condition is followed.
8676*22dc650dSSadaf Ebrahimi
8677*22dc650dSSadaf Ebrahimi       Assertion groups are not capture groups. If an assertion contains  cap-
8678*22dc650dSSadaf Ebrahimi       ture  groups within it, these are counted for the purposes of numbering
8679*22dc650dSSadaf Ebrahimi       the capture groups in the whole pattern. Within each branch of  an  as-
8680*22dc650dSSadaf Ebrahimi       sertion,  locally  captured  substrings  may be referenced in the usual
8681*22dc650dSSadaf Ebrahimi       way. For example, a sequence such as (.)\g{-1} can  be  used  to  check
8682*22dc650dSSadaf Ebrahimi       that two adjacent characters are the same.
8683*22dc650dSSadaf Ebrahimi
8684*22dc650dSSadaf Ebrahimi       When  a  branch within an assertion fails to match, any substrings that
8685*22dc650dSSadaf Ebrahimi       were captured are discarded (as happens with any  pattern  branch  that
8686*22dc650dSSadaf Ebrahimi       fails  to  match).  A  negative  assertion  is  true  only when all its
8687*22dc650dSSadaf Ebrahimi       branches fail to match; this means that no captured substrings are ever
8688*22dc650dSSadaf Ebrahimi       retained after a successful negative assertion. When an assertion  con-
8689*22dc650dSSadaf Ebrahimi       tains a matching branch, what happens depends on the type of assertion.
8690*22dc650dSSadaf Ebrahimi
8691*22dc650dSSadaf Ebrahimi       For  a  positive  assertion, internally captured substrings in the suc-
8692*22dc650dSSadaf Ebrahimi       cessful branch are retained, and matching continues with the next  pat-
8693*22dc650dSSadaf Ebrahimi       tern  item  after  the  assertion. For a negative assertion, a matching
8694*22dc650dSSadaf Ebrahimi       branch means that the assertion is not true. If such  an  assertion  is
8695*22dc650dSSadaf Ebrahimi       being  used as a condition in a conditional group (see below), captured
8696*22dc650dSSadaf Ebrahimi       substrings are retained,  because  matching  continues  with  the  "no"
8697*22dc650dSSadaf Ebrahimi       branch of the condition. For other failing negative assertions, control
8698*22dc650dSSadaf Ebrahimi       passes to the previous backtracking point, thus discarding any captured
8699*22dc650dSSadaf Ebrahimi       strings within the assertion.
8700*22dc650dSSadaf Ebrahimi
8701*22dc650dSSadaf Ebrahimi       Most  assertion groups may be repeated; though it makes no sense to as-
8702*22dc650dSSadaf Ebrahimi       sert the same thing several times, the side effect of capturing in pos-
8703*22dc650dSSadaf Ebrahimi       itive assertions may occasionally be useful. However, an assertion that
8704*22dc650dSSadaf Ebrahimi       forms the condition for a conditional  group  may  not  be  quantified.
8705*22dc650dSSadaf Ebrahimi       PCRE2  used  to restrict the repetition of assertions, but from release
8706*22dc650dSSadaf Ebrahimi       10.35 the only restriction is that an unlimited maximum  repetition  is
8707*22dc650dSSadaf Ebrahimi       changed  to  be one more than the minimum. For example, {3,} is treated
8708*22dc650dSSadaf Ebrahimi       as {3,4}.
8709*22dc650dSSadaf Ebrahimi
8710*22dc650dSSadaf Ebrahimi   Alphabetic assertion names
8711*22dc650dSSadaf Ebrahimi
8712*22dc650dSSadaf Ebrahimi       Traditionally, symbolic sequences such as (?= and (?<= have  been  used
8713*22dc650dSSadaf Ebrahimi       to  specify lookaround assertions. Perl 5.28 introduced some experimen-
8714*22dc650dSSadaf Ebrahimi       tal alphabetic alternatives which might be easier to remember. They all
8715*22dc650dSSadaf Ebrahimi       start with (* instead of (? and must be written using lower  case  let-
8716*22dc650dSSadaf Ebrahimi       ters. PCRE2 supports the following synonyms:
8717*22dc650dSSadaf Ebrahimi
8718*22dc650dSSadaf Ebrahimi         (*positive_lookahead:  or (*pla: is the same as (?=
8719*22dc650dSSadaf Ebrahimi         (*negative_lookahead:  or (*nla: is the same as (?!
8720*22dc650dSSadaf Ebrahimi         (*positive_lookbehind: or (*plb: is the same as (?<=
8721*22dc650dSSadaf Ebrahimi         (*negative_lookbehind: or (*nlb: is the same as (?<!
8722*22dc650dSSadaf Ebrahimi
8723*22dc650dSSadaf Ebrahimi       For  example,  (*pla:foo) is the same assertion as (?=foo). In the fol-
8724*22dc650dSSadaf Ebrahimi       lowing sections, the various assertions are described using the  origi-
8725*22dc650dSSadaf Ebrahimi       nal symbolic forms.
8726*22dc650dSSadaf Ebrahimi
8727*22dc650dSSadaf Ebrahimi   Lookahead assertions
8728*22dc650dSSadaf Ebrahimi
8729*22dc650dSSadaf Ebrahimi       Lookahead assertions start with (?= for positive assertions and (?! for
8730*22dc650dSSadaf Ebrahimi       negative assertions. For example,
8731*22dc650dSSadaf Ebrahimi
8732*22dc650dSSadaf Ebrahimi         \w+(?=;)
8733*22dc650dSSadaf Ebrahimi
8734*22dc650dSSadaf Ebrahimi       matches  a word followed by a semicolon, but does not include the semi-
8735*22dc650dSSadaf Ebrahimi       colon in the match, and
8736*22dc650dSSadaf Ebrahimi
8737*22dc650dSSadaf Ebrahimi         foo(?!bar)
8738*22dc650dSSadaf Ebrahimi
8739*22dc650dSSadaf Ebrahimi       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
8740*22dc650dSSadaf Ebrahimi       that the apparently similar pattern
8741*22dc650dSSadaf Ebrahimi
8742*22dc650dSSadaf Ebrahimi         (?!foo)bar
8743*22dc650dSSadaf Ebrahimi
8744*22dc650dSSadaf Ebrahimi       does  not  find  an  occurrence  of "bar" that is preceded by something
8745*22dc650dSSadaf Ebrahimi       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
8746*22dc650dSSadaf Ebrahimi       the assertion (?!foo) is always true when the next three characters are
8747*22dc650dSSadaf Ebrahimi       "bar". A lookbehind assertion is needed to achieve the other effect.
8748*22dc650dSSadaf Ebrahimi
8749*22dc650dSSadaf Ebrahimi       If you want to force a matching failure at some point in a pattern, the
8750*22dc650dSSadaf Ebrahimi       most  convenient  way to do it is with (?!) because an empty string al-
8751*22dc650dSSadaf Ebrahimi       ways matches, so an assertion that requires there not to  be  an  empty
8752*22dc650dSSadaf Ebrahimi       string must always fail.  The backtracking control verb (*FAIL) or (*F)
8753*22dc650dSSadaf Ebrahimi       is a synonym for (?!).
8754*22dc650dSSadaf Ebrahimi
8755*22dc650dSSadaf Ebrahimi   Lookbehind assertions
8756*22dc650dSSadaf Ebrahimi
8757*22dc650dSSadaf Ebrahimi       Lookbehind  assertions start with (?<= for positive assertions and (?<!
8758*22dc650dSSadaf Ebrahimi       for negative assertions. For example,
8759*22dc650dSSadaf Ebrahimi
8760*22dc650dSSadaf Ebrahimi         (?<!foo)bar
8761*22dc650dSSadaf Ebrahimi
8762*22dc650dSSadaf Ebrahimi       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
8763*22dc650dSSadaf Ebrahimi       contents  of a lookbehind assertion are restricted such that there must
8764*22dc650dSSadaf Ebrahimi       be a known maximum to the lengths of all the strings it matches.  There
8765*22dc650dSSadaf Ebrahimi       are two cases:
8766*22dc650dSSadaf Ebrahimi
8767*22dc650dSSadaf Ebrahimi       If every top-level alternative matches a fixed length, for example
8768*22dc650dSSadaf Ebrahimi
8769*22dc650dSSadaf Ebrahimi         (?<=colour|color)
8770*22dc650dSSadaf Ebrahimi
8771*22dc650dSSadaf Ebrahimi       there  is a limit of 65535 characters to the lengths, which do not have
8772*22dc650dSSadaf Ebrahimi       to be the same, as this example demonstrates. This is the only kind  of
8773*22dc650dSSadaf Ebrahimi       lookbehind  supported  by  PCRE2 versions earlier than 10.43 and by the
8774*22dc650dSSadaf Ebrahimi       alternative matching function pcre2_dfa_match().
8775*22dc650dSSadaf Ebrahimi
8776*22dc650dSSadaf Ebrahimi       In PCRE2 10.43 and later, pcre2_match() supports lookbehind  assertions
8777*22dc650dSSadaf Ebrahimi       in  which  one  or  more top-level alternatives can match more than one
8778*22dc650dSSadaf Ebrahimi       string length, for example
8779*22dc650dSSadaf Ebrahimi
8780*22dc650dSSadaf Ebrahimi         (?<=colou?r)
8781*22dc650dSSadaf Ebrahimi
8782*22dc650dSSadaf Ebrahimi       The maximum matching length for any branch of the lookbehind is limited
8783*22dc650dSSadaf Ebrahimi       to a value set by the calling program (default 255 characters).  Unlim-
8784*22dc650dSSadaf Ebrahimi       ited  repetition (for example \d*) is not supported. In some cases, the
8785*22dc650dSSadaf Ebrahimi       escape sequence \K (see above) can be used instead of a lookbehind  as-
8786*22dc650dSSadaf Ebrahimi       sertion  at  the  start  of a pattern to get round the length limit re-
8787*22dc650dSSadaf Ebrahimi       striction.
8788*22dc650dSSadaf Ebrahimi
8789*22dc650dSSadaf Ebrahimi       In UTF-8 and UTF-16 modes, PCRE2 does not allow the  \C  escape  (which
8790*22dc650dSSadaf Ebrahimi       matches  a single code unit even in a UTF mode) to appear in lookbehind
8791*22dc650dSSadaf Ebrahimi       assertions, because it makes it impossible to calculate the  length  of
8792*22dc650dSSadaf Ebrahimi       the  lookbehind.  The \X and \R escapes, which can match different num-
8793*22dc650dSSadaf Ebrahimi       bers of code units, are never permitted in lookbehinds.
8794*22dc650dSSadaf Ebrahimi
8795*22dc650dSSadaf Ebrahimi       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
8796*22dc650dSSadaf Ebrahimi       lookbehinds,  as  long  as  the called capture group matches a limited-
8797*22dc650dSSadaf Ebrahimi       length string. However, recursion, that is, a "subroutine" call into  a
8798*22dc650dSSadaf Ebrahimi       group that is already active, is not supported.
8799*22dc650dSSadaf Ebrahimi
8800*22dc650dSSadaf Ebrahimi       PCRE2  supports backreferences in lookbehinds, but only if certain con-
8801*22dc650dSSadaf Ebrahimi       ditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be  set,
8802*22dc650dSSadaf Ebrahimi       there  must be no use of (?| in the pattern (it creates duplicate group
8803*22dc650dSSadaf Ebrahimi       numbers), and if the backreference is by name, the name must be unique.
8804*22dc650dSSadaf Ebrahimi       Of course, the referenced group must itself match a limited length sub-
8805*22dc650dSSadaf Ebrahimi       string. The following pattern matches words  containing  at  least  two
8806*22dc650dSSadaf Ebrahimi       characters that begin and end with the same character:
8807*22dc650dSSadaf Ebrahimi
8808*22dc650dSSadaf Ebrahimi          \b(\w)\w++(?<=\1)
8809*22dc650dSSadaf Ebrahimi
8810*22dc650dSSadaf Ebrahimi       Possessive  quantifiers  can be used in conjunction with lookbehind as-
8811*22dc650dSSadaf Ebrahimi       sertions to specify efficient matching at the end of  subject  strings.
8812*22dc650dSSadaf Ebrahimi       Consider a simple pattern such as
8813*22dc650dSSadaf Ebrahimi
8814*22dc650dSSadaf Ebrahimi         abcd$
8815*22dc650dSSadaf Ebrahimi
8816*22dc650dSSadaf Ebrahimi       when  applied  to  a  long string that does not match. Because matching
8817*22dc650dSSadaf Ebrahimi       proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8818*22dc650dSSadaf Ebrahimi       ject  and  then see if what follows matches the rest of the pattern. If
8819*22dc650dSSadaf Ebrahimi       the pattern is specified as
8820*22dc650dSSadaf Ebrahimi
8821*22dc650dSSadaf Ebrahimi         ^.*abcd$
8822*22dc650dSSadaf Ebrahimi
8823*22dc650dSSadaf Ebrahimi       the initial .* matches the entire string at first, but when this  fails
8824*22dc650dSSadaf Ebrahimi       (because there is no following "a"), it backtracks to match all but the
8825*22dc650dSSadaf Ebrahimi       last  character,  then all but the last two characters, and so on. Once
8826*22dc650dSSadaf Ebrahimi       again the search for "a" covers the entire string, from right to  left,
8827*22dc650dSSadaf Ebrahimi       so we are no better off. However, if the pattern is written as
8828*22dc650dSSadaf Ebrahimi
8829*22dc650dSSadaf Ebrahimi         ^.*+(?<=abcd)
8830*22dc650dSSadaf Ebrahimi
8831*22dc650dSSadaf Ebrahimi       there can be no backtracking for the .*+ item because of the possessive
8832*22dc650dSSadaf Ebrahimi       quantifier; it can match only the entire string. The subsequent lookbe-
8833*22dc650dSSadaf Ebrahimi       hind  assertion  does  a single test on the last four characters. If it
8834*22dc650dSSadaf Ebrahimi       fails, the match fails immediately. For  long  strings,  this  approach
8835*22dc650dSSadaf Ebrahimi       makes a significant difference to the processing time.
8836*22dc650dSSadaf Ebrahimi
8837*22dc650dSSadaf Ebrahimi   Using multiple assertions
8838*22dc650dSSadaf Ebrahimi
8839*22dc650dSSadaf Ebrahimi       Several assertions (of any sort) may occur in succession. For example,
8840*22dc650dSSadaf Ebrahimi
8841*22dc650dSSadaf Ebrahimi         (?<=\d{3})(?<!999)foo
8842*22dc650dSSadaf Ebrahimi
8843*22dc650dSSadaf Ebrahimi       matches  "foo" preceded by three digits that are not "999". Notice that
8844*22dc650dSSadaf Ebrahimi       each of the assertions is applied independently at the  same  point  in
8845*22dc650dSSadaf Ebrahimi       the  subject  string.  First  there  is a check that the previous three
8846*22dc650dSSadaf Ebrahimi       characters are all digits, and then there is  a  check  that  the  same
8847*22dc650dSSadaf Ebrahimi       three characters are not "999".  This pattern does not match "foo" pre-
8848*22dc650dSSadaf Ebrahimi       ceded  by  six  characters,  the first of which are digits and the last
8849*22dc650dSSadaf Ebrahimi       three of which are not "999". For example, it  doesn't  match  "123abc-
8850*22dc650dSSadaf Ebrahimi       foo". A pattern to do that is
8851*22dc650dSSadaf Ebrahimi
8852*22dc650dSSadaf Ebrahimi         (?<=\d{3}...)(?<!999)foo
8853*22dc650dSSadaf Ebrahimi
8854*22dc650dSSadaf Ebrahimi       This  time  the  first assertion looks at the preceding six characters,
8855*22dc650dSSadaf Ebrahimi       checking that the first three are digits, and then the second assertion
8856*22dc650dSSadaf Ebrahimi       checks that the preceding three characters are not "999".
8857*22dc650dSSadaf Ebrahimi
8858*22dc650dSSadaf Ebrahimi       Assertions can be nested in any combination. For example,
8859*22dc650dSSadaf Ebrahimi
8860*22dc650dSSadaf Ebrahimi         (?<=(?<!foo)bar)baz
8861*22dc650dSSadaf Ebrahimi
8862*22dc650dSSadaf Ebrahimi       matches an occurrence of "baz" that is preceded by "bar" which in  turn
8863*22dc650dSSadaf Ebrahimi       is not preceded by "foo", while
8864*22dc650dSSadaf Ebrahimi
8865*22dc650dSSadaf Ebrahimi         (?<=\d{3}(?!999)...)foo
8866*22dc650dSSadaf Ebrahimi
8867*22dc650dSSadaf Ebrahimi       is  another pattern that matches "foo" preceded by three digits and any
8868*22dc650dSSadaf Ebrahimi       three characters that are not "999".
8869*22dc650dSSadaf Ebrahimi
8870*22dc650dSSadaf Ebrahimi
8871*22dc650dSSadaf EbrahimiNON-ATOMIC ASSERTIONS
8872*22dc650dSSadaf Ebrahimi
8873*22dc650dSSadaf Ebrahimi       Traditional lookaround assertions are atomic. That is, if an  assertion
8874*22dc650dSSadaf Ebrahimi       is  true, but there is a subsequent matching failure, there is no back-
8875*22dc650dSSadaf Ebrahimi       tracking into the assertion. However, there are some cases  where  non-
8876*22dc650dSSadaf Ebrahimi       atomic  positive  assertions  can be useful. PCRE2 provides these using
8877*22dc650dSSadaf Ebrahimi       the following syntax:
8878*22dc650dSSadaf Ebrahimi
8879*22dc650dSSadaf Ebrahimi         (*non_atomic_positive_lookahead:  or (*napla: or (?*
8880*22dc650dSSadaf Ebrahimi         (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
8881*22dc650dSSadaf Ebrahimi
8882*22dc650dSSadaf Ebrahimi       Consider the problem of finding the right-most word in  a  string  that
8883*22dc650dSSadaf Ebrahimi       also  appears  earlier  in the string, that is, it must appear at least
8884*22dc650dSSadaf Ebrahimi       twice in total.  This pattern returns the required result  as  captured
8885*22dc650dSSadaf Ebrahimi       substring 1:
8886*22dc650dSSadaf Ebrahimi
8887*22dc650dSSadaf Ebrahimi         ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
8888*22dc650dSSadaf Ebrahimi
8889*22dc650dSSadaf Ebrahimi       For  a subject such as "word1 word2 word3 word2 word3 word4" the result
8890*22dc650dSSadaf Ebrahimi       is "word3". How does it work? At the start, ^(?x) anchors  the  pattern
8891*22dc650dSSadaf Ebrahimi       and sets the "x" option, which causes white space (introduced for read-
8892*22dc650dSSadaf Ebrahimi       ability)  to  be  ignored. Inside the assertion, the greedy .* at first
8893*22dc650dSSadaf Ebrahimi       consumes the entire string, but then has to backtrack until the rest of
8894*22dc650dSSadaf Ebrahimi       the assertion can match a word, which is captured by group 1. In  other
8895*22dc650dSSadaf Ebrahimi       words,  when  the  assertion first succeeds, it captures the right-most
8896*22dc650dSSadaf Ebrahimi       word in the string.
8897*22dc650dSSadaf Ebrahimi
8898*22dc650dSSadaf Ebrahimi       The current matching point is then reset to the start of  the  subject,
8899*22dc650dSSadaf Ebrahimi       and  the  rest  of  the pattern match checks for two occurrences of the
8900*22dc650dSSadaf Ebrahimi       captured word, using an ungreedy .*? to scan from  the  left.  If  this
8901*22dc650dSSadaf Ebrahimi       succeeds,  we are done, but if the last word in the string does not oc-
8902*22dc650dSSadaf Ebrahimi       cur twice, this part of the pattern  fails.  If  a  traditional  atomic
8903*22dc650dSSadaf Ebrahimi       lookahead  (?=  or (*pla: had been used, the assertion could not be re-
8904*22dc650dSSadaf Ebrahimi       entered, and the whole match would fail. The pattern would succeed only
8905*22dc650dSSadaf Ebrahimi       if the very last word in the subject was found twice.
8906*22dc650dSSadaf Ebrahimi
8907*22dc650dSSadaf Ebrahimi       Using a non-atomic lookahead, however, means that when  the  last  word
8908*22dc650dSSadaf Ebrahimi       does  not  occur  twice  in the string, the lookahead can backtrack and
8909*22dc650dSSadaf Ebrahimi       find the second-last word, and so on, until either the match  succeeds,
8910*22dc650dSSadaf Ebrahimi       or all words have been tested.
8911*22dc650dSSadaf Ebrahimi
8912*22dc650dSSadaf Ebrahimi       Two conditions must be met for a non-atomic assertion to be useful: the
8913*22dc650dSSadaf Ebrahimi       contents  of one or more capturing groups must change after a backtrack
8914*22dc650dSSadaf Ebrahimi       into the assertion, and there must be  a  backreference  to  a  changed
8915*22dc650dSSadaf Ebrahimi       group  later  in  the pattern. If this is not the case, the rest of the
8916*22dc650dSSadaf Ebrahimi       pattern match fails exactly as before because nothing has  changed,  so
8917*22dc650dSSadaf Ebrahimi       using a non-atomic assertion just wastes resources.
8918*22dc650dSSadaf Ebrahimi
8919*22dc650dSSadaf Ebrahimi       There  is one exception to backtracking into a non-atomic assertion. If
8920*22dc650dSSadaf Ebrahimi       an (*ACCEPT) control verb is triggered, the assertion  succeeds  atomi-
8921*22dc650dSSadaf Ebrahimi       cally.  That  is,  a subsequent match failure cannot backtrack into the
8922*22dc650dSSadaf Ebrahimi       assertion.
8923*22dc650dSSadaf Ebrahimi
8924*22dc650dSSadaf Ebrahimi       Non-atomic assertions are not supported  by  the  alternative  matching
8925*22dc650dSSadaf Ebrahimi       function pcre2_dfa_match(). They are supported by JIT, but only if they
8926*22dc650dSSadaf Ebrahimi       do not contain any control verbs such as (*ACCEPT). (This may change in
8927*22dc650dSSadaf Ebrahimi       future). Note that assertions that appear as conditions for conditional
8928*22dc650dSSadaf Ebrahimi       groups (see below) must be atomic.
8929*22dc650dSSadaf Ebrahimi
8930*22dc650dSSadaf Ebrahimi
8931*22dc650dSSadaf EbrahimiSCRIPT RUNS
8932*22dc650dSSadaf Ebrahimi
8933*22dc650dSSadaf Ebrahimi       In  concept, a script run is a sequence of characters that are all from
8934*22dc650dSSadaf Ebrahimi       the same Unicode script such as Latin or Greek. However,  because  some
8935*22dc650dSSadaf Ebrahimi       scripts  are  commonly  used together, and because some diacritical and
8936*22dc650dSSadaf Ebrahimi       other marks are used with multiple scripts,  it  is  not  that  simple.
8937*22dc650dSSadaf Ebrahimi       There is a full description of the rules that PCRE2 uses in the section
8938*22dc650dSSadaf Ebrahimi       entitled "Script Runs" in the pcre2unicode documentation.
8939*22dc650dSSadaf Ebrahimi
8940*22dc650dSSadaf Ebrahimi       If  part  of a pattern is enclosed between (*script_run: or (*sr: and a
8941*22dc650dSSadaf Ebrahimi       closing parenthesis, it fails if the sequence  of  characters  that  it
8942*22dc650dSSadaf Ebrahimi       matches  are not a script run. After a failure, normal backtracking oc-
8943*22dc650dSSadaf Ebrahimi       curs. Script runs can be used to detect spoofing attacks using  charac-
8944*22dc650dSSadaf Ebrahimi       ters  that  look  the  same, but are from different scripts. The string
8945*22dc650dSSadaf Ebrahimi       "paypal.com" is an infamous example, where the letters could be a  mix-
8946*22dc650dSSadaf Ebrahimi       ture of Latin and Cyrillic. This pattern ensures that the matched char-
8947*22dc650dSSadaf Ebrahimi       acters in a sequence of non-spaces that follow white space are a script
8948*22dc650dSSadaf Ebrahimi       run:
8949*22dc650dSSadaf Ebrahimi
8950*22dc650dSSadaf Ebrahimi         \s+(*sr:\S+)
8951*22dc650dSSadaf Ebrahimi
8952*22dc650dSSadaf Ebrahimi       To  be  sure  that  they are all from the Latin script (for example), a
8953*22dc650dSSadaf Ebrahimi       lookahead can be used:
8954*22dc650dSSadaf Ebrahimi
8955*22dc650dSSadaf Ebrahimi         \s+(?=\p{Latin})(*sr:\S+)
8956*22dc650dSSadaf Ebrahimi
8957*22dc650dSSadaf Ebrahimi       This works as long as the first character is expected to be a character
8958*22dc650dSSadaf Ebrahimi       in that script, and not (for example)  punctuation,  which  is  allowed
8959*22dc650dSSadaf Ebrahimi       with  any script. If this is not the case, a more creative lookahead is
8960*22dc650dSSadaf Ebrahimi       needed. For example, if digits, underscore, and dots are  permitted  at
8961*22dc650dSSadaf Ebrahimi       the start:
8962*22dc650dSSadaf Ebrahimi
8963*22dc650dSSadaf Ebrahimi         \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8964*22dc650dSSadaf Ebrahimi
8965*22dc650dSSadaf Ebrahimi
8966*22dc650dSSadaf Ebrahimi       In  many  cases, backtracking into a script run pattern fragment is not
8967*22dc650dSSadaf Ebrahimi       desirable. The script run can employ an atomic group to  prevent  this.
8968*22dc650dSSadaf Ebrahimi       Because  this is a common requirement, a shorthand notation is provided
8969*22dc650dSSadaf Ebrahimi       by (*atomic_script_run: or (*asr:
8970*22dc650dSSadaf Ebrahimi
8971*22dc650dSSadaf Ebrahimi         (*asr:...) is the same as (*sr:(?>...))
8972*22dc650dSSadaf Ebrahimi
8973*22dc650dSSadaf Ebrahimi       Note that the atomic group is inside the script run. Putting it outside
8974*22dc650dSSadaf Ebrahimi       would not prevent backtracking into the script run pattern.
8975*22dc650dSSadaf Ebrahimi
8976*22dc650dSSadaf Ebrahimi       Support for script runs is not available if PCRE2 is  compiled  without
8977*22dc650dSSadaf Ebrahimi       Unicode support. A compile-time error is given if any of the above con-
8978*22dc650dSSadaf Ebrahimi       structs  is encountered. Script runs are not supported by the alternate
8979*22dc650dSSadaf Ebrahimi       matching function, pcre2_dfa_match() because they use the  same  mecha-
8980*22dc650dSSadaf Ebrahimi       nism as capturing parentheses.
8981*22dc650dSSadaf Ebrahimi
8982*22dc650dSSadaf Ebrahimi       Warning:  The  (*ACCEPT)  control  verb  (see below) should not be used
8983*22dc650dSSadaf Ebrahimi       within a script run group, because it causes an immediate exit from the
8984*22dc650dSSadaf Ebrahimi       group, bypassing the script run checking.
8985*22dc650dSSadaf Ebrahimi
8986*22dc650dSSadaf Ebrahimi
8987*22dc650dSSadaf EbrahimiCONDITIONAL GROUPS
8988*22dc650dSSadaf Ebrahimi
8989*22dc650dSSadaf Ebrahimi       It is possible to cause the matching process to obey a pattern fragment
8990*22dc650dSSadaf Ebrahimi       conditionally or to choose between two alternative fragments, depending
8991*22dc650dSSadaf Ebrahimi       on the result of an assertion, or whether a specific capture group  has
8992*22dc650dSSadaf Ebrahimi       already been matched. The two possible forms of conditional group are:
8993*22dc650dSSadaf Ebrahimi
8994*22dc650dSSadaf Ebrahimi         (?(condition)yes-pattern)
8995*22dc650dSSadaf Ebrahimi         (?(condition)yes-pattern|no-pattern)
8996*22dc650dSSadaf Ebrahimi
8997*22dc650dSSadaf Ebrahimi       If  the  condition is satisfied, the yes-pattern is used; otherwise the
8998*22dc650dSSadaf Ebrahimi       no-pattern (if present) is used. An absent no-pattern is equivalent  to
8999*22dc650dSSadaf Ebrahimi       an  empty string (it always matches). If there are more than two alter-
9000*22dc650dSSadaf Ebrahimi       natives in the group, a compile-time error occurs. Each of the two  al-
9001*22dc650dSSadaf Ebrahimi       ternatives may itself contain nested groups of any form, including con-
9002*22dc650dSSadaf Ebrahimi       ditional  groups;  the  restriction to two alternatives applies only at
9003*22dc650dSSadaf Ebrahimi       the level of the condition itself. This pattern fragment is an  example
9004*22dc650dSSadaf Ebrahimi       where the alternatives are complex:
9005*22dc650dSSadaf Ebrahimi
9006*22dc650dSSadaf Ebrahimi         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
9007*22dc650dSSadaf Ebrahimi
9008*22dc650dSSadaf Ebrahimi
9009*22dc650dSSadaf Ebrahimi       There are five kinds of condition: references to capture groups, refer-
9010*22dc650dSSadaf Ebrahimi       ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
9011*22dc650dSSadaf Ebrahimi       and assertions.
9012*22dc650dSSadaf Ebrahimi
9013*22dc650dSSadaf Ebrahimi   Checking for a used capture group by number
9014*22dc650dSSadaf Ebrahimi
9015*22dc650dSSadaf Ebrahimi       If the text between the parentheses consists of a sequence  of  digits,
9016*22dc650dSSadaf Ebrahimi       the  condition is true if a capture group of that number has previously
9017*22dc650dSSadaf Ebrahimi       matched. If there is more than one capture group with the  same  number
9018*22dc650dSSadaf Ebrahimi       (see  the earlier section about duplicate group numbers), the condition
9019*22dc650dSSadaf Ebrahimi       is true if any of them have matched. An alternative notation, which  is
9020*22dc650dSSadaf Ebrahimi       a PCRE2 extension, not supported by Perl, is to precede the digits with
9021*22dc650dSSadaf Ebrahimi       a plus or minus sign. In this case, the group number is relative rather
9022*22dc650dSSadaf Ebrahimi       than  absolute.  The most recently opened capture group (which could be
9023*22dc650dSSadaf Ebrahimi       enclosing this condition) can be referenced by (?(-1),  the  next  most
9024*22dc650dSSadaf Ebrahimi       recent by (?(-2), and so on. Inside loops it can also make sense to re-
9025*22dc650dSSadaf Ebrahimi       fer  to  subsequent groups.  The next capture group to be opened can be
9026*22dc650dSSadaf Ebrahimi       referenced as (?(+1), and so on. The value zero in any of  these  forms
9027*22dc650dSSadaf Ebrahimi       is not used; it provokes a compile-time error.
9028*22dc650dSSadaf Ebrahimi
9029*22dc650dSSadaf Ebrahimi       Consider  the  following  pattern, which contains non-significant white
9030*22dc650dSSadaf Ebrahimi       space to make it more readable (assume the PCRE2_EXTENDED  option)  and
9031*22dc650dSSadaf Ebrahimi       to divide it into three parts for ease of discussion:
9032*22dc650dSSadaf Ebrahimi
9033*22dc650dSSadaf Ebrahimi         ( \( )?    [^()]+    (?(1) \) )
9034*22dc650dSSadaf Ebrahimi
9035*22dc650dSSadaf Ebrahimi       The  first  part  matches  an optional opening parenthesis, and if that
9036*22dc650dSSadaf Ebrahimi       character is present, sets it as the first captured substring. The sec-
9037*22dc650dSSadaf Ebrahimi       ond part matches one or more characters that are not  parentheses.  The
9038*22dc650dSSadaf Ebrahimi       third  part  is a conditional group that tests whether or not the first
9039*22dc650dSSadaf Ebrahimi       capture group matched. If it did, that is, if subject started  with  an
9040*22dc650dSSadaf Ebrahimi       opening  parenthesis,  the condition is true, and so the yes-pattern is
9041*22dc650dSSadaf Ebrahimi       executed and a closing parenthesis is required.  Otherwise,  since  no-
9042*22dc650dSSadaf Ebrahimi       pattern is not present, the conditional group matches nothing. In other
9043*22dc650dSSadaf Ebrahimi       words,  this  pattern matches a sequence of non-parentheses, optionally
9044*22dc650dSSadaf Ebrahimi       enclosed in parentheses.
9045*22dc650dSSadaf Ebrahimi
9046*22dc650dSSadaf Ebrahimi       If you were embedding this pattern in a larger one,  you  could  use  a
9047*22dc650dSSadaf Ebrahimi       relative reference:
9048*22dc650dSSadaf Ebrahimi
9049*22dc650dSSadaf Ebrahimi         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
9050*22dc650dSSadaf Ebrahimi
9051*22dc650dSSadaf Ebrahimi       This  makes  the  fragment independent of the parentheses in the larger
9052*22dc650dSSadaf Ebrahimi       pattern.
9053*22dc650dSSadaf Ebrahimi
9054*22dc650dSSadaf Ebrahimi   Checking for a used capture group by name
9055*22dc650dSSadaf Ebrahimi
9056*22dc650dSSadaf Ebrahimi       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
9057*22dc650dSSadaf Ebrahimi       used  capture group by name. For compatibility with earlier versions of
9058*22dc650dSSadaf Ebrahimi       PCRE1, which had this facility before Perl, the syntax (?(name)...)  is
9059*22dc650dSSadaf Ebrahimi       also  recognized.   Note, however, that undelimited names consisting of
9060*22dc650dSSadaf Ebrahimi       the letter R followed by digits are ambiguous (see the  following  sec-
9061*22dc650dSSadaf Ebrahimi       tion). Rewriting the above example to use a named group gives this:
9062*22dc650dSSadaf Ebrahimi
9063*22dc650dSSadaf Ebrahimi         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
9064*22dc650dSSadaf Ebrahimi
9065*22dc650dSSadaf Ebrahimi       If  the  name used in a condition of this kind is a duplicate, the test
9066*22dc650dSSadaf Ebrahimi       is applied to all groups of the same name, and is true if  any  one  of
9067*22dc650dSSadaf Ebrahimi       them has matched.
9068*22dc650dSSadaf Ebrahimi
9069*22dc650dSSadaf Ebrahimi   Checking for pattern recursion
9070*22dc650dSSadaf Ebrahimi
9071*22dc650dSSadaf Ebrahimi       "Recursion"  in  this sense refers to any subroutine-like call from one
9072*22dc650dSSadaf Ebrahimi       part of the pattern to another, whether or not it  is  actually  recur-
9073*22dc650dSSadaf Ebrahimi       sive.  See  the  sections  entitled "Recursive patterns" and "Groups as
9074*22dc650dSSadaf Ebrahimi       subroutines" below for details of recursion and subroutine calls.
9075*22dc650dSSadaf Ebrahimi
9076*22dc650dSSadaf Ebrahimi       If a condition is the string (R), and there is no  capture  group  with
9077*22dc650dSSadaf Ebrahimi       the  name R, the condition is true if matching is currently in a recur-
9078*22dc650dSSadaf Ebrahimi       sion or subroutine call to the whole pattern or any capture  group.  If
9079*22dc650dSSadaf Ebrahimi       digits  follow  the letter R, and there is no group with that name, the
9080*22dc650dSSadaf Ebrahimi       condition is true if the most recent call is  into  a  group  with  the
9081*22dc650dSSadaf Ebrahimi       given  number,  which must exist somewhere in the overall pattern. This
9082*22dc650dSSadaf Ebrahimi       is a contrived example that is equivalent to a+b:
9083*22dc650dSSadaf Ebrahimi
9084*22dc650dSSadaf Ebrahimi         ((?(R1)a+|(?1)b))
9085*22dc650dSSadaf Ebrahimi
9086*22dc650dSSadaf Ebrahimi       However, in both cases, if there is a capture  group  with  a  matching
9087*22dc650dSSadaf Ebrahimi       name,  the  condition tests for its being set, as described in the sec-
9088*22dc650dSSadaf Ebrahimi       tion above, instead of testing for recursion. For example,  creating  a
9089*22dc650dSSadaf Ebrahimi       group  with  the  name  R1  by adding (?<R1>) to the above pattern com-
9090*22dc650dSSadaf Ebrahimi       pletely changes its meaning.
9091*22dc650dSSadaf Ebrahimi
9092*22dc650dSSadaf Ebrahimi       If a name preceded by ampersand follows the letter R, for example:
9093*22dc650dSSadaf Ebrahimi
9094*22dc650dSSadaf Ebrahimi         (?(R&name)...)
9095*22dc650dSSadaf Ebrahimi
9096*22dc650dSSadaf Ebrahimi       the condition is true if the most recent recursion is into a  group  of
9097*22dc650dSSadaf Ebrahimi       that name (which must exist within the pattern).
9098*22dc650dSSadaf Ebrahimi
9099*22dc650dSSadaf Ebrahimi       This condition does not check the entire recursion stack. It tests only
9100*22dc650dSSadaf Ebrahimi       the  current  level.  If the name used in a condition of this kind is a
9101*22dc650dSSadaf Ebrahimi       duplicate, the test is applied to all groups of the same name,  and  is
9102*22dc650dSSadaf Ebrahimi       true if any one of them is the most recent recursion.
9103*22dc650dSSadaf Ebrahimi
9104*22dc650dSSadaf Ebrahimi       At "top level", all these recursion test conditions are false.
9105*22dc650dSSadaf Ebrahimi
9106*22dc650dSSadaf Ebrahimi   Defining capture groups for use by reference only
9107*22dc650dSSadaf Ebrahimi
9108*22dc650dSSadaf Ebrahimi       If the condition is the string (DEFINE), the condition is always false,
9109*22dc650dSSadaf Ebrahimi       even  if there is a group with the name DEFINE. In this case, there may
9110*22dc650dSSadaf Ebrahimi       be only one alternative in the rest of the conditional group. It is al-
9111*22dc650dSSadaf Ebrahimi       ways skipped if control reaches this point in the pattern; the idea  of
9112*22dc650dSSadaf Ebrahimi       DEFINE  is that it can be used to define subroutines that can be refer-
9113*22dc650dSSadaf Ebrahimi       enced from elsewhere. (The use of subroutines is described below.)  For
9114*22dc650dSSadaf Ebrahimi       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
9115*22dc650dSSadaf Ebrahimi       could be written like this (ignore white space and line breaks):
9116*22dc650dSSadaf Ebrahimi
9117*22dc650dSSadaf Ebrahimi         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
9118*22dc650dSSadaf Ebrahimi         \b (?&byte) (\.(?&byte)){3} \b
9119*22dc650dSSadaf Ebrahimi
9120*22dc650dSSadaf Ebrahimi       The first part of the pattern is a DEFINE group  inside  which  another
9121*22dc650dSSadaf Ebrahimi       group  named "byte" is defined. This matches an individual component of
9122*22dc650dSSadaf Ebrahimi       an IPv4 address (a number less than 256). When  matching  takes  place,
9123*22dc650dSSadaf Ebrahimi       this  part  of  the pattern is skipped because DEFINE acts like a false
9124*22dc650dSSadaf Ebrahimi       condition. The rest of the pattern uses references to the  named  group
9125*22dc650dSSadaf Ebrahimi       to  match the four dot-separated components of an IPv4 address, insist-
9126*22dc650dSSadaf Ebrahimi       ing on a word boundary at each end.
9127*22dc650dSSadaf Ebrahimi
9128*22dc650dSSadaf Ebrahimi   Checking the PCRE2 version
9129*22dc650dSSadaf Ebrahimi
9130*22dc650dSSadaf Ebrahimi       Programs that link with a PCRE2 library can check the version by  call-
9131*22dc650dSSadaf Ebrahimi       ing  pcre2_config()  with  appropriate arguments. Users of applications
9132*22dc650dSSadaf Ebrahimi       that do not have access to the underlying code cannot do this.  A  spe-
9133*22dc650dSSadaf Ebrahimi       cial  "condition" called VERSION exists to allow such users to discover
9134*22dc650dSSadaf Ebrahimi       which version of PCRE2 they are dealing with by using this condition to
9135*22dc650dSSadaf Ebrahimi       match a string such as "yesno". VERSION must be followed either by  "="
9136*22dc650dSSadaf Ebrahimi       or ">=" and a version number.  For example:
9137*22dc650dSSadaf Ebrahimi
9138*22dc650dSSadaf Ebrahimi         (?(VERSION>=10.4)yes|no)
9139*22dc650dSSadaf Ebrahimi
9140*22dc650dSSadaf Ebrahimi       This  pattern matches "yes" if the PCRE2 version is greater or equal to
9141*22dc650dSSadaf Ebrahimi       10.4, or "no" otherwise. The fractional part of the version number  may
9142*22dc650dSSadaf Ebrahimi       not contain more than two digits.
9143*22dc650dSSadaf Ebrahimi
9144*22dc650dSSadaf Ebrahimi   Assertion conditions
9145*22dc650dSSadaf Ebrahimi
9146*22dc650dSSadaf Ebrahimi       If  the  condition  is  not  in  any of the above formats, it must be a
9147*22dc650dSSadaf Ebrahimi       parenthesized assertion. This may be a positive or  negative  lookahead
9148*22dc650dSSadaf Ebrahimi       or  lookbehind  assertion. However, it must be a traditional atomic as-
9149*22dc650dSSadaf Ebrahimi       sertion, not one of the non-atomic assertions.
9150*22dc650dSSadaf Ebrahimi
9151*22dc650dSSadaf Ebrahimi       Consider this pattern, again containing  non-significant  white  space,
9152*22dc650dSSadaf Ebrahimi       and with the two alternatives on the second line:
9153*22dc650dSSadaf Ebrahimi
9154*22dc650dSSadaf Ebrahimi         (?(?=[^a-z]*[a-z])
9155*22dc650dSSadaf Ebrahimi         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
9156*22dc650dSSadaf Ebrahimi
9157*22dc650dSSadaf Ebrahimi       The  condition  is  a  positive lookahead assertion that matches an op-
9158*22dc650dSSadaf Ebrahimi       tional sequence of non-letters followed by a letter. In other words, it
9159*22dc650dSSadaf Ebrahimi       tests for the presence of at least one letter in the subject. If a let-
9160*22dc650dSSadaf Ebrahimi       ter is found, the subject is matched  against  the  first  alternative;
9161*22dc650dSSadaf Ebrahimi       otherwise  it  is  matched  against  the  second.  This pattern matches
9162*22dc650dSSadaf Ebrahimi       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
9163*22dc650dSSadaf Ebrahimi       letters and dd are digits.
9164*22dc650dSSadaf Ebrahimi
9165*22dc650dSSadaf Ebrahimi       When an assertion that is a condition contains capture groups, any cap-
9166*22dc650dSSadaf Ebrahimi       turing  that  occurs  in  a matching branch is retained afterwards, for
9167*22dc650dSSadaf Ebrahimi       both positive and negative assertions, because matching always  contin-
9168*22dc650dSSadaf Ebrahimi       ues  after  the  assertion, whether it succeeds or fails. (Compare non-
9169*22dc650dSSadaf Ebrahimi       conditional assertions, for which captures are retained only for  posi-
9170*22dc650dSSadaf Ebrahimi       tive assertions that succeed.)
9171*22dc650dSSadaf Ebrahimi
9172*22dc650dSSadaf Ebrahimi
9173*22dc650dSSadaf EbrahimiCOMMENTS
9174*22dc650dSSadaf Ebrahimi
9175*22dc650dSSadaf Ebrahimi       There are two ways of including comments in patterns that are processed
9176*22dc650dSSadaf Ebrahimi       by  PCRE2.  In  both  cases,  the start of the comment must not be in a
9177*22dc650dSSadaf Ebrahimi       character class, nor in the middle of any  other  sequence  of  related
9178*22dc650dSSadaf Ebrahimi       characters  such  as (?: or a group name or number. The characters that
9179*22dc650dSSadaf Ebrahimi       make up a comment play no part in the pattern matching.
9180*22dc650dSSadaf Ebrahimi
9181*22dc650dSSadaf Ebrahimi       The sequence (?# marks the start of a comment that continues up to  the
9182*22dc650dSSadaf Ebrahimi       next  closing parenthesis. Nested parentheses are not permitted. If the
9183*22dc650dSSadaf Ebrahimi       PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is  set,  an  unescaped  #
9184*22dc650dSSadaf Ebrahimi       character  also  introduces  a comment, which in this case continues to
9185*22dc650dSSadaf Ebrahimi       immediately after the next newline character or character  sequence  in
9186*22dc650dSSadaf Ebrahimi       the pattern. Which characters are interpreted as newlines is controlled
9187*22dc650dSSadaf Ebrahimi       by  an option passed to the compiling function or by a special sequence
9188*22dc650dSSadaf Ebrahimi       at the start of the pattern, as described in the section entitled "New-
9189*22dc650dSSadaf Ebrahimi       line conventions" above. Note that the end of this type of comment is a
9190*22dc650dSSadaf Ebrahimi       literal newline sequence in the pattern; escape sequences  that  happen
9191*22dc650dSSadaf Ebrahimi       to represent a newline do not count. For example, consider this pattern
9192*22dc650dSSadaf Ebrahimi       when  PCRE2_EXTENDED is set, and the default newline convention (a sin-
9193*22dc650dSSadaf Ebrahimi       gle linefeed character) is in force:
9194*22dc650dSSadaf Ebrahimi
9195*22dc650dSSadaf Ebrahimi         abc #comment \n still comment
9196*22dc650dSSadaf Ebrahimi
9197*22dc650dSSadaf Ebrahimi       On encountering the # character, pcre2_compile() skips  along,  looking
9198*22dc650dSSadaf Ebrahimi       for  a newline in the pattern. The sequence \n is still literal at this
9199*22dc650dSSadaf Ebrahimi       stage, so it does not terminate the comment. Only an  actual  character
9200*22dc650dSSadaf Ebrahimi       with the code value 0x0a (the default newline) does so.
9201*22dc650dSSadaf Ebrahimi
9202*22dc650dSSadaf Ebrahimi
9203*22dc650dSSadaf EbrahimiRECURSIVE PATTERNS
9204*22dc650dSSadaf Ebrahimi
9205*22dc650dSSadaf Ebrahimi       Consider  the problem of matching a string in parentheses, allowing for
9206*22dc650dSSadaf Ebrahimi       unlimited nested parentheses. Without the use of  recursion,  the  best
9207*22dc650dSSadaf Ebrahimi       that  can  be  done  is  to use a pattern that matches up to some fixed
9208*22dc650dSSadaf Ebrahimi       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
9209*22dc650dSSadaf Ebrahimi       depth.
9210*22dc650dSSadaf Ebrahimi
9211*22dc650dSSadaf Ebrahimi       For some time, Perl has provided a facility that allows regular expres-
9212*22dc650dSSadaf Ebrahimi       sions  to recurse (amongst other things). It does this by interpolating
9213*22dc650dSSadaf Ebrahimi       Perl code in the expression at run time, and the code can refer to  the
9214*22dc650dSSadaf Ebrahimi       expression itself. A Perl pattern using code interpolation to solve the
9215*22dc650dSSadaf Ebrahimi       parentheses problem can be created like this:
9216*22dc650dSSadaf Ebrahimi
9217*22dc650dSSadaf Ebrahimi         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
9218*22dc650dSSadaf Ebrahimi
9219*22dc650dSSadaf Ebrahimi       The (?p{...}) item interpolates Perl code at run time, and in this case
9220*22dc650dSSadaf Ebrahimi       refers recursively to the pattern in which it appears.
9221*22dc650dSSadaf Ebrahimi
9222*22dc650dSSadaf Ebrahimi       Obviously,  PCRE2  cannot  support  the interpolation of Perl code. In-
9223*22dc650dSSadaf Ebrahimi       stead, it supports special syntax for recursion of the entire  pattern,
9224*22dc650dSSadaf Ebrahimi       and also for individual capture group recursion. After its introduction
9225*22dc650dSSadaf Ebrahimi       in PCRE1 and Python, this kind of recursion was subsequently introduced
9226*22dc650dSSadaf Ebrahimi       into Perl at release 5.10.
9227*22dc650dSSadaf Ebrahimi
9228*22dc650dSSadaf Ebrahimi       A  special  item  that consists of (? followed by a number greater than
9229*22dc650dSSadaf Ebrahimi       zero and a closing parenthesis is a recursive subroutine  call  of  the
9230*22dc650dSSadaf Ebrahimi       capture  group of the given number, provided that it occurs inside that
9231*22dc650dSSadaf Ebrahimi       group. (If not, it is a non-recursive subroutine  call,  which  is  de-
9232*22dc650dSSadaf Ebrahimi       scribed in the next section.) The special item (?R) or (?0) is a recur-
9233*22dc650dSSadaf Ebrahimi       sive call of the entire regular expression.
9234*22dc650dSSadaf Ebrahimi
9235*22dc650dSSadaf Ebrahimi       This  PCRE2  pattern  solves the nested parentheses problem (assume the
9236*22dc650dSSadaf Ebrahimi       PCRE2_EXTENDED option is set so that white space is ignored):
9237*22dc650dSSadaf Ebrahimi
9238*22dc650dSSadaf Ebrahimi         \( ( [^()]++ | (?R) )* \)
9239*22dc650dSSadaf Ebrahimi
9240*22dc650dSSadaf Ebrahimi       First it matches an opening parenthesis. Then it matches any number  of
9241*22dc650dSSadaf Ebrahimi       substrings  which can either be a sequence of non-parentheses, or a re-
9242*22dc650dSSadaf Ebrahimi       cursive match of the pattern itself (that is, a correctly parenthesized
9243*22dc650dSSadaf Ebrahimi       substring).  Finally there is a closing parenthesis. Note the use of  a
9244*22dc650dSSadaf Ebrahimi       possessive  quantifier  to  avoid  backtracking  into sequences of non-
9245*22dc650dSSadaf Ebrahimi       parentheses.
9246*22dc650dSSadaf Ebrahimi
9247*22dc650dSSadaf Ebrahimi       If this were part of a larger pattern, you would not  want  to  recurse
9248*22dc650dSSadaf Ebrahimi       the entire pattern, so instead you could use this:
9249*22dc650dSSadaf Ebrahimi
9250*22dc650dSSadaf Ebrahimi         ( \( ( [^()]++ | (?1) )* \) )
9251*22dc650dSSadaf Ebrahimi
9252*22dc650dSSadaf Ebrahimi       We  have  put the pattern into parentheses, and caused the recursion to
9253*22dc650dSSadaf Ebrahimi       refer to them instead of the whole pattern.
9254*22dc650dSSadaf Ebrahimi
9255*22dc650dSSadaf Ebrahimi       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
9256*22dc650dSSadaf Ebrahimi       tricky.  This is made easier by the use of relative references. Instead
9257*22dc650dSSadaf Ebrahimi       of (?1) in the pattern above you can write (?-2) to refer to the second
9258*22dc650dSSadaf Ebrahimi       most recently opened parentheses  preceding  the  recursion.  In  other
9259*22dc650dSSadaf Ebrahimi       words,  a  negative  number counts capturing parentheses leftwards from
9260*22dc650dSSadaf Ebrahimi       the point at which it is encountered.
9261*22dc650dSSadaf Ebrahimi
9262*22dc650dSSadaf Ebrahimi       Be aware however, that if duplicate capture group numbers are  in  use,
9263*22dc650dSSadaf Ebrahimi       relative  references  refer  to the earliest group with the appropriate
9264*22dc650dSSadaf Ebrahimi       number. Consider, for example:
9265*22dc650dSSadaf Ebrahimi
9266*22dc650dSSadaf Ebrahimi         (?|(a)|(b)) (c) (?-2)
9267*22dc650dSSadaf Ebrahimi
9268*22dc650dSSadaf Ebrahimi       The first two capture groups (a) and (b) are both numbered 1, and group
9269*22dc650dSSadaf Ebrahimi       (c) is number 2. When the reference (?-2) is  encountered,  the  second
9270*22dc650dSSadaf Ebrahimi       most  recently opened parentheses has the number 1, but it is the first
9271*22dc650dSSadaf Ebrahimi       such group (the (a) group) to which the recursion refers. This would be
9272*22dc650dSSadaf Ebrahimi       the same if an absolute reference (?1) was used. In other words,  rela-
9273*22dc650dSSadaf Ebrahimi       tive references are just a shorthand for computing a group number.
9274*22dc650dSSadaf Ebrahimi
9275*22dc650dSSadaf Ebrahimi       It  is  also possible to refer to subsequent capture groups, by writing
9276*22dc650dSSadaf Ebrahimi       references such as (?+2). However, these cannot  be  recursive  because
9277*22dc650dSSadaf Ebrahimi       the  reference  is not inside the parentheses that are referenced. They
9278*22dc650dSSadaf Ebrahimi       are always non-recursive subroutine calls, as  described  in  the  next
9279*22dc650dSSadaf Ebrahimi       section.
9280*22dc650dSSadaf Ebrahimi
9281*22dc650dSSadaf Ebrahimi       An  alternative  approach  is to use named parentheses. The Perl syntax
9282*22dc650dSSadaf Ebrahimi       for this is (?&name); PCRE1's earlier syntax  (?P>name)  is  also  sup-
9283*22dc650dSSadaf Ebrahimi       ported. We could rewrite the above example as follows:
9284*22dc650dSSadaf Ebrahimi
9285*22dc650dSSadaf Ebrahimi         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
9286*22dc650dSSadaf Ebrahimi
9287*22dc650dSSadaf Ebrahimi       If there is more than one group with the same name, the earliest one is
9288*22dc650dSSadaf Ebrahimi       used.
9289*22dc650dSSadaf Ebrahimi
9290*22dc650dSSadaf Ebrahimi       The example pattern that we have been looking at contains nested unlim-
9291*22dc650dSSadaf Ebrahimi       ited  repeats,  and  so the use of a possessive quantifier for matching
9292*22dc650dSSadaf Ebrahimi       strings of non-parentheses is important when applying  the  pattern  to
9293*22dc650dSSadaf Ebrahimi       strings that do not match. For example, when this pattern is applied to
9294*22dc650dSSadaf Ebrahimi
9295*22dc650dSSadaf Ebrahimi         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
9296*22dc650dSSadaf Ebrahimi
9297*22dc650dSSadaf Ebrahimi       it  yields  "no  match" quickly. However, if a possessive quantifier is
9298*22dc650dSSadaf Ebrahimi       not used, the match runs for a very long time indeed because there  are
9299*22dc650dSSadaf Ebrahimi       so  many  different  ways the + and * repeats can carve up the subject,
9300*22dc650dSSadaf Ebrahimi       and all have to be tested before failure can be reported.
9301*22dc650dSSadaf Ebrahimi
9302*22dc650dSSadaf Ebrahimi       At the end of a match, the values of capturing  parentheses  are  those
9303*22dc650dSSadaf Ebrahimi       from  the outermost level. If you want to obtain intermediate values, a
9304*22dc650dSSadaf Ebrahimi       callout function can be used (see below and the pcre2callout documenta-
9305*22dc650dSSadaf Ebrahimi       tion). If the pattern above is matched against
9306*22dc650dSSadaf Ebrahimi
9307*22dc650dSSadaf Ebrahimi         (ab(cd)ef)
9308*22dc650dSSadaf Ebrahimi
9309*22dc650dSSadaf Ebrahimi       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
9310*22dc650dSSadaf Ebrahimi       which  is  the last value taken on at the top level. If a capture group
9311*22dc650dSSadaf Ebrahimi       is not matched at the top level, its final  captured  value  is  unset,
9312*22dc650dSSadaf Ebrahimi       even  if it was (temporarily) set at a deeper level during the matching
9313*22dc650dSSadaf Ebrahimi       process.
9314*22dc650dSSadaf Ebrahimi
9315*22dc650dSSadaf Ebrahimi       Do not confuse the (?R) item with the condition (R),  which  tests  for
9316*22dc650dSSadaf Ebrahimi       recursion.   Consider  this pattern, which matches text in angle brack-
9317*22dc650dSSadaf Ebrahimi       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
9318*22dc650dSSadaf Ebrahimi       brackets  (that is, when recursing), whereas any characters are permit-
9319*22dc650dSSadaf Ebrahimi       ted at the outer level.
9320*22dc650dSSadaf Ebrahimi
9321*22dc650dSSadaf Ebrahimi         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
9322*22dc650dSSadaf Ebrahimi
9323*22dc650dSSadaf Ebrahimi       In this pattern, (?(R) is the start of a conditional  group,  with  two
9324*22dc650dSSadaf Ebrahimi       different  alternatives  for the recursive and non-recursive cases. The
9325*22dc650dSSadaf Ebrahimi       (?R) item is the actual recursive call.
9326*22dc650dSSadaf Ebrahimi
9327*22dc650dSSadaf Ebrahimi   Differences in recursion processing between PCRE2 and Perl
9328*22dc650dSSadaf Ebrahimi
9329*22dc650dSSadaf Ebrahimi       Some former differences between PCRE2 and Perl no longer exist.
9330*22dc650dSSadaf Ebrahimi
9331*22dc650dSSadaf Ebrahimi       Before release 10.30, recursion processing in PCRE2 differed from  Perl
9332*22dc650dSSadaf Ebrahimi       in  that  a  recursive  subroutine call was always treated as an atomic
9333*22dc650dSSadaf Ebrahimi       group. That is, once it had matched some of the subject string, it  was
9334*22dc650dSSadaf Ebrahimi       never  re-entered,  even if it contained untried alternatives and there
9335*22dc650dSSadaf Ebrahimi       was a subsequent matching failure. (Historical note:  PCRE  implemented
9336*22dc650dSSadaf Ebrahimi       recursion before Perl did.)
9337*22dc650dSSadaf Ebrahimi
9338*22dc650dSSadaf Ebrahimi       Starting  with  release 10.30, recursive subroutine calls are no longer
9339*22dc650dSSadaf Ebrahimi       treated as atomic. That is, they can be re-entered to try unused alter-
9340*22dc650dSSadaf Ebrahimi       natives if there is a matching failure later in the  pattern.  This  is
9341*22dc650dSSadaf Ebrahimi       now  compatible  with the way Perl works. If you want a subroutine call
9342*22dc650dSSadaf Ebrahimi       to be atomic, you must explicitly enclose it in an atomic group.
9343*22dc650dSSadaf Ebrahimi
9344*22dc650dSSadaf Ebrahimi       Supporting backtracking into recursions simplifies certain types of re-
9345*22dc650dSSadaf Ebrahimi       cursive pattern. For example, this pattern matches palindromic strings:
9346*22dc650dSSadaf Ebrahimi
9347*22dc650dSSadaf Ebrahimi         ^((.)(?1)\2|.?)$
9348*22dc650dSSadaf Ebrahimi
9349*22dc650dSSadaf Ebrahimi       The second branch in the group matches a single  central  character  in
9350*22dc650dSSadaf Ebrahimi       the  palindrome  when there are an odd number of characters, or nothing
9351*22dc650dSSadaf Ebrahimi       when there are an even number of characters, but in order  to  work  it
9352*22dc650dSSadaf Ebrahimi       has  to  be  able  to  try the second case when the rest of the pattern
9353*22dc650dSSadaf Ebrahimi       match fails. If you want to match typical palindromic phrases, the pat-
9354*22dc650dSSadaf Ebrahimi       tern has to ignore all non-word characters,  which  can  be  done  like
9355*22dc650dSSadaf Ebrahimi       this:
9356*22dc650dSSadaf Ebrahimi
9357*22dc650dSSadaf Ebrahimi         ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
9358*22dc650dSSadaf Ebrahimi
9359*22dc650dSSadaf Ebrahimi       If  run  with  the  PCRE2_CASELESS option, this pattern matches phrases
9360*22dc650dSSadaf Ebrahimi       such as "A man, a plan, a canal: Panama!". Note the use of the  posses-
9361*22dc650dSSadaf Ebrahimi       sive  quantifier  *+  to  avoid backtracking into sequences of non-word
9362*22dc650dSSadaf Ebrahimi       characters. Without this, PCRE2 takes a great deal longer (ten times or
9363*22dc650dSSadaf Ebrahimi       more) to match typical phrases, and Perl takes so long that  you  think
9364*22dc650dSSadaf Ebrahimi       it has gone into a loop.
9365*22dc650dSSadaf Ebrahimi
9366*22dc650dSSadaf Ebrahimi       Another  way  in which PCRE2 and Perl used to differ in their recursion
9367*22dc650dSSadaf Ebrahimi       processing is in the handling of captured  values.  Formerly  in  Perl,
9368*22dc650dSSadaf Ebrahimi       when  a  group  was called recursively or as a subroutine (see the next
9369*22dc650dSSadaf Ebrahimi       section), it had no access to any values that were captured outside the
9370*22dc650dSSadaf Ebrahimi       recursion, whereas in PCRE2 these values can  be  referenced.  Consider
9371*22dc650dSSadaf Ebrahimi       this pattern:
9372*22dc650dSSadaf Ebrahimi
9373*22dc650dSSadaf Ebrahimi         ^(.)(\1|a(?2))
9374*22dc650dSSadaf Ebrahimi
9375*22dc650dSSadaf Ebrahimi       This  pattern matches "bab". The first capturing parentheses match "b",
9376*22dc650dSSadaf Ebrahimi       then in the second group, when the backreference \1 fails to match "b",
9377*22dc650dSSadaf Ebrahimi       the second alternative matches "a" and then recurses. In the recursion,
9378*22dc650dSSadaf Ebrahimi       \1 does now match "b" and so the whole match succeeds. This match  used
9379*22dc650dSSadaf Ebrahimi       to fail in Perl, but in later versions (I tried 5.024) it now works.
9380*22dc650dSSadaf Ebrahimi
9381*22dc650dSSadaf Ebrahimi
9382*22dc650dSSadaf EbrahimiGROUPS AS SUBROUTINES
9383*22dc650dSSadaf Ebrahimi
9384*22dc650dSSadaf Ebrahimi       If  the syntax for a recursive group call (either by number or by name)
9385*22dc650dSSadaf Ebrahimi       is used outside the parentheses to which it refers, it operates  a  bit
9386*22dc650dSSadaf Ebrahimi       like  a  subroutine  in  a programming language. More accurately, PCRE2
9387*22dc650dSSadaf Ebrahimi       treats the referenced group as an independent subpattern which it tries
9388*22dc650dSSadaf Ebrahimi       to match at the current matching position. The called group may be  de-
9389*22dc650dSSadaf Ebrahimi       fined  before  or  after the reference. A numbered reference can be ab-
9390*22dc650dSSadaf Ebrahimi       solute or relative, as in these examples:
9391*22dc650dSSadaf Ebrahimi
9392*22dc650dSSadaf Ebrahimi         (...(absolute)...)...(?2)...
9393*22dc650dSSadaf Ebrahimi         (...(relative)...)...(?-1)...
9394*22dc650dSSadaf Ebrahimi         (...(?+1)...(relative)...
9395*22dc650dSSadaf Ebrahimi
9396*22dc650dSSadaf Ebrahimi       An earlier example pointed out that the pattern
9397*22dc650dSSadaf Ebrahimi
9398*22dc650dSSadaf Ebrahimi         (sens|respons)e and \1ibility
9399*22dc650dSSadaf Ebrahimi
9400*22dc650dSSadaf Ebrahimi       matches "sense and sensibility" and "response and responsibility",  but
9401*22dc650dSSadaf Ebrahimi       not "sense and responsibility". If instead the pattern
9402*22dc650dSSadaf Ebrahimi
9403*22dc650dSSadaf Ebrahimi         (sens|respons)e and (?1)ibility
9404*22dc650dSSadaf Ebrahimi
9405*22dc650dSSadaf Ebrahimi       is  used, it does match "sense and responsibility" as well as the other
9406*22dc650dSSadaf Ebrahimi       two strings. Another example is  given  in  the  discussion  of  DEFINE
9407*22dc650dSSadaf Ebrahimi       above.
9408*22dc650dSSadaf Ebrahimi
9409*22dc650dSSadaf Ebrahimi       Like  recursions,  subroutine  calls  used to be treated as atomic, but
9410*22dc650dSSadaf Ebrahimi       this changed at PCRE2 release 10.30, so  backtracking  into  subroutine
9411*22dc650dSSadaf Ebrahimi       calls  can  now  occur. However, any capturing parentheses that are set
9412*22dc650dSSadaf Ebrahimi       during the subroutine call revert to their previous values afterwards.
9413*22dc650dSSadaf Ebrahimi
9414*22dc650dSSadaf Ebrahimi       Processing options such as case-independence are fixed when a group  is
9415*22dc650dSSadaf Ebrahimi       defined,  so  if  it  is  used  as a subroutine, such options cannot be
9416*22dc650dSSadaf Ebrahimi       changed for different calls. For example, consider this pattern:
9417*22dc650dSSadaf Ebrahimi
9418*22dc650dSSadaf Ebrahimi         (abc)(?i:(?-1))
9419*22dc650dSSadaf Ebrahimi
9420*22dc650dSSadaf Ebrahimi       It matches "abcabc". It does not match "abcABC" because the  change  of
9421*22dc650dSSadaf Ebrahimi       processing option does not affect the called group.
9422*22dc650dSSadaf Ebrahimi
9423*22dc650dSSadaf Ebrahimi       The  behaviour  of  backtracking control verbs in groups when called as
9424*22dc650dSSadaf Ebrahimi       subroutines is described in the section entitled "Backtracking verbs in
9425*22dc650dSSadaf Ebrahimi       subroutines" below.
9426*22dc650dSSadaf Ebrahimi
9427*22dc650dSSadaf Ebrahimi
9428*22dc650dSSadaf EbrahimiONIGURUMA SUBROUTINE SYNTAX
9429*22dc650dSSadaf Ebrahimi
9430*22dc650dSSadaf Ebrahimi       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
9431*22dc650dSSadaf Ebrahimi       name or a number enclosed either in angle brackets or single quotes, is
9432*22dc650dSSadaf Ebrahimi       an alternative syntax for calling a group as a subroutine, possibly re-
9433*22dc650dSSadaf Ebrahimi       cursively.  Here  are  two  of the examples used above, rewritten using
9434*22dc650dSSadaf Ebrahimi       this syntax:
9435*22dc650dSSadaf Ebrahimi
9436*22dc650dSSadaf Ebrahimi         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
9437*22dc650dSSadaf Ebrahimi         (sens|respons)e and \g'1'ibility
9438*22dc650dSSadaf Ebrahimi
9439*22dc650dSSadaf Ebrahimi       PCRE2 supports an extension to Oniguruma: if a number is preceded by  a
9440*22dc650dSSadaf Ebrahimi       plus or a minus sign it is taken as a relative reference. For example:
9441*22dc650dSSadaf Ebrahimi
9442*22dc650dSSadaf Ebrahimi         (abc)(?i:\g<-1>)
9443*22dc650dSSadaf Ebrahimi
9444*22dc650dSSadaf Ebrahimi       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
9445*22dc650dSSadaf Ebrahimi       synonymous. The former is a backreference; the latter is  a  subroutine
9446*22dc650dSSadaf Ebrahimi       call.
9447*22dc650dSSadaf Ebrahimi
9448*22dc650dSSadaf Ebrahimi
9449*22dc650dSSadaf EbrahimiCALLOUTS
9450*22dc650dSSadaf Ebrahimi
9451*22dc650dSSadaf Ebrahimi       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
9452*22dc650dSSadaf Ebrahimi       Perl  code to be obeyed in the middle of matching a regular expression.
9453*22dc650dSSadaf Ebrahimi       This makes it possible, amongst other things, to extract different sub-
9454*22dc650dSSadaf Ebrahimi       strings that match the same pair of parentheses when there is a repeti-
9455*22dc650dSSadaf Ebrahimi       tion.
9456*22dc650dSSadaf Ebrahimi
9457*22dc650dSSadaf Ebrahimi       PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
9458*22dc650dSSadaf Ebrahimi       trary  Perl  code. The feature is called "callout". The caller of PCRE2
9459*22dc650dSSadaf Ebrahimi       provides an external function by putting its entry  point  in  a  match
9460*22dc650dSSadaf Ebrahimi       context  using  the function pcre2_set_callout(), and then passing that
9461*22dc650dSSadaf Ebrahimi       context to pcre2_match() or pcre2_dfa_match(). If no match  context  is
9462*22dc650dSSadaf Ebrahimi       passed, or if the callout entry point is set to NULL, callouts are dis-
9463*22dc650dSSadaf Ebrahimi       abled.
9464*22dc650dSSadaf Ebrahimi
9465*22dc650dSSadaf Ebrahimi       Within  a  regular expression, (?C<arg>) indicates a point at which the
9466*22dc650dSSadaf Ebrahimi       external function is to be called. There  are  two  kinds  of  callout:
9467*22dc650dSSadaf Ebrahimi       those  with a numerical argument and those with a string argument. (?C)
9468*22dc650dSSadaf Ebrahimi       on its own with no argument is treated as (?C0). A  numerical  argument
9469*22dc650dSSadaf Ebrahimi       allows  the  application  to  distinguish  between  different callouts.
9470*22dc650dSSadaf Ebrahimi       String arguments were added for release 10.20 to make it  possible  for
9471*22dc650dSSadaf Ebrahimi       script  languages that use PCRE2 to embed short scripts within patterns
9472*22dc650dSSadaf Ebrahimi       in a similar way to Perl.
9473*22dc650dSSadaf Ebrahimi
9474*22dc650dSSadaf Ebrahimi       During matching, when PCRE2 reaches a callout point, the external func-
9475*22dc650dSSadaf Ebrahimi       tion is called. It is provided with the number or  string  argument  of
9476*22dc650dSSadaf Ebrahimi       the  callout, the position in the pattern, and one item of data that is
9477*22dc650dSSadaf Ebrahimi       also set in the match block. The callout function may cause matching to
9478*22dc650dSSadaf Ebrahimi       proceed, to backtrack, or to fail.
9479*22dc650dSSadaf Ebrahimi
9480*22dc650dSSadaf Ebrahimi       By default, PCRE2 implements a  number  of  optimizations  at  matching
9481*22dc650dSSadaf Ebrahimi       time,  and  one  side-effect is that sometimes callouts are skipped. If
9482*22dc650dSSadaf Ebrahimi       you need all possible callouts to happen, you need to set options  that
9483*22dc650dSSadaf Ebrahimi       disable  the relevant optimizations. More details, including a complete
9484*22dc650dSSadaf Ebrahimi       description of the programming interface to the callout  function,  are
9485*22dc650dSSadaf Ebrahimi       given in the pcre2callout documentation.
9486*22dc650dSSadaf Ebrahimi
9487*22dc650dSSadaf Ebrahimi   Callouts with numerical arguments
9488*22dc650dSSadaf Ebrahimi
9489*22dc650dSSadaf Ebrahimi       If  you  just  want  to  have  a means of identifying different callout
9490*22dc650dSSadaf Ebrahimi       points, put a number less than 256 after the  letter  C.  For  example,
9491*22dc650dSSadaf Ebrahimi       this pattern has two callout points:
9492*22dc650dSSadaf Ebrahimi
9493*22dc650dSSadaf Ebrahimi         (?C1)abc(?C2)def
9494*22dc650dSSadaf Ebrahimi
9495*22dc650dSSadaf Ebrahimi       If  the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
9496*22dc650dSSadaf Ebrahimi       callouts are automatically installed before each item in  the  pattern.
9497*22dc650dSSadaf Ebrahimi       They  are all numbered 255. If there is a conditional group in the pat-
9498*22dc650dSSadaf Ebrahimi       tern whose condition is an assertion, an additional callout is inserted
9499*22dc650dSSadaf Ebrahimi       just before the condition. An explicit callout may also be set at  this
9500*22dc650dSSadaf Ebrahimi       position, as in this example:
9501*22dc650dSSadaf Ebrahimi
9502*22dc650dSSadaf Ebrahimi         (?(?C9)(?=a)abc|def)
9503*22dc650dSSadaf Ebrahimi
9504*22dc650dSSadaf Ebrahimi       Note that this applies only to assertion conditions, not to other types
9505*22dc650dSSadaf Ebrahimi       of condition.
9506*22dc650dSSadaf Ebrahimi
9507*22dc650dSSadaf Ebrahimi   Callouts with string arguments
9508*22dc650dSSadaf Ebrahimi
9509*22dc650dSSadaf Ebrahimi       A  delimited  string may be used instead of a number as a callout argu-
9510*22dc650dSSadaf Ebrahimi       ment. The starting delimiter must be one of ` ' " ^ % #  $  {  and  the
9511*22dc650dSSadaf Ebrahimi       ending delimiter is the same as the start, except for {, where the end-
9512*22dc650dSSadaf Ebrahimi       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
9513*22dc650dSSadaf Ebrahimi       string, it must be doubled. For example:
9514*22dc650dSSadaf Ebrahimi
9515*22dc650dSSadaf Ebrahimi         (?C'ab ''c'' d')xyz(?C{any text})pqr
9516*22dc650dSSadaf Ebrahimi
9517*22dc650dSSadaf Ebrahimi       The doubling is removed before the string  is  passed  to  the  callout
9518*22dc650dSSadaf Ebrahimi       function.
9519*22dc650dSSadaf Ebrahimi
9520*22dc650dSSadaf Ebrahimi
9521*22dc650dSSadaf EbrahimiBACKTRACKING CONTROL
9522*22dc650dSSadaf Ebrahimi
9523*22dc650dSSadaf Ebrahimi       There  are  a  number  of  special "Backtracking Control Verbs" (to use
9524*22dc650dSSadaf Ebrahimi       Perl's terminology) that modify the behaviour  of  backtracking  during
9525*22dc650dSSadaf Ebrahimi       matching.  They are generally of the form (*VERB) or (*VERB:NAME). Some
9526*22dc650dSSadaf Ebrahimi       verbs take either form, and may behave differently depending on whether
9527*22dc650dSSadaf Ebrahimi       or not a name argument is present. The names are  not  required  to  be
9528*22dc650dSSadaf Ebrahimi       unique within the pattern.
9529*22dc650dSSadaf Ebrahimi
9530*22dc650dSSadaf Ebrahimi       By  default,  for  compatibility  with  Perl, a name is any sequence of
9531*22dc650dSSadaf Ebrahimi       characters that does not include a closing parenthesis. The name is not
9532*22dc650dSSadaf Ebrahimi       processed in any way, and it is  not  possible  to  include  a  closing
9533*22dc650dSSadaf Ebrahimi       parenthesis   in  the  name.   This  can  be  changed  by  setting  the
9534*22dc650dSSadaf Ebrahimi       PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati-
9535*22dc650dSSadaf Ebrahimi       ble.
9536*22dc650dSSadaf Ebrahimi
9537*22dc650dSSadaf Ebrahimi       When  PCRE2_ALT_VERBNAMES  is  set,  backslash processing is applied to
9538*22dc650dSSadaf Ebrahimi       verb names and only an unescaped  closing  parenthesis  terminates  the
9539*22dc650dSSadaf Ebrahimi       name.  However, the only backslash items that are permitted are \Q, \E,
9540*22dc650dSSadaf Ebrahimi       and sequences such as \x{100} that define character code points.  Char-
9541*22dc650dSSadaf Ebrahimi       acter type escapes such as \d are faulted.
9542*22dc650dSSadaf Ebrahimi
9543*22dc650dSSadaf Ebrahimi       A closing parenthesis can be included in a name either as \) or between
9544*22dc650dSSadaf Ebrahimi       \Q  and  \E. In addition to backslash processing, if the PCRE2_EXTENDED
9545*22dc650dSSadaf Ebrahimi       or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
9546*22dc650dSSadaf Ebrahimi       names is skipped, and #-comments are recognized, exactly as in the rest
9547*22dc650dSSadaf Ebrahimi       of the pattern.  PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do  not  affect
9548*22dc650dSSadaf Ebrahimi       verb names unless PCRE2_ALT_VERBNAMES is also set.
9549*22dc650dSSadaf Ebrahimi
9550*22dc650dSSadaf Ebrahimi       The  maximum  length of a name is 255 in the 8-bit library and 65535 in
9551*22dc650dSSadaf Ebrahimi       the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
9552*22dc650dSSadaf Ebrahimi       closing  parenthesis immediately follows the colon, the effect is as if
9553*22dc650dSSadaf Ebrahimi       the colon were not there. Any number of these verbs may occur in a pat-
9554*22dc650dSSadaf Ebrahimi       tern. Except for (*ACCEPT), they may not be quantified.
9555*22dc650dSSadaf Ebrahimi
9556*22dc650dSSadaf Ebrahimi       Since these verbs are specifically related  to  backtracking,  most  of
9557*22dc650dSSadaf Ebrahimi       them  can be used only when the pattern is to be matched using the tra-
9558*22dc650dSSadaf Ebrahimi       ditional matching function, because that uses a backtracking algorithm.
9559*22dc650dSSadaf Ebrahimi       With the exception of (*FAIL), which behaves like  a  failing  negative
9560*22dc650dSSadaf Ebrahimi       assertion, the backtracking control verbs cause an error if encountered
9561*22dc650dSSadaf Ebrahimi       by the DFA matching function.
9562*22dc650dSSadaf Ebrahimi
9563*22dc650dSSadaf Ebrahimi       The  behaviour  of  these  verbs in repeated groups, assertions, and in
9564*22dc650dSSadaf Ebrahimi       capture groups called as subroutines (whether or  not  recursively)  is
9565*22dc650dSSadaf Ebrahimi       documented below.
9566*22dc650dSSadaf Ebrahimi
9567*22dc650dSSadaf Ebrahimi   Optimizations that affect backtracking verbs
9568*22dc650dSSadaf Ebrahimi
9569*22dc650dSSadaf Ebrahimi       PCRE2 contains some optimizations that are used to speed up matching by
9570*22dc650dSSadaf Ebrahimi       running some checks at the start of each match attempt. For example, it
9571*22dc650dSSadaf Ebrahimi       may  know  the minimum length of matching subject, or that a particular
9572*22dc650dSSadaf Ebrahimi       character must be present. When one of these optimizations bypasses the
9573*22dc650dSSadaf Ebrahimi       running of a match,  any  included  backtracking  verbs  will  not,  of
9574*22dc650dSSadaf Ebrahimi       course, be processed. You can suppress the start-of-match optimizations
9575*22dc650dSSadaf Ebrahimi       by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9576*22dc650dSSadaf Ebrahimi       pile(), or by starting the pattern with (*NO_START_OPT). There is  more
9577*22dc650dSSadaf Ebrahimi       discussion of this option in the section entitled "Compiling a pattern"
9578*22dc650dSSadaf Ebrahimi       in the pcre2api documentation.
9579*22dc650dSSadaf Ebrahimi
9580*22dc650dSSadaf Ebrahimi       Experiments  with  Perl  suggest that it too has similar optimizations,
9581*22dc650dSSadaf Ebrahimi       and like PCRE2, turning them off can change the result of a match.
9582*22dc650dSSadaf Ebrahimi
9583*22dc650dSSadaf Ebrahimi   Verbs that act immediately
9584*22dc650dSSadaf Ebrahimi
9585*22dc650dSSadaf Ebrahimi       The following verbs act as soon as they are encountered.
9586*22dc650dSSadaf Ebrahimi
9587*22dc650dSSadaf Ebrahimi          (*ACCEPT) or (*ACCEPT:NAME)
9588*22dc650dSSadaf Ebrahimi
9589*22dc650dSSadaf Ebrahimi       This verb causes the match to end successfully, skipping the  remainder
9590*22dc650dSSadaf Ebrahimi       of  the  pattern.  However,  when  it is inside a capture group that is
9591*22dc650dSSadaf Ebrahimi       called as a subroutine, only that group is ended successfully. Matching
9592*22dc650dSSadaf Ebrahimi       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9593*22dc650dSSadaf Ebrahimi       tive assertion, the assertion succeeds; in a  negative  assertion,  the
9594*22dc650dSSadaf Ebrahimi       assertion fails.
9595*22dc650dSSadaf Ebrahimi
9596*22dc650dSSadaf Ebrahimi       If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap-
9597*22dc650dSSadaf Ebrahimi       tured. For example:
9598*22dc650dSSadaf Ebrahimi
9599*22dc650dSSadaf Ebrahimi         A((?:A|B(*ACCEPT)|C)D)
9600*22dc650dSSadaf Ebrahimi
9601*22dc650dSSadaf Ebrahimi       This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
9602*22dc650dSSadaf Ebrahimi       tured by the outer parentheses.
9603*22dc650dSSadaf Ebrahimi
9604*22dc650dSSadaf Ebrahimi       (*ACCEPT)  is  the only backtracking verb that is allowed to be quanti-
9605*22dc650dSSadaf Ebrahimi       fied because an ungreedy quantification with a  minimum  of  zero  acts
9606*22dc650dSSadaf Ebrahimi       only when a backtrack happens. Consider, for example,
9607*22dc650dSSadaf Ebrahimi
9608*22dc650dSSadaf Ebrahimi         (A(*ACCEPT)??B)C
9609*22dc650dSSadaf Ebrahimi
9610*22dc650dSSadaf Ebrahimi       where  A,  B, and C may be complex expressions. After matching "A", the
9611*22dc650dSSadaf Ebrahimi       matcher processes "BC"; if that fails, causing a  backtrack,  (*ACCEPT)
9612*22dc650dSSadaf Ebrahimi       is  triggered  and the match succeeds. In both cases, all but C is cap-
9613*22dc650dSSadaf Ebrahimi       tured. Whereas (*COMMIT) (see below) means "fail on backtrack",  a  re-
9614*22dc650dSSadaf Ebrahimi       peated (*ACCEPT) of this type means "succeed on backtrack".
9615*22dc650dSSadaf Ebrahimi
9616*22dc650dSSadaf Ebrahimi       Warning:  (*ACCEPT)  should  not be used within a script run group, be-
9617*22dc650dSSadaf Ebrahimi       cause it causes an immediate exit from the group, bypassing the  script
9618*22dc650dSSadaf Ebrahimi       run checking.
9619*22dc650dSSadaf Ebrahimi
9620*22dc650dSSadaf Ebrahimi         (*FAIL) or (*FAIL:NAME)
9621*22dc650dSSadaf Ebrahimi
9622*22dc650dSSadaf Ebrahimi       This  verb causes a matching failure, forcing backtracking to occur. It
9623*22dc650dSSadaf Ebrahimi       may be abbreviated to (*F). It is equivalent  to  (?!)  but  easier  to
9624*22dc650dSSadaf Ebrahimi       read. The Perl documentation notes that it is probably useful only when
9625*22dc650dSSadaf Ebrahimi       combined with (?{}) or (??{}). Those are, of course, Perl features that
9626*22dc650dSSadaf Ebrahimi       are  not  present  in PCRE2. The nearest equivalent is the callout fea-
9627*22dc650dSSadaf Ebrahimi       ture, as for example in this pattern:
9628*22dc650dSSadaf Ebrahimi
9629*22dc650dSSadaf Ebrahimi         a+(?C)(*FAIL)
9630*22dc650dSSadaf Ebrahimi
9631*22dc650dSSadaf Ebrahimi       A match with the string "aaaa" always fails, but the callout  is  taken
9632*22dc650dSSadaf Ebrahimi       before each backtrack happens (in this example, 10 times).
9633*22dc650dSSadaf Ebrahimi
9634*22dc650dSSadaf Ebrahimi       (*ACCEPT:NAME)  and  (*FAIL:NAME)  behave the same as (*MARK:NAME)(*AC-
9635*22dc650dSSadaf Ebrahimi       CEPT) and (*MARK:NAME)(*FAIL), respectively,  that  is,  a  (*MARK)  is
9636*22dc650dSSadaf Ebrahimi       recorded just before the verb acts.
9637*22dc650dSSadaf Ebrahimi
9638*22dc650dSSadaf Ebrahimi   Recording which path was taken
9639*22dc650dSSadaf Ebrahimi
9640*22dc650dSSadaf Ebrahimi       There  is  one  verb whose main purpose is to track how a match was ar-
9641*22dc650dSSadaf Ebrahimi       rived at, though it also has a secondary use in  conjunction  with  ad-
9642*22dc650dSSadaf Ebrahimi       vancing the match starting point (see (*SKIP) below).
9643*22dc650dSSadaf Ebrahimi
9644*22dc650dSSadaf Ebrahimi         (*MARK:NAME) or (*:NAME)
9645*22dc650dSSadaf Ebrahimi
9646*22dc650dSSadaf Ebrahimi       A  name is always required with this verb. For all the other backtrack-
9647*22dc650dSSadaf Ebrahimi       ing control verbs, a NAME argument is optional.
9648*22dc650dSSadaf Ebrahimi
9649*22dc650dSSadaf Ebrahimi       When a match succeeds, the name of the last-encountered  mark  name  on
9650*22dc650dSSadaf Ebrahimi       the matching path is passed back to the caller as described in the sec-
9651*22dc650dSSadaf Ebrahimi       tion entitled "Other information about the match" in the pcre2api docu-
9652*22dc650dSSadaf Ebrahimi       mentation.  This  applies  to all instances of (*MARK) and other verbs,
9653*22dc650dSSadaf Ebrahimi       including those inside assertions and atomic groups. However, there are
9654*22dc650dSSadaf Ebrahimi       differences in those cases when (*MARK) is  used  in  conjunction  with
9655*22dc650dSSadaf Ebrahimi       (*SKIP) as described below.
9656*22dc650dSSadaf Ebrahimi
9657*22dc650dSSadaf Ebrahimi       The  mark name that was last encountered on the matching path is passed
9658*22dc650dSSadaf Ebrahimi       back. A verb without a NAME argument is ignored for this purpose.  Here
9659*22dc650dSSadaf Ebrahimi       is  an  example of pcre2test output, where the "mark" modifier requests
9660*22dc650dSSadaf Ebrahimi       the retrieval and outputting of (*MARK) data:
9661*22dc650dSSadaf Ebrahimi
9662*22dc650dSSadaf Ebrahimi           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9663*22dc650dSSadaf Ebrahimi         data> XY
9664*22dc650dSSadaf Ebrahimi          0: XY
9665*22dc650dSSadaf Ebrahimi         MK: A
9666*22dc650dSSadaf Ebrahimi         XZ
9667*22dc650dSSadaf Ebrahimi          0: XZ
9668*22dc650dSSadaf Ebrahimi         MK: B
9669*22dc650dSSadaf Ebrahimi
9670*22dc650dSSadaf Ebrahimi       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9671*22dc650dSSadaf Ebrahimi       ple it indicates which of the two alternatives matched. This is a  more
9672*22dc650dSSadaf Ebrahimi       efficient  way of obtaining this information than putting each alterna-
9673*22dc650dSSadaf Ebrahimi       tive in its own capturing parentheses.
9674*22dc650dSSadaf Ebrahimi
9675*22dc650dSSadaf Ebrahimi       If a verb with a name is encountered in a positive  assertion  that  is
9676*22dc650dSSadaf Ebrahimi       true,  the  name  is recorded and passed back if it is the last-encoun-
9677*22dc650dSSadaf Ebrahimi       tered. This does not happen for negative assertions or failing positive
9678*22dc650dSSadaf Ebrahimi       assertions.
9679*22dc650dSSadaf Ebrahimi
9680*22dc650dSSadaf Ebrahimi       After a partial match or a failed match, the last encountered  name  in
9681*22dc650dSSadaf Ebrahimi       the entire match process is returned. For example:
9682*22dc650dSSadaf Ebrahimi
9683*22dc650dSSadaf Ebrahimi           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9684*22dc650dSSadaf Ebrahimi         data> XP
9685*22dc650dSSadaf Ebrahimi         No match, mark = B
9686*22dc650dSSadaf Ebrahimi
9687*22dc650dSSadaf Ebrahimi       Note  that  in  this  unanchored  example the mark is retained from the
9688*22dc650dSSadaf Ebrahimi       match attempt that started at the letter "X" in the subject. Subsequent
9689*22dc650dSSadaf Ebrahimi       match attempts starting at "P" and then with an empty string do not get
9690*22dc650dSSadaf Ebrahimi       as far as the (*MARK) item, but nevertheless do not reset it.
9691*22dc650dSSadaf Ebrahimi
9692*22dc650dSSadaf Ebrahimi       If you are interested in  (*MARK)  values  after  failed  matches,  you
9693*22dc650dSSadaf Ebrahimi       should  probably  set the PCRE2_NO_START_OPTIMIZE option (see above) to
9694*22dc650dSSadaf Ebrahimi       ensure that the match is always attempted.
9695*22dc650dSSadaf Ebrahimi
9696*22dc650dSSadaf Ebrahimi   Verbs that act after backtracking
9697*22dc650dSSadaf Ebrahimi
9698*22dc650dSSadaf Ebrahimi       The following verbs do nothing when they are encountered. Matching con-
9699*22dc650dSSadaf Ebrahimi       tinues with what follows, but if there is a subsequent  match  failure,
9700*22dc650dSSadaf Ebrahimi       causing  a  backtrack  to the verb, a failure is forced. That is, back-
9701*22dc650dSSadaf Ebrahimi       tracking cannot pass to the left of the  verb.  However,  when  one  of
9702*22dc650dSSadaf Ebrahimi       these verbs appears inside an atomic group or in a lookaround assertion
9703*22dc650dSSadaf Ebrahimi       that  is  true,  its effect is confined to that group, because once the
9704*22dc650dSSadaf Ebrahimi       group has been matched, there is never any backtracking into it.  Back-
9705*22dc650dSSadaf Ebrahimi       tracking from beyond an assertion or an atomic group ignores the entire
9706*22dc650dSSadaf Ebrahimi       group, and seeks a preceding backtracking point.
9707*22dc650dSSadaf Ebrahimi
9708*22dc650dSSadaf Ebrahimi       These  verbs  differ  in exactly what kind of failure occurs when back-
9709*22dc650dSSadaf Ebrahimi       tracking reaches them. The behaviour described below  is  what  happens
9710*22dc650dSSadaf Ebrahimi       when  the  verb is not in a subroutine or an assertion. Subsequent sec-
9711*22dc650dSSadaf Ebrahimi       tions cover these special cases.
9712*22dc650dSSadaf Ebrahimi
9713*22dc650dSSadaf Ebrahimi         (*COMMIT) or (*COMMIT:NAME)
9714*22dc650dSSadaf Ebrahimi
9715*22dc650dSSadaf Ebrahimi       This verb causes the whole match to fail outright if there is  a  later
9716*22dc650dSSadaf Ebrahimi       matching failure that causes backtracking to reach it. Even if the pat-
9717*22dc650dSSadaf Ebrahimi       tern  is  unanchored,  no further attempts to find a match by advancing
9718*22dc650dSSadaf Ebrahimi       the starting point take place. If (*COMMIT) is  the  only  backtracking
9719*22dc650dSSadaf Ebrahimi       verb that is encountered, once it has been passed pcre2_match() is com-
9720*22dc650dSSadaf Ebrahimi       mitted to finding a match at the current starting point, or not at all.
9721*22dc650dSSadaf Ebrahimi       For example:
9722*22dc650dSSadaf Ebrahimi
9723*22dc650dSSadaf Ebrahimi         a+(*COMMIT)b
9724*22dc650dSSadaf Ebrahimi
9725*22dc650dSSadaf Ebrahimi       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
9726*22dc650dSSadaf Ebrahimi       of dynamic anchor, or "I've started, so I must finish."
9727*22dc650dSSadaf Ebrahimi
9728*22dc650dSSadaf Ebrahimi       The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM-
9729*22dc650dSSadaf Ebrahimi       MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
9730*22dc650dSSadaf Ebrahimi       ing back to the caller. However, (*SKIP:NAME) searches only  for  names
9731*22dc650dSSadaf Ebrahimi       that are set with (*MARK), ignoring those set by any of the other back-
9732*22dc650dSSadaf Ebrahimi       tracking verbs.
9733*22dc650dSSadaf Ebrahimi
9734*22dc650dSSadaf Ebrahimi       If  there  is more than one backtracking verb in a pattern, a different
9735*22dc650dSSadaf Ebrahimi       one that follows (*COMMIT) may be triggered first,  so  merely  passing
9736*22dc650dSSadaf Ebrahimi       (*COMMIT) during a match does not always guarantee that a match must be
9737*22dc650dSSadaf Ebrahimi       at this starting point.
9738*22dc650dSSadaf Ebrahimi
9739*22dc650dSSadaf Ebrahimi       Note that (*COMMIT) at the start of a pattern is not the same as an an-
9740*22dc650dSSadaf Ebrahimi       chor,  unless  PCRE2's  start-of-match optimizations are turned off, as
9741*22dc650dSSadaf Ebrahimi       shown in this output from pcre2test:
9742*22dc650dSSadaf Ebrahimi
9743*22dc650dSSadaf Ebrahimi           re> /(*COMMIT)abc/
9744*22dc650dSSadaf Ebrahimi         data> xyzabc
9745*22dc650dSSadaf Ebrahimi          0: abc
9746*22dc650dSSadaf Ebrahimi         data>
9747*22dc650dSSadaf Ebrahimi         re> /(*COMMIT)abc/no_start_optimize
9748*22dc650dSSadaf Ebrahimi         data> xyzabc
9749*22dc650dSSadaf Ebrahimi         No match
9750*22dc650dSSadaf Ebrahimi
9751*22dc650dSSadaf Ebrahimi       For the first pattern, PCRE2 knows that any match must start with  "a",
9752*22dc650dSSadaf Ebrahimi       so  the optimization skips along the subject to "a" before applying the
9753*22dc650dSSadaf Ebrahimi       pattern to the first set of data. The match attempt then succeeds.  The
9754*22dc650dSSadaf Ebrahimi       second  pattern disables the optimization that skips along to the first
9755*22dc650dSSadaf Ebrahimi       character. The pattern is now applied  starting  at  "x",  and  so  the
9756*22dc650dSSadaf Ebrahimi       (*COMMIT)  causes  the  match to fail without trying any other starting
9757*22dc650dSSadaf Ebrahimi       points.
9758*22dc650dSSadaf Ebrahimi
9759*22dc650dSSadaf Ebrahimi         (*PRUNE) or (*PRUNE:NAME)
9760*22dc650dSSadaf Ebrahimi
9761*22dc650dSSadaf Ebrahimi       This verb causes the match to fail at the current starting position  in
9762*22dc650dSSadaf Ebrahimi       the subject if there is a later matching failure that causes backtrack-
9763*22dc650dSSadaf Ebrahimi       ing  to  reach it. If the pattern is unanchored, the normal "bumpalong"
9764*22dc650dSSadaf Ebrahimi       advance to the next starting character then happens.  Backtracking  can
9765*22dc650dSSadaf Ebrahimi       occur  as  usual to the left of (*PRUNE), before it is reached, or when
9766*22dc650dSSadaf Ebrahimi       matching to the right of (*PRUNE), but if there  is  no  match  to  the
9767*22dc650dSSadaf Ebrahimi       right,  backtracking cannot cross (*PRUNE). In simple cases, the use of
9768*22dc650dSSadaf Ebrahimi       (*PRUNE) is just an alternative to an atomic group or possessive  quan-
9769*22dc650dSSadaf Ebrahimi       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
9770*22dc650dSSadaf Ebrahimi       any  other  way. In an anchored pattern (*PRUNE) has the same effect as
9771*22dc650dSSadaf Ebrahimi       (*COMMIT).
9772*22dc650dSSadaf Ebrahimi
9773*22dc650dSSadaf Ebrahimi       The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
9774*22dc650dSSadaf Ebrahimi       It is like (*MARK:NAME) in that the name is remembered for passing back
9775*22dc650dSSadaf Ebrahimi       to the caller. However, (*SKIP:NAME) searches only for names  set  with
9776*22dc650dSSadaf Ebrahimi       (*MARK), ignoring those set by other backtracking verbs.
9777*22dc650dSSadaf Ebrahimi
9778*22dc650dSSadaf Ebrahimi         (*SKIP)
9779*22dc650dSSadaf Ebrahimi
9780*22dc650dSSadaf Ebrahimi       This  verb, when given without a name, is like (*PRUNE), except that if
9781*22dc650dSSadaf Ebrahimi       the pattern is unanchored, the "bumpalong" advance is not to  the  next
9782*22dc650dSSadaf Ebrahimi       character, but to the position in the subject where (*SKIP) was encoun-
9783*22dc650dSSadaf Ebrahimi       tered.  (*SKIP)  signifies that whatever text was matched leading up to
9784*22dc650dSSadaf Ebrahimi       it cannot be part of a successful match if there is a  later  mismatch.
9785*22dc650dSSadaf Ebrahimi       Consider:
9786*22dc650dSSadaf Ebrahimi
9787*22dc650dSSadaf Ebrahimi         a+(*SKIP)b
9788*22dc650dSSadaf Ebrahimi
9789*22dc650dSSadaf Ebrahimi       If  the  subject  is  "aaaac...",  after  the first match attempt fails
9790*22dc650dSSadaf Ebrahimi       (starting at the first character in the  string),  the  starting  point
9791*22dc650dSSadaf Ebrahimi       skips on to start the next attempt at "c". Note that a possessive quan-
9792*22dc650dSSadaf Ebrahimi       tifier does not have the same effect as this example; although it would
9793*22dc650dSSadaf Ebrahimi       suppress  backtracking  during  the first match attempt, the second at-
9794*22dc650dSSadaf Ebrahimi       tempt would start at the second character instead  of  skipping  on  to
9795*22dc650dSSadaf Ebrahimi       "c".
9796*22dc650dSSadaf Ebrahimi
9797*22dc650dSSadaf Ebrahimi       If  (*SKIP) is used to specify a new starting position that is the same
9798*22dc650dSSadaf Ebrahimi       as the starting position of the current match, or (by  being  inside  a
9799*22dc650dSSadaf Ebrahimi       lookbehind)  earlier, the position specified by (*SKIP) is ignored, and
9800*22dc650dSSadaf Ebrahimi       instead the normal "bumpalong" occurs.
9801*22dc650dSSadaf Ebrahimi
9802*22dc650dSSadaf Ebrahimi         (*SKIP:NAME)
9803*22dc650dSSadaf Ebrahimi
9804*22dc650dSSadaf Ebrahimi       When (*SKIP) has an associated name, its behaviour  is  modified.  When
9805*22dc650dSSadaf Ebrahimi       such  a  (*SKIP) is triggered, the previous path through the pattern is
9806*22dc650dSSadaf Ebrahimi       searched for the most recent (*MARK) that has the same name. If one  is
9807*22dc650dSSadaf Ebrahimi       found,  the  "bumpalong" advance is to the subject position that corre-
9808*22dc650dSSadaf Ebrahimi       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
9809*22dc650dSSadaf Ebrahimi       no (*MARK) with a matching name is found, the (*SKIP) is ignored.
9810*22dc650dSSadaf Ebrahimi
9811*22dc650dSSadaf Ebrahimi       The  search  for a (*MARK) name uses the normal backtracking mechanism,
9812*22dc650dSSadaf Ebrahimi       which means that it does not  see  (*MARK)  settings  that  are  inside
9813*22dc650dSSadaf Ebrahimi       atomic groups or assertions, because they are never re-entered by back-
9814*22dc650dSSadaf Ebrahimi       tracking. Compare the following pcre2test examples:
9815*22dc650dSSadaf Ebrahimi
9816*22dc650dSSadaf Ebrahimi           re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
9817*22dc650dSSadaf Ebrahimi         data: abc
9818*22dc650dSSadaf Ebrahimi          0: a
9819*22dc650dSSadaf Ebrahimi          1: a
9820*22dc650dSSadaf Ebrahimi         data:
9821*22dc650dSSadaf Ebrahimi           re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
9822*22dc650dSSadaf Ebrahimi         data: abc
9823*22dc650dSSadaf Ebrahimi          0: b
9824*22dc650dSSadaf Ebrahimi          1: b
9825*22dc650dSSadaf Ebrahimi
9826*22dc650dSSadaf Ebrahimi       In  the first example, the (*MARK) setting is in an atomic group, so it
9827*22dc650dSSadaf Ebrahimi       is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
9828*22dc650dSSadaf Ebrahimi       This allows the second branch of the pattern to be tried at  the  first
9829*22dc650dSSadaf Ebrahimi       character  position.  In the second example, the (*MARK) setting is not
9830*22dc650dSSadaf Ebrahimi       in an atomic group. This allows (*SKIP:X) to find the (*MARK)  when  it
9831*22dc650dSSadaf Ebrahimi       backtracks, and this causes a new matching attempt to start at the sec-
9832*22dc650dSSadaf Ebrahimi       ond  character.  This  time, the (*MARK) is never seen because "a" does
9833*22dc650dSSadaf Ebrahimi       not match "b", so the matcher immediately jumps to the second branch of
9834*22dc650dSSadaf Ebrahimi       the pattern.
9835*22dc650dSSadaf Ebrahimi
9836*22dc650dSSadaf Ebrahimi       Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME).  It
9837*22dc650dSSadaf Ebrahimi       ignores names that are set by other backtracking verbs.
9838*22dc650dSSadaf Ebrahimi
9839*22dc650dSSadaf Ebrahimi         (*THEN) or (*THEN:NAME)
9840*22dc650dSSadaf Ebrahimi
9841*22dc650dSSadaf Ebrahimi       This  verb  causes  a skip to the next innermost alternative when back-
9842*22dc650dSSadaf Ebrahimi       tracking reaches it. That  is,  it  cancels  any  further  backtracking
9843*22dc650dSSadaf Ebrahimi       within  the  current  alternative.  Its name comes from the observation
9844*22dc650dSSadaf Ebrahimi       that it can be used for a pattern-based if-then-else block:
9845*22dc650dSSadaf Ebrahimi
9846*22dc650dSSadaf Ebrahimi         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
9847*22dc650dSSadaf Ebrahimi
9848*22dc650dSSadaf Ebrahimi       If the COND1 pattern matches, FOO is tried (and possibly further  items
9849*22dc650dSSadaf Ebrahimi       after  the  end  of the group if FOO succeeds); on failure, the matcher
9850*22dc650dSSadaf Ebrahimi       skips to the second alternative and tries COND2,  without  backtracking
9851*22dc650dSSadaf Ebrahimi       into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
9852*22dc650dSSadaf Ebrahimi       quently BAZ fails, there are no more alternatives, so there is a  back-
9853*22dc650dSSadaf Ebrahimi       track  to  whatever came before the entire group. If (*THEN) is not in-
9854*22dc650dSSadaf Ebrahimi       side an alternation, it acts like (*PRUNE).
9855*22dc650dSSadaf Ebrahimi
9856*22dc650dSSadaf Ebrahimi       The behaviour of (*THEN:NAME) is not the same  as  (*MARK:NAME)(*THEN).
9857*22dc650dSSadaf Ebrahimi       It is like (*MARK:NAME) in that the name is remembered for passing back
9858*22dc650dSSadaf Ebrahimi       to  the  caller. However, (*SKIP:NAME) searches only for names set with
9859*22dc650dSSadaf Ebrahimi       (*MARK), ignoring those set by other backtracking verbs.
9860*22dc650dSSadaf Ebrahimi
9861*22dc650dSSadaf Ebrahimi       A group that does not contain a | character is just a part of  the  en-
9862*22dc650dSSadaf Ebrahimi       closing  alternative;  it is not a nested alternation with only one al-
9863*22dc650dSSadaf Ebrahimi       ternative. The effect of (*THEN) extends beyond such a group to the en-
9864*22dc650dSSadaf Ebrahimi       closing alternative.  Consider this pattern, where A, B, etc. are  com-
9865*22dc650dSSadaf Ebrahimi       plex  pattern  fragments  that  do not contain any | characters at this
9866*22dc650dSSadaf Ebrahimi       level:
9867*22dc650dSSadaf Ebrahimi
9868*22dc650dSSadaf Ebrahimi         A (B(*THEN)C) | D
9869*22dc650dSSadaf Ebrahimi
9870*22dc650dSSadaf Ebrahimi       If A and B are matched, but there is a failure in C, matching does  not
9871*22dc650dSSadaf Ebrahimi       backtrack into A; instead it moves to the next alternative, that is, D.
9872*22dc650dSSadaf Ebrahimi       However,  if  the  group containing (*THEN) is given an alternative, it
9873*22dc650dSSadaf Ebrahimi       behaves differently:
9874*22dc650dSSadaf Ebrahimi
9875*22dc650dSSadaf Ebrahimi         A (B(*THEN)C | (*FAIL)) | D
9876*22dc650dSSadaf Ebrahimi
9877*22dc650dSSadaf Ebrahimi       The effect of (*THEN) is now confined to the inner group. After a fail-
9878*22dc650dSSadaf Ebrahimi       ure in C, matching moves to (*FAIL), which causes the  whole  group  to
9879*22dc650dSSadaf Ebrahimi       fail  because  there  are  no  more  alternatives to try. In this case,
9880*22dc650dSSadaf Ebrahimi       matching does backtrack into A.
9881*22dc650dSSadaf Ebrahimi
9882*22dc650dSSadaf Ebrahimi       Note that a conditional group is not considered as having two  alterna-
9883*22dc650dSSadaf Ebrahimi       tives,  because  only one is ever used. In other words, the | character
9884*22dc650dSSadaf Ebrahimi       in a conditional group has a different meaning. Ignoring  white  space,
9885*22dc650dSSadaf Ebrahimi       consider:
9886*22dc650dSSadaf Ebrahimi
9887*22dc650dSSadaf Ebrahimi         ^.*? (?(?=a) a | b(*THEN)c )
9888*22dc650dSSadaf Ebrahimi
9889*22dc650dSSadaf Ebrahimi       If the subject is "ba", this pattern does not match. Because .*? is un-
9890*22dc650dSSadaf Ebrahimi       greedy,  it initially matches zero characters. The condition (?=a) then
9891*22dc650dSSadaf Ebrahimi       fails, the character "b" is matched, but "c" is  not.  At  this  point,
9892*22dc650dSSadaf Ebrahimi       matching  does  not  backtrack to .*? as might perhaps be expected from
9893*22dc650dSSadaf Ebrahimi       the presence of the | character. The conditional group is part  of  the
9894*22dc650dSSadaf Ebrahimi       single  alternative  that comprises the whole pattern, and so the match
9895*22dc650dSSadaf Ebrahimi       fails. (If there was a backtrack into .*?, allowing it  to  match  "b",
9896*22dc650dSSadaf Ebrahimi       the match would succeed.)
9897*22dc650dSSadaf Ebrahimi
9898*22dc650dSSadaf Ebrahimi       The  verbs just described provide four different "strengths" of control
9899*22dc650dSSadaf Ebrahimi       when subsequent matching fails. (*THEN) is the weakest, carrying on the
9900*22dc650dSSadaf Ebrahimi       match at the next alternative. (*PRUNE) comes next, failing  the  match
9901*22dc650dSSadaf Ebrahimi       at  the  current starting position, but allowing an advance to the next
9902*22dc650dSSadaf Ebrahimi       character (for an unanchored pattern). (*SKIP) is similar, except  that
9903*22dc650dSSadaf Ebrahimi       the advance may be more than one character. (*COMMIT) is the strongest,
9904*22dc650dSSadaf Ebrahimi       causing the entire match to fail.
9905*22dc650dSSadaf Ebrahimi
9906*22dc650dSSadaf Ebrahimi   More than one backtracking verb
9907*22dc650dSSadaf Ebrahimi
9908*22dc650dSSadaf Ebrahimi       If  more  than  one  backtracking verb is present in a pattern, the one
9909*22dc650dSSadaf Ebrahimi       that is backtracked onto first acts. For example,  consider  this  pat-
9910*22dc650dSSadaf Ebrahimi       tern, where A, B, etc. are complex pattern fragments:
9911*22dc650dSSadaf Ebrahimi
9912*22dc650dSSadaf Ebrahimi         (A(*COMMIT)B(*THEN)C|ABD)
9913*22dc650dSSadaf Ebrahimi
9914*22dc650dSSadaf Ebrahimi       If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
9915*22dc650dSSadaf Ebrahimi       match to fail. However, if A and B match, but C fails, the backtrack to
9916*22dc650dSSadaf Ebrahimi       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
9917*22dc650dSSadaf Ebrahimi       is  consistent,  but is not always the same as Perl's. It means that if
9918*22dc650dSSadaf Ebrahimi       two or more backtracking verbs appear in succession, all but  the  last
9919*22dc650dSSadaf Ebrahimi       of them has no effect. Consider this example:
9920*22dc650dSSadaf Ebrahimi
9921*22dc650dSSadaf Ebrahimi         ...(*COMMIT)(*PRUNE)...
9922*22dc650dSSadaf Ebrahimi
9923*22dc650dSSadaf Ebrahimi       If there is a matching failure to the right, backtracking onto (*PRUNE)
9924*22dc650dSSadaf Ebrahimi       causes  it to be triggered, and its action is taken. There can never be
9925*22dc650dSSadaf Ebrahimi       a backtrack onto (*COMMIT).
9926*22dc650dSSadaf Ebrahimi
9927*22dc650dSSadaf Ebrahimi   Backtracking verbs in repeated groups
9928*22dc650dSSadaf Ebrahimi
9929*22dc650dSSadaf Ebrahimi       PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9930*22dc650dSSadaf Ebrahimi       in repeated groups. For example, consider:
9931*22dc650dSSadaf Ebrahimi
9932*22dc650dSSadaf Ebrahimi         /(a(*COMMIT)b)+ac/
9933*22dc650dSSadaf Ebrahimi
9934*22dc650dSSadaf Ebrahimi       If the subject is "abac", Perl matches  unless  its  optimizations  are
9935*22dc650dSSadaf Ebrahimi       disabled,  but  PCRE2  always fails because the (*COMMIT) in the second
9936*22dc650dSSadaf Ebrahimi       repeat of the group acts.
9937*22dc650dSSadaf Ebrahimi
9938*22dc650dSSadaf Ebrahimi   Backtracking verbs in assertions
9939*22dc650dSSadaf Ebrahimi
9940*22dc650dSSadaf Ebrahimi       (*FAIL) in any assertion has its normal effect: it forces an  immediate
9941*22dc650dSSadaf Ebrahimi       backtrack.  The  behaviour  of  the other backtracking verbs depends on
9942*22dc650dSSadaf Ebrahimi       whether or not the assertion is standalone or acting as  the  condition
9943*22dc650dSSadaf Ebrahimi       in a conditional group.
9944*22dc650dSSadaf Ebrahimi
9945*22dc650dSSadaf Ebrahimi       (*ACCEPT)  in  a  standalone positive assertion causes the assertion to
9946*22dc650dSSadaf Ebrahimi       succeed without any further processing; captured  strings  and  a  mark
9947*22dc650dSSadaf Ebrahimi       name  (if  set) are retained. In a standalone negative assertion, (*AC-
9948*22dc650dSSadaf Ebrahimi       CEPT) causes the assertion to fail without any further processing; cap-
9949*22dc650dSSadaf Ebrahimi       tured substrings and any mark name are discarded.
9950*22dc650dSSadaf Ebrahimi
9951*22dc650dSSadaf Ebrahimi       If the assertion is a condition, (*ACCEPT) causes the condition  to  be
9952*22dc650dSSadaf Ebrahimi       true  for  a  positive assertion and false for a negative one; captured
9953*22dc650dSSadaf Ebrahimi       substrings are retained in both cases.
9954*22dc650dSSadaf Ebrahimi
9955*22dc650dSSadaf Ebrahimi       The remaining verbs act only when a later failure causes a backtrack to
9956*22dc650dSSadaf Ebrahimi       reach them. This means that, for the Perl-compatible assertions,  their
9957*22dc650dSSadaf Ebrahimi       effect is confined to the assertion, because Perl lookaround assertions
9958*22dc650dSSadaf Ebrahimi       are atomic. A backtrack that occurs after such an assertion is complete
9959*22dc650dSSadaf Ebrahimi       does  not  jump  back  into  the  assertion.  Note in particular that a
9960*22dc650dSSadaf Ebrahimi       (*MARK) name that is set in an assertion is not "seen" by  an  instance
9961*22dc650dSSadaf Ebrahimi       of (*SKIP:NAME) later in the pattern.
9962*22dc650dSSadaf Ebrahimi
9963*22dc650dSSadaf Ebrahimi       PCRE2  now supports non-atomic positive assertions, as described in the
9964*22dc650dSSadaf Ebrahimi       section entitled "Non-atomic assertions" above. These  assertions  must
9965*22dc650dSSadaf Ebrahimi       be  standalone  (not used as conditions). They are not Perl-compatible.
9966*22dc650dSSadaf Ebrahimi       For these assertions, a later backtrack does jump back into the  asser-
9967*22dc650dSSadaf Ebrahimi       tion,  and  therefore verbs such as (*COMMIT) can be triggered by back-
9968*22dc650dSSadaf Ebrahimi       tracks from later in the pattern.
9969*22dc650dSSadaf Ebrahimi
9970*22dc650dSSadaf Ebrahimi       The effect of (*THEN) is not allowed to escape beyond an assertion.  If
9971*22dc650dSSadaf Ebrahimi       there  are no more branches to try, (*THEN) causes a positive assertion
9972*22dc650dSSadaf Ebrahimi       to be false, and a negative assertion to be true.
9973*22dc650dSSadaf Ebrahimi
9974*22dc650dSSadaf Ebrahimi       The other backtracking verbs are not treated specially if  they  appear
9975*22dc650dSSadaf Ebrahimi       in  a  standalone  positive assertion. In a conditional positive asser-
9976*22dc650dSSadaf Ebrahimi       tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
9977*22dc650dSSadaf Ebrahimi       or (*PRUNE) causes the condition to be false. However, for both  stand-
9978*22dc650dSSadaf Ebrahimi       alone and conditional negative assertions, backtracking into (*COMMIT),
9979*22dc650dSSadaf Ebrahimi       (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9980*22dc650dSSadaf Ebrahimi       ing any further alternative branches.
9981*22dc650dSSadaf Ebrahimi
9982*22dc650dSSadaf Ebrahimi   Backtracking verbs in subroutines
9983*22dc650dSSadaf Ebrahimi
9984*22dc650dSSadaf Ebrahimi       These behaviours occur whether or not the group is called recursively.
9985*22dc650dSSadaf Ebrahimi
9986*22dc650dSSadaf Ebrahimi       (*ACCEPT) in a group called as a subroutine causes the subroutine match
9987*22dc650dSSadaf Ebrahimi       to  succeed without any further processing. Matching then continues af-
9988*22dc650dSSadaf Ebrahimi       ter the subroutine call. Perl documents this behaviour.  Perl's  treat-
9989*22dc650dSSadaf Ebrahimi       ment of the other verbs in subroutines is different in some cases.
9990*22dc650dSSadaf Ebrahimi
9991*22dc650dSSadaf Ebrahimi       (*FAIL)  in  a  group  called as a subroutine has its normal effect: it
9992*22dc650dSSadaf Ebrahimi       forces an immediate backtrack.
9993*22dc650dSSadaf Ebrahimi
9994*22dc650dSSadaf Ebrahimi       (*COMMIT), (*SKIP), and (*PRUNE) cause the  subroutine  match  to  fail
9995*22dc650dSSadaf Ebrahimi       when  triggered  by being backtracked to in a group called as a subrou-
9996*22dc650dSSadaf Ebrahimi       tine. There is then a backtrack at the outer level.
9997*22dc650dSSadaf Ebrahimi
9998*22dc650dSSadaf Ebrahimi       (*THEN), when triggered, skips to the next alternative in the innermost
9999*22dc650dSSadaf Ebrahimi       enclosing group that has alternatives (its normal behaviour).  However,
10000*22dc650dSSadaf Ebrahimi       if there is no such group within the subroutine's group, the subroutine
10001*22dc650dSSadaf Ebrahimi       match fails and there is a backtrack at the outer level.
10002*22dc650dSSadaf Ebrahimi
10003*22dc650dSSadaf Ebrahimi
10004*22dc650dSSadaf EbrahimiSEE ALSO
10005*22dc650dSSadaf Ebrahimi
10006*22dc650dSSadaf Ebrahimi       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
10007*22dc650dSSadaf Ebrahimi       pcre2(3).
10008*22dc650dSSadaf Ebrahimi
10009*22dc650dSSadaf Ebrahimi
10010*22dc650dSSadaf EbrahimiAUTHOR
10011*22dc650dSSadaf Ebrahimi
10012*22dc650dSSadaf Ebrahimi       Philip Hazel
10013*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
10014*22dc650dSSadaf Ebrahimi       Cambridge, England.
10015*22dc650dSSadaf Ebrahimi
10016*22dc650dSSadaf Ebrahimi
10017*22dc650dSSadaf EbrahimiREVISION
10018*22dc650dSSadaf Ebrahimi
10019*22dc650dSSadaf Ebrahimi       Last updated: 04 June 2024
10020*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2024 University of Cambridge.
10021*22dc650dSSadaf Ebrahimi
10022*22dc650dSSadaf Ebrahimi
10023*22dc650dSSadaf EbrahimiPCRE2 10.44                      04 June 2024                  PCRE2PATTERN(3)
10024*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
10025*22dc650dSSadaf Ebrahimi
10026*22dc650dSSadaf Ebrahimi
10027*22dc650dSSadaf Ebrahimi
10028*22dc650dSSadaf EbrahimiPCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
10029*22dc650dSSadaf Ebrahimi
10030*22dc650dSSadaf Ebrahimi
10031*22dc650dSSadaf EbrahimiNAME
10032*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
10033*22dc650dSSadaf Ebrahimi
10034*22dc650dSSadaf Ebrahimi
10035*22dc650dSSadaf EbrahimiPCRE2 PERFORMANCE
10036*22dc650dSSadaf Ebrahimi
10037*22dc650dSSadaf Ebrahimi       Two  aspects  of performance are discussed below: memory usage and pro-
10038*22dc650dSSadaf Ebrahimi       cessing time. The way you express your pattern as a regular  expression
10039*22dc650dSSadaf Ebrahimi       can affect both of them.
10040*22dc650dSSadaf Ebrahimi
10041*22dc650dSSadaf Ebrahimi
10042*22dc650dSSadaf EbrahimiCOMPILED PATTERN MEMORY USAGE
10043*22dc650dSSadaf Ebrahimi
10044*22dc650dSSadaf Ebrahimi       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
10045*22dc650dSSadaf Ebrahimi       code,  so  that most simple patterns do not use much memory for storing
10046*22dc650dSSadaf Ebrahimi       the compiled version. However, there is one case where the memory usage
10047*22dc650dSSadaf Ebrahimi       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
10048*22dc650dSSadaf Ebrahimi       group  has  a quantifier with a minimum greater than 1 and/or a limited
10049*22dc650dSSadaf Ebrahimi       maximum, the whole group is repeated in the compiled code. For example,
10050*22dc650dSSadaf Ebrahimi       the pattern
10051*22dc650dSSadaf Ebrahimi
10052*22dc650dSSadaf Ebrahimi         (abc|def){2,4}
10053*22dc650dSSadaf Ebrahimi
10054*22dc650dSSadaf Ebrahimi       is compiled as if it were
10055*22dc650dSSadaf Ebrahimi
10056*22dc650dSSadaf Ebrahimi         (abc|def)(abc|def)((abc|def)(abc|def)?)?
10057*22dc650dSSadaf Ebrahimi
10058*22dc650dSSadaf Ebrahimi       (Technical aside: It is done this way so that backtrack  points  within
10059*22dc650dSSadaf Ebrahimi       each of the repetitions can be independently maintained.)
10060*22dc650dSSadaf Ebrahimi
10061*22dc650dSSadaf Ebrahimi       For  regular expressions whose quantifiers use only small numbers, this
10062*22dc650dSSadaf Ebrahimi       is not usually a problem. However, if the numbers are large,  and  par-
10063*22dc650dSSadaf Ebrahimi       ticularly  if  such repetitions are nested, the memory usage can become
10064*22dc650dSSadaf Ebrahimi       an embarrassment. For example, the very simple pattern
10065*22dc650dSSadaf Ebrahimi
10066*22dc650dSSadaf Ebrahimi         ((ab){1,1000}c){1,3}
10067*22dc650dSSadaf Ebrahimi
10068*22dc650dSSadaf Ebrahimi       uses over 50KiB when compiled using the 8-bit library.  When  PCRE2  is
10069*22dc650dSSadaf Ebrahimi       compiled  with its default internal pointer size of two bytes, the size
10070*22dc650dSSadaf Ebrahimi       limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
10071*22dc650dSSadaf Ebrahimi       libraries, and this is reached with the above pattern if the outer rep-
10072*22dc650dSSadaf Ebrahimi       etition is increased from 3 to 4. PCRE2 can be compiled to  use  larger
10073*22dc650dSSadaf Ebrahimi       internal  pointers  and thus handle larger compiled patterns, but it is
10074*22dc650dSSadaf Ebrahimi       better to try to rewrite your pattern to use less memory if you can.
10075*22dc650dSSadaf Ebrahimi
10076*22dc650dSSadaf Ebrahimi       One way of reducing the memory usage for such patterns is to  make  use
10077*22dc650dSSadaf Ebrahimi       of PCRE2's "subroutine" facility. Re-writing the above pattern as
10078*22dc650dSSadaf Ebrahimi
10079*22dc650dSSadaf Ebrahimi         ((ab)(?2){0,999}c)(?1){0,2}
10080*22dc650dSSadaf Ebrahimi
10081*22dc650dSSadaf Ebrahimi       reduces  the memory requirements to around 16KiB, and indeed it remains
10082*22dc650dSSadaf Ebrahimi       under 20KiB even with the outer repetition increased to  100.  However,
10083*22dc650dSSadaf Ebrahimi       this kind of pattern is not always exactly equivalent, because any cap-
10084*22dc650dSSadaf Ebrahimi       tures  within  subroutine calls are lost when the subroutine completes.
10085*22dc650dSSadaf Ebrahimi       If this is not a problem, this kind of  rewriting  will  allow  you  to
10086*22dc650dSSadaf Ebrahimi       process  patterns that PCRE2 cannot otherwise handle. The matching per-
10087*22dc650dSSadaf Ebrahimi       formance of the two different versions of the pattern are  roughly  the
10088*22dc650dSSadaf Ebrahimi       same.  (This applies from release 10.30 - things were different in ear-
10089*22dc650dSSadaf Ebrahimi       lier releases.)
10090*22dc650dSSadaf Ebrahimi
10091*22dc650dSSadaf Ebrahimi
10092*22dc650dSSadaf EbrahimiSTACK AND HEAP USAGE AT RUN TIME
10093*22dc650dSSadaf Ebrahimi
10094*22dc650dSSadaf Ebrahimi       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
10095*22dc650dSSadaf Ebrahimi       uses very little system stack at run time. In earlier  releases  recur-
10096*22dc650dSSadaf Ebrahimi       sive  function  calls  could  use a great deal of stack, and this could
10097*22dc650dSSadaf Ebrahimi       cause problems, but this usage has been eliminated. Backtracking  posi-
10098*22dc650dSSadaf Ebrahimi       tions  are now explicitly remembered in memory frames controlled by the
10099*22dc650dSSadaf Ebrahimi       code.
10100*22dc650dSSadaf Ebrahimi
10101*22dc650dSSadaf Ebrahimi       The size of each frame depends on the size of pointer variables and the
10102*22dc650dSSadaf Ebrahimi       number of capturing parenthesized groups in the pattern being  matched.
10103*22dc650dSSadaf Ebrahimi       On a 64-bit system the frame size for a pattern with no captures is 128
10104*22dc650dSSadaf Ebrahimi       bytes. For each capturing group the size increases by 16 bytes.
10105*22dc650dSSadaf Ebrahimi
10106*22dc650dSSadaf Ebrahimi       Until  release  10.41,  an initial 20KiB frames vector was allocated on
10107*22dc650dSSadaf Ebrahimi       the system stack, but this still caused some  issues  for  multi-thread
10108*22dc650dSSadaf Ebrahimi       applications  where  each  thread  has a very small stack. From release
10109*22dc650dSSadaf Ebrahimi       10.41 backtracking memory frames are always held  in  heap  memory.  An
10110*22dc650dSSadaf Ebrahimi       initial heap allocation is obtained the first time any match data block
10111*22dc650dSSadaf Ebrahimi       is  passed  to  pcre2_match().  This  is remembered with the match data
10112*22dc650dSSadaf Ebrahimi       block and re-used if that block is used for another match. It is  freed
10113*22dc650dSSadaf Ebrahimi       when the match data block itself is freed.
10114*22dc650dSSadaf Ebrahimi
10115*22dc650dSSadaf Ebrahimi       The  size  of the initial block is the larger of 20KiB or ten times the
10116*22dc650dSSadaf Ebrahimi       pattern's frame size, unless the heap limit is less than this, in which
10117*22dc650dSSadaf Ebrahimi       case the heap limit is used. If the initial  block  proves  to  be  too
10118*22dc650dSSadaf Ebrahimi       small during matching, it is replaced by a larger block, subject to the
10119*22dc650dSSadaf Ebrahimi       heap  limit.  The  heap limit is checked only when a new block is to be
10120*22dc650dSSadaf Ebrahimi       allocated. Reducing the heap limit between calls to pcre2_match()  with
10121*22dc650dSSadaf Ebrahimi       the same match data block does not affect the saved block.
10122*22dc650dSSadaf Ebrahimi
10123*22dc650dSSadaf Ebrahimi       In  contrast  to  pcre2_match(),  pcre2_dfa_match()  does use recursive
10124*22dc650dSSadaf Ebrahimi       function calls, but only for processing atomic groups,  lookaround  as-
10125*22dc650dSSadaf Ebrahimi       sertions, and recursion within the pattern. The original version of the
10126*22dc650dSSadaf Ebrahimi       code  used  to  allocate  quite large internal workspace vectors on the
10127*22dc650dSSadaf Ebrahimi       stack, which caused some problems for  some  patterns  in  environments
10128*22dc650dSSadaf Ebrahimi       with  small  stacks.  From release 10.32 the code for pcre2_dfa_match()
10129*22dc650dSSadaf Ebrahimi       has been re-factored to use heap memory  when  necessary  for  internal
10130*22dc650dSSadaf Ebrahimi       workspace  when  recursing,  though  recursive function calls are still
10131*22dc650dSSadaf Ebrahimi       used.
10132*22dc650dSSadaf Ebrahimi
10133*22dc650dSSadaf Ebrahimi       The "match depth" parameter can be used to limit the depth of  function
10134*22dc650dSSadaf Ebrahimi       recursion,  and  the  "match  heap"  parameter  to limit heap memory in
10135*22dc650dSSadaf Ebrahimi       pcre2_dfa_match().
10136*22dc650dSSadaf Ebrahimi
10137*22dc650dSSadaf Ebrahimi
10138*22dc650dSSadaf EbrahimiPROCESSING TIME
10139*22dc650dSSadaf Ebrahimi
10140*22dc650dSSadaf Ebrahimi       Certain items in regular expression patterns are processed  more  effi-
10141*22dc650dSSadaf Ebrahimi       ciently than others. It is more efficient to use a character class like
10142*22dc650dSSadaf Ebrahimi       [aeiou]   than   a   set   of  single-character  alternatives  such  as
10143*22dc650dSSadaf Ebrahimi       (a|e|i|o|u). In general, the simplest construction  that  provides  the
10144*22dc650dSSadaf Ebrahimi       required behaviour is usually the most efficient. Jeffrey Friedl's book
10145*22dc650dSSadaf Ebrahimi       contains  a  lot  of useful general discussion about optimizing regular
10146*22dc650dSSadaf Ebrahimi       expressions for efficient performance. This document contains a few ob-
10147*22dc650dSSadaf Ebrahimi       servations about PCRE2.
10148*22dc650dSSadaf Ebrahimi
10149*22dc650dSSadaf Ebrahimi       Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
10150*22dc650dSSadaf Ebrahimi       slow,  because  PCRE2 has to use a multi-stage table lookup whenever it
10151*22dc650dSSadaf Ebrahimi       needs a character's property. If you can find  an  alternative  pattern
10152*22dc650dSSadaf Ebrahimi       that does not use character properties, it will probably be faster.
10153*22dc650dSSadaf Ebrahimi
10154*22dc650dSSadaf Ebrahimi       By  default,  the  escape  sequences  \b, \d, \s, and \w, and the POSIX
10155*22dc650dSSadaf Ebrahimi       character classes such as [:alpha:]  do  not  use  Unicode  properties,
10156*22dc650dSSadaf Ebrahimi       partly for backwards compatibility, and partly for performance reasons.
10157*22dc650dSSadaf Ebrahimi       However,  you  can  set  the PCRE2_UCP option or start the pattern with
10158*22dc650dSSadaf Ebrahimi       (*UCP) if you want Unicode character properties to be  used.  This  can
10159*22dc650dSSadaf Ebrahimi       double  the  matching  time  for  items  such  as \d, when matched with
10160*22dc650dSSadaf Ebrahimi       pcre2_match(); the performance loss is less with a DFA  matching  func-
10161*22dc650dSSadaf Ebrahimi       tion, and in both cases there is not much difference for \b.
10162*22dc650dSSadaf Ebrahimi
10163*22dc650dSSadaf Ebrahimi       When  a pattern begins with .* not in atomic parentheses, nor in paren-
10164*22dc650dSSadaf Ebrahimi       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
10165*22dc650dSSadaf Ebrahimi       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
10166*22dc650dSSadaf Ebrahimi       can match only at the start of a subject string.  If  the  pattern  has
10167*22dc650dSSadaf Ebrahimi       multiple top-level branches, they must all be anchorable. The optimiza-
10168*22dc650dSSadaf Ebrahimi       tion  can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
10169*22dc650dSSadaf Ebrahimi       tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
10170*22dc650dSSadaf Ebrahimi
10171*22dc650dSSadaf Ebrahimi       If PCRE2_DOTALL is not set, PCRE2 cannot make  this  optimization,  be-
10172*22dc650dSSadaf Ebrahimi       cause  the  dot metacharacter does not then match a newline, and if the
10173*22dc650dSSadaf Ebrahimi       subject string contains newlines, the pattern may match from the  char-
10174*22dc650dSSadaf Ebrahimi       acter immediately following one of them instead of from the very start.
10175*22dc650dSSadaf Ebrahimi       For example, the pattern
10176*22dc650dSSadaf Ebrahimi
10177*22dc650dSSadaf Ebrahimi         .*second
10178*22dc650dSSadaf Ebrahimi
10179*22dc650dSSadaf Ebrahimi       matches  the subject "first\nand second" (where \n stands for a newline
10180*22dc650dSSadaf Ebrahimi       character), with the match starting at the seventh character. In  order
10181*22dc650dSSadaf Ebrahimi       to  do  this, PCRE2 has to retry the match starting after every newline
10182*22dc650dSSadaf Ebrahimi       in the subject.
10183*22dc650dSSadaf Ebrahimi
10184*22dc650dSSadaf Ebrahimi       If you are using such a pattern with subject strings that do  not  con-
10185*22dc650dSSadaf Ebrahimi       tain   newlines,   the   best   performance   is  obtained  by  setting
10186*22dc650dSSadaf Ebrahimi       PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate  ex-
10187*22dc650dSSadaf Ebrahimi       plicit  anchoring.  That saves PCRE2 from having to scan along the sub-
10188*22dc650dSSadaf Ebrahimi       ject looking for a newline to restart at.
10189*22dc650dSSadaf Ebrahimi
10190*22dc650dSSadaf Ebrahimi       Beware of patterns that contain nested indefinite  repeats.  These  can
10191*22dc650dSSadaf Ebrahimi       take  a  long time to run when applied to a string that does not match.
10192*22dc650dSSadaf Ebrahimi       Consider the pattern fragment
10193*22dc650dSSadaf Ebrahimi
10194*22dc650dSSadaf Ebrahimi         ^(a+)*
10195*22dc650dSSadaf Ebrahimi
10196*22dc650dSSadaf Ebrahimi       This can match "aaaa" in 16 different ways, and this  number  increases
10197*22dc650dSSadaf Ebrahimi       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
10198*22dc650dSSadaf Ebrahimi       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
10199*22dc650dSSadaf Ebrahimi       repeats  can  match  different numbers of times.) When the remainder of
10200*22dc650dSSadaf Ebrahimi       the pattern is such that the entire match is going to fail,  PCRE2  has
10201*22dc650dSSadaf Ebrahimi       in  principle to try every possible variation, and this can take an ex-
10202*22dc650dSSadaf Ebrahimi       tremely long time, even for relatively short strings.
10203*22dc650dSSadaf Ebrahimi
10204*22dc650dSSadaf Ebrahimi       An optimization catches some of the more simple cases such as
10205*22dc650dSSadaf Ebrahimi
10206*22dc650dSSadaf Ebrahimi         (a+)*b
10207*22dc650dSSadaf Ebrahimi
10208*22dc650dSSadaf Ebrahimi       where a literal character follows. Before  embarking  on  the  standard
10209*22dc650dSSadaf Ebrahimi       matching  procedure, PCRE2 checks that there is a "b" later in the sub-
10210*22dc650dSSadaf Ebrahimi       ject string, and if there is not, it fails the match immediately.  How-
10211*22dc650dSSadaf Ebrahimi       ever,  when  there  is no following literal this optimization cannot be
10212*22dc650dSSadaf Ebrahimi       used. You can see the difference by comparing the behaviour of
10213*22dc650dSSadaf Ebrahimi
10214*22dc650dSSadaf Ebrahimi         (a+)*\d
10215*22dc650dSSadaf Ebrahimi
10216*22dc650dSSadaf Ebrahimi       with the pattern above. The former gives  a  failure  almost  instantly
10217*22dc650dSSadaf Ebrahimi       when  applied  to  a  whole  line of "a" characters, whereas the latter
10218*22dc650dSSadaf Ebrahimi       takes an appreciable time with strings longer than about 20 characters.
10219*22dc650dSSadaf Ebrahimi
10220*22dc650dSSadaf Ebrahimi       In many cases, the solution to this kind of performance issue is to use
10221*22dc650dSSadaf Ebrahimi       an atomic group or a possessive quantifier. This can often reduce  mem-
10222*22dc650dSSadaf Ebrahimi       ory requirements as well. As another example, consider this pattern:
10223*22dc650dSSadaf Ebrahimi
10224*22dc650dSSadaf Ebrahimi         ([^<]|<(?!inet))+
10225*22dc650dSSadaf Ebrahimi
10226*22dc650dSSadaf Ebrahimi       It  matches  from wherever it starts until it encounters "<inet" or the
10227*22dc650dSSadaf Ebrahimi       end of the data, and is the kind of pattern that  might  be  used  when
10228*22dc650dSSadaf Ebrahimi       processing an XML file. Each iteration of the outer parentheses matches
10229*22dc650dSSadaf Ebrahimi       either  one  character that is not "<" or a "<" that is not followed by
10230*22dc650dSSadaf Ebrahimi       "inet". However, each time a parenthesis is processed,  a  backtracking
10231*22dc650dSSadaf Ebrahimi       position  is  passed,  so this formulation uses a memory frame for each
10232*22dc650dSSadaf Ebrahimi       matched character. For a long string, a lot of memory is required. Con-
10233*22dc650dSSadaf Ebrahimi       sider now this  rewritten  pattern,  which  matches  exactly  the  same
10234*22dc650dSSadaf Ebrahimi       strings:
10235*22dc650dSSadaf Ebrahimi
10236*22dc650dSSadaf Ebrahimi         ([^<]++|<(?!inet))+
10237*22dc650dSSadaf Ebrahimi
10238*22dc650dSSadaf Ebrahimi       This runs much faster, because sequences of characters that do not con-
10239*22dc650dSSadaf Ebrahimi       tain "<" are "swallowed" in one item inside the parentheses, and a pos-
10240*22dc650dSSadaf Ebrahimi       sessive  quantifier  is  used to stop any backtracking into the runs of
10241*22dc650dSSadaf Ebrahimi       non-"<" characters. This version also uses a lot  less  memory  because
10242*22dc650dSSadaf Ebrahimi       entry  to  a  new  set of parentheses happens only when a "<" character
10243*22dc650dSSadaf Ebrahimi       that is not followed by "inet" is encountered (and we  assume  this  is
10244*22dc650dSSadaf Ebrahimi       relatively rare).
10245*22dc650dSSadaf Ebrahimi
10246*22dc650dSSadaf Ebrahimi       This example shows that one way of optimizing performance when matching
10247*22dc650dSSadaf Ebrahimi       long  subject strings is to write repeated parenthesized subpatterns to
10248*22dc650dSSadaf Ebrahimi       match more than one character whenever possible.
10249*22dc650dSSadaf Ebrahimi
10250*22dc650dSSadaf Ebrahimi   SETTING RESOURCE LIMITS
10251*22dc650dSSadaf Ebrahimi
10252*22dc650dSSadaf Ebrahimi       You can set limits on the amount of processing that  takes  place  when
10253*22dc650dSSadaf Ebrahimi       matching,  and  on  the amount of heap memory that is used. The default
10254*22dc650dSSadaf Ebrahimi       values of the limits are very large, and unlikely ever to operate. They
10255*22dc650dSSadaf Ebrahimi       can be changed when PCRE2 is built, and  they  can  also  be  set  when
10256*22dc650dSSadaf Ebrahimi       pcre2_match()  or pcre2_dfa_match() is called. For details of these in-
10257*22dc650dSSadaf Ebrahimi       terfaces, see the pcre2build documentation  and  the  section  entitled
10258*22dc650dSSadaf Ebrahimi       "The match context" in the pcre2api documentation.
10259*22dc650dSSadaf Ebrahimi
10260*22dc650dSSadaf Ebrahimi       The  pcre2test  test program has a modifier called "find_limits" which,
10261*22dc650dSSadaf Ebrahimi       if applied to a subject line, causes it to  find  the  smallest  limits
10262*22dc650dSSadaf Ebrahimi       that allow a pattern to match. This is done by repeatedly matching with
10263*22dc650dSSadaf Ebrahimi       different limits.
10264*22dc650dSSadaf Ebrahimi
10265*22dc650dSSadaf Ebrahimi
10266*22dc650dSSadaf EbrahimiAUTHOR
10267*22dc650dSSadaf Ebrahimi
10268*22dc650dSSadaf Ebrahimi       Philip Hazel
10269*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
10270*22dc650dSSadaf Ebrahimi       Cambridge, England.
10271*22dc650dSSadaf Ebrahimi
10272*22dc650dSSadaf Ebrahimi
10273*22dc650dSSadaf EbrahimiREVISION
10274*22dc650dSSadaf Ebrahimi
10275*22dc650dSSadaf Ebrahimi       Last updated: 27 July 2022
10276*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2022 University of Cambridge.
10277*22dc650dSSadaf Ebrahimi
10278*22dc650dSSadaf Ebrahimi
10279*22dc650dSSadaf EbrahimiPCRE2 10.41                      27 July 2022                  PCRE2PERFORM(3)
10280*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
10281*22dc650dSSadaf Ebrahimi
10282*22dc650dSSadaf Ebrahimi
10283*22dc650dSSadaf Ebrahimi
10284*22dc650dSSadaf EbrahimiPCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
10285*22dc650dSSadaf Ebrahimi
10286*22dc650dSSadaf Ebrahimi
10287*22dc650dSSadaf EbrahimiNAME
10288*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
10289*22dc650dSSadaf Ebrahimi
10290*22dc650dSSadaf Ebrahimi
10291*22dc650dSSadaf EbrahimiSYNOPSIS
10292*22dc650dSSadaf Ebrahimi
10293*22dc650dSSadaf Ebrahimi       #include <pcre2posix.h>
10294*22dc650dSSadaf Ebrahimi
10295*22dc650dSSadaf Ebrahimi       int pcre2_regcomp(regex_t *preg, const char *pattern,
10296*22dc650dSSadaf Ebrahimi            int cflags);
10297*22dc650dSSadaf Ebrahimi
10298*22dc650dSSadaf Ebrahimi       int pcre2_regexec(const regex_t *preg, const char *string,
10299*22dc650dSSadaf Ebrahimi            size_t nmatch, regmatch_t pmatch[], int eflags);
10300*22dc650dSSadaf Ebrahimi
10301*22dc650dSSadaf Ebrahimi       size_t pcre2_regerror(int errcode, const regex_t *preg,
10302*22dc650dSSadaf Ebrahimi            char *errbuf, size_t errbuf_size);
10303*22dc650dSSadaf Ebrahimi
10304*22dc650dSSadaf Ebrahimi       void pcre2_regfree(regex_t *preg);
10305*22dc650dSSadaf Ebrahimi
10306*22dc650dSSadaf Ebrahimi
10307*22dc650dSSadaf EbrahimiDESCRIPTION
10308*22dc650dSSadaf Ebrahimi
10309*22dc650dSSadaf Ebrahimi       This  set of functions provides a POSIX-style API for the PCRE2 regular
10310*22dc650dSSadaf Ebrahimi       expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
10311*22dc650dSSadaf Ebrahimi       16-bit and 32-bit libraries. See the pcre2api documentation for  a  de-
10312*22dc650dSSadaf Ebrahimi       scription  of  PCRE2's native API, which contains much additional func-
10313*22dc650dSSadaf Ebrahimi       tionality.
10314*22dc650dSSadaf Ebrahimi
10315*22dc650dSSadaf Ebrahimi       IMPORTANT NOTE: The functions described here are NOT  thread-safe,  and
10316*22dc650dSSadaf Ebrahimi       should  not  be used in multi-threaded applications. They are also lim-
10317*22dc650dSSadaf Ebrahimi       ited to processing subjects that are not bigger than 2GB. Use  the  na-
10318*22dc650dSSadaf Ebrahimi       tive API instead.
10319*22dc650dSSadaf Ebrahimi
10320*22dc650dSSadaf Ebrahimi       These  functions  are  wrapper functions that ultimately call the PCRE2
10321*22dc650dSSadaf Ebrahimi       native API. Their prototypes are defined  in  the  pcre2posix.h  header
10322*22dc650dSSadaf Ebrahimi       file, and they all have unique names starting with pcre2_. However, the
10323*22dc650dSSadaf Ebrahimi       pcre2posix.h  header  also  contains macro definitions that convert the
10324*22dc650dSSadaf Ebrahimi       standard POSIX names such  regcomp()  into  pcre2_regcomp()  etc.  This
10325*22dc650dSSadaf Ebrahimi       means  that a program can use the usual POSIX names without running the
10326*22dc650dSSadaf Ebrahimi       risk of accidentally linking with POSIX functions from a different  li-
10327*22dc650dSSadaf Ebrahimi       brary.
10328*22dc650dSSadaf Ebrahimi
10329*22dc650dSSadaf Ebrahimi       On  Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
10330*22dc650dSSadaf Ebrahimi       so can be accessed by adding -lpcre2-posix to the command  for  linking
10331*22dc650dSSadaf Ebrahimi       an application. Because the POSIX functions call the native ones, it is
10332*22dc650dSSadaf Ebrahimi       also necessary to add -lpcre2-8.
10333*22dc650dSSadaf Ebrahimi
10334*22dc650dSSadaf Ebrahimi       On Windows systems, if you are linking to a DLL version of the library,
10335*22dc650dSSadaf Ebrahimi       it  is  recommended  that PCRE2POSIX_SHARED is defined before including
10336*22dc650dSSadaf Ebrahimi       the pcre2posix.h header, as it will allow for a more efficient  way  to
10337*22dc650dSSadaf Ebrahimi       invoke the functions by adding the __declspec(dllimport) decorator.
10338*22dc650dSSadaf Ebrahimi
10339*22dc650dSSadaf Ebrahimi       Although  they were not defined as prototypes in pcre2posix.h, releases
10340*22dc650dSSadaf Ebrahimi       10.33 to 10.36 of the library contained functions with the POSIX  names
10341*22dc650dSSadaf Ebrahimi       regcomp()  etc.  These simply passed their arguments to the PCRE2 func-
10342*22dc650dSSadaf Ebrahimi       tions. These functions were provided for backwards  compatibility  with
10343*22dc650dSSadaf Ebrahimi       earlier  versions  of  PCRE2, which had only POSIX names. However, this
10344*22dc650dSSadaf Ebrahimi       has proved troublesome in situations where a program links with several
10345*22dc650dSSadaf Ebrahimi       libraries, some of which use PCRE2's POSIX interface while  others  use
10346*22dc650dSSadaf Ebrahimi       the  real  POSIX functions.  For this reason, the POSIX names have been
10347*22dc650dSSadaf Ebrahimi       removed since release 10.37.
10348*22dc650dSSadaf Ebrahimi
10349*22dc650dSSadaf Ebrahimi       Calling the header file pcre2posix.h avoids  any  conflict  with  other
10350*22dc650dSSadaf Ebrahimi       POSIX  libraries.  It can, of course, be renamed or aliased as regex.h,
10351*22dc650dSSadaf Ebrahimi       which is the "correct" name, if there is  no  clash.  It  provides  two
10352*22dc650dSSadaf Ebrahimi       structure  types,  regex_t  for compiled internal forms, and regmatch_t
10353*22dc650dSSadaf Ebrahimi       for returning captured substrings. It also defines some constants whose
10354*22dc650dSSadaf Ebrahimi       names start with "REG_"; these are used for setting options and identi-
10355*22dc650dSSadaf Ebrahimi       fying error codes.
10356*22dc650dSSadaf Ebrahimi
10357*22dc650dSSadaf Ebrahimi
10358*22dc650dSSadaf EbrahimiUSING THE POSIX FUNCTIONS
10359*22dc650dSSadaf Ebrahimi
10360*22dc650dSSadaf Ebrahimi       Note that these functions are just POSIX-style wrappers for PCRE2's na-
10361*22dc650dSSadaf Ebrahimi       tive API.  They do not give POSIX  regular  expression  behaviour,  and
10362*22dc650dSSadaf Ebrahimi       they are not thread-safe or even POSIX compatible.
10363*22dc650dSSadaf Ebrahimi
10364*22dc650dSSadaf Ebrahimi       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
10365*22dc650dSSadaf Ebrahimi       options have been implemented. In addition, the option REG_EXTENDED  is
10366*22dc650dSSadaf Ebrahimi       defined  with  the  value  zero. This has no effect, but since programs
10367*22dc650dSSadaf Ebrahimi       that are written to the POSIX interface often use  it,  this  makes  it
10368*22dc650dSSadaf Ebrahimi       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
10369*22dc650dSSadaf Ebrahimi       are not even defined.
10370*22dc650dSSadaf Ebrahimi
10371*22dc650dSSadaf Ebrahimi       There are also some options that are not defined by POSIX.  These  have
10372*22dc650dSSadaf Ebrahimi       been  added  at  the  request  of users who want to make use of certain
10373*22dc650dSSadaf Ebrahimi       PCRE2-specific features via the POSIX calling interface or to  add  BSD
10374*22dc650dSSadaf Ebrahimi       or GNU functionality.
10375*22dc650dSSadaf Ebrahimi
10376*22dc650dSSadaf Ebrahimi       When  PCRE2  is  called via these functions, it is only the API that is
10377*22dc650dSSadaf Ebrahimi       POSIX-like in style. The syntax and semantics of  the  regular  expres-
10378*22dc650dSSadaf Ebrahimi       sions  themselves  are  still  those of Perl, subject to the setting of
10379*22dc650dSSadaf Ebrahimi       various PCRE2 options, as described below. "POSIX-like in style"  means
10380*22dc650dSSadaf Ebrahimi       that  the  API  approximates  to  the POSIX definition; it is not fully
10381*22dc650dSSadaf Ebrahimi       POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
10382*22dc650dSSadaf Ebrahimi       even less compatible.
10383*22dc650dSSadaf Ebrahimi
10384*22dc650dSSadaf Ebrahimi       The  descriptions  below use the actual names of the functions, but, as
10385*22dc650dSSadaf Ebrahimi       described above, the standard POSIX names (without the  pcre2_  prefix)
10386*22dc650dSSadaf Ebrahimi       may also be used.
10387*22dc650dSSadaf Ebrahimi
10388*22dc650dSSadaf Ebrahimi
10389*22dc650dSSadaf EbrahimiCOMPILING A PATTERN
10390*22dc650dSSadaf Ebrahimi
10391*22dc650dSSadaf Ebrahimi       The function pcre2_regcomp() is called to compile a pattern into an in-
10392*22dc650dSSadaf Ebrahimi       ternal  form. By default, the pattern is a C string terminated by a bi-
10393*22dc650dSSadaf Ebrahimi       nary zero (but see REG_PEND below). The preg argument is a pointer to a
10394*22dc650dSSadaf Ebrahimi       regex_t structure that is used as a base for storing information  about
10395*22dc650dSSadaf Ebrahimi       the  compiled  regular  expression.  It  is  also  used  for input when
10396*22dc650dSSadaf Ebrahimi       REG_PEND is set. The regex_t structure used by pcre2_regcomp()  is  de-
10397*22dc650dSSadaf Ebrahimi       fined  in  pcre2posix.h  and  is  not the same as the structure used by
10398*22dc650dSSadaf Ebrahimi       other libraries that provide POSIX-style matching.
10399*22dc650dSSadaf Ebrahimi
10400*22dc650dSSadaf Ebrahimi       The argument cflags is either zero, or contains one or more of the bits
10401*22dc650dSSadaf Ebrahimi       defined by the following macros:
10402*22dc650dSSadaf Ebrahimi
10403*22dc650dSSadaf Ebrahimi         REG_DOTALL
10404*22dc650dSSadaf Ebrahimi
10405*22dc650dSSadaf Ebrahimi       The PCRE2_DOTALL option is set when the regular  expression  is  passed
10406*22dc650dSSadaf Ebrahimi       for  compilation  to  the  native function. Note that REG_DOTALL is not
10407*22dc650dSSadaf Ebrahimi       part of the POSIX standard.
10408*22dc650dSSadaf Ebrahimi
10409*22dc650dSSadaf Ebrahimi         REG_ICASE
10410*22dc650dSSadaf Ebrahimi
10411*22dc650dSSadaf Ebrahimi       The PCRE2_CASELESS option is set when the regular expression is  passed
10412*22dc650dSSadaf Ebrahimi       for compilation to the native function.
10413*22dc650dSSadaf Ebrahimi
10414*22dc650dSSadaf Ebrahimi         REG_NEWLINE
10415*22dc650dSSadaf Ebrahimi
10416*22dc650dSSadaf Ebrahimi       The PCRE2_MULTILINE option is set when the regular expression is passed
10417*22dc650dSSadaf Ebrahimi       for  compilation  to the native function. Note that this does not mimic
10418*22dc650dSSadaf Ebrahimi       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
10419*22dc650dSSadaf Ebrahimi       tion).
10420*22dc650dSSadaf Ebrahimi
10421*22dc650dSSadaf Ebrahimi         REG_NOSPEC
10422*22dc650dSSadaf Ebrahimi
10423*22dc650dSSadaf Ebrahimi       The  PCRE2_LITERAL  option is set when the regular expression is passed
10424*22dc650dSSadaf Ebrahimi       for compilation to the native function. This disables all meta  charac-
10425*22dc650dSSadaf Ebrahimi       ters  in the pattern, causing it to be treated as a literal string. The
10426*22dc650dSSadaf Ebrahimi       only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE,
10427*22dc650dSSadaf Ebrahimi       REG_NOSUB,  REG_PEND,  and REG_UTF. Note that REG_NOSPEC is not part of
10428*22dc650dSSadaf Ebrahimi       the POSIX standard.
10429*22dc650dSSadaf Ebrahimi
10430*22dc650dSSadaf Ebrahimi         REG_NOSUB
10431*22dc650dSSadaf Ebrahimi
10432*22dc650dSSadaf Ebrahimi       When  a  pattern  that  is  compiled  with  this  flag  is  passed   to
10433*22dc650dSSadaf Ebrahimi       pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
10434*22dc650dSSadaf Ebrahimi       nored, and no captured strings are returned. Versions of the  PCRE  li-
10435*22dc650dSSadaf Ebrahimi       brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10436*22dc650dSSadaf Ebrahimi       tion, but this no longer happens because it disables the use  of  back-
10437*22dc650dSSadaf Ebrahimi       references.
10438*22dc650dSSadaf Ebrahimi
10439*22dc650dSSadaf Ebrahimi         REG_PEND
10440*22dc650dSSadaf Ebrahimi
10441*22dc650dSSadaf Ebrahimi       If  this option is set, the reg_endp field in the preg structure (which
10442*22dc650dSSadaf Ebrahimi       has the type const char *) must be set to point to the character beyond
10443*22dc650dSSadaf Ebrahimi       the end of the pattern before calling pcre2_regcomp(). The pattern  it-
10444*22dc650dSSadaf Ebrahimi       self  may  now  contain binary zeros, which are treated as data charac-
10445*22dc650dSSadaf Ebrahimi       ters. Without REG_PEND, a binary zero terminates the  pattern  and  the
10446*22dc650dSSadaf Ebrahimi       re_endp field is ignored. This is a GNU extension to the POSIX standard
10447*22dc650dSSadaf Ebrahimi       and  should be used with caution in software intended to be portable to
10448*22dc650dSSadaf Ebrahimi       other systems.
10449*22dc650dSSadaf Ebrahimi
10450*22dc650dSSadaf Ebrahimi         REG_UCP
10451*22dc650dSSadaf Ebrahimi
10452*22dc650dSSadaf Ebrahimi       The PCRE2_UCP option is set when the regular expression is  passed  for
10453*22dc650dSSadaf Ebrahimi       compilation  to  the  native function. This causes PCRE2 to use Unicode
10454*22dc650dSSadaf Ebrahimi       properties when matching \d, \w,  etc.,  instead  of  just  recognizing
10455*22dc650dSSadaf Ebrahimi       ASCII values. Note that REG_UCP is not part of the POSIX standard.
10456*22dc650dSSadaf Ebrahimi
10457*22dc650dSSadaf Ebrahimi         REG_UNGREEDY
10458*22dc650dSSadaf Ebrahimi
10459*22dc650dSSadaf Ebrahimi       The  PCRE2_UNGREEDY option is set when the regular expression is passed
10460*22dc650dSSadaf Ebrahimi       for compilation to the native function. Note that REG_UNGREEDY  is  not
10461*22dc650dSSadaf Ebrahimi       part of the POSIX standard.
10462*22dc650dSSadaf Ebrahimi
10463*22dc650dSSadaf Ebrahimi         REG_UTF
10464*22dc650dSSadaf Ebrahimi
10465*22dc650dSSadaf Ebrahimi       The  PCRE2_UTF  option is set when the regular expression is passed for
10466*22dc650dSSadaf Ebrahimi       compilation to the native function. This causes the pattern itself  and
10467*22dc650dSSadaf Ebrahimi       all  data  strings used for matching it to be treated as UTF-8 strings.
10468*22dc650dSSadaf Ebrahimi       Note that REG_UTF is not part of the POSIX standard.
10469*22dc650dSSadaf Ebrahimi
10470*22dc650dSSadaf Ebrahimi       In the absence of these flags, no options  are  passed  to  the  native
10471*22dc650dSSadaf Ebrahimi       function.  This means that the regex is compiled with PCRE2 default se-
10472*22dc650dSSadaf Ebrahimi       mantics.  In  particular,  the way it handles newline characters in the
10473*22dc650dSSadaf Ebrahimi       subject string is the Perl way, not the POSIX way.  Note  that  setting
10474*22dc650dSSadaf Ebrahimi       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
10475*22dc650dSSadaf Ebrahimi       It  does not affect the way newlines are matched by the dot metacharac-
10476*22dc650dSSadaf Ebrahimi       ter (they are not) or by a negative class such as [^a] (they are).
10477*22dc650dSSadaf Ebrahimi
10478*22dc650dSSadaf Ebrahimi       The yield of pcre2_regcomp() is zero on success,  and  non-zero  other-
10479*22dc650dSSadaf Ebrahimi       wise.  The preg structure is filled in on success, and one other member
10480*22dc650dSSadaf Ebrahimi       of  the  structure (as well as re_endp) is public: re_nsub contains the
10481*22dc650dSSadaf Ebrahimi       number of capturing subpatterns in the regular expression. Various  er-
10482*22dc650dSSadaf Ebrahimi       ror codes are defined in the header file.
10483*22dc650dSSadaf Ebrahimi
10484*22dc650dSSadaf Ebrahimi       NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10485*22dc650dSSadaf Ebrahimi       to use the contents of the preg structure. If, for example, you pass it
10486*22dc650dSSadaf Ebrahimi       to  pcre2_regexec(), the result is undefined and your program is likely
10487*22dc650dSSadaf Ebrahimi       to crash.
10488*22dc650dSSadaf Ebrahimi
10489*22dc650dSSadaf Ebrahimi
10490*22dc650dSSadaf EbrahimiMATCHING NEWLINE CHARACTERS
10491*22dc650dSSadaf Ebrahimi
10492*22dc650dSSadaf Ebrahimi       This area is not simple, because POSIX and Perl take different views of
10493*22dc650dSSadaf Ebrahimi       things.  It is not possible to get PCRE2 to obey POSIX  semantics,  but
10494*22dc650dSSadaf Ebrahimi       then PCRE2 was never intended to be a POSIX engine. The following table
10495*22dc650dSSadaf Ebrahimi       lists  the  different  possibilities for matching newline characters in
10496*22dc650dSSadaf Ebrahimi       Perl and PCRE2:
10497*22dc650dSSadaf Ebrahimi
10498*22dc650dSSadaf Ebrahimi                                 Default   Change with
10499*22dc650dSSadaf Ebrahimi
10500*22dc650dSSadaf Ebrahimi         . matches newline          no     PCRE2_DOTALL
10501*22dc650dSSadaf Ebrahimi         newline matches [^a]       yes    not changeable
10502*22dc650dSSadaf Ebrahimi         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
10503*22dc650dSSadaf Ebrahimi         $ matches \n in middle     no     PCRE2_MULTILINE
10504*22dc650dSSadaf Ebrahimi         ^ matches \n in middle     no     PCRE2_MULTILINE
10505*22dc650dSSadaf Ebrahimi
10506*22dc650dSSadaf Ebrahimi       This is the equivalent table for a POSIX-compatible pattern matcher:
10507*22dc650dSSadaf Ebrahimi
10508*22dc650dSSadaf Ebrahimi                                 Default   Change with
10509*22dc650dSSadaf Ebrahimi
10510*22dc650dSSadaf Ebrahimi         . matches newline          yes    REG_NEWLINE
10511*22dc650dSSadaf Ebrahimi         newline matches [^a]       yes    REG_NEWLINE
10512*22dc650dSSadaf Ebrahimi         $ matches \n at end        no     REG_NEWLINE
10513*22dc650dSSadaf Ebrahimi         $ matches \n in middle     no     REG_NEWLINE
10514*22dc650dSSadaf Ebrahimi         ^ matches \n in middle     no     REG_NEWLINE
10515*22dc650dSSadaf Ebrahimi
10516*22dc650dSSadaf Ebrahimi       This behaviour is not what happens when PCRE2 is called via  its  POSIX
10517*22dc650dSSadaf Ebrahimi       API.  By  default, PCRE2's behaviour is the same as Perl's, except that
10518*22dc650dSSadaf Ebrahimi       there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both  PCRE2
10519*22dc650dSSadaf Ebrahimi       and Perl, there is no way to stop newline from matching [^a].
10520*22dc650dSSadaf Ebrahimi
10521*22dc650dSSadaf Ebrahimi       Default  POSIX newline handling can be obtained by setting PCRE2_DOTALL
10522*22dc650dSSadaf Ebrahimi       and PCRE2_DOLLAR_ENDONLY when  calling  pcre2_compile()  directly,  but
10523*22dc650dSSadaf Ebrahimi       there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10524*22dc650dSSadaf Ebrahimi       tion.  When  using  the  POSIX  API,  passing  REG_NEWLINE  to  PCRE2's
10525*22dc650dSSadaf Ebrahimi       pcre2_regcomp()  function  causes  PCRE2_MULTILINE  to  be  passed   to
10526*22dc650dSSadaf Ebrahimi       pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
10527*22dc650dSSadaf Ebrahimi       pass PCRE2_DOLLAR_ENDONLY.
10528*22dc650dSSadaf Ebrahimi
10529*22dc650dSSadaf Ebrahimi
10530*22dc650dSSadaf EbrahimiMATCHING A PATTERN
10531*22dc650dSSadaf Ebrahimi
10532*22dc650dSSadaf Ebrahimi       The function pcre2_regexec() is called to match a compiled pattern preg
10533*22dc650dSSadaf Ebrahimi       against  a  given string, which is by default terminated by a zero byte
10534*22dc650dSSadaf Ebrahimi       (but see REG_STARTEND below), subject to the options in eflags.   These
10535*22dc650dSSadaf Ebrahimi       can be:
10536*22dc650dSSadaf Ebrahimi
10537*22dc650dSSadaf Ebrahimi         REG_NOTBOL
10538*22dc650dSSadaf Ebrahimi
10539*22dc650dSSadaf Ebrahimi       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10540*22dc650dSSadaf Ebrahimi       ing function.
10541*22dc650dSSadaf Ebrahimi
10542*22dc650dSSadaf Ebrahimi         REG_NOTEMPTY
10543*22dc650dSSadaf Ebrahimi
10544*22dc650dSSadaf Ebrahimi       The  PCRE2_NOTEMPTY  option  is  set  when calling the underlying PCRE2
10545*22dc650dSSadaf Ebrahimi       matching function. Note that REG_NOTEMPTY is  not  part  of  the  POSIX
10546*22dc650dSSadaf Ebrahimi       standard.  However, setting this option can give more POSIX-like behav-
10547*22dc650dSSadaf Ebrahimi       iour in some situations.
10548*22dc650dSSadaf Ebrahimi
10549*22dc650dSSadaf Ebrahimi         REG_NOTEOL
10550*22dc650dSSadaf Ebrahimi
10551*22dc650dSSadaf Ebrahimi       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10552*22dc650dSSadaf Ebrahimi       ing function.
10553*22dc650dSSadaf Ebrahimi
10554*22dc650dSSadaf Ebrahimi         REG_STARTEND
10555*22dc650dSSadaf Ebrahimi
10556*22dc650dSSadaf Ebrahimi       When this option  is  set,  the  subject  string  starts  at  string  +
10557*22dc650dSSadaf Ebrahimi       pmatch[0].rm_so  and  ends  at  string  + pmatch[0].rm_eo, which should
10558*22dc650dSSadaf Ebrahimi       point to the first character beyond the string. There may be binary ze-
10559*22dc650dSSadaf Ebrahimi       ros within the subject string, and indeed, using  REG_STARTEND  is  the
10560*22dc650dSSadaf Ebrahimi       only way to pass a subject string that contains a binary zero.
10561*22dc650dSSadaf Ebrahimi
10562*22dc650dSSadaf Ebrahimi       Whatever  the  value  of  pmatch[0].rm_so,  the  offsets of the matched
10563*22dc650dSSadaf Ebrahimi       string and any captured substrings are  still  given  relative  to  the
10564*22dc650dSSadaf Ebrahimi       start  of  string  itself. (Before PCRE2 release 10.30 these were given
10565*22dc650dSSadaf Ebrahimi       relative to string + pmatch[0].rm_so, but this differs from  other  im-
10566*22dc650dSSadaf Ebrahimi       plementations.)
10567*22dc650dSSadaf Ebrahimi
10568*22dc650dSSadaf Ebrahimi       This  is  a  BSD  extension,  compatible with but not specified by IEEE
10569*22dc650dSSadaf Ebrahimi       Standard 1003.2 (POSIX.2), and should be used with caution in  software
10570*22dc650dSSadaf Ebrahimi       intended  to  be  portable to other systems. Note that a non-zero rm_so
10571*22dc650dSSadaf Ebrahimi       does not imply REG_NOTBOL; REG_STARTEND affects only the  location  and
10572*22dc650dSSadaf Ebrahimi       length  of  the string, not how it is matched. Setting REG_STARTEND and
10573*22dc650dSSadaf Ebrahimi       passing pmatch as NULL are mutually exclusive; the error REG_INVARG  is
10574*22dc650dSSadaf Ebrahimi       returned.
10575*22dc650dSSadaf Ebrahimi
10576*22dc650dSSadaf Ebrahimi       If  the pattern was compiled with the REG_NOSUB flag, no data about any
10577*22dc650dSSadaf Ebrahimi       matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
10578*22dc650dSSadaf Ebrahimi       pcre2_regexec()  are  ignored  (except  possibly as input for REG_STAR-
10579*22dc650dSSadaf Ebrahimi       TEND).
10580*22dc650dSSadaf Ebrahimi
10581*22dc650dSSadaf Ebrahimi       The value of nmatch may be zero, and the value pmatch may be NULL  (un-
10582*22dc650dSSadaf Ebrahimi       less  REG_STARTEND  is  set);  in  both  these  cases no data about any
10583*22dc650dSSadaf Ebrahimi       matched strings is returned.
10584*22dc650dSSadaf Ebrahimi
10585*22dc650dSSadaf Ebrahimi       Otherwise, the portion of the string that was  matched,  and  also  any
10586*22dc650dSSadaf Ebrahimi       captured substrings, are returned via the pmatch argument, which points
10587*22dc650dSSadaf Ebrahimi       to  an  array  of  nmatch structures of type regmatch_t, containing the
10588*22dc650dSSadaf Ebrahimi       members rm_so and rm_eo. These contain the byte  offset  to  the  first
10589*22dc650dSSadaf Ebrahimi       character of each substring and the offset to the first character after
10590*22dc650dSSadaf Ebrahimi       the  end of each substring, respectively. The 0th element of the vector
10591*22dc650dSSadaf Ebrahimi       relates to the entire portion of string that  was  matched;  subsequent
10592*22dc650dSSadaf Ebrahimi       elements relate to the capturing subpatterns of the regular expression.
10593*22dc650dSSadaf Ebrahimi       Unused entries in the array have both structure members set to -1.
10594*22dc650dSSadaf Ebrahimi
10595*22dc650dSSadaf Ebrahimi       regmatch_t  as  well  as  the  regoff_t  typedef it uses are defined in
10596*22dc650dSSadaf Ebrahimi       pcre2posix.h and are not warranted to have the same size or  layout  as
10597*22dc650dSSadaf Ebrahimi       other  similarly  named  types from other libraries that provide POSIX-
10598*22dc650dSSadaf Ebrahimi       style matching.
10599*22dc650dSSadaf Ebrahimi
10600*22dc650dSSadaf Ebrahimi       A successful match yields a zero return; various error  codes  are  de-
10601*22dc650dSSadaf Ebrahimi       fined  in the header file, of which REG_NOMATCH is the "expected" fail-
10602*22dc650dSSadaf Ebrahimi       ure code.
10603*22dc650dSSadaf Ebrahimi
10604*22dc650dSSadaf Ebrahimi
10605*22dc650dSSadaf EbrahimiERROR MESSAGES
10606*22dc650dSSadaf Ebrahimi
10607*22dc650dSSadaf Ebrahimi       The pcre2_regerror() function maps a  non-zero  errorcode  from  either
10608*22dc650dSSadaf Ebrahimi       pcre2_regcomp()  or  pcre2_regexec() to a printable message. If preg is
10609*22dc650dSSadaf Ebrahimi       not NULL, the error should have arisen from the use of that  structure.
10610*22dc650dSSadaf Ebrahimi       A  message  terminated  by  a  binary  zero is placed in errbuf. If the
10611*22dc650dSSadaf Ebrahimi       buffer is too short, only the first errbuf_size - 1 characters  of  the
10612*22dc650dSSadaf Ebrahimi       error message are used. The yield of the function is the size of buffer
10613*22dc650dSSadaf Ebrahimi       needed  to hold the whole message, including the terminating zero. This
10614*22dc650dSSadaf Ebrahimi       value is greater than errbuf_size if the message was truncated.
10615*22dc650dSSadaf Ebrahimi
10616*22dc650dSSadaf Ebrahimi
10617*22dc650dSSadaf EbrahimiMEMORY USAGE
10618*22dc650dSSadaf Ebrahimi
10619*22dc650dSSadaf Ebrahimi       Compiling a regular expression causes memory to be allocated and  asso-
10620*22dc650dSSadaf Ebrahimi       ciated  with the preg structure. The function pcre2_regfree() frees all
10621*22dc650dSSadaf Ebrahimi       such memory, after which preg may no longer be used as a  compiled  ex-
10622*22dc650dSSadaf Ebrahimi       pression.
10623*22dc650dSSadaf Ebrahimi
10624*22dc650dSSadaf Ebrahimi
10625*22dc650dSSadaf EbrahimiAUTHOR
10626*22dc650dSSadaf Ebrahimi
10627*22dc650dSSadaf Ebrahimi       Philip Hazel
10628*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
10629*22dc650dSSadaf Ebrahimi       Cambridge, England.
10630*22dc650dSSadaf Ebrahimi
10631*22dc650dSSadaf Ebrahimi
10632*22dc650dSSadaf EbrahimiREVISION
10633*22dc650dSSadaf Ebrahimi
10634*22dc650dSSadaf Ebrahimi       Last updated: 19 January 2024
10635*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2024 University of Cambridge.
10636*22dc650dSSadaf Ebrahimi
10637*22dc650dSSadaf Ebrahimi
10638*22dc650dSSadaf EbrahimiPCRE2 10.43                     19 January 2024                  PCRE2POSIX(3)
10639*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
10640*22dc650dSSadaf Ebrahimi
10641*22dc650dSSadaf Ebrahimi
10642*22dc650dSSadaf Ebrahimi
10643*22dc650dSSadaf EbrahimiPCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
10644*22dc650dSSadaf Ebrahimi
10645*22dc650dSSadaf Ebrahimi
10646*22dc650dSSadaf EbrahimiNAME
10647*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
10648*22dc650dSSadaf Ebrahimi
10649*22dc650dSSadaf Ebrahimi
10650*22dc650dSSadaf EbrahimiPCRE2 SAMPLE PROGRAM
10651*22dc650dSSadaf Ebrahimi
10652*22dc650dSSadaf Ebrahimi       A  simple, complete demonstration program to get you started with using
10653*22dc650dSSadaf Ebrahimi       PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the
10654*22dc650dSSadaf Ebrahimi       PCRE2 distribution. A listing of this program is given in the pcre2demo
10655*22dc650dSSadaf Ebrahimi       documentation. If you do not have a copy of the PCRE2 distribution, you
10656*22dc650dSSadaf Ebrahimi       can save this listing to re-create the contents of pcre2demo.c.
10657*22dc650dSSadaf Ebrahimi
10658*22dc650dSSadaf Ebrahimi       The  demonstration  program compiles the regular expression that is its
10659*22dc650dSSadaf Ebrahimi       first argument, and matches it against the subject string in its second
10660*22dc650dSSadaf Ebrahimi       argument. No PCRE2 options are set, and default  character  tables  are
10661*22dc650dSSadaf Ebrahimi       used. If matching succeeds, the program outputs the portion of the sub-
10662*22dc650dSSadaf Ebrahimi       ject  that  matched,  together  with  the contents of any captured sub-
10663*22dc650dSSadaf Ebrahimi       strings.
10664*22dc650dSSadaf Ebrahimi
10665*22dc650dSSadaf Ebrahimi       If the -g option is given on the command line, the program then goes on
10666*22dc650dSSadaf Ebrahimi       to check for further matches of the same regular expression in the same
10667*22dc650dSSadaf Ebrahimi       subject string. The logic is a little bit tricky because of the  possi-
10668*22dc650dSSadaf Ebrahimi       bility  of  matching an empty string. Comments in the code explain what
10669*22dc650dSSadaf Ebrahimi       is going on.
10670*22dc650dSSadaf Ebrahimi
10671*22dc650dSSadaf Ebrahimi       The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
10672*22dc650dSSadaf Ebrahimi       library.  It  handles  strings  and characters that are stored in 8-bit
10673*22dc650dSSadaf Ebrahimi       code units.  By default, one character corresponds to  one  code  unit,
10674*22dc650dSSadaf Ebrahimi       but  if  the  pattern starts with "(*UTF)", both it and the subject are
10675*22dc650dSSadaf Ebrahimi       treated as UTF-8 strings, where characters  may  occupy  multiple  code
10676*22dc650dSSadaf Ebrahimi       units.
10677*22dc650dSSadaf Ebrahimi
10678*22dc650dSSadaf Ebrahimi       If  PCRE2  is installed in the standard include and library directories
10679*22dc650dSSadaf Ebrahimi       for your operating system, you should be able to compile the demonstra-
10680*22dc650dSSadaf Ebrahimi       tion program using a command like this:
10681*22dc650dSSadaf Ebrahimi
10682*22dc650dSSadaf Ebrahimi         cc -o pcre2demo pcre2demo.c -lpcre2-8
10683*22dc650dSSadaf Ebrahimi
10684*22dc650dSSadaf Ebrahimi       If PCRE2 is installed elsewhere, you may need to add additional options
10685*22dc650dSSadaf Ebrahimi       to the command line. For example, on a Unix-like system that has  PCRE2
10686*22dc650dSSadaf Ebrahimi       installed  in /usr/local, you can compile the demonstration program us-
10687*22dc650dSSadaf Ebrahimi       ing a command like this:
10688*22dc650dSSadaf Ebrahimi
10689*22dc650dSSadaf Ebrahimi         cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10690*22dc650dSSadaf Ebrahimi            -L/usr/local/lib -lpcre2-8
10691*22dc650dSSadaf Ebrahimi
10692*22dc650dSSadaf Ebrahimi       Once you have built the demonstration program, you can run simple tests
10693*22dc650dSSadaf Ebrahimi       like this:
10694*22dc650dSSadaf Ebrahimi
10695*22dc650dSSadaf Ebrahimi         ./pcre2demo 'cat|dog' 'the cat sat on the mat'
10696*22dc650dSSadaf Ebrahimi         ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10697*22dc650dSSadaf Ebrahimi
10698*22dc650dSSadaf Ebrahimi       Note that there is a  much  more  comprehensive  test  program,  called
10699*22dc650dSSadaf Ebrahimi       pcre2test,  which supports many more facilities for testing regular ex-
10700*22dc650dSSadaf Ebrahimi       pressions using all three PCRE2 libraries (8-bit, 16-bit,  and  32-bit,
10701*22dc650dSSadaf Ebrahimi       though  not all three need be installed). The pcre2demo program is pro-
10702*22dc650dSSadaf Ebrahimi       vided as a relatively simple coding example.
10703*22dc650dSSadaf Ebrahimi
10704*22dc650dSSadaf Ebrahimi       If you try to run pcre2demo when PCRE2 is not installed in the standard
10705*22dc650dSSadaf Ebrahimi       library directory, you may get an error like  this  on  some  operating
10706*22dc650dSSadaf Ebrahimi       systems (e.g. Solaris):
10707*22dc650dSSadaf Ebrahimi
10708*22dc650dSSadaf Ebrahimi         ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10709*22dc650dSSadaf Ebrahimi       or directory
10710*22dc650dSSadaf Ebrahimi
10711*22dc650dSSadaf Ebrahimi       This  is  caused  by the way shared library support works on those sys-
10712*22dc650dSSadaf Ebrahimi       tems. You need to add
10713*22dc650dSSadaf Ebrahimi
10714*22dc650dSSadaf Ebrahimi         -R/usr/local/lib
10715*22dc650dSSadaf Ebrahimi
10716*22dc650dSSadaf Ebrahimi       (for example) to the compile command to get round this problem.
10717*22dc650dSSadaf Ebrahimi
10718*22dc650dSSadaf Ebrahimi
10719*22dc650dSSadaf EbrahimiAUTHOR
10720*22dc650dSSadaf Ebrahimi
10721*22dc650dSSadaf Ebrahimi       Philip Hazel
10722*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
10723*22dc650dSSadaf Ebrahimi       Cambridge, England.
10724*22dc650dSSadaf Ebrahimi
10725*22dc650dSSadaf Ebrahimi
10726*22dc650dSSadaf EbrahimiREVISION
10727*22dc650dSSadaf Ebrahimi
10728*22dc650dSSadaf Ebrahimi       Last updated: 02 February 2016
10729*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2016 University of Cambridge.
10730*22dc650dSSadaf Ebrahimi
10731*22dc650dSSadaf Ebrahimi
10732*22dc650dSSadaf EbrahimiPCRE2 10.22                    02 February 2016                 PCRE2SAMPLE(3)
10733*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
10734*22dc650dSSadaf Ebrahimi
10735*22dc650dSSadaf EbrahimiPCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3)
10736*22dc650dSSadaf Ebrahimi
10737*22dc650dSSadaf Ebrahimi
10738*22dc650dSSadaf EbrahimiNAME
10739*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
10740*22dc650dSSadaf Ebrahimi
10741*22dc650dSSadaf Ebrahimi
10742*22dc650dSSadaf EbrahimiSAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10743*22dc650dSSadaf Ebrahimi
10744*22dc650dSSadaf Ebrahimi       int32_t pcre2_serialize_decode(pcre2_code **codes,
10745*22dc650dSSadaf Ebrahimi         int32_t number_of_codes, const uint8_t *bytes,
10746*22dc650dSSadaf Ebrahimi         pcre2_general_context *gcontext);
10747*22dc650dSSadaf Ebrahimi
10748*22dc650dSSadaf Ebrahimi       int32_t pcre2_serialize_encode(const pcre2_code **codes,
10749*22dc650dSSadaf Ebrahimi         int32_t number_of_codes, uint8_t **serialized_bytes,
10750*22dc650dSSadaf Ebrahimi         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
10751*22dc650dSSadaf Ebrahimi
10752*22dc650dSSadaf Ebrahimi       void pcre2_serialize_free(uint8_t *bytes);
10753*22dc650dSSadaf Ebrahimi
10754*22dc650dSSadaf Ebrahimi       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
10755*22dc650dSSadaf Ebrahimi
10756*22dc650dSSadaf Ebrahimi       If  you  are running an application that uses a large number of regular
10757*22dc650dSSadaf Ebrahimi       expression patterns, it may be useful to store them  in  a  precompiled
10758*22dc650dSSadaf Ebrahimi       form  instead  of  having to compile them every time the application is
10759*22dc650dSSadaf Ebrahimi       run. However, if you are using the just-in-time  optimization  feature,
10760*22dc650dSSadaf Ebrahimi       it is not possible to save and reload the JIT data, because it is posi-
10761*22dc650dSSadaf Ebrahimi       tion-dependent.  The  host  on  which the patterns are reloaded must be
10762*22dc650dSSadaf Ebrahimi       running the same version of PCRE2, with the same code unit  width,  and
10763*22dc650dSSadaf Ebrahimi       must  also have the same endianness, pointer width and PCRE2_SIZE type.
10764*22dc650dSSadaf Ebrahimi       For example, patterns compiled on a 32-bit system using PCRE2's  16-bit
10765*22dc650dSSadaf Ebrahimi       library cannot be reloaded on a 64-bit system, nor can they be reloaded
10766*22dc650dSSadaf Ebrahimi       using the 8-bit library.
10767*22dc650dSSadaf Ebrahimi
10768*22dc650dSSadaf Ebrahimi       Note  that  "serialization" in PCRE2 does not convert compiled patterns
10769*22dc650dSSadaf Ebrahimi       to an abstract format like Java or .NET serialization.  The  serialized
10770*22dc650dSSadaf Ebrahimi       output  is really just a bytecode dump, which is why it can only be re-
10771*22dc650dSSadaf Ebrahimi       loaded in the same environment as the one that created  it.  Hence  the
10772*22dc650dSSadaf Ebrahimi       restrictions  mentioned  above.   Applications  that are not statically
10773*22dc650dSSadaf Ebrahimi       linked with a fixed version of PCRE2 must be prepared to recompile pat-
10774*22dc650dSSadaf Ebrahimi       terns from their sources, in order to be immune to PCRE2 upgrades.
10775*22dc650dSSadaf Ebrahimi
10776*22dc650dSSadaf Ebrahimi
10777*22dc650dSSadaf EbrahimiSECURITY CONCERNS
10778*22dc650dSSadaf Ebrahimi
10779*22dc650dSSadaf Ebrahimi       The facility for saving and restoring compiled patterns is intended for
10780*22dc650dSSadaf Ebrahimi       use within individual applications.  As  such,  the  data  supplied  to
10781*22dc650dSSadaf Ebrahimi       pcre2_serialize_decode()  is expected to be trusted data, not data from
10782*22dc650dSSadaf Ebrahimi       arbitrary external sources.  There  is  only  some  simple  consistency
10783*22dc650dSSadaf Ebrahimi       checking, not complete validation of what is being re-loaded. Corrupted
10784*22dc650dSSadaf Ebrahimi       data may cause undefined results. For example, if the length field of a
10785*22dc650dSSadaf Ebrahimi       pattern in the serialized data is corrupted, the deserializing code may
10786*22dc650dSSadaf Ebrahimi       read beyond the end of the byte stream that is passed to it.
10787*22dc650dSSadaf Ebrahimi
10788*22dc650dSSadaf Ebrahimi
10789*22dc650dSSadaf EbrahimiSAVING COMPILED PATTERNS
10790*22dc650dSSadaf Ebrahimi
10791*22dc650dSSadaf Ebrahimi       Before compiled patterns can be saved they must be serialized, which in
10792*22dc650dSSadaf Ebrahimi       PCRE2  means converting the pattern to a stream of bytes. A single byte
10793*22dc650dSSadaf Ebrahimi       stream may contain any number of compiled patterns, but they  must  all
10794*22dc650dSSadaf Ebrahimi       use  the same character tables. A single copy of the tables is included
10795*22dc650dSSadaf Ebrahimi       in the byte stream (its size is 1088 bytes). For more details of  char-
10796*22dc650dSSadaf Ebrahimi       acter  tables,  see the section on locale support in the pcre2api docu-
10797*22dc650dSSadaf Ebrahimi       mentation.
10798*22dc650dSSadaf Ebrahimi
10799*22dc650dSSadaf Ebrahimi       The function pcre2_serialize_encode() creates a serialized byte  stream
10800*22dc650dSSadaf Ebrahimi       from  a  list of compiled patterns. Its first two arguments specify the
10801*22dc650dSSadaf Ebrahimi       list, being a pointer to a vector of pointers to compiled patterns, and
10802*22dc650dSSadaf Ebrahimi       the length of the vector. The third and fourth arguments point to vari-
10803*22dc650dSSadaf Ebrahimi       ables which are set to point to the created byte stream and its length,
10804*22dc650dSSadaf Ebrahimi       respectively. The final argument is a pointer  to  a  general  context,
10805*22dc650dSSadaf Ebrahimi       which  can  be  used  to specify custom memory management functions. If
10806*22dc650dSSadaf Ebrahimi       this argument is NULL, malloc() is used to obtain memory for  the  byte
10807*22dc650dSSadaf Ebrahimi       stream. The yield of the function is the number of serialized patterns,
10808*22dc650dSSadaf Ebrahimi       or one of the following negative error codes:
10809*22dc650dSSadaf Ebrahimi
10810*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADDATA      the number of patterns is zero or less
10811*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
10812*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NOMEMORY     memory allocation failed
10813*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
10814*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
10815*22dc650dSSadaf Ebrahimi
10816*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
10817*22dc650dSSadaf Ebrahimi       rupted, or that a slot in the vector does not point to a compiled  pat-
10818*22dc650dSSadaf Ebrahimi       tern.
10819*22dc650dSSadaf Ebrahimi
10820*22dc650dSSadaf Ebrahimi       Once a set of patterns has been serialized you can save the data in any
10821*22dc650dSSadaf Ebrahimi       appropriate  manner. Here is sample code that compiles two patterns and
10822*22dc650dSSadaf Ebrahimi       writes them to a file. It assumes that the variable fd refers to a file
10823*22dc650dSSadaf Ebrahimi       that is open for output. The error checking that should be present in a
10824*22dc650dSSadaf Ebrahimi       real application has been omitted for simplicity.
10825*22dc650dSSadaf Ebrahimi
10826*22dc650dSSadaf Ebrahimi         int errorcode;
10827*22dc650dSSadaf Ebrahimi         uint8_t *bytes;
10828*22dc650dSSadaf Ebrahimi         PCRE2_SIZE erroroffset;
10829*22dc650dSSadaf Ebrahimi         PCRE2_SIZE bytescount;
10830*22dc650dSSadaf Ebrahimi         pcre2_code *list_of_codes[2];
10831*22dc650dSSadaf Ebrahimi         list_of_codes[0] = pcre2_compile("first pattern",
10832*22dc650dSSadaf Ebrahimi           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10833*22dc650dSSadaf Ebrahimi         list_of_codes[1] = pcre2_compile("second pattern",
10834*22dc650dSSadaf Ebrahimi           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10835*22dc650dSSadaf Ebrahimi         errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
10836*22dc650dSSadaf Ebrahimi           &bytescount, NULL);
10837*22dc650dSSadaf Ebrahimi         errorcode = fwrite(bytes, 1, bytescount, fd);
10838*22dc650dSSadaf Ebrahimi
10839*22dc650dSSadaf Ebrahimi       Note that the serialized data is binary data that may  contain  any  of
10840*22dc650dSSadaf Ebrahimi       the  256  possible  byte values. On systems that make a distinction be-
10841*22dc650dSSadaf Ebrahimi       tween binary and non-binary data, be sure that the file is  opened  for
10842*22dc650dSSadaf Ebrahimi       binary output.
10843*22dc650dSSadaf Ebrahimi
10844*22dc650dSSadaf Ebrahimi       Serializing  a  set  of patterns leaves the original data untouched, so
10845*22dc650dSSadaf Ebrahimi       they can still be used for matching. Their memory  must  eventually  be
10846*22dc650dSSadaf Ebrahimi       freed in the usual way by calling pcre2_code_free(). When you have fin-
10847*22dc650dSSadaf Ebrahimi       ished with the byte stream, it too must be freed by calling pcre2_seri-
10848*22dc650dSSadaf Ebrahimi       alize_free().  If  this function is called with a NULL argument, it re-
10849*22dc650dSSadaf Ebrahimi       turns immediately without doing anything.
10850*22dc650dSSadaf Ebrahimi
10851*22dc650dSSadaf Ebrahimi
10852*22dc650dSSadaf EbrahimiRE-USING PRECOMPILED PATTERNS
10853*22dc650dSSadaf Ebrahimi
10854*22dc650dSSadaf Ebrahimi       In order to re-use a set of saved patterns you must first make the  se-
10855*22dc650dSSadaf Ebrahimi       rialized  byte stream available in main memory (for example, by reading
10856*22dc650dSSadaf Ebrahimi       from a file). The management of this memory block is up to the applica-
10857*22dc650dSSadaf Ebrahimi       tion. You can use the pcre2_serialize_get_number_of_codes() function to
10858*22dc650dSSadaf Ebrahimi       find out how many compiled patterns are in the serialized data  without
10859*22dc650dSSadaf Ebrahimi       actually decoding the patterns:
10860*22dc650dSSadaf Ebrahimi
10861*22dc650dSSadaf Ebrahimi         uint8_t *bytes = <serialized data>;
10862*22dc650dSSadaf Ebrahimi         int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
10863*22dc650dSSadaf Ebrahimi
10864*22dc650dSSadaf Ebrahimi       The pcre2_serialize_decode() function reads a byte stream and recreates
10865*22dc650dSSadaf Ebrahimi       the compiled patterns in new memory blocks, setting pointers to them in
10866*22dc650dSSadaf Ebrahimi       a  vector.  The  first two arguments are a pointer to a suitable vector
10867*22dc650dSSadaf Ebrahimi       and its length, and the third argument points to a byte stream. The fi-
10868*22dc650dSSadaf Ebrahimi       nal argument is a pointer to a general context, which can  be  used  to
10869*22dc650dSSadaf Ebrahimi       specify custom memory management functions for the decoded patterns. If
10870*22dc650dSSadaf Ebrahimi       this argument is NULL, malloc() and free() are used. After deserializa-
10871*22dc650dSSadaf Ebrahimi       tion, the byte stream is no longer needed and can be discarded.
10872*22dc650dSSadaf Ebrahimi
10873*22dc650dSSadaf Ebrahimi         pcre2_code *list_of_codes[2];
10874*22dc650dSSadaf Ebrahimi         uint8_t *bytes = <serialized data>;
10875*22dc650dSSadaf Ebrahimi         int32_t number_of_codes =
10876*22dc650dSSadaf Ebrahimi           pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
10877*22dc650dSSadaf Ebrahimi
10878*22dc650dSSadaf Ebrahimi       If  the  vector  is  not  large enough for all the patterns in the byte
10879*22dc650dSSadaf Ebrahimi       stream, it is filled with those that fit, and  the  remainder  are  ig-
10880*22dc650dSSadaf Ebrahimi       nored.  The yield of the function is the number of decoded patterns, or
10881*22dc650dSSadaf Ebrahimi       one of the following negative error codes:
10882*22dc650dSSadaf Ebrahimi
10883*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADDATA    second argument is zero or less
10884*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
10885*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
10886*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
10887*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_MEMORY     memory allocation failed
10888*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_NULL       first or third argument is NULL
10889*22dc650dSSadaf Ebrahimi
10890*22dc650dSSadaf Ebrahimi       PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it  was
10891*22dc650dSSadaf Ebrahimi       compiled on a system with different endianness.
10892*22dc650dSSadaf Ebrahimi
10893*22dc650dSSadaf Ebrahimi       Decoded patterns can be used for matching in the usual way, and must be
10894*22dc650dSSadaf Ebrahimi       freed  by  calling pcre2_code_free(). However, be aware that there is a
10895*22dc650dSSadaf Ebrahimi       potential race issue if you are using multiple patterns that  were  de-
10896*22dc650dSSadaf Ebrahimi       coded  from a single byte stream in a multithreaded application. A sin-
10897*22dc650dSSadaf Ebrahimi       gle copy of the character tables is used by all  the  decoded  patterns
10898*22dc650dSSadaf Ebrahimi       and a reference count is used to arrange for its memory to be automati-
10899*22dc650dSSadaf Ebrahimi       cally  freed when the last pattern is freed, but there is no locking on
10900*22dc650dSSadaf Ebrahimi       this reference count. Therefore, if you want to call  pcre2_code_free()
10901*22dc650dSSadaf Ebrahimi       for  these  patterns  in  different  threads, you must arrange your own
10902*22dc650dSSadaf Ebrahimi       locking, and ensure that pcre2_code_free()  cannot  be  called  by  two
10903*22dc650dSSadaf Ebrahimi       threads at the same time.
10904*22dc650dSSadaf Ebrahimi
10905*22dc650dSSadaf Ebrahimi       If  a pattern was processed by pcre2_jit_compile() before being serial-
10906*22dc650dSSadaf Ebrahimi       ized, the JIT data is discarded and so is no longer available  after  a
10907*22dc650dSSadaf Ebrahimi       save/restore  cycle.  You can, however, process a restored pattern with
10908*22dc650dSSadaf Ebrahimi       pcre2_jit_compile() if you wish.
10909*22dc650dSSadaf Ebrahimi
10910*22dc650dSSadaf Ebrahimi
10911*22dc650dSSadaf EbrahimiAUTHOR
10912*22dc650dSSadaf Ebrahimi
10913*22dc650dSSadaf Ebrahimi       Philip Hazel
10914*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
10915*22dc650dSSadaf Ebrahimi       Cambridge, England.
10916*22dc650dSSadaf Ebrahimi
10917*22dc650dSSadaf Ebrahimi
10918*22dc650dSSadaf EbrahimiREVISION
10919*22dc650dSSadaf Ebrahimi
10920*22dc650dSSadaf Ebrahimi       Last updated: 27 June 2018
10921*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2018 University of Cambridge.
10922*22dc650dSSadaf Ebrahimi
10923*22dc650dSSadaf Ebrahimi
10924*22dc650dSSadaf EbrahimiPCRE2 10.32                      27 June 2018                PCRE2SERIALIZE(3)
10925*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
10926*22dc650dSSadaf Ebrahimi
10927*22dc650dSSadaf Ebrahimi
10928*22dc650dSSadaf Ebrahimi
10929*22dc650dSSadaf EbrahimiPCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
10930*22dc650dSSadaf Ebrahimi
10931*22dc650dSSadaf Ebrahimi
10932*22dc650dSSadaf EbrahimiNAME
10933*22dc650dSSadaf Ebrahimi       PCRE2 - Perl-compatible regular expressions (revised API)
10934*22dc650dSSadaf Ebrahimi
10935*22dc650dSSadaf Ebrahimi
10936*22dc650dSSadaf EbrahimiPCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
10937*22dc650dSSadaf Ebrahimi
10938*22dc650dSSadaf Ebrahimi       The  full syntax and semantics of the regular expressions that are sup-
10939*22dc650dSSadaf Ebrahimi       ported by PCRE2 are described in the pcre2pattern  documentation.  This
10940*22dc650dSSadaf Ebrahimi       document contains a quick-reference summary of the syntax.
10941*22dc650dSSadaf Ebrahimi
10942*22dc650dSSadaf Ebrahimi
10943*22dc650dSSadaf EbrahimiQUOTING
10944*22dc650dSSadaf Ebrahimi
10945*22dc650dSSadaf Ebrahimi         \x         where x is non-alphanumeric is a literal x
10946*22dc650dSSadaf Ebrahimi         \Q...\E    treat enclosed characters as literal
10947*22dc650dSSadaf Ebrahimi
10948*22dc650dSSadaf Ebrahimi       Note that white space inside \Q...\E is always treated as literal, even
10949*22dc650dSSadaf Ebrahimi       if PCRE2_EXTENDED is set, causing most other white space to be ignored.
10950*22dc650dSSadaf Ebrahimi
10951*22dc650dSSadaf Ebrahimi
10952*22dc650dSSadaf EbrahimiBRACED ITEMS
10953*22dc650dSSadaf Ebrahimi
10954*22dc650dSSadaf Ebrahimi       With  one  exception, wherever brace characters { and } are required to
10955*22dc650dSSadaf Ebrahimi       enclose data for constructions such as \g{2} or \k{name}, space  and/or
10956*22dc650dSSadaf Ebrahimi       horizontal  tab  characters  that follow { or precede } are allowed and
10957*22dc650dSSadaf Ebrahimi       are ignored. In the case of quantifiers, they may also appear before or
10958*22dc650dSSadaf Ebrahimi       after the comma. The exception is \u{...} which is not  Perl-compatible
10959*22dc650dSSadaf Ebrahimi       and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC-
10960*22dc650dSSadaf Ebrahimi       MAScript compatibility feature, and follows ECMAScript's behaviour.
10961*22dc650dSSadaf Ebrahimi
10962*22dc650dSSadaf Ebrahimi
10963*22dc650dSSadaf EbrahimiESCAPED CHARACTERS
10964*22dc650dSSadaf Ebrahimi
10965*22dc650dSSadaf Ebrahimi       This  table  applies to ASCII and Unicode environments. An unrecognized
10966*22dc650dSSadaf Ebrahimi       escape sequence causes an error.
10967*22dc650dSSadaf Ebrahimi
10968*22dc650dSSadaf Ebrahimi         \a         alarm, that is, the BEL character (hex 07)
10969*22dc650dSSadaf Ebrahimi         \cx        "control-x", where x is a non-control ASCII character
10970*22dc650dSSadaf Ebrahimi         \e         escape (hex 1B)
10971*22dc650dSSadaf Ebrahimi         \f         form feed (hex 0C)
10972*22dc650dSSadaf Ebrahimi         \n         newline (hex 0A)
10973*22dc650dSSadaf Ebrahimi         \r         carriage return (hex 0D)
10974*22dc650dSSadaf Ebrahimi         \t         tab (hex 09)
10975*22dc650dSSadaf Ebrahimi         \0dd       character with octal code 0dd
10976*22dc650dSSadaf Ebrahimi         \ddd       character with octal code ddd, or backreference
10977*22dc650dSSadaf Ebrahimi         \o{ddd..}  character with octal code ddd..
10978*22dc650dSSadaf Ebrahimi         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
10979*22dc650dSSadaf Ebrahimi         \xhh       character with hex code hh
10980*22dc650dSSadaf Ebrahimi         \x{hh..}   character with hex code hh..
10981*22dc650dSSadaf Ebrahimi
10982*22dc650dSSadaf Ebrahimi       If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
10983*22dc650dSSadaf Ebrahimi       following are also recognized:
10984*22dc650dSSadaf Ebrahimi
10985*22dc650dSSadaf Ebrahimi         \U         the character "U"
10986*22dc650dSSadaf Ebrahimi         \uhhhh     character with hex code hhhh
10987*22dc650dSSadaf Ebrahimi         \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
10988*22dc650dSSadaf Ebrahimi
10989*22dc650dSSadaf Ebrahimi       When \x is not followed by {, from zero to two hexadecimal  digits  are
10990*22dc650dSSadaf Ebrahimi       read,  but in ALT_BSUX mode \x must be followed by two hexadecimal dig-
10991*22dc650dSSadaf Ebrahimi       its to be recognized as a hexadecimal escape; otherwise  it  matches  a
10992*22dc650dSSadaf Ebrahimi       literal  "x".   Likewise,  if  \u (in ALT_BSUX mode) is not followed by
10993*22dc650dSSadaf Ebrahimi       four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence  of  hex
10994*22dc650dSSadaf Ebrahimi       digits in curly brackets, it matches a literal "u".
10995*22dc650dSSadaf Ebrahimi
10996*22dc650dSSadaf Ebrahimi       Note that \0dd is always an octal code. The treatment of backslash fol-
10997*22dc650dSSadaf Ebrahimi       lowed  by  a non-zero digit is complicated; for details see the section
10998*22dc650dSSadaf Ebrahimi       "Non-printing characters" in the pcre2pattern documentation, where  de-
10999*22dc650dSSadaf Ebrahimi       tails  of  escape  processing  in  EBCDIC  environments are also given.
11000*22dc650dSSadaf Ebrahimi       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
11001*22dc650dSSadaf Ebrahimi       EBCDIC environments. Note that \N not  followed  by  an  opening  curly
11002*22dc650dSSadaf Ebrahimi       bracket has a different meaning (see below).
11003*22dc650dSSadaf Ebrahimi
11004*22dc650dSSadaf Ebrahimi
11005*22dc650dSSadaf EbrahimiCHARACTER TYPES
11006*22dc650dSSadaf Ebrahimi
11007*22dc650dSSadaf Ebrahimi         .          any character except newline;
11008*22dc650dSSadaf Ebrahimi                      in dotall mode, any character whatsoever
11009*22dc650dSSadaf Ebrahimi         \C         one code unit, even in UTF mode (best avoided)
11010*22dc650dSSadaf Ebrahimi         \d         a decimal digit
11011*22dc650dSSadaf Ebrahimi         \D         a character that is not a decimal digit
11012*22dc650dSSadaf Ebrahimi         \h         a horizontal white space character
11013*22dc650dSSadaf Ebrahimi         \H         a character that is not a horizontal white space character
11014*22dc650dSSadaf Ebrahimi         \N         a character that is not a newline
11015*22dc650dSSadaf Ebrahimi         \p{xx}     a character with the xx property
11016*22dc650dSSadaf Ebrahimi         \P{xx}     a character without the xx property
11017*22dc650dSSadaf Ebrahimi         \R         a newline sequence
11018*22dc650dSSadaf Ebrahimi         \s         a white space character
11019*22dc650dSSadaf Ebrahimi         \S         a character that is not a white space character
11020*22dc650dSSadaf Ebrahimi         \v         a vertical white space character
11021*22dc650dSSadaf Ebrahimi         \V         a character that is not a vertical white space character
11022*22dc650dSSadaf Ebrahimi         \w         a "word" character
11023*22dc650dSSadaf Ebrahimi         \W         a "non-word" character
11024*22dc650dSSadaf Ebrahimi         \X         a Unicode extended grapheme cluster
11025*22dc650dSSadaf Ebrahimi
11026*22dc650dSSadaf Ebrahimi       \C  is dangerous because it may leave the current matching point in the
11027*22dc650dSSadaf Ebrahimi       middle of a UTF-8 or UTF-16 character. The application can lock out the
11028*22dc650dSSadaf Ebrahimi       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
11029*22dc650dSSadaf Ebrahimi       possible to build PCRE2 with the use of \C permanently disabled.
11030*22dc650dSSadaf Ebrahimi
11031*22dc650dSSadaf Ebrahimi       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
11032*22dc650dSSadaf Ebrahimi       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
11033*22dc650dSSadaf Ebrahimi       matching is happening, \s and \w may also match  characters  with  code
11034*22dc650dSSadaf Ebrahimi       points in the range 128-255. If the PCRE2_UCP option is set, the behav-
11035*22dc650dSSadaf Ebrahimi       iour of these escape sequences is changed to use Unicode properties and
11036*22dc650dSSadaf Ebrahimi       they  match  many  more  characters, but there are some option settings
11037*22dc650dSSadaf Ebrahimi       that can restrict individual sequences to matching only  ASCII  charac-
11038*22dc650dSSadaf Ebrahimi       ters.
11039*22dc650dSSadaf Ebrahimi
11040*22dc650dSSadaf Ebrahimi       Property descriptions in \p and \P are matched caselessly; hyphens, un-
11041*22dc650dSSadaf Ebrahimi       derscores,  and  white  space are ignored, in accordance with Unicode's
11042*22dc650dSSadaf Ebrahimi       "loose matching" rules.
11043*22dc650dSSadaf Ebrahimi
11044*22dc650dSSadaf Ebrahimi
11045*22dc650dSSadaf EbrahimiGENERAL CATEGORY PROPERTIES FOR \p and \P
11046*22dc650dSSadaf Ebrahimi
11047*22dc650dSSadaf Ebrahimi         C          Other
11048*22dc650dSSadaf Ebrahimi         Cc         Control
11049*22dc650dSSadaf Ebrahimi         Cf         Format
11050*22dc650dSSadaf Ebrahimi         Cn         Unassigned
11051*22dc650dSSadaf Ebrahimi         Co         Private use
11052*22dc650dSSadaf Ebrahimi         Cs         Surrogate
11053*22dc650dSSadaf Ebrahimi
11054*22dc650dSSadaf Ebrahimi         L          Letter
11055*22dc650dSSadaf Ebrahimi         Ll         Lower case letter
11056*22dc650dSSadaf Ebrahimi         Lm         Modifier letter
11057*22dc650dSSadaf Ebrahimi         Lo         Other letter
11058*22dc650dSSadaf Ebrahimi         Lt         Title case letter
11059*22dc650dSSadaf Ebrahimi         Lu         Upper case letter
11060*22dc650dSSadaf Ebrahimi         Lc         Ll, Lu, or Lt
11061*22dc650dSSadaf Ebrahimi         L&         Ll, Lu, or Lt
11062*22dc650dSSadaf Ebrahimi
11063*22dc650dSSadaf Ebrahimi         M          Mark
11064*22dc650dSSadaf Ebrahimi         Mc         Spacing mark
11065*22dc650dSSadaf Ebrahimi         Me         Enclosing mark
11066*22dc650dSSadaf Ebrahimi         Mn         Non-spacing mark
11067*22dc650dSSadaf Ebrahimi
11068*22dc650dSSadaf Ebrahimi         N          Number
11069*22dc650dSSadaf Ebrahimi         Nd         Decimal number
11070*22dc650dSSadaf Ebrahimi         Nl         Letter number
11071*22dc650dSSadaf Ebrahimi         No         Other number
11072*22dc650dSSadaf Ebrahimi
11073*22dc650dSSadaf Ebrahimi         P          Punctuation
11074*22dc650dSSadaf Ebrahimi         Pc         Connector punctuation
11075*22dc650dSSadaf Ebrahimi         Pd         Dash punctuation
11076*22dc650dSSadaf Ebrahimi         Pe         Close punctuation
11077*22dc650dSSadaf Ebrahimi         Pf         Final punctuation
11078*22dc650dSSadaf Ebrahimi         Pi         Initial punctuation
11079*22dc650dSSadaf Ebrahimi         Po         Other punctuation
11080*22dc650dSSadaf Ebrahimi         Ps         Open punctuation
11081*22dc650dSSadaf Ebrahimi
11082*22dc650dSSadaf Ebrahimi         S          Symbol
11083*22dc650dSSadaf Ebrahimi         Sc         Currency symbol
11084*22dc650dSSadaf Ebrahimi         Sk         Modifier symbol
11085*22dc650dSSadaf Ebrahimi         Sm         Mathematical symbol
11086*22dc650dSSadaf Ebrahimi         So         Other symbol
11087*22dc650dSSadaf Ebrahimi
11088*22dc650dSSadaf Ebrahimi         Z          Separator
11089*22dc650dSSadaf Ebrahimi         Zl         Line separator
11090*22dc650dSSadaf Ebrahimi         Zp         Paragraph separator
11091*22dc650dSSadaf Ebrahimi         Zs         Space separator
11092*22dc650dSSadaf Ebrahimi
11093*22dc650dSSadaf Ebrahimi
11094*22dc650dSSadaf EbrahimiPCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
11095*22dc650dSSadaf Ebrahimi
11096*22dc650dSSadaf Ebrahimi         Xan        Alphanumeric: union of properties L and N
11097*22dc650dSSadaf Ebrahimi         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
11098*22dc650dSSadaf Ebrahimi         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
11099*22dc650dSSadaf Ebrahimi         Xuc        Universally-named character: one that can be
11100*22dc650dSSadaf Ebrahimi                      represented by a Universal Character Name
11101*22dc650dSSadaf Ebrahimi         Xwd        Perl word: property Xan or underscore
11102*22dc650dSSadaf Ebrahimi
11103*22dc650dSSadaf Ebrahimi       Perl and POSIX space are now the same. Perl added VT to its space char-
11104*22dc650dSSadaf Ebrahimi       acter set at release 5.18.
11105*22dc650dSSadaf Ebrahimi
11106*22dc650dSSadaf Ebrahimi
11107*22dc650dSSadaf EbrahimiBINARY PROPERTIES FOR \p AND \P
11108*22dc650dSSadaf Ebrahimi
11109*22dc650dSSadaf Ebrahimi       Unicode defines a number of  binary  properties,  that  is,  properties
11110*22dc650dSSadaf Ebrahimi       whose  only  values  are  true or false. You can obtain a list of those
11111*22dc650dSSadaf Ebrahimi       that are recognized by \p and \P, along with  their  abbreviations,  by
11112*22dc650dSSadaf Ebrahimi       running this command:
11113*22dc650dSSadaf Ebrahimi
11114*22dc650dSSadaf Ebrahimi         pcre2test -LP
11115*22dc650dSSadaf Ebrahimi
11116*22dc650dSSadaf Ebrahimi
11117*22dc650dSSadaf EbrahimiSCRIPT MATCHING WITH \p AND \P
11118*22dc650dSSadaf Ebrahimi
11119*22dc650dSSadaf Ebrahimi       Many  script  names  and their 4-letter abbreviations are recognized in
11120*22dc650dSSadaf Ebrahimi       \p{sc:...} or \p{scx:...} items, or on their own with \p (and  also  \P
11121*22dc650dSSadaf Ebrahimi       of course). You can obtain a list of these scripts by running this com-
11122*22dc650dSSadaf Ebrahimi       mand:
11123*22dc650dSSadaf Ebrahimi
11124*22dc650dSSadaf Ebrahimi         pcre2test -LS
11125*22dc650dSSadaf Ebrahimi
11126*22dc650dSSadaf Ebrahimi
11127*22dc650dSSadaf EbrahimiTHE BIDI_CLASS PROPERTY FOR \p AND \P
11128*22dc650dSSadaf Ebrahimi
11129*22dc650dSSadaf Ebrahimi         \p{Bidi_Class:<class>}   matches a character with the given class
11130*22dc650dSSadaf Ebrahimi         \p{BC:<class>}           matches a character with the given class
11131*22dc650dSSadaf Ebrahimi
11132*22dc650dSSadaf Ebrahimi       The recognized classes are:
11133*22dc650dSSadaf Ebrahimi
11134*22dc650dSSadaf Ebrahimi         AL          Arabic letter
11135*22dc650dSSadaf Ebrahimi         AN          Arabic number
11136*22dc650dSSadaf Ebrahimi         B           paragraph separator
11137*22dc650dSSadaf Ebrahimi         BN          boundary neutral
11138*22dc650dSSadaf Ebrahimi         CS          common separator
11139*22dc650dSSadaf Ebrahimi         EN          European number
11140*22dc650dSSadaf Ebrahimi         ES          European separator
11141*22dc650dSSadaf Ebrahimi         ET          European terminator
11142*22dc650dSSadaf Ebrahimi         FSI         first strong isolate
11143*22dc650dSSadaf Ebrahimi         L           left-to-right
11144*22dc650dSSadaf Ebrahimi         LRE         left-to-right embedding
11145*22dc650dSSadaf Ebrahimi         LRI         left-to-right isolate
11146*22dc650dSSadaf Ebrahimi         LRO         left-to-right override
11147*22dc650dSSadaf Ebrahimi         NSM         non-spacing mark
11148*22dc650dSSadaf Ebrahimi         ON          other neutral
11149*22dc650dSSadaf Ebrahimi         PDF         pop directional format
11150*22dc650dSSadaf Ebrahimi         PDI         pop directional isolate
11151*22dc650dSSadaf Ebrahimi         R           right-to-left
11152*22dc650dSSadaf Ebrahimi         RLE         right-to-left embedding
11153*22dc650dSSadaf Ebrahimi         RLI         right-to-left isolate
11154*22dc650dSSadaf Ebrahimi         RLO         right-to-left override
11155*22dc650dSSadaf Ebrahimi         S           segment separator
11156*22dc650dSSadaf Ebrahimi         WS          which space
11157*22dc650dSSadaf Ebrahimi
11158*22dc650dSSadaf Ebrahimi
11159*22dc650dSSadaf EbrahimiCHARACTER CLASSES
11160*22dc650dSSadaf Ebrahimi
11161*22dc650dSSadaf Ebrahimi         [...]       positive character class
11162*22dc650dSSadaf Ebrahimi         [^...]      negative character class
11163*22dc650dSSadaf Ebrahimi         [x-y]       range (can be used for hex characters)
11164*22dc650dSSadaf Ebrahimi         [[:xxx:]]   positive POSIX named set
11165*22dc650dSSadaf Ebrahimi         [[:^xxx:]]  negative POSIX named set
11166*22dc650dSSadaf Ebrahimi
11167*22dc650dSSadaf Ebrahimi         alnum       alphanumeric
11168*22dc650dSSadaf Ebrahimi         alpha       alphabetic
11169*22dc650dSSadaf Ebrahimi         ascii       0-127
11170*22dc650dSSadaf Ebrahimi         blank       space or tab
11171*22dc650dSSadaf Ebrahimi         cntrl       control character
11172*22dc650dSSadaf Ebrahimi         digit       decimal digit
11173*22dc650dSSadaf Ebrahimi         graph       printing, excluding space
11174*22dc650dSSadaf Ebrahimi         lower       lower case letter
11175*22dc650dSSadaf Ebrahimi         print       printing, including space
11176*22dc650dSSadaf Ebrahimi         punct       printing, excluding alphanumeric
11177*22dc650dSSadaf Ebrahimi         space       white space
11178*22dc650dSSadaf Ebrahimi         upper       upper case letter
11179*22dc650dSSadaf Ebrahimi         word        same as \w
11180*22dc650dSSadaf Ebrahimi         xdigit      hexadecimal digit
11181*22dc650dSSadaf Ebrahimi
11182*22dc650dSSadaf Ebrahimi       In  PCRE2, POSIX character set names recognize only ASCII characters by
11183*22dc650dSSadaf Ebrahimi       default, but some of them use Unicode properties if PCRE2_UCP  is  set.
11184*22dc650dSSadaf Ebrahimi       You can use \Q...\E inside a character class.
11185*22dc650dSSadaf Ebrahimi
11186*22dc650dSSadaf Ebrahimi
11187*22dc650dSSadaf EbrahimiQUANTIFIERS
11188*22dc650dSSadaf Ebrahimi
11189*22dc650dSSadaf Ebrahimi         ?           0 or 1, greedy
11190*22dc650dSSadaf Ebrahimi         ?+          0 or 1, possessive
11191*22dc650dSSadaf Ebrahimi         ??          0 or 1, lazy
11192*22dc650dSSadaf Ebrahimi         *           0 or more, greedy
11193*22dc650dSSadaf Ebrahimi         *+          0 or more, possessive
11194*22dc650dSSadaf Ebrahimi         *?          0 or more, lazy
11195*22dc650dSSadaf Ebrahimi         +           1 or more, greedy
11196*22dc650dSSadaf Ebrahimi         ++          1 or more, possessive
11197*22dc650dSSadaf Ebrahimi         +?          1 or more, lazy
11198*22dc650dSSadaf Ebrahimi         {n}         exactly n
11199*22dc650dSSadaf Ebrahimi         {n,m}       at least n, no more than m, greedy
11200*22dc650dSSadaf Ebrahimi         {n,m}+      at least n, no more than m, possessive
11201*22dc650dSSadaf Ebrahimi         {n,m}?      at least n, no more than m, lazy
11202*22dc650dSSadaf Ebrahimi         {n,}        n or more, greedy
11203*22dc650dSSadaf Ebrahimi         {n,}+       n or more, possessive
11204*22dc650dSSadaf Ebrahimi         {n,}?       n or more, lazy
11205*22dc650dSSadaf Ebrahimi         {,m}        zero up to m, greedy
11206*22dc650dSSadaf Ebrahimi         {,m}+       zero up to m, possessive
11207*22dc650dSSadaf Ebrahimi         {,m}?       zero up to m, lazy
11208*22dc650dSSadaf Ebrahimi
11209*22dc650dSSadaf Ebrahimi
11210*22dc650dSSadaf EbrahimiANCHORS AND SIMPLE ASSERTIONS
11211*22dc650dSSadaf Ebrahimi
11212*22dc650dSSadaf Ebrahimi         \b          word boundary
11213*22dc650dSSadaf Ebrahimi         \B          not a word boundary
11214*22dc650dSSadaf Ebrahimi         ^           start of subject
11215*22dc650dSSadaf Ebrahimi                       also after an internal newline in multiline mode
11216*22dc650dSSadaf Ebrahimi                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
11217*22dc650dSSadaf Ebrahimi         \A          start of subject
11218*22dc650dSSadaf Ebrahimi         $           end of subject
11219*22dc650dSSadaf Ebrahimi                       also before newline at end of subject
11220*22dc650dSSadaf Ebrahimi                       also before internal newline in multiline mode
11221*22dc650dSSadaf Ebrahimi         \Z          end of subject
11222*22dc650dSSadaf Ebrahimi                       also before newline at end of subject
11223*22dc650dSSadaf Ebrahimi         \z          end of subject
11224*22dc650dSSadaf Ebrahimi         \G          first matching position in subject
11225*22dc650dSSadaf Ebrahimi
11226*22dc650dSSadaf Ebrahimi
11227*22dc650dSSadaf EbrahimiREPORTED MATCH POINT SETTING
11228*22dc650dSSadaf Ebrahimi
11229*22dc650dSSadaf Ebrahimi         \K          set reported start of match
11230*22dc650dSSadaf Ebrahimi
11231*22dc650dSSadaf Ebrahimi       From  release 10.38 \K is not permitted by default in lookaround asser-
11232*22dc650dSSadaf Ebrahimi       tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL-
11233*22dc650dSSadaf Ebrahimi       LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
11234*22dc650dSSadaf Ebrahimi       When this option is set, \K is honoured in positive assertions, but ig-
11235*22dc650dSSadaf Ebrahimi       nored in negative ones.
11236*22dc650dSSadaf Ebrahimi
11237*22dc650dSSadaf Ebrahimi
11238*22dc650dSSadaf EbrahimiALTERNATION
11239*22dc650dSSadaf Ebrahimi
11240*22dc650dSSadaf Ebrahimi         expr|expr|expr...
11241*22dc650dSSadaf Ebrahimi
11242*22dc650dSSadaf Ebrahimi
11243*22dc650dSSadaf EbrahimiCAPTURING
11244*22dc650dSSadaf Ebrahimi
11245*22dc650dSSadaf Ebrahimi         (...)           capture group
11246*22dc650dSSadaf Ebrahimi         (?<name>...)    named capture group (Perl)
11247*22dc650dSSadaf Ebrahimi         (?'name'...)    named capture group (Perl)
11248*22dc650dSSadaf Ebrahimi         (?P<name>...)   named capture group (Python)
11249*22dc650dSSadaf Ebrahimi         (?:...)         non-capture group
11250*22dc650dSSadaf Ebrahimi         (?|...)         non-capture group; reset group numbers for
11251*22dc650dSSadaf Ebrahimi                          capture groups in each alternative
11252*22dc650dSSadaf Ebrahimi
11253*22dc650dSSadaf Ebrahimi       In  non-UTF  modes, names may contain underscores and ASCII letters and
11254*22dc650dSSadaf Ebrahimi       digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits
11255*22dc650dSSadaf Ebrahimi       are permitted. In both cases, a name must not start with a digit.
11256*22dc650dSSadaf Ebrahimi
11257*22dc650dSSadaf Ebrahimi
11258*22dc650dSSadaf EbrahimiATOMIC GROUPS
11259*22dc650dSSadaf Ebrahimi
11260*22dc650dSSadaf Ebrahimi         (?>...)         atomic non-capture group
11261*22dc650dSSadaf Ebrahimi         (*atomic:...)   atomic non-capture group
11262*22dc650dSSadaf Ebrahimi
11263*22dc650dSSadaf Ebrahimi
11264*22dc650dSSadaf EbrahimiCOMMENT
11265*22dc650dSSadaf Ebrahimi
11266*22dc650dSSadaf Ebrahimi         (?#....)        comment (not nestable)
11267*22dc650dSSadaf Ebrahimi
11268*22dc650dSSadaf Ebrahimi
11269*22dc650dSSadaf EbrahimiOPTION SETTING
11270*22dc650dSSadaf Ebrahimi       Changes  of these options within a group are automatically cancelled at
11271*22dc650dSSadaf Ebrahimi       the end of the group.
11272*22dc650dSSadaf Ebrahimi
11273*22dc650dSSadaf Ebrahimi         (?a)            all ASCII options
11274*22dc650dSSadaf Ebrahimi         (?aD)           restrict \d to ASCII in UCP mode
11275*22dc650dSSadaf Ebrahimi         (?aS)           restrict \s to ASCII in UCP mode
11276*22dc650dSSadaf Ebrahimi         (?aW)           restrict \w to ASCII in UCP mode
11277*22dc650dSSadaf Ebrahimi         (?aP)           restrict all POSIX classes to ASCII in UCP mode
11278*22dc650dSSadaf Ebrahimi         (?aT)           restrict POSIX digit classes to ASCII in UCP mode
11279*22dc650dSSadaf Ebrahimi         (?i)            caseless
11280*22dc650dSSadaf Ebrahimi         (?J)            allow duplicate named groups
11281*22dc650dSSadaf Ebrahimi         (?m)            multiline
11282*22dc650dSSadaf Ebrahimi         (?n)            no auto capture
11283*22dc650dSSadaf Ebrahimi         (?r)            restrict caseless to either ASCII or non-ASCII
11284*22dc650dSSadaf Ebrahimi         (?s)            single line (dotall)
11285*22dc650dSSadaf Ebrahimi         (?U)            default ungreedy (lazy)
11286*22dc650dSSadaf Ebrahimi         (?x)            ignore white space except in classes or \Q...\E
11287*22dc650dSSadaf Ebrahimi         (?xx)           as (?x) but also ignore space and tab in classes
11288*22dc650dSSadaf Ebrahimi         (?-...)         unset the given option(s)
11289*22dc650dSSadaf Ebrahimi         (?^)            unset imnrsx options
11290*22dc650dSSadaf Ebrahimi
11291*22dc650dSSadaf Ebrahimi       (?aP) implies (?aT) as well, though this has no additional effect. How-
11292*22dc650dSSadaf Ebrahimi       ever, it means that (?-aP) is really (?-PT) which  disables  all  ASCII
11293*22dc650dSSadaf Ebrahimi       restrictions for POSIX classes.
11294*22dc650dSSadaf Ebrahimi
11295*22dc650dSSadaf Ebrahimi       Unsetting  x or xx unsets both. Several options may be set at once, and
11296*22dc650dSSadaf Ebrahimi       a mixture of setting and unsetting such as (?i-x) is allowed, but there
11297*22dc650dSSadaf Ebrahimi       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
11298*22dc650dSSadaf Ebrahimi       for example (?^in). An option setting may appear at the start of a non-
11299*22dc650dSSadaf Ebrahimi       capture group, for example (?i:...).
11300*22dc650dSSadaf Ebrahimi
11301*22dc650dSSadaf Ebrahimi       The following are recognized only at the very start of a pattern or af-
11302*22dc650dSSadaf Ebrahimi       ter one of the newline or \R options with similar syntax. More than one
11303*22dc650dSSadaf Ebrahimi       of them may appear. For the first three, d is a decimal number.
11304*22dc650dSSadaf Ebrahimi
11305*22dc650dSSadaf Ebrahimi         (*LIMIT_DEPTH=d) set the backtracking limit to d
11306*22dc650dSSadaf Ebrahimi         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
11307*22dc650dSSadaf Ebrahimi         (*LIMIT_MATCH=d) set the match limit to d
11308*22dc650dSSadaf Ebrahimi         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
11309*22dc650dSSadaf Ebrahimi         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
11310*22dc650dSSadaf Ebrahimi         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
11311*22dc650dSSadaf Ebrahimi         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
11312*22dc650dSSadaf Ebrahimi         (*NO_JIT)       disable JIT optimization
11313*22dc650dSSadaf Ebrahimi         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
11314*22dc650dSSadaf Ebrahimi         (*UTF)          set appropriate UTF mode for the library in use
11315*22dc650dSSadaf Ebrahimi         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
11316*22dc650dSSadaf Ebrahimi
11317*22dc650dSSadaf Ebrahimi       Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the
11318*22dc650dSSadaf Ebrahimi       value   of   the   limits   set  by  the  caller  of  pcre2_match()  or
11319*22dc650dSSadaf Ebrahimi       pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete
11320*22dc650dSSadaf Ebrahimi       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
11321*22dc650dSSadaf Ebrahimi       and  (*UCP)  by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
11322*22dc650dSSadaf Ebrahimi       respectively, at compile time.
11323*22dc650dSSadaf Ebrahimi
11324*22dc650dSSadaf Ebrahimi
11325*22dc650dSSadaf EbrahimiNEWLINE CONVENTION
11326*22dc650dSSadaf Ebrahimi
11327*22dc650dSSadaf Ebrahimi       These are recognized only at the very start of the pattern or after op-
11328*22dc650dSSadaf Ebrahimi       tion settings with a similar syntax.
11329*22dc650dSSadaf Ebrahimi
11330*22dc650dSSadaf Ebrahimi         (*CR)           carriage return only
11331*22dc650dSSadaf Ebrahimi         (*LF)           linefeed only
11332*22dc650dSSadaf Ebrahimi         (*CRLF)         carriage return followed by linefeed
11333*22dc650dSSadaf Ebrahimi         (*ANYCRLF)      all three of the above
11334*22dc650dSSadaf Ebrahimi         (*ANY)          any Unicode newline sequence
11335*22dc650dSSadaf Ebrahimi         (*NUL)          the NUL character (binary zero)
11336*22dc650dSSadaf Ebrahimi
11337*22dc650dSSadaf Ebrahimi
11338*22dc650dSSadaf EbrahimiWHAT \R MATCHES
11339*22dc650dSSadaf Ebrahimi
11340*22dc650dSSadaf Ebrahimi       These are recognized only at the very start of the pattern or after op-
11341*22dc650dSSadaf Ebrahimi       tion setting with a similar syntax.
11342*22dc650dSSadaf Ebrahimi
11343*22dc650dSSadaf Ebrahimi         (*BSR_ANYCRLF)  CR, LF, or CRLF
11344*22dc650dSSadaf Ebrahimi         (*BSR_UNICODE)  any Unicode newline sequence
11345*22dc650dSSadaf Ebrahimi
11346*22dc650dSSadaf Ebrahimi
11347*22dc650dSSadaf EbrahimiLOOKAHEAD AND LOOKBEHIND ASSERTIONS
11348*22dc650dSSadaf Ebrahimi
11349*22dc650dSSadaf Ebrahimi         (?=...)                     )
11350*22dc650dSSadaf Ebrahimi         (*pla:...)                  ) positive lookahead
11351*22dc650dSSadaf Ebrahimi         (*positive_lookahead:...)   )
11352*22dc650dSSadaf Ebrahimi
11353*22dc650dSSadaf Ebrahimi         (?!...)                     )
11354*22dc650dSSadaf Ebrahimi         (*nla:...)                  ) negative lookahead
11355*22dc650dSSadaf Ebrahimi         (*negative_lookahead:...)   )
11356*22dc650dSSadaf Ebrahimi
11357*22dc650dSSadaf Ebrahimi         (?<=...)                    )
11358*22dc650dSSadaf Ebrahimi         (*plb:...)                  ) positive lookbehind
11359*22dc650dSSadaf Ebrahimi         (*positive_lookbehind:...)  )
11360*22dc650dSSadaf Ebrahimi
11361*22dc650dSSadaf Ebrahimi         (?<!...)                    )
11362*22dc650dSSadaf Ebrahimi         (*nlb:...)                  ) negative lookbehind
11363*22dc650dSSadaf Ebrahimi         (*negative_lookbehind:...)  )
11364*22dc650dSSadaf Ebrahimi
11365*22dc650dSSadaf Ebrahimi       Each top-level branch of a lookbehind must have a limit for the  number
11366*22dc650dSSadaf Ebrahimi       of  characters it matches. If any branch can match a variable number of
11367*22dc650dSSadaf Ebrahimi       characters, the maximum for each branch is limited to a  value  set  by
11368*22dc650dSSadaf Ebrahimi       the  caller  of  pcre2_compile()  or defaulted. The default is set when
11369*22dc650dSSadaf Ebrahimi       PCRE2 is built (ultimate default 255). If every branch matches a  fixed
11370*22dc650dSSadaf Ebrahimi       number of characters, the limit for each branch is 65535 characters.
11371*22dc650dSSadaf Ebrahimi
11372*22dc650dSSadaf Ebrahimi
11373*22dc650dSSadaf EbrahimiNON-ATOMIC LOOKAROUND ASSERTIONS
11374*22dc650dSSadaf Ebrahimi
11375*22dc650dSSadaf Ebrahimi       These assertions are specific to PCRE2 and are not Perl-compatible.
11376*22dc650dSSadaf Ebrahimi
11377*22dc650dSSadaf Ebrahimi         (?*...)                                )
11378*22dc650dSSadaf Ebrahimi         (*napla:...)                           ) synonyms
11379*22dc650dSSadaf Ebrahimi         (*non_atomic_positive_lookahead:...)   )
11380*22dc650dSSadaf Ebrahimi
11381*22dc650dSSadaf Ebrahimi         (?<*...)                               )
11382*22dc650dSSadaf Ebrahimi         (*naplb:...)                           ) synonyms
11383*22dc650dSSadaf Ebrahimi         (*non_atomic_positive_lookbehind:...)  )
11384*22dc650dSSadaf Ebrahimi
11385*22dc650dSSadaf Ebrahimi
11386*22dc650dSSadaf EbrahimiSCRIPT RUNS
11387*22dc650dSSadaf Ebrahimi
11388*22dc650dSSadaf Ebrahimi         (*script_run:...)           ) script run, can be backtracked into
11389*22dc650dSSadaf Ebrahimi         (*sr:...)                   )
11390*22dc650dSSadaf Ebrahimi
11391*22dc650dSSadaf Ebrahimi         (*atomic_script_run:...)    ) atomic script run
11392*22dc650dSSadaf Ebrahimi         (*asr:...)                  )
11393*22dc650dSSadaf Ebrahimi
11394*22dc650dSSadaf Ebrahimi
11395*22dc650dSSadaf EbrahimiBACKREFERENCES
11396*22dc650dSSadaf Ebrahimi
11397*22dc650dSSadaf Ebrahimi         \n              reference by number (can be ambiguous)
11398*22dc650dSSadaf Ebrahimi         \gn             reference by number
11399*22dc650dSSadaf Ebrahimi         \g{n}           reference by number
11400*22dc650dSSadaf Ebrahimi         \g+n            relative reference by number (PCRE2 extension)
11401*22dc650dSSadaf Ebrahimi         \g-n            relative reference by number
11402*22dc650dSSadaf Ebrahimi         \g{+n}          relative reference by number (PCRE2 extension)
11403*22dc650dSSadaf Ebrahimi         \g{-n}          relative reference by number
11404*22dc650dSSadaf Ebrahimi         \k<name>        reference by name (Perl)
11405*22dc650dSSadaf Ebrahimi         \k'name'        reference by name (Perl)
11406*22dc650dSSadaf Ebrahimi         \g{name}        reference by name (Perl)
11407*22dc650dSSadaf Ebrahimi         \k{name}        reference by name (.NET)
11408*22dc650dSSadaf Ebrahimi         (?P=name)       reference by name (Python)
11409*22dc650dSSadaf Ebrahimi
11410*22dc650dSSadaf Ebrahimi
11411*22dc650dSSadaf EbrahimiSUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
11412*22dc650dSSadaf Ebrahimi
11413*22dc650dSSadaf Ebrahimi         (?R)            recurse whole pattern
11414*22dc650dSSadaf Ebrahimi         (?n)            call subroutine by absolute number
11415*22dc650dSSadaf Ebrahimi         (?+n)           call subroutine by relative number
11416*22dc650dSSadaf Ebrahimi         (?-n)           call subroutine by relative number
11417*22dc650dSSadaf Ebrahimi         (?&name)        call subroutine by name (Perl)
11418*22dc650dSSadaf Ebrahimi         (?P>name)       call subroutine by name (Python)
11419*22dc650dSSadaf Ebrahimi         \g<name>        call subroutine by name (Oniguruma)
11420*22dc650dSSadaf Ebrahimi         \g'name'        call subroutine by name (Oniguruma)
11421*22dc650dSSadaf Ebrahimi         \g<n>           call subroutine by absolute number (Oniguruma)
11422*22dc650dSSadaf Ebrahimi         \g'n'           call subroutine by absolute number (Oniguruma)
11423*22dc650dSSadaf Ebrahimi         \g<+n>          call subroutine by relative number (PCRE2 extension)
11424*22dc650dSSadaf Ebrahimi         \g'+n'          call subroutine by relative number (PCRE2 extension)
11425*22dc650dSSadaf Ebrahimi         \g<-n>          call subroutine by relative number (PCRE2 extension)
11426*22dc650dSSadaf Ebrahimi         \g'-n'          call subroutine by relative number (PCRE2 extension)
11427*22dc650dSSadaf Ebrahimi
11428*22dc650dSSadaf Ebrahimi
11429*22dc650dSSadaf EbrahimiCONDITIONAL PATTERNS
11430*22dc650dSSadaf Ebrahimi
11431*22dc650dSSadaf Ebrahimi         (?(condition)yes-pattern)
11432*22dc650dSSadaf Ebrahimi         (?(condition)yes-pattern|no-pattern)
11433*22dc650dSSadaf Ebrahimi
11434*22dc650dSSadaf Ebrahimi         (?(n)               absolute reference condition
11435*22dc650dSSadaf Ebrahimi         (?(+n)              relative reference condition (PCRE2 extension)
11436*22dc650dSSadaf Ebrahimi         (?(-n)              relative reference condition (PCRE2 extension)
11437*22dc650dSSadaf Ebrahimi         (?(<name>)          named reference condition (Perl)
11438*22dc650dSSadaf Ebrahimi         (?('name')          named reference condition (Perl)
11439*22dc650dSSadaf Ebrahimi         (?(name)            named reference condition (PCRE2, deprecated)
11440*22dc650dSSadaf Ebrahimi         (?(R)               overall recursion condition
11441*22dc650dSSadaf Ebrahimi         (?(Rn)              specific numbered group recursion condition
11442*22dc650dSSadaf Ebrahimi         (?(R&name)          specific named group recursion condition
11443*22dc650dSSadaf Ebrahimi         (?(DEFINE)          define groups for reference
11444*22dc650dSSadaf Ebrahimi         (?(VERSION[>]=n.m)  test PCRE2 version
11445*22dc650dSSadaf Ebrahimi         (?(assert)          assertion condition
11446*22dc650dSSadaf Ebrahimi
11447*22dc650dSSadaf Ebrahimi       Note  the  ambiguity of (?(R) and (?(Rn) which might be named reference
11448*22dc650dSSadaf Ebrahimi       conditions or recursion tests. Such a condition  is  interpreted  as  a
11449*22dc650dSSadaf Ebrahimi       reference condition if the relevant named group exists.
11450*22dc650dSSadaf Ebrahimi
11451*22dc650dSSadaf Ebrahimi
11452*22dc650dSSadaf EbrahimiBACKTRACKING CONTROL
11453*22dc650dSSadaf Ebrahimi
11454*22dc650dSSadaf Ebrahimi       All  backtracking  control  verbs  may be in the form (*VERB:NAME). For
11455*22dc650dSSadaf Ebrahimi       (*MARK) the name is mandatory, for the others it is  optional.  (*SKIP)
11456*22dc650dSSadaf Ebrahimi       changes  its  behaviour if :NAME is present. The others just set a name
11457*22dc650dSSadaf Ebrahimi       for passing back to the caller, but this is not a name that (*SKIP) can
11458*22dc650dSSadaf Ebrahimi       see. The following act immediately they are reached:
11459*22dc650dSSadaf Ebrahimi
11460*22dc650dSSadaf Ebrahimi         (*ACCEPT)       force successful match
11461*22dc650dSSadaf Ebrahimi         (*FAIL)         force backtrack; synonym (*F)
11462*22dc650dSSadaf Ebrahimi         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
11463*22dc650dSSadaf Ebrahimi
11464*22dc650dSSadaf Ebrahimi       The following act only when a subsequent match failure causes  a  back-
11465*22dc650dSSadaf Ebrahimi       track to reach them. They all force a match failure, but they differ in
11466*22dc650dSSadaf Ebrahimi       what happens afterwards. Those that advance the start-of-match point do
11467*22dc650dSSadaf Ebrahimi       so only if the pattern is not anchored.
11468*22dc650dSSadaf Ebrahimi
11469*22dc650dSSadaf Ebrahimi         (*COMMIT)       overall failure, no advance of starting point
11470*22dc650dSSadaf Ebrahimi         (*PRUNE)        advance to next starting character
11471*22dc650dSSadaf Ebrahimi         (*SKIP)         advance to current matching position
11472*22dc650dSSadaf Ebrahimi         (*SKIP:NAME)    advance to position corresponding to an earlier
11473*22dc650dSSadaf Ebrahimi                         (*MARK:NAME); if not found, the (*SKIP) is ignored
11474*22dc650dSSadaf Ebrahimi         (*THEN)         local failure, backtrack to next alternation
11475*22dc650dSSadaf Ebrahimi
11476*22dc650dSSadaf Ebrahimi       The  effect  of one of these verbs in a group called as a subroutine is
11477*22dc650dSSadaf Ebrahimi       confined to the subroutine call.
11478*22dc650dSSadaf Ebrahimi
11479*22dc650dSSadaf Ebrahimi
11480*22dc650dSSadaf EbrahimiCALLOUTS
11481*22dc650dSSadaf Ebrahimi
11482*22dc650dSSadaf Ebrahimi         (?C)            callout (assumed number 0)
11483*22dc650dSSadaf Ebrahimi         (?Cn)           callout with numerical data n
11484*22dc650dSSadaf Ebrahimi         (?C"text")      callout with string data
11485*22dc650dSSadaf Ebrahimi
11486*22dc650dSSadaf Ebrahimi       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
11487*22dc650dSSadaf Ebrahimi       the start and the end), and the starting delimiter { matched  with  the
11488*22dc650dSSadaf Ebrahimi       ending  delimiter  }. To encode the ending delimiter within the string,
11489*22dc650dSSadaf Ebrahimi       double it.
11490*22dc650dSSadaf Ebrahimi
11491*22dc650dSSadaf Ebrahimi
11492*22dc650dSSadaf EbrahimiSEE ALSO
11493*22dc650dSSadaf Ebrahimi
11494*22dc650dSSadaf Ebrahimi       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
11495*22dc650dSSadaf Ebrahimi       pcre2(3).
11496*22dc650dSSadaf Ebrahimi
11497*22dc650dSSadaf Ebrahimi
11498*22dc650dSSadaf EbrahimiAUTHOR
11499*22dc650dSSadaf Ebrahimi
11500*22dc650dSSadaf Ebrahimi       Philip Hazel
11501*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
11502*22dc650dSSadaf Ebrahimi       Cambridge, England.
11503*22dc650dSSadaf Ebrahimi
11504*22dc650dSSadaf Ebrahimi
11505*22dc650dSSadaf EbrahimiREVISION
11506*22dc650dSSadaf Ebrahimi
11507*22dc650dSSadaf Ebrahimi       Last updated: 12 October 2023
11508*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2023 University of Cambridge.
11509*22dc650dSSadaf Ebrahimi
11510*22dc650dSSadaf Ebrahimi
11511*22dc650dSSadaf EbrahimiPCRE2 10.43                     12 October 2023                 PCRE2SYNTAX(3)
11512*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
11513*22dc650dSSadaf Ebrahimi
11514*22dc650dSSadaf Ebrahimi
11515*22dc650dSSadaf Ebrahimi
11516*22dc650dSSadaf EbrahimiPCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
11517*22dc650dSSadaf Ebrahimi
11518*22dc650dSSadaf Ebrahimi
11519*22dc650dSSadaf EbrahimiNAME
11520*22dc650dSSadaf Ebrahimi       PCRE - Perl-compatible regular expressions (revised API)
11521*22dc650dSSadaf Ebrahimi
11522*22dc650dSSadaf Ebrahimi
11523*22dc650dSSadaf EbrahimiUNICODE AND UTF SUPPORT
11524*22dc650dSSadaf Ebrahimi
11525*22dc650dSSadaf Ebrahimi       PCRE2 is normally built with Unicode support, though if you do not need
11526*22dc650dSSadaf Ebrahimi       it,  you  can  build  it  without,  in  which  case the library will be
11527*22dc650dSSadaf Ebrahimi       smaller. With Unicode support, PCRE2 has knowledge of Unicode character
11528*22dc650dSSadaf Ebrahimi       properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11529*22dc650dSSadaf Ebrahimi       format (depending on the code unit width), but this is not the default.
11530*22dc650dSSadaf Ebrahimi       Unless specifically requested, PCRE2 treats each code unit in a  string
11531*22dc650dSSadaf Ebrahimi       as one character.
11532*22dc650dSSadaf Ebrahimi
11533*22dc650dSSadaf Ebrahimi       There  are two ways of telling PCRE2 to switch to UTF mode, where char-
11534*22dc650dSSadaf Ebrahimi       acters may consist of more than one code unit and the range  of  values
11535*22dc650dSSadaf Ebrahimi       is constrained. The program can call pcre2_compile() with the PCRE2_UTF
11536*22dc650dSSadaf Ebrahimi       option,  or  the  pattern may start with the sequence (*UTF).  However,
11537*22dc650dSSadaf Ebrahimi       the latter facility can be locked out by  the  PCRE2_NEVER_UTF  option.
11538*22dc650dSSadaf Ebrahimi       That  is,  the  programmer can prevent the supplier of the pattern from
11539*22dc650dSSadaf Ebrahimi       switching to UTF mode.
11540*22dc650dSSadaf Ebrahimi
11541*22dc650dSSadaf Ebrahimi       Note  that  the  PCRE2_MATCH_INVALID_UTF  option  (see  below)   forces
11542*22dc650dSSadaf Ebrahimi       PCRE2_UTF to be set.
11543*22dc650dSSadaf Ebrahimi
11544*22dc650dSSadaf Ebrahimi       In  UTF mode, both the pattern and any subject strings that are matched
11545*22dc650dSSadaf Ebrahimi       against it are treated as UTF strings instead of strings of  individual
11546*22dc650dSSadaf Ebrahimi       one-code-unit  characters. There are also some other changes to the way
11547*22dc650dSSadaf Ebrahimi       characters are handled, as documented below.
11548*22dc650dSSadaf Ebrahimi
11549*22dc650dSSadaf Ebrahimi
11550*22dc650dSSadaf EbrahimiUNICODE PROPERTY SUPPORT
11551*22dc650dSSadaf Ebrahimi
11552*22dc650dSSadaf Ebrahimi       When PCRE2 is built with Unicode support, the escape sequences  \p{..},
11553*22dc650dSSadaf Ebrahimi       \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11554*22dc650dSSadaf Ebrahimi       ting.   The Unicode properties that can be tested are a subset of those
11555*22dc650dSSadaf Ebrahimi       that Perl supports. Currently they are limited to the general  category
11556*22dc650dSSadaf Ebrahimi       properties such as Lu for an upper case letter or Nd for a decimal num-
11557*22dc650dSSadaf Ebrahimi       ber, the derived properties Any and LC (synonym L&), the Unicode script
11558*22dc650dSSadaf Ebrahimi       names such as Arabic or Han, Bidi_Class, Bidi_Control, and a few binary
11559*22dc650dSSadaf Ebrahimi       properties.
11560*22dc650dSSadaf Ebrahimi
11561*22dc650dSSadaf Ebrahimi       The full lists are given in the pcre2pattern and pcre2syntax documenta-
11562*22dc650dSSadaf Ebrahimi       tion.  In  general,  only the short names for properties are supported.
11563*22dc650dSSadaf Ebrahimi       For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is
11564*22dc650dSSadaf Ebrahimi       not supported. Furthermore, in Perl, many properties may optionally  be
11565*22dc650dSSadaf Ebrahimi       prefixed  by "Is", for compatibility with Perl 5.6. PCRE2 does not sup-
11566*22dc650dSSadaf Ebrahimi       port this.
11567*22dc650dSSadaf Ebrahimi
11568*22dc650dSSadaf Ebrahimi
11569*22dc650dSSadaf EbrahimiWIDE CHARACTERS AND UTF MODES
11570*22dc650dSSadaf Ebrahimi
11571*22dc650dSSadaf Ebrahimi       Code points less than 256 can be specified in patterns by either braced
11572*22dc650dSSadaf Ebrahimi       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
11573*22dc650dSSadaf Ebrahimi       Larger values have to use braced sequences. Unbraced octal code  points
11574*22dc650dSSadaf Ebrahimi       up to \777 are also recognized; larger ones can be coded using \o{...}.
11575*22dc650dSSadaf Ebrahimi
11576*22dc650dSSadaf Ebrahimi       The  escape sequence \N{U+<hex digits>} is recognized as another way of
11577*22dc650dSSadaf Ebrahimi       specifying a Unicode character by code point in a UTF mode. It  is  not
11578*22dc650dSSadaf Ebrahimi       allowed in non-UTF mode.
11579*22dc650dSSadaf Ebrahimi
11580*22dc650dSSadaf Ebrahimi       In  UTF  mode, repeat quantifiers apply to complete UTF characters, not
11581*22dc650dSSadaf Ebrahimi       to individual code units.
11582*22dc650dSSadaf Ebrahimi
11583*22dc650dSSadaf Ebrahimi       In UTF mode, the dot metacharacter matches one UTF character instead of
11584*22dc650dSSadaf Ebrahimi       a single code unit.
11585*22dc650dSSadaf Ebrahimi
11586*22dc650dSSadaf Ebrahimi       In UTF mode, capture group names are not restricted to ASCII,  and  may
11587*22dc650dSSadaf Ebrahimi       contain any Unicode letters and decimal digits, as well as underscore.
11588*22dc650dSSadaf Ebrahimi
11589*22dc650dSSadaf Ebrahimi       The  escape  sequence \C can be used to match a single code unit in UTF
11590*22dc650dSSadaf Ebrahimi       mode, but its use can lead to some strange effects because it breaks up
11591*22dc650dSSadaf Ebrahimi       multi-unit characters (see the description of \C  in  the  pcre2pattern
11592*22dc650dSSadaf Ebrahimi       documentation). For this reason, there is a build-time option that dis-
11593*22dc650dSSadaf Ebrahimi       ables  support  for  \C completely. There is also a less draconian com-
11594*22dc650dSSadaf Ebrahimi       pile-time option for locking out the use of \C when a pattern  is  com-
11595*22dc650dSSadaf Ebrahimi       piled.
11596*22dc650dSSadaf Ebrahimi
11597*22dc650dSSadaf Ebrahimi       The  use  of  \C  is not supported by the alternative matching function
11598*22dc650dSSadaf Ebrahimi       pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11599*22dc650dSSadaf Ebrahimi       ter may consist of more than one code unit. The  use  of  \C  in  these
11600*22dc650dSSadaf Ebrahimi       modes  provokes a match-time error. Also, the JIT optimization does not
11601*22dc650dSSadaf Ebrahimi       support \C in these modes. If JIT optimization is requested for a UTF-8
11602*22dc650dSSadaf Ebrahimi       or UTF-16 pattern that contains \C, it will not succeed,  and  so  when
11603*22dc650dSSadaf Ebrahimi       pcre2_match() is called, the matching will be carried out by the inter-
11604*22dc650dSSadaf Ebrahimi       pretive function.
11605*22dc650dSSadaf Ebrahimi
11606*22dc650dSSadaf Ebrahimi       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
11607*22dc650dSSadaf Ebrahimi       characters  of  any  code  value,  but, by default, the characters that
11608*22dc650dSSadaf Ebrahimi       PCRE2 recognizes as digits, spaces, or word characters remain the  same
11609*22dc650dSSadaf Ebrahimi       set  as  in  non-UTF mode, all with code points less than 256. This re-
11610*22dc650dSSadaf Ebrahimi       mains true even when PCRE2 is built to include Unicode support, because
11611*22dc650dSSadaf Ebrahimi       to do otherwise would slow down matching in  many  common  cases.  Note
11612*22dc650dSSadaf Ebrahimi       that  this also applies to \b and \B, because they are defined in terms
11613*22dc650dSSadaf Ebrahimi       of \w and \W. If you want to test for a wider sense of,  say,  "digit",
11614*22dc650dSSadaf Ebrahimi       you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-
11615*22dc650dSSadaf Ebrahimi       tively, if you set the PCRE2_UCP option, the way that the character es-
11616*22dc650dSSadaf Ebrahimi       capes work is changed so that Unicode properties are used to  determine
11617*22dc650dSSadaf Ebrahimi       which  characters  match,  though  there are some options that suppress
11618*22dc650dSSadaf Ebrahimi       this for individual escapes. For details see  the  section  on  generic
11619*22dc650dSSadaf Ebrahimi       character types in the pcre2pattern documentation.
11620*22dc650dSSadaf Ebrahimi
11621*22dc650dSSadaf Ebrahimi       Like  the  escapes,  characters  that  match  the POSIX named character
11622*22dc650dSSadaf Ebrahimi       classes are all low-valued characters unless the  PCRE2_UCP  option  is
11623*22dc650dSSadaf Ebrahimi       set, but there is an option to override this.
11624*22dc650dSSadaf Ebrahimi
11625*22dc650dSSadaf Ebrahimi       In contrast to the character escapes and character classes, the special
11626*22dc650dSSadaf Ebrahimi       horizontal  and  vertical  white  space escapes (\h, \H, \v, and \V) do
11627*22dc650dSSadaf Ebrahimi       match all the appropriate Unicode characters, whether or not  PCRE2_UCP
11628*22dc650dSSadaf Ebrahimi       is set.
11629*22dc650dSSadaf Ebrahimi
11630*22dc650dSSadaf Ebrahimi
11631*22dc650dSSadaf EbrahimiUNICODE CASE-EQUIVALENCE
11632*22dc650dSSadaf Ebrahimi
11633*22dc650dSSadaf Ebrahimi       If  either  PCRE2_UTF  or PCRE2_UCP is set, upper/lower case processing
11634*22dc650dSSadaf Ebrahimi       makes use of Unicode properties except for characters whose code points
11635*22dc650dSSadaf Ebrahimi       are less than 128 and that have at most two case-equivalent values. For
11636*22dc650dSSadaf Ebrahimi       these, a direct table lookup is used for speed. A few  Unicode  charac-
11637*22dc650dSSadaf Ebrahimi       ters  such as Greek sigma have more than two code points that are case-
11638*22dc650dSSadaf Ebrahimi       equivalent, and these are treated specially. Setting PCRE2_UCP  without
11639*22dc650dSSadaf Ebrahimi       PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character
11640*22dc650dSSadaf Ebrahimi       encodings such as UCS-2.
11641*22dc650dSSadaf Ebrahimi
11642*22dc650dSSadaf Ebrahimi       There are two ASCII characters (S and K) that,  in  addition  to  their
11643*22dc650dSSadaf Ebrahimi       ASCII  lower case equivalents, have a non-ASCII one as well (long S and
11644*22dc650dSSadaf Ebrahimi       Kelvin sign).  Recognition of these non-ASCII characters as case-equiv-
11645*22dc650dSSadaf Ebrahimi       alent to their ASCII  counterparts  can  be  disabled  by  setting  the
11646*22dc650dSSadaf Ebrahimi       PCRE2_EXTRA_CASELESS_RESTRICT  option. When this is set, all characters
11647*22dc650dSSadaf Ebrahimi       in a case equivalence must either be ASCII or non-ASCII; there  can  be
11648*22dc650dSSadaf Ebrahimi       no mixing.
11649*22dc650dSSadaf Ebrahimi
11650*22dc650dSSadaf Ebrahimi
11651*22dc650dSSadaf EbrahimiSCRIPT RUNS
11652*22dc650dSSadaf Ebrahimi
11653*22dc650dSSadaf Ebrahimi       The  pattern constructs (*script_run:...) and (*atomic_script_run:...),
11654*22dc650dSSadaf Ebrahimi       with synonyms (*sr:...) and (*asr:...), verify that the string  matched
11655*22dc650dSSadaf Ebrahimi       within  the  parentheses is a script run. In concept, a script run is a
11656*22dc650dSSadaf Ebrahimi       sequence of characters that are all from the same Unicode script.  How-
11657*22dc650dSSadaf Ebrahimi       ever, because some scripts are commonly used together, and because some
11658*22dc650dSSadaf Ebrahimi       diacritical  and  other marks are used with multiple scripts, it is not
11659*22dc650dSSadaf Ebrahimi       that simple.
11660*22dc650dSSadaf Ebrahimi
11661*22dc650dSSadaf Ebrahimi       Every Unicode character has a Script property, mostly with a value cor-
11662*22dc650dSSadaf Ebrahimi       responding to the name of a script, such as Latin, Greek, or  Cyrillic.
11663*22dc650dSSadaf Ebrahimi       There are also three special values:
11664*22dc650dSSadaf Ebrahimi
11665*22dc650dSSadaf Ebrahimi       "Unknown" is used for code points that have not been assigned, and also
11666*22dc650dSSadaf Ebrahimi       for  the surrogate code points. In the PCRE2 32-bit library, characters
11667*22dc650dSSadaf Ebrahimi       whose code points are greater  than  the  Unicode  maximum  (U+10FFFF),
11668*22dc650dSSadaf Ebrahimi       which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11669*22dc650dSSadaf Ebrahimi       script.
11670*22dc650dSSadaf Ebrahimi
11671*22dc650dSSadaf Ebrahimi       "Common" is used for characters that are used with many scripts.  These
11672*22dc650dSSadaf Ebrahimi       include  punctuation,  emoji,  mathematical, musical, and currency sym-
11673*22dc650dSSadaf Ebrahimi       bols, and the ASCII digits 0 to 9.
11674*22dc650dSSadaf Ebrahimi
11675*22dc650dSSadaf Ebrahimi       "Inherited" is used for characters such as diacritical marks that  mod-
11676*22dc650dSSadaf Ebrahimi       ify a previous character. These are considered to take on the script of
11677*22dc650dSSadaf Ebrahimi       the character that they modify.
11678*22dc650dSSadaf Ebrahimi
11679*22dc650dSSadaf Ebrahimi       Some  Inherited characters are used with many scripts, but many of them
11680*22dc650dSSadaf Ebrahimi       are only normally used with a small number  of  scripts.  For  example,
11681*22dc650dSSadaf Ebrahimi       U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11682*22dc650dSSadaf Ebrahimi       tic.  In  order  to  make it possible to check this, a Unicode property
11683*22dc650dSSadaf Ebrahimi       called Script Extension exists. Its value is a list of scripts that ap-
11684*22dc650dSSadaf Ebrahimi       ply to the character. For the majority of characters, the list contains
11685*22dc650dSSadaf Ebrahimi       just one script, the same one as  the  Script  property.  However,  for
11686*22dc650dSSadaf Ebrahimi       characters  such  as  U+102E0 more than one Script is listed. There are
11687*22dc650dSSadaf Ebrahimi       also some Common characters that have a single,  non-Common  script  in
11688*22dc650dSSadaf Ebrahimi       their Script Extension list.
11689*22dc650dSSadaf Ebrahimi
11690*22dc650dSSadaf Ebrahimi       The next section describes the basic rules for deciding whether a given
11691*22dc650dSSadaf Ebrahimi       string  of  characters  is  a script run. Note, however, that there are
11692*22dc650dSSadaf Ebrahimi       some special cases involving the Chinese Han script, and an  additional
11693*22dc650dSSadaf Ebrahimi       constraint  for  decimal  digits.  These are covered in subsequent sec-
11694*22dc650dSSadaf Ebrahimi       tions.
11695*22dc650dSSadaf Ebrahimi
11696*22dc650dSSadaf Ebrahimi   Basic script run rules
11697*22dc650dSSadaf Ebrahimi
11698*22dc650dSSadaf Ebrahimi       A string that is less than two characters long is a script run. This is
11699*22dc650dSSadaf Ebrahimi       the only case in which an Unknown character can be  part  of  a  script
11700*22dc650dSSadaf Ebrahimi       run.  Longer strings are checked using only the Script Extensions prop-
11701*22dc650dSSadaf Ebrahimi       erty, not the basic Script property.
11702*22dc650dSSadaf Ebrahimi
11703*22dc650dSSadaf Ebrahimi       If a character's Script Extension property is the single value  "Inher-
11704*22dc650dSSadaf Ebrahimi       ited", it is always accepted as part of a script run. This is also true
11705*22dc650dSSadaf Ebrahimi       for  the  property  "Common", subject to the checking of decimal digits
11706*22dc650dSSadaf Ebrahimi       described below. All the remaining characters in a script run must have
11707*22dc650dSSadaf Ebrahimi       at least one script in common in their Script Extension lists. In  set-
11708*22dc650dSSadaf Ebrahimi       theoretic terminology, the intersection of all the sets of scripts must
11709*22dc650dSSadaf Ebrahimi       not be empty.
11710*22dc650dSSadaf Ebrahimi
11711*22dc650dSSadaf Ebrahimi       A  simple example is an Internet name such as "google.com". The letters
11712*22dc650dSSadaf Ebrahimi       are all in the Latin script, and the dot is Common, so this string is a
11713*22dc650dSSadaf Ebrahimi       script run.  However, the Cyrillic letter "o" looks exactly the same as
11714*22dc650dSSadaf Ebrahimi       the Latin "o"; a string that looks the same, but with Cyrillic "o"s  is
11715*22dc650dSSadaf Ebrahimi       not a script run.
11716*22dc650dSSadaf Ebrahimi
11717*22dc650dSSadaf Ebrahimi       More  interesting examples involve characters with more than one script
11718*22dc650dSSadaf Ebrahimi       in their Script Extension. Consider the following characters:
11719*22dc650dSSadaf Ebrahimi
11720*22dc650dSSadaf Ebrahimi         U+060C  Arabic comma
11721*22dc650dSSadaf Ebrahimi         U+06D4  Arabic full stop
11722*22dc650dSSadaf Ebrahimi
11723*22dc650dSSadaf Ebrahimi       The first has the Script Extension list Arabic, Hanifi  Rohingya,  Syr-
11724*22dc650dSSadaf Ebrahimi       iac,  and  Thaana; the second has just Arabic and Hanifi Rohingya. Both
11725*22dc650dSSadaf Ebrahimi       of them could appear in script runs of  either  Arabic  or  Hanifi  Ro-
11726*22dc650dSSadaf Ebrahimi       hingya.  The  first  could also appear in Syriac or Thaana script runs,
11727*22dc650dSSadaf Ebrahimi       but the second could not.
11728*22dc650dSSadaf Ebrahimi
11729*22dc650dSSadaf Ebrahimi   The Chinese Han script
11730*22dc650dSSadaf Ebrahimi
11731*22dc650dSSadaf Ebrahimi       The Chinese Han script is  commonly  used  in  conjunction  with  other
11732*22dc650dSSadaf Ebrahimi       scripts  for  writing certain languages. Japanese uses the Hiragana and
11733*22dc650dSSadaf Ebrahimi       Katakana scripts together with Han; Korean uses Hangul  and  Han;  Tai-
11734*22dc650dSSadaf Ebrahimi       wanese  Mandarin  uses  Bopomofo  and Han. These three combinations are
11735*22dc650dSSadaf Ebrahimi       treated as special cases when checking script runs and are, in  effect,
11736*22dc650dSSadaf Ebrahimi       "virtual  scripts".  Thus,  a script run may contain a mixture of Hira-
11737*22dc650dSSadaf Ebrahimi       gana, Katakana, and Han, or a mixture of Hangul and Han, or  a  mixture
11738*22dc650dSSadaf Ebrahimi       of  Bopomofo  and  Han,  but  not, for example, a mixture of Hangul and
11739*22dc650dSSadaf Ebrahimi       Bopomofo and Han. PCRE2 (like Perl) follows Unicode's  Technical  Stan-
11740*22dc650dSSadaf Ebrahimi       dard   39   ("Unicode   Security   Mechanisms",  http://unicode.org/re-
11741*22dc650dSSadaf Ebrahimi       ports/tr39/) in allowing such mixtures.
11742*22dc650dSSadaf Ebrahimi
11743*22dc650dSSadaf Ebrahimi   Decimal digits
11744*22dc650dSSadaf Ebrahimi
11745*22dc650dSSadaf Ebrahimi       Unicode contains many sets of 10 decimal digits in  different  scripts,
11746*22dc650dSSadaf Ebrahimi       and  some  scripts  (including the Common script) contain more than one
11747*22dc650dSSadaf Ebrahimi       set. Some of these decimal digits them are  visually  indistinguishable
11748*22dc650dSSadaf Ebrahimi       from  the  common  ASCII digits. In addition to the script checking de-
11749*22dc650dSSadaf Ebrahimi       scribed above, if a script run contains any decimal digits,  they  must
11750*22dc650dSSadaf Ebrahimi       all come from the same set of 10 adjacent characters.
11751*22dc650dSSadaf Ebrahimi
11752*22dc650dSSadaf Ebrahimi
11753*22dc650dSSadaf EbrahimiVALIDITY OF UTF STRINGS
11754*22dc650dSSadaf Ebrahimi
11755*22dc650dSSadaf Ebrahimi       When  the  PCRE2_UTF  option is set, the strings passed as patterns and
11756*22dc650dSSadaf Ebrahimi       subjects are (by default) checked for validity on entry to the relevant
11757*22dc650dSSadaf Ebrahimi       functions. If an invalid UTF string is passed, a negative error code is
11758*22dc650dSSadaf Ebrahimi       returned. The code unit offset to the offending character  can  be  ex-
11759*22dc650dSSadaf Ebrahimi       tracted  from  the  match  data block by calling pcre2_get_startchar(),
11760*22dc650dSSadaf Ebrahimi       which is used for this purpose after a UTF error.
11761*22dc650dSSadaf Ebrahimi
11762*22dc650dSSadaf Ebrahimi       In some situations, you may already know that your strings  are  valid,
11763*22dc650dSSadaf Ebrahimi       and  therefore  want  to  skip these checks in order to improve perfor-
11764*22dc650dSSadaf Ebrahimi       mance, for example in the case of a long subject string that  is  being
11765*22dc650dSSadaf Ebrahimi       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
11766*22dc650dSSadaf Ebrahimi       pile time or at match time, PCRE2 assumes that the pattern  or  subject
11767*22dc650dSSadaf Ebrahimi       it is given (respectively) contains only valid UTF code unit sequences.
11768*22dc650dSSadaf Ebrahimi
11769*22dc650dSSadaf Ebrahimi       If  you  pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
11770*22dc650dSSadaf Ebrahimi       result is undefined and your program may crash or loop indefinitely  or
11771*22dc650dSSadaf Ebrahimi       give  incorrect  results.  There is, however, one mode of matching that
11772*22dc650dSSadaf Ebrahimi       can handle invalid UTF subject strings.  This  is  enabled  by  passing
11773*22dc650dSSadaf Ebrahimi       PCRE2_MATCH_INVALID_UTF  to  pcre2_compile()  and is discussed below in
11774*22dc650dSSadaf Ebrahimi       the next section. The  rest  of  this  section  covers  the  case  when
11775*22dc650dSSadaf Ebrahimi       PCRE2_MATCH_INVALID_UTF is not set.
11776*22dc650dSSadaf Ebrahimi
11777*22dc650dSSadaf Ebrahimi       Passing  PCRE2_NO_UTF_CHECK  to  pcre2_compile()  just disables the UTF
11778*22dc650dSSadaf Ebrahimi       check for the pattern; it does not also apply to  subject  strings.  If
11779*22dc650dSSadaf Ebrahimi       you  want  to disable the check for a subject string you must pass this
11780*22dc650dSSadaf Ebrahimi       same option to pcre2_match() or pcre2_dfa_match().
11781*22dc650dSSadaf Ebrahimi
11782*22dc650dSSadaf Ebrahimi       UTF-16 and UTF-32 strings can indicate their endianness by special code
11783*22dc650dSSadaf Ebrahimi       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
11784*22dc650dSSadaf Ebrahimi       this, expecting strings to be in host byte order.
11785*22dc650dSSadaf Ebrahimi
11786*22dc650dSSadaf Ebrahimi       Unless  PCRE2_NO_UTF_CHECK  is  set, a UTF string is checked before any
11787*22dc650dSSadaf Ebrahimi       other  processing  takes  place.  In  the  case  of  pcre2_match()  and
11788*22dc650dSSadaf Ebrahimi       pcre2_dfa_match()  calls  with a non-zero starting offset, the check is
11789*22dc650dSSadaf Ebrahimi       applied only to that part of the subject that could be inspected during
11790*22dc650dSSadaf Ebrahimi       matching, and there is a check that the starting offset points  to  the
11791*22dc650dSSadaf Ebrahimi       first  code  unit of a character or to the end of the subject. If there
11792*22dc650dSSadaf Ebrahimi       are no lookbehind assertions in the pattern, the check  starts  at  the
11793*22dc650dSSadaf Ebrahimi       starting  offset.   Otherwise,  it  starts at the length of the longest
11794*22dc650dSSadaf Ebrahimi       lookbehind before the starting offset, or at the start of  the  subject
11795*22dc650dSSadaf Ebrahimi       if  there are not that many characters before the starting offset. Note
11796*22dc650dSSadaf Ebrahimi       that the sequences \b and \B are one-character lookbehinds.
11797*22dc650dSSadaf Ebrahimi
11798*22dc650dSSadaf Ebrahimi       In addition to checking the format of the string, there is a  check  to
11799*22dc650dSSadaf Ebrahimi       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
11800*22dc650dSSadaf Ebrahimi       the  surrogate  area. The so-called "non-character" code points are not
11801*22dc650dSSadaf Ebrahimi       excluded because Unicode corrigendum #9 makes it clear that they should
11802*22dc650dSSadaf Ebrahimi       not be.
11803*22dc650dSSadaf Ebrahimi
11804*22dc650dSSadaf Ebrahimi       Characters in the "Surrogate Area" of Unicode are reserved for  use  by
11805*22dc650dSSadaf Ebrahimi       UTF-16,  where they are used in pairs to encode code points with values
11806*22dc650dSSadaf Ebrahimi       greater than 0xFFFF. The code points that are encoded by  UTF-16  pairs
11807*22dc650dSSadaf Ebrahimi       are  available  independently  in  the  UTF-8 and UTF-32 encodings. (In
11808*22dc650dSSadaf Ebrahimi       other words, the whole surrogate thing is a fudge for UTF-16 which  un-
11809*22dc650dSSadaf Ebrahimi       fortunately messes up UTF-8 and UTF-32.)
11810*22dc650dSSadaf Ebrahimi
11811*22dc650dSSadaf Ebrahimi       Setting  PCRE2_NO_UTF_CHECK  at compile time does not disable the error
11812*22dc650dSSadaf Ebrahimi       that is given if an escape sequence for an invalid Unicode  code  point
11813*22dc650dSSadaf Ebrahimi       is  encountered  in  the pattern. If you want to allow escape sequences
11814*22dc650dSSadaf Ebrahimi       such as \x{d800} (a surrogate code point) you  can  set  the  PCRE2_EX-
11815*22dc650dSSadaf Ebrahimi       TRA_ALLOW_SURROGATE_ESCAPES  extra  option.  However,  this is possible
11816*22dc650dSSadaf Ebrahimi       only in UTF-8 and UTF-32 modes, because these  values  are  not  repre-
11817*22dc650dSSadaf Ebrahimi       sentable in UTF-16.
11818*22dc650dSSadaf Ebrahimi
11819*22dc650dSSadaf Ebrahimi   Errors in UTF-8 strings
11820*22dc650dSSadaf Ebrahimi
11821*22dc650dSSadaf Ebrahimi       The following negative error codes are given for invalid UTF-8 strings:
11822*22dc650dSSadaf Ebrahimi
11823*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR1
11824*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR2
11825*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR3
11826*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR4
11827*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR5
11828*22dc650dSSadaf Ebrahimi
11829*22dc650dSSadaf Ebrahimi       The  string  ends  with a truncated UTF-8 character; the code specifies
11830*22dc650dSSadaf Ebrahimi       how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
11831*22dc650dSSadaf Ebrahimi       characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
11832*22dc650dSSadaf Ebrahimi       nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
11833*22dc650dSSadaf Ebrahimi       checked first; hence the possibility of 4 or 5 missing bytes.
11834*22dc650dSSadaf Ebrahimi
11835*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR6
11836*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR7
11837*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR8
11838*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR9
11839*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR10
11840*22dc650dSSadaf Ebrahimi
11841*22dc650dSSadaf Ebrahimi       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
11842*22dc650dSSadaf Ebrahimi       the  character  do  not have the binary value 0b10 (that is, either the
11843*22dc650dSSadaf Ebrahimi       most significant bit is 0, or the next bit is 1).
11844*22dc650dSSadaf Ebrahimi
11845*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR11
11846*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR12
11847*22dc650dSSadaf Ebrahimi
11848*22dc650dSSadaf Ebrahimi       A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
11849*22dc650dSSadaf Ebrahimi       long; these code points are excluded by RFC 3629.
11850*22dc650dSSadaf Ebrahimi
11851*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR13
11852*22dc650dSSadaf Ebrahimi
11853*22dc650dSSadaf Ebrahimi       A 4-byte character has a value greater than 0x10ffff; these code points
11854*22dc650dSSadaf Ebrahimi       are excluded by RFC 3629.
11855*22dc650dSSadaf Ebrahimi
11856*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR14
11857*22dc650dSSadaf Ebrahimi
11858*22dc650dSSadaf Ebrahimi       A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
11859*22dc650dSSadaf Ebrahimi       range of code points are reserved by RFC 3629 for use with UTF-16,  and
11860*22dc650dSSadaf Ebrahimi       so are excluded from UTF-8.
11861*22dc650dSSadaf Ebrahimi
11862*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR15
11863*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR16
11864*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR17
11865*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR18
11866*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR19
11867*22dc650dSSadaf Ebrahimi
11868*22dc650dSSadaf Ebrahimi       A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
11869*22dc650dSSadaf Ebrahimi       for a value that can be represented by fewer bytes, which  is  invalid.
11870*22dc650dSSadaf Ebrahimi       For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
11871*22dc650dSSadaf Ebrahimi       rect coding uses just one byte.
11872*22dc650dSSadaf Ebrahimi
11873*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR20
11874*22dc650dSSadaf Ebrahimi
11875*22dc650dSSadaf Ebrahimi       The two most significant bits of the first byte of a character have the
11876*22dc650dSSadaf Ebrahimi       binary value 0b10 (that is, the most significant bit is 1 and the  sec-
11877*22dc650dSSadaf Ebrahimi       ond  is  0). Such a byte can only validly occur as the second or subse-
11878*22dc650dSSadaf Ebrahimi       quent byte of a multi-byte character.
11879*22dc650dSSadaf Ebrahimi
11880*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF8_ERR21
11881*22dc650dSSadaf Ebrahimi
11882*22dc650dSSadaf Ebrahimi       The first byte of a character has the value 0xfe or 0xff. These  values
11883*22dc650dSSadaf Ebrahimi       can never occur in a valid UTF-8 string.
11884*22dc650dSSadaf Ebrahimi
11885*22dc650dSSadaf Ebrahimi   Errors in UTF-16 strings
11886*22dc650dSSadaf Ebrahimi
11887*22dc650dSSadaf Ebrahimi       The  following  negative  error  codes  are  given  for  invalid UTF-16
11888*22dc650dSSadaf Ebrahimi       strings:
11889*22dc650dSSadaf Ebrahimi
11890*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
11891*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
11892*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
11893*22dc650dSSadaf Ebrahimi
11894*22dc650dSSadaf Ebrahimi
11895*22dc650dSSadaf Ebrahimi   Errors in UTF-32 strings
11896*22dc650dSSadaf Ebrahimi
11897*22dc650dSSadaf Ebrahimi       The following  negative  error  codes  are  given  for  invalid  UTF-32
11898*22dc650dSSadaf Ebrahimi       strings:
11899*22dc650dSSadaf Ebrahimi
11900*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
11901*22dc650dSSadaf Ebrahimi         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
11902*22dc650dSSadaf Ebrahimi
11903*22dc650dSSadaf Ebrahimi
11904*22dc650dSSadaf EbrahimiMATCHING IN INVALID UTF STRINGS
11905*22dc650dSSadaf Ebrahimi
11906*22dc650dSSadaf Ebrahimi       You can run pattern matches on subject strings that may contain invalid
11907*22dc650dSSadaf Ebrahimi       UTF  sequences  if  you  call  pcre2_compile() with the PCRE2_MATCH_IN-
11908*22dc650dSSadaf Ebrahimi       VALID_UTF option. This is supported  by  pcre2_match(),  including  JIT
11909*22dc650dSSadaf Ebrahimi       matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is
11910*22dc650dSSadaf Ebrahimi       set,  it  forces  PCRE2_UTF  to be set as well. Note, however, that the
11911*22dc650dSSadaf Ebrahimi       pattern itself must be a valid UTF string.
11912*22dc650dSSadaf Ebrahimi
11913*22dc650dSSadaf Ebrahimi       If you do not set PCRE2_MATCH_INVALID_UTF when  calling  pcre2_compile,
11914*22dc650dSSadaf Ebrahimi       and  you  are  not  certain that your subject strings are valid UTF se-
11915*22dc650dSSadaf Ebrahimi       quences, you should not make  use  of  the  JIT  "fast  path"  function
11916*22dc650dSSadaf Ebrahimi       pcre2_jit_match()  because it bypasses sanity checks, including the one
11917*22dc650dSSadaf Ebrahimi       for UTF validity. An invalid string may cause undefined behaviour,  in-
11918*22dc650dSSadaf Ebrahimi       cluding looping, crashing, or giving the wrong answer.
11919*22dc650dSSadaf Ebrahimi
11920*22dc650dSSadaf Ebrahimi       Setting  PCRE2_MATCH_INVALID_UTF  does  not affect what pcre2_compile()
11921*22dc650dSSadaf Ebrahimi       generates, but if pcre2_jit_compile() is subsequently called,  it  does
11922*22dc650dSSadaf Ebrahimi       generate different code. If JIT is not used, the option affects the be-
11923*22dc650dSSadaf Ebrahimi       haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11924*22dc650dSSadaf Ebrahimi       VALID_UTF  is  set  at  compile  time, PCRE2_NO_UTF_CHECK is ignored at
11925*22dc650dSSadaf Ebrahimi       match time.
11926*22dc650dSSadaf Ebrahimi
11927*22dc650dSSadaf Ebrahimi       In this mode, an invalid  code  unit  sequence  in  the  subject  never
11928*22dc650dSSadaf Ebrahimi       matches  any  pattern  item.  It  does not match dot, it does not match
11929*22dc650dSSadaf Ebrahimi       \p{Any}, it does not even match negative items such as [^X]. A  lookbe-
11930*22dc650dSSadaf Ebrahimi       hind  assertion fails if it encounters an invalid sequence while moving
11931*22dc650dSSadaf Ebrahimi       the current point backwards. In other words, an invalid UTF  code  unit
11932*22dc650dSSadaf Ebrahimi       sequence acts as a barrier which no match can cross.
11933*22dc650dSSadaf Ebrahimi
11934*22dc650dSSadaf Ebrahimi       You can also think of this as the subject being split up into fragments
11935*22dc650dSSadaf Ebrahimi       of  valid UTF, delimited internally by invalid code unit sequences. The
11936*22dc650dSSadaf Ebrahimi       pattern is matched fragment by fragment. The  result  of  a  successful
11937*22dc650dSSadaf Ebrahimi       match,  however,  is  given  as code unit offsets in the entire subject
11938*22dc650dSSadaf Ebrahimi       string in the usual way. There are a few points to consider:
11939*22dc650dSSadaf Ebrahimi
11940*22dc650dSSadaf Ebrahimi       The internal boundaries are not interpreted as the beginnings  or  ends
11941*22dc650dSSadaf Ebrahimi       of  lines  and  so  do not match circumflex or dollar characters in the
11942*22dc650dSSadaf Ebrahimi       pattern.
11943*22dc650dSSadaf Ebrahimi
11944*22dc650dSSadaf Ebrahimi       If pcre2_match() is called with an offset that  points  to  an  invalid
11945*22dc650dSSadaf Ebrahimi       UTF-sequence,  that  sequence  is  skipped, and the match starts at the
11946*22dc650dSSadaf Ebrahimi       next valid UTF character, or the end of the subject.
11947*22dc650dSSadaf Ebrahimi
11948*22dc650dSSadaf Ebrahimi       At internal fragment boundaries, \b and \B behave in the same way as at
11949*22dc650dSSadaf Ebrahimi       the beginning and end of the subject. For example, a sequence  such  as
11950*22dc650dSSadaf Ebrahimi       \bWORD\b  would match an instance of WORD that is surrounded by invalid
11951*22dc650dSSadaf Ebrahimi       UTF code units.
11952*22dc650dSSadaf Ebrahimi
11953*22dc650dSSadaf Ebrahimi       Using PCRE2_MATCH_INVALID_UTF, an application can run matches on  arbi-
11954*22dc650dSSadaf Ebrahimi       trary  data,  knowing  that  any  matched strings that are returned are
11955*22dc650dSSadaf Ebrahimi       valid UTF. This can be useful when searching for UTF text in executable
11956*22dc650dSSadaf Ebrahimi       or other binary files.
11957*22dc650dSSadaf Ebrahimi
11958*22dc650dSSadaf Ebrahimi       Note, however, that the  16-bit  and  32-bit  PCRE2  libraries  process
11959*22dc650dSSadaf Ebrahimi       strings  as  sequences of uint16_t or uint32_t code points. They cannot
11960*22dc650dSSadaf Ebrahimi       find valid UTF sequences within an arbitrary  string  of  bytes  unless
11961*22dc650dSSadaf Ebrahimi       such sequences are suitably aligned.
11962*22dc650dSSadaf Ebrahimi
11963*22dc650dSSadaf Ebrahimi
11964*22dc650dSSadaf EbrahimiAUTHOR
11965*22dc650dSSadaf Ebrahimi
11966*22dc650dSSadaf Ebrahimi       Philip Hazel
11967*22dc650dSSadaf Ebrahimi       Retired from University Computing Service
11968*22dc650dSSadaf Ebrahimi       Cambridge, England.
11969*22dc650dSSadaf Ebrahimi
11970*22dc650dSSadaf Ebrahimi
11971*22dc650dSSadaf EbrahimiREVISION
11972*22dc650dSSadaf Ebrahimi
11973*22dc650dSSadaf Ebrahimi       Last updated: 12 October 2023
11974*22dc650dSSadaf Ebrahimi       Copyright (c) 1997-2023 University of Cambridge.
11975*22dc650dSSadaf Ebrahimi
11976*22dc650dSSadaf Ebrahimi
11977*22dc650dSSadaf EbrahimiPCRE2 10.43                    04 February 2023                PCRE2UNICODE(3)
11978*22dc650dSSadaf Ebrahimi------------------------------------------------------------------------------
11979*22dc650dSSadaf Ebrahimi
11980*22dc650dSSadaf Ebrahimi
11981