xref: /aosp_15_r20/external/pcre/HACKING (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf EbrahimiTechnical notes about PCRE2
2*22dc650dSSadaf Ebrahimi---------------------------
3*22dc650dSSadaf Ebrahimi
4*22dc650dSSadaf EbrahimiThese are very rough technical notes that record potentially useful information
5*22dc650dSSadaf Ebrahimiabout PCRE2 internals. PCRE2 is a library based on the original PCRE library,
6*22dc650dSSadaf Ebrahimibut with a revised (and incompatible) API. To avoid confusion, the original
7*22dc650dSSadaf Ebrahimilibrary is referred to as PCRE1 below. For information about testing PCRE2, see
8*22dc650dSSadaf Ebrahimithe pcre2test documentation and the comment at the head of the RunTest file.
9*22dc650dSSadaf Ebrahimi
10*22dc650dSSadaf EbrahimiPCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
11*22dc650dSSadaf Ebrahimireleases carried on the 8.xx series, up to the final 8.45 release. PCRE2
12*22dc650dSSadaf Ebrahimireleases started at 10.00 to avoid confusion with PCRE1.
13*22dc650dSSadaf Ebrahimi
14*22dc650dSSadaf Ebrahimi
15*22dc650dSSadaf EbrahimiHistorical note 1
16*22dc650dSSadaf Ebrahimi-----------------
17*22dc650dSSadaf Ebrahimi
18*22dc650dSSadaf EbrahimiMany years ago I implemented some regular expression functions to an algorithm
19*22dc650dSSadaf Ebrahimisuggested by Martin Richards. The rather simple patterns were not Unix-like in
20*22dc650dSSadaf Ebrahimiform, and were quite restricted in what they could do by comparison with Perl.
21*22dc650dSSadaf EbrahimiThe interesting part about the algorithm was that the amount of space required
22*22dc650dSSadaf Ebrahimito hold the compiled form of an expression was known in advance. The code to
23*22dc650dSSadaf Ebrahimiapply an expression did not operate by backtracking, as the original Henry
24*22dc650dSSadaf EbrahimiSpencer code and current PCRE2 and Perl code does, but instead checked all
25*22dc650dSSadaf Ebrahimipossibilities simultaneously by keeping a list of current states and checking
26*22dc650dSSadaf Ebrahimiall of them as it advanced through the subject string. In the terminology of
27*22dc650dSSadaf EbrahimiJeffrey Friedl's book, it was a "DFA algorithm", though it was not a
28*22dc650dSSadaf Ebrahimitraditional Finite State Machine (FSM). When the pattern was all used up, all
29*22dc650dSSadaf Ebrahimiremaining states were possible matches, and the one matching the longest subset
30*22dc650dSSadaf Ebrahimiof the subject string was chosen. This did not necessarily maximize the
31*22dc650dSSadaf Ebrahimiindividual wild portions of the pattern, as is expected in Unix and Perl-style
32*22dc650dSSadaf Ebrahimiregular expressions.
33*22dc650dSSadaf Ebrahimi
34*22dc650dSSadaf Ebrahimi
35*22dc650dSSadaf EbrahimiHistorical note 2
36*22dc650dSSadaf Ebrahimi-----------------
37*22dc650dSSadaf Ebrahimi
38*22dc650dSSadaf EbrahimiBy contrast, the code originally written by Henry Spencer (which was
39*22dc650dSSadaf Ebrahimisubsequently heavily modified for Perl) compiles the expression twice: once in
40*22dc650dSSadaf Ebrahimia dummy mode in order to find out how much store will be needed, and then for
41*22dc650dSSadaf Ebrahimireal. (The Perl version may or may not still do this; I'm talking about the
42*22dc650dSSadaf Ebrahimioriginal library.) The execution function operates by backtracking and
43*22dc650dSSadaf Ebrahimimaximizing (or, optionally, minimizing, in Perl) the amount of the subject that
44*22dc650dSSadaf Ebrahimimatches individual wild portions of the pattern. This is an "NFA algorithm" in
45*22dc650dSSadaf EbrahimiFriedl's terminology.
46*22dc650dSSadaf Ebrahimi
47*22dc650dSSadaf Ebrahimi
48*22dc650dSSadaf EbrahimiOK, here's the real stuff
49*22dc650dSSadaf Ebrahimi-------------------------
50*22dc650dSSadaf Ebrahimi
51*22dc650dSSadaf EbrahimiFor the set of functions that formed the original PCRE1 library in 1997 (which
52*22dc650dSSadaf Ebrahimiare unrelated to those mentioned above), I tried at first to invent an
53*22dc650dSSadaf Ebrahimialgorithm that used an amount of store bounded by a multiple of the number of
54*22dc650dSSadaf Ebrahimicharacters in the pattern, to save on compiling time. However, because of the
55*22dc650dSSadaf Ebrahimigreater complexity in Perl regular expressions, I couldn't do this, even though
56*22dc650dSSadaf Ebrahimithe then current Perl 5.004 patterns were much simpler than those supported
57*22dc650dSSadaf Ebrahiminowadays. In any case, a first pass through the pattern is helpful for other
58*22dc650dSSadaf Ebrahimireasons.
59*22dc650dSSadaf Ebrahimi
60*22dc650dSSadaf Ebrahimi
61*22dc650dSSadaf EbrahimiSupport for 16-bit and 32-bit data strings
62*22dc650dSSadaf Ebrahimi-------------------------------------------
63*22dc650dSSadaf Ebrahimi
64*22dc650dSSadaf EbrahimiThe PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
65*22dc650dSSadaf Ebrahimimodes, creating up to three different libraries. In the description that
66*22dc650dSSadaf Ebrahimifollows, the word "short" is used for a 16-bit data quantity, and the phrase
67*22dc650dSSadaf Ebrahimi"code unit" is used for a quantity that is a byte in 8-bit mode, a short in
68*22dc650dSSadaf Ebrahimi16-bit mode and a 32-bit word in 32-bit mode. The names of PCRE2 functions are
69*22dc650dSSadaf Ebrahimigiven in generic form, without the _8, _16, or _32 suffix.
70*22dc650dSSadaf Ebrahimi
71*22dc650dSSadaf Ebrahimi
72*22dc650dSSadaf EbrahimiComputing the memory requirement: how it was
73*22dc650dSSadaf Ebrahimi--------------------------------------------
74*22dc650dSSadaf Ebrahimi
75*22dc650dSSadaf EbrahimiUp to and including release 6.7, PCRE1 worked by running a very degenerate
76*22dc650dSSadaf Ebrahimifirst pass to calculate a maximum memory requirement, and then a second pass to
77*22dc650dSSadaf Ebrahimido the real compile - which might use a bit less than the predicted amount of
78*22dc650dSSadaf Ebrahimimemory. The idea was that this would turn out faster than the Henry Spencer
79*22dc650dSSadaf Ebrahimicode because the first pass is degenerate and the second pass can just store
80*22dc650dSSadaf Ebrahimistuff straight into memory, which it knows is big enough.
81*22dc650dSSadaf Ebrahimi
82*22dc650dSSadaf Ebrahimi
83*22dc650dSSadaf EbrahimiComputing the memory requirement: how it is
84*22dc650dSSadaf Ebrahimi-------------------------------------------
85*22dc650dSSadaf Ebrahimi
86*22dc650dSSadaf EbrahimiBy the time I was working on a potential 6.8 release, the degenerate first pass
87*22dc650dSSadaf Ebrahimihad become very complicated and hard to maintain. Indeed one of the early
88*22dc650dSSadaf Ebrahimithings I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
89*22dc650dSSadaf EbrahimiI had a flash of inspiration as to how I could run the real compile function in
90*22dc650dSSadaf Ebrahimia "fake" mode that enables it to compute how much memory it would need, while
91*22dc650dSSadaf Ebrahimiin most cases only ever using a small amount of working memory, and without too
92*22dc650dSSadaf Ebrahimimany tests of the mode that might slow it down. So I refactored the compiling
93*22dc650dSSadaf Ebrahimifunctions to work this way. This got rid of about 600 lines of source and made
94*22dc650dSSadaf Ebrahimifurther maintenance and development easier. As this was such a major change, I
95*22dc650dSSadaf Ebrahiminever released 6.8, instead upping the number to 7.0 (other quite major changes
96*22dc650dSSadaf Ebrahimiwere also present in the 7.0 release).
97*22dc650dSSadaf Ebrahimi
98*22dc650dSSadaf EbrahimiA side effect of this work was that the previous limit of 200 on the nesting
99*22dc650dSSadaf Ebrahimidepth of parentheses was removed. However, there was a downside: compiling ran
100*22dc650dSSadaf Ebrahimimore slowly than before (30% or more, depending on the pattern) because it now
101*22dc650dSSadaf Ebrahimidid a full analysis of the pattern. My hope was that this would not be a big
102*22dc650dSSadaf Ebrahimiissue, and in the event, nobody has commented on it.
103*22dc650dSSadaf Ebrahimi
104*22dc650dSSadaf EbrahimiAt release 8.34, a limit on the nesting depth of parentheses was re-introduced
105*22dc650dSSadaf Ebrahimi(default 250, settable at build time) so as to put a limit on the amount of
106*22dc650dSSadaf Ebrahimisystem stack used by the compile function, which uses recursive function calls
107*22dc650dSSadaf Ebrahimifor nested parenthesized groups. This is a safety feature for environments with
108*22dc650dSSadaf Ebrahimismall stacks where the patterns are provided by users.
109*22dc650dSSadaf Ebrahimi
110*22dc650dSSadaf Ebrahimi
111*22dc650dSSadaf EbrahimiYet another pattern scan
112*22dc650dSSadaf Ebrahimi------------------------
113*22dc650dSSadaf Ebrahimi
114*22dc650dSSadaf EbrahimiHistory repeated itself for PCRE2 release 10.20. A number of bugs relating to
115*22dc650dSSadaf Ebrahiminamed subpatterns had been discovered by fuzzers. Most of these were related to
116*22dc650dSSadaf Ebrahimithe handling of forward references when it was not known if the named group was
117*22dc650dSSadaf Ebrahimiunique. (References to non-unique names use a different opcode and more
118*22dc650dSSadaf Ebrahimimemory.) The use of duplicate group numbers (the (?| facility) also caused
119*22dc650dSSadaf Ebrahimiissues.
120*22dc650dSSadaf Ebrahimi
121*22dc650dSSadaf EbrahimiTo get around these problems I adopted a new approach by adding a third pass
122*22dc650dSSadaf Ebrahimiover the pattern (really a "pre-pass"), which did nothing other than identify
123*22dc650dSSadaf Ebrahimiall the named subpatterns and their corresponding group numbers. This means
124*22dc650dSSadaf Ebrahimithat the actual compile (both the memory-computing dummy run and the real
125*22dc650dSSadaf Ebrahimicompile) has full knowledge of group names and numbers throughout. Several
126*22dc650dSSadaf Ebrahimidozen lines of messy code were eliminated, though the new pre-pass was not
127*22dc650dSSadaf Ebrahimishort. In particular, parsing and skipping over [] classes is complicated.
128*22dc650dSSadaf Ebrahimi
129*22dc650dSSadaf EbrahimiWhile working on 10.22 I realized that I could simplify yet again by moving
130*22dc650dSSadaf Ebrahimimore of the parsing into the pre-pass, thus avoiding doing it in two places, so
131*22dc650dSSadaf Ebrahimiafter 10.22 was released, the code underwent yet another big refactoring. This
132*22dc650dSSadaf Ebrahimiis how it is from 10.23 onwards:
133*22dc650dSSadaf Ebrahimi
134*22dc650dSSadaf EbrahimiThe function called parse_regex() scans the pattern characters, parsing them
135*22dc650dSSadaf Ebrahimiinto literal data and meta characters. It converts escapes such as \x{123}
136*22dc650dSSadaf Ebrahimiinto literals, handles \Q...\E, and skips over comments and non-significant
137*22dc650dSSadaf Ebrahimiwhite space. The result of the scanning is put into a vector of 32-bit unsigned
138*22dc650dSSadaf Ebrahimiintegers. Values less than 0x80000000 are literal data. Higher values represent
139*22dc650dSSadaf Ebrahimimeta-characters. The top 16-bits of such values identify the meta-character,
140*22dc650dSSadaf Ebrahimiand these are given names such as META_CAPTURE. The lower 16-bits are available
141*22dc650dSSadaf Ebrahimifor data, for example, the capturing group number. The only situation in which
142*22dc650dSSadaf Ebrahimiliteral data values greater than 0x7fffffff can appear is when the 32-bit
143*22dc650dSSadaf Ebrahimilibrary is running in non-UTF mode. This is handled by having a special
144*22dc650dSSadaf Ebrahimimeta-character that is followed by the 32-bit data value.
145*22dc650dSSadaf Ebrahimi
146*22dc650dSSadaf EbrahimiThe size of the parsed pattern vector, when auto-callouts are not enabled, is
147*22dc650dSSadaf Ebrahimibounded by the length of the pattern (with one exception). The code is written
148*22dc650dSSadaf Ebrahimiso that each item in the pattern uses no more vector elements than the number
149*22dc650dSSadaf Ebrahimiof code units in the item itself. The exception is the aforementioned large
150*22dc650dSSadaf Ebrahimi32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
151*22dc650dSSadaf Ebrahimiadvance to check for such values. When auto-callouts are enabled, the generous
152*22dc650dSSadaf Ebrahimiassumption is made that there will be a callout for each pattern code unit
153*22dc650dSSadaf Ebrahimi(which of course is only actually true if all code units are literals) plus one
154*22dc650dSSadaf Ebrahimiat the end. A default parsed pattern vector is defined on the system stack, to
155*22dc650dSSadaf Ebrahimiminimize memory handling, but if this is not big enough, heap memory is used.
156*22dc650dSSadaf Ebrahimi
157*22dc650dSSadaf EbrahimiAs before, the actual compiling function is run twice, the first time to
158*22dc650dSSadaf Ebrahimidetermine the amount of memory needed for the final compiled pattern. It
159*22dc650dSSadaf Ebrahiminow processes the parsed pattern vector, not the pattern itself, although some
160*22dc650dSSadaf Ebrahimiof the parsed items refer to strings in the pattern - for example, group
161*22dc650dSSadaf Ebrahiminames. As escapes and comments have already been processed, the code is a bit
162*22dc650dSSadaf Ebrahimisimpler than before.
163*22dc650dSSadaf Ebrahimi
164*22dc650dSSadaf EbrahimiMost errors can be diagnosed during the parsing scan. For those that cannot
165*22dc650dSSadaf Ebrahimi(for example, "lookbehind assertion is not fixed length"), the parsed code
166*22dc650dSSadaf Ebrahimicontains offsets into the pattern so that the actual compiling code can
167*22dc650dSSadaf Ebrahimireport where errors are.
168*22dc650dSSadaf Ebrahimi
169*22dc650dSSadaf Ebrahimi
170*22dc650dSSadaf EbrahimiThe elements of the parsed pattern vector
171*22dc650dSSadaf Ebrahimi-----------------------------------------
172*22dc650dSSadaf Ebrahimi
173*22dc650dSSadaf EbrahimiThe word "offset" below means a code unit offset into the pattern. When
174*22dc650dSSadaf EbrahimiPCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
175*22dc650dSSadaf Ebrahimistored in a single parsed pattern element. Otherwise (typically on 64-bit
176*22dc650dSSadaf Ebrahimisystems) it occupies two elements. The following meta items occupy just one
177*22dc650dSSadaf Ebrahimielement, with no data:
178*22dc650dSSadaf Ebrahimi
179*22dc650dSSadaf EbrahimiMETA_ACCEPT           (*ACCEPT)
180*22dc650dSSadaf EbrahimiMETA_ASTERISK         *
181*22dc650dSSadaf EbrahimiMETA_ASTERISK_PLUS    *+
182*22dc650dSSadaf EbrahimiMETA_ASTERISK_QUERY   *?
183*22dc650dSSadaf EbrahimiMETA_ATOMIC           (?> start of atomic group
184*22dc650dSSadaf EbrahimiMETA_CIRCUMFLEX       ^ metacharacter
185*22dc650dSSadaf EbrahimiMETA_CLASS            [ start of non-empty class
186*22dc650dSSadaf EbrahimiMETA_CLASS_EMPTY      [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
187*22dc650dSSadaf EbrahimiMETA_CLASS_EMPTY_NOT  [^] negative empty class - ditto
188*22dc650dSSadaf EbrahimiMETA_CLASS_END        ] end of non-empty class
189*22dc650dSSadaf EbrahimiMETA_CLASS_NOT        [^ start non-empty negative class
190*22dc650dSSadaf EbrahimiMETA_COMMIT           (*COMMIT) - no argument (see below for with argument)
191*22dc650dSSadaf EbrahimiMETA_COND_ASSERT      (?(?assertion)
192*22dc650dSSadaf EbrahimiMETA_DOLLAR           $ metacharacter
193*22dc650dSSadaf EbrahimiMETA_DOT              . metacharacter
194*22dc650dSSadaf EbrahimiMETA_END              End of pattern (this value is 0x80000000)
195*22dc650dSSadaf EbrahimiMETA_FAIL             (*FAIL)
196*22dc650dSSadaf EbrahimiMETA_KET              ) closing parenthesis
197*22dc650dSSadaf EbrahimiMETA_LOOKAHEAD        (?= start of lookahead
198*22dc650dSSadaf EbrahimiMETA_LOOKAHEAD_NA     (*napla: start of non-atomic lookahead
199*22dc650dSSadaf EbrahimiMETA_LOOKAHEADNOT     (?! start of negative lookahead
200*22dc650dSSadaf EbrahimiMETA_NOCAPTURE        (?: no capture parens
201*22dc650dSSadaf EbrahimiMETA_PLUS             +
202*22dc650dSSadaf EbrahimiMETA_PLUS_PLUS        ++
203*22dc650dSSadaf EbrahimiMETA_PLUS_QUERY       +?
204*22dc650dSSadaf EbrahimiMETA_PRUNE            (*PRUNE) - no argument (see below for with argument)
205*22dc650dSSadaf EbrahimiMETA_QUERY            ?
206*22dc650dSSadaf EbrahimiMETA_QUERY_PLUS       ?+
207*22dc650dSSadaf EbrahimiMETA_QUERY_QUERY      ??
208*22dc650dSSadaf EbrahimiMETA_RANGE_ESCAPED    hyphen in class range with at least one escape
209*22dc650dSSadaf EbrahimiMETA_RANGE_LITERAL    hyphen in class range defined literally
210*22dc650dSSadaf EbrahimiMETA_SKIP             (*SKIP) - no argument (see below for with argument)
211*22dc650dSSadaf EbrahimiMETA_THEN             (*THEN) - no argument (see below for with argument)
212*22dc650dSSadaf Ebrahimi
213*22dc650dSSadaf EbrahimiThe two RANGE values occur only in character classes. They are positioned
214*22dc650dSSadaf Ebrahimibetween two literals that define the start and end of the range. In an EBCDIC
215*22dc650dSSadaf Ebrahimienvironment it is necessary to know whether either of the range values was
216*22dc650dSSadaf Ebrahimispecified as an escape. In an ASCII/Unicode environment the distinction is not
217*22dc650dSSadaf Ebrahimirelevant.
218*22dc650dSSadaf Ebrahimi
219*22dc650dSSadaf EbrahimiThe following have data in the lower 16 bits, and may be followed by other data
220*22dc650dSSadaf Ebrahimielements:
221*22dc650dSSadaf Ebrahimi
222*22dc650dSSadaf EbrahimiMETA_ALT              | alternation
223*22dc650dSSadaf EbrahimiMETA_BACKREF          back reference
224*22dc650dSSadaf EbrahimiMETA_CAPTURE          start of capturing group
225*22dc650dSSadaf EbrahimiMETA_ESCAPE           non-literal escape sequence
226*22dc650dSSadaf EbrahimiMETA_RECURSE          recursion call
227*22dc650dSSadaf Ebrahimi
228*22dc650dSSadaf EbrahimiIf the data for META_ALT is non-zero, it is inside a lookbehind, and the data
229*22dc650dSSadaf Ebrahimiis the maximum length of its branch (see META_LOOKBEHIND below for more
230*22dc650dSSadaf Ebrahimidetail).
231*22dc650dSSadaf Ebrahimi
232*22dc650dSSadaf EbrahimiMETA_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
233*22dc650dSSadaf Ebrahimitheir data in the lower 16 bits of the element. META_RECURSE is followed by an
234*22dc650dSSadaf Ebrahimioffset, for use in error messages.
235*22dc650dSSadaf Ebrahimi
236*22dc650dSSadaf EbrahimiMETA_BACKREF is followed by an offset if the back reference group number is 10
237*22dc650dSSadaf Ebrahimior more. The offsets of the first occurrences of references to groups whose
238*22dc650dSSadaf Ebrahiminumbers are less than 10 are put in cb->small_ref_offset[] (only the first
239*22dc650dSSadaf Ebrahimioccurrence is useful). On 64-bit systems this avoids using more than two parsed
240*22dc650dSSadaf Ebrahimipattern elements for items such as \3. The offset is used when an error occurs
241*22dc650dSSadaf Ebrahimibecause the reference is to a non-existent group.
242*22dc650dSSadaf Ebrahimi
243*22dc650dSSadaf EbrahimiMETA_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
244*22dc650dSSadaf Ebrahimielement contains the 16-bit type and data property values, packed together.
245*22dc650dSSadaf EbrahimiESC_g and ESC_k are used only for named references - numerical ones are turned
246*22dc650dSSadaf Ebrahimiinto META_RECURSE or META_BACKREF as appropriate. ESC_g and ESC_k are followed
247*22dc650dSSadaf Ebrahimiby a length and an offset into the pattern to specify the name.
248*22dc650dSSadaf Ebrahimi
249*22dc650dSSadaf EbrahimiThe following have one data item that follows in the next vector element:
250*22dc650dSSadaf Ebrahimi
251*22dc650dSSadaf EbrahimiMETA_BIGVALUE         Next is a literal >= META_END
252*22dc650dSSadaf EbrahimiMETA_POSIX            POSIX class item (data identifies the class)
253*22dc650dSSadaf EbrahimiMETA_POSIX_NEG        negative POSIX class item (ditto)
254*22dc650dSSadaf Ebrahimi
255*22dc650dSSadaf EbrahimiThe following are followed by a length element, then a number of character code
256*22dc650dSSadaf Ebrahimivalues (which should match with the length):
257*22dc650dSSadaf Ebrahimi
258*22dc650dSSadaf EbrahimiMETA_MARK             (*MARK:xxxx)
259*22dc650dSSadaf EbrahimiMETA_COMMIT_ARG       )*COMMIT:xxxx)
260*22dc650dSSadaf EbrahimiMETA_PRUNE_ARG        (*PRUNE:xxx)
261*22dc650dSSadaf EbrahimiMETA_SKIP_ARG         (*SKIP:xxxx)
262*22dc650dSSadaf EbrahimiMETA_THEN_ARG         (*THEN:xxxx)
263*22dc650dSSadaf Ebrahimi
264*22dc650dSSadaf EbrahimiThe following are followed by a length element, then an offset in the pattern
265*22dc650dSSadaf Ebrahimithat identifies the name:
266*22dc650dSSadaf Ebrahimi
267*22dc650dSSadaf EbrahimiMETA_COND_NAME        (?(<name>) or (?('name') or (?(name)
268*22dc650dSSadaf EbrahimiMETA_COND_RNAME       (?(R&name)
269*22dc650dSSadaf EbrahimiMETA_COND_RNUMBER     (?(Rdigits)
270*22dc650dSSadaf EbrahimiMETA_RECURSE_BYNAME   (?&name)
271*22dc650dSSadaf EbrahimiMETA_BACKREF_BYNAME   \k'name'
272*22dc650dSSadaf Ebrahimi
273*22dc650dSSadaf EbrahimiMETA_COND_RNUMBER is used for names that start with R and continue with digits,
274*22dc650dSSadaf Ebrahimibecause this is an ambiguous case. It could be a back reference to a group with
275*22dc650dSSadaf Ebrahimithat name, or it could be a recursion test on a numbered group.
276*22dc650dSSadaf Ebrahimi
277*22dc650dSSadaf EbrahimiThis one is followed by an offset, for use in error messages, then a number:
278*22dc650dSSadaf Ebrahimi
279*22dc650dSSadaf EbrahimiMETA_COND_NUMBER       (?([+-]digits)
280*22dc650dSSadaf Ebrahimi
281*22dc650dSSadaf EbrahimiThe following is followed just by an offset, for use in error messages:
282*22dc650dSSadaf Ebrahimi
283*22dc650dSSadaf EbrahimiMETA_COND_DEFINE      (?(DEFINE)
284*22dc650dSSadaf Ebrahimi
285*22dc650dSSadaf EbrahimiThe following are at first also followed just by an offset for use in error
286*22dc650dSSadaf Ebrahimimessages. After the lengths of the branches of a lookbehind group have been
287*22dc650dSSadaf Ebrahimichecked the error offset is no longer needed. The lower 16 bits of the main
288*22dc650dSSadaf Ebrahimiword are now set to the maximum length of the first branch of the lookbehind
289*22dc650dSSadaf Ebrahimigroup, and the second word is set to the mimimum matching length for a
290*22dc650dSSadaf Ebrahimivariable-length lookbehind group, or to LOOKBEHIND_MAX for a group whose
291*22dc650dSSadaf Ebrahimibranches are all of fixed length. These values are used when generating
292*22dc650dSSadaf EbrahimiOP_REVERSE or OP_VREVERSE for the first branch. The miminum value is also used
293*22dc650dSSadaf Ebrahimifor any subsequent branches because there is only room for one value (the
294*22dc650dSSadaf Ebrahimibranch maximum length) in a META_ALT item.
295*22dc650dSSadaf Ebrahimi
296*22dc650dSSadaf EbrahimiMETA_LOOKBEHIND       (?<=      start of lookbehind
297*22dc650dSSadaf EbrahimiMETA_LOOKBEHIND_NA    (*naplb:  start of non-atomic lookbehind
298*22dc650dSSadaf EbrahimiMETA_LOOKBEHINDNOT    (?<!      start of negative lookbehind
299*22dc650dSSadaf Ebrahimi
300*22dc650dSSadaf EbrahimiThe following are followed by two elements, the minimum and maximum. The
301*22dc650dSSadaf Ebrahimimaximum value is limited to 65535 (MAX_REPEAT_COUNT). A maximum value of
302*22dc650dSSadaf Ebrahimi"unlimited" is represented by REPEAT_UNLIMITED, which is bigger than it:
303*22dc650dSSadaf Ebrahimi
304*22dc650dSSadaf EbrahimiMETA_MINMAX           {n,m}  repeat
305*22dc650dSSadaf EbrahimiMETA_MINMAX_PLUS      {n,m}+ repeat
306*22dc650dSSadaf EbrahimiMETA_MINMAX_QUERY     {n,m}? repeat
307*22dc650dSSadaf Ebrahimi
308*22dc650dSSadaf EbrahimiThis one is followed by two elements, giving the new option settings for the
309*22dc650dSSadaf Ebrahimimain and extra options, respectively.
310*22dc650dSSadaf Ebrahimi
311*22dc650dSSadaf EbrahimiMETA_OPTIONS          (?i) and friends
312*22dc650dSSadaf Ebrahimi
313*22dc650dSSadaf EbrahimiThis one is followed by three elements. The first is 0 for '>' and 1 for '>=';
314*22dc650dSSadaf Ebrahimithe next two are the major and minor numbers:
315*22dc650dSSadaf Ebrahimi
316*22dc650dSSadaf EbrahimiMETA_COND_VERSION     (?(VERSION<op>x.y)
317*22dc650dSSadaf Ebrahimi
318*22dc650dSSadaf EbrahimiCallouts are converted into one of two items:
319*22dc650dSSadaf Ebrahimi
320*22dc650dSSadaf EbrahimiMETA_CALLOUT_NUMBER   (?C with numerical argument
321*22dc650dSSadaf EbrahimiMETA_CALLOUT_STRING   (?C with string argument
322*22dc650dSSadaf Ebrahimi
323*22dc650dSSadaf EbrahimiIn both cases, the next two elements contain the offset and length of the next
324*22dc650dSSadaf Ebrahimiitem in the pattern. Then there is either one callout number, or a length and
325*22dc650dSSadaf Ebrahimian offset for the string argument. The length includes both delimiters.
326*22dc650dSSadaf Ebrahimi
327*22dc650dSSadaf Ebrahimi
328*22dc650dSSadaf EbrahimiTraditional matching function
329*22dc650dSSadaf Ebrahimi-----------------------------
330*22dc650dSSadaf Ebrahimi
331*22dc650dSSadaf EbrahimiThe "traditional", and original, matching function is called pcre2_match(), and
332*22dc650dSSadaf Ebrahimiit implements an NFA algorithm, similar to the original Henry Spencer algorithm
333*22dc650dSSadaf Ebrahimiand the way that Perl works. This is not surprising, since it is intended to be
334*22dc650dSSadaf Ebrahimias compatible with Perl as possible. This is the function most users of PCRE2
335*22dc650dSSadaf Ebrahimiwill use most of the time. If PCRE2 is compiled with just-in-time (JIT)
336*22dc650dSSadaf Ebrahimisupport, and studying a compiled pattern with JIT is successful, the JIT code
337*22dc650dSSadaf Ebrahimiis run instead of the normal pcre2_match() code, but the result is the same.
338*22dc650dSSadaf Ebrahimi
339*22dc650dSSadaf Ebrahimi
340*22dc650dSSadaf EbrahimiSupplementary matching function
341*22dc650dSSadaf Ebrahimi-------------------------------
342*22dc650dSSadaf Ebrahimi
343*22dc650dSSadaf EbrahimiThere is also a supplementary matching function called pcre2_dfa_match(). This
344*22dc650dSSadaf Ebrahimiimplements a DFA matching algorithm that searches simultaneously for all
345*22dc650dSSadaf Ebrahimipossible matches that start at one point in the subject string. (Going back to
346*22dc650dSSadaf Ebrahimimy roots: see Historical Note 1 above.) This function intreprets the same
347*22dc650dSSadaf Ebrahimicompiled pattern data as pcre2_match(); however, not all the facilities are
348*22dc650dSSadaf Ebrahimiavailable, and those that are do not always work in quite the same way. See the
349*22dc650dSSadaf Ebrahimiuser documentation for details.
350*22dc650dSSadaf Ebrahimi
351*22dc650dSSadaf EbrahimiThe algorithm that is used for pcre2_dfa_match() is not a traditional FSM,
352*22dc650dSSadaf Ebrahimibecause it may have a number of states active at one time. More work would be
353*22dc650dSSadaf Ebrahimineeded at compile time to produce a traditional FSM where only one state is
354*22dc650dSSadaf Ebrahimiever active at once. I believe some other regex matchers work this way. JIT
355*22dc650dSSadaf Ebrahimisupport is not available for this kind of matching.
356*22dc650dSSadaf Ebrahimi
357*22dc650dSSadaf Ebrahimi
358*22dc650dSSadaf EbrahimiChangeable options
359*22dc650dSSadaf Ebrahimi------------------
360*22dc650dSSadaf Ebrahimi
361*22dc650dSSadaf EbrahimiThe /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and
362*22dc650dSSadaf Ebrahimisome others may be changed in the middle of patterns by items such as (?i).
363*22dc650dSSadaf EbrahimiTheir processing is handled entirely at compile time by generating different
364*22dc650dSSadaf Ebrahimiopcodes for the different settings. The runtime functions do not need to keep
365*22dc650dSSadaf Ebrahimitrack of an option's state.
366*22dc650dSSadaf Ebrahimi
367*22dc650dSSadaf EbrahimiPCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
368*22dc650dSSadaf Ebrahimiare tracked and processed during the parsing pre-pass. The others are handled
369*22dc650dSSadaf Ebrahimifrom META_OPTIONS items during the main compile phase.
370*22dc650dSSadaf Ebrahimi
371*22dc650dSSadaf Ebrahimi
372*22dc650dSSadaf EbrahimiFormat of compiled patterns
373*22dc650dSSadaf Ebrahimi---------------------------
374*22dc650dSSadaf Ebrahimi
375*22dc650dSSadaf EbrahimiThe compiled form of a pattern is a vector of unsigned code units (bytes in
376*22dc650dSSadaf Ebrahimi8-bit mode, shorts in 16-bit mode, 32-bit words in 32-bit mode), containing
377*22dc650dSSadaf Ebrahimiitems of variable length. The first code unit in an item contains an opcode,
378*22dc650dSSadaf Ebrahimiand the length of the item is either implicit in the opcode or contained in the
379*22dc650dSSadaf Ebrahimidata that follows it.
380*22dc650dSSadaf Ebrahimi
381*22dc650dSSadaf EbrahimiIn many cases listed below, LINK_SIZE data values are specified for offsets
382*22dc650dSSadaf Ebrahimiwithin the compiled pattern. LINK_SIZE always specifies a number of bytes. The
383*22dc650dSSadaf Ebrahimidefault value for LINK_SIZE is 2, except for the 32-bit library, where it can
384*22dc650dSSadaf Ebrahimionly be 4. The 8-bit library can be compiled to use 3-byte or 4-byte values,
385*22dc650dSSadaf Ebrahimiand the 16-bit library can be compiled to use 4-byte values, though this
386*22dc650dSSadaf Ebrahimiimpairs performance. Specifying a LINK_SIZE larger than 2 for these libraries is
387*22dc650dSSadaf Ebrahiminecessary only when patterns whose compiled length is greater than 65535 code
388*22dc650dSSadaf Ebrahimiunits are going to be processed. When a LINK_SIZE value uses more than one code
389*22dc650dSSadaf Ebrahimiunit, the most significant unit is first.
390*22dc650dSSadaf Ebrahimi
391*22dc650dSSadaf EbrahimiIn this description, we assume the "normal" compilation options. Data values
392*22dc650dSSadaf Ebrahimithat are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
393*22dc650dSSadaf Ebrahimi(most significant byte first), and one code unit in 16-bit and 32-bit modes.
394*22dc650dSSadaf Ebrahimi
395*22dc650dSSadaf Ebrahimi
396*22dc650dSSadaf EbrahimiOpcodes with no following data
397*22dc650dSSadaf Ebrahimi------------------------------
398*22dc650dSSadaf Ebrahimi
399*22dc650dSSadaf EbrahimiThese items are all just one unit long:
400*22dc650dSSadaf Ebrahimi
401*22dc650dSSadaf Ebrahimi  OP_END                 end of pattern
402*22dc650dSSadaf Ebrahimi  OP_ANY                 match any one character other than newline
403*22dc650dSSadaf Ebrahimi  OP_ALLANY              match any one character, including newline
404*22dc650dSSadaf Ebrahimi  OP_ANYBYTE             match any single code unit, even in UTF-8/16 mode
405*22dc650dSSadaf Ebrahimi  OP_SOD                 match start of data: \A
406*22dc650dSSadaf Ebrahimi  OP_SOM,                start of match (subject + offset): \G
407*22dc650dSSadaf Ebrahimi  OP_SET_SOM,            set start of match (\K)
408*22dc650dSSadaf Ebrahimi  OP_CIRC                ^ (start of data)
409*22dc650dSSadaf Ebrahimi  OP_CIRCM               ^ multiline mode (start of data or after newline)
410*22dc650dSSadaf Ebrahimi  OP_NOT_WORD_BOUNDARY   \W
411*22dc650dSSadaf Ebrahimi  OP_WORD_BOUNDARY       \w
412*22dc650dSSadaf Ebrahimi  OP_NOT_DIGIT           \D
413*22dc650dSSadaf Ebrahimi  OP_DIGIT               \d
414*22dc650dSSadaf Ebrahimi  OP_NOT_HSPACE          \H
415*22dc650dSSadaf Ebrahimi  OP_HSPACE              \h
416*22dc650dSSadaf Ebrahimi  OP_NOT_WHITESPACE      \S
417*22dc650dSSadaf Ebrahimi  OP_WHITESPACE          \s
418*22dc650dSSadaf Ebrahimi  OP_NOT_VSPACE          \V
419*22dc650dSSadaf Ebrahimi  OP_VSPACE              \v
420*22dc650dSSadaf Ebrahimi  OP_NOT_WORDCHAR        \W
421*22dc650dSSadaf Ebrahimi  OP_WORDCHAR            \w
422*22dc650dSSadaf Ebrahimi  OP_EODN                match end of data or newline at end: \Z
423*22dc650dSSadaf Ebrahimi  OP_EOD                 match end of data: \z
424*22dc650dSSadaf Ebrahimi  OP_DOLL                $ (end of data, or before final newline)
425*22dc650dSSadaf Ebrahimi  OP_DOLLM               $ multiline mode (end of data or before newline)
426*22dc650dSSadaf Ebrahimi  OP_EXTUNI              match an extended Unicode grapheme cluster
427*22dc650dSSadaf Ebrahimi  OP_ANYNL               match any Unicode newline sequence
428*22dc650dSSadaf Ebrahimi
429*22dc650dSSadaf Ebrahimi  OP_ASSERT_ACCEPT       )
430*22dc650dSSadaf Ebrahimi  OP_ACCEPT              ) These are Perl 5.10's "backtracking control
431*22dc650dSSadaf Ebrahimi  OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
432*22dc650dSSadaf Ebrahimi  OP_FAIL                ) parentheses, it may be preceded by one or more
433*22dc650dSSadaf Ebrahimi  OP_PRUNE               ) OP_CLOSE, each followed by a number that
434*22dc650dSSadaf Ebrahimi  OP_SKIP                ) indicates which parentheses must be closed.
435*22dc650dSSadaf Ebrahimi  OP_THEN                )
436*22dc650dSSadaf Ebrahimi
437*22dc650dSSadaf EbrahimiOP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
438*22dc650dSSadaf EbrahimiThis ends the assertion, not the entire pattern match. The assertion (?!) is
439*22dc650dSSadaf Ebrahimialways optimized to OP_FAIL.
440*22dc650dSSadaf Ebrahimi
441*22dc650dSSadaf EbrahimiOP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
442*22dc650dSSadaf Ebrahiminon-UTF modes and in UTF-32 mode (since one code unit still equals one
443*22dc650dSSadaf Ebrahimicharacter). Another use is for [^] when empty classes are permitted
444*22dc650dSSadaf Ebrahimi(PCRE2_ALLOW_EMPTY_CLASS is set).
445*22dc650dSSadaf Ebrahimi
446*22dc650dSSadaf Ebrahimi
447*22dc650dSSadaf EbrahimiBacktracking control verbs
448*22dc650dSSadaf Ebrahimi--------------------------
449*22dc650dSSadaf Ebrahimi
450*22dc650dSSadaf EbrahimiVerbs with no arguments generate opcodes with no following data (as listed
451*22dc650dSSadaf Ebrahimiin the section above).
452*22dc650dSSadaf Ebrahimi
453*22dc650dSSadaf Ebrahimi(*MARK:NAME) generates OP_MARK followed by the mark name, preceded by a
454*22dc650dSSadaf Ebrahimilength in one code unit, and followed by a binary zero. The name length is
455*22dc650dSSadaf Ebrahimilimited by the size of the code unit.
456*22dc650dSSadaf Ebrahimi
457*22dc650dSSadaf Ebrahimi(*ACCEPT:NAME) and (*FAIL:NAME) are compiled as (*MARK:NAME)(*ACCEPT) and
458*22dc650dSSadaf Ebrahimi(*MARK:NAME)(*FAIL) respectively.
459*22dc650dSSadaf Ebrahimi
460*22dc650dSSadaf EbrahimiFor (*COMMIT:NAME), (*PRUNE:NAME), (*SKIP:NAME), and (*THEN:NAME), the opcodes
461*22dc650dSSadaf EbrahimiOP_COMMIT_ARG, OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the
462*22dc650dSSadaf Ebrahiminame following in the same format as for OP_MARK.
463*22dc650dSSadaf Ebrahimi
464*22dc650dSSadaf Ebrahimi
465*22dc650dSSadaf EbrahimiMatching literal characters
466*22dc650dSSadaf Ebrahimi---------------------------
467*22dc650dSSadaf Ebrahimi
468*22dc650dSSadaf EbrahimiThe OP_CHAR opcode is followed by a single character that is to be matched
469*22dc650dSSadaf Ebrahimicasefully. For caseless matching of characters that have at most two
470*22dc650dSSadaf Ebrahimicase-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
471*22dc650dSSadaf Ebrahimicharacter may be more than one code unit long. In UTF-32 mode, characters are
472*22dc650dSSadaf Ebrahimialways exactly one code unit long.
473*22dc650dSSadaf Ebrahimi
474*22dc650dSSadaf EbrahimiIf there is only one character in a character class, OP_CHAR or OP_CHARI is
475*22dc650dSSadaf Ebrahimiused for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
476*22dc650dSSadaf Ebrahimifor something like [^a]).
477*22dc650dSSadaf Ebrahimi
478*22dc650dSSadaf EbrahimiCaseless matching (positive or negative) of characters that have more than two
479*22dc650dSSadaf Ebrahimicase-equivalent code points (which is possible only in UTF mode) is handled by
480*22dc650dSSadaf Ebrahimicompiling a Unicode property item (see below), with the pseudo-property
481*22dc650dSSadaf EbrahimiPT_CLIST. The value of this property is an offset in a vector called
482*22dc650dSSadaf Ebrahimi"ucd_caseless_sets" which identifies the start of a short list of case
483*22dc650dSSadaf Ebrahimiequivalent characters, terminated by the value NOTACHAR (0xffffffff).
484*22dc650dSSadaf Ebrahimi
485*22dc650dSSadaf Ebrahimi
486*22dc650dSSadaf EbrahimiRepeating single characters
487*22dc650dSSadaf Ebrahimi---------------------------
488*22dc650dSSadaf Ebrahimi
489*22dc650dSSadaf EbrahimiThe common repeats (*, +, ?), when applied to a single character, use the
490*22dc650dSSadaf Ebrahimifollowing opcodes, which come in caseful and caseless versions:
491*22dc650dSSadaf Ebrahimi
492*22dc650dSSadaf Ebrahimi  Caseful         Caseless
493*22dc650dSSadaf Ebrahimi  OP_STAR         OP_STARI
494*22dc650dSSadaf Ebrahimi  OP_MINSTAR      OP_MINSTARI
495*22dc650dSSadaf Ebrahimi  OP_POSSTAR      OP_POSSTARI
496*22dc650dSSadaf Ebrahimi  OP_PLUS         OP_PLUSI
497*22dc650dSSadaf Ebrahimi  OP_MINPLUS      OP_MINPLUSI
498*22dc650dSSadaf Ebrahimi  OP_POSPLUS      OP_POSPLUSI
499*22dc650dSSadaf Ebrahimi  OP_QUERY        OP_QUERYI
500*22dc650dSSadaf Ebrahimi  OP_MINQUERY     OP_MINQUERYI
501*22dc650dSSadaf Ebrahimi  OP_POSQUERY     OP_POSQUERYI
502*22dc650dSSadaf Ebrahimi
503*22dc650dSSadaf EbrahimiEach opcode is followed by the character that is to be repeated. In ASCII or
504*22dc650dSSadaf EbrahimiUTF-32 modes, these are two-code-unit items; in UTF-8 or UTF-16 modes, the
505*22dc650dSSadaf Ebrahimilength is variable. Those with "MIN" in their names are the minimizing
506*22dc650dSSadaf Ebrahimiversions. Those with "POS" in their names are possessive versions. Other kinds
507*22dc650dSSadaf Ebrahimiof repeat make use of these opcodes:
508*22dc650dSSadaf Ebrahimi
509*22dc650dSSadaf Ebrahimi  Caseful         Caseless
510*22dc650dSSadaf Ebrahimi  OP_UPTO         OP_UPTOI
511*22dc650dSSadaf Ebrahimi  OP_MINUPTO      OP_MINUPTOI
512*22dc650dSSadaf Ebrahimi  OP_POSUPTO      OP_POSUPTOI
513*22dc650dSSadaf Ebrahimi  OP_EXACT        OP_EXACTI
514*22dc650dSSadaf Ebrahimi
515*22dc650dSSadaf EbrahimiEach of these is followed by a count and then the repeated character. The count
516*22dc650dSSadaf Ebrahimiis two bytes long in 8-bit mode (most significant byte first), or one code unit
517*22dc650dSSadaf Ebrahimiin 16-bit and 32-bit modes.
518*22dc650dSSadaf Ebrahimi
519*22dc650dSSadaf EbrahimiOP_UPTO matches from 0 to the given number. A repeat with a non-zero minimum
520*22dc650dSSadaf Ebrahimiand a fixed maximum is coded as an OP_EXACT followed by an OP_UPTO (or
521*22dc650dSSadaf EbrahimiOP_MINUPTO or OPT_POSUPTO).
522*22dc650dSSadaf Ebrahimi
523*22dc650dSSadaf EbrahimiAnother set of matching repeating opcodes (called OP_NOTSTAR, OP_NOTSTARI,
524*22dc650dSSadaf Ebrahimietc.) are used for repeated, negated, single-character classes such as [^a]*.
525*22dc650dSSadaf EbrahimiThe normal single-character opcodes (OP_STAR, etc.) are used for repeated
526*22dc650dSSadaf Ebrahimipositive single-character classes.
527*22dc650dSSadaf Ebrahimi
528*22dc650dSSadaf Ebrahimi
529*22dc650dSSadaf EbrahimiRepeating character types
530*22dc650dSSadaf Ebrahimi-------------------------
531*22dc650dSSadaf Ebrahimi
532*22dc650dSSadaf EbrahimiRepeats of things like \d are done exactly as for single characters, except
533*22dc650dSSadaf Ebrahimithat instead of a character, the opcode for the type (e.g. OP_DIGIT) is stored
534*22dc650dSSadaf Ebrahimiin the next code unit. The opcodes are:
535*22dc650dSSadaf Ebrahimi
536*22dc650dSSadaf Ebrahimi  OP_TYPESTAR
537*22dc650dSSadaf Ebrahimi  OP_TYPEMINSTAR
538*22dc650dSSadaf Ebrahimi  OP_TYPEPOSSTAR
539*22dc650dSSadaf Ebrahimi  OP_TYPEPLUS
540*22dc650dSSadaf Ebrahimi  OP_TYPEMINPLUS
541*22dc650dSSadaf Ebrahimi  OP_TYPEPOSPLUS
542*22dc650dSSadaf Ebrahimi  OP_TYPEQUERY
543*22dc650dSSadaf Ebrahimi  OP_TYPEMINQUERY
544*22dc650dSSadaf Ebrahimi  OP_TYPEPOSQUERY
545*22dc650dSSadaf Ebrahimi  OP_TYPEUPTO
546*22dc650dSSadaf Ebrahimi  OP_TYPEMINUPTO
547*22dc650dSSadaf Ebrahimi  OP_TYPEPOSUPTO
548*22dc650dSSadaf Ebrahimi  OP_TYPEEXACT
549*22dc650dSSadaf Ebrahimi
550*22dc650dSSadaf Ebrahimi
551*22dc650dSSadaf EbrahimiMatch by Unicode property
552*22dc650dSSadaf Ebrahimi-------------------------
553*22dc650dSSadaf Ebrahimi
554*22dc650dSSadaf EbrahimiOP_PROP and OP_NOTPROP are used for positive and negative matches of a
555*22dc650dSSadaf Ebrahimicharacter by testing its Unicode property (the \p and \P escape sequences).
556*22dc650dSSadaf EbrahimiEach is followed by two code units that encode the desired property as a type
557*22dc650dSSadaf Ebrahimiand a value. The types are a set of #defines of the form PT_xxx, and the values
558*22dc650dSSadaf Ebrahimiare enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
559*22dc650dSSadaf EbrahimiThe value is relevant only for PT_GC (General Category), PT_PC (Particular
560*22dc650dSSadaf EbrahimiCategory), PT_SC (Script), PT_BIDICL (Bidi Class), PT_BOOL (Boolean property),
561*22dc650dSSadaf Ebrahimiand the pseudo-property PT_CLIST, which is used to identify a list of
562*22dc650dSSadaf Ebrahimicase-equivalent characters when there are three or more (see above).
563*22dc650dSSadaf Ebrahimi
564*22dc650dSSadaf EbrahimiRepeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
565*22dc650dSSadaf Ebrahimithree code units: OP_PROP or OP_NOTPROP, and then the desired property type and
566*22dc650dSSadaf Ebrahimivalue.
567*22dc650dSSadaf Ebrahimi
568*22dc650dSSadaf Ebrahimi
569*22dc650dSSadaf EbrahimiCharacter classes
570*22dc650dSSadaf Ebrahimi-----------------
571*22dc650dSSadaf Ebrahimi
572*22dc650dSSadaf EbrahimiIf there is only one character in a class, OP_CHAR or OP_CHARI is used for a
573*22dc650dSSadaf Ebrahimipositive class, and OP_NOT or OP_NOTI for a negative one (that is, for
574*22dc650dSSadaf Ebrahimisomething like [^a]), except when caselessly matching a character that has more
575*22dc650dSSadaf Ebrahimithan two case-equivalent code points (which can happen only in UTF mode). In
576*22dc650dSSadaf Ebrahimithis case a Unicode property item is used, as described above in "Matching
577*22dc650dSSadaf Ebrahimiliteral characters".
578*22dc650dSSadaf Ebrahimi
579*22dc650dSSadaf EbrahimiA set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
580*22dc650dSSadaf Ebrahiminegated, single-character classes. The normal single-character opcodes
581*22dc650dSSadaf Ebrahimi(OP_STAR, etc.) are used for repeated positive single-character classes.
582*22dc650dSSadaf Ebrahimi
583*22dc650dSSadaf EbrahimiWhen there is more than one character in a class, and all the code points are
584*22dc650dSSadaf Ebrahimiless than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a
585*22dc650dSSadaf Ebrahiminegative one. In either case, the opcode is followed by a 32-byte (16-short,
586*22dc650dSSadaf Ebrahimi8-word) bit map containing a 1 bit for every character that is acceptable. The
587*22dc650dSSadaf Ebrahimibits are counted from the least significant end of each unit. In caseless mode,
588*22dc650dSSadaf Ebrahimibits for both cases are set.
589*22dc650dSSadaf Ebrahimi
590*22dc650dSSadaf EbrahimiThe reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 and
591*22dc650dSSadaf Ebrahimi16-bit and 32-bit modes, subject characters with values greater than 255 can be
592*22dc650dSSadaf Ebrahimihandled correctly. For OP_CLASS they do not match, whereas for OP_NCLASS they
593*22dc650dSSadaf Ebrahimido.
594*22dc650dSSadaf Ebrahimi
595*22dc650dSSadaf EbrahimiFor classes containing characters with values greater than 255 or that contain
596*22dc650dSSadaf Ebrahimi\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
597*22dc650dSSadaf Ebrahimicode points are less than 256, followed by a list of pairs (for a range) and/or
598*22dc650dSSadaf Ebrahimisingle characters and/or properties. In caseless mode, all equivalent
599*22dc650dSSadaf Ebrahimicharacters are explicitly listed.
600*22dc650dSSadaf Ebrahimi
601*22dc650dSSadaf EbrahimiOP_XCLASS is followed by a LINK_SIZE value containing the total length of the
602*22dc650dSSadaf Ebrahimiopcode and its data. This is followed by a code unit containing flag bits:
603*22dc650dSSadaf EbrahimiXCL_NOT indicates that this is a negative class, and XCL_MAP indicates that a
604*22dc650dSSadaf Ebrahimibit map is present. There follows the bit map, if XCL_MAP is set, and then a
605*22dc650dSSadaf Ebrahimisequence of items coded as follows:
606*22dc650dSSadaf Ebrahimi
607*22dc650dSSadaf Ebrahimi  XCL_END      marks the end of the list
608*22dc650dSSadaf Ebrahimi  XCL_SINGLE   one character follows
609*22dc650dSSadaf Ebrahimi  XCL_RANGE    two characters follow
610*22dc650dSSadaf Ebrahimi  XCL_PROP     a Unicode property (type, value) follows
611*22dc650dSSadaf Ebrahimi  XCL_NOTPROP  a Unicode property (type, value) follows
612*22dc650dSSadaf Ebrahimi
613*22dc650dSSadaf EbrahimiIf a range starts with a code point less than 256 and ends with one greater
614*22dc650dSSadaf Ebrahimithan 255, it is split into two ranges, with characters less than 256 being
615*22dc650dSSadaf Ebrahimiindicated in the bit map, and the rest with XCL_RANGE.
616*22dc650dSSadaf Ebrahimi
617*22dc650dSSadaf EbrahimiWhen XCL_NOT is set, the bit map, if present, contains bits for characters that
618*22dc650dSSadaf Ebrahimiare allowed (exactly as for OP_NCLASS), but the list of items that follow it
619*22dc650dSSadaf Ebrahimispecifies characters and properties that are not allowed.
620*22dc650dSSadaf Ebrahimi
621*22dc650dSSadaf Ebrahimi
622*22dc650dSSadaf EbrahimiBack references
623*22dc650dSSadaf Ebrahimi---------------
624*22dc650dSSadaf Ebrahimi
625*22dc650dSSadaf EbrahimiOP_REF (caseful) or OP_REFI (caseless) is followed by a count containing the
626*22dc650dSSadaf Ebrahimireference number when the reference is to a unique capturing group (either by
627*22dc650dSSadaf Ebrahiminumber or by name). When named groups are used, there may be more than one
628*22dc650dSSadaf Ebrahimigroup with the same name. In this case, a reference to such a group by name
629*22dc650dSSadaf Ebrahimigenerates OP_DNREF or OP_DNREFI. These are followed by two counts: the index
630*22dc650dSSadaf Ebrahimi(not the byte offset) in the group name table of the first entry for the
631*22dc650dSSadaf Ebrahimirequired name, followed by the number of groups with the same name. The
632*22dc650dSSadaf Ebrahimimatching code can then search for the first one that is set.
633*22dc650dSSadaf Ebrahimi
634*22dc650dSSadaf Ebrahimi
635*22dc650dSSadaf EbrahimiRepeating character classes and back references
636*22dc650dSSadaf Ebrahimi-----------------------------------------------
637*22dc650dSSadaf Ebrahimi
638*22dc650dSSadaf EbrahimiSingle-character classes are handled specially (see above). This section
639*22dc650dSSadaf Ebrahimiapplies to other classes and also to back references. In both cases, the repeat
640*22dc650dSSadaf Ebrahimiinformation follows the base item. The matching code looks at the following
641*22dc650dSSadaf Ebrahimiopcode to see if it is one of these:
642*22dc650dSSadaf Ebrahimi
643*22dc650dSSadaf Ebrahimi  OP_CRSTAR
644*22dc650dSSadaf Ebrahimi  OP_CRMINSTAR
645*22dc650dSSadaf Ebrahimi  OP_CRPOSSTAR
646*22dc650dSSadaf Ebrahimi  OP_CRPLUS
647*22dc650dSSadaf Ebrahimi  OP_CRMINPLUS
648*22dc650dSSadaf Ebrahimi  OP_CRPOSPLUS
649*22dc650dSSadaf Ebrahimi  OP_CRQUERY
650*22dc650dSSadaf Ebrahimi  OP_CRMINQUERY
651*22dc650dSSadaf Ebrahimi  OP_CRPOSQUERY
652*22dc650dSSadaf Ebrahimi  OP_CRRANGE
653*22dc650dSSadaf Ebrahimi  OP_CRMINRANGE
654*22dc650dSSadaf Ebrahimi  OP_CRPOSRANGE
655*22dc650dSSadaf Ebrahimi
656*22dc650dSSadaf EbrahimiAll but the last three are single-code-unit items, with no data. The range
657*22dc650dSSadaf Ebrahimiopcodes are followed by the minimum and maximum repeat counts.
658*22dc650dSSadaf Ebrahimi
659*22dc650dSSadaf Ebrahimi
660*22dc650dSSadaf EbrahimiBrackets and alternation
661*22dc650dSSadaf Ebrahimi------------------------
662*22dc650dSSadaf Ebrahimi
663*22dc650dSSadaf EbrahimiA pair of non-capturing round brackets is wrapped round each expression at
664*22dc650dSSadaf Ebrahimicompile time, so alternation always happens in the context of brackets.
665*22dc650dSSadaf Ebrahimi
666*22dc650dSSadaf Ebrahimi[Note for North Americans: "bracket" to some English speakers, including
667*22dc650dSSadaf Ebrahimimyself, can be round, square, curly, or pointy. Hence this usage rather than
668*22dc650dSSadaf Ebrahimi"parentheses".]
669*22dc650dSSadaf Ebrahimi
670*22dc650dSSadaf EbrahimiNon-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
671*22dc650dSSadaf Ebrahimibracket opcode is followed by a LINK_SIZE value which gives the offset to the
672*22dc650dSSadaf Ebrahiminext alternative OP_ALT or, if there aren't any branches, to the terminating
673*22dc650dSSadaf Ebrahimiopcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
674*22dc650dSSadaf Ebrahiminext one, or to the final opcode. For capturing brackets, the bracket number is
675*22dc650dSSadaf Ebrahimia count that immediately follows the offset.
676*22dc650dSSadaf Ebrahimi
677*22dc650dSSadaf EbrahimiThere are several opcodes that mark the end of a subpattern group. OP_KET is
678*22dc650dSSadaf Ebrahimiused for subpatterns that do not repeat indefinitely, OP_KETRMIN and
679*22dc650dSSadaf EbrahimiOP_KETRMAX are used for indefinite repetitions, minimally or maximally
680*22dc650dSSadaf Ebrahimirespectively, and OP_KETRPOS for possessive repetitions (see below for more
681*22dc650dSSadaf Ebrahimidetails). All four are followed by a LINK_SIZE value giving (as a positive
682*22dc650dSSadaf Ebrahiminumber) the offset back to the matching opening bracket opcode.
683*22dc650dSSadaf Ebrahimi
684*22dc650dSSadaf EbrahimiIf a subpattern is quantified such that it is permitted to match zero times, it
685*22dc650dSSadaf Ebrahimiis preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
686*22dc650dSSadaf Ebrahimisingle-unit opcodes that tell the matcher that skipping the following
687*22dc650dSSadaf Ebrahimisubpattern entirely is a valid match. In the case of the first two, not
688*22dc650dSSadaf Ebrahimiskipping the pattern is also valid (greedy and non-greedy). The third is used
689*22dc650dSSadaf Ebrahimiwhen a pattern has the quantifier {0,0}. It cannot be entirely discarded,
690*22dc650dSSadaf Ebrahimibecause it may be called as a subroutine from elsewhere in the pattern.
691*22dc650dSSadaf Ebrahimi
692*22dc650dSSadaf EbrahimiA subpattern with an indefinite maximum repetition is replicated in the
693*22dc650dSSadaf Ebrahimicompiled data its minimum number of times (or once with OP_BRAZERO if the
694*22dc650dSSadaf Ebrahimiminimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX
695*22dc650dSSadaf Ebrahimias appropriate.
696*22dc650dSSadaf Ebrahimi
697*22dc650dSSadaf EbrahimiA subpattern with a bounded maximum repetition is replicated in a nested
698*22dc650dSSadaf Ebrahimifashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO
699*22dc650dSSadaf Ebrahimibefore each replication after the minimum, so that, for example, (abc){2,5} is
700*22dc650dSSadaf Ebrahimicompiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group
701*22dc650dSSadaf Ebrahimihas the same number.
702*22dc650dSSadaf Ebrahimi
703*22dc650dSSadaf EbrahimiWhen a repeated subpattern has an unbounded upper limit, it is checked to see
704*22dc650dSSadaf Ebrahimiwhether it could match an empty string. If this is the case, the opcode in the
705*22dc650dSSadaf Ebrahimifinal replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher
706*22dc650dSSadaf Ebrahimithat it needs to check for matching an empty string when it hits OP_KETRMIN or
707*22dc650dSSadaf EbrahimiOP_KETRMAX, and if so, to break the loop.
708*22dc650dSSadaf Ebrahimi
709*22dc650dSSadaf Ebrahimi
710*22dc650dSSadaf EbrahimiPossessive brackets
711*22dc650dSSadaf Ebrahimi-------------------
712*22dc650dSSadaf Ebrahimi
713*22dc650dSSadaf EbrahimiWhen a repeated group (capturing or non-capturing) is marked as possessive by
714*22dc650dSSadaf Ebrahimithe "+" notation, e.g. (abc)++, different opcodes are used. Their names all
715*22dc650dSSadaf Ebrahimihave POS on the end, e.g. OP_BRAPOS instead of OP_BRA and OP_SCBRAPOS instead
716*22dc650dSSadaf Ebrahimiof OP_SCBRA. The end of such a group is marked by OP_KETRPOS. If the minimum
717*22dc650dSSadaf Ebrahimirepetition is zero, the group is preceded by OP_BRAPOSZERO.
718*22dc650dSSadaf Ebrahimi
719*22dc650dSSadaf Ebrahimi
720*22dc650dSSadaf EbrahimiOnce-only (atomic) groups
721*22dc650dSSadaf Ebrahimi-------------------------
722*22dc650dSSadaf Ebrahimi
723*22dc650dSSadaf EbrahimiThese are just like other subpatterns, but they start with the opcode OP_ONCE.
724*22dc650dSSadaf EbrahimiThe check for matching an empty string in an unbounded repeat is handled
725*22dc650dSSadaf Ebrahimientirely at runtime, so there is just this one opcode for atomic groups.
726*22dc650dSSadaf Ebrahimi
727*22dc650dSSadaf Ebrahimi
728*22dc650dSSadaf EbrahimiAssertions
729*22dc650dSSadaf Ebrahimi----------
730*22dc650dSSadaf Ebrahimi
731*22dc650dSSadaf EbrahimiForward assertions are also just like other subpatterns, but starting with one
732*22dc650dSSadaf Ebrahimiof the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or
733*22dc650dSSadaf EbrahimiOP_ASSERT_NOT.
734*22dc650dSSadaf Ebrahimi
735*22dc650dSSadaf EbrahimiBackward assertions use the opcodes OP_ASSERTBACK, OP_ASSERTBACK_NA, and
736*22dc650dSSadaf EbrahimiOP_ASSERTBACK_NOT. If all the branches of a backward assertion are of fixed
737*22dc650dSSadaf Ebrahimilength (not necessarily the same), the first opcode inside each branch is
738*22dc650dSSadaf EbrahimiOP_REVERSE, followed by an IMM2_SIZE count of the number of characters to move
739*22dc650dSSadaf Ebrahimiback the pointer in the subject string, thus allowing each branch to have a
740*22dc650dSSadaf Ebrahimidifferent (but fixed) length.
741*22dc650dSSadaf Ebrahimi
742*22dc650dSSadaf EbrahimiVariable-length backward assertions whose maximum matching length is limited
743*22dc650dSSadaf Ebrahimiare also supported. For such assertions, the first opcode inside each branch is
744*22dc650dSSadaf EbrahimiOP_VREVERSE, followed by the minimum and maximum lengths for that branch,
745*22dc650dSSadaf Ebrahimiunless these happen to be equal, in which case OP_REVERSE is used. These
746*22dc650dSSadaf EbrahimiIMM2_SIZE values occupy two code units each in 8-bit mode, and 1 code unit in
747*22dc650dSSadaf Ebrahimi16/32 bit modes.
748*22dc650dSSadaf Ebrahimi
749*22dc650dSSadaf EbrahimiIn ASCII or UTF-32 mode, the character counts in OP_REVERSE and OP_VREVERSE are
750*22dc650dSSadaf Ebrahimialso the number of code units, but in UTF-8/16 mode each character may occupy
751*22dc650dSSadaf Ebrahimimore than one code unit.
752*22dc650dSSadaf Ebrahimi
753*22dc650dSSadaf Ebrahimi
754*22dc650dSSadaf EbrahimiConditional subpatterns
755*22dc650dSSadaf Ebrahimi-----------------------
756*22dc650dSSadaf Ebrahimi
757*22dc650dSSadaf EbrahimiThese are like other subpatterns, but they start with the opcode OP_COND, or
758*22dc650dSSadaf EbrahimiOP_SCOND for one that might match an empty string in an unbounded repeat.
759*22dc650dSSadaf Ebrahimi
760*22dc650dSSadaf EbrahimiIf the condition is a back reference, this is stored at the start of the
761*22dc650dSSadaf Ebrahimisubpattern using the opcode OP_CREF followed by a count containing the
762*22dc650dSSadaf Ebrahimireference number, provided that the reference is to a unique capturing group.
763*22dc650dSSadaf EbrahimiIf the reference was by name and there is more than one group with that name,
764*22dc650dSSadaf EbrahimiOP_DNCREF is used instead. It is followed by two counts: the index in the group
765*22dc650dSSadaf Ebrahiminames table, and the number of groups with the same name. The allows the
766*22dc650dSSadaf Ebrahimimatcher to check if any group with the given name is set.
767*22dc650dSSadaf Ebrahimi
768*22dc650dSSadaf EbrahimiIf the condition is "in recursion" (coded as "(?(R)"), or "in recursion of
769*22dc650dSSadaf Ebrahimigroup x" (coded as "(?(Rx)"), the group number is stored at the start of the
770*22dc650dSSadaf Ebrahimisubpattern using the opcode OP_RREF (with a value of RREF_ANY (0xffff) for "the
771*22dc650dSSadaf Ebrahimiwhole pattern") or OP_DNRREF (with data as for OP_DNCREF).
772*22dc650dSSadaf Ebrahimi
773*22dc650dSSadaf EbrahimiFor a DEFINE condition, OP_FALSE is used (with no associated data). During
774*22dc650dSSadaf Ebrahimicompilation, however, a DEFINE condition is coded as OP_DEFINE so that, when
775*22dc650dSSadaf Ebrahimithe conditional group is complete, there can be a check to ensure that it
776*22dc650dSSadaf Ebrahimicontains only one top-level branch. Once this has happened, the opcode is
777*22dc650dSSadaf Ebrahimichanged to OP_FALSE, so the matcher never sees OP_DEFINE.
778*22dc650dSSadaf Ebrahimi
779*22dc650dSSadaf EbrahimiThere is a special PCRE2-specific condition of the form (VERSION[>]=x.y), which
780*22dc650dSSadaf Ebrahimitests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE
781*22dc650dSSadaf Ebrahimior OP_FALSE.
782*22dc650dSSadaf Ebrahimi
783*22dc650dSSadaf EbrahimiIf a condition is not a back reference, recursion test, DEFINE, or VERSION, it
784*22dc650dSSadaf Ebrahimimust start with a parenthesized atomic assertion, whose opcode normally
785*22dc650dSSadaf Ebrahimiimmediately follows OP_COND or OP_SCOND. However, if automatic callouts are
786*22dc650dSSadaf Ebrahimienabled, a callout is inserted immediately before the assertion. It is also
787*22dc650dSSadaf Ebrahimipossible to insert a manual callout at this point. Only assertion conditions
788*22dc650dSSadaf Ebrahimimay have callouts preceding the condition.
789*22dc650dSSadaf Ebrahimi
790*22dc650dSSadaf EbrahimiA condition that is the negative assertion (?!) is optimized to OP_FAIL in all
791*22dc650dSSadaf Ebrahimiparts of the pattern, so this is another opcode that may appear as a condition.
792*22dc650dSSadaf EbrahimiIt is treated the same as OP_FALSE.
793*22dc650dSSadaf Ebrahimi
794*22dc650dSSadaf Ebrahimi
795*22dc650dSSadaf EbrahimiRecursion
796*22dc650dSSadaf Ebrahimi---------
797*22dc650dSSadaf Ebrahimi
798*22dc650dSSadaf EbrahimiRecursion either matches the current pattern, or some subexpression. The opcode
799*22dc650dSSadaf EbrahimiOP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
800*22dc650dSSadaf Ebrahimibracket from the start of the whole pattern. OP_RECURSE is also used for
801*22dc650dSSadaf Ebrahimi"subroutine" calls, even though they are not strictly a recursion. Up till
802*22dc650dSSadaf Ebrahimirelease 10.30 recursions were treated as atomic groups, making them
803*22dc650dSSadaf Ebrahimiincompatible with Perl (but PCRE had them well before Perl did). From 10.30,
804*22dc650dSSadaf Ebrahimibacktracking into recursions is supported.
805*22dc650dSSadaf Ebrahimi
806*22dc650dSSadaf EbrahimiRepeated recursions used to be wrapped inside OP_ONCE brackets, which not only
807*22dc650dSSadaf Ebrahimiforced no backtracking, but also allowed repetition to be handled as for other
808*22dc650dSSadaf Ebrahimibracketed groups. From 10.30 onwards, repeated recursions are duplicated for
809*22dc650dSSadaf Ebrahimitheir minimum repetitions, and then wrapped in non-capturing brackets for the
810*22dc650dSSadaf Ebrahimiremainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is
811*22dc650dSSadaf Ebrahimitreated as (?1)(?1)(?:(?1)){0,2}.
812*22dc650dSSadaf Ebrahimi
813*22dc650dSSadaf Ebrahimi
814*22dc650dSSadaf EbrahimiCallouts
815*22dc650dSSadaf Ebrahimi--------
816*22dc650dSSadaf Ebrahimi
817*22dc650dSSadaf EbrahimiA callout may have either a numerical argument or a string argument. These use
818*22dc650dSSadaf EbrahimiOP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
819*22dc650dSSadaf Ebrahimitwo LINK_SIZE values giving the offset in the pattern string to the start of
820*22dc650dSSadaf Ebrahimithe following item, and another count giving the length of this item. These
821*22dc650dSSadaf Ebrahimivalues make it possible for pcre2test to output useful tracing information
822*22dc650dSSadaf Ebrahimiusing callouts.
823*22dc650dSSadaf Ebrahimi
824*22dc650dSSadaf EbrahimiIn the case of a numeric callout, after these two values there is a single code
825*22dc650dSSadaf Ebrahimiunit containing the callout number, in the range 0-255, with 255 being used for
826*22dc650dSSadaf Ebrahimicallouts that are automatically inserted as a result of the PCRE2_AUTO_CALLOUT
827*22dc650dSSadaf Ebrahimioption. Thus, this opcode item is of fixed length:
828*22dc650dSSadaf Ebrahimi
829*22dc650dSSadaf Ebrahimi  [OP_CALLOUT] [PATTERN_OFFSET] [PATTERN_LENGTH] [NUMBER]
830*22dc650dSSadaf Ebrahimi
831*22dc650dSSadaf EbrahimiFor callouts with string arguments, OP_CALLOUT_STR has three more data items:
832*22dc650dSSadaf Ebrahimia LINK_SIZE value giving the complete length of the entire opcode item, a
833*22dc650dSSadaf EbrahimiLINK_SIZE item containing the offset within the pattern string to the start of
834*22dc650dSSadaf Ebrahimithe string argument, and the string itself, preceded by its starting delimiter
835*22dc650dSSadaf Ebrahimiand followed by a binary zero. When a callout function is called, a pointer to
836*22dc650dSSadaf Ebrahimithe actual string is passed, but the delimiter can be accessed as string[-1] if
837*22dc650dSSadaf Ebrahimithe application needs it. In the 8-bit library, the callout in /X(?C'abc')Y/ is
838*22dc650dSSadaf Ebrahimicompiled as the following bytes (decimal numbers represent binary values):
839*22dc650dSSadaf Ebrahimi
840*22dc650dSSadaf Ebrahimi  [OP_CALLOUT_STR]  [0] [10]  [0] [1]  [0] [14]  [0] [5] ['] [a] [b] [c] [0]
841*22dc650dSSadaf Ebrahimi                    --------  -------  --------  -------
842*22dc650dSSadaf Ebrahimi                       |         |        |         |
843*22dc650dSSadaf Ebrahimi                       ------- LINK_SIZE items ------
844*22dc650dSSadaf Ebrahimi
845*22dc650dSSadaf EbrahimiOpcode table checking
846*22dc650dSSadaf Ebrahimi---------------------
847*22dc650dSSadaf Ebrahimi
848*22dc650dSSadaf EbrahimiThe last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
849*22dc650dSSadaf Ebrahiminot a real opcode, but is used to check at compile time that tables indexed by
850*22dc650dSSadaf Ebrahimiopcode are the correct length, in order to catch updating errors.
851*22dc650dSSadaf Ebrahimi
852*22dc650dSSadaf EbrahimiPhilip Hazel
853*22dc650dSSadaf EbrahimiNovember 2023
854