1*22dc650dSSadaf EbrahimiTechnical notes about PCRE2 2*22dc650dSSadaf Ebrahimi--------------------------- 3*22dc650dSSadaf Ebrahimi 4*22dc650dSSadaf EbrahimiThese are very rough technical notes that record potentially useful information 5*22dc650dSSadaf Ebrahimiabout PCRE2 internals. PCRE2 is a library based on the original PCRE library, 6*22dc650dSSadaf Ebrahimibut with a revised (and incompatible) API. To avoid confusion, the original 7*22dc650dSSadaf Ebrahimilibrary is referred to as PCRE1 below. For information about testing PCRE2, see 8*22dc650dSSadaf Ebrahimithe pcre2test documentation and the comment at the head of the RunTest file. 9*22dc650dSSadaf Ebrahimi 10*22dc650dSSadaf EbrahimiPCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix 11*22dc650dSSadaf Ebrahimireleases carried on the 8.xx series, up to the final 8.45 release. PCRE2 12*22dc650dSSadaf Ebrahimireleases started at 10.00 to avoid confusion with PCRE1. 13*22dc650dSSadaf Ebrahimi 14*22dc650dSSadaf Ebrahimi 15*22dc650dSSadaf EbrahimiHistorical note 1 16*22dc650dSSadaf Ebrahimi----------------- 17*22dc650dSSadaf Ebrahimi 18*22dc650dSSadaf EbrahimiMany years ago I implemented some regular expression functions to an algorithm 19*22dc650dSSadaf Ebrahimisuggested by Martin Richards. The rather simple patterns were not Unix-like in 20*22dc650dSSadaf Ebrahimiform, and were quite restricted in what they could do by comparison with Perl. 21*22dc650dSSadaf EbrahimiThe interesting part about the algorithm was that the amount of space required 22*22dc650dSSadaf Ebrahimito hold the compiled form of an expression was known in advance. The code to 23*22dc650dSSadaf Ebrahimiapply an expression did not operate by backtracking, as the original Henry 24*22dc650dSSadaf EbrahimiSpencer code and current PCRE2 and Perl code does, but instead checked all 25*22dc650dSSadaf Ebrahimipossibilities simultaneously by keeping a list of current states and checking 26*22dc650dSSadaf Ebrahimiall of them as it advanced through the subject string. In the terminology of 27*22dc650dSSadaf EbrahimiJeffrey Friedl's book, it was a "DFA algorithm", though it was not a 28*22dc650dSSadaf Ebrahimitraditional Finite State Machine (FSM). When the pattern was all used up, all 29*22dc650dSSadaf Ebrahimiremaining states were possible matches, and the one matching the longest subset 30*22dc650dSSadaf Ebrahimiof the subject string was chosen. This did not necessarily maximize the 31*22dc650dSSadaf Ebrahimiindividual wild portions of the pattern, as is expected in Unix and Perl-style 32*22dc650dSSadaf Ebrahimiregular expressions. 33*22dc650dSSadaf Ebrahimi 34*22dc650dSSadaf Ebrahimi 35*22dc650dSSadaf EbrahimiHistorical note 2 36*22dc650dSSadaf Ebrahimi----------------- 37*22dc650dSSadaf Ebrahimi 38*22dc650dSSadaf EbrahimiBy contrast, the code originally written by Henry Spencer (which was 39*22dc650dSSadaf Ebrahimisubsequently heavily modified for Perl) compiles the expression twice: once in 40*22dc650dSSadaf Ebrahimia dummy mode in order to find out how much store will be needed, and then for 41*22dc650dSSadaf Ebrahimireal. (The Perl version may or may not still do this; I'm talking about the 42*22dc650dSSadaf Ebrahimioriginal library.) The execution function operates by backtracking and 43*22dc650dSSadaf Ebrahimimaximizing (or, optionally, minimizing, in Perl) the amount of the subject that 44*22dc650dSSadaf Ebrahimimatches individual wild portions of the pattern. This is an "NFA algorithm" in 45*22dc650dSSadaf EbrahimiFriedl's terminology. 46*22dc650dSSadaf Ebrahimi 47*22dc650dSSadaf Ebrahimi 48*22dc650dSSadaf EbrahimiOK, here's the real stuff 49*22dc650dSSadaf Ebrahimi------------------------- 50*22dc650dSSadaf Ebrahimi 51*22dc650dSSadaf EbrahimiFor the set of functions that formed the original PCRE1 library in 1997 (which 52*22dc650dSSadaf Ebrahimiare unrelated to those mentioned above), I tried at first to invent an 53*22dc650dSSadaf Ebrahimialgorithm that used an amount of store bounded by a multiple of the number of 54*22dc650dSSadaf Ebrahimicharacters in the pattern, to save on compiling time. However, because of the 55*22dc650dSSadaf Ebrahimigreater complexity in Perl regular expressions, I couldn't do this, even though 56*22dc650dSSadaf Ebrahimithe then current Perl 5.004 patterns were much simpler than those supported 57*22dc650dSSadaf Ebrahiminowadays. In any case, a first pass through the pattern is helpful for other 58*22dc650dSSadaf Ebrahimireasons. 59*22dc650dSSadaf Ebrahimi 60*22dc650dSSadaf Ebrahimi 61*22dc650dSSadaf EbrahimiSupport for 16-bit and 32-bit data strings 62*22dc650dSSadaf Ebrahimi------------------------------------------- 63*22dc650dSSadaf Ebrahimi 64*22dc650dSSadaf EbrahimiThe PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit 65*22dc650dSSadaf Ebrahimimodes, creating up to three different libraries. In the description that 66*22dc650dSSadaf Ebrahimifollows, the word "short" is used for a 16-bit data quantity, and the phrase 67*22dc650dSSadaf Ebrahimi"code unit" is used for a quantity that is a byte in 8-bit mode, a short in 68*22dc650dSSadaf Ebrahimi16-bit mode and a 32-bit word in 32-bit mode. The names of PCRE2 functions are 69*22dc650dSSadaf Ebrahimigiven in generic form, without the _8, _16, or _32 suffix. 70*22dc650dSSadaf Ebrahimi 71*22dc650dSSadaf Ebrahimi 72*22dc650dSSadaf EbrahimiComputing the memory requirement: how it was 73*22dc650dSSadaf Ebrahimi-------------------------------------------- 74*22dc650dSSadaf Ebrahimi 75*22dc650dSSadaf EbrahimiUp to and including release 6.7, PCRE1 worked by running a very degenerate 76*22dc650dSSadaf Ebrahimifirst pass to calculate a maximum memory requirement, and then a second pass to 77*22dc650dSSadaf Ebrahimido the real compile - which might use a bit less than the predicted amount of 78*22dc650dSSadaf Ebrahimimemory. The idea was that this would turn out faster than the Henry Spencer 79*22dc650dSSadaf Ebrahimicode because the first pass is degenerate and the second pass can just store 80*22dc650dSSadaf Ebrahimistuff straight into memory, which it knows is big enough. 81*22dc650dSSadaf Ebrahimi 82*22dc650dSSadaf Ebrahimi 83*22dc650dSSadaf EbrahimiComputing the memory requirement: how it is 84*22dc650dSSadaf Ebrahimi------------------------------------------- 85*22dc650dSSadaf Ebrahimi 86*22dc650dSSadaf EbrahimiBy the time I was working on a potential 6.8 release, the degenerate first pass 87*22dc650dSSadaf Ebrahimihad become very complicated and hard to maintain. Indeed one of the early 88*22dc650dSSadaf Ebrahimithings I did for 6.8 was to fix Yet Another Bug in the memory computation. Then 89*22dc650dSSadaf EbrahimiI had a flash of inspiration as to how I could run the real compile function in 90*22dc650dSSadaf Ebrahimia "fake" mode that enables it to compute how much memory it would need, while 91*22dc650dSSadaf Ebrahimiin most cases only ever using a small amount of working memory, and without too 92*22dc650dSSadaf Ebrahimimany tests of the mode that might slow it down. So I refactored the compiling 93*22dc650dSSadaf Ebrahimifunctions to work this way. This got rid of about 600 lines of source and made 94*22dc650dSSadaf Ebrahimifurther maintenance and development easier. As this was such a major change, I 95*22dc650dSSadaf Ebrahiminever released 6.8, instead upping the number to 7.0 (other quite major changes 96*22dc650dSSadaf Ebrahimiwere also present in the 7.0 release). 97*22dc650dSSadaf Ebrahimi 98*22dc650dSSadaf EbrahimiA side effect of this work was that the previous limit of 200 on the nesting 99*22dc650dSSadaf Ebrahimidepth of parentheses was removed. However, there was a downside: compiling ran 100*22dc650dSSadaf Ebrahimimore slowly than before (30% or more, depending on the pattern) because it now 101*22dc650dSSadaf Ebrahimidid a full analysis of the pattern. My hope was that this would not be a big 102*22dc650dSSadaf Ebrahimiissue, and in the event, nobody has commented on it. 103*22dc650dSSadaf Ebrahimi 104*22dc650dSSadaf EbrahimiAt release 8.34, a limit on the nesting depth of parentheses was re-introduced 105*22dc650dSSadaf Ebrahimi(default 250, settable at build time) so as to put a limit on the amount of 106*22dc650dSSadaf Ebrahimisystem stack used by the compile function, which uses recursive function calls 107*22dc650dSSadaf Ebrahimifor nested parenthesized groups. This is a safety feature for environments with 108*22dc650dSSadaf Ebrahimismall stacks where the patterns are provided by users. 109*22dc650dSSadaf Ebrahimi 110*22dc650dSSadaf Ebrahimi 111*22dc650dSSadaf EbrahimiYet another pattern scan 112*22dc650dSSadaf Ebrahimi------------------------ 113*22dc650dSSadaf Ebrahimi 114*22dc650dSSadaf EbrahimiHistory repeated itself for PCRE2 release 10.20. A number of bugs relating to 115*22dc650dSSadaf Ebrahiminamed subpatterns had been discovered by fuzzers. Most of these were related to 116*22dc650dSSadaf Ebrahimithe handling of forward references when it was not known if the named group was 117*22dc650dSSadaf Ebrahimiunique. (References to non-unique names use a different opcode and more 118*22dc650dSSadaf Ebrahimimemory.) The use of duplicate group numbers (the (?| facility) also caused 119*22dc650dSSadaf Ebrahimiissues. 120*22dc650dSSadaf Ebrahimi 121*22dc650dSSadaf EbrahimiTo get around these problems I adopted a new approach by adding a third pass 122*22dc650dSSadaf Ebrahimiover the pattern (really a "pre-pass"), which did nothing other than identify 123*22dc650dSSadaf Ebrahimiall the named subpatterns and their corresponding group numbers. This means 124*22dc650dSSadaf Ebrahimithat the actual compile (both the memory-computing dummy run and the real 125*22dc650dSSadaf Ebrahimicompile) has full knowledge of group names and numbers throughout. Several 126*22dc650dSSadaf Ebrahimidozen lines of messy code were eliminated, though the new pre-pass was not 127*22dc650dSSadaf Ebrahimishort. In particular, parsing and skipping over [] classes is complicated. 128*22dc650dSSadaf Ebrahimi 129*22dc650dSSadaf EbrahimiWhile working on 10.22 I realized that I could simplify yet again by moving 130*22dc650dSSadaf Ebrahimimore of the parsing into the pre-pass, thus avoiding doing it in two places, so 131*22dc650dSSadaf Ebrahimiafter 10.22 was released, the code underwent yet another big refactoring. This 132*22dc650dSSadaf Ebrahimiis how it is from 10.23 onwards: 133*22dc650dSSadaf Ebrahimi 134*22dc650dSSadaf EbrahimiThe function called parse_regex() scans the pattern characters, parsing them 135*22dc650dSSadaf Ebrahimiinto literal data and meta characters. It converts escapes such as \x{123} 136*22dc650dSSadaf Ebrahimiinto literals, handles \Q...\E, and skips over comments and non-significant 137*22dc650dSSadaf Ebrahimiwhite space. The result of the scanning is put into a vector of 32-bit unsigned 138*22dc650dSSadaf Ebrahimiintegers. Values less than 0x80000000 are literal data. Higher values represent 139*22dc650dSSadaf Ebrahimimeta-characters. The top 16-bits of such values identify the meta-character, 140*22dc650dSSadaf Ebrahimiand these are given names such as META_CAPTURE. The lower 16-bits are available 141*22dc650dSSadaf Ebrahimifor data, for example, the capturing group number. The only situation in which 142*22dc650dSSadaf Ebrahimiliteral data values greater than 0x7fffffff can appear is when the 32-bit 143*22dc650dSSadaf Ebrahimilibrary is running in non-UTF mode. This is handled by having a special 144*22dc650dSSadaf Ebrahimimeta-character that is followed by the 32-bit data value. 145*22dc650dSSadaf Ebrahimi 146*22dc650dSSadaf EbrahimiThe size of the parsed pattern vector, when auto-callouts are not enabled, is 147*22dc650dSSadaf Ebrahimibounded by the length of the pattern (with one exception). The code is written 148*22dc650dSSadaf Ebrahimiso that each item in the pattern uses no more vector elements than the number 149*22dc650dSSadaf Ebrahimiof code units in the item itself. The exception is the aforementioned large 150*22dc650dSSadaf Ebrahimi32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in 151*22dc650dSSadaf Ebrahimiadvance to check for such values. When auto-callouts are enabled, the generous 152*22dc650dSSadaf Ebrahimiassumption is made that there will be a callout for each pattern code unit 153*22dc650dSSadaf Ebrahimi(which of course is only actually true if all code units are literals) plus one 154*22dc650dSSadaf Ebrahimiat the end. A default parsed pattern vector is defined on the system stack, to 155*22dc650dSSadaf Ebrahimiminimize memory handling, but if this is not big enough, heap memory is used. 156*22dc650dSSadaf Ebrahimi 157*22dc650dSSadaf EbrahimiAs before, the actual compiling function is run twice, the first time to 158*22dc650dSSadaf Ebrahimidetermine the amount of memory needed for the final compiled pattern. It 159*22dc650dSSadaf Ebrahiminow processes the parsed pattern vector, not the pattern itself, although some 160*22dc650dSSadaf Ebrahimiof the parsed items refer to strings in the pattern - for example, group 161*22dc650dSSadaf Ebrahiminames. As escapes and comments have already been processed, the code is a bit 162*22dc650dSSadaf Ebrahimisimpler than before. 163*22dc650dSSadaf Ebrahimi 164*22dc650dSSadaf EbrahimiMost errors can be diagnosed during the parsing scan. For those that cannot 165*22dc650dSSadaf Ebrahimi(for example, "lookbehind assertion is not fixed length"), the parsed code 166*22dc650dSSadaf Ebrahimicontains offsets into the pattern so that the actual compiling code can 167*22dc650dSSadaf Ebrahimireport where errors are. 168*22dc650dSSadaf Ebrahimi 169*22dc650dSSadaf Ebrahimi 170*22dc650dSSadaf EbrahimiThe elements of the parsed pattern vector 171*22dc650dSSadaf Ebrahimi----------------------------------------- 172*22dc650dSSadaf Ebrahimi 173*22dc650dSSadaf EbrahimiThe word "offset" below means a code unit offset into the pattern. When 174*22dc650dSSadaf EbrahimiPCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is 175*22dc650dSSadaf Ebrahimistored in a single parsed pattern element. Otherwise (typically on 64-bit 176*22dc650dSSadaf Ebrahimisystems) it occupies two elements. The following meta items occupy just one 177*22dc650dSSadaf Ebrahimielement, with no data: 178*22dc650dSSadaf Ebrahimi 179*22dc650dSSadaf EbrahimiMETA_ACCEPT (*ACCEPT) 180*22dc650dSSadaf EbrahimiMETA_ASTERISK * 181*22dc650dSSadaf EbrahimiMETA_ASTERISK_PLUS *+ 182*22dc650dSSadaf EbrahimiMETA_ASTERISK_QUERY *? 183*22dc650dSSadaf EbrahimiMETA_ATOMIC (?> start of atomic group 184*22dc650dSSadaf EbrahimiMETA_CIRCUMFLEX ^ metacharacter 185*22dc650dSSadaf EbrahimiMETA_CLASS [ start of non-empty class 186*22dc650dSSadaf EbrahimiMETA_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS 187*22dc650dSSadaf EbrahimiMETA_CLASS_EMPTY_NOT [^] negative empty class - ditto 188*22dc650dSSadaf EbrahimiMETA_CLASS_END ] end of non-empty class 189*22dc650dSSadaf EbrahimiMETA_CLASS_NOT [^ start non-empty negative class 190*22dc650dSSadaf EbrahimiMETA_COMMIT (*COMMIT) - no argument (see below for with argument) 191*22dc650dSSadaf EbrahimiMETA_COND_ASSERT (?(?assertion) 192*22dc650dSSadaf EbrahimiMETA_DOLLAR $ metacharacter 193*22dc650dSSadaf EbrahimiMETA_DOT . metacharacter 194*22dc650dSSadaf EbrahimiMETA_END End of pattern (this value is 0x80000000) 195*22dc650dSSadaf EbrahimiMETA_FAIL (*FAIL) 196*22dc650dSSadaf EbrahimiMETA_KET ) closing parenthesis 197*22dc650dSSadaf EbrahimiMETA_LOOKAHEAD (?= start of lookahead 198*22dc650dSSadaf EbrahimiMETA_LOOKAHEAD_NA (*napla: start of non-atomic lookahead 199*22dc650dSSadaf EbrahimiMETA_LOOKAHEADNOT (?! start of negative lookahead 200*22dc650dSSadaf EbrahimiMETA_NOCAPTURE (?: no capture parens 201*22dc650dSSadaf EbrahimiMETA_PLUS + 202*22dc650dSSadaf EbrahimiMETA_PLUS_PLUS ++ 203*22dc650dSSadaf EbrahimiMETA_PLUS_QUERY +? 204*22dc650dSSadaf EbrahimiMETA_PRUNE (*PRUNE) - no argument (see below for with argument) 205*22dc650dSSadaf EbrahimiMETA_QUERY ? 206*22dc650dSSadaf EbrahimiMETA_QUERY_PLUS ?+ 207*22dc650dSSadaf EbrahimiMETA_QUERY_QUERY ?? 208*22dc650dSSadaf EbrahimiMETA_RANGE_ESCAPED hyphen in class range with at least one escape 209*22dc650dSSadaf EbrahimiMETA_RANGE_LITERAL hyphen in class range defined literally 210*22dc650dSSadaf EbrahimiMETA_SKIP (*SKIP) - no argument (see below for with argument) 211*22dc650dSSadaf EbrahimiMETA_THEN (*THEN) - no argument (see below for with argument) 212*22dc650dSSadaf Ebrahimi 213*22dc650dSSadaf EbrahimiThe two RANGE values occur only in character classes. They are positioned 214*22dc650dSSadaf Ebrahimibetween two literals that define the start and end of the range. In an EBCDIC 215*22dc650dSSadaf Ebrahimienvironment it is necessary to know whether either of the range values was 216*22dc650dSSadaf Ebrahimispecified as an escape. In an ASCII/Unicode environment the distinction is not 217*22dc650dSSadaf Ebrahimirelevant. 218*22dc650dSSadaf Ebrahimi 219*22dc650dSSadaf EbrahimiThe following have data in the lower 16 bits, and may be followed by other data 220*22dc650dSSadaf Ebrahimielements: 221*22dc650dSSadaf Ebrahimi 222*22dc650dSSadaf EbrahimiMETA_ALT | alternation 223*22dc650dSSadaf EbrahimiMETA_BACKREF back reference 224*22dc650dSSadaf EbrahimiMETA_CAPTURE start of capturing group 225*22dc650dSSadaf EbrahimiMETA_ESCAPE non-literal escape sequence 226*22dc650dSSadaf EbrahimiMETA_RECURSE recursion call 227*22dc650dSSadaf Ebrahimi 228*22dc650dSSadaf EbrahimiIf the data for META_ALT is non-zero, it is inside a lookbehind, and the data 229*22dc650dSSadaf Ebrahimiis the maximum length of its branch (see META_LOOKBEHIND below for more 230*22dc650dSSadaf Ebrahimidetail). 231*22dc650dSSadaf Ebrahimi 232*22dc650dSSadaf EbrahimiMETA_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as 233*22dc650dSSadaf Ebrahimitheir data in the lower 16 bits of the element. META_RECURSE is followed by an 234*22dc650dSSadaf Ebrahimioffset, for use in error messages. 235*22dc650dSSadaf Ebrahimi 236*22dc650dSSadaf EbrahimiMETA_BACKREF is followed by an offset if the back reference group number is 10 237*22dc650dSSadaf Ebrahimior more. The offsets of the first occurrences of references to groups whose 238*22dc650dSSadaf Ebrahiminumbers are less than 10 are put in cb->small_ref_offset[] (only the first 239*22dc650dSSadaf Ebrahimioccurrence is useful). On 64-bit systems this avoids using more than two parsed 240*22dc650dSSadaf Ebrahimipattern elements for items such as \3. The offset is used when an error occurs 241*22dc650dSSadaf Ebrahimibecause the reference is to a non-existent group. 242*22dc650dSSadaf Ebrahimi 243*22dc650dSSadaf EbrahimiMETA_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next 244*22dc650dSSadaf Ebrahimielement contains the 16-bit type and data property values, packed together. 245*22dc650dSSadaf EbrahimiESC_g and ESC_k are used only for named references - numerical ones are turned 246*22dc650dSSadaf Ebrahimiinto META_RECURSE or META_BACKREF as appropriate. ESC_g and ESC_k are followed 247*22dc650dSSadaf Ebrahimiby a length and an offset into the pattern to specify the name. 248*22dc650dSSadaf Ebrahimi 249*22dc650dSSadaf EbrahimiThe following have one data item that follows in the next vector element: 250*22dc650dSSadaf Ebrahimi 251*22dc650dSSadaf EbrahimiMETA_BIGVALUE Next is a literal >= META_END 252*22dc650dSSadaf EbrahimiMETA_POSIX POSIX class item (data identifies the class) 253*22dc650dSSadaf EbrahimiMETA_POSIX_NEG negative POSIX class item (ditto) 254*22dc650dSSadaf Ebrahimi 255*22dc650dSSadaf EbrahimiThe following are followed by a length element, then a number of character code 256*22dc650dSSadaf Ebrahimivalues (which should match with the length): 257*22dc650dSSadaf Ebrahimi 258*22dc650dSSadaf EbrahimiMETA_MARK (*MARK:xxxx) 259*22dc650dSSadaf EbrahimiMETA_COMMIT_ARG )*COMMIT:xxxx) 260*22dc650dSSadaf EbrahimiMETA_PRUNE_ARG (*PRUNE:xxx) 261*22dc650dSSadaf EbrahimiMETA_SKIP_ARG (*SKIP:xxxx) 262*22dc650dSSadaf EbrahimiMETA_THEN_ARG (*THEN:xxxx) 263*22dc650dSSadaf Ebrahimi 264*22dc650dSSadaf EbrahimiThe following are followed by a length element, then an offset in the pattern 265*22dc650dSSadaf Ebrahimithat identifies the name: 266*22dc650dSSadaf Ebrahimi 267*22dc650dSSadaf EbrahimiMETA_COND_NAME (?(<name>) or (?('name') or (?(name) 268*22dc650dSSadaf EbrahimiMETA_COND_RNAME (?(R&name) 269*22dc650dSSadaf EbrahimiMETA_COND_RNUMBER (?(Rdigits) 270*22dc650dSSadaf EbrahimiMETA_RECURSE_BYNAME (?&name) 271*22dc650dSSadaf EbrahimiMETA_BACKREF_BYNAME \k'name' 272*22dc650dSSadaf Ebrahimi 273*22dc650dSSadaf EbrahimiMETA_COND_RNUMBER is used for names that start with R and continue with digits, 274*22dc650dSSadaf Ebrahimibecause this is an ambiguous case. It could be a back reference to a group with 275*22dc650dSSadaf Ebrahimithat name, or it could be a recursion test on a numbered group. 276*22dc650dSSadaf Ebrahimi 277*22dc650dSSadaf EbrahimiThis one is followed by an offset, for use in error messages, then a number: 278*22dc650dSSadaf Ebrahimi 279*22dc650dSSadaf EbrahimiMETA_COND_NUMBER (?([+-]digits) 280*22dc650dSSadaf Ebrahimi 281*22dc650dSSadaf EbrahimiThe following is followed just by an offset, for use in error messages: 282*22dc650dSSadaf Ebrahimi 283*22dc650dSSadaf EbrahimiMETA_COND_DEFINE (?(DEFINE) 284*22dc650dSSadaf Ebrahimi 285*22dc650dSSadaf EbrahimiThe following are at first also followed just by an offset for use in error 286*22dc650dSSadaf Ebrahimimessages. After the lengths of the branches of a lookbehind group have been 287*22dc650dSSadaf Ebrahimichecked the error offset is no longer needed. The lower 16 bits of the main 288*22dc650dSSadaf Ebrahimiword are now set to the maximum length of the first branch of the lookbehind 289*22dc650dSSadaf Ebrahimigroup, and the second word is set to the mimimum matching length for a 290*22dc650dSSadaf Ebrahimivariable-length lookbehind group, or to LOOKBEHIND_MAX for a group whose 291*22dc650dSSadaf Ebrahimibranches are all of fixed length. These values are used when generating 292*22dc650dSSadaf EbrahimiOP_REVERSE or OP_VREVERSE for the first branch. The miminum value is also used 293*22dc650dSSadaf Ebrahimifor any subsequent branches because there is only room for one value (the 294*22dc650dSSadaf Ebrahimibranch maximum length) in a META_ALT item. 295*22dc650dSSadaf Ebrahimi 296*22dc650dSSadaf EbrahimiMETA_LOOKBEHIND (?<= start of lookbehind 297*22dc650dSSadaf EbrahimiMETA_LOOKBEHIND_NA (*naplb: start of non-atomic lookbehind 298*22dc650dSSadaf EbrahimiMETA_LOOKBEHINDNOT (?<! start of negative lookbehind 299*22dc650dSSadaf Ebrahimi 300*22dc650dSSadaf EbrahimiThe following are followed by two elements, the minimum and maximum. The 301*22dc650dSSadaf Ebrahimimaximum value is limited to 65535 (MAX_REPEAT_COUNT). A maximum value of 302*22dc650dSSadaf Ebrahimi"unlimited" is represented by REPEAT_UNLIMITED, which is bigger than it: 303*22dc650dSSadaf Ebrahimi 304*22dc650dSSadaf EbrahimiMETA_MINMAX {n,m} repeat 305*22dc650dSSadaf EbrahimiMETA_MINMAX_PLUS {n,m}+ repeat 306*22dc650dSSadaf EbrahimiMETA_MINMAX_QUERY {n,m}? repeat 307*22dc650dSSadaf Ebrahimi 308*22dc650dSSadaf EbrahimiThis one is followed by two elements, giving the new option settings for the 309*22dc650dSSadaf Ebrahimimain and extra options, respectively. 310*22dc650dSSadaf Ebrahimi 311*22dc650dSSadaf EbrahimiMETA_OPTIONS (?i) and friends 312*22dc650dSSadaf Ebrahimi 313*22dc650dSSadaf EbrahimiThis one is followed by three elements. The first is 0 for '>' and 1 for '>='; 314*22dc650dSSadaf Ebrahimithe next two are the major and minor numbers: 315*22dc650dSSadaf Ebrahimi 316*22dc650dSSadaf EbrahimiMETA_COND_VERSION (?(VERSION<op>x.y) 317*22dc650dSSadaf Ebrahimi 318*22dc650dSSadaf EbrahimiCallouts are converted into one of two items: 319*22dc650dSSadaf Ebrahimi 320*22dc650dSSadaf EbrahimiMETA_CALLOUT_NUMBER (?C with numerical argument 321*22dc650dSSadaf EbrahimiMETA_CALLOUT_STRING (?C with string argument 322*22dc650dSSadaf Ebrahimi 323*22dc650dSSadaf EbrahimiIn both cases, the next two elements contain the offset and length of the next 324*22dc650dSSadaf Ebrahimiitem in the pattern. Then there is either one callout number, or a length and 325*22dc650dSSadaf Ebrahimian offset for the string argument. The length includes both delimiters. 326*22dc650dSSadaf Ebrahimi 327*22dc650dSSadaf Ebrahimi 328*22dc650dSSadaf EbrahimiTraditional matching function 329*22dc650dSSadaf Ebrahimi----------------------------- 330*22dc650dSSadaf Ebrahimi 331*22dc650dSSadaf EbrahimiThe "traditional", and original, matching function is called pcre2_match(), and 332*22dc650dSSadaf Ebrahimiit implements an NFA algorithm, similar to the original Henry Spencer algorithm 333*22dc650dSSadaf Ebrahimiand the way that Perl works. This is not surprising, since it is intended to be 334*22dc650dSSadaf Ebrahimias compatible with Perl as possible. This is the function most users of PCRE2 335*22dc650dSSadaf Ebrahimiwill use most of the time. If PCRE2 is compiled with just-in-time (JIT) 336*22dc650dSSadaf Ebrahimisupport, and studying a compiled pattern with JIT is successful, the JIT code 337*22dc650dSSadaf Ebrahimiis run instead of the normal pcre2_match() code, but the result is the same. 338*22dc650dSSadaf Ebrahimi 339*22dc650dSSadaf Ebrahimi 340*22dc650dSSadaf EbrahimiSupplementary matching function 341*22dc650dSSadaf Ebrahimi------------------------------- 342*22dc650dSSadaf Ebrahimi 343*22dc650dSSadaf EbrahimiThere is also a supplementary matching function called pcre2_dfa_match(). This 344*22dc650dSSadaf Ebrahimiimplements a DFA matching algorithm that searches simultaneously for all 345*22dc650dSSadaf Ebrahimipossible matches that start at one point in the subject string. (Going back to 346*22dc650dSSadaf Ebrahimimy roots: see Historical Note 1 above.) This function intreprets the same 347*22dc650dSSadaf Ebrahimicompiled pattern data as pcre2_match(); however, not all the facilities are 348*22dc650dSSadaf Ebrahimiavailable, and those that are do not always work in quite the same way. See the 349*22dc650dSSadaf Ebrahimiuser documentation for details. 350*22dc650dSSadaf Ebrahimi 351*22dc650dSSadaf EbrahimiThe algorithm that is used for pcre2_dfa_match() is not a traditional FSM, 352*22dc650dSSadaf Ebrahimibecause it may have a number of states active at one time. More work would be 353*22dc650dSSadaf Ebrahimineeded at compile time to produce a traditional FSM where only one state is 354*22dc650dSSadaf Ebrahimiever active at once. I believe some other regex matchers work this way. JIT 355*22dc650dSSadaf Ebrahimisupport is not available for this kind of matching. 356*22dc650dSSadaf Ebrahimi 357*22dc650dSSadaf Ebrahimi 358*22dc650dSSadaf EbrahimiChangeable options 359*22dc650dSSadaf Ebrahimi------------------ 360*22dc650dSSadaf Ebrahimi 361*22dc650dSSadaf EbrahimiThe /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL) and 362*22dc650dSSadaf Ebrahimisome others may be changed in the middle of patterns by items such as (?i). 363*22dc650dSSadaf EbrahimiTheir processing is handled entirely at compile time by generating different 364*22dc650dSSadaf Ebrahimiopcodes for the different settings. The runtime functions do not need to keep 365*22dc650dSSadaf Ebrahimitrack of an option's state. 366*22dc650dSSadaf Ebrahimi 367*22dc650dSSadaf EbrahimiPCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE 368*22dc650dSSadaf Ebrahimiare tracked and processed during the parsing pre-pass. The others are handled 369*22dc650dSSadaf Ebrahimifrom META_OPTIONS items during the main compile phase. 370*22dc650dSSadaf Ebrahimi 371*22dc650dSSadaf Ebrahimi 372*22dc650dSSadaf EbrahimiFormat of compiled patterns 373*22dc650dSSadaf Ebrahimi--------------------------- 374*22dc650dSSadaf Ebrahimi 375*22dc650dSSadaf EbrahimiThe compiled form of a pattern is a vector of unsigned code units (bytes in 376*22dc650dSSadaf Ebrahimi8-bit mode, shorts in 16-bit mode, 32-bit words in 32-bit mode), containing 377*22dc650dSSadaf Ebrahimiitems of variable length. The first code unit in an item contains an opcode, 378*22dc650dSSadaf Ebrahimiand the length of the item is either implicit in the opcode or contained in the 379*22dc650dSSadaf Ebrahimidata that follows it. 380*22dc650dSSadaf Ebrahimi 381*22dc650dSSadaf EbrahimiIn many cases listed below, LINK_SIZE data values are specified for offsets 382*22dc650dSSadaf Ebrahimiwithin the compiled pattern. LINK_SIZE always specifies a number of bytes. The 383*22dc650dSSadaf Ebrahimidefault value for LINK_SIZE is 2, except for the 32-bit library, where it can 384*22dc650dSSadaf Ebrahimionly be 4. The 8-bit library can be compiled to use 3-byte or 4-byte values, 385*22dc650dSSadaf Ebrahimiand the 16-bit library can be compiled to use 4-byte values, though this 386*22dc650dSSadaf Ebrahimiimpairs performance. Specifying a LINK_SIZE larger than 2 for these libraries is 387*22dc650dSSadaf Ebrahiminecessary only when patterns whose compiled length is greater than 65535 code 388*22dc650dSSadaf Ebrahimiunits are going to be processed. When a LINK_SIZE value uses more than one code 389*22dc650dSSadaf Ebrahimiunit, the most significant unit is first. 390*22dc650dSSadaf Ebrahimi 391*22dc650dSSadaf EbrahimiIn this description, we assume the "normal" compilation options. Data values 392*22dc650dSSadaf Ebrahimithat are counts (e.g. quantifiers) are always two bytes long in 8-bit mode 393*22dc650dSSadaf Ebrahimi(most significant byte first), and one code unit in 16-bit and 32-bit modes. 394*22dc650dSSadaf Ebrahimi 395*22dc650dSSadaf Ebrahimi 396*22dc650dSSadaf EbrahimiOpcodes with no following data 397*22dc650dSSadaf Ebrahimi------------------------------ 398*22dc650dSSadaf Ebrahimi 399*22dc650dSSadaf EbrahimiThese items are all just one unit long: 400*22dc650dSSadaf Ebrahimi 401*22dc650dSSadaf Ebrahimi OP_END end of pattern 402*22dc650dSSadaf Ebrahimi OP_ANY match any one character other than newline 403*22dc650dSSadaf Ebrahimi OP_ALLANY match any one character, including newline 404*22dc650dSSadaf Ebrahimi OP_ANYBYTE match any single code unit, even in UTF-8/16 mode 405*22dc650dSSadaf Ebrahimi OP_SOD match start of data: \A 406*22dc650dSSadaf Ebrahimi OP_SOM, start of match (subject + offset): \G 407*22dc650dSSadaf Ebrahimi OP_SET_SOM, set start of match (\K) 408*22dc650dSSadaf Ebrahimi OP_CIRC ^ (start of data) 409*22dc650dSSadaf Ebrahimi OP_CIRCM ^ multiline mode (start of data or after newline) 410*22dc650dSSadaf Ebrahimi OP_NOT_WORD_BOUNDARY \W 411*22dc650dSSadaf Ebrahimi OP_WORD_BOUNDARY \w 412*22dc650dSSadaf Ebrahimi OP_NOT_DIGIT \D 413*22dc650dSSadaf Ebrahimi OP_DIGIT \d 414*22dc650dSSadaf Ebrahimi OP_NOT_HSPACE \H 415*22dc650dSSadaf Ebrahimi OP_HSPACE \h 416*22dc650dSSadaf Ebrahimi OP_NOT_WHITESPACE \S 417*22dc650dSSadaf Ebrahimi OP_WHITESPACE \s 418*22dc650dSSadaf Ebrahimi OP_NOT_VSPACE \V 419*22dc650dSSadaf Ebrahimi OP_VSPACE \v 420*22dc650dSSadaf Ebrahimi OP_NOT_WORDCHAR \W 421*22dc650dSSadaf Ebrahimi OP_WORDCHAR \w 422*22dc650dSSadaf Ebrahimi OP_EODN match end of data or newline at end: \Z 423*22dc650dSSadaf Ebrahimi OP_EOD match end of data: \z 424*22dc650dSSadaf Ebrahimi OP_DOLL $ (end of data, or before final newline) 425*22dc650dSSadaf Ebrahimi OP_DOLLM $ multiline mode (end of data or before newline) 426*22dc650dSSadaf Ebrahimi OP_EXTUNI match an extended Unicode grapheme cluster 427*22dc650dSSadaf Ebrahimi OP_ANYNL match any Unicode newline sequence 428*22dc650dSSadaf Ebrahimi 429*22dc650dSSadaf Ebrahimi OP_ASSERT_ACCEPT ) 430*22dc650dSSadaf Ebrahimi OP_ACCEPT ) These are Perl 5.10's "backtracking control 431*22dc650dSSadaf Ebrahimi OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing 432*22dc650dSSadaf Ebrahimi OP_FAIL ) parentheses, it may be preceded by one or more 433*22dc650dSSadaf Ebrahimi OP_PRUNE ) OP_CLOSE, each followed by a number that 434*22dc650dSSadaf Ebrahimi OP_SKIP ) indicates which parentheses must be closed. 435*22dc650dSSadaf Ebrahimi OP_THEN ) 436*22dc650dSSadaf Ebrahimi 437*22dc650dSSadaf EbrahimiOP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion. 438*22dc650dSSadaf EbrahimiThis ends the assertion, not the entire pattern match. The assertion (?!) is 439*22dc650dSSadaf Ebrahimialways optimized to OP_FAIL. 440*22dc650dSSadaf Ebrahimi 441*22dc650dSSadaf EbrahimiOP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in 442*22dc650dSSadaf Ebrahiminon-UTF modes and in UTF-32 mode (since one code unit still equals one 443*22dc650dSSadaf Ebrahimicharacter). Another use is for [^] when empty classes are permitted 444*22dc650dSSadaf Ebrahimi(PCRE2_ALLOW_EMPTY_CLASS is set). 445*22dc650dSSadaf Ebrahimi 446*22dc650dSSadaf Ebrahimi 447*22dc650dSSadaf EbrahimiBacktracking control verbs 448*22dc650dSSadaf Ebrahimi-------------------------- 449*22dc650dSSadaf Ebrahimi 450*22dc650dSSadaf EbrahimiVerbs with no arguments generate opcodes with no following data (as listed 451*22dc650dSSadaf Ebrahimiin the section above). 452*22dc650dSSadaf Ebrahimi 453*22dc650dSSadaf Ebrahimi(*MARK:NAME) generates OP_MARK followed by the mark name, preceded by a 454*22dc650dSSadaf Ebrahimilength in one code unit, and followed by a binary zero. The name length is 455*22dc650dSSadaf Ebrahimilimited by the size of the code unit. 456*22dc650dSSadaf Ebrahimi 457*22dc650dSSadaf Ebrahimi(*ACCEPT:NAME) and (*FAIL:NAME) are compiled as (*MARK:NAME)(*ACCEPT) and 458*22dc650dSSadaf Ebrahimi(*MARK:NAME)(*FAIL) respectively. 459*22dc650dSSadaf Ebrahimi 460*22dc650dSSadaf EbrahimiFor (*COMMIT:NAME), (*PRUNE:NAME), (*SKIP:NAME), and (*THEN:NAME), the opcodes 461*22dc650dSSadaf EbrahimiOP_COMMIT_ARG, OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the 462*22dc650dSSadaf Ebrahiminame following in the same format as for OP_MARK. 463*22dc650dSSadaf Ebrahimi 464*22dc650dSSadaf Ebrahimi 465*22dc650dSSadaf EbrahimiMatching literal characters 466*22dc650dSSadaf Ebrahimi--------------------------- 467*22dc650dSSadaf Ebrahimi 468*22dc650dSSadaf EbrahimiThe OP_CHAR opcode is followed by a single character that is to be matched 469*22dc650dSSadaf Ebrahimicasefully. For caseless matching of characters that have at most two 470*22dc650dSSadaf Ebrahimicase-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the 471*22dc650dSSadaf Ebrahimicharacter may be more than one code unit long. In UTF-32 mode, characters are 472*22dc650dSSadaf Ebrahimialways exactly one code unit long. 473*22dc650dSSadaf Ebrahimi 474*22dc650dSSadaf EbrahimiIf there is only one character in a character class, OP_CHAR or OP_CHARI is 475*22dc650dSSadaf Ebrahimiused for a positive class, and OP_NOT or OP_NOTI for a negative one (that is, 476*22dc650dSSadaf Ebrahimifor something like [^a]). 477*22dc650dSSadaf Ebrahimi 478*22dc650dSSadaf EbrahimiCaseless matching (positive or negative) of characters that have more than two 479*22dc650dSSadaf Ebrahimicase-equivalent code points (which is possible only in UTF mode) is handled by 480*22dc650dSSadaf Ebrahimicompiling a Unicode property item (see below), with the pseudo-property 481*22dc650dSSadaf EbrahimiPT_CLIST. The value of this property is an offset in a vector called 482*22dc650dSSadaf Ebrahimi"ucd_caseless_sets" which identifies the start of a short list of case 483*22dc650dSSadaf Ebrahimiequivalent characters, terminated by the value NOTACHAR (0xffffffff). 484*22dc650dSSadaf Ebrahimi 485*22dc650dSSadaf Ebrahimi 486*22dc650dSSadaf EbrahimiRepeating single characters 487*22dc650dSSadaf Ebrahimi--------------------------- 488*22dc650dSSadaf Ebrahimi 489*22dc650dSSadaf EbrahimiThe common repeats (*, +, ?), when applied to a single character, use the 490*22dc650dSSadaf Ebrahimifollowing opcodes, which come in caseful and caseless versions: 491*22dc650dSSadaf Ebrahimi 492*22dc650dSSadaf Ebrahimi Caseful Caseless 493*22dc650dSSadaf Ebrahimi OP_STAR OP_STARI 494*22dc650dSSadaf Ebrahimi OP_MINSTAR OP_MINSTARI 495*22dc650dSSadaf Ebrahimi OP_POSSTAR OP_POSSTARI 496*22dc650dSSadaf Ebrahimi OP_PLUS OP_PLUSI 497*22dc650dSSadaf Ebrahimi OP_MINPLUS OP_MINPLUSI 498*22dc650dSSadaf Ebrahimi OP_POSPLUS OP_POSPLUSI 499*22dc650dSSadaf Ebrahimi OP_QUERY OP_QUERYI 500*22dc650dSSadaf Ebrahimi OP_MINQUERY OP_MINQUERYI 501*22dc650dSSadaf Ebrahimi OP_POSQUERY OP_POSQUERYI 502*22dc650dSSadaf Ebrahimi 503*22dc650dSSadaf EbrahimiEach opcode is followed by the character that is to be repeated. In ASCII or 504*22dc650dSSadaf EbrahimiUTF-32 modes, these are two-code-unit items; in UTF-8 or UTF-16 modes, the 505*22dc650dSSadaf Ebrahimilength is variable. Those with "MIN" in their names are the minimizing 506*22dc650dSSadaf Ebrahimiversions. Those with "POS" in their names are possessive versions. Other kinds 507*22dc650dSSadaf Ebrahimiof repeat make use of these opcodes: 508*22dc650dSSadaf Ebrahimi 509*22dc650dSSadaf Ebrahimi Caseful Caseless 510*22dc650dSSadaf Ebrahimi OP_UPTO OP_UPTOI 511*22dc650dSSadaf Ebrahimi OP_MINUPTO OP_MINUPTOI 512*22dc650dSSadaf Ebrahimi OP_POSUPTO OP_POSUPTOI 513*22dc650dSSadaf Ebrahimi OP_EXACT OP_EXACTI 514*22dc650dSSadaf Ebrahimi 515*22dc650dSSadaf EbrahimiEach of these is followed by a count and then the repeated character. The count 516*22dc650dSSadaf Ebrahimiis two bytes long in 8-bit mode (most significant byte first), or one code unit 517*22dc650dSSadaf Ebrahimiin 16-bit and 32-bit modes. 518*22dc650dSSadaf Ebrahimi 519*22dc650dSSadaf EbrahimiOP_UPTO matches from 0 to the given number. A repeat with a non-zero minimum 520*22dc650dSSadaf Ebrahimiand a fixed maximum is coded as an OP_EXACT followed by an OP_UPTO (or 521*22dc650dSSadaf EbrahimiOP_MINUPTO or OPT_POSUPTO). 522*22dc650dSSadaf Ebrahimi 523*22dc650dSSadaf EbrahimiAnother set of matching repeating opcodes (called OP_NOTSTAR, OP_NOTSTARI, 524*22dc650dSSadaf Ebrahimietc.) are used for repeated, negated, single-character classes such as [^a]*. 525*22dc650dSSadaf EbrahimiThe normal single-character opcodes (OP_STAR, etc.) are used for repeated 526*22dc650dSSadaf Ebrahimipositive single-character classes. 527*22dc650dSSadaf Ebrahimi 528*22dc650dSSadaf Ebrahimi 529*22dc650dSSadaf EbrahimiRepeating character types 530*22dc650dSSadaf Ebrahimi------------------------- 531*22dc650dSSadaf Ebrahimi 532*22dc650dSSadaf EbrahimiRepeats of things like \d are done exactly as for single characters, except 533*22dc650dSSadaf Ebrahimithat instead of a character, the opcode for the type (e.g. OP_DIGIT) is stored 534*22dc650dSSadaf Ebrahimiin the next code unit. The opcodes are: 535*22dc650dSSadaf Ebrahimi 536*22dc650dSSadaf Ebrahimi OP_TYPESTAR 537*22dc650dSSadaf Ebrahimi OP_TYPEMINSTAR 538*22dc650dSSadaf Ebrahimi OP_TYPEPOSSTAR 539*22dc650dSSadaf Ebrahimi OP_TYPEPLUS 540*22dc650dSSadaf Ebrahimi OP_TYPEMINPLUS 541*22dc650dSSadaf Ebrahimi OP_TYPEPOSPLUS 542*22dc650dSSadaf Ebrahimi OP_TYPEQUERY 543*22dc650dSSadaf Ebrahimi OP_TYPEMINQUERY 544*22dc650dSSadaf Ebrahimi OP_TYPEPOSQUERY 545*22dc650dSSadaf Ebrahimi OP_TYPEUPTO 546*22dc650dSSadaf Ebrahimi OP_TYPEMINUPTO 547*22dc650dSSadaf Ebrahimi OP_TYPEPOSUPTO 548*22dc650dSSadaf Ebrahimi OP_TYPEEXACT 549*22dc650dSSadaf Ebrahimi 550*22dc650dSSadaf Ebrahimi 551*22dc650dSSadaf EbrahimiMatch by Unicode property 552*22dc650dSSadaf Ebrahimi------------------------- 553*22dc650dSSadaf Ebrahimi 554*22dc650dSSadaf EbrahimiOP_PROP and OP_NOTPROP are used for positive and negative matches of a 555*22dc650dSSadaf Ebrahimicharacter by testing its Unicode property (the \p and \P escape sequences). 556*22dc650dSSadaf EbrahimiEach is followed by two code units that encode the desired property as a type 557*22dc650dSSadaf Ebrahimiand a value. The types are a set of #defines of the form PT_xxx, and the values 558*22dc650dSSadaf Ebrahimiare enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file. 559*22dc650dSSadaf EbrahimiThe value is relevant only for PT_GC (General Category), PT_PC (Particular 560*22dc650dSSadaf EbrahimiCategory), PT_SC (Script), PT_BIDICL (Bidi Class), PT_BOOL (Boolean property), 561*22dc650dSSadaf Ebrahimiand the pseudo-property PT_CLIST, which is used to identify a list of 562*22dc650dSSadaf Ebrahimicase-equivalent characters when there are three or more (see above). 563*22dc650dSSadaf Ebrahimi 564*22dc650dSSadaf EbrahimiRepeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by 565*22dc650dSSadaf Ebrahimithree code units: OP_PROP or OP_NOTPROP, and then the desired property type and 566*22dc650dSSadaf Ebrahimivalue. 567*22dc650dSSadaf Ebrahimi 568*22dc650dSSadaf Ebrahimi 569*22dc650dSSadaf EbrahimiCharacter classes 570*22dc650dSSadaf Ebrahimi----------------- 571*22dc650dSSadaf Ebrahimi 572*22dc650dSSadaf EbrahimiIf there is only one character in a class, OP_CHAR or OP_CHARI is used for a 573*22dc650dSSadaf Ebrahimipositive class, and OP_NOT or OP_NOTI for a negative one (that is, for 574*22dc650dSSadaf Ebrahimisomething like [^a]), except when caselessly matching a character that has more 575*22dc650dSSadaf Ebrahimithan two case-equivalent code points (which can happen only in UTF mode). In 576*22dc650dSSadaf Ebrahimithis case a Unicode property item is used, as described above in "Matching 577*22dc650dSSadaf Ebrahimiliteral characters". 578*22dc650dSSadaf Ebrahimi 579*22dc650dSSadaf EbrahimiA set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated, 580*22dc650dSSadaf Ebrahiminegated, single-character classes. The normal single-character opcodes 581*22dc650dSSadaf Ebrahimi(OP_STAR, etc.) are used for repeated positive single-character classes. 582*22dc650dSSadaf Ebrahimi 583*22dc650dSSadaf EbrahimiWhen there is more than one character in a class, and all the code points are 584*22dc650dSSadaf Ebrahimiless than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a 585*22dc650dSSadaf Ebrahiminegative one. In either case, the opcode is followed by a 32-byte (16-short, 586*22dc650dSSadaf Ebrahimi8-word) bit map containing a 1 bit for every character that is acceptable. The 587*22dc650dSSadaf Ebrahimibits are counted from the least significant end of each unit. In caseless mode, 588*22dc650dSSadaf Ebrahimibits for both cases are set. 589*22dc650dSSadaf Ebrahimi 590*22dc650dSSadaf EbrahimiThe reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 and 591*22dc650dSSadaf Ebrahimi16-bit and 32-bit modes, subject characters with values greater than 255 can be 592*22dc650dSSadaf Ebrahimihandled correctly. For OP_CLASS they do not match, whereas for OP_NCLASS they 593*22dc650dSSadaf Ebrahimido. 594*22dc650dSSadaf Ebrahimi 595*22dc650dSSadaf EbrahimiFor classes containing characters with values greater than 255 or that contain 596*22dc650dSSadaf Ebrahimi\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable 597*22dc650dSSadaf Ebrahimicode points are less than 256, followed by a list of pairs (for a range) and/or 598*22dc650dSSadaf Ebrahimisingle characters and/or properties. In caseless mode, all equivalent 599*22dc650dSSadaf Ebrahimicharacters are explicitly listed. 600*22dc650dSSadaf Ebrahimi 601*22dc650dSSadaf EbrahimiOP_XCLASS is followed by a LINK_SIZE value containing the total length of the 602*22dc650dSSadaf Ebrahimiopcode and its data. This is followed by a code unit containing flag bits: 603*22dc650dSSadaf EbrahimiXCL_NOT indicates that this is a negative class, and XCL_MAP indicates that a 604*22dc650dSSadaf Ebrahimibit map is present. There follows the bit map, if XCL_MAP is set, and then a 605*22dc650dSSadaf Ebrahimisequence of items coded as follows: 606*22dc650dSSadaf Ebrahimi 607*22dc650dSSadaf Ebrahimi XCL_END marks the end of the list 608*22dc650dSSadaf Ebrahimi XCL_SINGLE one character follows 609*22dc650dSSadaf Ebrahimi XCL_RANGE two characters follow 610*22dc650dSSadaf Ebrahimi XCL_PROP a Unicode property (type, value) follows 611*22dc650dSSadaf Ebrahimi XCL_NOTPROP a Unicode property (type, value) follows 612*22dc650dSSadaf Ebrahimi 613*22dc650dSSadaf EbrahimiIf a range starts with a code point less than 256 and ends with one greater 614*22dc650dSSadaf Ebrahimithan 255, it is split into two ranges, with characters less than 256 being 615*22dc650dSSadaf Ebrahimiindicated in the bit map, and the rest with XCL_RANGE. 616*22dc650dSSadaf Ebrahimi 617*22dc650dSSadaf EbrahimiWhen XCL_NOT is set, the bit map, if present, contains bits for characters that 618*22dc650dSSadaf Ebrahimiare allowed (exactly as for OP_NCLASS), but the list of items that follow it 619*22dc650dSSadaf Ebrahimispecifies characters and properties that are not allowed. 620*22dc650dSSadaf Ebrahimi 621*22dc650dSSadaf Ebrahimi 622*22dc650dSSadaf EbrahimiBack references 623*22dc650dSSadaf Ebrahimi--------------- 624*22dc650dSSadaf Ebrahimi 625*22dc650dSSadaf EbrahimiOP_REF (caseful) or OP_REFI (caseless) is followed by a count containing the 626*22dc650dSSadaf Ebrahimireference number when the reference is to a unique capturing group (either by 627*22dc650dSSadaf Ebrahiminumber or by name). When named groups are used, there may be more than one 628*22dc650dSSadaf Ebrahimigroup with the same name. In this case, a reference to such a group by name 629*22dc650dSSadaf Ebrahimigenerates OP_DNREF or OP_DNREFI. These are followed by two counts: the index 630*22dc650dSSadaf Ebrahimi(not the byte offset) in the group name table of the first entry for the 631*22dc650dSSadaf Ebrahimirequired name, followed by the number of groups with the same name. The 632*22dc650dSSadaf Ebrahimimatching code can then search for the first one that is set. 633*22dc650dSSadaf Ebrahimi 634*22dc650dSSadaf Ebrahimi 635*22dc650dSSadaf EbrahimiRepeating character classes and back references 636*22dc650dSSadaf Ebrahimi----------------------------------------------- 637*22dc650dSSadaf Ebrahimi 638*22dc650dSSadaf EbrahimiSingle-character classes are handled specially (see above). This section 639*22dc650dSSadaf Ebrahimiapplies to other classes and also to back references. In both cases, the repeat 640*22dc650dSSadaf Ebrahimiinformation follows the base item. The matching code looks at the following 641*22dc650dSSadaf Ebrahimiopcode to see if it is one of these: 642*22dc650dSSadaf Ebrahimi 643*22dc650dSSadaf Ebrahimi OP_CRSTAR 644*22dc650dSSadaf Ebrahimi OP_CRMINSTAR 645*22dc650dSSadaf Ebrahimi OP_CRPOSSTAR 646*22dc650dSSadaf Ebrahimi OP_CRPLUS 647*22dc650dSSadaf Ebrahimi OP_CRMINPLUS 648*22dc650dSSadaf Ebrahimi OP_CRPOSPLUS 649*22dc650dSSadaf Ebrahimi OP_CRQUERY 650*22dc650dSSadaf Ebrahimi OP_CRMINQUERY 651*22dc650dSSadaf Ebrahimi OP_CRPOSQUERY 652*22dc650dSSadaf Ebrahimi OP_CRRANGE 653*22dc650dSSadaf Ebrahimi OP_CRMINRANGE 654*22dc650dSSadaf Ebrahimi OP_CRPOSRANGE 655*22dc650dSSadaf Ebrahimi 656*22dc650dSSadaf EbrahimiAll but the last three are single-code-unit items, with no data. The range 657*22dc650dSSadaf Ebrahimiopcodes are followed by the minimum and maximum repeat counts. 658*22dc650dSSadaf Ebrahimi 659*22dc650dSSadaf Ebrahimi 660*22dc650dSSadaf EbrahimiBrackets and alternation 661*22dc650dSSadaf Ebrahimi------------------------ 662*22dc650dSSadaf Ebrahimi 663*22dc650dSSadaf EbrahimiA pair of non-capturing round brackets is wrapped round each expression at 664*22dc650dSSadaf Ebrahimicompile time, so alternation always happens in the context of brackets. 665*22dc650dSSadaf Ebrahimi 666*22dc650dSSadaf Ebrahimi[Note for North Americans: "bracket" to some English speakers, including 667*22dc650dSSadaf Ebrahimimyself, can be round, square, curly, or pointy. Hence this usage rather than 668*22dc650dSSadaf Ebrahimi"parentheses".] 669*22dc650dSSadaf Ebrahimi 670*22dc650dSSadaf EbrahimiNon-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A 671*22dc650dSSadaf Ebrahimibracket opcode is followed by a LINK_SIZE value which gives the offset to the 672*22dc650dSSadaf Ebrahiminext alternative OP_ALT or, if there aren't any branches, to the terminating 673*22dc650dSSadaf Ebrahimiopcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the 674*22dc650dSSadaf Ebrahiminext one, or to the final opcode. For capturing brackets, the bracket number is 675*22dc650dSSadaf Ebrahimia count that immediately follows the offset. 676*22dc650dSSadaf Ebrahimi 677*22dc650dSSadaf EbrahimiThere are several opcodes that mark the end of a subpattern group. OP_KET is 678*22dc650dSSadaf Ebrahimiused for subpatterns that do not repeat indefinitely, OP_KETRMIN and 679*22dc650dSSadaf EbrahimiOP_KETRMAX are used for indefinite repetitions, minimally or maximally 680*22dc650dSSadaf Ebrahimirespectively, and OP_KETRPOS for possessive repetitions (see below for more 681*22dc650dSSadaf Ebrahimidetails). All four are followed by a LINK_SIZE value giving (as a positive 682*22dc650dSSadaf Ebrahiminumber) the offset back to the matching opening bracket opcode. 683*22dc650dSSadaf Ebrahimi 684*22dc650dSSadaf EbrahimiIf a subpattern is quantified such that it is permitted to match zero times, it 685*22dc650dSSadaf Ebrahimiis preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are 686*22dc650dSSadaf Ebrahimisingle-unit opcodes that tell the matcher that skipping the following 687*22dc650dSSadaf Ebrahimisubpattern entirely is a valid match. In the case of the first two, not 688*22dc650dSSadaf Ebrahimiskipping the pattern is also valid (greedy and non-greedy). The third is used 689*22dc650dSSadaf Ebrahimiwhen a pattern has the quantifier {0,0}. It cannot be entirely discarded, 690*22dc650dSSadaf Ebrahimibecause it may be called as a subroutine from elsewhere in the pattern. 691*22dc650dSSadaf Ebrahimi 692*22dc650dSSadaf EbrahimiA subpattern with an indefinite maximum repetition is replicated in the 693*22dc650dSSadaf Ebrahimicompiled data its minimum number of times (or once with OP_BRAZERO if the 694*22dc650dSSadaf Ebrahimiminimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX 695*22dc650dSSadaf Ebrahimias appropriate. 696*22dc650dSSadaf Ebrahimi 697*22dc650dSSadaf EbrahimiA subpattern with a bounded maximum repetition is replicated in a nested 698*22dc650dSSadaf Ebrahimifashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO 699*22dc650dSSadaf Ebrahimibefore each replication after the minimum, so that, for example, (abc){2,5} is 700*22dc650dSSadaf Ebrahimicompiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group 701*22dc650dSSadaf Ebrahimihas the same number. 702*22dc650dSSadaf Ebrahimi 703*22dc650dSSadaf EbrahimiWhen a repeated subpattern has an unbounded upper limit, it is checked to see 704*22dc650dSSadaf Ebrahimiwhether it could match an empty string. If this is the case, the opcode in the 705*22dc650dSSadaf Ebrahimifinal replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher 706*22dc650dSSadaf Ebrahimithat it needs to check for matching an empty string when it hits OP_KETRMIN or 707*22dc650dSSadaf EbrahimiOP_KETRMAX, and if so, to break the loop. 708*22dc650dSSadaf Ebrahimi 709*22dc650dSSadaf Ebrahimi 710*22dc650dSSadaf EbrahimiPossessive brackets 711*22dc650dSSadaf Ebrahimi------------------- 712*22dc650dSSadaf Ebrahimi 713*22dc650dSSadaf EbrahimiWhen a repeated group (capturing or non-capturing) is marked as possessive by 714*22dc650dSSadaf Ebrahimithe "+" notation, e.g. (abc)++, different opcodes are used. Their names all 715*22dc650dSSadaf Ebrahimihave POS on the end, e.g. OP_BRAPOS instead of OP_BRA and OP_SCBRAPOS instead 716*22dc650dSSadaf Ebrahimiof OP_SCBRA. The end of such a group is marked by OP_KETRPOS. If the minimum 717*22dc650dSSadaf Ebrahimirepetition is zero, the group is preceded by OP_BRAPOSZERO. 718*22dc650dSSadaf Ebrahimi 719*22dc650dSSadaf Ebrahimi 720*22dc650dSSadaf EbrahimiOnce-only (atomic) groups 721*22dc650dSSadaf Ebrahimi------------------------- 722*22dc650dSSadaf Ebrahimi 723*22dc650dSSadaf EbrahimiThese are just like other subpatterns, but they start with the opcode OP_ONCE. 724*22dc650dSSadaf EbrahimiThe check for matching an empty string in an unbounded repeat is handled 725*22dc650dSSadaf Ebrahimientirely at runtime, so there is just this one opcode for atomic groups. 726*22dc650dSSadaf Ebrahimi 727*22dc650dSSadaf Ebrahimi 728*22dc650dSSadaf EbrahimiAssertions 729*22dc650dSSadaf Ebrahimi---------- 730*22dc650dSSadaf Ebrahimi 731*22dc650dSSadaf EbrahimiForward assertions are also just like other subpatterns, but starting with one 732*22dc650dSSadaf Ebrahimiof the opcodes OP_ASSERT, OP_ASSERT_NA (non-atomic assertion), or 733*22dc650dSSadaf EbrahimiOP_ASSERT_NOT. 734*22dc650dSSadaf Ebrahimi 735*22dc650dSSadaf EbrahimiBackward assertions use the opcodes OP_ASSERTBACK, OP_ASSERTBACK_NA, and 736*22dc650dSSadaf EbrahimiOP_ASSERTBACK_NOT. If all the branches of a backward assertion are of fixed 737*22dc650dSSadaf Ebrahimilength (not necessarily the same), the first opcode inside each branch is 738*22dc650dSSadaf EbrahimiOP_REVERSE, followed by an IMM2_SIZE count of the number of characters to move 739*22dc650dSSadaf Ebrahimiback the pointer in the subject string, thus allowing each branch to have a 740*22dc650dSSadaf Ebrahimidifferent (but fixed) length. 741*22dc650dSSadaf Ebrahimi 742*22dc650dSSadaf EbrahimiVariable-length backward assertions whose maximum matching length is limited 743*22dc650dSSadaf Ebrahimiare also supported. For such assertions, the first opcode inside each branch is 744*22dc650dSSadaf EbrahimiOP_VREVERSE, followed by the minimum and maximum lengths for that branch, 745*22dc650dSSadaf Ebrahimiunless these happen to be equal, in which case OP_REVERSE is used. These 746*22dc650dSSadaf EbrahimiIMM2_SIZE values occupy two code units each in 8-bit mode, and 1 code unit in 747*22dc650dSSadaf Ebrahimi16/32 bit modes. 748*22dc650dSSadaf Ebrahimi 749*22dc650dSSadaf EbrahimiIn ASCII or UTF-32 mode, the character counts in OP_REVERSE and OP_VREVERSE are 750*22dc650dSSadaf Ebrahimialso the number of code units, but in UTF-8/16 mode each character may occupy 751*22dc650dSSadaf Ebrahimimore than one code unit. 752*22dc650dSSadaf Ebrahimi 753*22dc650dSSadaf Ebrahimi 754*22dc650dSSadaf EbrahimiConditional subpatterns 755*22dc650dSSadaf Ebrahimi----------------------- 756*22dc650dSSadaf Ebrahimi 757*22dc650dSSadaf EbrahimiThese are like other subpatterns, but they start with the opcode OP_COND, or 758*22dc650dSSadaf EbrahimiOP_SCOND for one that might match an empty string in an unbounded repeat. 759*22dc650dSSadaf Ebrahimi 760*22dc650dSSadaf EbrahimiIf the condition is a back reference, this is stored at the start of the 761*22dc650dSSadaf Ebrahimisubpattern using the opcode OP_CREF followed by a count containing the 762*22dc650dSSadaf Ebrahimireference number, provided that the reference is to a unique capturing group. 763*22dc650dSSadaf EbrahimiIf the reference was by name and there is more than one group with that name, 764*22dc650dSSadaf EbrahimiOP_DNCREF is used instead. It is followed by two counts: the index in the group 765*22dc650dSSadaf Ebrahiminames table, and the number of groups with the same name. The allows the 766*22dc650dSSadaf Ebrahimimatcher to check if any group with the given name is set. 767*22dc650dSSadaf Ebrahimi 768*22dc650dSSadaf EbrahimiIf the condition is "in recursion" (coded as "(?(R)"), or "in recursion of 769*22dc650dSSadaf Ebrahimigroup x" (coded as "(?(Rx)"), the group number is stored at the start of the 770*22dc650dSSadaf Ebrahimisubpattern using the opcode OP_RREF (with a value of RREF_ANY (0xffff) for "the 771*22dc650dSSadaf Ebrahimiwhole pattern") or OP_DNRREF (with data as for OP_DNCREF). 772*22dc650dSSadaf Ebrahimi 773*22dc650dSSadaf EbrahimiFor a DEFINE condition, OP_FALSE is used (with no associated data). During 774*22dc650dSSadaf Ebrahimicompilation, however, a DEFINE condition is coded as OP_DEFINE so that, when 775*22dc650dSSadaf Ebrahimithe conditional group is complete, there can be a check to ensure that it 776*22dc650dSSadaf Ebrahimicontains only one top-level branch. Once this has happened, the opcode is 777*22dc650dSSadaf Ebrahimichanged to OP_FALSE, so the matcher never sees OP_DEFINE. 778*22dc650dSSadaf Ebrahimi 779*22dc650dSSadaf EbrahimiThere is a special PCRE2-specific condition of the form (VERSION[>]=x.y), which 780*22dc650dSSadaf Ebrahimitests the PCRE2 version number. This compiles into one of the opcodes OP_TRUE 781*22dc650dSSadaf Ebrahimior OP_FALSE. 782*22dc650dSSadaf Ebrahimi 783*22dc650dSSadaf EbrahimiIf a condition is not a back reference, recursion test, DEFINE, or VERSION, it 784*22dc650dSSadaf Ebrahimimust start with a parenthesized atomic assertion, whose opcode normally 785*22dc650dSSadaf Ebrahimiimmediately follows OP_COND or OP_SCOND. However, if automatic callouts are 786*22dc650dSSadaf Ebrahimienabled, a callout is inserted immediately before the assertion. It is also 787*22dc650dSSadaf Ebrahimipossible to insert a manual callout at this point. Only assertion conditions 788*22dc650dSSadaf Ebrahimimay have callouts preceding the condition. 789*22dc650dSSadaf Ebrahimi 790*22dc650dSSadaf EbrahimiA condition that is the negative assertion (?!) is optimized to OP_FAIL in all 791*22dc650dSSadaf Ebrahimiparts of the pattern, so this is another opcode that may appear as a condition. 792*22dc650dSSadaf EbrahimiIt is treated the same as OP_FALSE. 793*22dc650dSSadaf Ebrahimi 794*22dc650dSSadaf Ebrahimi 795*22dc650dSSadaf EbrahimiRecursion 796*22dc650dSSadaf Ebrahimi--------- 797*22dc650dSSadaf Ebrahimi 798*22dc650dSSadaf EbrahimiRecursion either matches the current pattern, or some subexpression. The opcode 799*22dc650dSSadaf EbrahimiOP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting 800*22dc650dSSadaf Ebrahimibracket from the start of the whole pattern. OP_RECURSE is also used for 801*22dc650dSSadaf Ebrahimi"subroutine" calls, even though they are not strictly a recursion. Up till 802*22dc650dSSadaf Ebrahimirelease 10.30 recursions were treated as atomic groups, making them 803*22dc650dSSadaf Ebrahimiincompatible with Perl (but PCRE had them well before Perl did). From 10.30, 804*22dc650dSSadaf Ebrahimibacktracking into recursions is supported. 805*22dc650dSSadaf Ebrahimi 806*22dc650dSSadaf EbrahimiRepeated recursions used to be wrapped inside OP_ONCE brackets, which not only 807*22dc650dSSadaf Ebrahimiforced no backtracking, but also allowed repetition to be handled as for other 808*22dc650dSSadaf Ebrahimibracketed groups. From 10.30 onwards, repeated recursions are duplicated for 809*22dc650dSSadaf Ebrahimitheir minimum repetitions, and then wrapped in non-capturing brackets for the 810*22dc650dSSadaf Ebrahimiremainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is 811*22dc650dSSadaf Ebrahimitreated as (?1)(?1)(?:(?1)){0,2}. 812*22dc650dSSadaf Ebrahimi 813*22dc650dSSadaf Ebrahimi 814*22dc650dSSadaf EbrahimiCallouts 815*22dc650dSSadaf Ebrahimi-------- 816*22dc650dSSadaf Ebrahimi 817*22dc650dSSadaf EbrahimiA callout may have either a numerical argument or a string argument. These use 818*22dc650dSSadaf EbrahimiOP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by 819*22dc650dSSadaf Ebrahimitwo LINK_SIZE values giving the offset in the pattern string to the start of 820*22dc650dSSadaf Ebrahimithe following item, and another count giving the length of this item. These 821*22dc650dSSadaf Ebrahimivalues make it possible for pcre2test to output useful tracing information 822*22dc650dSSadaf Ebrahimiusing callouts. 823*22dc650dSSadaf Ebrahimi 824*22dc650dSSadaf EbrahimiIn the case of a numeric callout, after these two values there is a single code 825*22dc650dSSadaf Ebrahimiunit containing the callout number, in the range 0-255, with 255 being used for 826*22dc650dSSadaf Ebrahimicallouts that are automatically inserted as a result of the PCRE2_AUTO_CALLOUT 827*22dc650dSSadaf Ebrahimioption. Thus, this opcode item is of fixed length: 828*22dc650dSSadaf Ebrahimi 829*22dc650dSSadaf Ebrahimi [OP_CALLOUT] [PATTERN_OFFSET] [PATTERN_LENGTH] [NUMBER] 830*22dc650dSSadaf Ebrahimi 831*22dc650dSSadaf EbrahimiFor callouts with string arguments, OP_CALLOUT_STR has three more data items: 832*22dc650dSSadaf Ebrahimia LINK_SIZE value giving the complete length of the entire opcode item, a 833*22dc650dSSadaf EbrahimiLINK_SIZE item containing the offset within the pattern string to the start of 834*22dc650dSSadaf Ebrahimithe string argument, and the string itself, preceded by its starting delimiter 835*22dc650dSSadaf Ebrahimiand followed by a binary zero. When a callout function is called, a pointer to 836*22dc650dSSadaf Ebrahimithe actual string is passed, but the delimiter can be accessed as string[-1] if 837*22dc650dSSadaf Ebrahimithe application needs it. In the 8-bit library, the callout in /X(?C'abc')Y/ is 838*22dc650dSSadaf Ebrahimicompiled as the following bytes (decimal numbers represent binary values): 839*22dc650dSSadaf Ebrahimi 840*22dc650dSSadaf Ebrahimi [OP_CALLOUT_STR] [0] [10] [0] [1] [0] [14] [0] [5] ['] [a] [b] [c] [0] 841*22dc650dSSadaf Ebrahimi -------- ------- -------- ------- 842*22dc650dSSadaf Ebrahimi | | | | 843*22dc650dSSadaf Ebrahimi ------- LINK_SIZE items ------ 844*22dc650dSSadaf Ebrahimi 845*22dc650dSSadaf EbrahimiOpcode table checking 846*22dc650dSSadaf Ebrahimi--------------------- 847*22dc650dSSadaf Ebrahimi 848*22dc650dSSadaf EbrahimiThe last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is 849*22dc650dSSadaf Ebrahiminot a real opcode, but is used to check at compile time that tables indexed by 850*22dc650dSSadaf Ebrahimiopcode are the correct length, in order to catch updating errors. 851*22dc650dSSadaf Ebrahimi 852*22dc650dSSadaf EbrahimiPhilip Hazel 853*22dc650dSSadaf EbrahimiNovember 2023 854