1*22dc650dSSadaf Ebrahimi 2*22dc650dSSadaf EbrahimiPCRE2TEST(1) General Commands Manual PCRE2TEST(1) 3*22dc650dSSadaf Ebrahimi 4*22dc650dSSadaf Ebrahimi 5*22dc650dSSadaf EbrahimiNAME 6*22dc650dSSadaf Ebrahimi pcre2test - a program for testing Perl-compatible regular expressions. 7*22dc650dSSadaf Ebrahimi 8*22dc650dSSadaf Ebrahimi 9*22dc650dSSadaf EbrahimiSYNOPSIS 10*22dc650dSSadaf Ebrahimi 11*22dc650dSSadaf Ebrahimi pcre2test [options] [input file [output file]] 12*22dc650dSSadaf Ebrahimi 13*22dc650dSSadaf Ebrahimi pcre2test is a test program for the PCRE2 regular expression libraries, 14*22dc650dSSadaf Ebrahimi but it can also be used for experimenting with regular expressions. 15*22dc650dSSadaf Ebrahimi This document describes the features of the test program; for details 16*22dc650dSSadaf Ebrahimi of the regular expressions themselves, see the pcre2pattern documenta- 17*22dc650dSSadaf Ebrahimi tion. For details of the PCRE2 library function calls and their op- 18*22dc650dSSadaf Ebrahimi tions, see the pcre2api documentation. 19*22dc650dSSadaf Ebrahimi 20*22dc650dSSadaf Ebrahimi The input for pcre2test is a sequence of regular expression patterns 21*22dc650dSSadaf Ebrahimi and subject strings to be matched. There are also command lines for 22*22dc650dSSadaf Ebrahimi setting defaults and controlling some special actions. The output shows 23*22dc650dSSadaf Ebrahimi the result of each match attempt. Modifiers on external or internal 24*22dc650dSSadaf Ebrahimi command lines, the patterns, and the subject lines specify PCRE2 func- 25*22dc650dSSadaf Ebrahimi tion options, control how the subject is processed, and what output is 26*22dc650dSSadaf Ebrahimi produced. 27*22dc650dSSadaf Ebrahimi 28*22dc650dSSadaf Ebrahimi There are many obscure modifiers, some of which are specifically de- 29*22dc650dSSadaf Ebrahimi signed for use in conjunction with the test script and data files that 30*22dc650dSSadaf Ebrahimi are distributed as part of PCRE2. All the modifiers are documented 31*22dc650dSSadaf Ebrahimi here, some without much justification, but many of them are unlikely to 32*22dc650dSSadaf Ebrahimi be of use except when testing the libraries. 33*22dc650dSSadaf Ebrahimi 34*22dc650dSSadaf Ebrahimi 35*22dc650dSSadaf EbrahimiPCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES 36*22dc650dSSadaf Ebrahimi 37*22dc650dSSadaf Ebrahimi Different versions of the PCRE2 library can be built to support charac- 38*22dc650dSSadaf Ebrahimi ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units. 39*22dc650dSSadaf Ebrahimi One, two, or all three of these libraries may be simultaneously in- 40*22dc650dSSadaf Ebrahimi stalled. The pcre2test program can be used to test all the libraries. 41*22dc650dSSadaf Ebrahimi However, its own input and output are always in 8-bit format. When 42*22dc650dSSadaf Ebrahimi testing the 16-bit or 32-bit libraries, patterns and subject strings 43*22dc650dSSadaf Ebrahimi are converted to 16-bit or 32-bit format before being passed to the li- 44*22dc650dSSadaf Ebrahimi brary functions. Results are converted back to 8-bit code units for 45*22dc650dSSadaf Ebrahimi output. 46*22dc650dSSadaf Ebrahimi 47*22dc650dSSadaf Ebrahimi In the rest of this document, the names of library functions and struc- 48*22dc650dSSadaf Ebrahimi tures are given in generic form, for example, pcre2_compile(). The ac- 49*22dc650dSSadaf Ebrahimi tual names used in the libraries have a suffix _8, _16, or _32, as ap- 50*22dc650dSSadaf Ebrahimi propriate. 51*22dc650dSSadaf Ebrahimi 52*22dc650dSSadaf Ebrahimi 53*22dc650dSSadaf EbrahimiINPUT ENCODING 54*22dc650dSSadaf Ebrahimi 55*22dc650dSSadaf Ebrahimi Input to pcre2test is processed line by line, either by calling the C 56*22dc650dSSadaf Ebrahimi library's fgets() function, or via the libreadline or libedit library. 57*22dc650dSSadaf Ebrahimi In some Windows environments character 26 (hex 1A) causes an immediate 58*22dc650dSSadaf Ebrahimi end of file, and no further data is read, so this character should be 59*22dc650dSSadaf Ebrahimi avoided unless you really want that action. 60*22dc650dSSadaf Ebrahimi 61*22dc650dSSadaf Ebrahimi The input is processed using C's string functions, so must not contain 62*22dc650dSSadaf Ebrahimi binary zeros, even though in Unix-like environments, fgets() treats any 63*22dc650dSSadaf Ebrahimi bytes other than newline as data characters. An error is generated if a 64*22dc650dSSadaf Ebrahimi binary zero is encountered. By default subject lines are processed for 65*22dc650dSSadaf Ebrahimi backslash escapes, which makes it possible to include any data value in 66*22dc650dSSadaf Ebrahimi strings that are passed to the library for matching. For patterns, 67*22dc650dSSadaf Ebrahimi there is a facility for specifying some or all of the 8-bit input char- 68*22dc650dSSadaf Ebrahimi acters as hexadecimal pairs, which makes it possible to include binary 69*22dc650dSSadaf Ebrahimi zeros. 70*22dc650dSSadaf Ebrahimi 71*22dc650dSSadaf Ebrahimi Input for the 16-bit and 32-bit libraries 72*22dc650dSSadaf Ebrahimi 73*22dc650dSSadaf Ebrahimi When testing the 16-bit or 32-bit libraries, there is a need to be able 74*22dc650dSSadaf Ebrahimi to generate character code points greater than 255 in the strings that 75*22dc650dSSadaf Ebrahimi are passed to the library. For subject lines, backslash escapes can be 76*22dc650dSSadaf Ebrahimi used. In addition, when the utf modifier (see "Setting compilation op- 77*22dc650dSSadaf Ebrahimi tions" below) is set, the pattern and any following subject lines are 78*22dc650dSSadaf Ebrahimi interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as ap- 79*22dc650dSSadaf Ebrahimi propriate. 80*22dc650dSSadaf Ebrahimi 81*22dc650dSSadaf Ebrahimi For non-UTF testing of wide characters, the utf8_input modifier can be 82*22dc650dSSadaf Ebrahimi used. This is mutually exclusive with utf, and is allowed only in 83*22dc650dSSadaf Ebrahimi 16-bit or 32-bit mode. It causes the pattern and following subject 84*22dc650dSSadaf Ebrahimi lines to be treated as UTF-8 according to the original definition (RFC 85*22dc650dSSadaf Ebrahimi 2279), which allows for character values up to 0x7fffffff. Each charac- 86*22dc650dSSadaf Ebrahimi ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case, 87*22dc650dSSadaf Ebrahimi values greater than 0xffff cause an error to occur). 88*22dc650dSSadaf Ebrahimi 89*22dc650dSSadaf Ebrahimi UTF-8 (in its original definition) is not capable of encoding values 90*22dc650dSSadaf Ebrahimi greater than 0x7fffffff, but such values can be handled by the 32-bit 91*22dc650dSSadaf Ebrahimi library. When testing this library in non-UTF mode with utf8_input set, 92*22dc650dSSadaf Ebrahimi if any character is preceded by the byte 0xff (which is an invalid byte 93*22dc650dSSadaf Ebrahimi in UTF-8) 0x80000000 is added to the character's value. This is the 94*22dc650dSSadaf Ebrahimi only way of passing such code points in a pattern string. For subject 95*22dc650dSSadaf Ebrahimi strings, using an escape sequence is preferable. 96*22dc650dSSadaf Ebrahimi 97*22dc650dSSadaf Ebrahimi 98*22dc650dSSadaf EbrahimiCOMMAND LINE OPTIONS 99*22dc650dSSadaf Ebrahimi 100*22dc650dSSadaf Ebrahimi -8 If the 8-bit library has been built, this option causes it to 101*22dc650dSSadaf Ebrahimi be used (this is the default). If the 8-bit library has not 102*22dc650dSSadaf Ebrahimi been built, this option causes an error. 103*22dc650dSSadaf Ebrahimi 104*22dc650dSSadaf Ebrahimi -16 If the 16-bit library has been built, this option causes it 105*22dc650dSSadaf Ebrahimi to be used. If the 8-bit library has not been built, this is 106*22dc650dSSadaf Ebrahimi the default. If the 16-bit library has not been built, this 107*22dc650dSSadaf Ebrahimi option causes an error. 108*22dc650dSSadaf Ebrahimi 109*22dc650dSSadaf Ebrahimi -32 If the 32-bit library has been built, this option causes it 110*22dc650dSSadaf Ebrahimi to be used. If no other library has been built, this is the 111*22dc650dSSadaf Ebrahimi default. If the 32-bit library has not been built, this op- 112*22dc650dSSadaf Ebrahimi tion causes an error. 113*22dc650dSSadaf Ebrahimi 114*22dc650dSSadaf Ebrahimi -ac Behave as if each pattern has the auto_callout modifier, that 115*22dc650dSSadaf Ebrahimi is, insert automatic callouts into every pattern that is com- 116*22dc650dSSadaf Ebrahimi piled. 117*22dc650dSSadaf Ebrahimi 118*22dc650dSSadaf Ebrahimi -AC As for -ac, but in addition behave as if each subject line 119*22dc650dSSadaf Ebrahimi has the callout_extra modifier, that is, show additional in- 120*22dc650dSSadaf Ebrahimi formation from callouts. 121*22dc650dSSadaf Ebrahimi 122*22dc650dSSadaf Ebrahimi -b Behave as if each pattern has the fullbincode modifier; the 123*22dc650dSSadaf Ebrahimi full internal binary form of the pattern is output after com- 124*22dc650dSSadaf Ebrahimi pilation. 125*22dc650dSSadaf Ebrahimi 126*22dc650dSSadaf Ebrahimi -C Output the version number of the PCRE2 library, and all 127*22dc650dSSadaf Ebrahimi available information about the optional features that are 128*22dc650dSSadaf Ebrahimi included, and then exit with zero exit code. All other op- 129*22dc650dSSadaf Ebrahimi tions are ignored. If both -C and -LM are present, whichever 130*22dc650dSSadaf Ebrahimi is first is recognized. 131*22dc650dSSadaf Ebrahimi 132*22dc650dSSadaf Ebrahimi -C option Output information about a specific build-time option, then 133*22dc650dSSadaf Ebrahimi exit. This functionality is intended for use in scripts such 134*22dc650dSSadaf Ebrahimi as RunTest. The following options output the value and set 135*22dc650dSSadaf Ebrahimi the exit code as indicated: 136*22dc650dSSadaf Ebrahimi 137*22dc650dSSadaf Ebrahimi ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 138*22dc650dSSadaf Ebrahimi 0x15 or 0x25 139*22dc650dSSadaf Ebrahimi 0 if used in an ASCII environment 140*22dc650dSSadaf Ebrahimi exit code is always 0 141*22dc650dSSadaf Ebrahimi linksize the configured internal link size (2, 3, or 4) 142*22dc650dSSadaf Ebrahimi exit code is set to the link size 143*22dc650dSSadaf Ebrahimi newline the default newline setting: 144*22dc650dSSadaf Ebrahimi CR, LF, CRLF, ANYCRLF, ANY, or NUL 145*22dc650dSSadaf Ebrahimi exit code is always 0 146*22dc650dSSadaf Ebrahimi bsr the default setting for what \R matches: 147*22dc650dSSadaf Ebrahimi ANYCRLF or ANY 148*22dc650dSSadaf Ebrahimi exit code is always 0 149*22dc650dSSadaf Ebrahimi 150*22dc650dSSadaf Ebrahimi The following options output 1 for true or 0 for false, and 151*22dc650dSSadaf Ebrahimi set the exit code to the same value: 152*22dc650dSSadaf Ebrahimi 153*22dc650dSSadaf Ebrahimi backslash-C \C is supported (not locked out) 154*22dc650dSSadaf Ebrahimi ebcdic compiled for an EBCDIC environment 155*22dc650dSSadaf Ebrahimi jit just-in-time support is available 156*22dc650dSSadaf Ebrahimi pcre2-16 the 16-bit library was built 157*22dc650dSSadaf Ebrahimi pcre2-32 the 32-bit library was built 158*22dc650dSSadaf Ebrahimi pcre2-8 the 8-bit library was built 159*22dc650dSSadaf Ebrahimi unicode Unicode support is available 160*22dc650dSSadaf Ebrahimi 161*22dc650dSSadaf Ebrahimi If an unknown option is given, an error message is output; 162*22dc650dSSadaf Ebrahimi the exit code is 0. 163*22dc650dSSadaf Ebrahimi 164*22dc650dSSadaf Ebrahimi -d Behave as if each pattern has the debug modifier; the inter- 165*22dc650dSSadaf Ebrahimi nal form and information about the compiled pattern is output 166*22dc650dSSadaf Ebrahimi after compilation; -d is equivalent to -b -i. 167*22dc650dSSadaf Ebrahimi 168*22dc650dSSadaf Ebrahimi -dfa Behave as if each subject line has the dfa modifier; matching 169*22dc650dSSadaf Ebrahimi is done using the pcre2_dfa_match() function instead of the 170*22dc650dSSadaf Ebrahimi default pcre2_match(). 171*22dc650dSSadaf Ebrahimi 172*22dc650dSSadaf Ebrahimi -error number[,number,...] 173*22dc650dSSadaf Ebrahimi Call pcre2_get_error_message() for each of the error numbers 174*22dc650dSSadaf Ebrahimi in the comma-separated list, display the resulting messages 175*22dc650dSSadaf Ebrahimi on the standard output, then exit with zero exit code. The 176*22dc650dSSadaf Ebrahimi numbers may be positive or negative. This is a convenience 177*22dc650dSSadaf Ebrahimi facility for PCRE2 maintainers. 178*22dc650dSSadaf Ebrahimi 179*22dc650dSSadaf Ebrahimi -help Output a brief summary these options and then exit. 180*22dc650dSSadaf Ebrahimi 181*22dc650dSSadaf Ebrahimi -i Behave as if each pattern has the info modifier; information 182*22dc650dSSadaf Ebrahimi about the compiled pattern is given after compilation. 183*22dc650dSSadaf Ebrahimi 184*22dc650dSSadaf Ebrahimi -jit Behave as if each pattern line has the jit modifier; after 185*22dc650dSSadaf Ebrahimi successful compilation, each pattern is passed to the just- 186*22dc650dSSadaf Ebrahimi in-time compiler, if available. 187*22dc650dSSadaf Ebrahimi 188*22dc650dSSadaf Ebrahimi -jitfast Behave as if each pattern line has the jitfast modifier; af- 189*22dc650dSSadaf Ebrahimi ter successful compilation, each pattern is passed to the 190*22dc650dSSadaf Ebrahimi just-in-time compiler, if available, and each subject line is 191*22dc650dSSadaf Ebrahimi passed directly to the JIT matcher via its "fast path". 192*22dc650dSSadaf Ebrahimi 193*22dc650dSSadaf Ebrahimi -jitverify 194*22dc650dSSadaf Ebrahimi Behave as if each pattern line has the jitverify modifier; 195*22dc650dSSadaf Ebrahimi after successful compilation, each pattern is passed to the 196*22dc650dSSadaf Ebrahimi just-in-time compiler, if available, and the use of JIT for 197*22dc650dSSadaf Ebrahimi matching is verified. 198*22dc650dSSadaf Ebrahimi 199*22dc650dSSadaf Ebrahimi -LM List modifiers: write a list of available pattern and subject 200*22dc650dSSadaf Ebrahimi modifiers to the standard output, then exit with zero exit 201*22dc650dSSadaf Ebrahimi code. All other options are ignored. If both -C and any -Lx 202*22dc650dSSadaf Ebrahimi options are present, whichever is first is recognized. 203*22dc650dSSadaf Ebrahimi 204*22dc650dSSadaf Ebrahimi -LP List properties: write a list of recognized Unicode proper- 205*22dc650dSSadaf Ebrahimi ties to the standard output, then exit with zero exit code. 206*22dc650dSSadaf Ebrahimi All other options are ignored. If both -C and any -Lx options 207*22dc650dSSadaf Ebrahimi are present, whichever is first is recognized. 208*22dc650dSSadaf Ebrahimi 209*22dc650dSSadaf Ebrahimi -LS List scripts: write a list of recognized Unicode script names 210*22dc650dSSadaf Ebrahimi to the standard output, then exit with zero exit code. All 211*22dc650dSSadaf Ebrahimi other options are ignored. If both -C and any -Lx options are 212*22dc650dSSadaf Ebrahimi present, whichever is first is recognized. 213*22dc650dSSadaf Ebrahimi 214*22dc650dSSadaf Ebrahimi -pattern modifier-list 215*22dc650dSSadaf Ebrahimi Behave as if each pattern line contains the given modifiers. 216*22dc650dSSadaf Ebrahimi 217*22dc650dSSadaf Ebrahimi -q Do not output the version number of pcre2test at the start of 218*22dc650dSSadaf Ebrahimi execution. 219*22dc650dSSadaf Ebrahimi 220*22dc650dSSadaf Ebrahimi -S size On Unix-like systems, set the size of the run-time stack to 221*22dc650dSSadaf Ebrahimi size mebibytes (units of 1024*1024 bytes). 222*22dc650dSSadaf Ebrahimi 223*22dc650dSSadaf Ebrahimi -subject modifier-list 224*22dc650dSSadaf Ebrahimi Behave as if each subject line contains the given modifiers. 225*22dc650dSSadaf Ebrahimi 226*22dc650dSSadaf Ebrahimi -t Run each compile and match many times with a timer, and out- 227*22dc650dSSadaf Ebrahimi put the resulting times per compile or match. When JIT is 228*22dc650dSSadaf Ebrahimi used, separate times are given for the initial compile and 229*22dc650dSSadaf Ebrahimi the JIT compile. You can control the number of iterations 230*22dc650dSSadaf Ebrahimi that are used for timing by following -t with a number (as a 231*22dc650dSSadaf Ebrahimi separate item on the command line). For example, "-t 1000" 232*22dc650dSSadaf Ebrahimi iterates 1000 times. The default is to iterate 500,000 times. 233*22dc650dSSadaf Ebrahimi 234*22dc650dSSadaf Ebrahimi -tm This is like -t except that it times only the matching phase, 235*22dc650dSSadaf Ebrahimi not the compile phase. 236*22dc650dSSadaf Ebrahimi 237*22dc650dSSadaf Ebrahimi -T -TM These behave like -t and -tm, but in addition, at the end of 238*22dc650dSSadaf Ebrahimi a run, the total times for all compiles and matches are out- 239*22dc650dSSadaf Ebrahimi put. 240*22dc650dSSadaf Ebrahimi 241*22dc650dSSadaf Ebrahimi -version Output the PCRE2 version number and then exit. 242*22dc650dSSadaf Ebrahimi 243*22dc650dSSadaf Ebrahimi 244*22dc650dSSadaf EbrahimiDESCRIPTION 245*22dc650dSSadaf Ebrahimi 246*22dc650dSSadaf Ebrahimi If pcre2test is given two filename arguments, it reads from the first 247*22dc650dSSadaf Ebrahimi and writes to the second. If the first name is "-", input is taken from 248*22dc650dSSadaf Ebrahimi the standard input. If pcre2test is given only one argument, it reads 249*22dc650dSSadaf Ebrahimi from that file and writes to stdout. Otherwise, it reads from stdin and 250*22dc650dSSadaf Ebrahimi writes to stdout. 251*22dc650dSSadaf Ebrahimi 252*22dc650dSSadaf Ebrahimi When pcre2test is built, a configuration option can specify that it 253*22dc650dSSadaf Ebrahimi should be linked with the libreadline or libedit library. When this is 254*22dc650dSSadaf Ebrahimi done, if the input is from a terminal, it is read using the readline() 255*22dc650dSSadaf Ebrahimi function. This provides line-editing and history facilities. The output 256*22dc650dSSadaf Ebrahimi from the -help option states whether or not readline() will be used. 257*22dc650dSSadaf Ebrahimi 258*22dc650dSSadaf Ebrahimi The program handles any number of tests, each of which consists of a 259*22dc650dSSadaf Ebrahimi set of input lines. Each set starts with a regular expression pattern, 260*22dc650dSSadaf Ebrahimi followed by any number of subject lines to be matched against that pat- 261*22dc650dSSadaf Ebrahimi tern. In between sets of test data, command lines that begin with # may 262*22dc650dSSadaf Ebrahimi appear. This file format, with some restrictions, can also be processed 263*22dc650dSSadaf Ebrahimi by the perltest.sh script that is distributed with PCRE2 as a means of 264*22dc650dSSadaf Ebrahimi checking that the behaviour of PCRE2 and Perl is the same. For a speci- 265*22dc650dSSadaf Ebrahimi fication of perltest.sh, see the comments near its beginning. See also 266*22dc650dSSadaf Ebrahimi the #perltest command below. 267*22dc650dSSadaf Ebrahimi 268*22dc650dSSadaf Ebrahimi When the input is a terminal, pcre2test prompts for each line of input, 269*22dc650dSSadaf Ebrahimi using "re>" to prompt for regular expression patterns, and "data>" to 270*22dc650dSSadaf Ebrahimi prompt for subject lines. Command lines starting with # can be entered 271*22dc650dSSadaf Ebrahimi only in response to the "re>" prompt. 272*22dc650dSSadaf Ebrahimi 273*22dc650dSSadaf Ebrahimi Each subject line is matched separately and independently. If you want 274*22dc650dSSadaf Ebrahimi to do multi-line matches, you have to use the \n escape sequence (or \r 275*22dc650dSSadaf Ebrahimi or \r\n, etc., depending on the newline setting) in a single line of 276*22dc650dSSadaf Ebrahimi input to encode the newline sequences. There is no limit on the length 277*22dc650dSSadaf Ebrahimi of subject lines; the input buffer is automatically extended if it is 278*22dc650dSSadaf Ebrahimi too small. There are replication features that makes it possible to 279*22dc650dSSadaf Ebrahimi generate long repetitive pattern or subject lines without having to 280*22dc650dSSadaf Ebrahimi supply them explicitly. 281*22dc650dSSadaf Ebrahimi 282*22dc650dSSadaf Ebrahimi An empty line or the end of the file signals the end of the subject 283*22dc650dSSadaf Ebrahimi lines for a test, at which point a new pattern or command line is ex- 284*22dc650dSSadaf Ebrahimi pected if there is still input to be read. 285*22dc650dSSadaf Ebrahimi 286*22dc650dSSadaf Ebrahimi 287*22dc650dSSadaf EbrahimiCOMMAND LINES 288*22dc650dSSadaf Ebrahimi 289*22dc650dSSadaf Ebrahimi In between sets of test data, a line that begins with # is interpreted 290*22dc650dSSadaf Ebrahimi as a command line. If the first character is followed by white space or 291*22dc650dSSadaf Ebrahimi an exclamation mark, the line is treated as a comment, and ignored. 292*22dc650dSSadaf Ebrahimi Otherwise, the following commands are recognized: 293*22dc650dSSadaf Ebrahimi 294*22dc650dSSadaf Ebrahimi #forbid_utf 295*22dc650dSSadaf Ebrahimi 296*22dc650dSSadaf Ebrahimi Subsequent patterns automatically have the PCRE2_NEVER_UTF and 297*22dc650dSSadaf Ebrahimi PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF 298*22dc650dSSadaf Ebrahimi and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of 299*22dc650dSSadaf Ebrahimi patterns. This command also forces an error if a subsequent pattern 300*22dc650dSSadaf Ebrahimi contains any occurrences of \P, \p, or \X, which are still supported 301*22dc650dSSadaf Ebrahimi when PCRE2_UTF is not set, but which require Unicode property support 302*22dc650dSSadaf Ebrahimi to be included in the library. 303*22dc650dSSadaf Ebrahimi 304*22dc650dSSadaf Ebrahimi This is a trigger guard that is used in test files to ensure that UTF 305*22dc650dSSadaf Ebrahimi or Unicode property tests are not accidentally added to files that are 306*22dc650dSSadaf Ebrahimi used when Unicode support is not included in the library. Setting 307*22dc650dSSadaf Ebrahimi PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained 308*22dc650dSSadaf Ebrahimi by the use of #pattern; the difference is that #forbid_utf cannot be 309*22dc650dSSadaf Ebrahimi unset, and the automatic options are not displayed in pattern informa- 310*22dc650dSSadaf Ebrahimi tion, to avoid cluttering up test output. 311*22dc650dSSadaf Ebrahimi 312*22dc650dSSadaf Ebrahimi #load <filename> 313*22dc650dSSadaf Ebrahimi 314*22dc650dSSadaf Ebrahimi This command is used to load a set of precompiled patterns from a file, 315*22dc650dSSadaf Ebrahimi as described in the section entitled "Saving and restoring compiled 316*22dc650dSSadaf Ebrahimi patterns" below. 317*22dc650dSSadaf Ebrahimi 318*22dc650dSSadaf Ebrahimi #loadtables <filename> 319*22dc650dSSadaf Ebrahimi 320*22dc650dSSadaf Ebrahimi This command is used to load a set of binary character tables that can 321*22dc650dSSadaf Ebrahimi be accessed by the tables=3 qualifier. Such tables can be created by 322*22dc650dSSadaf Ebrahimi the pcre2_dftables program with the -b option. 323*22dc650dSSadaf Ebrahimi 324*22dc650dSSadaf Ebrahimi #newline_default [<newline-list>] 325*22dc650dSSadaf Ebrahimi 326*22dc650dSSadaf Ebrahimi When PCRE2 is built, a default newline convention can be specified. 327*22dc650dSSadaf Ebrahimi This determines which characters and/or character pairs are recognized 328*22dc650dSSadaf Ebrahimi as indicating a newline in a pattern or subject string. The default can 329*22dc650dSSadaf Ebrahimi be overridden when a pattern is compiled. The standard test files con- 330*22dc650dSSadaf Ebrahimi tain tests of various newline conventions, but the majority of the 331*22dc650dSSadaf Ebrahimi tests expect a single linefeed to be recognized as a newline by de- 332*22dc650dSSadaf Ebrahimi fault. Without special action the tests would fail when PCRE2 is com- 333*22dc650dSSadaf Ebrahimi piled with either CR or CRLF as the default newline. 334*22dc650dSSadaf Ebrahimi 335*22dc650dSSadaf Ebrahimi The #newline_default command specifies a list of newline types that are 336*22dc650dSSadaf Ebrahimi acceptable as the default. The types must be one of CR, LF, CRLF, ANY- 337*22dc650dSSadaf Ebrahimi CRLF, ANY, or NUL (in upper or lower case), for example: 338*22dc650dSSadaf Ebrahimi 339*22dc650dSSadaf Ebrahimi #newline_default LF Any anyCRLF 340*22dc650dSSadaf Ebrahimi 341*22dc650dSSadaf Ebrahimi If the default newline is in the list, this command has no effect. Oth- 342*22dc650dSSadaf Ebrahimi erwise, except when testing the POSIX API, a newline modifier that 343*22dc650dSSadaf Ebrahimi specifies the first newline convention in the list (LF in the above ex- 344*22dc650dSSadaf Ebrahimi ample) is added to any pattern that does not already have a newline 345*22dc650dSSadaf Ebrahimi modifier. If the newline list is empty, the feature is turned off. This 346*22dc650dSSadaf Ebrahimi command is present in a number of the standard test input files. 347*22dc650dSSadaf Ebrahimi 348*22dc650dSSadaf Ebrahimi When the POSIX API is being tested there is no way to override the de- 349*22dc650dSSadaf Ebrahimi fault newline convention, though it is possible to set the newline con- 350*22dc650dSSadaf Ebrahimi vention from within the pattern. A warning is given if the posix or 351*22dc650dSSadaf Ebrahimi posix_nosub modifier is used when #newline_default would set a default 352*22dc650dSSadaf Ebrahimi for the non-POSIX API. 353*22dc650dSSadaf Ebrahimi 354*22dc650dSSadaf Ebrahimi #pattern <modifier-list> 355*22dc650dSSadaf Ebrahimi 356*22dc650dSSadaf Ebrahimi This command sets a default modifier list that applies to all subse- 357*22dc650dSSadaf Ebrahimi quent patterns. Modifiers on a pattern can change these settings. 358*22dc650dSSadaf Ebrahimi 359*22dc650dSSadaf Ebrahimi #perltest 360*22dc650dSSadaf Ebrahimi 361*22dc650dSSadaf Ebrahimi This line is used in test files that can also be processed by perl- 362*22dc650dSSadaf Ebrahimi test.sh to confirm that Perl gives the same results as PCRE2. Subse- 363*22dc650dSSadaf Ebrahimi quent tests are checked for the use of pcre2test features that are in- 364*22dc650dSSadaf Ebrahimi compatible with the perltest.sh script. 365*22dc650dSSadaf Ebrahimi 366*22dc650dSSadaf Ebrahimi Patterns must use '/' as their delimiter, and only certain modifiers 367*22dc650dSSadaf Ebrahimi are supported. Comment lines, #pattern commands, and #subject commands 368*22dc650dSSadaf Ebrahimi that set or unset "mark" are recognized and acted on. The #perltest, 369*22dc650dSSadaf Ebrahimi #forbid_utf, and #newline_default commands, which are needed in the 370*22dc650dSSadaf Ebrahimi relevant pcre2test files, are silently ignored. All other command lines 371*22dc650dSSadaf Ebrahimi are ignored, but give a warning message. The #perltest command helps 372*22dc650dSSadaf Ebrahimi detect tests that are accidentally put in the wrong file or use the 373*22dc650dSSadaf Ebrahimi wrong delimiter. For more details of the perltest.sh script see the 374*22dc650dSSadaf Ebrahimi comments it contains. 375*22dc650dSSadaf Ebrahimi 376*22dc650dSSadaf Ebrahimi #pop [<modifiers>] 377*22dc650dSSadaf Ebrahimi #popcopy [<modifiers>] 378*22dc650dSSadaf Ebrahimi 379*22dc650dSSadaf Ebrahimi These commands are used to manipulate the stack of compiled patterns, 380*22dc650dSSadaf Ebrahimi as described in the section entitled "Saving and restoring compiled 381*22dc650dSSadaf Ebrahimi patterns" below. 382*22dc650dSSadaf Ebrahimi 383*22dc650dSSadaf Ebrahimi #save <filename> 384*22dc650dSSadaf Ebrahimi 385*22dc650dSSadaf Ebrahimi This command is used to save a set of compiled patterns to a file, as 386*22dc650dSSadaf Ebrahimi described in the section entitled "Saving and restoring compiled pat- 387*22dc650dSSadaf Ebrahimi terns" below. 388*22dc650dSSadaf Ebrahimi 389*22dc650dSSadaf Ebrahimi #subject <modifier-list> 390*22dc650dSSadaf Ebrahimi 391*22dc650dSSadaf Ebrahimi This command sets a default modifier list that applies to all subse- 392*22dc650dSSadaf Ebrahimi quent subject lines. Modifiers on a subject line can change these set- 393*22dc650dSSadaf Ebrahimi tings. 394*22dc650dSSadaf Ebrahimi 395*22dc650dSSadaf Ebrahimi 396*22dc650dSSadaf EbrahimiMODIFIER SYNTAX 397*22dc650dSSadaf Ebrahimi 398*22dc650dSSadaf Ebrahimi Modifier lists are used with both pattern and subject lines. Items in a 399*22dc650dSSadaf Ebrahimi list are separated by commas followed by optional white space. Trailing 400*22dc650dSSadaf Ebrahimi whitespace in a modifier list is ignored. Some modifiers may be given 401*22dc650dSSadaf Ebrahimi for both patterns and subject lines, whereas others are valid only for 402*22dc650dSSadaf Ebrahimi one or the other. Each modifier has a long name, for example "an- 403*22dc650dSSadaf Ebrahimi chored", and some of them must be followed by an equals sign and a 404*22dc650dSSadaf Ebrahimi value, for example, "offset=12". Values cannot contain comma charac- 405*22dc650dSSadaf Ebrahimi ters, but may contain spaces. Modifiers that do not take values may be 406*22dc650dSSadaf Ebrahimi preceded by a minus sign to turn off a previous setting. 407*22dc650dSSadaf Ebrahimi 408*22dc650dSSadaf Ebrahimi A few of the more common modifiers can also be specified as single let- 409*22dc650dSSadaf Ebrahimi ters, for example "i" for "caseless". In documentation, following the 410*22dc650dSSadaf Ebrahimi Perl convention, these are written with a slash ("the /i modifier") for 411*22dc650dSSadaf Ebrahimi clarity. Abbreviated modifiers must all be concatenated in the first 412*22dc650dSSadaf Ebrahimi item of a modifier list. If the first item is not recognized as a long 413*22dc650dSSadaf Ebrahimi modifier name, it is interpreted as a sequence of these abbreviations. 414*22dc650dSSadaf Ebrahimi For example: 415*22dc650dSSadaf Ebrahimi 416*22dc650dSSadaf Ebrahimi /abc/ig,newline=cr,jit=3 417*22dc650dSSadaf Ebrahimi 418*22dc650dSSadaf Ebrahimi This is a pattern line whose modifier list starts with two one-letter 419*22dc650dSSadaf Ebrahimi modifiers (/i and /g). The lower-case abbreviated modifiers are the 420*22dc650dSSadaf Ebrahimi same as used in Perl. 421*22dc650dSSadaf Ebrahimi 422*22dc650dSSadaf Ebrahimi 423*22dc650dSSadaf EbrahimiPATTERN SYNTAX 424*22dc650dSSadaf Ebrahimi 425*22dc650dSSadaf Ebrahimi A pattern line must start with one of the following characters (common 426*22dc650dSSadaf Ebrahimi symbols, excluding pattern meta-characters): 427*22dc650dSSadaf Ebrahimi 428*22dc650dSSadaf Ebrahimi / ! " ' ` - = _ : ; , % & @ ~ 429*22dc650dSSadaf Ebrahimi 430*22dc650dSSadaf Ebrahimi This is interpreted as the pattern's delimiter. A regular expression 431*22dc650dSSadaf Ebrahimi may be continued over several input lines, in which case the newline 432*22dc650dSSadaf Ebrahimi characters are included within it. It is possible to include the delim- 433*22dc650dSSadaf Ebrahimi iter as a literal within the pattern by escaping it with a backslash, 434*22dc650dSSadaf Ebrahimi for example 435*22dc650dSSadaf Ebrahimi 436*22dc650dSSadaf Ebrahimi /abc\/def/ 437*22dc650dSSadaf Ebrahimi 438*22dc650dSSadaf Ebrahimi If you do this, the escape and the delimiter form part of the pattern, 439*22dc650dSSadaf Ebrahimi but since the delimiters are all non-alphanumeric, the inclusion of the 440*22dc650dSSadaf Ebrahimi backslash does not affect the pattern's interpretation. Note, however, 441*22dc650dSSadaf Ebrahimi that this trick does not work within \Q...\E literal bracketing because 442*22dc650dSSadaf Ebrahimi the backslash will itself be interpreted as a literal. If the terminat- 443*22dc650dSSadaf Ebrahimi ing delimiter is immediately followed by a backslash, for example, 444*22dc650dSSadaf Ebrahimi 445*22dc650dSSadaf Ebrahimi /abc/\ 446*22dc650dSSadaf Ebrahimi 447*22dc650dSSadaf Ebrahimi a backslash is added to the end of the pattern. This is done to provide 448*22dc650dSSadaf Ebrahimi a way of testing the error condition that arises if a pattern finishes 449*22dc650dSSadaf Ebrahimi with a backslash, because 450*22dc650dSSadaf Ebrahimi 451*22dc650dSSadaf Ebrahimi /abc\/ 452*22dc650dSSadaf Ebrahimi 453*22dc650dSSadaf Ebrahimi is interpreted as the first line of a pattern that starts with "abc/", 454*22dc650dSSadaf Ebrahimi causing pcre2test to read the next line as a continuation of the regu- 455*22dc650dSSadaf Ebrahimi lar expression. 456*22dc650dSSadaf Ebrahimi 457*22dc650dSSadaf Ebrahimi A pattern can be followed by a modifier list (details below). 458*22dc650dSSadaf Ebrahimi 459*22dc650dSSadaf Ebrahimi 460*22dc650dSSadaf EbrahimiSUBJECT LINE SYNTAX 461*22dc650dSSadaf Ebrahimi 462*22dc650dSSadaf Ebrahimi Before each subject line is passed to pcre2_match(), pcre2_dfa_match(), 463*22dc650dSSadaf Ebrahimi or pcre2_jit_match(), leading and trailing white space is removed, and 464*22dc650dSSadaf Ebrahimi the line is scanned for backslash escapes, unless the subject_literal 465*22dc650dSSadaf Ebrahimi modifier was set for the pattern. The following provide a means of en- 466*22dc650dSSadaf Ebrahimi coding non-printing characters in a visible way: 467*22dc650dSSadaf Ebrahimi 468*22dc650dSSadaf Ebrahimi \a alarm (BEL, \x07) 469*22dc650dSSadaf Ebrahimi \b backspace (\x08) 470*22dc650dSSadaf Ebrahimi \e escape (\x27) 471*22dc650dSSadaf Ebrahimi \f form feed (\x0c) 472*22dc650dSSadaf Ebrahimi \n newline (\x0a) 473*22dc650dSSadaf Ebrahimi \r carriage return (\x0d) 474*22dc650dSSadaf Ebrahimi \t tab (\x09) 475*22dc650dSSadaf Ebrahimi \v vertical tab (\x0b) 476*22dc650dSSadaf Ebrahimi \nnn octal character (up to 3 octal digits); always 477*22dc650dSSadaf Ebrahimi a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 478*22dc650dSSadaf Ebrahimi \o{dd...} octal character (any number of octal digits} 479*22dc650dSSadaf Ebrahimi \xhh hexadecimal byte (up to 2 hex digits) 480*22dc650dSSadaf Ebrahimi \x{hh...} hexadecimal character (any number of hex digits) 481*22dc650dSSadaf Ebrahimi 482*22dc650dSSadaf Ebrahimi The use of \x{hh...} is not dependent on the use of the utf modifier on 483*22dc650dSSadaf Ebrahimi the pattern. It is recognized always. There may be any number of hexa- 484*22dc650dSSadaf Ebrahimi decimal digits inside the braces; invalid values provoke error mes- 485*22dc650dSSadaf Ebrahimi sages. 486*22dc650dSSadaf Ebrahimi 487*22dc650dSSadaf Ebrahimi Note that \xhh specifies one byte rather than one character in UTF-8 488*22dc650dSSadaf Ebrahimi mode; this makes it possible to construct invalid UTF-8 sequences for 489*22dc650dSSadaf Ebrahimi testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 490*22dc650dSSadaf Ebrahimi character in UTF-8 mode, generating more than one byte if the value is 491*22dc650dSSadaf Ebrahimi greater than 127. When testing the 8-bit library not in UTF-8 mode, 492*22dc650dSSadaf Ebrahimi \x{hh} generates one byte for values less than 256, and causes an error 493*22dc650dSSadaf Ebrahimi for greater values. 494*22dc650dSSadaf Ebrahimi 495*22dc650dSSadaf Ebrahimi In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 496*22dc650dSSadaf Ebrahimi possible to construct invalid UTF-16 sequences for testing purposes. 497*22dc650dSSadaf Ebrahimi 498*22dc650dSSadaf Ebrahimi In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This 499*22dc650dSSadaf Ebrahimi makes it possible to construct invalid UTF-32 sequences for testing 500*22dc650dSSadaf Ebrahimi purposes. 501*22dc650dSSadaf Ebrahimi 502*22dc650dSSadaf Ebrahimi There is a special backslash sequence that specifies replication of one 503*22dc650dSSadaf Ebrahimi or more characters: 504*22dc650dSSadaf Ebrahimi 505*22dc650dSSadaf Ebrahimi \[<characters>]{<count>} 506*22dc650dSSadaf Ebrahimi 507*22dc650dSSadaf Ebrahimi This makes it possible to test long strings without having to provide 508*22dc650dSSadaf Ebrahimi them as part of the file. For example: 509*22dc650dSSadaf Ebrahimi 510*22dc650dSSadaf Ebrahimi \[abc]{4} 511*22dc650dSSadaf Ebrahimi 512*22dc650dSSadaf Ebrahimi is converted to "abcabcabcabc". This feature does not support nesting. 513*22dc650dSSadaf Ebrahimi To include a closing square bracket in the characters, code it as \x5D. 514*22dc650dSSadaf Ebrahimi 515*22dc650dSSadaf Ebrahimi A backslash followed by an equals sign marks the end of the subject 516*22dc650dSSadaf Ebrahimi string and the start of a modifier list. For example: 517*22dc650dSSadaf Ebrahimi 518*22dc650dSSadaf Ebrahimi abc\=notbol,notempty 519*22dc650dSSadaf Ebrahimi 520*22dc650dSSadaf Ebrahimi If the subject string is empty and \= is followed by whitespace, the 521*22dc650dSSadaf Ebrahimi line is treated as a comment line, and is not used for matching. For 522*22dc650dSSadaf Ebrahimi example: 523*22dc650dSSadaf Ebrahimi 524*22dc650dSSadaf Ebrahimi \= This is a comment. 525*22dc650dSSadaf Ebrahimi abc\= This is an invalid modifier list. 526*22dc650dSSadaf Ebrahimi 527*22dc650dSSadaf Ebrahimi A backslash followed by any other non-alphanumeric character just es- 528*22dc650dSSadaf Ebrahimi capes that character. A backslash followed by anything else causes an 529*22dc650dSSadaf Ebrahimi error. However, if the very last character in the line is a backslash 530*22dc650dSSadaf Ebrahimi (and there is no modifier list), it is ignored. This gives a way of 531*22dc650dSSadaf Ebrahimi passing an empty line as data, since a real empty line terminates the 532*22dc650dSSadaf Ebrahimi data input. 533*22dc650dSSadaf Ebrahimi 534*22dc650dSSadaf Ebrahimi If the subject_literal modifier is set for a pattern, all subject lines 535*22dc650dSSadaf Ebrahimi that follow are treated as literals, with no special treatment of back- 536*22dc650dSSadaf Ebrahimi slashes. No replication is possible, and any subject modifiers must be 537*22dc650dSSadaf Ebrahimi set as defaults by a #subject command. 538*22dc650dSSadaf Ebrahimi 539*22dc650dSSadaf Ebrahimi 540*22dc650dSSadaf EbrahimiPATTERN MODIFIERS 541*22dc650dSSadaf Ebrahimi 542*22dc650dSSadaf Ebrahimi There are several types of modifier that can appear in pattern lines. 543*22dc650dSSadaf Ebrahimi Except where noted below, they may also be used in #pattern commands. A 544*22dc650dSSadaf Ebrahimi pattern's modifier list can add to or override default modifiers that 545*22dc650dSSadaf Ebrahimi were set by a previous #pattern command. 546*22dc650dSSadaf Ebrahimi 547*22dc650dSSadaf Ebrahimi Setting compilation options 548*22dc650dSSadaf Ebrahimi 549*22dc650dSSadaf Ebrahimi The following modifiers set options for pcre2_compile(). Most of them 550*22dc650dSSadaf Ebrahimi set bits in the options argument of that function, but those whose 551*22dc650dSSadaf Ebrahimi names start with PCRE2_EXTRA are additional options that are set in the 552*22dc650dSSadaf Ebrahimi compile context. Some of these options have single-letter abbrevia- 553*22dc650dSSadaf Ebrahimi tions. There is special handling for /x: if a second x is present, 554*22dc650dSSadaf Ebrahimi PCRE2_EXTENDED is converted into PCRE2_EXTENDED_MORE as in Perl. A 555*22dc650dSSadaf Ebrahimi third appearance adds PCRE2_EXTENDED as well, though this makes no dif- 556*22dc650dSSadaf Ebrahimi ference to the way pcre2_compile() behaves. See pcre2api for a descrip- 557*22dc650dSSadaf Ebrahimi tion of the effects of these options. 558*22dc650dSSadaf Ebrahimi 559*22dc650dSSadaf Ebrahimi allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 560*22dc650dSSadaf Ebrahimi allow_lookaround_bsk set PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK 561*22dc650dSSadaf Ebrahimi allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 562*22dc650dSSadaf Ebrahimi alt_bsux set PCRE2_ALT_BSUX 563*22dc650dSSadaf Ebrahimi alt_circumflex set PCRE2_ALT_CIRCUMFLEX 564*22dc650dSSadaf Ebrahimi alt_verbnames set PCRE2_ALT_VERBNAMES 565*22dc650dSSadaf Ebrahimi anchored set PCRE2_ANCHORED 566*22dc650dSSadaf Ebrahimi /a ascii_all set all ASCII options 567*22dc650dSSadaf Ebrahimi ascii_bsd set PCRE2_EXTRA_ASCII_BSD 568*22dc650dSSadaf Ebrahimi ascii_bss set PCRE2_EXTRA_ASCII_BSS 569*22dc650dSSadaf Ebrahimi ascii_bsw set PCRE2_EXTRA_ASCII_BSW 570*22dc650dSSadaf Ebrahimi ascii_digit set PCRE2_EXTRA_ASCII_DIGIT 571*22dc650dSSadaf Ebrahimi ascii_posix set PCRE2_EXTRA_ASCII_POSIX 572*22dc650dSSadaf Ebrahimi auto_callout set PCRE2_AUTO_CALLOUT 573*22dc650dSSadaf Ebrahimi bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 574*22dc650dSSadaf Ebrahimi /i caseless set PCRE2_CASELESS 575*22dc650dSSadaf Ebrahimi /r caseless_restrict set PCRE2_EXTRA_CASELESS_RESTRICT 576*22dc650dSSadaf Ebrahimi dollar_endonly set PCRE2_DOLLAR_ENDONLY 577*22dc650dSSadaf Ebrahimi /s dotall set PCRE2_DOTALL 578*22dc650dSSadaf Ebrahimi dupnames set PCRE2_DUPNAMES 579*22dc650dSSadaf Ebrahimi endanchored set PCRE2_ENDANCHORED 580*22dc650dSSadaf Ebrahimi escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF 581*22dc650dSSadaf Ebrahimi /x extended set PCRE2_EXTENDED 582*22dc650dSSadaf Ebrahimi /xx extended_more set PCRE2_EXTENDED_MORE 583*22dc650dSSadaf Ebrahimi extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX 584*22dc650dSSadaf Ebrahimi firstline set PCRE2_FIRSTLINE 585*22dc650dSSadaf Ebrahimi literal set PCRE2_LITERAL 586*22dc650dSSadaf Ebrahimi match_line set PCRE2_EXTRA_MATCH_LINE 587*22dc650dSSadaf Ebrahimi match_invalid_utf set PCRE2_MATCH_INVALID_UTF 588*22dc650dSSadaf Ebrahimi match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 589*22dc650dSSadaf Ebrahimi match_word set PCRE2_EXTRA_MATCH_WORD 590*22dc650dSSadaf Ebrahimi /m multiline set PCRE2_MULTILINE 591*22dc650dSSadaf Ebrahimi never_backslash_c set PCRE2_NEVER_BACKSLASH_C 592*22dc650dSSadaf Ebrahimi never_ucp set PCRE2_NEVER_UCP 593*22dc650dSSadaf Ebrahimi never_utf set PCRE2_NEVER_UTF 594*22dc650dSSadaf Ebrahimi /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE 595*22dc650dSSadaf Ebrahimi no_auto_possess set PCRE2_NO_AUTO_POSSESS 596*22dc650dSSadaf Ebrahimi no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 597*22dc650dSSadaf Ebrahimi no_start_optimize set PCRE2_NO_START_OPTIMIZE 598*22dc650dSSadaf Ebrahimi no_utf_check set PCRE2_NO_UTF_CHECK 599*22dc650dSSadaf Ebrahimi ucp set PCRE2_UCP 600*22dc650dSSadaf Ebrahimi ungreedy set PCRE2_UNGREEDY 601*22dc650dSSadaf Ebrahimi use_offset_limit set PCRE2_USE_OFFSET_LIMIT 602*22dc650dSSadaf Ebrahimi utf set PCRE2_UTF 603*22dc650dSSadaf Ebrahimi 604*22dc650dSSadaf Ebrahimi As well as turning on the PCRE2_UTF option, the utf modifier causes all 605*22dc650dSSadaf Ebrahimi non-printing characters in output strings to be printed using the 606*22dc650dSSadaf Ebrahimi \x{hh...} notation. Otherwise, those less than 0x100 are output in hex 607*22dc650dSSadaf Ebrahimi without the curly brackets. Setting utf in 16-bit or 32-bit mode also 608*22dc650dSSadaf Ebrahimi causes pattern and subject strings to be translated to UTF-16 or 609*22dc650dSSadaf Ebrahimi UTF-32, respectively, before being passed to library functions. 610*22dc650dSSadaf Ebrahimi 611*22dc650dSSadaf Ebrahimi Setting compilation controls 612*22dc650dSSadaf Ebrahimi 613*22dc650dSSadaf Ebrahimi The following modifiers affect the compilation process or request in- 614*22dc650dSSadaf Ebrahimi formation about the pattern. There are single-letter abbreviations for 615*22dc650dSSadaf Ebrahimi some that are heavily used in the test files. 616*22dc650dSSadaf Ebrahimi 617*22dc650dSSadaf Ebrahimi bsr=[anycrlf|unicode] specify \R handling 618*22dc650dSSadaf Ebrahimi /B bincode show binary code without lengths 619*22dc650dSSadaf Ebrahimi callout_info show callout information 620*22dc650dSSadaf Ebrahimi convert=<options> request foreign pattern conversion 621*22dc650dSSadaf Ebrahimi convert_glob_escape=c set glob escape character 622*22dc650dSSadaf Ebrahimi convert_glob_separator=c set glob separator character 623*22dc650dSSadaf Ebrahimi convert_length set convert buffer length 624*22dc650dSSadaf Ebrahimi debug same as info,fullbincode 625*22dc650dSSadaf Ebrahimi framesize show matching frame size 626*22dc650dSSadaf Ebrahimi fullbincode show binary code with lengths 627*22dc650dSSadaf Ebrahimi /I info show info about compiled pattern 628*22dc650dSSadaf Ebrahimi hex unquoted characters are hexadecimal 629*22dc650dSSadaf Ebrahimi jit[=<number>] use JIT 630*22dc650dSSadaf Ebrahimi jitfast use JIT fast path 631*22dc650dSSadaf Ebrahimi jitverify verify JIT use 632*22dc650dSSadaf Ebrahimi locale=<name> use this locale 633*22dc650dSSadaf Ebrahimi max_pattern_compiled ) set maximum compiled pattern 634*22dc650dSSadaf Ebrahimi _length=<n> ) length (bytes) 635*22dc650dSSadaf Ebrahimi max_pattern_length=<n> set maximum pattern length (code units) 636*22dc650dSSadaf Ebrahimi max_varlookbehind=<n> set maximum variable lookbehind length 637*22dc650dSSadaf Ebrahimi memory show memory used 638*22dc650dSSadaf Ebrahimi newline=<type> set newline type 639*22dc650dSSadaf Ebrahimi null_context compile with a NULL context 640*22dc650dSSadaf Ebrahimi null_pattern pass pattern as NULL 641*22dc650dSSadaf Ebrahimi parens_nest_limit=<n> set maximum parentheses depth 642*22dc650dSSadaf Ebrahimi posix use the POSIX API 643*22dc650dSSadaf Ebrahimi posix_nosub use the POSIX API with REG_NOSUB 644*22dc650dSSadaf Ebrahimi push push compiled pattern onto the stack 645*22dc650dSSadaf Ebrahimi pushcopy push a copy onto the stack 646*22dc650dSSadaf Ebrahimi stackguard=<number> test the stackguard feature 647*22dc650dSSadaf Ebrahimi subject_literal treat all subject lines as literal 648*22dc650dSSadaf Ebrahimi tables=[0|1|2|3] select internal tables 649*22dc650dSSadaf Ebrahimi use_length do not zero-terminate the pattern 650*22dc650dSSadaf Ebrahimi utf8_input treat input as UTF-8 651*22dc650dSSadaf Ebrahimi 652*22dc650dSSadaf Ebrahimi The effects of these modifiers are described in the following sections. 653*22dc650dSSadaf Ebrahimi 654*22dc650dSSadaf Ebrahimi Newline and \R handling 655*22dc650dSSadaf Ebrahimi 656*22dc650dSSadaf Ebrahimi The bsr modifier specifies what \R in a pattern should match. If it is 657*22dc650dSSadaf Ebrahimi set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to 658*22dc650dSSadaf Ebrahimi "unicode", \R matches any Unicode newline sequence. The default can be 659*22dc650dSSadaf Ebrahimi specified when PCRE2 is built; if it is not, the default is set to Uni- 660*22dc650dSSadaf Ebrahimi code. 661*22dc650dSSadaf Ebrahimi 662*22dc650dSSadaf Ebrahimi The newline modifier specifies which characters are to be interpreted 663*22dc650dSSadaf Ebrahimi as newlines, both in the pattern and in subject lines. The type must be 664*22dc650dSSadaf Ebrahimi one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case). 665*22dc650dSSadaf Ebrahimi 666*22dc650dSSadaf Ebrahimi Information about a pattern 667*22dc650dSSadaf Ebrahimi 668*22dc650dSSadaf Ebrahimi The debug modifier is a shorthand for info,fullbincode, requesting all 669*22dc650dSSadaf Ebrahimi available information. 670*22dc650dSSadaf Ebrahimi 671*22dc650dSSadaf Ebrahimi The bincode modifier causes a representation of the compiled code to be 672*22dc650dSSadaf Ebrahimi output after compilation. This information does not contain length and 673*22dc650dSSadaf Ebrahimi offset values, which ensures that the same output is generated for dif- 674*22dc650dSSadaf Ebrahimi ferent internal link sizes and different code unit widths. By using 675*22dc650dSSadaf Ebrahimi bincode, the same regression tests can be used in different environ- 676*22dc650dSSadaf Ebrahimi ments. 677*22dc650dSSadaf Ebrahimi 678*22dc650dSSadaf Ebrahimi The fullbincode modifier, by contrast, does include length and offset 679*22dc650dSSadaf Ebrahimi values. This is used in a few special tests that run only for specific 680*22dc650dSSadaf Ebrahimi code unit widths and link sizes, and is also useful for one-off tests. 681*22dc650dSSadaf Ebrahimi 682*22dc650dSSadaf Ebrahimi The info modifier requests information about the compiled pattern 683*22dc650dSSadaf Ebrahimi (whether it is anchored, has a fixed first character, and so on). The 684*22dc650dSSadaf Ebrahimi information is obtained from the pcre2_pattern_info() function. Here 685*22dc650dSSadaf Ebrahimi are some typical examples: 686*22dc650dSSadaf Ebrahimi 687*22dc650dSSadaf Ebrahimi re> /(?i)(^a|^b)/m,info 688*22dc650dSSadaf Ebrahimi Capture group count = 1 689*22dc650dSSadaf Ebrahimi Compile options: multiline 690*22dc650dSSadaf Ebrahimi Overall options: caseless multiline 691*22dc650dSSadaf Ebrahimi First code unit at start or follows newline 692*22dc650dSSadaf Ebrahimi Subject length lower bound = 1 693*22dc650dSSadaf Ebrahimi 694*22dc650dSSadaf Ebrahimi re> /(?i)abc/info 695*22dc650dSSadaf Ebrahimi Capture group count = 0 696*22dc650dSSadaf Ebrahimi Compile options: <none> 697*22dc650dSSadaf Ebrahimi Overall options: caseless 698*22dc650dSSadaf Ebrahimi First code unit = 'a' (caseless) 699*22dc650dSSadaf Ebrahimi Last code unit = 'c' (caseless) 700*22dc650dSSadaf Ebrahimi Subject length lower bound = 3 701*22dc650dSSadaf Ebrahimi 702*22dc650dSSadaf Ebrahimi "Compile options" are those specified by modifiers; "overall options" 703*22dc650dSSadaf Ebrahimi have added options that are taken or deduced from the pattern. If both 704*22dc650dSSadaf Ebrahimi sets of options are the same, just a single "options" line is output; 705*22dc650dSSadaf Ebrahimi if there are no options, the line is omitted. "First code unit" is 706*22dc650dSSadaf Ebrahimi where any match must start; if there is more than one they are listed 707*22dc650dSSadaf Ebrahimi as "starting code units". "Last code unit" is the last literal code 708*22dc650dSSadaf Ebrahimi unit that must be present in any match. This is not necessarily the 709*22dc650dSSadaf Ebrahimi last character. These lines are omitted if no starting or ending code 710*22dc650dSSadaf Ebrahimi units are recorded. The subject length line is omitted when 711*22dc650dSSadaf Ebrahimi no_start_optimize is set because the minimum length is not calculated 712*22dc650dSSadaf Ebrahimi when it can never be used. 713*22dc650dSSadaf Ebrahimi 714*22dc650dSSadaf Ebrahimi The framesize modifier shows the size, in bytes, of each storage frame 715*22dc650dSSadaf Ebrahimi used by pcre2_match() for handling backtracking. The size depends on 716*22dc650dSSadaf Ebrahimi the number of capturing parentheses in the pattern. A vector of these 717*22dc650dSSadaf Ebrahimi frames is used at matching time; its overall size is shown when the 718*22dc650dSSadaf Ebrahimi heaframes_size subject modifier is set. 719*22dc650dSSadaf Ebrahimi 720*22dc650dSSadaf Ebrahimi The callout_info modifier requests information about all the callouts 721*22dc650dSSadaf Ebrahimi in the pattern. A list of them is output at the end of any other infor- 722*22dc650dSSadaf Ebrahimi mation that is requested. For each callout, either its number or string 723*22dc650dSSadaf Ebrahimi is given, followed by the item that follows it in the pattern. 724*22dc650dSSadaf Ebrahimi 725*22dc650dSSadaf Ebrahimi Passing a NULL context 726*22dc650dSSadaf Ebrahimi 727*22dc650dSSadaf Ebrahimi Normally, pcre2test passes a context block to pcre2_compile(). If the 728*22dc650dSSadaf Ebrahimi null_context modifier is set, however, NULL is passed. This is for 729*22dc650dSSadaf Ebrahimi testing that pcre2_compile() behaves correctly in this case (it uses 730*22dc650dSSadaf Ebrahimi default values). 731*22dc650dSSadaf Ebrahimi 732*22dc650dSSadaf Ebrahimi Passing a NULL pattern 733*22dc650dSSadaf Ebrahimi 734*22dc650dSSadaf Ebrahimi The null_pattern modifier is for testing the behaviour of pcre2_com- 735*22dc650dSSadaf Ebrahimi pile() when the pattern argument is NULL. The length value passed is 736*22dc650dSSadaf Ebrahimi the default PCRE2_ZERO_TERMINATED unless use_length is set. Any length 737*22dc650dSSadaf Ebrahimi other than zero causes an error. 738*22dc650dSSadaf Ebrahimi 739*22dc650dSSadaf Ebrahimi Specifying pattern characters in hexadecimal 740*22dc650dSSadaf Ebrahimi 741*22dc650dSSadaf Ebrahimi The hex modifier specifies that the characters of the pattern, except 742*22dc650dSSadaf Ebrahimi for substrings enclosed in single or double quotes, are to be inter- 743*22dc650dSSadaf Ebrahimi preted as pairs of hexadecimal digits. This feature is provided as a 744*22dc650dSSadaf Ebrahimi way of creating patterns that contain binary zeros and other non-print- 745*22dc650dSSadaf Ebrahimi ing characters. White space is permitted between pairs of digits. For 746*22dc650dSSadaf Ebrahimi example, this pattern contains three characters: 747*22dc650dSSadaf Ebrahimi 748*22dc650dSSadaf Ebrahimi /ab 32 59/hex 749*22dc650dSSadaf Ebrahimi 750*22dc650dSSadaf Ebrahimi Parts of such a pattern are taken literally if quoted. This pattern 751*22dc650dSSadaf Ebrahimi contains nine characters, only two of which are specified in hexadeci- 752*22dc650dSSadaf Ebrahimi mal: 753*22dc650dSSadaf Ebrahimi 754*22dc650dSSadaf Ebrahimi /ab "literal" 32/hex 755*22dc650dSSadaf Ebrahimi 756*22dc650dSSadaf Ebrahimi Either single or double quotes may be used. There is no way of includ- 757*22dc650dSSadaf Ebrahimi ing the delimiter within a substring. The hex and expand modifiers are 758*22dc650dSSadaf Ebrahimi mutually exclusive. 759*22dc650dSSadaf Ebrahimi 760*22dc650dSSadaf Ebrahimi Specifying the pattern's length 761*22dc650dSSadaf Ebrahimi 762*22dc650dSSadaf Ebrahimi By default, patterns are passed to the compiling functions as zero-ter- 763*22dc650dSSadaf Ebrahimi minated strings but can be passed by length instead of being zero-ter- 764*22dc650dSSadaf Ebrahimi minated. The use_length modifier causes this to happen. Using a length 765*22dc650dSSadaf Ebrahimi happens automatically (whether or not use_length is set) when hex is 766*22dc650dSSadaf Ebrahimi set, because patterns specified in hexadecimal may contain binary ze- 767*22dc650dSSadaf Ebrahimi ros. 768*22dc650dSSadaf Ebrahimi 769*22dc650dSSadaf Ebrahimi If hex or use_length is used with the POSIX wrapper API (see "Using the 770*22dc650dSSadaf Ebrahimi POSIX wrapper API" below), the REG_PEND extension is used to pass the 771*22dc650dSSadaf Ebrahimi pattern's length. 772*22dc650dSSadaf Ebrahimi 773*22dc650dSSadaf Ebrahimi Specifying a maximum for variable lookbehinds 774*22dc650dSSadaf Ebrahimi 775*22dc650dSSadaf Ebrahimi Variable lookbehind assertions are supported only if, for each one, 776*22dc650dSSadaf Ebrahimi there is a maximum length (in characters) that it can match. There is a 777*22dc650dSSadaf Ebrahimi limit on this, whose default can be set at build time, with an ultimate 778*22dc650dSSadaf Ebrahimi default of 255. The max_varlookbehind modifier uses the 779*22dc650dSSadaf Ebrahimi pcre2_set_max_varlookbehind() function to change the limit. Lookbehinds 780*22dc650dSSadaf Ebrahimi whose branches each match a fixed length are limited to 65535 charac- 781*22dc650dSSadaf Ebrahimi ters per branch. 782*22dc650dSSadaf Ebrahimi 783*22dc650dSSadaf Ebrahimi Specifying wide characters in 16-bit and 32-bit modes 784*22dc650dSSadaf Ebrahimi 785*22dc650dSSadaf Ebrahimi In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 786*22dc650dSSadaf Ebrahimi and translated to UTF-16 or UTF-32 when the utf modifier is set. For 787*22dc650dSSadaf Ebrahimi testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input 788*22dc650dSSadaf Ebrahimi modifier can be used. It is mutually exclusive with utf. Input lines 789*22dc650dSSadaf Ebrahimi are interpreted as UTF-8 as a means of specifying wide characters. More 790*22dc650dSSadaf Ebrahimi details are given in "Input encoding" above. 791*22dc650dSSadaf Ebrahimi 792*22dc650dSSadaf Ebrahimi Generating long repetitive patterns 793*22dc650dSSadaf Ebrahimi 794*22dc650dSSadaf Ebrahimi Some tests use long patterns that are very repetitive. Instead of cre- 795*22dc650dSSadaf Ebrahimi ating a very long input line for such a pattern, you can use a special 796*22dc650dSSadaf Ebrahimi repetition feature, similar to the one described for subject lines 797*22dc650dSSadaf Ebrahimi above. If the expand modifier is present on a pattern, parts of the 798*22dc650dSSadaf Ebrahimi pattern that have the form 799*22dc650dSSadaf Ebrahimi 800*22dc650dSSadaf Ebrahimi \[<characters>]{<count>} 801*22dc650dSSadaf Ebrahimi 802*22dc650dSSadaf Ebrahimi are expanded before the pattern is passed to pcre2_compile(). For exam- 803*22dc650dSSadaf Ebrahimi ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 804*22dc650dSSadaf Ebrahimi cannot be nested. An initial "\[" sequence is recognized only if "]{" 805*22dc650dSSadaf Ebrahimi followed by decimal digits and "}" is found later in the pattern. If 806*22dc650dSSadaf Ebrahimi not, the characters remain in the pattern unaltered. The expand and hex 807*22dc650dSSadaf Ebrahimi modifiers are mutually exclusive. 808*22dc650dSSadaf Ebrahimi 809*22dc650dSSadaf Ebrahimi If part of an expanded pattern looks like an expansion, but is really 810*22dc650dSSadaf Ebrahimi part of the actual pattern, unwanted expansion can be avoided by giving 811*22dc650dSSadaf Ebrahimi two values in the quantifier. For example, \[AB]{6000,6000} is not rec- 812*22dc650dSSadaf Ebrahimi ognized as an expansion item. 813*22dc650dSSadaf Ebrahimi 814*22dc650dSSadaf Ebrahimi If the info modifier is set on an expanded pattern, the result of the 815*22dc650dSSadaf Ebrahimi expansion is included in the information that is output. 816*22dc650dSSadaf Ebrahimi 817*22dc650dSSadaf Ebrahimi JIT compilation 818*22dc650dSSadaf Ebrahimi 819*22dc650dSSadaf Ebrahimi Just-in-time (JIT) compiling is a heavyweight optimization that can 820*22dc650dSSadaf Ebrahimi greatly speed up pattern matching. See the pcre2jit documentation for 821*22dc650dSSadaf Ebrahimi details. JIT compiling happens, optionally, after a pattern has been 822*22dc650dSSadaf Ebrahimi successfully compiled into an internal form. The JIT compiler converts 823*22dc650dSSadaf Ebrahimi this to optimized machine code. It needs to know whether the match-time 824*22dc650dSSadaf Ebrahimi options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, 825*22dc650dSSadaf Ebrahimi because different code is generated for the different cases. See the 826*22dc650dSSadaf Ebrahimi partial modifier in "Subject Modifiers" below for details of how these 827*22dc650dSSadaf Ebrahimi options are specified for each match attempt. 828*22dc650dSSadaf Ebrahimi 829*22dc650dSSadaf Ebrahimi JIT compilation is requested by the jit pattern modifier, which may op- 830*22dc650dSSadaf Ebrahimi tionally be followed by an equals sign and a number in the range 0 to 831*22dc650dSSadaf Ebrahimi 7. The three bits that make up the number specify which of the three 832*22dc650dSSadaf Ebrahimi JIT operating modes are to be compiled: 833*22dc650dSSadaf Ebrahimi 834*22dc650dSSadaf Ebrahimi 1 compile JIT code for non-partial matching 835*22dc650dSSadaf Ebrahimi 2 compile JIT code for soft partial matching 836*22dc650dSSadaf Ebrahimi 4 compile JIT code for hard partial matching 837*22dc650dSSadaf Ebrahimi 838*22dc650dSSadaf Ebrahimi The possible values for the jit modifier are therefore: 839*22dc650dSSadaf Ebrahimi 840*22dc650dSSadaf Ebrahimi 0 disable JIT 841*22dc650dSSadaf Ebrahimi 1 normal matching only 842*22dc650dSSadaf Ebrahimi 2 soft partial matching only 843*22dc650dSSadaf Ebrahimi 3 normal and soft partial matching 844*22dc650dSSadaf Ebrahimi 4 hard partial matching only 845*22dc650dSSadaf Ebrahimi 6 soft and hard partial matching only 846*22dc650dSSadaf Ebrahimi 7 all three modes 847*22dc650dSSadaf Ebrahimi 848*22dc650dSSadaf Ebrahimi If no number is given, 7 is assumed. The phrase "partial matching" 849*22dc650dSSadaf Ebrahimi means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the 850*22dc650dSSadaf Ebrahimi PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- 851*22dc650dSSadaf Ebrahimi plete match; the options enable the possibility of a partial match, but 852*22dc650dSSadaf Ebrahimi do not require it. Note also that if you request JIT compilation only 853*22dc650dSSadaf Ebrahimi for partial matching (for example, jit=2) but do not set the partial 854*22dc650dSSadaf Ebrahimi modifier on a subject line, that match will not use JIT code because 855*22dc650dSSadaf Ebrahimi none was compiled for non-partial matching. 856*22dc650dSSadaf Ebrahimi 857*22dc650dSSadaf Ebrahimi If JIT compilation is successful, the compiled JIT code will automati- 858*22dc650dSSadaf Ebrahimi cally be used when an appropriate type of match is run, except when in- 859*22dc650dSSadaf Ebrahimi compatible run-time options are specified. For more details, see the 860*22dc650dSSadaf Ebrahimi pcre2jit documentation. See also the jitstack modifier below for a way 861*22dc650dSSadaf Ebrahimi of setting the size of the JIT stack. 862*22dc650dSSadaf Ebrahimi 863*22dc650dSSadaf Ebrahimi If the jitfast modifier is specified, matching is done using the JIT 864*22dc650dSSadaf Ebrahimi "fast path" interface, pcre2_jit_match(), which skips some of the san- 865*22dc650dSSadaf Ebrahimi ity checks that are done by pcre2_match(), and of course does not work 866*22dc650dSSadaf Ebrahimi when JIT is not supported. If jitfast is specified without jit, jit=7 867*22dc650dSSadaf Ebrahimi is assumed. 868*22dc650dSSadaf Ebrahimi 869*22dc650dSSadaf Ebrahimi If the jitverify modifier is specified, information about the compiled 870*22dc650dSSadaf Ebrahimi pattern shows whether JIT compilation was or was not successful. If 871*22dc650dSSadaf Ebrahimi jitverify is specified without jit, jit=7 is assumed. If JIT compila- 872*22dc650dSSadaf Ebrahimi tion is successful when jitverify is set, the text "(JIT)" is added to 873*22dc650dSSadaf Ebrahimi the first output line after a match or non match when JIT-compiled code 874*22dc650dSSadaf Ebrahimi was actually used in the match. 875*22dc650dSSadaf Ebrahimi 876*22dc650dSSadaf Ebrahimi Setting a locale 877*22dc650dSSadaf Ebrahimi 878*22dc650dSSadaf Ebrahimi The locale modifier must specify the name of a locale, for example: 879*22dc650dSSadaf Ebrahimi 880*22dc650dSSadaf Ebrahimi /pattern/locale=fr_FR 881*22dc650dSSadaf Ebrahimi 882*22dc650dSSadaf Ebrahimi The given locale is set, pcre2_maketables() is called to build a set of 883*22dc650dSSadaf Ebrahimi character tables for the locale, and this is then passed to pcre2_com- 884*22dc650dSSadaf Ebrahimi pile() when compiling the regular expression. The same tables are used 885*22dc650dSSadaf Ebrahimi when matching the following subject lines. The locale modifier applies 886*22dc650dSSadaf Ebrahimi only to the pattern on which it appears, but can be given in a #pattern 887*22dc650dSSadaf Ebrahimi command if a default is needed. Setting a locale and alternate charac- 888*22dc650dSSadaf Ebrahimi ter tables are mutually exclusive. 889*22dc650dSSadaf Ebrahimi 890*22dc650dSSadaf Ebrahimi Showing pattern memory 891*22dc650dSSadaf Ebrahimi 892*22dc650dSSadaf Ebrahimi The memory modifier causes the size in bytes of the memory used to hold 893*22dc650dSSadaf Ebrahimi the compiled pattern to be output. This does not include the size of 894*22dc650dSSadaf Ebrahimi the pcre2_code block; it is just the actual compiled data. If the pat- 895*22dc650dSSadaf Ebrahimi tern is subsequently passed to the JIT compiler, the size of the JIT 896*22dc650dSSadaf Ebrahimi compiled code is also output. Here is an example: 897*22dc650dSSadaf Ebrahimi 898*22dc650dSSadaf Ebrahimi re> /a(b)c/jit,memory 899*22dc650dSSadaf Ebrahimi Memory allocation (code space): 21 900*22dc650dSSadaf Ebrahimi Memory allocation (JIT code): 1910 901*22dc650dSSadaf Ebrahimi 902*22dc650dSSadaf Ebrahimi 903*22dc650dSSadaf Ebrahimi Limiting nested parentheses 904*22dc650dSSadaf Ebrahimi 905*22dc650dSSadaf Ebrahimi The parens_nest_limit modifier sets a limit on the depth of nested 906*22dc650dSSadaf Ebrahimi parentheses in a pattern. Breaching the limit causes a compilation er- 907*22dc650dSSadaf Ebrahimi ror. The default for the library is set when PCRE2 is built, but 908*22dc650dSSadaf Ebrahimi pcre2test sets its own default of 220, which is required for running 909*22dc650dSSadaf Ebrahimi the standard test suite. 910*22dc650dSSadaf Ebrahimi 911*22dc650dSSadaf Ebrahimi Limiting the pattern length 912*22dc650dSSadaf Ebrahimi 913*22dc650dSSadaf Ebrahimi The max_pattern_length modifier sets a limit, in code units, to the 914*22dc650dSSadaf Ebrahimi length of pattern that pcre2_compile() will accept. Breaching the limit 915*22dc650dSSadaf Ebrahimi causes a compilation error. The default is the largest number a 916*22dc650dSSadaf Ebrahimi PCRE2_SIZE variable can hold (essentially unlimited). 917*22dc650dSSadaf Ebrahimi 918*22dc650dSSadaf Ebrahimi Limiting the size of a compiled pattern 919*22dc650dSSadaf Ebrahimi 920*22dc650dSSadaf Ebrahimi The max_pattern_compiled_length modifier sets a limit, in bytes, to the 921*22dc650dSSadaf Ebrahimi amount of memory used by a compiled pattern. Breaching the limit causes 922*22dc650dSSadaf Ebrahimi a compilation error. The default is the largest number a PCRE2_SIZE 923*22dc650dSSadaf Ebrahimi variable can hold (essentially unlimited). 924*22dc650dSSadaf Ebrahimi 925*22dc650dSSadaf Ebrahimi Using the POSIX wrapper API 926*22dc650dSSadaf Ebrahimi 927*22dc650dSSadaf Ebrahimi The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via 928*22dc650dSSadaf Ebrahimi the POSIX wrapper API rather than its native API. When posix_nosub is 929*22dc650dSSadaf Ebrahimi used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX 930*22dc650dSSadaf Ebrahimi wrapper supports only the 8-bit library. Note that it does not imply 931*22dc650dSSadaf Ebrahimi POSIX matching semantics; for more detail see the pcre2posix documenta- 932*22dc650dSSadaf Ebrahimi tion. The following pattern modifiers set options for the regcomp() 933*22dc650dSSadaf Ebrahimi function: 934*22dc650dSSadaf Ebrahimi 935*22dc650dSSadaf Ebrahimi caseless REG_ICASE 936*22dc650dSSadaf Ebrahimi multiline REG_NEWLINE 937*22dc650dSSadaf Ebrahimi dotall REG_DOTALL ) 938*22dc650dSSadaf Ebrahimi ungreedy REG_UNGREEDY ) These options are not part of 939*22dc650dSSadaf Ebrahimi ucp REG_UCP ) the POSIX standard 940*22dc650dSSadaf Ebrahimi utf REG_UTF8 ) 941*22dc650dSSadaf Ebrahimi 942*22dc650dSSadaf Ebrahimi The regerror_buffsize modifier specifies a size for the error buffer 943*22dc650dSSadaf Ebrahimi that is passed to regerror() in the event of a compilation error. For 944*22dc650dSSadaf Ebrahimi example: 945*22dc650dSSadaf Ebrahimi 946*22dc650dSSadaf Ebrahimi /abc/posix,regerror_buffsize=20 947*22dc650dSSadaf Ebrahimi 948*22dc650dSSadaf Ebrahimi This provides a means of testing the behaviour of regerror() when the 949*22dc650dSSadaf Ebrahimi buffer is too small for the error message. If this modifier has not 950*22dc650dSSadaf Ebrahimi been set, a large buffer is used. 951*22dc650dSSadaf Ebrahimi 952*22dc650dSSadaf Ebrahimi The aftertext and allaftertext subject modifiers work as described be- 953*22dc650dSSadaf Ebrahimi low. All other modifiers are either ignored, with a warning message, or 954*22dc650dSSadaf Ebrahimi cause an error. 955*22dc650dSSadaf Ebrahimi 956*22dc650dSSadaf Ebrahimi The pattern is passed to regcomp() as a zero-terminated string by de- 957*22dc650dSSadaf Ebrahimi fault, but if the use_length or hex modifiers are set, the REG_PEND ex- 958*22dc650dSSadaf Ebrahimi tension is used to pass it by length. 959*22dc650dSSadaf Ebrahimi 960*22dc650dSSadaf Ebrahimi Testing the stack guard feature 961*22dc650dSSadaf Ebrahimi 962*22dc650dSSadaf Ebrahimi The stackguard modifier is used to test the use of pcre2_set_com- 963*22dc650dSSadaf Ebrahimi pile_recursion_guard(), a function that is provided to enable stack 964*22dc650dSSadaf Ebrahimi availability to be checked during compilation (see the pcre2api docu- 965*22dc650dSSadaf Ebrahimi mentation for details). If the number specified by the modifier is 966*22dc650dSSadaf Ebrahimi greater than zero, pcre2_set_compile_recursion_guard() is called to set 967*22dc650dSSadaf Ebrahimi up callback from pcre2_compile() to a local function. The argument it 968*22dc650dSSadaf Ebrahimi receives is the current nesting parenthesis depth; if this is greater 969*22dc650dSSadaf Ebrahimi than the value given by the modifier, non-zero is returned, causing the 970*22dc650dSSadaf Ebrahimi compilation to be aborted. 971*22dc650dSSadaf Ebrahimi 972*22dc650dSSadaf Ebrahimi Using alternative character tables 973*22dc650dSSadaf Ebrahimi 974*22dc650dSSadaf Ebrahimi The value specified for the tables modifier must be one of the digits 975*22dc650dSSadaf Ebrahimi 0, 1, 2, or 3. It causes a specific set of built-in character tables to 976*22dc650dSSadaf Ebrahimi be passed to pcre2_compile(). This is used in the PCRE2 tests to check 977*22dc650dSSadaf Ebrahimi behaviour with different character tables. The digit specifies the ta- 978*22dc650dSSadaf Ebrahimi bles as follows: 979*22dc650dSSadaf Ebrahimi 980*22dc650dSSadaf Ebrahimi 0 do not pass any special character tables 981*22dc650dSSadaf Ebrahimi 1 the default ASCII tables, as distributed in 982*22dc650dSSadaf Ebrahimi pcre2_chartables.c.dist 983*22dc650dSSadaf Ebrahimi 2 a set of tables defining ISO 8859 characters 984*22dc650dSSadaf Ebrahimi 3 a set of tables loaded by the #loadtables command 985*22dc650dSSadaf Ebrahimi 986*22dc650dSSadaf Ebrahimi In tables 2, some characters whose codes are greater than 128 are iden- 987*22dc650dSSadaf Ebrahimi tified as letters, digits, spaces, etc. Tables 3 can be used only after 988*22dc650dSSadaf Ebrahimi a #loadtables command has loaded them from a binary file. Setting al- 989*22dc650dSSadaf Ebrahimi ternate character tables and a locale are mutually exclusive. 990*22dc650dSSadaf Ebrahimi 991*22dc650dSSadaf Ebrahimi Setting certain match controls 992*22dc650dSSadaf Ebrahimi 993*22dc650dSSadaf Ebrahimi The following modifiers are really subject modifiers, and are described 994*22dc650dSSadaf Ebrahimi under "Subject Modifiers" below. However, they may be included in a 995*22dc650dSSadaf Ebrahimi pattern's modifier list, in which case they are applied to every sub- 996*22dc650dSSadaf Ebrahimi ject line that is processed with that pattern. These modifiers do not 997*22dc650dSSadaf Ebrahimi affect the compilation process. 998*22dc650dSSadaf Ebrahimi 999*22dc650dSSadaf Ebrahimi aftertext show text after match 1000*22dc650dSSadaf Ebrahimi allaftertext show text after captures 1001*22dc650dSSadaf Ebrahimi allcaptures show all captures 1002*22dc650dSSadaf Ebrahimi allvector show the entire ovector 1003*22dc650dSSadaf Ebrahimi allusedtext show all consulted text 1004*22dc650dSSadaf Ebrahimi altglobal alternative global matching 1005*22dc650dSSadaf Ebrahimi /g global global matching 1006*22dc650dSSadaf Ebrahimi heapframes_size show match data heapframes size 1007*22dc650dSSadaf Ebrahimi jitstack=<n> set size of JIT stack 1008*22dc650dSSadaf Ebrahimi mark show mark values 1009*22dc650dSSadaf Ebrahimi replace=<string> specify a replacement string 1010*22dc650dSSadaf Ebrahimi startchar show starting character when relevant 1011*22dc650dSSadaf Ebrahimi substitute_callout use substitution callouts 1012*22dc650dSSadaf Ebrahimi substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 1013*22dc650dSSadaf Ebrahimi substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1014*22dc650dSSadaf Ebrahimi substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1015*22dc650dSSadaf Ebrahimi substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1016*22dc650dSSadaf Ebrahimi substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1017*22dc650dSSadaf Ebrahimi substitute_skip=<n> skip substitution <n> 1018*22dc650dSSadaf Ebrahimi substitute_stop=<n> skip substitution <n> and following 1019*22dc650dSSadaf Ebrahimi substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1020*22dc650dSSadaf Ebrahimi substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1021*22dc650dSSadaf Ebrahimi 1022*22dc650dSSadaf Ebrahimi These modifiers may not appear in a #pattern command. If you want them 1023*22dc650dSSadaf Ebrahimi as defaults, set them in a #subject command. 1024*22dc650dSSadaf Ebrahimi 1025*22dc650dSSadaf Ebrahimi Specifying literal subject lines 1026*22dc650dSSadaf Ebrahimi 1027*22dc650dSSadaf Ebrahimi If the subject_literal modifier is present on a pattern, all the sub- 1028*22dc650dSSadaf Ebrahimi ject lines that it matches are taken as literal strings, with no inter- 1029*22dc650dSSadaf Ebrahimi pretation of backslashes. It is not possible to set subject modifiers 1030*22dc650dSSadaf Ebrahimi on such lines, but any that are set as defaults by a #subject command 1031*22dc650dSSadaf Ebrahimi are recognized. 1032*22dc650dSSadaf Ebrahimi 1033*22dc650dSSadaf Ebrahimi Saving a compiled pattern 1034*22dc650dSSadaf Ebrahimi 1035*22dc650dSSadaf Ebrahimi When a pattern with the push modifier is successfully compiled, it is 1036*22dc650dSSadaf Ebrahimi pushed onto a stack of compiled patterns, and pcre2test expects the 1037*22dc650dSSadaf Ebrahimi next line to contain a new pattern (or a command) instead of a subject 1038*22dc650dSSadaf Ebrahimi line. This facility is used when saving compiled patterns to a file, as 1039*22dc650dSSadaf Ebrahimi described in the section entitled "Saving and restoring compiled pat- 1040*22dc650dSSadaf Ebrahimi terns" below. If pushcopy is used instead of push, a copy of the com- 1041*22dc650dSSadaf Ebrahimi piled pattern is stacked, leaving the original as current, ready to 1042*22dc650dSSadaf Ebrahimi match the following input lines. This provides a way of testing the 1043*22dc650dSSadaf Ebrahimi pcre2_code_copy() function. The push and pushcopy modifiers are in- 1044*22dc650dSSadaf Ebrahimi compatible with compilation modifiers such as global that act at match 1045*22dc650dSSadaf Ebrahimi time. Any that are specified are ignored (for the stacked copy), with a 1046*22dc650dSSadaf Ebrahimi warning message, except for replace, which causes an error. Note that 1047*22dc650dSSadaf Ebrahimi jitverify, which is allowed, does not carry through to any subsequent 1048*22dc650dSSadaf Ebrahimi matching that uses a stacked pattern. 1049*22dc650dSSadaf Ebrahimi 1050*22dc650dSSadaf Ebrahimi Testing foreign pattern conversion 1051*22dc650dSSadaf Ebrahimi 1052*22dc650dSSadaf Ebrahimi The experimental foreign pattern conversion functions in PCRE2 can be 1053*22dc650dSSadaf Ebrahimi tested by setting the convert modifier. Its argument is a colon-sepa- 1054*22dc650dSSadaf Ebrahimi rated list of options, which set the equivalent option for the 1055*22dc650dSSadaf Ebrahimi pcre2_pattern_convert() function: 1056*22dc650dSSadaf Ebrahimi 1057*22dc650dSSadaf Ebrahimi glob PCRE2_CONVERT_GLOB 1058*22dc650dSSadaf Ebrahimi glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR 1059*22dc650dSSadaf Ebrahimi glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR 1060*22dc650dSSadaf Ebrahimi posix_basic PCRE2_CONVERT_POSIX_BASIC 1061*22dc650dSSadaf Ebrahimi posix_extended PCRE2_CONVERT_POSIX_EXTENDED 1062*22dc650dSSadaf Ebrahimi unset Unset all options 1063*22dc650dSSadaf Ebrahimi 1064*22dc650dSSadaf Ebrahimi The "unset" value is useful for turning off a default that has been set 1065*22dc650dSSadaf Ebrahimi by a #pattern command. When one of these options is set, the input pat- 1066*22dc650dSSadaf Ebrahimi tern is passed to pcre2_pattern_convert(). If the conversion is suc- 1067*22dc650dSSadaf Ebrahimi cessful, the result is reflected in the output and then passed to 1068*22dc650dSSadaf Ebrahimi pcre2_compile(). The normal utf and no_utf_check options, if set, cause 1069*22dc650dSSadaf Ebrahimi the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be 1070*22dc650dSSadaf Ebrahimi passed to pcre2_pattern_convert(). 1071*22dc650dSSadaf Ebrahimi 1072*22dc650dSSadaf Ebrahimi By default, the conversion function is allowed to allocate a buffer for 1073*22dc650dSSadaf Ebrahimi its output. However, if the convert_length modifier is set to a value 1074*22dc650dSSadaf Ebrahimi greater than zero, pcre2test passes a buffer of the given length. This 1075*22dc650dSSadaf Ebrahimi makes it possible to test the length check. 1076*22dc650dSSadaf Ebrahimi 1077*22dc650dSSadaf Ebrahimi The convert_glob_escape and convert_glob_separator modifiers can be 1078*22dc650dSSadaf Ebrahimi used to specify the escape and separator characters for glob process- 1079*22dc650dSSadaf Ebrahimi ing, overriding the defaults, which are operating-system dependent. 1080*22dc650dSSadaf Ebrahimi 1081*22dc650dSSadaf Ebrahimi 1082*22dc650dSSadaf EbrahimiSUBJECT MODIFIERS 1083*22dc650dSSadaf Ebrahimi 1084*22dc650dSSadaf Ebrahimi The modifiers that can appear in subject lines and the #subject command 1085*22dc650dSSadaf Ebrahimi are of two types. 1086*22dc650dSSadaf Ebrahimi 1087*22dc650dSSadaf Ebrahimi Setting match options 1088*22dc650dSSadaf Ebrahimi 1089*22dc650dSSadaf Ebrahimi The following modifiers set options for pcre2_match() or 1090*22dc650dSSadaf Ebrahimi pcre2_dfa_match(). See pcreapi for a description of their effects. 1091*22dc650dSSadaf Ebrahimi 1092*22dc650dSSadaf Ebrahimi anchored set PCRE2_ANCHORED 1093*22dc650dSSadaf Ebrahimi endanchored set PCRE2_ENDANCHORED 1094*22dc650dSSadaf Ebrahimi dfa_restart set PCRE2_DFA_RESTART 1095*22dc650dSSadaf Ebrahimi dfa_shortest set PCRE2_DFA_SHORTEST 1096*22dc650dSSadaf Ebrahimi disable_recurseloop_check set PCRE2_DISABLE_RECURSELOOP_CHECK 1097*22dc650dSSadaf Ebrahimi no_jit set PCRE2_NO_JIT 1098*22dc650dSSadaf Ebrahimi no_utf_check set PCRE2_NO_UTF_CHECK 1099*22dc650dSSadaf Ebrahimi notbol set PCRE2_NOTBOL 1100*22dc650dSSadaf Ebrahimi notempty set PCRE2_NOTEMPTY 1101*22dc650dSSadaf Ebrahimi notempty_atstart set PCRE2_NOTEMPTY_ATSTART 1102*22dc650dSSadaf Ebrahimi noteol set PCRE2_NOTEOL 1103*22dc650dSSadaf Ebrahimi partial_hard (or ph) set PCRE2_PARTIAL_HARD 1104*22dc650dSSadaf Ebrahimi partial_soft (or ps) set PCRE2_PARTIAL_SOFT 1105*22dc650dSSadaf Ebrahimi 1106*22dc650dSSadaf Ebrahimi The partial matching modifiers are provided with abbreviations because 1107*22dc650dSSadaf Ebrahimi they appear frequently in tests. 1108*22dc650dSSadaf Ebrahimi 1109*22dc650dSSadaf Ebrahimi If the posix or posix_nosub modifier was present on the pattern, caus- 1110*22dc650dSSadaf Ebrahimi ing the POSIX wrapper API to be used, the only option-setting modifiers 1111*22dc650dSSadaf Ebrahimi that have any effect are notbol, notempty, and noteol, causing REG_NOT- 1112*22dc650dSSadaf Ebrahimi BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to 1113*22dc650dSSadaf Ebrahimi regexec(). The other modifiers are ignored, with a warning message. 1114*22dc650dSSadaf Ebrahimi 1115*22dc650dSSadaf Ebrahimi There is one additional modifier that can be used with the POSIX wrap- 1116*22dc650dSSadaf Ebrahimi per. It is ignored (with a warning) if used for non-POSIX matching. 1117*22dc650dSSadaf Ebrahimi 1118*22dc650dSSadaf Ebrahimi posix_startend=<n>[:<m>] 1119*22dc650dSSadaf Ebrahimi 1120*22dc650dSSadaf Ebrahimi This causes the subject string to be passed to regexec() using the 1121*22dc650dSSadaf Ebrahimi REG_STARTEND option, which uses offsets to specify which part of the 1122*22dc650dSSadaf Ebrahimi string is searched. If only one number is given, the end offset is 1123*22dc650dSSadaf Ebrahimi passed as the end of the subject string. For more detail of REG_STAR- 1124*22dc650dSSadaf Ebrahimi TEND, see the pcre2posix documentation. If the subject string contains 1125*22dc650dSSadaf Ebrahimi binary zeros (coded as escapes such as \x{00} because pcre2test does 1126*22dc650dSSadaf Ebrahimi not support actual binary zeros in its input), you must use posix_star- 1127*22dc650dSSadaf Ebrahimi tend to specify its length. 1128*22dc650dSSadaf Ebrahimi 1129*22dc650dSSadaf Ebrahimi Setting match controls 1130*22dc650dSSadaf Ebrahimi 1131*22dc650dSSadaf Ebrahimi The following modifiers affect the matching process or request addi- 1132*22dc650dSSadaf Ebrahimi tional information. Some of them may also be specified on a pattern 1133*22dc650dSSadaf Ebrahimi line (see above), in which case they apply to every subject line that 1134*22dc650dSSadaf Ebrahimi is matched against that pattern, but can be overridden by modifiers on 1135*22dc650dSSadaf Ebrahimi the subject. 1136*22dc650dSSadaf Ebrahimi 1137*22dc650dSSadaf Ebrahimi aftertext show text after match 1138*22dc650dSSadaf Ebrahimi allaftertext show text after captures 1139*22dc650dSSadaf Ebrahimi allcaptures show all captures 1140*22dc650dSSadaf Ebrahimi allvector show the entire ovector 1141*22dc650dSSadaf Ebrahimi allusedtext show all consulted text (non-JIT only) 1142*22dc650dSSadaf Ebrahimi altglobal alternative global matching 1143*22dc650dSSadaf Ebrahimi callout_capture show captures at callout time 1144*22dc650dSSadaf Ebrahimi callout_data=<n> set a value to pass via callouts 1145*22dc650dSSadaf Ebrahimi callout_error=<n>[:<m>] control callout error 1146*22dc650dSSadaf Ebrahimi callout_extra show extra callout information 1147*22dc650dSSadaf Ebrahimi callout_fail=<n>[:<m>] control callout failure 1148*22dc650dSSadaf Ebrahimi callout_no_where do not show position of a callout 1149*22dc650dSSadaf Ebrahimi callout_none do not supply a callout function 1150*22dc650dSSadaf Ebrahimi copy=<number or name> copy captured substring 1151*22dc650dSSadaf Ebrahimi depth_limit=<n> set a depth limit 1152*22dc650dSSadaf Ebrahimi dfa use pcre2_dfa_match() 1153*22dc650dSSadaf Ebrahimi find_limits find heap, match and depth limits 1154*22dc650dSSadaf Ebrahimi find_limits_noheap find match and depth limits 1155*22dc650dSSadaf Ebrahimi get=<number or name> extract captured substring 1156*22dc650dSSadaf Ebrahimi getall extract all captured substrings 1157*22dc650dSSadaf Ebrahimi /g global global matching 1158*22dc650dSSadaf Ebrahimi heapframes_size show match data heapframes size 1159*22dc650dSSadaf Ebrahimi heap_limit=<n> set a limit on heap memory (Kbytes) 1160*22dc650dSSadaf Ebrahimi jitstack=<n> set size of JIT stack 1161*22dc650dSSadaf Ebrahimi mark show mark values 1162*22dc650dSSadaf Ebrahimi match_limit=<n> set a match limit 1163*22dc650dSSadaf Ebrahimi memory show heap memory usage 1164*22dc650dSSadaf Ebrahimi null_context match with a NULL context 1165*22dc650dSSadaf Ebrahimi null_replacement substitute with NULL replacement 1166*22dc650dSSadaf Ebrahimi null_subject match with NULL subject 1167*22dc650dSSadaf Ebrahimi offset=<n> set starting offset 1168*22dc650dSSadaf Ebrahimi offset_limit=<n> set offset limit 1169*22dc650dSSadaf Ebrahimi ovector=<n> set size of output vector 1170*22dc650dSSadaf Ebrahimi recursion_limit=<n> obsolete synonym for depth_limit 1171*22dc650dSSadaf Ebrahimi replace=<string> specify a replacement string 1172*22dc650dSSadaf Ebrahimi startchar show startchar when relevant 1173*22dc650dSSadaf Ebrahimi startoffset=<n> same as offset=<n> 1174*22dc650dSSadaf Ebrahimi substitute_callout use substitution callouts 1175*22dc650dSSadaf Ebrahimi substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 1176*22dc650dSSadaf Ebrahimi substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1177*22dc650dSSadaf Ebrahimi substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1178*22dc650dSSadaf Ebrahimi substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1179*22dc650dSSadaf Ebrahimi substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1180*22dc650dSSadaf Ebrahimi substitute_skip=<n> skip substitution number n 1181*22dc650dSSadaf Ebrahimi substitute_stop=<n> skip substitution number n and greater 1182*22dc650dSSadaf Ebrahimi substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1183*22dc650dSSadaf Ebrahimi substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1184*22dc650dSSadaf Ebrahimi zero_terminate pass the subject as zero-terminated 1185*22dc650dSSadaf Ebrahimi 1186*22dc650dSSadaf Ebrahimi The effects of these modifiers are described in the following sections. 1187*22dc650dSSadaf Ebrahimi When matching via the POSIX wrapper API, the aftertext, allaftertext, 1188*22dc650dSSadaf Ebrahimi and ovector subject modifiers work as described below. All other modi- 1189*22dc650dSSadaf Ebrahimi fiers are either ignored, with a warning message, or cause an error. 1190*22dc650dSSadaf Ebrahimi 1191*22dc650dSSadaf Ebrahimi Showing more text 1192*22dc650dSSadaf Ebrahimi 1193*22dc650dSSadaf Ebrahimi The aftertext modifier requests that as well as outputting the part of 1194*22dc650dSSadaf Ebrahimi the subject string that matched the entire pattern, pcre2test should in 1195*22dc650dSSadaf Ebrahimi addition output the remainder of the subject string. This is useful for 1196*22dc650dSSadaf Ebrahimi tests where the subject contains multiple copies of the same substring. 1197*22dc650dSSadaf Ebrahimi The allaftertext modifier requests the same action for captured sub- 1198*22dc650dSSadaf Ebrahimi strings as well as the main matched substring. In each case the remain- 1199*22dc650dSSadaf Ebrahimi der is output on the following line with a plus character following the 1200*22dc650dSSadaf Ebrahimi capture number. 1201*22dc650dSSadaf Ebrahimi 1202*22dc650dSSadaf Ebrahimi The allusedtext modifier requests that all the text that was consulted 1203*22dc650dSSadaf Ebrahimi during a successful pattern match by the interpreter should be shown, 1204*22dc650dSSadaf Ebrahimi for both full and partial matches. This feature is not supported for 1205*22dc650dSSadaf Ebrahimi JIT matching, and if requested with JIT it is ignored (with a warning 1206*22dc650dSSadaf Ebrahimi message). Setting this modifier affects the output if there is a look- 1207*22dc650dSSadaf Ebrahimi behind at the start of a match, or, for a complete match, a lookahead 1208*22dc650dSSadaf Ebrahimi at the end, or if \K is used in the pattern. Characters that precede or 1209*22dc650dSSadaf Ebrahimi follow the start and end of the actual match are indicated in the out- 1210*22dc650dSSadaf Ebrahimi put by '<' or '>' characters underneath them. Here is an example: 1211*22dc650dSSadaf Ebrahimi 1212*22dc650dSSadaf Ebrahimi re> /(?<=pqr)abc(?=xyz)/ 1213*22dc650dSSadaf Ebrahimi data> 123pqrabcxyz456\=allusedtext 1214*22dc650dSSadaf Ebrahimi 0: pqrabcxyz 1215*22dc650dSSadaf Ebrahimi <<< >>> 1216*22dc650dSSadaf Ebrahimi data> 123pqrabcxy\=ph,allusedtext 1217*22dc650dSSadaf Ebrahimi Partial match: pqrabcxy 1218*22dc650dSSadaf Ebrahimi <<< 1219*22dc650dSSadaf Ebrahimi 1220*22dc650dSSadaf Ebrahimi The first, complete match shows that the matched string is "abc", with 1221*22dc650dSSadaf Ebrahimi the preceding and following strings "pqr" and "xyz" having been con- 1222*22dc650dSSadaf Ebrahimi sulted during the match (when processing the assertions). The partial 1223*22dc650dSSadaf Ebrahimi match can indicate only the preceding string. 1224*22dc650dSSadaf Ebrahimi 1225*22dc650dSSadaf Ebrahimi The startchar modifier requests that the starting character for the 1226*22dc650dSSadaf Ebrahimi match be indicated, if it is different to the start of the matched 1227*22dc650dSSadaf Ebrahimi string. The only time when this occurs is when \K has been processed as 1228*22dc650dSSadaf Ebrahimi part of the match. In this situation, the output for the matched string 1229*22dc650dSSadaf Ebrahimi is displayed from the starting character instead of from the match 1230*22dc650dSSadaf Ebrahimi point, with circumflex characters under the earlier characters. For ex- 1231*22dc650dSSadaf Ebrahimi ample: 1232*22dc650dSSadaf Ebrahimi 1233*22dc650dSSadaf Ebrahimi re> /abc\Kxyz/ 1234*22dc650dSSadaf Ebrahimi data> abcxyz\=startchar 1235*22dc650dSSadaf Ebrahimi 0: abcxyz 1236*22dc650dSSadaf Ebrahimi ^^^ 1237*22dc650dSSadaf Ebrahimi 1238*22dc650dSSadaf Ebrahimi Unlike allusedtext, the startchar modifier can be used with JIT. How- 1239*22dc650dSSadaf Ebrahimi ever, these two modifiers are mutually exclusive. 1240*22dc650dSSadaf Ebrahimi 1241*22dc650dSSadaf Ebrahimi Showing the value of all capture groups 1242*22dc650dSSadaf Ebrahimi 1243*22dc650dSSadaf Ebrahimi The allcaptures modifier requests that the values of all potential cap- 1244*22dc650dSSadaf Ebrahimi tured parentheses be output after a match. By default, only those up to 1245*22dc650dSSadaf Ebrahimi the highest one actually used in the match are output (corresponding to 1246*22dc650dSSadaf Ebrahimi the return code from pcre2_match()). Groups that did not take part in 1247*22dc650dSSadaf Ebrahimi the match are output as "<unset>". This modifier is not relevant for 1248*22dc650dSSadaf Ebrahimi DFA matching (which does no capturing) and does not apply when replace 1249*22dc650dSSadaf Ebrahimi is specified; it is ignored, with a warning message, if present. 1250*22dc650dSSadaf Ebrahimi 1251*22dc650dSSadaf Ebrahimi Showing the entire ovector, for all outcomes 1252*22dc650dSSadaf Ebrahimi 1253*22dc650dSSadaf Ebrahimi The allvector modifier requests that the entire ovector be shown, what- 1254*22dc650dSSadaf Ebrahimi ever the outcome of the match. Compare allcaptures, which shows only up 1255*22dc650dSSadaf Ebrahimi to the maximum number of capture groups for the pattern, and then only 1256*22dc650dSSadaf Ebrahimi for a successful complete non-DFA match. This modifier, which acts af- 1257*22dc650dSSadaf Ebrahimi ter any match result, and also for DFA matching, provides a means of 1258*22dc650dSSadaf Ebrahimi checking that there are no unexpected modifications to ovector fields. 1259*22dc650dSSadaf Ebrahimi Before each match attempt, the ovector is filled with a special value, 1260*22dc650dSSadaf Ebrahimi and if this is found in both elements of a capturing pair, "<un- 1261*22dc650dSSadaf Ebrahimi changed>" is output. After a successful match, this applies to all 1262*22dc650dSSadaf Ebrahimi groups after the maximum capture group for the pattern. In other cases 1263*22dc650dSSadaf Ebrahimi it applies to the entire ovector. After a partial match, the first two 1264*22dc650dSSadaf Ebrahimi elements are the only ones that should be set. After a DFA match, the 1265*22dc650dSSadaf Ebrahimi amount of ovector that is used depends on the number of matches that 1266*22dc650dSSadaf Ebrahimi were found. 1267*22dc650dSSadaf Ebrahimi 1268*22dc650dSSadaf Ebrahimi Testing pattern callouts 1269*22dc650dSSadaf Ebrahimi 1270*22dc650dSSadaf Ebrahimi A callout function is supplied when pcre2test calls the library match- 1271*22dc650dSSadaf Ebrahimi ing functions, unless callout_none is specified. Its behaviour can be 1272*22dc650dSSadaf Ebrahimi controlled by various modifiers listed above whose names begin with 1273*22dc650dSSadaf Ebrahimi callout_. Details are given in the section entitled "Callouts" below. 1274*22dc650dSSadaf Ebrahimi Testing callouts from pcre2_substitute() is described separately in 1275*22dc650dSSadaf Ebrahimi "Testing the substitution function" below. 1276*22dc650dSSadaf Ebrahimi 1277*22dc650dSSadaf Ebrahimi Finding all matches in a string 1278*22dc650dSSadaf Ebrahimi 1279*22dc650dSSadaf Ebrahimi Searching for all possible matches within a subject can be requested by 1280*22dc650dSSadaf Ebrahimi the global or altglobal modifier. After finding a match, the matching 1281*22dc650dSSadaf Ebrahimi function is called again to search the remainder of the subject. The 1282*22dc650dSSadaf Ebrahimi difference between global and altglobal is that the former uses the 1283*22dc650dSSadaf Ebrahimi start_offset argument to pcre2_match() or pcre2_dfa_match() to start 1284*22dc650dSSadaf Ebrahimi searching at a new point within the entire string (which is what Perl 1285*22dc650dSSadaf Ebrahimi does), whereas the latter passes over a shortened subject. This makes a 1286*22dc650dSSadaf Ebrahimi difference to the matching process if the pattern begins with a lookbe- 1287*22dc650dSSadaf Ebrahimi hind assertion (including \b or \B). 1288*22dc650dSSadaf Ebrahimi 1289*22dc650dSSadaf Ebrahimi If an empty string is matched, the next match is done with the 1290*22dc650dSSadaf Ebrahimi PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search 1291*22dc650dSSadaf Ebrahimi for another, non-empty, match at the same point in the subject. If this 1292*22dc650dSSadaf Ebrahimi match fails, the start offset is advanced, and the normal match is re- 1293*22dc650dSSadaf Ebrahimi tried. This imitates the way Perl handles such cases when using the /g 1294*22dc650dSSadaf Ebrahimi modifier or the split() function. Normally, the start offset is ad- 1295*22dc650dSSadaf Ebrahimi vanced by one character, but if the newline convention recognizes CRLF 1296*22dc650dSSadaf Ebrahimi as a newline, and the current character is CR followed by LF, an ad- 1297*22dc650dSSadaf Ebrahimi vance of two characters occurs. 1298*22dc650dSSadaf Ebrahimi 1299*22dc650dSSadaf Ebrahimi Testing substring extraction functions 1300*22dc650dSSadaf Ebrahimi 1301*22dc650dSSadaf Ebrahimi The copy and get modifiers can be used to test the pcre2_sub- 1302*22dc650dSSadaf Ebrahimi string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be 1303*22dc650dSSadaf Ebrahimi given more than once, and each can specify a capture group name or num- 1304*22dc650dSSadaf Ebrahimi ber, for example: 1305*22dc650dSSadaf Ebrahimi 1306*22dc650dSSadaf Ebrahimi abcd\=copy=1,copy=3,get=G1 1307*22dc650dSSadaf Ebrahimi 1308*22dc650dSSadaf Ebrahimi If the #subject command is used to set default copy and/or get lists, 1309*22dc650dSSadaf Ebrahimi these can be unset by specifying a negative number to cancel all num- 1310*22dc650dSSadaf Ebrahimi bered groups and an empty name to cancel all named groups. 1311*22dc650dSSadaf Ebrahimi 1312*22dc650dSSadaf Ebrahimi The getall modifier tests pcre2_substring_list_get(), which extracts 1313*22dc650dSSadaf Ebrahimi all captured substrings. 1314*22dc650dSSadaf Ebrahimi 1315*22dc650dSSadaf Ebrahimi If the subject line is successfully matched, the substrings extracted 1316*22dc650dSSadaf Ebrahimi by the convenience functions are output with C, G, or L after the 1317*22dc650dSSadaf Ebrahimi string number instead of a colon. This is in addition to the normal 1318*22dc650dSSadaf Ebrahimi full list. The string length (that is, the return from the extraction 1319*22dc650dSSadaf Ebrahimi function) is given in parentheses after each substring, followed by the 1320*22dc650dSSadaf Ebrahimi name when the extraction was by name. 1321*22dc650dSSadaf Ebrahimi 1322*22dc650dSSadaf Ebrahimi Testing the substitution function 1323*22dc650dSSadaf Ebrahimi 1324*22dc650dSSadaf Ebrahimi If the replace modifier is set, the pcre2_substitute() function is 1325*22dc650dSSadaf Ebrahimi called instead of one of the matching functions (or after one call of 1326*22dc650dSSadaf Ebrahimi pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re- 1327*22dc650dSSadaf Ebrahimi placement strings cannot contain commas, because a comma signifies the 1328*22dc650dSSadaf Ebrahimi end of a modifier. This is not thought to be an issue in a test pro- 1329*22dc650dSSadaf Ebrahimi gram. 1330*22dc650dSSadaf Ebrahimi 1331*22dc650dSSadaf Ebrahimi Specifying a completely empty replacement string disables this modi- 1332*22dc650dSSadaf Ebrahimi fier. However, it is possible to specify an empty replacement by pro- 1333*22dc650dSSadaf Ebrahimi viding a buffer length, as described below, for an otherwise empty re- 1334*22dc650dSSadaf Ebrahimi placement. 1335*22dc650dSSadaf Ebrahimi 1336*22dc650dSSadaf Ebrahimi Unlike subject strings, pcre2test does not process replacement strings 1337*22dc650dSSadaf Ebrahimi for escape sequences. In UTF mode, a replacement string is checked to 1338*22dc650dSSadaf Ebrahimi see if it is a valid UTF-8 string. If so, it is correctly converted to 1339*22dc650dSSadaf Ebrahimi a UTF string of the appropriate code unit width. If it is not a valid 1340*22dc650dSSadaf Ebrahimi UTF-8 string, the individual code units are copied directly. This pro- 1341*22dc650dSSadaf Ebrahimi vides a means of passing an invalid UTF-8 string for testing purposes. 1342*22dc650dSSadaf Ebrahimi 1343*22dc650dSSadaf Ebrahimi The following modifiers set options (in additional to the normal match 1344*22dc650dSSadaf Ebrahimi options) for pcre2_substitute(): 1345*22dc650dSSadaf Ebrahimi 1346*22dc650dSSadaf Ebrahimi global PCRE2_SUBSTITUTE_GLOBAL 1347*22dc650dSSadaf Ebrahimi substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1348*22dc650dSSadaf Ebrahimi substitute_literal PCRE2_SUBSTITUTE_LITERAL 1349*22dc650dSSadaf Ebrahimi substitute_matched PCRE2_SUBSTITUTE_MATCHED 1350*22dc650dSSadaf Ebrahimi substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1351*22dc650dSSadaf Ebrahimi substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1352*22dc650dSSadaf Ebrahimi substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1353*22dc650dSSadaf Ebrahimi substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1354*22dc650dSSadaf Ebrahimi 1355*22dc650dSSadaf Ebrahimi See the pcre2api documentation for details of these options. 1356*22dc650dSSadaf Ebrahimi 1357*22dc650dSSadaf Ebrahimi After a successful substitution, the modified string is output, pre- 1358*22dc650dSSadaf Ebrahimi ceded by the number of replacements. This may be zero if there were no 1359*22dc650dSSadaf Ebrahimi matches. Here is a simple example of a substitution test: 1360*22dc650dSSadaf Ebrahimi 1361*22dc650dSSadaf Ebrahimi /abc/replace=xxx 1362*22dc650dSSadaf Ebrahimi =abc=abc= 1363*22dc650dSSadaf Ebrahimi 1: =xxx=abc= 1364*22dc650dSSadaf Ebrahimi =abc=abc=\=global 1365*22dc650dSSadaf Ebrahimi 2: =xxx=xxx= 1366*22dc650dSSadaf Ebrahimi 1367*22dc650dSSadaf Ebrahimi Subject and replacement strings should be kept relatively short (fewer 1368*22dc650dSSadaf Ebrahimi than 256 characters) for substitution tests, as fixed-size buffers are 1369*22dc650dSSadaf Ebrahimi used. To make it easy to test for buffer overflow, if the replacement 1370*22dc650dSSadaf Ebrahimi string starts with a number in square brackets, that number is passed 1371*22dc650dSSadaf Ebrahimi to pcre2_substitute() as the size of the output buffer, with the re- 1372*22dc650dSSadaf Ebrahimi placement string starting at the next character. Here is an example 1373*22dc650dSSadaf Ebrahimi that tests the edge case: 1374*22dc650dSSadaf Ebrahimi 1375*22dc650dSSadaf Ebrahimi /abc/ 1376*22dc650dSSadaf Ebrahimi 123abc123\=replace=[10]XYZ 1377*22dc650dSSadaf Ebrahimi 1: 123XYZ123 1378*22dc650dSSadaf Ebrahimi 123abc123\=replace=[9]XYZ 1379*22dc650dSSadaf Ebrahimi Failed: error -47: no more memory 1380*22dc650dSSadaf Ebrahimi 1381*22dc650dSSadaf Ebrahimi The default action of pcre2_substitute() is to return PCRE2_ER- 1382*22dc650dSSadaf Ebrahimi ROR_NOMEMORY when the output buffer is too small. However, if the 1383*22dc650dSSadaf Ebrahimi PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the substi- 1384*22dc650dSSadaf Ebrahimi tute_overflow_length modifier), pcre2_substitute() continues to go 1385*22dc650dSSadaf Ebrahimi through the motions of matching and substituting (but not doing any 1386*22dc650dSSadaf Ebrahimi callouts), in order to compute the size of buffer that is required. 1387*22dc650dSSadaf Ebrahimi When this happens, pcre2test shows the required buffer length (which 1388*22dc650dSSadaf Ebrahimi includes space for the trailing zero) as part of the error message. For 1389*22dc650dSSadaf Ebrahimi example: 1390*22dc650dSSadaf Ebrahimi 1391*22dc650dSSadaf Ebrahimi /abc/substitute_overflow_length 1392*22dc650dSSadaf Ebrahimi 123abc123\=replace=[9]XYZ 1393*22dc650dSSadaf Ebrahimi Failed: error -47: no more memory: 10 code units are needed 1394*22dc650dSSadaf Ebrahimi 1395*22dc650dSSadaf Ebrahimi A replacement string is ignored with POSIX and DFA matching. Specifying 1396*22dc650dSSadaf Ebrahimi partial matching provokes an error return ("bad option value") from 1397*22dc650dSSadaf Ebrahimi pcre2_substitute(). 1398*22dc650dSSadaf Ebrahimi 1399*22dc650dSSadaf Ebrahimi Testing substitute callouts 1400*22dc650dSSadaf Ebrahimi 1401*22dc650dSSadaf Ebrahimi If the substitute_callout modifier is set, a substitution callout func- 1402*22dc650dSSadaf Ebrahimi tion is set up. The null_context modifier must not be set, because the 1403*22dc650dSSadaf Ebrahimi address of the callout function is passed in a match context. When the 1404*22dc650dSSadaf Ebrahimi callout function is called (after each substitution), details of the 1405*22dc650dSSadaf Ebrahimi input and output strings are output. For example: 1406*22dc650dSSadaf Ebrahimi 1407*22dc650dSSadaf Ebrahimi /abc/g,replace=<$0>,substitute_callout 1408*22dc650dSSadaf Ebrahimi abcdefabcpqr 1409*22dc650dSSadaf Ebrahimi 1(1) Old 0 3 "abc" New 0 5 "<abc>" 1410*22dc650dSSadaf Ebrahimi 2(1) Old 6 9 "abc" New 8 13 "<abc>" 1411*22dc650dSSadaf Ebrahimi 2: <abc>def<abc>pqr 1412*22dc650dSSadaf Ebrahimi 1413*22dc650dSSadaf Ebrahimi The first number on each callout line is the count of matches. The 1414*22dc650dSSadaf Ebrahimi parenthesized number is the number of pairs that are set in the ovector 1415*22dc650dSSadaf Ebrahimi (that is, one more than the number of capturing groups that were set). 1416*22dc650dSSadaf Ebrahimi Then are listed the offsets of the old substring, its contents, and the 1417*22dc650dSSadaf Ebrahimi same for the replacement. 1418*22dc650dSSadaf Ebrahimi 1419*22dc650dSSadaf Ebrahimi By default, the substitution callout function returns zero, which ac- 1420*22dc650dSSadaf Ebrahimi cepts the replacement and causes matching to continue if /g was used. 1421*22dc650dSSadaf Ebrahimi Two further modifiers can be used to test other return values. If sub- 1422*22dc650dSSadaf Ebrahimi stitute_skip is set to a value greater than zero the callout function 1423*22dc650dSSadaf Ebrahimi returns +1 for the match of that number, and similarly substitute_stop 1424*22dc650dSSadaf Ebrahimi returns -1. These cause the replacement to be rejected, and -1 causes 1425*22dc650dSSadaf Ebrahimi no further matching to take place. If either of them are set, substi- 1426*22dc650dSSadaf Ebrahimi tute_callout is assumed. For example: 1427*22dc650dSSadaf Ebrahimi 1428*22dc650dSSadaf Ebrahimi /abc/g,replace=<$0>,substitute_skip=1 1429*22dc650dSSadaf Ebrahimi abcdefabcpqr 1430*22dc650dSSadaf Ebrahimi 1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED" 1431*22dc650dSSadaf Ebrahimi 2(1) Old 6 9 "abc" New 6 11 "<abc>" 1432*22dc650dSSadaf Ebrahimi 2: abcdef<abc>pqr 1433*22dc650dSSadaf Ebrahimi abcdefabcpqr\=substitute_stop=1 1434*22dc650dSSadaf Ebrahimi 1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED" 1435*22dc650dSSadaf Ebrahimi 1: abcdefabcpqr 1436*22dc650dSSadaf Ebrahimi 1437*22dc650dSSadaf Ebrahimi If both are set for the same number, stop takes precedence. Only a sin- 1438*22dc650dSSadaf Ebrahimi gle skip or stop is supported, which is sufficient for testing that the 1439*22dc650dSSadaf Ebrahimi feature works. 1440*22dc650dSSadaf Ebrahimi 1441*22dc650dSSadaf Ebrahimi Setting the JIT stack size 1442*22dc650dSSadaf Ebrahimi 1443*22dc650dSSadaf Ebrahimi The jitstack modifier provides a way of setting the maximum stack size 1444*22dc650dSSadaf Ebrahimi that is used by the just-in-time optimization code. It is ignored if 1445*22dc650dSSadaf Ebrahimi JIT optimization is not being used. The value is a number of kibibytes 1446*22dc650dSSadaf Ebrahimi (units of 1024 bytes). Setting zero reverts to the default of 32KiB. 1447*22dc650dSSadaf Ebrahimi Providing a stack that is larger than the default is necessary only for 1448*22dc650dSSadaf Ebrahimi very complicated patterns. If jitstack is set non-zero on a subject 1449*22dc650dSSadaf Ebrahimi line it overrides any value that was set on the pattern. 1450*22dc650dSSadaf Ebrahimi 1451*22dc650dSSadaf Ebrahimi Setting heap, match, and depth limits 1452*22dc650dSSadaf Ebrahimi 1453*22dc650dSSadaf Ebrahimi The heap_limit, match_limit, and depth_limit modifiers set the appro- 1454*22dc650dSSadaf Ebrahimi priate limits in the match context. These values are ignored when the 1455*22dc650dSSadaf Ebrahimi find_limits or find_limits_noheap modifier is specified. 1456*22dc650dSSadaf Ebrahimi 1457*22dc650dSSadaf Ebrahimi Finding minimum limits 1458*22dc650dSSadaf Ebrahimi 1459*22dc650dSSadaf Ebrahimi If the find_limits modifier is present on a subject line, pcre2test 1460*22dc650dSSadaf Ebrahimi calls the relevant matching function several times, setting different 1461*22dc650dSSadaf Ebrahimi values in the match context via pcre2_set_heap_limit(), 1462*22dc650dSSadaf Ebrahimi pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the 1463*22dc650dSSadaf Ebrahimi smallest value for each parameter that allows the match to complete 1464*22dc650dSSadaf Ebrahimi without a "limit exceeded" error. The match itself may succeed or fail. 1465*22dc650dSSadaf Ebrahimi An alternative modifier, find_limits_noheap, omits the heap limit. This 1466*22dc650dSSadaf Ebrahimi is used in the standard tests, because the minimum heap limit varies 1467*22dc650dSSadaf Ebrahimi between systems. If JIT is being used, only the match limit is rele- 1468*22dc650dSSadaf Ebrahimi vant, and the other two are automatically omitted. 1469*22dc650dSSadaf Ebrahimi 1470*22dc650dSSadaf Ebrahimi When using this modifier, the pattern should not contain any limit set- 1471*22dc650dSSadaf Ebrahimi tings such as (*LIMIT_MATCH=...) within it. If such a setting is 1472*22dc650dSSadaf Ebrahimi present and is lower than the minimum matching value, the minimum value 1473*22dc650dSSadaf Ebrahimi cannot be found because pcre2_set_match_limit() etc. are only able to 1474*22dc650dSSadaf Ebrahimi reduce the value of an in-pattern limit; they cannot increase it. 1475*22dc650dSSadaf Ebrahimi 1476*22dc650dSSadaf Ebrahimi For non-DFA matching, the minimum depth_limit number is a measure of 1477*22dc650dSSadaf Ebrahimi how much nested backtracking happens (that is, how deeply the pattern's 1478*22dc650dSSadaf Ebrahimi tree is searched). In the case of DFA matching, depth_limit controls 1479*22dc650dSSadaf Ebrahimi the depth of recursive calls of the internal function that is used for 1480*22dc650dSSadaf Ebrahimi handling pattern recursion, lookaround assertions, and atomic groups. 1481*22dc650dSSadaf Ebrahimi 1482*22dc650dSSadaf Ebrahimi For non-DFA matching, the match_limit number is a measure of the amount 1483*22dc650dSSadaf Ebrahimi of backtracking that takes place, and learning the minimum value can be 1484*22dc650dSSadaf Ebrahimi instructive. For most simple matches, the number is quite small, but 1485*22dc650dSSadaf Ebrahimi for patterns with very large numbers of matching possibilities, it can 1486*22dc650dSSadaf Ebrahimi become large very quickly with increasing length of subject string. In 1487*22dc650dSSadaf Ebrahimi the case of DFA matching, match_limit controls the total number of 1488*22dc650dSSadaf Ebrahimi calls, both recursive and non-recursive, to the internal matching func- 1489*22dc650dSSadaf Ebrahimi tion, thus controlling the overall amount of computing resource that is 1490*22dc650dSSadaf Ebrahimi used. 1491*22dc650dSSadaf Ebrahimi 1492*22dc650dSSadaf Ebrahimi For both kinds of matching, the heap_limit number, which is in 1493*22dc650dSSadaf Ebrahimi kibibytes (units of 1024 bytes), limits the amount of heap memory used 1494*22dc650dSSadaf Ebrahimi for matching. 1495*22dc650dSSadaf Ebrahimi 1496*22dc650dSSadaf Ebrahimi Showing MARK names 1497*22dc650dSSadaf Ebrahimi 1498*22dc650dSSadaf Ebrahimi 1499*22dc650dSSadaf Ebrahimi The mark modifier causes the names from backtracking control verbs that 1500*22dc650dSSadaf Ebrahimi are returned from calls to pcre2_match() to be displayed. If a mark is 1501*22dc650dSSadaf Ebrahimi returned for a match, non-match, or partial match, pcre2test shows it. 1502*22dc650dSSadaf Ebrahimi For a match, it is on a line by itself, tagged with "MK:". Otherwise, 1503*22dc650dSSadaf Ebrahimi it is added to the non-match message. 1504*22dc650dSSadaf Ebrahimi 1505*22dc650dSSadaf Ebrahimi Showing memory usage 1506*22dc650dSSadaf Ebrahimi 1507*22dc650dSSadaf Ebrahimi The memory modifier causes pcre2test to log the sizes of all heap mem- 1508*22dc650dSSadaf Ebrahimi ory allocation and freeing calls that occur during a call to 1509*22dc650dSSadaf Ebrahimi pcre2_match() or pcre2_dfa_match(). In the latter case, heap memory is 1510*22dc650dSSadaf Ebrahimi used only when a match requires more internal workspace that the de- 1511*22dc650dSSadaf Ebrahimi fault allocation on the stack, so in many cases there will be no out- 1512*22dc650dSSadaf Ebrahimi put. No heap memory is allocated during matching with JIT. For this 1513*22dc650dSSadaf Ebrahimi modifier to work, the null_context modifier must not be set on both the 1514*22dc650dSSadaf Ebrahimi pattern and the subject, though it can be set on one or the other. 1515*22dc650dSSadaf Ebrahimi 1516*22dc650dSSadaf Ebrahimi Showing the heap frame overall vector size 1517*22dc650dSSadaf Ebrahimi 1518*22dc650dSSadaf Ebrahimi The heapframes_size modifier is relevant for matches using 1519*22dc650dSSadaf Ebrahimi pcre2_match() without JIT. After a match has run (whether successful or 1520*22dc650dSSadaf Ebrahimi not) the size, in bytes, of the allocated heap frames vector that is 1521*22dc650dSSadaf Ebrahimi left attached to the match data block is shown. If the matching action 1522*22dc650dSSadaf Ebrahimi involved several calls to pcre2_match() (for example, global matching 1523*22dc650dSSadaf Ebrahimi or for timing) only the final value is shown. 1524*22dc650dSSadaf Ebrahimi 1525*22dc650dSSadaf Ebrahimi This modifier is ignored, with a warning, for POSIX or DFA matching. 1526*22dc650dSSadaf Ebrahimi JIT matching does not use the heap frames vector, so the size is always 1527*22dc650dSSadaf Ebrahimi zero, unless there was a previous non-JIT match. Note that specifing a 1528*22dc650dSSadaf Ebrahimi size of zero for the output vector (see below) causes pcre2test to free 1529*22dc650dSSadaf Ebrahimi its match data block (and associated heap frames vector) and allocate a 1530*22dc650dSSadaf Ebrahimi new one. 1531*22dc650dSSadaf Ebrahimi 1532*22dc650dSSadaf Ebrahimi Setting a starting offset 1533*22dc650dSSadaf Ebrahimi 1534*22dc650dSSadaf Ebrahimi The offset modifier sets an offset in the subject string at which 1535*22dc650dSSadaf Ebrahimi matching starts. Its value is a number of code units, not characters. 1536*22dc650dSSadaf Ebrahimi 1537*22dc650dSSadaf Ebrahimi Setting an offset limit 1538*22dc650dSSadaf Ebrahimi 1539*22dc650dSSadaf Ebrahimi The offset_limit modifier sets a limit for unanchored matches. If a 1540*22dc650dSSadaf Ebrahimi match cannot be found starting at or before this offset in the subject, 1541*22dc650dSSadaf Ebrahimi a "no match" return is given. The data value is a number of code units, 1542*22dc650dSSadaf Ebrahimi not characters. When this modifier is used, the use_offset_limit modi- 1543*22dc650dSSadaf Ebrahimi fier must have been set for the pattern; if not, an error is generated. 1544*22dc650dSSadaf Ebrahimi 1545*22dc650dSSadaf Ebrahimi Setting the size of the output vector 1546*22dc650dSSadaf Ebrahimi 1547*22dc650dSSadaf Ebrahimi The ovector modifier applies only to the subject line in which it ap- 1548*22dc650dSSadaf Ebrahimi pears, though of course it can also be used to set a default in a #sub- 1549*22dc650dSSadaf Ebrahimi ject command. It specifies the number of pairs of offsets that are 1550*22dc650dSSadaf Ebrahimi available for storing matching information. The default is 15. 1551*22dc650dSSadaf Ebrahimi 1552*22dc650dSSadaf Ebrahimi A value of zero is useful when testing the POSIX API because it causes 1553*22dc650dSSadaf Ebrahimi regexec() to be called with a NULL capture vector. When not testing the 1554*22dc650dSSadaf Ebrahimi POSIX API, a value of zero is used to cause pcre2_match_data_cre- 1555*22dc650dSSadaf Ebrahimi ate_from_pattern() to be called, in order to create a new match block 1556*22dc650dSSadaf Ebrahimi of exactly the right size for the pattern. (It is not possible to cre- 1557*22dc650dSSadaf Ebrahimi ate a match block with a zero-length ovector; there is always at least 1558*22dc650dSSadaf Ebrahimi one pair of offsets.) The old match data block is freed. 1559*22dc650dSSadaf Ebrahimi 1560*22dc650dSSadaf Ebrahimi Passing the subject as zero-terminated 1561*22dc650dSSadaf Ebrahimi 1562*22dc650dSSadaf Ebrahimi By default, the subject string is passed to a native API matching func- 1563*22dc650dSSadaf Ebrahimi tion with its correct length. In order to test the facility for passing 1564*22dc650dSSadaf Ebrahimi a zero-terminated string, the zero_terminate modifier is provided. It 1565*22dc650dSSadaf Ebrahimi causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching 1566*22dc650dSSadaf Ebrahimi via the POSIX interface, this modifier is ignored, with a warning. 1567*22dc650dSSadaf Ebrahimi 1568*22dc650dSSadaf Ebrahimi When testing pcre2_substitute(), this modifier also has the effect of 1569*22dc650dSSadaf Ebrahimi passing the replacement string as zero-terminated. 1570*22dc650dSSadaf Ebrahimi 1571*22dc650dSSadaf Ebrahimi Passing a NULL context, subject, or replacement 1572*22dc650dSSadaf Ebrahimi 1573*22dc650dSSadaf Ebrahimi Normally, pcre2test passes a context block to pcre2_match(), 1574*22dc650dSSadaf Ebrahimi pcre2_dfa_match(), pcre2_jit_match() or pcre2_substitute(). If the 1575*22dc650dSSadaf Ebrahimi null_context modifier is set, however, NULL is passed. This is for 1576*22dc650dSSadaf Ebrahimi testing that the matching and substitution functions behave correctly 1577*22dc650dSSadaf Ebrahimi in this case (they use default values). This modifier cannot be used 1578*22dc650dSSadaf Ebrahimi with the find_limits, find_limits_noheap, or substitute_callout modi- 1579*22dc650dSSadaf Ebrahimi fiers. 1580*22dc650dSSadaf Ebrahimi 1581*22dc650dSSadaf Ebrahimi Similarly, for testing purposes, if the null_subject or null_replace- 1582*22dc650dSSadaf Ebrahimi ment modifier is set, the subject or replacement string pointers are 1583*22dc650dSSadaf Ebrahimi passed as NULL, respectively, to the relevant functions. 1584*22dc650dSSadaf Ebrahimi 1585*22dc650dSSadaf Ebrahimi 1586*22dc650dSSadaf EbrahimiTHE ALTERNATIVE MATCHING FUNCTION 1587*22dc650dSSadaf Ebrahimi 1588*22dc650dSSadaf Ebrahimi By default, pcre2test uses the standard PCRE2 matching function, 1589*22dc650dSSadaf Ebrahimi pcre2_match() to match each subject line. PCRE2 also supports an alter- 1590*22dc650dSSadaf Ebrahimi native matching function, pcre2_dfa_match(), which operates in a dif- 1591*22dc650dSSadaf Ebrahimi ferent way, and has some restrictions. The differences between the two 1592*22dc650dSSadaf Ebrahimi functions are described in the pcre2matching documentation. 1593*22dc650dSSadaf Ebrahimi 1594*22dc650dSSadaf Ebrahimi If the dfa modifier is set, the alternative matching function is used. 1595*22dc650dSSadaf Ebrahimi This function finds all possible matches at a given point in the sub- 1596*22dc650dSSadaf Ebrahimi ject. If, however, the dfa_shortest modifier is set, processing stops 1597*22dc650dSSadaf Ebrahimi after the first match is found. This is always the shortest possible 1598*22dc650dSSadaf Ebrahimi match. 1599*22dc650dSSadaf Ebrahimi 1600*22dc650dSSadaf Ebrahimi 1601*22dc650dSSadaf EbrahimiDEFAULT OUTPUT FROM pcre2test 1602*22dc650dSSadaf Ebrahimi 1603*22dc650dSSadaf Ebrahimi This section describes the output when the normal matching function, 1604*22dc650dSSadaf Ebrahimi pcre2_match(), is being used. 1605*22dc650dSSadaf Ebrahimi 1606*22dc650dSSadaf Ebrahimi When a match succeeds, pcre2test outputs the list of captured sub- 1607*22dc650dSSadaf Ebrahimi strings, starting with number 0 for the string that matched the whole 1608*22dc650dSSadaf Ebrahimi pattern. Otherwise, it outputs "No match" when the return is PCRE2_ER- 1609*22dc650dSSadaf Ebrahimi ROR_NOMATCH, or "Partial match:" followed by the partially matching 1610*22dc650dSSadaf Ebrahimi substring when the return is PCRE2_ERROR_PARTIAL. (Note that this is 1611*22dc650dSSadaf Ebrahimi the entire substring that was inspected during the partial match; it 1612*22dc650dSSadaf Ebrahimi may include characters before the actual match start if a lookbehind 1613*22dc650dSSadaf Ebrahimi assertion, \K, \b, or \B was involved.) 1614*22dc650dSSadaf Ebrahimi 1615*22dc650dSSadaf Ebrahimi For any other return, pcre2test outputs the PCRE2 negative error number 1616*22dc650dSSadaf Ebrahimi and a short descriptive phrase. If the error is a failed UTF string 1617*22dc650dSSadaf Ebrahimi check, the code unit offset of the start of the failing character is 1618*22dc650dSSadaf Ebrahimi also output. Here is an example of an interactive pcre2test run. 1619*22dc650dSSadaf Ebrahimi 1620*22dc650dSSadaf Ebrahimi $ pcre2test 1621*22dc650dSSadaf Ebrahimi PCRE2 version 10.22 2016-07-29 1622*22dc650dSSadaf Ebrahimi 1623*22dc650dSSadaf Ebrahimi re> /^abc(\d+)/ 1624*22dc650dSSadaf Ebrahimi data> abc123 1625*22dc650dSSadaf Ebrahimi 0: abc123 1626*22dc650dSSadaf Ebrahimi 1: 123 1627*22dc650dSSadaf Ebrahimi data> xyz 1628*22dc650dSSadaf Ebrahimi No match 1629*22dc650dSSadaf Ebrahimi 1630*22dc650dSSadaf Ebrahimi Unset capturing substrings that are not followed by one that is set are 1631*22dc650dSSadaf Ebrahimi not shown by pcre2test unless the allcaptures modifier is specified. In 1632*22dc650dSSadaf Ebrahimi the following example, there are two capturing substrings, but when the 1633*22dc650dSSadaf Ebrahimi first data line is matched, the second, unset substring is not shown. 1634*22dc650dSSadaf Ebrahimi An "internal" unset substring is shown as "<unset>", as for the second 1635*22dc650dSSadaf Ebrahimi data line. 1636*22dc650dSSadaf Ebrahimi 1637*22dc650dSSadaf Ebrahimi re> /(a)|(b)/ 1638*22dc650dSSadaf Ebrahimi data> a 1639*22dc650dSSadaf Ebrahimi 0: a 1640*22dc650dSSadaf Ebrahimi 1: a 1641*22dc650dSSadaf Ebrahimi data> b 1642*22dc650dSSadaf Ebrahimi 0: b 1643*22dc650dSSadaf Ebrahimi 1: <unset> 1644*22dc650dSSadaf Ebrahimi 2: b 1645*22dc650dSSadaf Ebrahimi 1646*22dc650dSSadaf Ebrahimi If the strings contain any non-printing characters, they are output as 1647*22dc650dSSadaf Ebrahimi \xhh escapes if the value is less than 256 and UTF mode is not set. 1648*22dc650dSSadaf Ebrahimi Otherwise they are output as \x{hh...} escapes. See below for the defi- 1649*22dc650dSSadaf Ebrahimi nition of non-printing characters. If the aftertext modifier is set, 1650*22dc650dSSadaf Ebrahimi the output for substring 0 is followed by the rest of the subject 1651*22dc650dSSadaf Ebrahimi string, identified by "0+" like this: 1652*22dc650dSSadaf Ebrahimi 1653*22dc650dSSadaf Ebrahimi re> /cat/aftertext 1654*22dc650dSSadaf Ebrahimi data> cataract 1655*22dc650dSSadaf Ebrahimi 0: cat 1656*22dc650dSSadaf Ebrahimi 0+ aract 1657*22dc650dSSadaf Ebrahimi 1658*22dc650dSSadaf Ebrahimi If global matching is requested, the results of successive matching at- 1659*22dc650dSSadaf Ebrahimi tempts are output in sequence, like this: 1660*22dc650dSSadaf Ebrahimi 1661*22dc650dSSadaf Ebrahimi re> /\Bi(\w\w)/g 1662*22dc650dSSadaf Ebrahimi data> Mississippi 1663*22dc650dSSadaf Ebrahimi 0: iss 1664*22dc650dSSadaf Ebrahimi 1: ss 1665*22dc650dSSadaf Ebrahimi 0: iss 1666*22dc650dSSadaf Ebrahimi 1: ss 1667*22dc650dSSadaf Ebrahimi 0: ipp 1668*22dc650dSSadaf Ebrahimi 1: pp 1669*22dc650dSSadaf Ebrahimi 1670*22dc650dSSadaf Ebrahimi "No match" is output only if the first match attempt fails. Here is an 1671*22dc650dSSadaf Ebrahimi example of a failure message (the offset 4 that is specified by the 1672*22dc650dSSadaf Ebrahimi offset modifier is past the end of the subject string): 1673*22dc650dSSadaf Ebrahimi 1674*22dc650dSSadaf Ebrahimi re> /xyz/ 1675*22dc650dSSadaf Ebrahimi data> xyz\=offset=4 1676*22dc650dSSadaf Ebrahimi Error -24 (bad offset value) 1677*22dc650dSSadaf Ebrahimi 1678*22dc650dSSadaf Ebrahimi Note that whereas patterns can be continued over several lines (a plain 1679*22dc650dSSadaf Ebrahimi ">" prompt is used for continuations), subject lines may not. However 1680*22dc650dSSadaf Ebrahimi newlines can be included in a subject by means of the \n escape (or \r, 1681*22dc650dSSadaf Ebrahimi \r\n, etc., depending on the newline sequence setting). 1682*22dc650dSSadaf Ebrahimi 1683*22dc650dSSadaf Ebrahimi 1684*22dc650dSSadaf EbrahimiOUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1685*22dc650dSSadaf Ebrahimi 1686*22dc650dSSadaf Ebrahimi When the alternative matching function, pcre2_dfa_match(), is used, the 1687*22dc650dSSadaf Ebrahimi output consists of a list of all the matches that start at the first 1688*22dc650dSSadaf Ebrahimi point in the subject where there is at least one match. For example: 1689*22dc650dSSadaf Ebrahimi 1690*22dc650dSSadaf Ebrahimi re> /(tang|tangerine|tan)/ 1691*22dc650dSSadaf Ebrahimi data> yellow tangerine\=dfa 1692*22dc650dSSadaf Ebrahimi 0: tangerine 1693*22dc650dSSadaf Ebrahimi 1: tang 1694*22dc650dSSadaf Ebrahimi 2: tan 1695*22dc650dSSadaf Ebrahimi 1696*22dc650dSSadaf Ebrahimi Using the normal matching function on this data finds only "tang". The 1697*22dc650dSSadaf Ebrahimi longest matching string is always given first (and numbered zero). Af- 1698*22dc650dSSadaf Ebrahimi ter a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", fol- 1699*22dc650dSSadaf Ebrahimi lowed by the partially matching substring. Note that this is the entire 1700*22dc650dSSadaf Ebrahimi substring that was inspected during the partial match; it may include 1701*22dc650dSSadaf Ebrahimi characters before the actual match start if a lookbehind assertion, \b, 1702*22dc650dSSadaf Ebrahimi or \B was involved. (\K is not supported for DFA matching.) 1703*22dc650dSSadaf Ebrahimi 1704*22dc650dSSadaf Ebrahimi If global matching is requested, the search for further matches resumes 1705*22dc650dSSadaf Ebrahimi at the end of the longest match. For example: 1706*22dc650dSSadaf Ebrahimi 1707*22dc650dSSadaf Ebrahimi re> /(tang|tangerine|tan)/g 1708*22dc650dSSadaf Ebrahimi data> yellow tangerine and tangy sultana\=dfa 1709*22dc650dSSadaf Ebrahimi 0: tangerine 1710*22dc650dSSadaf Ebrahimi 1: tang 1711*22dc650dSSadaf Ebrahimi 2: tan 1712*22dc650dSSadaf Ebrahimi 0: tang 1713*22dc650dSSadaf Ebrahimi 1: tan 1714*22dc650dSSadaf Ebrahimi 0: tan 1715*22dc650dSSadaf Ebrahimi 1716*22dc650dSSadaf Ebrahimi The alternative matching function does not support substring capture, 1717*22dc650dSSadaf Ebrahimi so the modifiers that are concerned with captured substrings are not 1718*22dc650dSSadaf Ebrahimi relevant. 1719*22dc650dSSadaf Ebrahimi 1720*22dc650dSSadaf Ebrahimi 1721*22dc650dSSadaf EbrahimiRESTARTING AFTER A PARTIAL MATCH 1722*22dc650dSSadaf Ebrahimi 1723*22dc650dSSadaf Ebrahimi When the alternative matching function has given the PCRE2_ERROR_PAR- 1724*22dc650dSSadaf Ebrahimi TIAL return, indicating that the subject partially matched the pattern, 1725*22dc650dSSadaf Ebrahimi you can restart the match with additional subject data by means of the 1726*22dc650dSSadaf Ebrahimi dfa_restart modifier. For example: 1727*22dc650dSSadaf Ebrahimi 1728*22dc650dSSadaf Ebrahimi re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1729*22dc650dSSadaf Ebrahimi data> 23ja\=ps,dfa 1730*22dc650dSSadaf Ebrahimi Partial match: 23ja 1731*22dc650dSSadaf Ebrahimi data> n05\=dfa,dfa_restart 1732*22dc650dSSadaf Ebrahimi 0: n05 1733*22dc650dSSadaf Ebrahimi 1734*22dc650dSSadaf Ebrahimi For further information about partial matching, see the pcre2partial 1735*22dc650dSSadaf Ebrahimi documentation. 1736*22dc650dSSadaf Ebrahimi 1737*22dc650dSSadaf Ebrahimi 1738*22dc650dSSadaf EbrahimiCALLOUTS 1739*22dc650dSSadaf Ebrahimi 1740*22dc650dSSadaf Ebrahimi If the pattern contains any callout requests, pcre2test's callout func- 1741*22dc650dSSadaf Ebrahimi tion is called during matching unless callout_none is specified. This 1742*22dc650dSSadaf Ebrahimi works with both matching functions, and with JIT, though there are some 1743*22dc650dSSadaf Ebrahimi differences in behaviour. The output for callouts with numerical argu- 1744*22dc650dSSadaf Ebrahimi ments and those with string arguments is slightly different. 1745*22dc650dSSadaf Ebrahimi 1746*22dc650dSSadaf Ebrahimi Callouts with numerical arguments 1747*22dc650dSSadaf Ebrahimi 1748*22dc650dSSadaf Ebrahimi By default, the callout function displays the callout number, the start 1749*22dc650dSSadaf Ebrahimi and current positions in the subject text at the callout time, and the 1750*22dc650dSSadaf Ebrahimi next pattern item to be tested. For example: 1751*22dc650dSSadaf Ebrahimi 1752*22dc650dSSadaf Ebrahimi --->pqrabcdef 1753*22dc650dSSadaf Ebrahimi 0 ^ ^ \d 1754*22dc650dSSadaf Ebrahimi 1755*22dc650dSSadaf Ebrahimi This output indicates that callout number 0 occurred for a match at- 1756*22dc650dSSadaf Ebrahimi tempt starting at the fourth character of the subject string, when the 1757*22dc650dSSadaf Ebrahimi pointer was at the seventh character, and when the next pattern item 1758*22dc650dSSadaf Ebrahimi was \d. Just one circumflex is output if the start and current posi- 1759*22dc650dSSadaf Ebrahimi tions are the same, or if the current position precedes the start posi- 1760*22dc650dSSadaf Ebrahimi tion, which can happen if the callout is in a lookbehind assertion. 1761*22dc650dSSadaf Ebrahimi 1762*22dc650dSSadaf Ebrahimi Callouts numbered 255 are assumed to be automatic callouts, inserted as 1763*22dc650dSSadaf Ebrahimi a result of the auto_callout pattern modifier. In this case, instead of 1764*22dc650dSSadaf Ebrahimi showing the callout number, the offset in the pattern, preceded by a 1765*22dc650dSSadaf Ebrahimi plus, is output. For example: 1766*22dc650dSSadaf Ebrahimi 1767*22dc650dSSadaf Ebrahimi re> /\d?[A-E]\*/auto_callout 1768*22dc650dSSadaf Ebrahimi data> E* 1769*22dc650dSSadaf Ebrahimi --->E* 1770*22dc650dSSadaf Ebrahimi +0 ^ \d? 1771*22dc650dSSadaf Ebrahimi +3 ^ [A-E] 1772*22dc650dSSadaf Ebrahimi +8 ^^ \* 1773*22dc650dSSadaf Ebrahimi +10 ^ ^ 1774*22dc650dSSadaf Ebrahimi 0: E* 1775*22dc650dSSadaf Ebrahimi 1776*22dc650dSSadaf Ebrahimi If a pattern contains (*MARK) items, an additional line is output when- 1777*22dc650dSSadaf Ebrahimi ever a change of latest mark is passed to the callout function. For ex- 1778*22dc650dSSadaf Ebrahimi ample: 1779*22dc650dSSadaf Ebrahimi 1780*22dc650dSSadaf Ebrahimi re> /a(*MARK:X)bc/auto_callout 1781*22dc650dSSadaf Ebrahimi data> abc 1782*22dc650dSSadaf Ebrahimi --->abc 1783*22dc650dSSadaf Ebrahimi +0 ^ a 1784*22dc650dSSadaf Ebrahimi +1 ^^ (*MARK:X) 1785*22dc650dSSadaf Ebrahimi +10 ^^ b 1786*22dc650dSSadaf Ebrahimi Latest Mark: X 1787*22dc650dSSadaf Ebrahimi +11 ^ ^ c 1788*22dc650dSSadaf Ebrahimi +12 ^ ^ 1789*22dc650dSSadaf Ebrahimi 0: abc 1790*22dc650dSSadaf Ebrahimi 1791*22dc650dSSadaf Ebrahimi The mark changes between matching "a" and "b", but stays the same for 1792*22dc650dSSadaf Ebrahimi the rest of the match, so nothing more is output. If, as a result of 1793*22dc650dSSadaf Ebrahimi backtracking, the mark reverts to being unset, the text "<unset>" is 1794*22dc650dSSadaf Ebrahimi output. 1795*22dc650dSSadaf Ebrahimi 1796*22dc650dSSadaf Ebrahimi Callouts with string arguments 1797*22dc650dSSadaf Ebrahimi 1798*22dc650dSSadaf Ebrahimi The output for a callout with a string argument is similar, except that 1799*22dc650dSSadaf Ebrahimi instead of outputting a callout number before the position indicators, 1800*22dc650dSSadaf Ebrahimi the callout string and its offset in the pattern string are output be- 1801*22dc650dSSadaf Ebrahimi fore the reflection of the subject string, and the subject string is 1802*22dc650dSSadaf Ebrahimi reflected for each callout. For example: 1803*22dc650dSSadaf Ebrahimi 1804*22dc650dSSadaf Ebrahimi re> /^ab(?C'first')cd(?C"second")ef/ 1805*22dc650dSSadaf Ebrahimi data> abcdefg 1806*22dc650dSSadaf Ebrahimi Callout (7): 'first' 1807*22dc650dSSadaf Ebrahimi --->abcdefg 1808*22dc650dSSadaf Ebrahimi ^ ^ c 1809*22dc650dSSadaf Ebrahimi Callout (20): "second" 1810*22dc650dSSadaf Ebrahimi --->abcdefg 1811*22dc650dSSadaf Ebrahimi ^ ^ e 1812*22dc650dSSadaf Ebrahimi 0: abcdef 1813*22dc650dSSadaf Ebrahimi 1814*22dc650dSSadaf Ebrahimi 1815*22dc650dSSadaf Ebrahimi Callout modifiers 1816*22dc650dSSadaf Ebrahimi 1817*22dc650dSSadaf Ebrahimi The callout function in pcre2test returns zero (carry on matching) by 1818*22dc650dSSadaf Ebrahimi default, but you can use a callout_fail modifier in a subject line to 1819*22dc650dSSadaf Ebrahimi change this and other parameters of the callout (see below). 1820*22dc650dSSadaf Ebrahimi 1821*22dc650dSSadaf Ebrahimi If the callout_capture modifier is set, the current captured groups are 1822*22dc650dSSadaf Ebrahimi output when a callout occurs. This is useful only for non-DFA matching, 1823*22dc650dSSadaf Ebrahimi as pcre2_dfa_match() does not support capturing, so no captures are 1824*22dc650dSSadaf Ebrahimi ever shown. 1825*22dc650dSSadaf Ebrahimi 1826*22dc650dSSadaf Ebrahimi The normal callout output, showing the callout number or pattern offset 1827*22dc650dSSadaf Ebrahimi (as described above) is suppressed if the callout_no_where modifier is 1828*22dc650dSSadaf Ebrahimi set. 1829*22dc650dSSadaf Ebrahimi 1830*22dc650dSSadaf Ebrahimi When using the interpretive matching function pcre2_match() without 1831*22dc650dSSadaf Ebrahimi JIT, setting the callout_extra modifier causes additional output from 1832*22dc650dSSadaf Ebrahimi pcre2test's callout function to be generated. For the first callout in 1833*22dc650dSSadaf Ebrahimi a match attempt at a new starting position in the subject, "New match 1834*22dc650dSSadaf Ebrahimi attempt" is output. If there has been a backtrack since the last call- 1835*22dc650dSSadaf Ebrahimi out (or start of matching if this is the first callout), "Backtrack" is 1836*22dc650dSSadaf Ebrahimi output, followed by "No other matching paths" if the backtrack ended 1837*22dc650dSSadaf Ebrahimi the previous match attempt. For example: 1838*22dc650dSSadaf Ebrahimi 1839*22dc650dSSadaf Ebrahimi re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess 1840*22dc650dSSadaf Ebrahimi data> aac\=callout_extra 1841*22dc650dSSadaf Ebrahimi New match attempt 1842*22dc650dSSadaf Ebrahimi --->aac 1843*22dc650dSSadaf Ebrahimi +0 ^ ( 1844*22dc650dSSadaf Ebrahimi +1 ^ a+ 1845*22dc650dSSadaf Ebrahimi +3 ^ ^ ) 1846*22dc650dSSadaf Ebrahimi +4 ^ ^ b 1847*22dc650dSSadaf Ebrahimi Backtrack 1848*22dc650dSSadaf Ebrahimi --->aac 1849*22dc650dSSadaf Ebrahimi +3 ^^ ) 1850*22dc650dSSadaf Ebrahimi +4 ^^ b 1851*22dc650dSSadaf Ebrahimi Backtrack 1852*22dc650dSSadaf Ebrahimi No other matching paths 1853*22dc650dSSadaf Ebrahimi New match attempt 1854*22dc650dSSadaf Ebrahimi --->aac 1855*22dc650dSSadaf Ebrahimi +0 ^ ( 1856*22dc650dSSadaf Ebrahimi +1 ^ a+ 1857*22dc650dSSadaf Ebrahimi +3 ^^ ) 1858*22dc650dSSadaf Ebrahimi +4 ^^ b 1859*22dc650dSSadaf Ebrahimi Backtrack 1860*22dc650dSSadaf Ebrahimi No other matching paths 1861*22dc650dSSadaf Ebrahimi New match attempt 1862*22dc650dSSadaf Ebrahimi --->aac 1863*22dc650dSSadaf Ebrahimi +0 ^ ( 1864*22dc650dSSadaf Ebrahimi +1 ^ a+ 1865*22dc650dSSadaf Ebrahimi Backtrack 1866*22dc650dSSadaf Ebrahimi No other matching paths 1867*22dc650dSSadaf Ebrahimi New match attempt 1868*22dc650dSSadaf Ebrahimi --->aac 1869*22dc650dSSadaf Ebrahimi +0 ^ ( 1870*22dc650dSSadaf Ebrahimi +1 ^ a+ 1871*22dc650dSSadaf Ebrahimi No match 1872*22dc650dSSadaf Ebrahimi 1873*22dc650dSSadaf Ebrahimi Notice that various optimizations must be turned off if you want all 1874*22dc650dSSadaf Ebrahimi possible matching paths to be scanned. If no_start_optimize is not 1875*22dc650dSSadaf Ebrahimi used, there is an immediate "no match", without any callouts, because 1876*22dc650dSSadaf Ebrahimi the starting optimization fails to find "b" in the subject, which it 1877*22dc650dSSadaf Ebrahimi knows must be present for any match. If no_auto_possess is not used, 1878*22dc650dSSadaf Ebrahimi the "a+" item is turned into "a++", which reduces the number of back- 1879*22dc650dSSadaf Ebrahimi tracks. 1880*22dc650dSSadaf Ebrahimi 1881*22dc650dSSadaf Ebrahimi The callout_extra modifier has no effect if used with the DFA matching 1882*22dc650dSSadaf Ebrahimi function, or with JIT. 1883*22dc650dSSadaf Ebrahimi 1884*22dc650dSSadaf Ebrahimi Return values from callouts 1885*22dc650dSSadaf Ebrahimi 1886*22dc650dSSadaf Ebrahimi The default return from the callout function is zero, which allows 1887*22dc650dSSadaf Ebrahimi matching to continue. The callout_fail modifier can be given one or two 1888*22dc650dSSadaf Ebrahimi numbers. If there is only one number, 1 is returned instead of 0 (caus- 1889*22dc650dSSadaf Ebrahimi ing matching to backtrack) when a callout of that number is reached. If 1890*22dc650dSSadaf Ebrahimi two numbers (<n>:<m>) are given, 1 is returned when callout <n> is 1891*22dc650dSSadaf Ebrahimi reached and there have been at least <m> callouts. The callout_error 1892*22dc650dSSadaf Ebrahimi modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- 1893*22dc650dSSadaf Ebrahimi ing the entire matching process to be aborted. If both these modifiers 1894*22dc650dSSadaf Ebrahimi are set for the same callout number, callout_error takes precedence. 1895*22dc650dSSadaf Ebrahimi Note that callouts with string arguments are always given the number 1896*22dc650dSSadaf Ebrahimi zero. 1897*22dc650dSSadaf Ebrahimi 1898*22dc650dSSadaf Ebrahimi The callout_data modifier can be given an unsigned or a negative num- 1899*22dc650dSSadaf Ebrahimi ber. This is set as the "user data" that is passed to the matching 1900*22dc650dSSadaf Ebrahimi function, and passed back when the callout function is invoked. Any 1901*22dc650dSSadaf Ebrahimi value other than zero is used as a return from pcre2test's callout 1902*22dc650dSSadaf Ebrahimi function. 1903*22dc650dSSadaf Ebrahimi 1904*22dc650dSSadaf Ebrahimi Inserting callouts can be helpful when using pcre2test to check compli- 1905*22dc650dSSadaf Ebrahimi cated regular expressions. For further information about callouts, see 1906*22dc650dSSadaf Ebrahimi the pcre2callout documentation. 1907*22dc650dSSadaf Ebrahimi 1908*22dc650dSSadaf Ebrahimi 1909*22dc650dSSadaf EbrahimiNON-PRINTING CHARACTERS 1910*22dc650dSSadaf Ebrahimi 1911*22dc650dSSadaf Ebrahimi When pcre2test is outputting text in the compiled version of a pattern, 1912*22dc650dSSadaf Ebrahimi bytes other than 32-126 are always treated as non-printing characters 1913*22dc650dSSadaf Ebrahimi and are therefore shown as hex escapes. 1914*22dc650dSSadaf Ebrahimi 1915*22dc650dSSadaf Ebrahimi When pcre2test is outputting text that is a matched part of a subject 1916*22dc650dSSadaf Ebrahimi string, it behaves in the same way, unless a different locale has been 1917*22dc650dSSadaf Ebrahimi set for the pattern (using the locale modifier). In this case, the is- 1918*22dc650dSSadaf Ebrahimi print() function is used to distinguish printing and non-printing char- 1919*22dc650dSSadaf Ebrahimi acters. 1920*22dc650dSSadaf Ebrahimi 1921*22dc650dSSadaf Ebrahimi 1922*22dc650dSSadaf EbrahimiSAVING AND RESTORING COMPILED PATTERNS 1923*22dc650dSSadaf Ebrahimi 1924*22dc650dSSadaf Ebrahimi It is possible to save compiled patterns on disc or elsewhere, and re- 1925*22dc650dSSadaf Ebrahimi load them later, subject to a number of restrictions. JIT data cannot 1926*22dc650dSSadaf Ebrahimi be saved. The host on which the patterns are reloaded must be running 1927*22dc650dSSadaf Ebrahimi the same version of PCRE2, with the same code unit width, and must also 1928*22dc650dSSadaf Ebrahimi have the same endianness, pointer width and PCRE2_SIZE type. Before 1929*22dc650dSSadaf Ebrahimi compiled patterns can be saved they must be serialized, that is, con- 1930*22dc650dSSadaf Ebrahimi verted to a stream of bytes. A single byte stream may contain any num- 1931*22dc650dSSadaf Ebrahimi ber of compiled patterns, but they must all use the same character ta- 1932*22dc650dSSadaf Ebrahimi bles. A single copy of the tables is included in the byte stream (its 1933*22dc650dSSadaf Ebrahimi size is 1088 bytes). 1934*22dc650dSSadaf Ebrahimi 1935*22dc650dSSadaf Ebrahimi The functions whose names begin with pcre2_serialize_ are used for se- 1936*22dc650dSSadaf Ebrahimi rializing and de-serializing. They are described in the pcre2serialize 1937*22dc650dSSadaf Ebrahimi documentation. In this section we describe the features of pcre2test 1938*22dc650dSSadaf Ebrahimi that can be used to test these functions. 1939*22dc650dSSadaf Ebrahimi 1940*22dc650dSSadaf Ebrahimi Note that "serialization" in PCRE2 does not convert compiled patterns 1941*22dc650dSSadaf Ebrahimi to an abstract format like Java or .NET. It just makes a reloadable 1942*22dc650dSSadaf Ebrahimi byte code stream. Hence the restrictions on reloading mentioned above. 1943*22dc650dSSadaf Ebrahimi 1944*22dc650dSSadaf Ebrahimi In pcre2test, when a pattern with push modifier is successfully com- 1945*22dc650dSSadaf Ebrahimi piled, it is pushed onto a stack of compiled patterns, and pcre2test 1946*22dc650dSSadaf Ebrahimi expects the next line to contain a new pattern (or command) instead of 1947*22dc650dSSadaf Ebrahimi a subject line. By contrast, the pushcopy modifier causes a copy of the 1948*22dc650dSSadaf Ebrahimi compiled pattern to be stacked, leaving the original available for im- 1949*22dc650dSSadaf Ebrahimi mediate matching. By using push and/or pushcopy, a number of patterns 1950*22dc650dSSadaf Ebrahimi can be compiled and retained. These modifiers are incompatible with 1951*22dc650dSSadaf Ebrahimi posix, and control modifiers that act at match time are ignored (with a 1952*22dc650dSSadaf Ebrahimi message) for the stacked patterns. The jitverify modifier applies only 1953*22dc650dSSadaf Ebrahimi at compile time. 1954*22dc650dSSadaf Ebrahimi 1955*22dc650dSSadaf Ebrahimi The command 1956*22dc650dSSadaf Ebrahimi 1957*22dc650dSSadaf Ebrahimi #save <filename> 1958*22dc650dSSadaf Ebrahimi 1959*22dc650dSSadaf Ebrahimi causes all the stacked patterns to be serialized and the result written 1960*22dc650dSSadaf Ebrahimi to the named file. Afterwards, all the stacked patterns are freed. The 1961*22dc650dSSadaf Ebrahimi command 1962*22dc650dSSadaf Ebrahimi 1963*22dc650dSSadaf Ebrahimi #load <filename> 1964*22dc650dSSadaf Ebrahimi 1965*22dc650dSSadaf Ebrahimi reads the data in the file, and then arranges for it to be de-serial- 1966*22dc650dSSadaf Ebrahimi ized, with the resulting compiled patterns added to the pattern stack. 1967*22dc650dSSadaf Ebrahimi The pattern on the top of the stack can be retrieved by the #pop com- 1968*22dc650dSSadaf Ebrahimi mand, which must be followed by lines of subjects that are to be 1969*22dc650dSSadaf Ebrahimi matched with the pattern, terminated as usual by an empty line or end 1970*22dc650dSSadaf Ebrahimi of file. This command may be followed by a modifier list containing 1971*22dc650dSSadaf Ebrahimi only control modifiers that act after a pattern has been compiled. In 1972*22dc650dSSadaf Ebrahimi particular, hex, posix, posix_nosub, push, and pushcopy are not al- 1973*22dc650dSSadaf Ebrahimi lowed, nor are any option-setting modifiers. The JIT modifiers are, 1974*22dc650dSSadaf Ebrahimi however permitted. Here is an example that saves and reloads two pat- 1975*22dc650dSSadaf Ebrahimi terns. 1976*22dc650dSSadaf Ebrahimi 1977*22dc650dSSadaf Ebrahimi /abc/push 1978*22dc650dSSadaf Ebrahimi /xyz/push 1979*22dc650dSSadaf Ebrahimi #save tempfile 1980*22dc650dSSadaf Ebrahimi #load tempfile 1981*22dc650dSSadaf Ebrahimi #pop info 1982*22dc650dSSadaf Ebrahimi xyz 1983*22dc650dSSadaf Ebrahimi 1984*22dc650dSSadaf Ebrahimi #pop jit,bincode 1985*22dc650dSSadaf Ebrahimi abc 1986*22dc650dSSadaf Ebrahimi 1987*22dc650dSSadaf Ebrahimi If jitverify is used with #pop, it does not automatically imply jit, 1988*22dc650dSSadaf Ebrahimi which is different behaviour from when it is used on a pattern. 1989*22dc650dSSadaf Ebrahimi 1990*22dc650dSSadaf Ebrahimi The #popcopy command is analogous to the pushcopy modifier in that it 1991*22dc650dSSadaf Ebrahimi makes current a copy of the topmost stack pattern, leaving the original 1992*22dc650dSSadaf Ebrahimi still on the stack. 1993*22dc650dSSadaf Ebrahimi 1994*22dc650dSSadaf Ebrahimi 1995*22dc650dSSadaf EbrahimiSEE ALSO 1996*22dc650dSSadaf Ebrahimi 1997*22dc650dSSadaf Ebrahimi pcre2(3), pcre2api(3), pcre2callout(3), pcre2jit, pcre2matching(3), 1998*22dc650dSSadaf Ebrahimi pcre2partial(d), pcre2pattern(3), pcre2serialize(3). 1999*22dc650dSSadaf Ebrahimi 2000*22dc650dSSadaf Ebrahimi 2001*22dc650dSSadaf EbrahimiAUTHOR 2002*22dc650dSSadaf Ebrahimi 2003*22dc650dSSadaf Ebrahimi Philip Hazel 2004*22dc650dSSadaf Ebrahimi Retired from University Computing Service 2005*22dc650dSSadaf Ebrahimi Cambridge, England. 2006*22dc650dSSadaf Ebrahimi 2007*22dc650dSSadaf Ebrahimi 2008*22dc650dSSadaf EbrahimiREVISION 2009*22dc650dSSadaf Ebrahimi 2010*22dc650dSSadaf Ebrahimi Last updated: 24 April 2024 2011*22dc650dSSadaf Ebrahimi Copyright (c) 1997-2024 University of Cambridge. 2012*22dc650dSSadaf Ebrahimi 2013*22dc650dSSadaf Ebrahimi 2014*22dc650dSSadaf EbrahimiPCRE 10.44 24 April 2024 PCRE2TEST(1) 2015