1<html> 2<head> 3<title>pcre2test specification</title> 4</head> 5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6<h1>pcre2test man page</h1> 7<p> 8Return to the <a href="index.html">PCRE2 index page</a>. 9</p> 10<p> 11This page is part of the PCRE2 HTML documentation. It was generated 12automatically from the original man page. If there is any nonsense in it, 13please consult the man page, in case the conversion went wrong. 14<br> 15<ul> 16<li><a name="TOC1" href="#SEC1">SYNOPSIS</a> 17<li><a name="TOC2" href="#SEC2">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a> 18<li><a name="TOC3" href="#SEC3">INPUT ENCODING</a> 19<li><a name="TOC4" href="#SEC4">COMMAND LINE OPTIONS</a> 20<li><a name="TOC5" href="#SEC5">DESCRIPTION</a> 21<li><a name="TOC6" href="#SEC6">COMMAND LINES</a> 22<li><a name="TOC7" href="#SEC7">MODIFIER SYNTAX</a> 23<li><a name="TOC8" href="#SEC8">PATTERN SYNTAX</a> 24<li><a name="TOC9" href="#SEC9">SUBJECT LINE SYNTAX</a> 25<li><a name="TOC10" href="#SEC10">PATTERN MODIFIERS</a> 26<li><a name="TOC11" href="#SEC11">SUBJECT MODIFIERS</a> 27<li><a name="TOC12" href="#SEC12">THE ALTERNATIVE MATCHING FUNCTION</a> 28<li><a name="TOC13" href="#SEC13">DEFAULT OUTPUT FROM pcre2test</a> 29<li><a name="TOC14" href="#SEC14">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a> 30<li><a name="TOC15" href="#SEC15">RESTARTING AFTER A PARTIAL MATCH</a> 31<li><a name="TOC16" href="#SEC16">CALLOUTS</a> 32<li><a name="TOC17" href="#SEC17">NON-PRINTING CHARACTERS</a> 33<li><a name="TOC18" href="#SEC18">SAVING AND RESTORING COMPILED PATTERNS</a> 34<li><a name="TOC19" href="#SEC19">SEE ALSO</a> 35<li><a name="TOC20" href="#SEC20">AUTHOR</a> 36<li><a name="TOC21" href="#SEC21">REVISION</a> 37</ul> 38<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br> 39<P> 40<b>pcre2test [options] [input file [output file]]</b> 41<br> 42<br> 43<b>pcre2test</b> is a test program for the PCRE2 regular expression libraries, 44but it can also be used for experimenting with regular expressions. This 45document describes the features of the test program; for details of the regular 46expressions themselves, see the 47<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 48documentation. For details of the PCRE2 library function calls and their 49options, see the 50<a href="pcre2api.html"><b>pcre2api</b></a> 51documentation. 52</P> 53<P> 54The input for <b>pcre2test</b> is a sequence of regular expression patterns and 55subject strings to be matched. There are also command lines for setting 56defaults and controlling some special actions. The output shows the result of 57each match attempt. Modifiers on external or internal command lines, the 58patterns, and the subject lines specify PCRE2 function options, control how the 59subject is processed, and what output is produced. 60</P> 61<P> 62There are many obscure modifiers, some of which are specifically designed for 63use in conjunction with the test script and data files that are distributed as 64part of PCRE2. All the modifiers are documented here, some without much 65justification, but many of them are unlikely to be of use except when testing 66the libraries. 67</P> 68<br><a name="SEC2" href="#TOC1">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br> 69<P> 70Different versions of the PCRE2 library can be built to support character 71strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or 72all three of these libraries may be simultaneously installed. The 73<b>pcre2test</b> program can be used to test all the libraries. However, its own 74input and output are always in 8-bit format. When testing the 16-bit or 32-bit 75libraries, patterns and subject strings are converted to 16-bit or 32-bit 76format before being passed to the library functions. Results are converted back 77to 8-bit code units for output. 78</P> 79<P> 80In the rest of this document, the names of library functions and structures 81are given in generic form, for example, <b>pcre2_compile()</b>. The actual 82names used in the libraries have a suffix _8, _16, or _32, as appropriate. 83<a name="inputencoding"></a></P> 84<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br> 85<P> 86Input to <b>pcre2test</b> is processed line by line, either by calling the C 87library's <b>fgets()</b> function, or via the <b>libreadline</b> or <b>libedit</b> 88library. In some Windows environments character 26 (hex 1A) causes an immediate 89end of file, and no further data is read, so this character should be avoided 90unless you really want that action. 91</P> 92<P> 93The input is processed using C's string functions, so must not contain binary 94zeros, even though in Unix-like environments, <b>fgets()</b> treats any bytes 95other than newline as data characters. An error is generated if a binary zero 96is encountered. By default subject lines are processed for backslash escapes, 97which makes it possible to include any data value in strings that are passed to 98the library for matching. For patterns, there is a facility for specifying some 99or all of the 8-bit input characters as hexadecimal pairs, which makes it 100possible to include binary zeros. 101</P> 102<br><b> 103Input for the 16-bit and 32-bit libraries 104</b><br> 105<P> 106When testing the 16-bit or 32-bit libraries, there is a need to be able to 107generate character code points greater than 255 in the strings that are passed 108to the library. For subject lines, backslash escapes can be used. In addition, 109when the <b>utf</b> modifier (see 110<a href="#optionmodifiers">"Setting compilation options"</a> 111below) is set, the pattern and any following subject lines are interpreted as 112UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate. 113</P> 114<P> 115For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be 116used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit 117or 32-bit mode. It causes the pattern and following subject lines to be treated 118as UTF-8 according to the original definition (RFC 2279), which allows for 119character values up to 0x7fffffff. Each character is placed in one 16-bit or 12032-bit code unit (in the 16-bit case, values greater than 0xffff cause an error 121to occur). 122</P> 123<P> 124UTF-8 (in its original definition) is not capable of encoding values greater 125than 0x7fffffff, but such values can be handled by the 32-bit library. When 126testing this library in non-UTF mode with <b>utf8_input</b> set, if any 127character is preceded by the byte 0xff (which is an invalid byte in UTF-8) 1280x80000000 is added to the character's value. This is the only way of passing 129such code points in a pattern string. For subject strings, using an escape 130sequence is preferable. 131</P> 132<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br> 133<P> 134<b>-8</b> 135If the 8-bit library has been built, this option causes it to be used (this is 136the default). If the 8-bit library has not been built, this option causes an 137error. 138</P> 139<P> 140<b>-16</b> 141If the 16-bit library has been built, this option causes it to be used. If the 1428-bit library has not been built, this is the default. If the 16-bit library 143has not been built, this option causes an error. 144</P> 145<P> 146<b>-32</b> 147If the 32-bit library has been built, this option causes it to be used. If no 148other library has been built, this is the default. If the 32-bit library has 149not been built, this option causes an error. 150</P> 151<P> 152<b>-ac</b> 153Behave as if each pattern has the <b>auto_callout</b> modifier, that is, insert 154automatic callouts into every pattern that is compiled. 155</P> 156<P> 157<b>-AC</b> 158As for <b>-ac</b>, but in addition behave as if each subject line has the 159<b>callout_extra</b> modifier, that is, show additional information from 160callouts. 161</P> 162<P> 163<b>-b</b> 164Behave as if each pattern has the <b>fullbincode</b> modifier; the full 165internal binary form of the pattern is output after compilation. 166</P> 167<P> 168<b>-C</b> 169Output the version number of the PCRE2 library, and all available information 170about the optional features that are included, and then exit with zero exit 171code. All other options are ignored. If both -C and -LM are present, whichever 172is first is recognized. 173</P> 174<P> 175<b>-C</b> <i>option</i> 176Output information about a specific build-time option, then exit. This 177functionality is intended for use in scripts such as <b>RunTest</b>. The 178following options output the value and set the exit code as indicated: 179<pre> 180 ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 181 0x15 or 0x25 182 0 if used in an ASCII environment 183 exit code is always 0 184 linksize the configured internal link size (2, 3, or 4) 185 exit code is set to the link size 186 newline the default newline setting: 187 CR, LF, CRLF, ANYCRLF, ANY, or NUL 188 exit code is always 0 189 bsr the default setting for what \R matches: 190 ANYCRLF or ANY 191 exit code is always 0 192</pre> 193The following options output 1 for true or 0 for false, and set the exit code 194to the same value: 195<pre> 196 backslash-C \C is supported (not locked out) 197 ebcdic compiled for an EBCDIC environment 198 jit just-in-time support is available 199 pcre2-16 the 16-bit library was built 200 pcre2-32 the 32-bit library was built 201 pcre2-8 the 8-bit library was built 202 unicode Unicode support is available 203</pre> 204If an unknown option is given, an error message is output; the exit code is 0. 205</P> 206<P> 207<b>-d</b> 208Behave as if each pattern has the <b>debug</b> modifier; the internal 209form and information about the compiled pattern is output after compilation; 210<b>-d</b> is equivalent to <b>-b -i</b>. 211</P> 212<P> 213<b>-dfa</b> 214Behave as if each subject line has the <b>dfa</b> modifier; matching is done 215using the <b>pcre2_dfa_match()</b> function instead of the default 216<b>pcre2_match()</b>. 217</P> 218<P> 219<b>-error</b> <i>number[,number,...]</i> 220Call <b>pcre2_get_error_message()</b> for each of the error numbers in the 221comma-separated list, display the resulting messages on the standard output, 222then exit with zero exit code. The numbers may be positive or negative. This is 223a convenience facility for PCRE2 maintainers. 224</P> 225<P> 226<b>-help</b> 227Output a brief summary these options and then exit. 228</P> 229<P> 230<b>-i</b> 231Behave as if each pattern has the <b>info</b> modifier; information about the 232compiled pattern is given after compilation. 233</P> 234<P> 235<b>-jit</b> 236Behave as if each pattern line has the <b>jit</b> modifier; after successful 237compilation, each pattern is passed to the just-in-time compiler, if available. 238</P> 239<P> 240<b>-jitfast</b> 241Behave as if each pattern line has the <b>jitfast</b> modifier; after 242successful compilation, each pattern is passed to the just-in-time compiler, if 243available, and each subject line is passed directly to the JIT matcher via its 244"fast path". 245</P> 246<P> 247<b>-jitverify</b> 248Behave as if each pattern line has the <b>jitverify</b> modifier; after 249successful compilation, each pattern is passed to the just-in-time compiler, if 250available, and the use of JIT for matching is verified. 251</P> 252<P> 253<b>-LM</b> 254List modifiers: write a list of available pattern and subject modifiers to the 255standard output, then exit with zero exit code. All other options are ignored. 256If both -C and any -Lx options are present, whichever is first is recognized. 257</P> 258<P> 259<b>-LP</b> 260List properties: write a list of recognized Unicode properties to the standard 261output, then exit with zero exit code. All other options are ignored. If both 262-C and any -Lx options are present, whichever is first is recognized. 263</P> 264<P> 265<b>-LS</b> 266List scripts: write a list of recognized Unicode script names to the standard 267output, then exit with zero exit code. All other options are ignored. If both 268-C and any -Lx options are present, whichever is first is recognized. 269</P> 270<P> 271<b>-pattern</b> <i>modifier-list</i> 272Behave as if each pattern line contains the given modifiers. 273</P> 274<P> 275<b>-q</b> 276Do not output the version number of <b>pcre2test</b> at the start of execution. 277</P> 278<P> 279<b>-S</b> <i>size</i> 280On Unix-like systems, set the size of the run-time stack to <i>size</i> 281mebibytes (units of 1024*1024 bytes). 282</P> 283<P> 284<b>-subject</b> <i>modifier-list</i> 285Behave as if each subject line contains the given modifiers. 286</P> 287<P> 288<b>-t</b> 289Run each compile and match many times with a timer, and output the resulting 290times per compile or match. When JIT is used, separate times are given for the 291initial compile and the JIT compile. You can control the number of iterations 292that are used for timing by following <b>-t</b> with a number (as a separate 293item on the command line). For example, "-t 1000" iterates 1000 times. The 294default is to iterate 500,000 times. 295</P> 296<P> 297<b>-tm</b> 298This is like <b>-t</b> except that it times only the matching phase, not the 299compile phase. 300</P> 301<P> 302<b>-T</b> <b>-TM</b> 303These behave like <b>-t</b> and <b>-tm</b>, but in addition, at the end of a run, 304the total times for all compiles and matches are output. 305</P> 306<P> 307<b>-version</b> 308Output the PCRE2 version number and then exit. 309</P> 310<br><a name="SEC5" href="#TOC1">DESCRIPTION</a><br> 311<P> 312If <b>pcre2test</b> is given two filename arguments, it reads from the first and 313writes to the second. If the first name is "-", input is taken from the 314standard input. If <b>pcre2test</b> is given only one argument, it reads from 315that file and writes to stdout. Otherwise, it reads from stdin and writes to 316stdout. 317</P> 318<P> 319When <b>pcre2test</b> is built, a configuration option can specify that it 320should be linked with the <b>libreadline</b> or <b>libedit</b> library. When this 321is done, if the input is from a terminal, it is read using the <b>readline()</b> 322function. This provides line-editing and history facilities. The output from 323the <b>-help</b> option states whether or not <b>readline()</b> will be used. 324</P> 325<P> 326The program handles any number of tests, each of which consists of a set of 327input lines. Each set starts with a regular expression pattern, followed by any 328number of subject lines to be matched against that pattern. In between sets of 329test data, command lines that begin with # may appear. This file format, with 330some restrictions, can also be processed by the <b>perltest.sh</b> script that 331is distributed with PCRE2 as a means of checking that the behaviour of PCRE2 332and Perl is the same. For a specification of <b>perltest.sh</b>, see the 333comments near its beginning. See also the #perltest command below. 334</P> 335<P> 336When the input is a terminal, <b>pcre2test</b> prompts for each line of input, 337using "re>" to prompt for regular expression patterns, and "data>" to prompt 338for subject lines. Command lines starting with # can be entered only in 339response to the "re>" prompt. 340</P> 341<P> 342Each subject line is matched separately and independently. If you want to do 343multi-line matches, you have to use the \n escape sequence (or \r or \r\n, 344etc., depending on the newline setting) in a single line of input to encode the 345newline sequences. There is no limit on the length of subject lines; the input 346buffer is automatically extended if it is too small. There are replication 347features that makes it possible to generate long repetitive pattern or subject 348lines without having to supply them explicitly. 349</P> 350<P> 351An empty line or the end of the file signals the end of the subject lines for a 352test, at which point a new pattern or command line is expected if there is 353still input to be read. 354</P> 355<br><a name="SEC6" href="#TOC1">COMMAND LINES</a><br> 356<P> 357In between sets of test data, a line that begins with # is interpreted as a 358command line. If the first character is followed by white space or an 359exclamation mark, the line is treated as a comment, and ignored. Otherwise, the 360following commands are recognized: 361<pre> 362 #forbid_utf 363</pre> 364Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP 365options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and 366the use of (*UTF) and (*UCP) at the start of patterns. This command also forces 367an error if a subsequent pattern contains any occurrences of \P, \p, or \X, 368which are still supported when PCRE2_UTF is not set, but which require Unicode 369property support to be included in the library. 370</P> 371<P> 372This is a trigger guard that is used in test files to ensure that UTF or 373Unicode property tests are not accidentally added to files that are used when 374Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and 375PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>; 376the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic 377options are not displayed in pattern information, to avoid cluttering up test 378output. 379<pre> 380 #load <filename> 381</pre> 382This command is used to load a set of precompiled patterns from a file, as 383described in the section entitled "Saving and restoring compiled patterns" 384<a href="#saverestore">below.</a> 385<pre> 386 #loadtables <filename> 387</pre> 388This command is used to load a set of binary character tables that can be 389accessed by the tables=3 qualifier. Such tables can be created by the 390<b>pcre2_dftables</b> program with the -b option. 391<pre> 392 #newline_default [<newline-list>] 393</pre> 394When PCRE2 is built, a default newline convention can be specified. This 395determines which characters and/or character pairs are recognized as indicating 396a newline in a pattern or subject string. The default can be overridden when a 397pattern is compiled. The standard test files contain tests of various newline 398conventions, but the majority of the tests expect a single linefeed to be 399recognized as a newline by default. Without special action the tests would fail 400when PCRE2 is compiled with either CR or CRLF as the default newline. 401</P> 402<P> 403The #newline_default command specifies a list of newline types that are 404acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, 405ANY, or NUL (in upper or lower case), for example: 406<pre> 407 #newline_default LF Any anyCRLF 408</pre> 409If the default newline is in the list, this command has no effect. Otherwise, 410except when testing the POSIX API, a <b>newline</b> modifier that specifies the 411first newline convention in the list (LF in the above example) is added to any 412pattern that does not already have a <b>newline</b> modifier. If the newline 413list is empty, the feature is turned off. This command is present in a number 414of the standard test input files. 415</P> 416<P> 417When the POSIX API is being tested there is no way to override the default 418newline convention, though it is possible to set the newline convention from 419within the pattern. A warning is given if the <b>posix</b> or <b>posix_nosub</b> 420modifier is used when <b>#newline_default</b> would set a default for the 421non-POSIX API. 422<pre> 423 #pattern <modifier-list> 424</pre> 425This command sets a default modifier list that applies to all subsequent 426patterns. Modifiers on a pattern can change these settings. 427<pre> 428 #perltest 429</pre> 430This line is used in test files that can also be processed by <b>perltest.sh</b> 431to confirm that Perl gives the same results as PCRE2. Subsequent tests are 432checked for the use of <b>pcre2test</b> features that are incompatible with the 433<b>perltest.sh</b> script. 434</P> 435<P> 436Patterns must use '/' as their delimiter, and only certain modifiers are 437supported. Comment lines, #pattern commands, and #subject commands that set or 438unset "mark" are recognized and acted on. The #perltest, #forbid_utf, and 439#newline_default commands, which are needed in the relevant pcre2test files, 440are silently ignored. All other command lines are ignored, but give a warning 441message. The <b>#perltest</b> command helps detect tests that are accidentally 442put in the wrong file or use the wrong delimiter. For more details of the 443<b>perltest.sh</b> script see the comments it contains. 444<pre> 445 #pop [<modifiers>] 446 #popcopy [<modifiers>] 447</pre> 448These commands are used to manipulate the stack of compiled patterns, as 449described in the section entitled "Saving and restoring compiled patterns" 450<a href="#saverestore">below.</a> 451<pre> 452 #save <filename> 453</pre> 454This command is used to save a set of compiled patterns to a file, as described 455in the section entitled "Saving and restoring compiled patterns" 456<a href="#saverestore">below.</a> 457<pre> 458 #subject <modifier-list> 459</pre> 460This command sets a default modifier list that applies to all subsequent 461subject lines. Modifiers on a subject line can change these settings. 462</P> 463<br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br> 464<P> 465Modifier lists are used with both pattern and subject lines. Items in a list 466are separated by commas followed by optional white space. Trailing whitespace 467in a modifier list is ignored. Some modifiers may be given for both patterns 468and subject lines, whereas others are valid only for one or the other. Each 469modifier has a long name, for example "anchored", and some of them must be 470followed by an equals sign and a value, for example, "offset=12". Values cannot 471contain comma characters, but may contain spaces. Modifiers that do not take 472values may be preceded by a minus sign to turn off a previous setting. 473</P> 474<P> 475A few of the more common modifiers can also be specified as single letters, for 476example "i" for "caseless". In documentation, following the Perl convention, 477these are written with a slash ("the /i modifier") for clarity. Abbreviated 478modifiers must all be concatenated in the first item of a modifier list. If the 479first item is not recognized as a long modifier name, it is interpreted as a 480sequence of these abbreviations. For example: 481<pre> 482 /abc/ig,newline=cr,jit=3 483</pre> 484This is a pattern line whose modifier list starts with two one-letter modifiers 485(/i and /g). The lower-case abbreviated modifiers are the same as used in Perl. 486</P> 487<br><a name="SEC8" href="#TOC1">PATTERN SYNTAX</a><br> 488<P> 489A pattern line must start with one of the following characters (common symbols, 490excluding pattern meta-characters): 491<pre> 492 / ! " ' ` - = _ : ; , % & @ ~ 493</pre> 494This is interpreted as the pattern's delimiter. A regular expression may be 495continued over several input lines, in which case the newline characters are 496included within it. It is possible to include the delimiter as a literal within 497the pattern by escaping it with a backslash, for example 498<pre> 499 /abc\/def/ 500</pre> 501If you do this, the escape and the delimiter form part of the pattern, but 502since the delimiters are all non-alphanumeric, the inclusion of the backslash 503does not affect the pattern's interpretation. Note, however, that this trick 504does not work within \Q...\E literal bracketing because the backslash will 505itself be interpreted as a literal. If the terminating delimiter is immediately 506followed by a backslash, for example, 507<pre> 508 /abc/\ 509</pre> 510a backslash is added to the end of the pattern. This is done to provide a way 511of testing the error condition that arises if a pattern finishes with a 512backslash, because 513<pre> 514 /abc\/ 515</pre> 516is interpreted as the first line of a pattern that starts with "abc/", causing 517pcre2test to read the next line as a continuation of the regular expression. 518</P> 519<P> 520A pattern can be followed by a modifier list (details below). 521</P> 522<br><a name="SEC9" href="#TOC1">SUBJECT LINE SYNTAX</a><br> 523<P> 524Before each subject line is passed to <b>pcre2_match()</b>, 525<b>pcre2_dfa_match()</b>, or <b>pcre2_jit_match()</b>, leading and trailing white 526space is removed, and the line is scanned for backslash escapes, unless the 527<b>subject_literal</b> modifier was set for the pattern. The following provide a 528means of encoding non-printing characters in a visible way: 529<pre> 530 \a alarm (BEL, \x07) 531 \b backspace (\x08) 532 \e escape (\x27) 533 \f form feed (\x0c) 534 \n newline (\x0a) 535 \r carriage return (\x0d) 536 \t tab (\x09) 537 \v vertical tab (\x0b) 538 \nnn octal character (up to 3 octal digits); always 539 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 540 \o{dd...} octal character (any number of octal digits} 541 \xhh hexadecimal byte (up to 2 hex digits) 542 \x{hh...} hexadecimal character (any number of hex digits) 543</pre> 544The use of \x{hh...} is not dependent on the use of the <b>utf</b> modifier on 545the pattern. It is recognized always. There may be any number of hexadecimal 546digits inside the braces; invalid values provoke error messages. 547</P> 548<P> 549Note that \xhh specifies one byte rather than one character in UTF-8 mode; 550this makes it possible to construct invalid UTF-8 sequences for testing 551purposes. On the other hand, \x{hh} is interpreted as a UTF-8 character in 552UTF-8 mode, generating more than one byte if the value is greater than 127. 553When testing the 8-bit library not in UTF-8 mode, \x{hh} generates one byte 554for values less than 256, and causes an error for greater values. 555</P> 556<P> 557In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 558possible to construct invalid UTF-16 sequences for testing purposes. 559</P> 560<P> 561In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This makes it 562possible to construct invalid UTF-32 sequences for testing purposes. 563</P> 564<P> 565There is a special backslash sequence that specifies replication of one or more 566characters: 567<pre> 568 \[<characters>]{<count>} 569</pre> 570This makes it possible to test long strings without having to provide them as 571part of the file. For example: 572<pre> 573 \[abc]{4} 574</pre> 575is converted to "abcabcabcabc". This feature does not support nesting. To 576include a closing square bracket in the characters, code it as \x5D. 577</P> 578<P> 579A backslash followed by an equals sign marks the end of the subject string and 580the start of a modifier list. For example: 581<pre> 582 abc\=notbol,notempty 583</pre> 584If the subject string is empty and \= is followed by whitespace, the line is 585treated as a comment line, and is not used for matching. For example: 586<pre> 587 \= This is a comment. 588 abc\= This is an invalid modifier list. 589</pre> 590A backslash followed by any other non-alphanumeric character just escapes that 591character. A backslash followed by anything else causes an error. However, if 592the very last character in the line is a backslash (and there is no modifier 593list), it is ignored. This gives a way of passing an empty line as data, since 594a real empty line terminates the data input. 595</P> 596<P> 597If the <b>subject_literal</b> modifier is set for a pattern, all subject lines 598that follow are treated as literals, with no special treatment of backslashes. 599No replication is possible, and any subject modifiers must be set as defaults 600by a <b>#subject</b> command. 601</P> 602<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br> 603<P> 604There are several types of modifier that can appear in pattern lines. Except 605where noted below, they may also be used in <b>#pattern</b> commands. A 606pattern's modifier list can add to or override default modifiers that were set 607by a previous <b>#pattern</b> command. 608<a name="optionmodifiers"></a></P> 609<br><b> 610Setting compilation options 611</b><br> 612<P> 613The following modifiers set options for <b>pcre2_compile()</b>. Most of them set 614bits in the options argument of that function, but those whose names start with 615PCRE2_EXTRA are additional options that are set in the compile context. 616Some of these options have single-letter abbreviations. There is special 617handling for /x: if a second x is present, PCRE2_EXTENDED is converted into 618PCRE2_EXTENDED_MORE as in Perl. A third appearance adds PCRE2_EXTENDED as well, 619though this makes no difference to the way <b>pcre2_compile()</b> behaves. See 620<a href="pcre2api.html"><b>pcre2api</b></a> 621for a description of the effects of these options. 622<pre> 623 allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 624 allow_lookaround_bsk set PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK 625 allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 626 alt_bsux set PCRE2_ALT_BSUX 627 alt_circumflex set PCRE2_ALT_CIRCUMFLEX 628 alt_verbnames set PCRE2_ALT_VERBNAMES 629 anchored set PCRE2_ANCHORED 630 /a ascii_all set all ASCII options 631 ascii_bsd set PCRE2_EXTRA_ASCII_BSD 632 ascii_bss set PCRE2_EXTRA_ASCII_BSS 633 ascii_bsw set PCRE2_EXTRA_ASCII_BSW 634 ascii_digit set PCRE2_EXTRA_ASCII_DIGIT 635 ascii_posix set PCRE2_EXTRA_ASCII_POSIX 636 auto_callout set PCRE2_AUTO_CALLOUT 637 bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 638 /i caseless set PCRE2_CASELESS 639 /r caseless_restrict set PCRE2_EXTRA_CASELESS_RESTRICT 640 dollar_endonly set PCRE2_DOLLAR_ENDONLY 641 /s dotall set PCRE2_DOTALL 642 dupnames set PCRE2_DUPNAMES 643 endanchored set PCRE2_ENDANCHORED 644 escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF 645 /x extended set PCRE2_EXTENDED 646 /xx extended_more set PCRE2_EXTENDED_MORE 647 extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX 648 firstline set PCRE2_FIRSTLINE 649 literal set PCRE2_LITERAL 650 match_line set PCRE2_EXTRA_MATCH_LINE 651 match_invalid_utf set PCRE2_MATCH_INVALID_UTF 652 match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 653 match_word set PCRE2_EXTRA_MATCH_WORD 654 /m multiline set PCRE2_MULTILINE 655 never_backslash_c set PCRE2_NEVER_BACKSLASH_C 656 never_ucp set PCRE2_NEVER_UCP 657 never_utf set PCRE2_NEVER_UTF 658 /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE 659 no_auto_possess set PCRE2_NO_AUTO_POSSESS 660 no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 661 no_start_optimize set PCRE2_NO_START_OPTIMIZE 662 no_utf_check set PCRE2_NO_UTF_CHECK 663 ucp set PCRE2_UCP 664 ungreedy set PCRE2_UNGREEDY 665 use_offset_limit set PCRE2_USE_OFFSET_LIMIT 666 utf set PCRE2_UTF 667</pre> 668As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all 669non-printing characters in output strings to be printed using the \x{hh...} 670notation. Otherwise, those less than 0x100 are output in hex without the curly 671brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and 672subject strings to be translated to UTF-16 or UTF-32, respectively, before 673being passed to library functions. 674<a name="controlmodifiers"></a></P> 675<br><b> 676Setting compilation controls 677</b><br> 678<P> 679The following modifiers affect the compilation process or request information 680about the pattern. There are single-letter abbreviations for some that are 681heavily used in the test files. 682<pre> 683 bsr=[anycrlf|unicode] specify \R handling 684 /B bincode show binary code without lengths 685 callout_info show callout information 686 convert=<options> request foreign pattern conversion 687 convert_glob_escape=c set glob escape character 688 convert_glob_separator=c set glob separator character 689 convert_length set convert buffer length 690 debug same as info,fullbincode 691 framesize show matching frame size 692 fullbincode show binary code with lengths 693 /I info show info about compiled pattern 694 hex unquoted characters are hexadecimal 695 jit[=<number>] use JIT 696 jitfast use JIT fast path 697 jitverify verify JIT use 698 locale=<name> use this locale 699 max_pattern_compiled ) set maximum compiled pattern 700 _length=<n> ) length (bytes) 701 max_pattern_length=<n> set maximum pattern length (code units) 702 max_varlookbehind=<n> set maximum variable lookbehind length 703 memory show memory used 704 newline=<type> set newline type 705 null_context compile with a NULL context 706 null_pattern pass pattern as NULL 707 parens_nest_limit=<n> set maximum parentheses depth 708 posix use the POSIX API 709 posix_nosub use the POSIX API with REG_NOSUB 710 push push compiled pattern onto the stack 711 pushcopy push a copy onto the stack 712 stackguard=<number> test the stackguard feature 713 subject_literal treat all subject lines as literal 714 tables=[0|1|2|3] select internal tables 715 use_length do not zero-terminate the pattern 716 utf8_input treat input as UTF-8 717</pre> 718The effects of these modifiers are described in the following sections. 719</P> 720<br><b> 721Newline and \R handling 722</b><br> 723<P> 724The <b>bsr</b> modifier specifies what \R in a pattern should match. If it is 725set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to "unicode", 726\R matches any Unicode newline sequence. The default can be specified when 727PCRE2 is built; if it is not, the default is set to Unicode. 728</P> 729<P> 730The <b>newline</b> modifier specifies which characters are to be interpreted as 731newlines, both in the pattern and in subject lines. The type must be one of CR, 732LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case). 733</P> 734<br><b> 735Information about a pattern 736</b><br> 737<P> 738The <b>debug</b> modifier is a shorthand for <b>info,fullbincode</b>, requesting 739all available information. 740</P> 741<P> 742The <b>bincode</b> modifier causes a representation of the compiled code to be 743output after compilation. This information does not contain length and offset 744values, which ensures that the same output is generated for different internal 745link sizes and different code unit widths. By using <b>bincode</b>, the same 746regression tests can be used in different environments. 747</P> 748<P> 749The <b>fullbincode</b> modifier, by contrast, <i>does</i> include length and 750offset values. This is used in a few special tests that run only for specific 751code unit widths and link sizes, and is also useful for one-off tests. 752</P> 753<P> 754The <b>info</b> modifier requests information about the compiled pattern 755(whether it is anchored, has a fixed first character, and so on). The 756information is obtained from the <b>pcre2_pattern_info()</b> function. Here are 757some typical examples: 758<pre> 759 re> /(?i)(^a|^b)/m,info 760 Capture group count = 1 761 Compile options: multiline 762 Overall options: caseless multiline 763 First code unit at start or follows newline 764 Subject length lower bound = 1 765 766 re> /(?i)abc/info 767 Capture group count = 0 768 Compile options: <none> 769 Overall options: caseless 770 First code unit = 'a' (caseless) 771 Last code unit = 'c' (caseless) 772 Subject length lower bound = 3 773</pre> 774"Compile options" are those specified by modifiers; "overall options" have 775added options that are taken or deduced from the pattern. If both sets of 776options are the same, just a single "options" line is output; if there are no 777options, the line is omitted. "First code unit" is where any match must start; 778if there is more than one they are listed as "starting code units". "Last code 779unit" is the last literal code unit that must be present in any match. This is 780not necessarily the last character. These lines are omitted if no starting or 781ending code units are recorded. The subject length line is omitted when 782<b>no_start_optimize</b> is set because the minimum length is not calculated 783when it can never be used. 784</P> 785<P> 786The <b>framesize</b> modifier shows the size, in bytes, of each storage frame 787used by <b>pcre2_match()</b> for handling backtracking. The size depends on the 788number of capturing parentheses in the pattern. A vector of these frames is 789used at matching time; its overall size is shown when the <b>heaframes_size</b> 790subject modifier is set. 791</P> 792<P> 793The <b>callout_info</b> modifier requests information about all the callouts in 794the pattern. A list of them is output at the end of any other information that 795is requested. For each callout, either its number or string is given, followed 796by the item that follows it in the pattern. 797</P> 798<br><b> 799Passing a NULL context 800</b><br> 801<P> 802Normally, <b>pcre2test</b> passes a context block to <b>pcre2_compile()</b>. If 803the <b>null_context</b> modifier is set, however, NULL is passed. This is for 804testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses 805default values). 806</P> 807<br><b> 808Passing a NULL pattern 809</b><br> 810<P> 811The <b>null_pattern</b> modifier is for testing the behaviour of 812<b>pcre2_compile()</b> when the pattern argument is NULL. The length value 813passed is the default PCRE2_ZERO_TERMINATED unless <b>use_length</b> is set. 814Any length other than zero causes an error. 815</P> 816<br><b> 817Specifying pattern characters in hexadecimal 818</b><br> 819<P> 820The <b>hex</b> modifier specifies that the characters of the pattern, except for 821substrings enclosed in single or double quotes, are to be interpreted as pairs 822of hexadecimal digits. This feature is provided as a way of creating patterns 823that contain binary zeros and other non-printing characters. White space is 824permitted between pairs of digits. For example, this pattern contains three 825characters: 826<pre> 827 /ab 32 59/hex 828</pre> 829Parts of such a pattern are taken literally if quoted. This pattern contains 830nine characters, only two of which are specified in hexadecimal: 831<pre> 832 /ab "literal" 32/hex 833</pre> 834Either single or double quotes may be used. There is no way of including 835the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are 836mutually exclusive. 837</P> 838<br><b> 839Specifying the pattern's length 840</b><br> 841<P> 842By default, patterns are passed to the compiling functions as zero-terminated 843strings but can be passed by length instead of being zero-terminated. The 844<b>use_length</b> modifier causes this to happen. Using a length happens 845automatically (whether or not <b>use_length</b> is set) when <b>hex</b> is set, 846because patterns specified in hexadecimal may contain binary zeros. 847</P> 848<P> 849If <b>hex</b> or <b>use_length</b> is used with the POSIX wrapper API (see 850<a href="#posixwrapper">"Using the POSIX wrapper API"</a> 851below), the REG_PEND extension is used to pass the pattern's length. 852</P> 853<br><b> 854Specifying a maximum for variable lookbehinds 855</b><br> 856<P> 857Variable lookbehind assertions are supported only if, for each one, there is a 858maximum length (in characters) that it can match. There is a limit on this, 859whose default can be set at build time, with an ultimate default of 255. The 860<b>max_varlookbehind</b> modifier uses the <b>pcre2_set_max_varlookbehind()</b> 861function to change the limit. Lookbehinds whose branches each match a fixed 862length are limited to 65535 characters per branch. 863</P> 864<br><b> 865Specifying wide characters in 16-bit and 32-bit modes 866</b><br> 867<P> 868In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and 869translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing 870the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier 871can be used. It is mutually exclusive with <b>utf</b>. Input lines are 872interpreted as UTF-8 as a means of specifying wide characters. More details are 873given in 874<a href="#inputencoding">"Input encoding"</a> 875above. 876</P> 877<br><b> 878Generating long repetitive patterns 879</b><br> 880<P> 881Some tests use long patterns that are very repetitive. Instead of creating a 882very long input line for such a pattern, you can use a special repetition 883feature, similar to the one described for subject lines above. If the 884<b>expand</b> modifier is present on a pattern, parts of the pattern that have 885the form 886<pre> 887 \[<characters>]{<count>} 888</pre> 889are expanded before the pattern is passed to <b>pcre2_compile()</b>. For 890example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 891cannot be nested. An initial "\[" sequence is recognized only if "]{" followed 892by decimal digits and "}" is found later in the pattern. If not, the characters 893remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are 894mutually exclusive. 895</P> 896<P> 897If part of an expanded pattern looks like an expansion, but is really part of 898the actual pattern, unwanted expansion can be avoided by giving two values in 899the quantifier. For example, \[AB]{6000,6000} is not recognized as an 900expansion item. 901</P> 902<P> 903If the <b>info</b> modifier is set on an expanded pattern, the result of the 904expansion is included in the information that is output. 905</P> 906<br><b> 907JIT compilation 908</b><br> 909<P> 910Just-in-time (JIT) compiling is a heavyweight optimization that can greatly 911speed up pattern matching. See the 912<a href="pcre2jit.html"><b>pcre2jit</b></a> 913documentation for details. JIT compiling happens, optionally, after a pattern 914has been successfully compiled into an internal form. The JIT compiler converts 915this to optimized machine code. It needs to know whether the match-time options 916PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because 917different code is generated for the different cases. See the <b>partial</b> 918modifier in "Subject Modifiers" 919<a href="#subjectmodifiers">below</a> 920for details of how these options are specified for each match attempt. 921</P> 922<P> 923JIT compilation is requested by the <b>jit</b> pattern modifier, which may 924optionally be followed by an equals sign and a number in the range 0 to 7. 925The three bits that make up the number specify which of the three JIT operating 926modes are to be compiled: 927<pre> 928 1 compile JIT code for non-partial matching 929 2 compile JIT code for soft partial matching 930 4 compile JIT code for hard partial matching 931</pre> 932The possible values for the <b>jit</b> modifier are therefore: 933<pre> 934 0 disable JIT 935 1 normal matching only 936 2 soft partial matching only 937 3 normal and soft partial matching 938 4 hard partial matching only 939 6 soft and hard partial matching only 940 7 all three modes 941</pre> 942If no number is given, 7 is assumed. The phrase "partial matching" means a call 943to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the 944PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete 945match; the options enable the possibility of a partial match, but do not 946require it. Note also that if you request JIT compilation only for partial 947matching (for example, jit=2) but do not set the <b>partial</b> modifier on a 948subject line, that match will not use JIT code because none was compiled for 949non-partial matching. 950</P> 951<P> 952If JIT compilation is successful, the compiled JIT code will automatically be 953used when an appropriate type of match is run, except when incompatible 954run-time options are specified. For more details, see the 955<a href="pcre2jit.html"><b>pcre2jit</b></a> 956documentation. See also the <b>jitstack</b> modifier below for a way of 957setting the size of the JIT stack. 958</P> 959<P> 960If the <b>jitfast</b> modifier is specified, matching is done using the JIT 961"fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity 962checks that are done by <b>pcre2_match()</b>, and of course does not work when 963JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is 964assumed. 965</P> 966<P> 967If the <b>jitverify</b> modifier is specified, information about the compiled 968pattern shows whether JIT compilation was or was not successful. If 969<b>jitverify</b> is specified without <b>jit</b>, jit=7 is assumed. If JIT 970compilation is successful when <b>jitverify</b> is set, the text "(JIT)" is 971added to the first output line after a match or non match when JIT-compiled 972code was actually used in the match. 973</P> 974<br><b> 975Setting a locale 976</b><br> 977<P> 978The <b>locale</b> modifier must specify the name of a locale, for example: 979<pre> 980 /pattern/locale=fr_FR 981</pre> 982The given locale is set, <b>pcre2_maketables()</b> is called to build a set of 983character tables for the locale, and this is then passed to 984<b>pcre2_compile()</b> when compiling the regular expression. The same tables 985are used when matching the following subject lines. The <b>locale</b> modifier 986applies only to the pattern on which it appears, but can be given in a 987<b>#pattern</b> command if a default is needed. Setting a locale and alternate 988character tables are mutually exclusive. 989</P> 990<br><b> 991Showing pattern memory 992</b><br> 993<P> 994The <b>memory</b> modifier causes the size in bytes of the memory used to hold 995the compiled pattern to be output. This does not include the size of the 996<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is 997subsequently passed to the JIT compiler, the size of the JIT compiled code is 998also output. Here is an example: 999<pre> 1000 re> /a(b)c/jit,memory 1001 Memory allocation (code space): 21 1002 Memory allocation (JIT code): 1910 1003 1004</PRE> 1005</P> 1006<br><b> 1007Limiting nested parentheses 1008</b><br> 1009<P> 1010The <b>parens_nest_limit</b> modifier sets a limit on the depth of nested 1011parentheses in a pattern. Breaching the limit causes a compilation error. 1012The default for the library is set when PCRE2 is built, but <b>pcre2test</b> 1013sets its own default of 220, which is required for running the standard test 1014suite. 1015</P> 1016<br><b> 1017Limiting the pattern length 1018</b><br> 1019<P> 1020The <b>max_pattern_length</b> modifier sets a limit, in code units, to the 1021length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit 1022causes a compilation error. The default is the largest number a PCRE2_SIZE 1023variable can hold (essentially unlimited). 1024</P> 1025<br><b> 1026Limiting the size of a compiled pattern 1027</b><br> 1028<P> 1029The <b>max_pattern_compiled_length</b> modifier sets a limit, in bytes, to the 1030amount of memory used by a compiled pattern. Breaching the limit causes a 1031compilation error. The default is the largest number a PCRE2_SIZE variable can 1032hold (essentially unlimited). 1033<a name="posixwrapper"></a></P> 1034<br><b> 1035Using the POSIX wrapper API 1036</b><br> 1037<P> 1038The <b>posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call 1039PCRE2 via the POSIX wrapper API rather than its native API. When 1040<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to 1041<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that 1042it does not imply POSIX matching semantics; for more detail see the 1043<a href="pcre2posix.html"><b>pcre2posix</b></a> 1044documentation. The following pattern modifiers set options for the 1045<b>regcomp()</b> function: 1046<pre> 1047 caseless REG_ICASE 1048 multiline REG_NEWLINE 1049 dotall REG_DOTALL ) 1050 ungreedy REG_UNGREEDY ) These options are not part of 1051 ucp REG_UCP ) the POSIX standard 1052 utf REG_UTF8 ) 1053</pre> 1054The <b>regerror_buffsize</b> modifier specifies a size for the error buffer that 1055is passed to <b>regerror()</b> in the event of a compilation error. For example: 1056<pre> 1057 /abc/posix,regerror_buffsize=20 1058</pre> 1059This provides a means of testing the behaviour of <b>regerror()</b> when the 1060buffer is too small for the error message. If this modifier has not been set, a 1061large buffer is used. 1062</P> 1063<P> 1064The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described 1065below. All other modifiers are either ignored, with a warning message, or cause 1066an error. 1067</P> 1068<P> 1069The pattern is passed to <b>regcomp()</b> as a zero-terminated string by 1070default, but if the <b>use_length</b> or <b>hex</b> modifiers are set, the 1071REG_PEND extension is used to pass it by length. 1072</P> 1073<br><b> 1074Testing the stack guard feature 1075</b><br> 1076<P> 1077The <b>stackguard</b> modifier is used to test the use of 1078<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to 1079enable stack availability to be checked during compilation (see the 1080<a href="pcre2api.html"><b>pcre2api</b></a> 1081documentation for details). If the number specified by the modifier is greater 1082than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up 1083callback from <b>pcre2_compile()</b> to a local function. The argument it 1084receives is the current nesting parenthesis depth; if this is greater than the 1085value given by the modifier, non-zero is returned, causing the compilation to 1086be aborted. 1087</P> 1088<br><b> 1089Using alternative character tables 1090</b><br> 1091<P> 1092The value specified for the <b>tables</b> modifier must be one of the digits 0, 10931, 2, or 3. It causes a specific set of built-in character tables to be passed 1094to <b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour 1095with different character tables. The digit specifies the tables as follows: 1096<pre> 1097 0 do not pass any special character tables 1098 1 the default ASCII tables, as distributed in 1099 pcre2_chartables.c.dist 1100 2 a set of tables defining ISO 8859 characters 1101 3 a set of tables loaded by the #loadtables command 1102</pre> 1103In tables 2, some characters whose codes are greater than 128 are identified as 1104letters, digits, spaces, etc. Tables 3 can be used only after a 1105<b>#loadtables</b> command has loaded them from a binary file. Setting alternate 1106character tables and a locale are mutually exclusive. 1107</P> 1108<br><b> 1109Setting certain match controls 1110</b><br> 1111<P> 1112The following modifiers are really subject modifiers, and are described under 1113"Subject Modifiers" below. However, they may be included in a pattern's 1114modifier list, in which case they are applied to every subject line that is 1115processed with that pattern. These modifiers do not affect the compilation 1116process. 1117<pre> 1118 aftertext show text after match 1119 allaftertext show text after captures 1120 allcaptures show all captures 1121 allvector show the entire ovector 1122 allusedtext show all consulted text 1123 altglobal alternative global matching 1124 /g global global matching 1125 heapframes_size show match data heapframes size 1126 jitstack=<n> set size of JIT stack 1127 mark show mark values 1128 replace=<string> specify a replacement string 1129 startchar show starting character when relevant 1130 substitute_callout use substitution callouts 1131 substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 1132 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1133 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1134 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1135 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1136 substitute_skip=<n> skip substitution <n> 1137 substitute_stop=<n> skip substitution <n> and following 1138 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1139 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1140</pre> 1141These modifiers may not appear in a <b>#pattern</b> command. If you want them as 1142defaults, set them in a <b>#subject</b> command. 1143</P> 1144<br><b> 1145Specifying literal subject lines 1146</b><br> 1147<P> 1148If the <b>subject_literal</b> modifier is present on a pattern, all the subject 1149lines that it matches are taken as literal strings, with no interpretation of 1150backslashes. It is not possible to set subject modifiers on such lines, but any 1151that are set as defaults by a <b>#subject</b> command are recognized. 1152</P> 1153<br><b> 1154Saving a compiled pattern 1155</b><br> 1156<P> 1157When a pattern with the <b>push</b> modifier is successfully compiled, it is 1158pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next 1159line to contain a new pattern (or a command) instead of a subject line. This 1160facility is used when saving compiled patterns to a file, as described in the 1161section entitled "Saving and restoring compiled patterns" 1162<a href="#saverestore">below.</a> 1163If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled 1164pattern is stacked, leaving the original as current, ready to match the 1165following input lines. This provides a way of testing the 1166<b>pcre2_code_copy()</b> function. 1167The <b>push</b> and <b>pushcopy </b> modifiers are incompatible with compilation 1168modifiers such as <b>global</b> that act at match time. Any that are specified 1169are ignored (for the stacked copy), with a warning message, except for 1170<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is 1171allowed, does not carry through to any subsequent matching that uses a stacked 1172pattern. 1173</P> 1174<br><b> 1175Testing foreign pattern conversion 1176</b><br> 1177<P> 1178The experimental foreign pattern conversion functions in PCRE2 can be tested by 1179setting the <b>convert</b> modifier. Its argument is a colon-separated list of 1180options, which set the equivalent option for the <b>pcre2_pattern_convert()</b> 1181function: 1182<pre> 1183 glob PCRE2_CONVERT_GLOB 1184 glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR 1185 glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR 1186 posix_basic PCRE2_CONVERT_POSIX_BASIC 1187 posix_extended PCRE2_CONVERT_POSIX_EXTENDED 1188 unset Unset all options 1189</pre> 1190The "unset" value is useful for turning off a default that has been set by a 1191<b>#pattern</b> command. When one of these options is set, the input pattern is 1192passed to <b>pcre2_pattern_convert()</b>. If the conversion is successful, the 1193result is reflected in the output and then passed to <b>pcre2_compile()</b>. The 1194normal <b>utf</b> and <b>no_utf_check</b> options, if set, cause the 1195PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be passed to 1196<b>pcre2_pattern_convert()</b>. 1197</P> 1198<P> 1199By default, the conversion function is allowed to allocate a buffer for its 1200output. However, if the <b>convert_length</b> modifier is set to a value greater 1201than zero, <b>pcre2test</b> passes a buffer of the given length. This makes it 1202possible to test the length check. 1203</P> 1204<P> 1205The <b>convert_glob_escape</b> and <b>convert_glob_separator</b> modifiers can be 1206used to specify the escape and separator characters for glob processing, 1207overriding the defaults, which are operating-system dependent. 1208<a name="subjectmodifiers"></a></P> 1209<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br> 1210<P> 1211The modifiers that can appear in subject lines and the <b>#subject</b> 1212command are of two types. 1213</P> 1214<br><b> 1215Setting match options 1216</b><br> 1217<P> 1218The following modifiers set options for <b>pcre2_match()</b> or 1219<b>pcre2_dfa_match()</b>. See 1220<a href="pcreapi.html"><b>pcreapi</b></a> 1221for a description of their effects. 1222<pre> 1223 anchored set PCRE2_ANCHORED 1224 endanchored set PCRE2_ENDANCHORED 1225 dfa_restart set PCRE2_DFA_RESTART 1226 dfa_shortest set PCRE2_DFA_SHORTEST 1227 disable_recurseloop_check set PCRE2_DISABLE_RECURSELOOP_CHECK 1228 no_jit set PCRE2_NO_JIT 1229 no_utf_check set PCRE2_NO_UTF_CHECK 1230 notbol set PCRE2_NOTBOL 1231 notempty set PCRE2_NOTEMPTY 1232 notempty_atstart set PCRE2_NOTEMPTY_ATSTART 1233 noteol set PCRE2_NOTEOL 1234 partial_hard (or ph) set PCRE2_PARTIAL_HARD 1235 partial_soft (or ps) set PCRE2_PARTIAL_SOFT 1236</pre> 1237The partial matching modifiers are provided with abbreviations because they 1238appear frequently in tests. 1239</P> 1240<P> 1241If the <b>posix</b> or <b>posix_nosub</b> modifier was present on the pattern, 1242causing the POSIX wrapper API to be used, the only option-setting modifiers 1243that have any effect are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, 1244causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to 1245<b>regexec()</b>. The other modifiers are ignored, with a warning message. 1246</P> 1247<P> 1248There is one additional modifier that can be used with the POSIX wrapper. It is 1249ignored (with a warning) if used for non-POSIX matching. 1250<pre> 1251 posix_startend=<n>[:<m>] 1252</pre> 1253This causes the subject string to be passed to <b>regexec()</b> using the 1254REG_STARTEND option, which uses offsets to specify which part of the string is 1255searched. If only one number is given, the end offset is passed as the end of 1256the subject string. For more detail of REG_STARTEND, see the 1257<a href="pcre2posix.html"><b>pcre2posix</b></a> 1258documentation. If the subject string contains binary zeros (coded as escapes 1259such as \x{00} because <b>pcre2test</b> does not support actual binary zeros in 1260its input), you must use <b>posix_startend</b> to specify its length. 1261</P> 1262<br><b> 1263Setting match controls 1264</b><br> 1265<P> 1266The following modifiers affect the matching process or request additional 1267information. Some of them may also be specified on a pattern line (see above), 1268in which case they apply to every subject line that is matched against that 1269pattern, but can be overridden by modifiers on the subject. 1270<pre> 1271 aftertext show text after match 1272 allaftertext show text after captures 1273 allcaptures show all captures 1274 allvector show the entire ovector 1275 allusedtext show all consulted text (non-JIT only) 1276 altglobal alternative global matching 1277 callout_capture show captures at callout time 1278 callout_data=<n> set a value to pass via callouts 1279 callout_error=<n>[:<m>] control callout error 1280 callout_extra show extra callout information 1281 callout_fail=<n>[:<m>] control callout failure 1282 callout_no_where do not show position of a callout 1283 callout_none do not supply a callout function 1284 copy=<number or name> copy captured substring 1285 depth_limit=<n> set a depth limit 1286 dfa use <b>pcre2_dfa_match()</b> 1287 find_limits find heap, match and depth limits 1288 find_limits_noheap find match and depth limits 1289 get=<number or name> extract captured substring 1290 getall extract all captured substrings 1291 /g global global matching 1292 heapframes_size show match data heapframes size 1293 heap_limit=<n> set a limit on heap memory (Kbytes) 1294 jitstack=<n> set size of JIT stack 1295 mark show mark values 1296 match_limit=<n> set a match limit 1297 memory show heap memory usage 1298 null_context match with a NULL context 1299 null_replacement substitute with NULL replacement 1300 null_subject match with NULL subject 1301 offset=<n> set starting offset 1302 offset_limit=<n> set offset limit 1303 ovector=<n> set size of output vector 1304 recursion_limit=<n> obsolete synonym for depth_limit 1305 replace=<string> specify a replacement string 1306 startchar show startchar when relevant 1307 startoffset=<n> same as offset=<n> 1308 substitute_callout use substitution callouts 1309 substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 1310 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1311 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1312 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1313 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1314 substitute_skip=<n> skip substitution number n 1315 substitute_stop=<n> skip substitution number n and greater 1316 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1317 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1318 zero_terminate pass the subject as zero-terminated 1319</pre> 1320The effects of these modifiers are described in the following sections. When 1321matching via the POSIX wrapper API, the <b>aftertext</b>, <b>allaftertext</b>, 1322and <b>ovector</b> subject modifiers work as described below. All other 1323modifiers are either ignored, with a warning message, or cause an error. 1324</P> 1325<br><b> 1326Showing more text 1327</b><br> 1328<P> 1329The <b>aftertext</b> modifier requests that as well as outputting the part of 1330the subject string that matched the entire pattern, <b>pcre2test</b> should in 1331addition output the remainder of the subject string. This is useful for tests 1332where the subject contains multiple copies of the same substring. The 1333<b>allaftertext</b> modifier requests the same action for captured substrings as 1334well as the main matched substring. In each case the remainder is output on the 1335following line with a plus character following the capture number. 1336</P> 1337<P> 1338The <b>allusedtext</b> modifier requests that all the text that was consulted 1339during a successful pattern match by the interpreter should be shown, for both 1340full and partial matches. This feature is not supported for JIT matching, and 1341if requested with JIT it is ignored (with a warning message). Setting this 1342modifier affects the output if there is a lookbehind at the start of a match, 1343or, for a complete match, a lookahead at the end, or if \K is used in the 1344pattern. Characters that precede or follow the start and end of the actual 1345match are indicated in the output by '<' or '>' characters underneath them. 1346Here is an example: 1347<pre> 1348 re> /(?<=pqr)abc(?=xyz)/ 1349 data> 123pqrabcxyz456\=allusedtext 1350 0: pqrabcxyz 1351 <<< >>> 1352 data> 123pqrabcxy\=ph,allusedtext 1353 Partial match: pqrabcxy 1354 <<< 1355</pre> 1356The first, complete match shows that the matched string is "abc", with the 1357preceding and following strings "pqr" and "xyz" having been consulted during 1358the match (when processing the assertions). The partial match can indicate only 1359the preceding string. 1360</P> 1361<P> 1362The <b>startchar</b> modifier requests that the starting character for the match 1363be indicated, if it is different to the start of the matched string. The only 1364time when this occurs is when \K has been processed as part of the match. In 1365this situation, the output for the matched string is displayed from the 1366starting character instead of from the match point, with circumflex characters 1367under the earlier characters. For example: 1368<pre> 1369 re> /abc\Kxyz/ 1370 data> abcxyz\=startchar 1371 0: abcxyz 1372 ^^^ 1373</pre> 1374Unlike <b>allusedtext</b>, the <b>startchar</b> modifier can be used with JIT. 1375However, these two modifiers are mutually exclusive. 1376</P> 1377<br><b> 1378Showing the value of all capture groups 1379</b><br> 1380<P> 1381The <b>allcaptures</b> modifier requests that the values of all potential 1382captured parentheses be output after a match. By default, only those up to the 1383highest one actually used in the match are output (corresponding to the return 1384code from <b>pcre2_match()</b>). Groups that did not take part in the match 1385are output as "<unset>". This modifier is not relevant for DFA matching (which 1386does no capturing) and does not apply when <b>replace</b> is specified; it is 1387ignored, with a warning message, if present. 1388</P> 1389<br><b> 1390Showing the entire ovector, for all outcomes 1391</b><br> 1392<P> 1393The <b>allvector</b> modifier requests that the entire ovector be shown, 1394whatever the outcome of the match. Compare <b>allcaptures</b>, which shows only 1395up to the maximum number of capture groups for the pattern, and then only for a 1396successful complete non-DFA match. This modifier, which acts after any match 1397result, and also for DFA matching, provides a means of checking that there are 1398no unexpected modifications to ovector fields. Before each match attempt, the 1399ovector is filled with a special value, and if this is found in both elements 1400of a capturing pair, "<unchanged>" is output. After a successful match, this 1401applies to all groups after the maximum capture group for the pattern. In other 1402cases it applies to the entire ovector. After a partial match, the first two 1403elements are the only ones that should be set. After a DFA match, the amount of 1404ovector that is used depends on the number of matches that were found. 1405</P> 1406<br><b> 1407Testing pattern callouts 1408</b><br> 1409<P> 1410A callout function is supplied when <b>pcre2test</b> calls the library matching 1411functions, unless <b>callout_none</b> is specified. Its behaviour can be 1412controlled by various modifiers listed above whose names begin with 1413<b>callout_</b>. Details are given in the section entitled "Callouts" 1414<a href="#callouts">below.</a> 1415Testing callouts from <b>pcre2_substitute()</b> is described separately in 1416"Testing the substitution function" 1417<a href="#substitution">below.</a> 1418</P> 1419<br><b> 1420Finding all matches in a string 1421</b><br> 1422<P> 1423Searching for all possible matches within a subject can be requested by the 1424<b>global</b> or <b>altglobal</b> modifier. After finding a match, the matching 1425function is called again to search the remainder of the subject. The difference 1426between <b>global</b> and <b>altglobal</b> is that the former uses the 1427<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> 1428to start searching at a new point within the entire string (which is what Perl 1429does), whereas the latter passes over a shortened subject. This makes a 1430difference to the matching process if the pattern begins with a lookbehind 1431assertion (including \b or \B). 1432</P> 1433<P> 1434If an empty string is matched, the next match is done with the 1435PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for 1436another, non-empty, match at the same point in the subject. If this match 1437fails, the start offset is advanced, and the normal match is retried. This 1438imitates the way Perl handles such cases when using the <b>/g</b> modifier or 1439the <b>split()</b> function. Normally, the start offset is advanced by one 1440character, but if the newline convention recognizes CRLF as a newline, and the 1441current character is CR followed by LF, an advance of two characters occurs. 1442</P> 1443<br><b> 1444Testing substring extraction functions 1445</b><br> 1446<P> 1447The <b>copy</b> and <b>get</b> modifiers can be used to test the 1448<b>pcre2_substring_copy_xxx()</b> and <b>pcre2_substring_get_xxx()</b> functions. 1449They can be given more than once, and each can specify a capture group name or 1450number, for example: 1451<pre> 1452 abcd\=copy=1,copy=3,get=G1 1453</pre> 1454If the <b>#subject</b> command is used to set default copy and/or get lists, 1455these can be unset by specifying a negative number to cancel all numbered 1456groups and an empty name to cancel all named groups. 1457</P> 1458<P> 1459The <b>getall</b> modifier tests <b>pcre2_substring_list_get()</b>, which 1460extracts all captured substrings. 1461</P> 1462<P> 1463If the subject line is successfully matched, the substrings extracted by the 1464convenience functions are output with C, G, or L after the string number 1465instead of a colon. This is in addition to the normal full list. The string 1466length (that is, the return from the extraction function) is given in 1467parentheses after each substring, followed by the name when the extraction was 1468by name. 1469<a name="substitution"></a></P> 1470<br><b> 1471Testing the substitution function 1472</b><br> 1473<P> 1474If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is 1475called instead of one of the matching functions (or after one call of 1476<b>pcre2_match()</b> in the case of PCRE2_SUBSTITUTE_MATCHED). Note that 1477replacement strings cannot contain commas, because a comma signifies the end of 1478a modifier. This is not thought to be an issue in a test program. 1479</P> 1480<P> 1481Specifying a completely empty replacement string disables this modifier. 1482However, it is possible to specify an empty replacement by providing a buffer 1483length, as described below, for an otherwise empty replacement. 1484</P> 1485<P> 1486Unlike subject strings, <b>pcre2test</b> does not process replacement strings 1487for escape sequences. In UTF mode, a replacement string is checked to see if it 1488is a valid UTF-8 string. If so, it is correctly converted to a UTF string of 1489the appropriate code unit width. If it is not a valid UTF-8 string, the 1490individual code units are copied directly. This provides a means of passing an 1491invalid UTF-8 string for testing purposes. 1492</P> 1493<P> 1494The following modifiers set options (in additional to the normal match options) 1495for <b>pcre2_substitute()</b>: 1496<pre> 1497 global PCRE2_SUBSTITUTE_GLOBAL 1498 substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1499 substitute_literal PCRE2_SUBSTITUTE_LITERAL 1500 substitute_matched PCRE2_SUBSTITUTE_MATCHED 1501 substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1502 substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1503 substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1504 substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1505</pre> 1506See the 1507<a href="pcre2api.html"><b>pcre2api</b></a> 1508documentation for details of these options. 1509</P> 1510<P> 1511After a successful substitution, the modified string is output, preceded by the 1512number of replacements. This may be zero if there were no matches. Here is a 1513simple example of a substitution test: 1514<pre> 1515 /abc/replace=xxx 1516 =abc=abc= 1517 1: =xxx=abc= 1518 =abc=abc=\=global 1519 2: =xxx=xxx= 1520</pre> 1521Subject and replacement strings should be kept relatively short (fewer than 256 1522characters) for substitution tests, as fixed-size buffers are used. To make it 1523easy to test for buffer overflow, if the replacement string starts with a 1524number in square brackets, that number is passed to <b>pcre2_substitute()</b> as 1525the size of the output buffer, with the replacement string starting at the next 1526character. Here is an example that tests the edge case: 1527<pre> 1528 /abc/ 1529 123abc123\=replace=[10]XYZ 1530 1: 123XYZ123 1531 123abc123\=replace=[9]XYZ 1532 Failed: error -47: no more memory 1533</pre> 1534The default action of <b>pcre2_substitute()</b> is to return 1535PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the 1536PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the 1537<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues 1538to go through the motions of matching and substituting (but not doing any 1539callouts), in order to compute the size of buffer that is required. When this 1540happens, <b>pcre2test</b> shows the required buffer length (which includes space 1541for the trailing zero) as part of the error message. For example: 1542<pre> 1543 /abc/substitute_overflow_length 1544 123abc123\=replace=[9]XYZ 1545 Failed: error -47: no more memory: 10 code units are needed 1546</pre> 1547A replacement string is ignored with POSIX and DFA matching. Specifying partial 1548matching provokes an error return ("bad option value") from 1549<b>pcre2_substitute()</b>. 1550</P> 1551<br><b> 1552Testing substitute callouts 1553</b><br> 1554<P> 1555If the <b>substitute_callout</b> modifier is set, a substitution callout 1556function is set up. The <b>null_context</b> modifier must not be set, because 1557the address of the callout function is passed in a match context. When the 1558callout function is called (after each substitution), details of the input 1559and output strings are output. For example: 1560<pre> 1561 /abc/g,replace=<$0>,substitute_callout 1562 abcdefabcpqr 1563 1(1) Old 0 3 "abc" New 0 5 "<abc>" 1564 2(1) Old 6 9 "abc" New 8 13 "<abc>" 1565 2: <abc>def<abc>pqr 1566</pre> 1567The first number on each callout line is the count of matches. The 1568parenthesized number is the number of pairs that are set in the ovector (that 1569is, one more than the number of capturing groups that were set). Then are 1570listed the offsets of the old substring, its contents, and the same for the 1571replacement. 1572</P> 1573<P> 1574By default, the substitution callout function returns zero, which accepts the 1575replacement and causes matching to continue if /g was used. Two further 1576modifiers can be used to test other return values. If <b>substitute_skip</b> is 1577set to a value greater than zero the callout function returns +1 for the match 1578of that number, and similarly <b>substitute_stop</b> returns -1. These cause the 1579replacement to be rejected, and -1 causes no further matching to take place. If 1580either of them are set, <b>substitute_callout</b> is assumed. For example: 1581<pre> 1582 /abc/g,replace=<$0>,substitute_skip=1 1583 abcdefabcpqr 1584 1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED" 1585 2(1) Old 6 9 "abc" New 6 11 "<abc>" 1586 2: abcdef<abc>pqr 1587 abcdefabcpqr\=substitute_stop=1 1588 1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED" 1589 1: abcdefabcpqr 1590</pre> 1591If both are set for the same number, stop takes precedence. Only a single skip 1592or stop is supported, which is sufficient for testing that the feature works. 1593</P> 1594<br><b> 1595Setting the JIT stack size 1596</b><br> 1597<P> 1598The <b>jitstack</b> modifier provides a way of setting the maximum stack size 1599that is used by the just-in-time optimization code. It is ignored if JIT 1600optimization is not being used. The value is a number of kibibytes (units of 16011024 bytes). Setting zero reverts to the default of 32KiB. Providing a stack 1602that is larger than the default is necessary only for very complicated 1603patterns. If <b>jitstack</b> is set non-zero on a subject line it overrides any 1604value that was set on the pattern. 1605</P> 1606<br><b> 1607Setting heap, match, and depth limits 1608</b><br> 1609<P> 1610The <b>heap_limit</b>, <b>match_limit</b>, and <b>depth_limit</b> modifiers set 1611the appropriate limits in the match context. These values are ignored when the 1612<b>find_limits</b> or <b>find_limits_noheap</b> modifier is specified. 1613</P> 1614<br><b> 1615Finding minimum limits 1616</b><br> 1617<P> 1618If the <b>find_limits</b> modifier is present on a subject line, <b>pcre2test</b> 1619calls the relevant matching function several times, setting different values in 1620the match context via <b>pcre2_set_heap_limit()</b>, 1621<b>pcre2_set_match_limit()</b>, or <b>pcre2_set_depth_limit()</b> until it finds 1622the smallest value for each parameter that allows the match to complete without 1623a "limit exceeded" error. The match itself may succeed or fail. An alternative 1624modifier, <b>find_limits_noheap</b>, omits the heap limit. This is used in the 1625standard tests, because the minimum heap limit varies between systems. If JIT 1626is being used, only the match limit is relevant, and the other two are 1627automatically omitted. 1628</P> 1629<P> 1630When using this modifier, the pattern should not contain any limit settings 1631such as (*LIMIT_MATCH=...) within it. If such a setting is present and is 1632lower than the minimum matching value, the minimum value cannot be found 1633because <b>pcre2_set_match_limit()</b> etc. are only able to reduce the value of 1634an in-pattern limit; they cannot increase it. 1635</P> 1636<P> 1637For non-DFA matching, the minimum <i>depth_limit</i> number is a measure of how 1638much nested backtracking happens (that is, how deeply the pattern's tree is 1639searched). In the case of DFA matching, <i>depth_limit</i> controls the depth of 1640recursive calls of the internal function that is used for handling pattern 1641recursion, lookaround assertions, and atomic groups. 1642</P> 1643<P> 1644For non-DFA matching, the <i>match_limit</i> number is a measure of the amount 1645of backtracking that takes place, and learning the minimum value can be 1646instructive. For most simple matches, the number is quite small, but for 1647patterns with very large numbers of matching possibilities, it can become large 1648very quickly with increasing length of subject string. In the case of DFA 1649matching, <i>match_limit</i> controls the total number of calls, both recursive 1650and non-recursive, to the internal matching function, thus controlling the 1651overall amount of computing resource that is used. 1652</P> 1653<P> 1654For both kinds of matching, the <i>heap_limit</i> number, which is in kibibytes 1655(units of 1024 bytes), limits the amount of heap memory used for matching. 1656</P> 1657<br><b> 1658Showing MARK names 1659</b><br> 1660<P> 1661The <b>mark</b> modifier causes the names from backtracking control verbs that 1662are returned from calls to <b>pcre2_match()</b> to be displayed. If a mark is 1663returned for a match, non-match, or partial match, <b>pcre2test</b> shows it. 1664For a match, it is on a line by itself, tagged with "MK:". Otherwise, it 1665is added to the non-match message. 1666</P> 1667<br><b> 1668Showing memory usage 1669</b><br> 1670<P> 1671The <b>memory</b> modifier causes <b>pcre2test</b> to log the sizes of all heap 1672memory allocation and freeing calls that occur during a call to 1673<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>. In the latter case, heap memory 1674is used only when a match requires more internal workspace that the default 1675allocation on the stack, so in many cases there will be no output. No heap 1676memory is allocated during matching with JIT. For this modifier to work, the 1677<b>null_context</b> modifier must not be set on both the pattern and the 1678subject, though it can be set on one or the other. 1679</P> 1680<br><b> 1681Showing the heap frame overall vector size 1682</b><br> 1683<P> 1684The <b>heapframes_size</b> modifier is relevant for matches using 1685<b>pcre2_match()</b> without JIT. After a match has run (whether successful or 1686not) the size, in bytes, of the allocated heap frames vector that is left 1687attached to the match data block is shown. If the matching action involved 1688several calls to <b>pcre2_match()</b> (for example, global matching or for 1689timing) only the final value is shown. 1690</P> 1691<P> 1692This modifier is ignored, with a warning, for POSIX or DFA matching. JIT 1693matching does not use the heap frames vector, so the size is always zero, 1694unless there was a previous non-JIT match. Note that specifing a size of zero 1695for the output vector (see below) causes <b>pcre2test</b> to free its match data 1696block (and associated heap frames vector) and allocate a new one. 1697</P> 1698<br><b> 1699Setting a starting offset 1700</b><br> 1701<P> 1702The <b>offset</b> modifier sets an offset in the subject string at which 1703matching starts. Its value is a number of code units, not characters. 1704</P> 1705<br><b> 1706Setting an offset limit 1707</b><br> 1708<P> 1709The <b>offset_limit</b> modifier sets a limit for unanchored matches. If a match 1710cannot be found starting at or before this offset in the subject, a "no match" 1711return is given. The data value is a number of code units, not characters. When 1712this modifier is used, the <b>use_offset_limit</b> modifier must have been set 1713for the pattern; if not, an error is generated. 1714</P> 1715<br><b> 1716Setting the size of the output vector 1717</b><br> 1718<P> 1719The <b>ovector</b> modifier applies only to the subject line in which it 1720appears, though of course it can also be used to set a default in a 1721<b>#subject</b> command. It specifies the number of pairs of offsets that are 1722available for storing matching information. The default is 15. 1723</P> 1724<P> 1725A value of zero is useful when testing the POSIX API because it causes 1726<b>regexec()</b> to be called with a NULL capture vector. When not testing the 1727POSIX API, a value of zero is used to cause 1728<b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a 1729new match block of exactly the right size for the pattern. (It is not possible 1730to create a match block with a zero-length ovector; there is always at least 1731one pair of offsets.) The old match data block is freed. 1732</P> 1733<br><b> 1734Passing the subject as zero-terminated 1735</b><br> 1736<P> 1737By default, the subject string is passed to a native API matching function with 1738its correct length. In order to test the facility for passing a zero-terminated 1739string, the <b>zero_terminate</b> modifier is provided. It causes the length to 1740be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface, 1741this modifier is ignored, with a warning. 1742</P> 1743<P> 1744When testing <b>pcre2_substitute()</b>, this modifier also has the effect of 1745passing the replacement string as zero-terminated. 1746</P> 1747<br><b> 1748Passing a NULL context, subject, or replacement 1749</b><br> 1750<P> 1751Normally, <b>pcre2test</b> passes a context block to <b>pcre2_match()</b>, 1752<b>pcre2_dfa_match()</b>, <b>pcre2_jit_match()</b> or <b>pcre2_substitute()</b>. 1753If the <b>null_context</b> modifier is set, however, NULL is passed. This is for 1754testing that the matching and substitution functions behave correctly in this 1755case (they use default values). This modifier cannot be used with the 1756<b>find_limits</b>, <b>find_limits_noheap</b>, or <b>substitute_callout</b> 1757modifiers. 1758</P> 1759<P> 1760Similarly, for testing purposes, if the <b>null_subject</b> or 1761<b>null_replacement</b> modifier is set, the subject or replacement string 1762pointers are passed as NULL, respectively, to the relevant functions. 1763</P> 1764<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br> 1765<P> 1766By default, <b>pcre2test</b> uses the standard PCRE2 matching function, 1767<b>pcre2_match()</b> to match each subject line. PCRE2 also supports an 1768alternative matching function, <b>pcre2_dfa_match()</b>, which operates in a 1769different way, and has some restrictions. The differences between the two 1770functions are described in the 1771<a href="pcre2matching.html"><b>pcre2matching</b></a> 1772documentation. 1773</P> 1774<P> 1775If the <b>dfa</b> modifier is set, the alternative matching function is used. 1776This function finds all possible matches at a given point in the subject. If, 1777however, the <b>dfa_shortest</b> modifier is set, processing stops after the 1778first match is found. This is always the shortest possible match. 1779</P> 1780<br><a name="SEC13" href="#TOC1">DEFAULT OUTPUT FROM pcre2test</a><br> 1781<P> 1782This section describes the output when the normal matching function, 1783<b>pcre2_match()</b>, is being used. 1784</P> 1785<P> 1786When a match succeeds, <b>pcre2test</b> outputs the list of captured substrings, 1787starting with number 0 for the string that matched the whole pattern. 1788Otherwise, it outputs "No match" when the return is PCRE2_ERROR_NOMATCH, or 1789"Partial match:" followed by the partially matching substring when the 1790return is PCRE2_ERROR_PARTIAL. (Note that this is the 1791entire substring that was inspected during the partial match; it may include 1792characters before the actual match start if a lookbehind assertion, \K, \b, 1793or \B was involved.) 1794</P> 1795<P> 1796For any other return, <b>pcre2test</b> outputs the PCRE2 negative error number 1797and a short descriptive phrase. If the error is a failed UTF string check, the 1798code unit offset of the start of the failing character is also output. Here is 1799an example of an interactive <b>pcre2test</b> run. 1800<pre> 1801 $ pcre2test 1802 PCRE2 version 10.22 2016-07-29 1803 1804 re> /^abc(\d+)/ 1805 data> abc123 1806 0: abc123 1807 1: 123 1808 data> xyz 1809 No match 1810</pre> 1811Unset capturing substrings that are not followed by one that is set are not 1812shown by <b>pcre2test</b> unless the <b>allcaptures</b> modifier is specified. In 1813the following example, there are two capturing substrings, but when the first 1814data line is matched, the second, unset substring is not shown. An "internal" 1815unset substring is shown as "<unset>", as for the second data line. 1816<pre> 1817 re> /(a)|(b)/ 1818 data> a 1819 0: a 1820 1: a 1821 data> b 1822 0: b 1823 1: <unset> 1824 2: b 1825</pre> 1826If the strings contain any non-printing characters, they are output as \xhh 1827escapes if the value is less than 256 and UTF mode is not set. Otherwise they 1828are output as \x{hh...} escapes. See below for the definition of non-printing 1829characters. If the <b>aftertext</b> modifier is set, the output for substring 0 1830is followed by the rest of the subject string, identified by "0+" like this: 1831<pre> 1832 re> /cat/aftertext 1833 data> cataract 1834 0: cat 1835 0+ aract 1836</pre> 1837If global matching is requested, the results of successive matching attempts 1838are output in sequence, like this: 1839<pre> 1840 re> /\Bi(\w\w)/g 1841 data> Mississippi 1842 0: iss 1843 1: ss 1844 0: iss 1845 1: ss 1846 0: ipp 1847 1: pp 1848</pre> 1849"No match" is output only if the first match attempt fails. Here is an example 1850of a failure message (the offset 4 that is specified by the <b>offset</b> 1851modifier is past the end of the subject string): 1852<pre> 1853 re> /xyz/ 1854 data> xyz\=offset=4 1855 Error -24 (bad offset value) 1856</PRE> 1857</P> 1858<P> 1859Note that whereas patterns can be continued over several lines (a plain ">" 1860prompt is used for continuations), subject lines may not. However newlines can 1861be included in a subject by means of the \n escape (or \r, \r\n, etc., 1862depending on the newline sequence setting). 1863</P> 1864<br><a name="SEC14" href="#TOC1">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a><br> 1865<P> 1866When the alternative matching function, <b>pcre2_dfa_match()</b>, is used, the 1867output consists of a list of all the matches that start at the first point in 1868the subject where there is at least one match. For example: 1869<pre> 1870 re> /(tang|tangerine|tan)/ 1871 data> yellow tangerine\=dfa 1872 0: tangerine 1873 1: tang 1874 2: tan 1875</pre> 1876Using the normal matching function on this data finds only "tang". The 1877longest matching string is always given first (and numbered zero). After a 1878PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the 1879partially matching substring. Note that this is the entire substring that was 1880inspected during the partial match; it may include characters before the actual 1881match start if a lookbehind assertion, \b, or \B was involved. (\K is not 1882supported for DFA matching.) 1883</P> 1884<P> 1885If global matching is requested, the search for further matches resumes 1886at the end of the longest match. For example: 1887<pre> 1888 re> /(tang|tangerine|tan)/g 1889 data> yellow tangerine and tangy sultana\=dfa 1890 0: tangerine 1891 1: tang 1892 2: tan 1893 0: tang 1894 1: tan 1895 0: tan 1896</pre> 1897The alternative matching function does not support substring capture, so the 1898modifiers that are concerned with captured substrings are not relevant. 1899</P> 1900<br><a name="SEC15" href="#TOC1">RESTARTING AFTER A PARTIAL MATCH</a><br> 1901<P> 1902When the alternative matching function has given the PCRE2_ERROR_PARTIAL 1903return, indicating that the subject partially matched the pattern, you can 1904restart the match with additional subject data by means of the 1905<b>dfa_restart</b> modifier. For example: 1906<pre> 1907 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1908 data> 23ja\=ps,dfa 1909 Partial match: 23ja 1910 data> n05\=dfa,dfa_restart 1911 0: n05 1912</pre> 1913For further information about partial matching, see the 1914<a href="pcre2partial.html"><b>pcre2partial</b></a> 1915documentation. 1916<a name="callouts"></a></P> 1917<br><a name="SEC16" href="#TOC1">CALLOUTS</a><br> 1918<P> 1919If the pattern contains any callout requests, <b>pcre2test</b>'s callout 1920function is called during matching unless <b>callout_none</b> is specified. This 1921works with both matching functions, and with JIT, though there are some 1922differences in behaviour. The output for callouts with numerical arguments and 1923those with string arguments is slightly different. 1924</P> 1925<br><b> 1926Callouts with numerical arguments 1927</b><br> 1928<P> 1929By default, the callout function displays the callout number, the start and 1930current positions in the subject text at the callout time, and the next pattern 1931item to be tested. For example: 1932<pre> 1933 --->pqrabcdef 1934 0 ^ ^ \d 1935</pre> 1936This output indicates that callout number 0 occurred for a match attempt 1937starting at the fourth character of the subject string, when the pointer was at 1938the seventh character, and when the next pattern item was \d. Just 1939one circumflex is output if the start and current positions are the same, or if 1940the current position precedes the start position, which can happen if the 1941callout is in a lookbehind assertion. 1942</P> 1943<P> 1944Callouts numbered 255 are assumed to be automatic callouts, inserted as a 1945result of the <b>auto_callout</b> pattern modifier. In this case, instead of 1946showing the callout number, the offset in the pattern, preceded by a plus, is 1947output. For example: 1948<pre> 1949 re> /\d?[A-E]\*/auto_callout 1950 data> E* 1951 --->E* 1952 +0 ^ \d? 1953 +3 ^ [A-E] 1954 +8 ^^ \* 1955 +10 ^ ^ 1956 0: E* 1957</pre> 1958If a pattern contains (*MARK) items, an additional line is output whenever 1959a change of latest mark is passed to the callout function. For example: 1960<pre> 1961 re> /a(*MARK:X)bc/auto_callout 1962 data> abc 1963 --->abc 1964 +0 ^ a 1965 +1 ^^ (*MARK:X) 1966 +10 ^^ b 1967 Latest Mark: X 1968 +11 ^ ^ c 1969 +12 ^ ^ 1970 0: abc 1971</pre> 1972The mark changes between matching "a" and "b", but stays the same for the rest 1973of the match, so nothing more is output. If, as a result of backtracking, the 1974mark reverts to being unset, the text "<unset>" is output. 1975</P> 1976<br><b> 1977Callouts with string arguments 1978</b><br> 1979<P> 1980The output for a callout with a string argument is similar, except that instead 1981of outputting a callout number before the position indicators, the callout 1982string and its offset in the pattern string are output before the reflection of 1983the subject string, and the subject string is reflected for each callout. For 1984example: 1985<pre> 1986 re> /^ab(?C'first')cd(?C"second")ef/ 1987 data> abcdefg 1988 Callout (7): 'first' 1989 --->abcdefg 1990 ^ ^ c 1991 Callout (20): "second" 1992 --->abcdefg 1993 ^ ^ e 1994 0: abcdef 1995 1996</PRE> 1997</P> 1998<br><b> 1999Callout modifiers 2000</b><br> 2001<P> 2002The callout function in <b>pcre2test</b> returns zero (carry on matching) by 2003default, but you can use a <b>callout_fail</b> modifier in a subject line to 2004change this and other parameters of the callout (see below). 2005</P> 2006<P> 2007If the <b>callout_capture</b> modifier is set, the current captured groups are 2008output when a callout occurs. This is useful only for non-DFA matching, as 2009<b>pcre2_dfa_match()</b> does not support capturing, so no captures are ever 2010shown. 2011</P> 2012<P> 2013The normal callout output, showing the callout number or pattern offset (as 2014described above) is suppressed if the <b>callout_no_where</b> modifier is set. 2015</P> 2016<P> 2017When using the interpretive matching function <b>pcre2_match()</b> without JIT, 2018setting the <b>callout_extra</b> modifier causes additional output from 2019<b>pcre2test</b>'s callout function to be generated. For the first callout in a 2020match attempt at a new starting position in the subject, "New match attempt" is 2021output. If there has been a backtrack since the last callout (or start of 2022matching if this is the first callout), "Backtrack" is output, followed by "No 2023other matching paths" if the backtrack ended the previous match attempt. For 2024example: 2025<pre> 2026 re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess 2027 data> aac\=callout_extra 2028 New match attempt 2029 --->aac 2030 +0 ^ ( 2031 +1 ^ a+ 2032 +3 ^ ^ ) 2033 +4 ^ ^ b 2034 Backtrack 2035 --->aac 2036 +3 ^^ ) 2037 +4 ^^ b 2038 Backtrack 2039 No other matching paths 2040 New match attempt 2041 --->aac 2042 +0 ^ ( 2043 +1 ^ a+ 2044 +3 ^^ ) 2045 +4 ^^ b 2046 Backtrack 2047 No other matching paths 2048 New match attempt 2049 --->aac 2050 +0 ^ ( 2051 +1 ^ a+ 2052 Backtrack 2053 No other matching paths 2054 New match attempt 2055 --->aac 2056 +0 ^ ( 2057 +1 ^ a+ 2058 No match 2059</pre> 2060Notice that various optimizations must be turned off if you want all possible 2061matching paths to be scanned. If <b>no_start_optimize</b> is not used, there is 2062an immediate "no match", without any callouts, because the starting 2063optimization fails to find "b" in the subject, which it knows must be present 2064for any match. If <b>no_auto_possess</b> is not used, the "a+" item is turned 2065into "a++", which reduces the number of backtracks. 2066</P> 2067<P> 2068The <b>callout_extra</b> modifier has no effect if used with the DFA matching 2069function, or with JIT. 2070</P> 2071<br><b> 2072Return values from callouts 2073</b><br> 2074<P> 2075The default return from the callout function is zero, which allows matching to 2076continue. The <b>callout_fail</b> modifier can be given one or two numbers. If 2077there is only one number, 1 is returned instead of 0 (causing matching to 2078backtrack) when a callout of that number is reached. If two numbers (<n>:<m>) 2079are given, 1 is returned when callout <n> is reached and there have been at 2080least <m> callouts. The <b>callout_error</b> modifier is similar, except that 2081PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be 2082aborted. If both these modifiers are set for the same callout number, 2083<b>callout_error</b> takes precedence. Note that callouts with string arguments 2084are always given the number zero. 2085</P> 2086<P> 2087The <b>callout_data</b> modifier can be given an unsigned or a negative number. 2088This is set as the "user data" that is passed to the matching function, and 2089passed back when the callout function is invoked. Any value other than zero is 2090used as a return from <b>pcre2test</b>'s callout function. 2091</P> 2092<P> 2093Inserting callouts can be helpful when using <b>pcre2test</b> to check 2094complicated regular expressions. For further information about callouts, see 2095the 2096<a href="pcre2callout.html"><b>pcre2callout</b></a> 2097documentation. 2098</P> 2099<br><a name="SEC17" href="#TOC1">NON-PRINTING CHARACTERS</a><br> 2100<P> 2101When <b>pcre2test</b> is outputting text in the compiled version of a pattern, 2102bytes other than 32-126 are always treated as non-printing characters and are 2103therefore shown as hex escapes. 2104</P> 2105<P> 2106When <b>pcre2test</b> is outputting text that is a matched part of a subject 2107string, it behaves in the same way, unless a different locale has been set for 2108the pattern (using the <b>locale</b> modifier). In this case, the 2109<b>isprint()</b> function is used to distinguish printing and non-printing 2110characters. 2111<a name="saverestore"></a></P> 2112<br><a name="SEC18" href="#TOC1">SAVING AND RESTORING COMPILED PATTERNS</a><br> 2113<P> 2114It is possible to save compiled patterns on disc or elsewhere, and reload them 2115later, subject to a number of restrictions. JIT data cannot be saved. The host 2116on which the patterns are reloaded must be running the same version of PCRE2, 2117with the same code unit width, and must also have the same endianness, pointer 2118width and PCRE2_SIZE type. Before compiled patterns can be saved they must be 2119serialized, that is, converted to a stream of bytes. A single byte stream may 2120contain any number of compiled patterns, but they must all use the same 2121character tables. A single copy of the tables is included in the byte stream 2122(its size is 1088 bytes). 2123</P> 2124<P> 2125The functions whose names begin with <b>pcre2_serialize_</b> are used 2126for serializing and de-serializing. They are described in the 2127<a href="pcre2serialize.html"><b>pcre2serialize</b></a> 2128documentation. In this section we describe the features of <b>pcre2test</b> that 2129can be used to test these functions. 2130</P> 2131<P> 2132Note that "serialization" in PCRE2 does not convert compiled patterns to an 2133abstract format like Java or .NET. It just makes a reloadable byte code stream. 2134Hence the restrictions on reloading mentioned above. 2135</P> 2136<P> 2137In <b>pcre2test</b>, when a pattern with <b>push</b> modifier is successfully 2138compiled, it is pushed onto a stack of compiled patterns, and <b>pcre2test</b> 2139expects the next line to contain a new pattern (or command) instead of a 2140subject line. By contrast, the <b>pushcopy</b> modifier causes a copy of the 2141compiled pattern to be stacked, leaving the original available for immediate 2142matching. By using <b>push</b> and/or <b>pushcopy</b>, a number of patterns can 2143be compiled and retained. These modifiers are incompatible with <b>posix</b>, 2144and control modifiers that act at match time are ignored (with a message) for 2145the stacked patterns. The <b>jitverify</b> modifier applies only at compile 2146time. 2147</P> 2148<P> 2149The command 2150<pre> 2151 #save <filename> 2152</pre> 2153causes all the stacked patterns to be serialized and the result written to the 2154named file. Afterwards, all the stacked patterns are freed. The command 2155<pre> 2156 #load <filename> 2157</pre> 2158reads the data in the file, and then arranges for it to be de-serialized, with 2159the resulting compiled patterns added to the pattern stack. The pattern on the 2160top of the stack can be retrieved by the #pop command, which must be followed 2161by lines of subjects that are to be matched with the pattern, terminated as 2162usual by an empty line or end of file. This command may be followed by a 2163modifier list containing only 2164<a href="#controlmodifiers">control modifiers</a> 2165that act after a pattern has been compiled. In particular, <b>hex</b>, 2166<b>posix</b>, <b>posix_nosub</b>, <b>push</b>, and <b>pushcopy</b> are not allowed, 2167nor are any 2168<a href="#optionmodifiers">option-setting modifiers.</a> 2169The JIT modifiers are, however permitted. Here is an example that saves and 2170reloads two patterns. 2171<pre> 2172 /abc/push 2173 /xyz/push 2174 #save tempfile 2175 #load tempfile 2176 #pop info 2177 xyz 2178 2179 #pop jit,bincode 2180 abc 2181</pre> 2182If <b>jitverify</b> is used with #pop, it does not automatically imply 2183<b>jit</b>, which is different behaviour from when it is used on a pattern. 2184</P> 2185<P> 2186The #popcopy command is analogous to the <b>pushcopy</b> modifier in that it 2187makes current a copy of the topmost stack pattern, leaving the original still 2188on the stack. 2189</P> 2190<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br> 2191<P> 2192<b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), 2193<b>pcre2jit</b>, <b>pcre2matching</b>(3), <b>pcre2partial</b>(d), 2194<b>pcre2pattern</b>(3), <b>pcre2serialize</b>(3). 2195</P> 2196<br><a name="SEC20" href="#TOC1">AUTHOR</a><br> 2197<P> 2198Philip Hazel 2199<br> 2200Retired from University Computing Service 2201<br> 2202Cambridge, England. 2203<br> 2204</P> 2205<br><a name="SEC21" href="#TOC1">REVISION</a><br> 2206<P> 2207Last updated: 24 April 2024 2208<br> 2209Copyright © 1997-2024 University of Cambridge. 2210<br> 2211<p> 2212Return to the <a href="index.html">PCRE2 index page</a>. 2213</p> 2214